Ch. 8. Lemmatisation of manuscript text

Ch. 8. Lemmatisation of manuscript text

Preliminary revised version by OEH (25 August 2003)

8.1 Introduction
8.2 The attribute lemma
8.3 The attribute pos
8.4 Homography and zero values
8.5 Word classes

8.1 Introduction

In ch. 2.3 we suggested that the word, <w>, is a basic unit in any transcription. The <w> element can easily be supplied with information about the dictionary entry and the grammatical analysis for every word in a manuscript text. As specified in the TEI guidelines, this information can be provided by the two attributes lemma and pos. In the present chapter, the basic principles for lemmatisation and grammatical encoding of manuscript text are treated. It is important to note, that this chapter should be seen as a suggestion rather than as definite guidelines. The elements and attributes to be discussed are:

Element

Contents

<w>

delimits a grammatical word.

lemma

gives the lexical form of the grammatical word.

pos

gives the morphosyntactic analysis of the grammatical word.

It is essential that the lemmatisation of Medieval Nordic manuscript text is done in adherence to the principles developed for handling large corpora in linguistic research. We have found the guidelines provided by EAGLES (1996) to be particularly useful, but have decided to deviate somewhat from these guidelines in order to produce a more self-explanatory, although slightly more verbose, system.

The model provided here is aimed at Medieval Norwegian and Icelandic texts. For Medieval Swedish and Danish texts and also for later Norwegian texts, we can expect a radical levelling in the grammatical system, e.g. in the nominal and verbal inflections. The model provided here will therefore overgenerate when applied to Medieval Swedish and Danish texts, and to late Medieval Norwegian texts.

8.2 The attribute lemma

The element <w> can be supplied with several lexicographical attributes for each word in a transcription. The attribute lemma provides the lexical form of each word based on the entries in standard dictionaries. For Medieval Norwegian and Icelandic texts we suggest that the word-list produced by the Arnamagnæan Commission's Ordbog over det norrøne prosasprog (ONP) at the University of Copenhagen is used to create the lemma base. The attribute would then be marked up as in this example, which states that the word "hefir" has "hafa" as its lemma:

<w lemma="hafa">hefir</w>

Lemmatised texts are useful for any language, and in particular for languages with complex morphology or variable orthography. The morphology of Old Norse is more complex than that of the modern Nordic languages, but not particularly difficult - it is rather like the morphology of Modern German. The orhography, however, was far from fixed, and since many transcriptions are likely to be fairly diplomatic, any lemma may be instantiated by a large number of orthographic forms. For example, the pronoun hann has only three forms in the normalised orthography of Old Norse: hann (nominative and accusative), hans (genitive), and honum (dative). In an actual transcription, however, a dozen or more forms may occur, as shown in the table below.

Form

Lemma

hann

hann

han<abbr>&bar;</abbr>

hann

h<abbr>&bar;</abbr>

hann

h<abbr>&bar;</abbr>n

hann

ha&scap;

hann

hans

hann

han&stall;

hann

h<abbr>&bar;</abbr>s

hann

h<abbr>&bar;</abbr>&stall;

hann

honum

hann

honom

hann

h<abbr>&bar;</abbr>m

hann

In ch. 2.3 the use of <w> for the encoding of graphic words and information concerning their description is treated. Note the use of entities for special characters, such as "&fins" and "&scap;", or abbreviations such as "&bar;". These are described in ch. 5 and ch. 6.

As stated in ch. 3, a text may be encoded on a single level of transcription, as exemplified with "hefir" above. If the text is transcribed on more than one level there is no need for any further attributes, since each word is contained within a single <w> element and the attribute is valid for the whole contents:

The next example is slightly more complicated since it contains an abbreviation on the <facs> level and a corresponding expansion in the <dipl> level, but the lemma attribute is unchanged:

In cases where a graphic word is included partially or completely in the element <unclear> this can be marked within the element <w> and be related to the attribute lemma.

Text included within the element <supplied> is not lemmatized. The following example shows how a character, word or phrase that has been supplied is marked with the element <w>, but without markup of the lemma as the text is not transcribed from the manuscript text.

<w>

<norm>leikti</norm>
</w>

This means that the forms that are not marked will not be included in the searchable database under the category lemma. We hereby avoid the problem of contamination between forms that are from the manuscript text and forms that have been supplied by a transcriber or encoder of the text. A basic principle is that the lemmatized text should be from the manuscript text.

8.3 The attribute pos

The attribute pos (for part of speech) adds information about the grammatical form of a word. To be able to make this analysis it is necessary to create a model which includes all possible morphological forms of each lemma. As stated above, the model is based on the morphology of Medieval Norwegian and Icelandic, as expounded in standard grammars of Old Norse or "norrønt".

We recommend a scheme in which the attribute pos contains a set of name tokens, one for each morphological category. White space separates each name token. We further recommend that the order of the name tokens should be fixed, and that there should be one specific order for each word class, as specified in ch. 8.5 below. For words with inflection, the first token specifies the word class and the following tokens the morphological categories relevant for this specific word class. Words belonging to word classes with no inflection, such as prepositions and subjunctions, will only receive a single name token for the word class itself. In addition to tokens for morphologhical categories such as gender, number and case, tokens for inflection class may be added.

Each name token consists of two parts. The first part specifies the cateory itself and is represented by a single lower-case letter. The second part specifies the value of the category and is given in one or more upper-case letters. As far as possible, mnemonic characters are used, e.g. "c" for "case" and "G" for "genitive". The name token "cG" is thus to be understood as "case: genitive" and is applicable to all words which can be inflected in genitive, such as nouns, adjectives, pronouns/determiners, numerals and verb participles.

In Old Norse, nouns are inflected for gender, number, case and species (definiteness). Below is an example of the mark-up for the word "hestum", dative plural indefinite of the masculine noun "hestr". The pos attribute opens with a name token for the word class, "xNC" for "noun, common", moving on to "gM" for "gender: masculine", "nP" for "number: plural", "cD" for "case: dative" and finally "sI" for "species: indefinite".

<w lemma ="hestr" pos="xNC gM nP cD sI">hestum</w>

Prepositions, which are not inflected, will receive a much simpler encoding, consisting of a single name token, "xAP", in which "x" denotes word class and "AP" the actual class, prepositions.

<w lemma ="fyrir" pos="xAP">fyrir</w>

Old Norse has the most complex morphology of the Nordic vernaculars and is therefore a suitable starting point. For texts with less complex morphology it is simply a case of making a selection of relevant categories from the repertoire in this chapter. Cf. the discussion on zero values in ch. 8.4.3 below.

8.3.1 Invariable properties

Words in inflectional languages exhibit variable and invariable properties. Word class is the prime example of an invariable property, since a word can belong to one and only one word class - the noun "hestr" can not be inflected in adjectival and verbal forms. For nouns, gender is an invariable property - once again, "hestr" can not be inflected in feminine or neutral forms. Adjectives, on the other hand, are inflected in gender, so for this word class gender is a variable property. Other categories, such as case, number, grade etc., are all variable.

Information on inflectional classes can be added to the pos attribute, e.g. strong vs. weak verbs, stem classes of nouns etc. These are also invariable properties.

The name tokens will, in any case, make it clear which tokens refer to invariable properties and which refer to invariable properties.

8.3.1.1 Word class

Word class is denoted by a name token consisting of the character "x" + an uppercase two-letter abbreviation for each class, including commonly recognised subclasses (such as the division between common and proper nouns). Inevitably, there will be some conflict of categorisation, especially among the pronouns and determiners. They will be discussed in ch. 8.5 below.

Name token

Word class

Inflection

xNC

Noun, common

Yes

xNP

Noun, proper

Yes

xAJ

Adjective

Yes

xPE

Pronoun, personal

Yes

xPQ

Pronoun, interrogative

Yes

xPI

Pronoun, indefinite

Yes

xDP

Determiner, possessive

Yes

xDD

Determiner, demonstrative

Yes

xDQ

Determiner, quantifier

Yes

xPD

Pronoun/Determiner

Yes

xNA

Numeral, cardinal

Yes

xNO

Numeral, ordinal

Yes

xVB

Verb

Yes

xAV

Adverb

Yes

xAT

Articles

Yes

xAP

Preposition (adposition)

No

xCC

Conjunction, coordinating

No

xCS

Conjunction, subordinating

No

xIT

Interjection

No

xIM

Infinitive marker

No

xRP

Relative particle

No

xUA

Unassigned

-

8.3.1.2 Inflectional class

Inflectional class is another invariable property and can usually be derived from a combination of the lemma and the word class. Thus, the lemma "fara" belonging to the word class "xVB" (verbs) will be classified as being a strong verb of the 6th class, according to most grammars of Old Norse. This is information which might be found in a dictionary or a lexicographical database of Old Norse.

If the encoder wishes to include information on the inflectional class we recommend that this is being done by adding to the pos attribute a name token consisting of the lowercase character "i" + an uppercase abbreviation for each class. The table below contains examples for the verb class, but can easily be extended to other classes. Incidentally, the distinction between strong and weak inflection also applies to nouns.

Name token

Inflectional class

iST

Strong

iWK

Weak

iRD

Reduplicating

iPP

Preterito-Presentic

etc.

8.3.2 Variable properties

The list of variable properties is rather long for an inflectional language such as Old Norse. Note that the very first category in this list, gender, is a border line case, since it is an invariable (inherent) property for nouns. For other word classes, such as adjectives, pronouns/determiners, numerals, articles and verb participles, it is a variable property. The remaining categories are variable.

8.3.2.1 Gender

This category applies to nouns, adjectives, pronouns/determiners, numerals and verb participles. Gender is denoted by a name token consisting of the lowercase character "g" + an uppercase abbreviation for each gender. The character "U" indicates unspecified cases.

Name token

Value

gM

Masculine

gF

Feminine

gN

Neuter

gU

Unspecified

Some nouns may have two genders, e.g. hungr 'hunger', which is either masculine or neutral. For words of this type we suggest using name tokens with more than one value, "gMF", "gMN" and "gFN".

Name token

Value

gMF

Masculine or Feminine

gMN

Masculine or Neuter

gFN

Feminine or Neuter

For words with three possible genders, we suggest using the character "U", meaning that the value is unspecified, "gU".

8.3.2.2 Number

This category applies to nouns, adjectives, pronouns/determiners and verbs. Number is denoted by a name token consisting of the lowercase character "n" + an uppercase abbreviation for each number. The dual form occurs only in the inflection of personal pronouns. The character "U" indicates unspecified cases.

Name token

Value

nS

Singular

nD

Dual

nP

Plural

nU

Unspecified

8.3.2.3 Case

This category applies to nouns, adjectives, pronouns/determiners and numerals. Case is denoted by a name token consisting of the lowercase character "c" + an uppercase abbreviation for each case. The character "U" indicates unspecified cases.

Name token

Value

cN

Nominative

cG

Genitive

cD

Dative

cA

Accusative

cU

Unspecified

8.3.2.4 Species

This category applies to nouns and adjectives. Species (or definiteness) is denoted by a name token consisting of the lowercase character "s" + an uppercase abbreviation for each type of species. The character "U" indicates unspecified cases.

In Old Norse, nouns and adjectives can have either indefinite or definite forms, e.g. "hestr" (indefinite noun) vs. "hestrinn" (definite noun) or "hvítr [hestr]" (indefinite adjective) vs. "[inn] hvíti [hestr]" (definite adjective).

Name token

Value

sI

Indefinite

sD

Definite

sU

Unspecified

8.3.2.5 Grade

This category applies to adjectives and adverbs. Grade is denoted by a name token consisting of the lowercase character "r" + an uppercase abbreviation for each grade. The character "U" indicates unspecified cases.

Memory hint: since the character "c" has been reserved for "case", the character "r" can be interpreted as "relative", which refers to an aspect of the category of grade.

Name token

Value

rP

Positive

rC

Comparative

rS

Superlative

rU

Unspecified

8.3.2.6 Person

This category applies to verbs and some of the pronouns. Person is denoted by a name token consisting of the lowercase character "p" + an uppercase abbreviation for each person. The character "U" indicates unspecified cases.

Name token

Value

p1

1. person

p2

2. person

p3

3. person

p-

Unspecified

8.3.2.7 Tense

This category applies only to verbs. Tense is denoted by a name token consisting of the lowercase character "t" + an uppercase abbreviation for each tense. The character "U" indicates unspecified cases.

Name token

Value

tPS

Present

tPT

Preterite

tU

Unspecified

8.3.2.8 Mood

This category applies only to verbs. Mood is denoted by a name token consisting of the lowercase character "m" + an uppercase abbreviation for each mood. The character "U" indicates unspecified cases.

Name token

Value

mIN

Indicative

mSU

Subjunctive

mIP

Imperative

mU

Unspecified

8.3.2.9 Voice

This category applies only to verbs. Voice is denoted by a name token consisting of the lowercase character "v" + an uppercase abbreviation for each type of voice. The character "U" indicates unspecified cases.

Name token

Value

vA

Active

vR

Reflexive

vU

Unspecified

8.3.2.10 Finiteness

This category applies only to verbs. Finiteness is denoted by a name token consisting of the lowercase character "g" + an uppercase abbreviation for each type of finiteness. The character "U" indicates unspecified cases.

Name token

Value

fF

Finite

fI

Infinitive (non-finite)

fP

Participle (non-finite)

fU

Unspecified

8.3.2.11 Enclitics

Personal pronouns may be attached to finite verbs, e.g. "emk" for "em ek" or "fórtu" for "fórt þú". From a morphological point of view, this process is similar to the suffixation in definite noun forms, e.g. "hestr + inn" = "hestrinn", or reflexive verb forms, e.g. "kalla + sk" = "kallask". However, it may be argued that the enclitic pronoun retains it character as a word to a larger extent than the suffixed determiner "inn" or the reflexive pronoun "s(i)k". For this reason, we suggest that enclitic forms are encoded as a sequence of a verb and a pronoun, e.g.

<w>em</w><w>k</w>
<w>fort</w><w>u</w>

In lemmatised version:

<w lemma="vera">em</w><w lemma="ek">k</w>
<w lemma="fara">fort</w><w lemma="þú>u</w>

Note the lack of white space between the two word elements, indicating that the two words are written with no space in between. Cf. the discussion in ch. 2.3 above.

Also note that the segmentation in some cases is open to discussion. Thus, the "t" in "fortu" may be seen as part of the verb form or as part of the pronoun. From a phonological point of view, it is an assimilation product of the final "t" in the verb and the initial "þ" in the pronoun. It is therefore useful to supply these verb and pronoun forms with a marker for enclitication. We suggest a name token "eE" for this purpose, to be used in the pos attribute of both words:

Name token

Value

eE

Enclitic pronoun

This category is only relevant for combinations of a verb and an enclitic pronoun. In all other cases, the name token is simply not used.

8.3.2.12 Government

In the Old Norwegian lemmatised corpus, prepositions are encoded for the case which they govern. Thus "fyrir" in the phrase. This is valuable syntactic information, but it is really not a morphologucal category. We therefore recommend that prepositions, which have no inflection in Old Norse (or possibly not in any other language), are only encoded for word class in the pos attribute, "xAP".

However, to accomodate the information provided in the Old Norwegian lemmatised corpus without introducing attributes for syntactic categories we suggest using a name token for government, consisting of the lowercase character "y" + an uppercase abbreviation for each type of case government. This category would apply to prepositions, verbs and some adjectives.

Name token

Value

yG

Governing Genitive

yD

Governing Dative

yA

Governing Accusative

yU

Unspecified government

8.4 Homography and zero values

Two or more words sometimes have the same spelling, but different meaning. This is usually referred to as homography and it is a basic problem for all morphological analysis. We shall distinguish between two types of homography, external and internal. The first case must be handled by lemma attribute, the second by the pos attribute.

For the discussion in this chapter, we shall adopt the distinction between word form, grammatical form and lemma (lexeme). The word form is the word as it is spelt in the text, whether normalised or unnormalised. The grammatical form is a specific morphological value of the word, referred to by the attribute pos. The lemma is the common denominator for all of these forms, typically given as a dictionary entryand referred to by the attribute lemma.

8.4.1. External homography

External homography means that one grammatical word can be mapped unto two or more lemmata. In some cases the alternative lemmata are different words from a semantic and etymological point of view, such as the feminine noun þýða 'friendship' in nominative singular and the verb þýða 'interpret' in infinitive. In all but a few cases, a semantic analysis will disambiguate these forms.

In other cases it is a questions of related words with variant forms, such as the neutral nouns líf and lífi. In dative singular they happen to have the same form, lífi:

Lemma

Word form

Grammatical form

líf

lífi

xNC gN nS cD sI

lífi

For this case of external homography we recommend encoding each of the possible lemmata in full, using the vertical bar, "|", as delimiter (for the sake of simplicity we are using "í" rather than "í):

... <w lemma="líf | lífi" pos="xNC gN nS cD sI">lifi</w> ...

A search engine would be able to pick out both "líf" and "lífi" as possible lemmata for "lífi", and also to keep this example separate from unambiguous ones, such as the genitive "lífs", which can only be mapped to the lemma "líf", or the nominative "lífi" which can only be mapped to the lemma "lífi".

8.4.2 Internal homography

Internal homography means that one word form can be mapped unto two or more grammatical words. This is often referred to as syncretism, and is frequently found in many languages, typically as the result of linguistic development. The levelling of the morphological system in Medieval Nordic (except Icelandic) produced a large amount of syncretism.

The masculine noun granni is a case in point. It has the same form, granna, in all three non-nominative (oblique) cases in singular:

Lemma

Word form

Grammatical form

kona

kona

xNC gF nS cN sI

konu

xNC gF nS cG sI

xNC gF nS cD sI

xNC gF nS cA sI

The encoder may choose to see these forms as syncretistic and simply encode case as unspecified for this word, using the value "U":

<w lemma ="kona" pos="xNC gN nS cU sI">konu</w>

This encoding entails that the word kona has case as a relevant category, but that the exact value has not been determined by the encoder. A search engine would be able to list the form as an example of e.g. a feminine noun in singular, but not as an example of a feminine noun in dative.

In most cases, however, a syntactic or semantic analysis will yield a unique result. For example, in the phrase "til konu" the word form konu would be analysed as genitive since the preposition til only governs this particular case:

<w lemma ="til" pos="xAP">til</w> <w lemma ="kona" pos="xNC gN nS cG sI">konu</w>

In another phrase, e.g. "fyrir konu", the encoder might not be willing to make a definitive choice, since the preposition fyrir governs both accusative and dative. The encoder might then hoose to list both alternatives in the pos attribute, using the vertical bar as a delimiter:

<w lemma ="fyrir" pos="xAP">fyrir</w> <w lemma ="kona" pos="xNC gN nS cA sI | xNC gN nS cD sI">konu</w>

A search engine would be able to pick out this instance of "kona" as a possible example of accusative and of dative. The presence of the delimiter would also make it possible to identify this as an instance of syncretism, so that this example would not be counted among the unambiguous examples of either accusative or dative. Note that the order of the alternatives is arbitrary; the encoding above does not imply that accusative is more likely than dative.

Finally, it should be pointed out that it is a moot question whether konu should be seen as a single word form, or as a three homographic word forms representing three distinct grammatical forms, konu-GEN, konu-DAT and konu-ACC. The answer to this question depends on the morphological analysis of the linguistic stage in question. One might possibly claim, for example, that in Medieval Norwegian case is a relevant distinction to make for all nouns, but that in Late Medieval Norwegian the case distinction has collapsed, and that the lemma kona only has two grammatical forms, the nominative kona and the non-nominative (oblique) konu.

8.4.3 Zero values

We believe it is convenient to distinguish between two types of zero values in morphological encoding, not applicable and not specified.

(a) Not applicable

No words have the complete set of morphological categories listed in 8.3 above. For example, although verb participles belong to the verb class, they are not inflected for mood. There is no need to encode participles for "mood:zero" - it is sufficient to leave out the name token for mood. In other words, the absence of the name token implies that mood is not a relevant category for the word in question.

(b) Not specified

In other cases, a word is inflected for a certain category, but the encoder is not able to specify a value. This may be the case with some proper nouns, for which no gender can be given. This is a different type of "zero" value, and we therefore suggest to indicate these cases with the character "U" to be read as "unspecified". An example:

<w lemma="Byblos" pos="xNP gU>Byblos</w>

This encoding entails that the word in question is a noun and that it does have a gender (it is thus not a case of non-applicability), but that the encoder does not know which gender that would be.

Another example: In Old Norse, there is no gender distinction in genitive or dative plural of any adjective or determiner. It is possible to encode adjectives and determiners for gender based on concord with a noun (if there happens to be one), so that in a genitive plural phrase like "spakra manna" the adjective "spakra" might be ascribed masculine gender on the basis of the noun maðr, which is masculine. From experience, we know that this is time-consuming and not really informative encoding. A less specified option would be to use the character "U" to indicate non-specification:

<w lemma ="spakr" pos="xNC gU nP cG sI">spakra</w>

A serach engine would be able to pick out spakra as an example of an adjective in gentive plural, but not as an adjective in masculine (or feminine, or neutral) gender.

8.5 Word classes

This chapter contains examples of encoding for each word class. We strongly recommend a fixed order of name tokens for each class, beginning with the name token for the word class itself. Note, however, that non-relevant categories can simply be left out, as recommended in ch. 8.4.3 above. Thus, for late Medieval texts the encoding of many word classes may be shorter than the one exemplified here.

8.5.1 Nouns (NC and NP)

Nouns are divided into two subgroups, common noun (xNC) and proper nouns (xNP). They are further encoded for gender, number, case and species

Example: Encoding of the noun "ymr" in the phrase "þá heyrðu þeir ym mikinn ok gny":

<w lemma="ymr" pos="xNC gM nS cA sI">ym</w>

Word class

Gender

Number

Case

Species

xNC
xNP

gM
gF
gN
gU

nS
nP
nU

cN
cG
cD
cA
cU

sI
sD
sU

Possibly, a separate name token for oblique case, "cO", might be added. The concept of the oblique case covers all non-nominative cases, i.e. genitive, dative and accusative.

8.5.2 Adjectives (AJ)

Adjectives are encoded for grade, gender, number, case and species.

Example: Encoding of the adjective "langr" in the phrase "seint er um langan veg at spyrja tíðenda":

<w lemma="langr" pos="xAJ rP gM nS cA sI">langan</w>

Word class

Grade

Gender

Number

Case

Species

xAJ

rP
rC
rS
rU

gM
gF
gN
gU

nS
nP
nU

cN
cG
cD
cA
cU

sI
sD
sU

Note that in the comparative form, adjectives only have weak (indefinite) inflection. Nevertheless, we recommend that they are encoded for species, "sI", throughout. Also note that some adjectives have defect comparation, but we still recommend that they are encoded for grade.

8.5.3 Pronouns proper (PE, PQ and PI)

In recent grammars the traditional category pronoun is usually divided into pronouns in a strict sense (words replacing a noun) and determiners (adjunct words), and that is our recommendation as well, cf. ch. 8.5.3 and 8.5.4 below. However, in some projects (i.e. the Old Norwegian lemmatised corpus) there is only a single category pronoun, and we have therefore added in ch. 8.5.5 a combined category, pronouns and determiners.

Although pronouns in the strict sense of "words replacing a noun" is a smaller category than the traditional one, there are a nonetheless three distinct sub-categories. In the following these are treated separately to provide an over-view.

8.5.3.1 Personal pronouns (PE)

Personal pronouns are encoded for person, gender, number and case. Note that only personal pronouns in 3. person have a gender distinction; for pronouns in 1. and 2. person this category is simply left out.

Example: Encoding of the personal pronoun "vit" in the phrase "vit erum fegnir" (leaving out the gender category):

<w lemma="vit" pos="xPE p1 nD cN>vit</w>

Word class

Person

Gender

Number

Case

xPE

p1
p2
p3
pU

gM
gF
gN
gU

nS
nD
nP
nU

cN
cG
cD
cA
cU

8.5.3.2 Interrogative pronouns (PQ)

Interrogative pronouns are encoded for gender, number and case. Memory hint: in the name token "xPQ" the last character stands for "question".

Example: Encoding of the interrogative pronoun "hverr" in the phrase "Frigg spurði hverr sá væri með ásum":

<w lemma="hverr" pos="xPQ gM nS cN">hverr</w>

Word class

Gender

Number

Case

xPQ

gM
gF
gN
gU

nS
nD
nP
nU

cN
cG
cD
cA
cU

8.5.3.3 Indefinite pronouns (PI)

Indefinite pronouns are encoded for gender, number and case.

Example: Encoding of the indefinite pronoun "einnhverr" in the phrase "vill hann taka til at þreyta drykkju við einhvern mann":

<w lemma="einnhverr" pos="xPI gM nS cA">hverr</w>

Word class

Gender

Number

Case

xPI

gM
gF
gN
gU

nS
nD
nP
nU

cN
cG
cD
cA
cU

8.5.4 Determiners (DP, DD and DQ)

The contents of the word class determiners vary between languages and grammars. In the present analysis, determiners comprise a large part of the traditional word class pronouns (as defined in many grammars of Old Norse) with the exception of pronouns proper. Determiners have three subcategories: possessives, demonstratives and quantifiers.

Note that articles and numerals are often analysed as determiners, but these traditional classes have been retained here.

8.5.4.1 Possessives (DP)

Possessives are encoded for gender, number and case.

Example: Encoding of the possessive "sinn" in the phrase "hann hugðisk þá at reyna afl sitt":

<w lemma="sinn" pos="xDP gN nS cA">sitt</w>

Word class

Gender

Number

Case

xDP

gM
gF
gN
gU

nS
nD
nP
nU

cN
cG
cD
cA
cU

8.5.4.2 Demonstratives (DD)

Possessives are encoded for gender, number and case.

Example: Encoding of the demonstrative "hinn" in the phrase "hitt fjall er hátt":

<w lemma="hinn" pos="xDD gN nS cN">hitt</w>

Word class

Gender

Number

Case

xDD

gM
gF
gN
gU

nS
nD
nP
nU

cN
cG
cD
cA
cU

8.5.4.3 Quantifiers (DQ)

Quantifiers are encoded for gender, number and case. This category may overlap with Indefinite pronouns.

Example: Encoding of the demonstrative "mar(g)t" in the phrase "mart folk hefir komit hér":

<w lemma="margr" pos="xDQ gN nS cN">mart</w>

Word class

Gender

Number

Case

xDQ

gM
gF
gN
gU

nS
nD
nP
nU

cN
cG
cD
cA
cU

8.5.5 Pronouns/determiners (PD)

This is the traditional category of pronoun, as defined in the grammars of e.g. Noreen 1923 and Iversen 1973. From a inflectional point of view this is a heterogenous category, but since it has been used in much lexicographical work, it is given here as an alternative to the two classes pronouns proper (8.5.3) and determiners (8.5.4).

Pronouns/derminers are encoded for person (only personal pronouns), gender, number and case.

Example: Encoding of the pronoun "engi" in the phrase "ormrinn er slœgari en ekki annat kvikendi" (no name token for person, since this category is not relevant):

<w lemma="engi" pos="xPD gN nS cN">ekki</w>

Word class

Person

Gender

Number

Case

xPD

p1
p2
p3
pU

gM
gF
gN
gU

nS
nD
nP
nU

cN
cG
cD
cA
cU

8.5.6 Numerals (NA and NO)

The numerals are devided into two sub-categories: cardinals (NA) and ordinals (NO). The character U is used for "unspecified", so that "xNU" comprises both cardinal and ordinal numerals. That happens to be the case for the Old Norwegian lemmatised corpus.

Numerals are encoded for gender (only the cardinals 1-4), number (only ordinals), case, and species (only relevant for the numerals einn, fyrstr, and annarr). Memory hint: since the obvious candidate "NC" for "numeral, cardinal" has been reserved for "nouns, common", the character "A" in "NA" can be seen as referring to the vowel "a" which occurs two times in the word "cardinal".

The numerals hundrað 'one hundred (and twenty)' and þúsund 'one thousand (two hundred)' are treated as nouns.

Example: Encoding of the numeral "sjaundi" in the phrase "in sjaunda borg":

<w lemma="sjaundi" pos="xNO gF nS cN sD">sjaunda</w>

Word class

Gender

Number

Case

Species

xNA
xNO
xNU

gM
gF
gN
gU

nS
nP
nU

cN
cG
cD
cA
cU-

sI
sD
sU

8.5.7 Articles (AT)

In recent grammars the traditional word class articles is usually classified as part of the word class determiners. However, in some projects (i.e. the Old Norwegian lemmatised corpus) articles are treated as a separate class, and we suggest that as an alternative they may be classified as such.

Articles are encoded for gender, number, case, and species.

Example: Encoding of the article "einn" in the phrase "ein kona":

<w lemma="einn" pos="xAT gF nS cN sI">ein</w>

Word class

Gender

Number

Case

Species

xAT

gM
gF
gN
gU

nS
nP
nU

cN
cG
cD
cA
cU

sI
sD
sU

8.5.8 Verbs (VB)

Verbs are either finite or infinite. In the former category, they are inflected for tense, mood, person, number and voice. In the latter category, participles are basically inflected as adjectives, while infinitives have a very restricted inflection. For practical reasons, we recommend that finite and infinite forms are treated separately.

8.5.8.1 Finite forms

Finite verbs are encoded for tense, mood, person, number, and voice. Optionally, verbs may be encoded for inflectional class. This may prove practical since Old Norse some "pair verbs" with identical lemmatic forms such as the strong verb brenna and the weak verb brenna. In the Old Norwegian lemmatised corpus, verbs are divided into four inflectional classes, as exemplified in the table below.

Example: Encoding of the verb "taldi" in the phrase "hon taldi" (leaving out inflectional class):

<w lemma="telja" pos="xVB fF tPT mIN p3 nS vA">taldi</w>

Word class

Finiteness

Tense

Mood

Person

Number

Voice

Infl. class

xVB

fF

tPS
tPT
tU

mIN
mSU
mIP
mU

p1
p2
p3
pU

nS
nP
nU

vA
vR
vU

iST
iWK
iRD
iPP
iU

8.5.8.2 Infinite forms

Infinite forms are either participles or infinitives, and may be distinguished by the name token finiteness with "fP" for participles and "fI" for infinitives.

(a) Participles

Participles are inflected for the verbal categories tense and voice, and for the nominal categories gender, number, case and species and voice (in supinum). Optionally, participles may be encoded for inflectional class.

Note that present participles only have weak (definite) declension. Preterite (perfect) participles usually have strong (indefinite) declension, but may sometimes occur with weak (definite) forms. Voice is only relevant for supinum, cf. e.g. "hann hefir kallat" vs. "ha hefir kallazk".

Example: Encoding of the verb "koma" in the phrase "hann er kominn":

<w lemma="koma" pos="xVB fP tPT gM nS cN sI">tel</w>

Word class

Finiteness

Tense

Voice

Gender

Number

Case

Species

Infl. class

xVB

fP

tPS
tPT
tU

vA
vR
vU

gM
gF
gN
gU

nS
nP
nU

cN
cG
cD
cA
cU

sI
sD
sU

iST
iWK
iRD
iPP
iU

(b) Infinitives

Infinitives are inflected only for the verbal categories tense and voice, and tense only applies to three verbs, munu, skulu and vilja (which have preterital forms). Optionally, participles may be encoded for inflectional class.

Example: Encoding of the verb "fara" in the phrase "hann mun fara" (with optional information on inflectional class):

<w lemma="fara" pos="xVB fI vA iST">fara</w>

Word class

Finiteness

Tense

Voice

Infl. class

xVB

fI

tPS
tPT
tU

vA
vR
vU

iST
iWK
iRD
iPP
iU

8.5.9 Adverbs (AV)

Adverbs are only encoded for grade.

Example: Encoding of the adverb "sterkliga" in the phrase "hann svaf ok hraut sterkliga":

<w lemma="sterkliga" pos="xAV rp">sterkliga</w>

Word class

Grade

xAV

rP
rC
rS
rU

Note that some adverbs have defect comparation, but we still recommend that they are encoded for grade.

8.5.10 Prepositions (AP)

Prepositions are not inflected and only encoded for word class, xAP. The latter is an abbreviation for "adposition", which is the hyponymous term for "preposition" and "postposition" (found in e.g. Japanese, but not in the Nordic languages).

Example: Encoding of the preposition "at" in the phrase "koma þeir at kveldi til eins búanda":

<w lemma="at" pos="xAP">at</w>

Word class

xAP

As stated in 8.3.2.12 above, prepositions in the Old Norwegian lemmatised corpus are encoded for the case they govern. Using the name token "y" + case, the example above would receive this encoding:

<w lemma="at" pos="xAP yD">at</w>

Word class

Government

xAP

yN
yG
yD
yA
yU

8.5.11 Conjunctions and subjunctions (CC and CS)

In recent grammars, the traditional word class conjunctions is usually divided into two separate classes, conjunctions (e.g. "ok", "en") and subjunctions (e.g. "at", "ef"). The former category connects phrases on the same syntactical level, while the latter category typically introduces clauses. In traditional terminology, this is reflected in the subdivision of conjunctions into coordinating and subordinating. We recommend making a distinction between conjuntions proper = coordinating conjunctions (xCC) and subjunctions = subordinating conjunctions (xCS).

However, in some schemes (i.e. the Old Norwegian lemmatised corpus) only a single word class conjunctions is recognised. In that case, the word class may be designated "xCU" using the character "U" for "unspecified".

Example: Encoding of the conjunction "ok" in the phrase "Logi hafÞi etit slátr allt ok beinin með":

<w lemma="ok" pos="xCC">ok</w>

Example: Encoding of the subjunction "at" in the phrase "hon sagði at Baldr hafði þar riðit":

<w lemma="at" pos="xCS">at</w>

Word class

xCC
xCS
xCU

8.5.12 Interjections (IT)

Interjections are not inflected and only marked for word class, xIT.

Word class

xIT

8.5.13 Infinitive marker (IM)

The infinitive marker is not inflected and encoded as xIM. In Old Norse it usually has the form at.

Word class

xIM

8.5.14 Relative particle (RP)

The relative particle is not inflected and only marked as xRP. In Old Norse it usually has the form er or sem. Some grammarians would classify the relative particle as a subjunction, while others tend to look upon it as a pronoun.

Word class

xRP

8.5.15 Unassigned (UA)

Some words are corrupt, diffcult to analyse, belong to another language or are for other reason indeterminate. These words are marked as unassigned, xUA.

Word class

xUA

Top of page

Created 25 August 2003.

	lemma	gives the lexical form of the grammatical word.
	pos	gives the morphosyntactic analysis of the grammatical word.

Form	Lemma
hann	hann
han<abbr>&bar;</abbr>	hann
h<abbr>&bar;</abbr>	hann
h<abbr>&bar;</abbr>n	hann
ha&scap;	hann
hans	hann
han&stall;	hann
h<abbr>&bar;</abbr>s	hann
h<abbr>&bar;</abbr>&stall;	hann
honum	hann
honom	hann
h<abbr>&bar;</abbr>m	hann

Name token	Word class	Inflection
xNC	Noun, common	Yes
xNP	Noun, proper	Yes
xAJ	Adjective	Yes
xPE	Pronoun, personal	Yes
xPQ	Pronoun, interrogative	Yes
xPI	Pronoun, indefinite	Yes
xDP	Determiner, possessive	Yes
xDD	Determiner, demonstrative	Yes
xDQ	Determiner, quantifier	Yes
xPD	Pronoun/Determiner	Yes
xNA	Numeral, cardinal	Yes
xNO	Numeral, ordinal	Yes
xVB	Verb	Yes
xAV	Adverb	Yes
xAT	Articles	Yes
xAP	Preposition (adposition)	No
xCC	Conjunction, coordinating	No
xCS	Conjunction, subordinating	No
xIT	Interjection	No
xIM	Infinitive marker	No
xRP	Relative particle	No
xUA	Unassigned	-

Name token	Inflectional class
iST	Strong
iWK	Weak
iRD	Reduplicating
iPP	Preterito-Presentic
etc.

Name token	Value
gMF	Masculine or Feminine
gMN	Masculine or Neuter
gFN	Feminine or Neuter

Name token	Value
cN	Nominative
cG	Genitive
cD	Dative
cA	Accusative
cU	Unspecified

Name token	Value
rP	Positive
rC	Comparative
rS	Superlative
rU	Unspecified

Name token	Value
p1	1. person
p2	2. person
p3	3. person
p-	Unspecified

Name token	Value
mIN	Indicative
mSU	Subjunctive
mIP	Imperative
mU	Unspecified

Name token	Value
fF	Finite
fI	Infinitive (non-finite)
fP	Participle (non-finite)
fU	Unspecified

Name token	Value
yG	Governing Genitive
yD	Governing Dative
yA	Governing Accusative
yU	Unspecified government

Lemma	Word form	Grammatical form
líf	lífi	xNC gN nS cD sI
lífi	lífi	xNC gN nS cD sI

Word class	Gender	Number	Case	Species
xNC xNP	gM gF gN gU	nS nP nU	cN cG cD cA cU	sI sD sU

Word class	Person	Gender	Number	Case
xPE	p1 p2 p3 pU	gM gF gN gU	nS nD nP nU	cN cG cD cA cU

Word class	Finiteness	Tense	Mood	Person	Number	Voice	Infl. class
xVB	fF	tPS tPT tU	mIN mSU mIP mU	p1 p2 p3 pU	nS nP nU	vA vR vU	iST iWK iRD iPP iU