Ch. 8. Lemmatisation of manuscript text

Ch. 8. Lemmatisation of manuscript text

by Karl G. Johansson

8.1 Introduction
8.2 The attribute lemma
8.3 The attribute pos
8.4 General problems
8.5 Word classes
8.6 Menota tags for a morphological database

8.1 Introduction

In ch. 2.3 we suggested that the unit word, <w>, should be marked in the transcription of manuscript text, in order to provide possibilities to treat abbreviations and their expansions consistently. The element <w> can also include information on lemma and a grammatical analysis for every word in the manuscript text. This information can preferably be provided as content in the two attributes lemma and pos. In this chapter the basic principles for lemmatization of manuscript text are treated. It is important to note, however, that this presentation should be seen as a scetch rather than definite guidelines. The elements and attributes that will be discussed are:

Element

Contents

<w>

delimits a grammatical word.

lemma

gives the lexical form of the grammatical word.

pos

gives the morphosyntactic analysis of the grammatical word.

It is essential that the lemmatization of Medieval Scandinavian manuscript text is done in adherence to the principles developed for handling large corpora in linguistic research. We therefore recommend that the guidelines provided by EAGLES (1996) are used as a starting point. In some aspects, however, the principles presented in the following are diverging from the principles suggested by EAGLES, as the Old Scandinavian languages present particular problems for the encoder.

The model provided here is adjusted to the Old Norse-Icelandic grammar. For Old Swedish and Old Danish texts we can expect a radical levelling in the grammatical system, e.g. in the nominal and verbal declinations. The model provided here will therefore overgenerate when applied to Swedish and Danish texts form the period.

8.2 The attribute lemma

Within the element <w> it is possible to provide a variety of information for every graphic word. With the attribute lemma we can for example provide information on the lexical form of the graphic word, which enables us to search all graphic and grammatical forms of the word. When a text is marked up with the element <w> we can add information about the lexical form of the word in the attribute lemma. The lemma could preferably be equal to the form you find in the lexicon. For Old Norse-Icelandic texts we suggest that the word-list produced by the Arnamagnæan Commission's Ordbog over det norrøne prosasprog (ONP) at University of Copenhagen is used to create the lemma base. The attribute would then be marked up as follows:

<w lemma="hafa">ha&fins;i</w>

In ch. 2.3 the use of <w> for the markup of graphic words and information concerning their description is treated. In the example used here the graphic word contains an Old Norse-Icelandic character which is not used in modern script. In a transcription of manuscript text this character is given with an entity name "&fins;" as described in ch. 5, but in the lemmatized form the character is normalized according to the principles of the ONP. The resulting structure will look as follows:

In the following example we can see how a more complicated form with abbreviations and expansions can be presented in the elements <orig> and <expForm> respectively, included in the element <w>, and thereby all be related to the attribute lemma.

In cases where a graphic word is included partially or completely in the element <unclear> this can be marked within the element <w> and be related to the attribute lemma.

Text included within the element <supplied> is not lemmatized. The following example shows how a character, word or phrase that has been supplied is marked with the element <w>, but without markup of the lemma as the text is not transcribed from the manuscript text.

<w>

<reg>leikti</reg>
</w>

This means that the forms that are not marked will not be included in the searchable database under the category lemma. We hereby avoid the problem of contamination between forms that are from the manuscript text and forms that have been supplied by a transcriber or encoder of the text. A basic principle is that the lemmatized text should be from the manuscript text.

8.3 The attribute pos

With the attribute pos we can add information about the morphosyntactic form of the individual representation of a lemma, i.e. the form provided in the element <orig> is described morphosyntactically. To be able to make this analysis it is necessary to create a model for the encoding that describes all the possible morphological forms of each lemma. In the following this description is tentatively built from the basic categories with sub-categories to provide a full description of the Old Norse-Icelandic grammar.

For the noun hestr 'horse' in dative plural this can be described as follows:

<orig> hestu&bar;<orig>

<expForm> hestu<expan>m</expan></expForm>

<reg>hestum</reg>
</w>

The lemmas can primarily be divided in word classes. The first character in the character set provided for the attribute pos above represents the word class Nouns (N). The following character defines the noun as a nomen appellativum or a Common Noun (C). There is also a need for information about gender (masculinum, M), number (plural, P) and case (dative, D). For every word class the categories are given in a certain order. In cases where a category is not in use, the space is marked with an *.

8.4 General problems

The manuscript texts of Medieval Scandinavia display a wide range of variation graphematically and ortographically, the Old Norse-Icelandic texts in a higher degree than the East Nordic texts. Further, the Medieval language of Scandinavia is highly flectional, which causes problems if we wish to analyse the graphical forms, what we call graphic words, in the manuscript text automatically with computers into lemma and lemmatic forms, i.e. the grammatical forms available for the analysed lemma. In the following we propose a model for the lemmatizition and analysis into lemmatic forms.

The variation we find in the manuscript text, what we can call form variation, is an initial problem in the first phase of the lemmatization. We need to be able to identify all possible graphic forms that can represent a lemma in the manuscript text. A good example is some of the graphic variation for the pronoun hann 'he' in different cases.

Form

Lemma

hann

hann

han&bar;

hann

h&bar;

hann

h&bar;n

hann

ha&scap;

hann

hans

hann

han&stall;

hann

h&bar;s

hann

h&bar;&stall;

hann

honum

hann

honom

hann

h&bar;m

hann

The flectional diversity of the Medieval Scandinavian languages provides many cases of homography between lemmatic forms of the same lemma, what we call internal homography. This can be seen in the following example where the nominative singular of the feminine noun hetja has the same lemmatic form as genitive plural (NCFSN| NCFPG), and in oblique case singular (NCFSG| NCFSD| NCFSA) and nominative plural and accusative plural (NCFPN| NCFPA). The homography is marked with | between the tags for each lemmatic form:

Form

Lemma

Tagg

hetja

hetja

NCFSNI | NCFPGI

hetju

hetja

NCFSGI | NCFSDI | NCFSAI

hetjur

hetja

NCFPNI | NCFPAI

In the initial markup of lemmatic forms it is suggested that all possible tags are given in the attribute pos. This is, however, not satisfying if we wish to have a consistent markup of the morphosyntactic analysis. In cases where the morphosyntactic analysis can be made consistently this should of course be done.

Further we must take into account the possibility that the graphic form for different lemmas appears in homographic forms on the level of lemmatic form, what we call external homography. An example of this could be the neutral noun vár 'spring' (NCNSN) and possessive determinative várr 'our' in feminine singular nominative, neutral plural nominative and accusative (DPFSN| DPNPN| DPNPA).

Form

Lemma

Tagg

vár

vár

NCNSNI

vár

várr

DPFSN | DPNPN | DPNPA

The graphic forms can also be homographic for different lemmas as in the feminine noun þýða 'friendship' in nominative singular and the verb þýða 'interpretate' in infinitive.

Form

Lemma

Tagg

þýða

þýða

NCFSNWI | V*pres***I

In these cases the morphosyntactic analysis has to be made manually. An alternative is to give all possible lemmatic forms in the attribute pos as in the above example.

8.5 Word classes

8.5.1 Nouns (N)

Nouns can be devided into two categories, appellatives, och propria. They are all marked with an N for noun. In the second field the markup define the two categories NC, appellatives (Common Nouns), and NP, propria, (Proper Nouns).

Nouns should also be marked for gender. In the Old Scandinavian languages we define three gender categories masculine, feminine and neutral which are marked in the third field as M, F and N respectively.

There are two categories for numerus. Singular and plural should be marked in the fourth field as S and P respectively.

There are four categories for case in the Old Scandinavian languages, nominative, genitive, dative and accusative which are marked in the fifth field as N, G, D and A respectively.

A noun can occur in definite and indefinite form. This is marked in the sixth field as D and I respectively. For personal names and place-names only the last can occur in definite form. Names are always defined as definite and therefore marked either as D or with an *.

The field order for the nouns can now be given as in the following example:

<w lemma="hestr" pos="NCMPDI">hestum</w>

This defines the lemma hestr 'horse' as represented by the form hestum, i.e. the indefinite masculine appellative hestr is represented by a form in plural dative.

The categories for the noun can be given as follows:

Noun

Gender

Number

Case

Species

NC
NP

M
F
N

S
P

N
G
D
A

D
I

8.5.2 Adjectives (AJ)

Adjectives (AJ) are declined for grade in three levels positive, comparative and superlative which are marked in the second field as P, C and S respectively.

Adjectives should also be marked for gender. In the Old Scandinavian languages we define three gender categories masculine, feminine and neutral which are marked in the third field as M, F and N respectively.

There are two categories for numerus. Singular and plural should be marked in the fourth field as S and P respectively.

There are four categories for case in Old Scandinavian languages, nominative, genitive, dative and accusative which are marked in the fifth field as N, G, D and A respectively.

The field order is given in the following example. If the adjective hvítr 'white' functions as an attribute to hestum from the example given above for the nouns, i.e. hvítum, this will be marked as follows:

<w lemma="hv&iac;tr" pos="AdjMPDSP">hv&iac;tum</w>

The categories for the adjective can be given as follows:

Adjective

Grade

Gender

Number

Case

AJ

P
K
S

M
F
N

S
P

N
G
D
A

8.5.3 Pronouns (P)

There are a number of sub-categories for the pronouns. In the following these are treated separately to provide an over-view. All pronouns are marked with P in the first field. In the second field the sub-category is given as described in the following sections.

Personal pronouns

The personal pronouns (PPer) are declined in first, second and third person. This is marked in the third field as 1, 2 and 3 respectively.

The declination in gender varies for the personal pronouns, but we can generally account for three categories masculine, feminine and neutral which are marked in the fourth field as M, F and N respectively. In some categories there is no grammatical markup for gender (see the list of tags below). In these cases the fourth field has an *.

Personal pronouns in the first and second person have three categories for number, singular, plural and dual which are marked in the fifth field as S, P and D respectively. Personal pronouns in the third person have no declination for number. The fifth field in this case has an *.

The personal pronouns are declined in four cases, nominative, genitive, dative and accusative which are marked as N, G, D and A respectively.

The attribute pos for a personal pronoun can for example be given as:

<w lemma="vit" pos="PPer1*DN">vit</w>

This markup indicates that a lemma vit 'we two' is represented by a form in first person dualis nominative vit. The categories for the personal pronouns can be given as follows:

Pronoun category

Person

Gender

Number

Case

Pper

1
2
3

M
F
N

S
D
P

N
G
D
A

Interrogative pronouns

The interrogative pronouns (PInt) have no declination in person. This field should therefore be marked with an *. They are declined in three categories for gender masculine, feminine och neutral which are marked in the third field as M, F and N respectively.

Interrogative pronouns are declined in two categories for number, singular och plural which are marked in the fourth field as S and P respectively.

Further, the interrogative pronouns are declined in four categories for case, nominative, genitive, dative och accusative which are marked in the fifth field as N, G, D and A respectively.

The attribute pos for an interrogative pronoun could be formed as follows:

<w lemma="hverr" pos="PInt*FPA">hverjar</w>

This markup indicates that a lemma hverr 'who' is represented by a form in feminine plural accusative hverjar. The categories for the interrogative pronouns can be given as follows:

Pronoun category

Person

Gender

Number

Case

PInt

*

M
F
N

S
P

N
G
D
A

Indefinite pronouns

The indefinite pronouns (PInd) have no declination in person. This field should therefore be marked with an *. They are declined in three categories for gender masculine, feminine och neutral which are marked in the third field as M, F and N respectively.

Indefinite pronouns are declined in two categories for number, singular och plural which are marked in the fourth field as S and P respectively.

Further, the indefinite pronouns are declined in four categories for case, nominative, genitive, dative och accusative which are marked in the fifth field as N, G, D and A respectively.

The attribute pos for an indefinite pronoun could be formed as follows:

<w lemma="einnhverr" pos="PInd*MPD">einhverjum</w>

This markup indicates that a lemma einnhverr 'anyone' is represented by a form in masculine plural dative einhverjum. The categories for the indefinite pronouns can be given as follows:

Pronoun category

Person

Gender

Number

Case

PInd

*

M
F
N

S
P

N
G
D
A

8.5.4 Determinatives

There are a number of sub-categories for the determinatives. In the following these are treated separately to provide an over-view. All determinatives are marked with D in the first field. In the second field the sub-category is given as described in the following sections.

Possessive determinatives

The possessive determinatives (DP) are declined in three categories for gender masculine, feminine och neutral which are marked in the second field as M, F and N respectively.

Possessive determinatives are declined in two categories for number, singular och plural which are marked in the third field as S and P respectively.

Further, the possessive determinatives are declined in four categories for case, nominative, genitive, dative och accusative which are marked in the fourth field as N, G, D and A respectively.

The attribute pos for a possessive determinative could be formed as follows:

<w lemma="v&aac;rr" pos="PPosNSD">v&aac;ru</w>

This markup indicates that a lemma várr 'our' is represented by a form in neutral singular dative váru. The categories for the possessive determinatives can be given as follows:

Determinative category

Gender

Number

Case

DP

M
F
N

S
P

N
G
D
A

Demonstrative determinatives

The demonstrative determinatives (DD) are declined in three categories for gender masculine, feminine och neutral which are marked in the second field as M, F and N respectively.

Demonstrative determinatives are declined in two categories for number, singular och plural which are marked in the third field as S and P respectively.

Further, the demonstrative determinatives are declined in four categories for case, nominative, genitive, dative och accusative which are marked in the fourth field as N, G, D and A respectively.

The attribute pos for a demonstrative determinative could be formed as follows:

<w lemma="s&aac;" pos="DMSG">&th;ess</w>

This markup indicates that a lemma sá 'the one' is represented by a form in masculine singular genitive þess. The categories for the demonstrative determinatives can be given as follows:

Determinative category

Gender

Number

Case

DD

M
F
N

S
P

N
G
D
A

8.5.5 Numerals (NU)

The numerals are devided into two sub-categories cardinals and ordinals. In the first field this is marked as NUC and NUO respectively.

The cardinals 1-4 are declined in three categories for gender masculine, feminine and neutral which are marked in the second field as M, F and N respectively.

There is no declination in number for the cardinals. The field number is therefore marked with an *.

The cardinals 1-4 are declined in four categories for case, nominative, genitive, dative and accusative which are marked in the fourth field as N, G, D and A respectively.

The rest of the cardinals are not declined and therefore marked with an * for all categories.

The ordinals 1-4 have the same declination as the cardinals 1-4. Further, they are declined in two categories for number, singular och plural which are marked in the third field as S and P respectively.

The numerals hundrað 'one hundred and twenty' and þúsund 'one tousand' are marked as nouns.

The attribute pos for a numeral could be formed as follows:

<w lemma="sjaundi" pos="NUOFSN">sjaunda</w>

This markup indicates that a lemma sjaundi 'seventh' is represented by a form in feminine singular nomintive sjaunda. The categories for the numerals can be given as follows:

Numerals

Gender

Number

Case

NUC
NUO

M
F
N
*

S
P
*

N
G
D
A
*

8.5.6 Verbs (V)

The verbs are marked in the first field as V. The verbs are declined in two categories for tense, present and preterite which are marked in the second field as Pres and Pret respectively.

Further, the verbs are declined in three categories for mood, indicative, subjunctive and imperative which are marked in the third field as Ind, Sub and Imp respectively.

In the personal inflection there are three categories, first, second and third person which are marked in the fourth field as 1, 2 and 3 respectively.

The verbs are declined in two categories for number, singular and plural which are marked in the fifth field as S and P respectively.

There are finally two categories to mark infinite and finite verb forms (finiteness). This should be done in the sixth field as I and F respectively.

The attribute pos for a verb could be formed as follows:

<w lemma="telja" pos="VPresInd1SF">tel</w>

This markup indicates that a lemma telja 'to count' is represented by a form in present indicative first person singular tel.

Verb

Tense

Mood

Person

Number

Finiteness

V

Pres
Pret

Ind
Kon
Imp

1
2
3

S
P

I
F

Present and preterite participles are treated as adjectives in the grammar of the Old Scandinavian languages, but in our description below of the categories we treat them as verbs. Thereby we get a contamination between two categories in the tag, which is perhaps not optimal. The fields that are not used for the participle declinations are marked with an *. After the fields for the verbal declination we have placed a field form which functions as a mark for the participles (P) followed by the fields for the markup of the adjective declination in the categories gender and case.

Verb

Tense

Mood

Person

Number

Finiteness

Form

Gender

Case

V

Pres
Pret

*

*

S
P

*

P

M
F
N

N
G
D
A

8.5.7 Adverbs (AV)

The adverbs are only declined in three categories for grade. We therefore only have to account for two categories, the word-class (AV) in the first field and, in the second field, the three categories positive, comparative and superlative which are marked as P, C and S respectively.

8.5.8 Prepositions (AP)

Prepositions are undeclined and only marked as AP

8.5.9 Subjunctions (C)

Subjunctions are marked as C. They are divided into two sub-categories coordinated, CC, and subordinated, CS.

8.5.10 Interjections (I)

Interjections are undeclined and only marked as I.

8.6 Menota tags for a morphological database

A complete table of possible tags for Old Norse morphology is located here:

Morphological tags for Old Norse - PDF file [44 kB]

Requries Acrobat Reader.

Created 08.04.2002 by KGJ. Latest update 08.04.2002 by KGJ.

	lemma	gives the lexical form of the grammatical word.
	pos	gives the morphosyntactic analysis of the grammatical word.

Form	Lemma
hann	hann
han&bar;	hann
h&bar;	hann
h&bar;n	hann
ha&scap;	hann
hans	hann
han&stall;	hann
h&bar;s	hann
h&bar;&stall;	hann
honum	hann
honom	hann
h&bar;m	hann

Form	Lemma	Tagg
hetja	hetja	NCFSNI \| NCFPGI
hetju	hetja	NCFSGI \| NCFSDI \| NCFSAI
hetjur	hetja	NCFPNI \| NCFPAI

Pronoun category	Person	Gender	Number	Case
Pper	1 2 3	M F N	S D P	N G D A

Verb	Tense	Mood	Person	Number	Finiteness
V	Pres Pret	Ind Kon Imp	1 2 3	S P	I F