Translation-based Word Sense Disambiguation
MetadataVis full innførsel
This thesis investigates the use of the translation-based Mirrors method (Dyvik, 2005, inter alia) for Word Sense Disambiguation (WSD) for Norwegian. Word Sense Disambiguation is the process of determining the relevant sense of an ambiguous word in context automatically. Automated WSD is relevant for Natural Language Processing systems such as machine translation (MT), information retrieval, information extraction and content analysis. The most successful WSD approaches to date are so-called supervised machine learning (ML) techniques, in which the system ‘learns’ the contextual characteristics of each sense from a training corpus that contains concrete examples of contexts in which a word sense typically occurs. This approach suffers from a knowledge acquisition problem since word senses are not overtly available in corpus text. First, we therefore need a sense inventory which is computationally tractable. Subjectively defined sense distinctions have been the norm in WSD research (especially the Princeton WordNet, Fellbaum, 1998). But WSD studies increasingly show that the WordNet senses are too fine-grained for efficient WSD, which has made WordNet less attractive for machine-learned WSD. Ide and Wilks (2006) recommend instead to approximate word senses by way of cross-lingual sense definitions. Second, we need a method for sense-tagging context examples with the relevant sense given the context. Preparing such sense-tagged training corpora manually is costly and time-consuming, in particular because statistical methods require large amounts of training examples, and automated methods are therefore desirable. This thesis introduces an experimental lexical knowledge source which derives word senses and relations between word senses on the basis of translational correspondences in a parallel corpus, resulting in a structured semantic network (Dyvik, 2009). The Mirrors method is applicable for any language pair for which a parallel corpus and word alignment is available. The appeal of the Mirrors method and its translational basis for lexical semantics is that it offers an objective and consistent—and hence, testable—criterion, as opposed to the traditional subjective judgements in lexicon classification (cf. the Princeton WordNet). But due to the lack of intersubjective “gold standards” for lexical semantics, it is not an easy task to evaluate the Mirrors method. The main research question of this thesis may thus be formulated as follows: are the translation-based senses and semantic relations in the Mirrors method linguistically motivated from a monolingual point of view? To this end, this thesis proposes to use monolingual task of WSD as a practical framework to evaluate the usefulness of the Mirrors method as a lexical knowledge source. This is motivated by the idea that a well-defined end-user application may provide a stable framework within which the benefits and drawbacks of a resource or a system can be demonstrated (e.g. Ng & Lee, 1996; Stevenson & Wilks, 2001; Yarowsky & Florian, 2002; Specia et al., 2009). The innovative aspect of applying the Mirrors method for WSD is two-fold: first, the Mirrors method is used to obtain sense-tagged data automatically (using cross-lingual data), providing a SemCor-like corpus which allows us to exploit semantically analysed context features in a subsequent WSD classifier. Second, we will test whether training on semantically analysed context features, based on information from the Mirrors method, means that the system resolves other instances than a ‘traditional’ classifier trained on words. In the absence of existing data sets for WSD for Norwegian, an automatically sense-tagged parallel corpus and a manually verified lexical sample of fifteen target words was developed for Norwegian as part of this thesis. The proposed automatic sense-tagging method is based on the Mirrors sense inventory and on the translational correspondents of each word occurrence. The sense-tagger provides a partially semantically analysed context—partially, because the translation-based sense-tagger can only sense-tag tokens that were successfully word-aligned. The sense-tagged English-Norwegian Parallel Corpus (the ENPC) is comparable in size to the existing SemCor. The sense-tagged material formed the basis for a series of controlled experiments, in which the knowledge source is varied but where we maintain the same experimental framework in terms of the classification algorithm, data sets, lexical sample and sense inventory. First, a WSD classifier is trained on the actually co-occurring context WORDS. This knowledge source functions as a point of reference to indicate how well a traditional word-based classifier could be expected to perform, given our specific data sample and using the Mirrors sense inventory. Second, two Mirrors-derived knowledge sources were tentatively implemented, both of which attempt to generalise from the actually occurring context words as a means of alleviating the sparse data problem in WSD. For instance, if the noun phone was found to co-occur with the ambiguous noun billN in the ‘invoice’ sense, and if the classifier can generalise from this to include words that are semantically close to phone, such as telephone, this means that the presence of only one of them during learning could make both of them ‘known’ to the classifier at classification time. In other words, it might be desirable to study not only word co-occurrences, as unanalysed and isolated units, but also how words enter into relations with other words (classes of words) in the structured network that constitutes the vocabulary of a language. In ML terms, it might be interesting to build a WSD model which learns, not how a word sense correlates with isolated words, but rather how a word sense correlates with certain classes of semantically related words. Such a tool for generalisation is clearly desirable in the face of sparse data and in view of the fact that most content words have a relatively low frequency even in larger text corpora. The first of the two Mirrors-based knowledge source rests on so-called SEMANTIC-FEATURES that are shared between word senses in the Mirrors network. Since SEMANTIC-FEATURES may include a very high number of related words, a second knowledge source was also developed—RELATED-WORDS—which attempts to selects a stricter class of near-related word senses in the wordnet-like Mirrors network. The results indicated that the gain in abstracting from context words to classes of semantically related word senses was only marginal in that the two Mirrorsbased knowledge sources only knew marginally more of the context words at classification time compared to a traditional word-based classifier. Regarding classification accuracy, the Mirrors-based SEMANTIC-FEATURES seemed to suffer from including too broad semantic information and performed significantly worse than the other two knowledge sources. The Mirrors-based RELATED-WORDS, on the other hand, was as good as, and sometimes better, than the traditional word model, but the differences were not found to be statistically significant. Although unfortunate for the purpose of enriching a traditional WSD model with Mirrorsderived information, the lack of a difference between the traditional word model and RELATED-WORDS nevertheless provides promising indications with regard to the plausibility of the Mirrors method.