Vis enkel innførsel

dc.contributor.authorTriki, Jonas Folkvord
dc.date.accessioned2021-08-18T00:00:45Z
dc.date.available2021-08-18T00:00:45Z
dc.date.issued2021-06-01
dc.date.submitted2021-08-17T22:00:11Z
dc.identifier.urihttps://hdl.handle.net/11250/2769947
dc.description.abstractOver the last few years, advances in natural language processing (NLP) have enabled us to learn more from textual data. To this end, word embedding models learn vectorized representations of words by training on big sets of texts (e.g. the entire Wikipedia corpus). Word2vec is a word embedding model which learns single vector representations of words. However, by creating such single vector representations of words, it becomes hard to separate between word meanings, as the single vector representations have to cover all the word meanings. Words with multiple meanings are called polysemous, and determining the word meanings is a challenging problem in NLP. Traditionally, word embeddings from word2vec are analyzed using analogy and cluster analysis. In analogy analysis of word embeddings, it is common to show relationships between words, e.g. that the relationship between king and man is the same as that between queen and woman, whereas, in cluster analysis of word embeddings, it is common to show how similar words cluster together, e.g. the clustering of country-related words. Moreover, due to recent developments in the field of topological data analysis, a topological measure of polysemy was introduced, which attempts to identify polysemous words from their word embeddings. The goal of this thesis is to show how word embeddings traditionally are analyzed using analogies and clustering algorithms and to use methods such as topological polysemy for identifying polysemous words of various word embeddings. Our results show that we are effectively able to cluster word embeddings into groups of varying sizes. Results also revealed that the measure of topological polysemy was inconsistent across word embeddings, and our proposed supervised models attempt to overcome and improve on this work.
dc.language.isoeng
dc.publisherThe University of Bergen
dc.rightsCopyright the Author. All rights reserved
dc.titleAnalysis of Word Embeddings: A Clustering and Topological Approach
dc.typeMaster thesis
dc.date.updated2021-08-17T22:00:11Z
dc.rights.holderCopyright the Author. All rights reserved
dc.description.degreeMasteroppgave i informatikk
dc.description.localcodeINF399
dc.description.localcodeMAMN-PROG
dc.description.localcodeMAMN-INF
dc.subject.nus754199
fs.subjectcodeINF399
fs.unitcode12-12-0


Tilhørende fil(er)

Thumbnail

Denne innførselen finnes i følgende samling(er)

Vis enkel innførsel