Analysis of Word Embeddings: A Clustering and Topological Approach

Triki, Jonas Folkvord

dc.contributor.author	Triki, Jonas Folkvord
dc.date.accessioned	2021-08-18T00:00:45Z
dc.date.available	2021-08-18T00:00:45Z
dc.date.issued	2021-06-01
dc.date.submitted	2021-08-17T22:00:11Z
dc.identifier.uri	https://hdl.handle.net/11250/2769947
dc.description.abstract	Over the last few years, advances in natural language processing (NLP) have enabled us to learn more from textual data. To this end, word embedding models learn vectorized representations of words by training on big sets of texts (e.g. the entire Wikipedia corpus). Word2vec is a word embedding model which learns single vector representations of words. However, by creating such single vector representations of words, it becomes hard to separate between word meanings, as the single vector representations have to cover all the word meanings. Words with multiple meanings are called polysemous, and determining the word meanings is a challenging problem in NLP. Traditionally, word embeddings from word2vec are analyzed using analogy and cluster analysis. In analogy analysis of word embeddings, it is common to show relationships between words, e.g. that the relationship between king and man is the same as that between queen and woman, whereas, in cluster analysis of word embeddings, it is common to show how similar words cluster together, e.g. the clustering of country-related words. Moreover, due to recent developments in the field of topological data analysis, a topological measure of polysemy was introduced, which attempts to identify polysemous words from their word embeddings. The goal of this thesis is to show how word embeddings traditionally are analyzed using analogies and clustering algorithms and to use methods such as topological polysemy for identifying polysemous words of various word embeddings. Our results show that we are effectively able to cluster word embeddings into groups of varying sizes. Results also revealed that the measure of topological polysemy was inconsistent across word embeddings, and our proposed supervised models attempt to overcome and improve on this work.
dc.language.iso	eng
dc.publisher	The University of Bergen
dc.rights	Copyright the Author. All rights reserved
dc.title	Analysis of Word Embeddings: A Clustering and Topological Approach
dc.type	Master thesis
dc.date.updated	2021-08-17T22:00:11Z
dc.rights.holder	Copyright the Author. All rights reserved
dc.description.degree	Masteroppgave i informatikk
dc.description.localcode	INF399
dc.description.localcode	MAMN-PROG
dc.description.localcode	MAMN-INF
dc.subject.nus	754199
fs.subjectcode	INF399
fs.unitcode	12-12-0

Tilhørende fil(er)

Filnavn:: Master-s-Thesis-in-ML---Jonas- ...
Størrelse:: 3.809Mb
Format:: PDF
Beskrivelse:: master thesis

Åpne

Denne innførselen finnes i følgende samling(er)

Master theses [197]

Vis enkel innførsel