Automated analysis of Norwegian text

Johansen, Bjarte

dc.contributor.author	Johansen, Bjarte
dc.date.accessioned	2019-10-03T07:50:38Z
dc.date.available	2019-10-03T07:50:38Z
dc.date.issued	2019-06-28
dc.identifier	container/a5/57/5a/6c/a5575a6c-2df9-4aec-8a16-d82833d7c5c8
dc.identifier.isbn	9788230848753
dc.identifier.isbn	9788230866757
dc.identifier.uri	http://hdl.handle.net/1956/20906
dc.description.abstract	In this thesis we look at how we can develop automated analysis tools for Norwegian text. We look at 3 different tasks: Part-of-Speech (PoS) tagging, Named-Entity Chunking (NEC), and Named-Entity Recognition (NER). For our work on PoS tagging, we extend the work done on the OBT+Stat tagger by training a new model to allow it to also do disambiguation of Nynorsk. We work with Googles SyntaxNet and train it for PoS tagging of Bokmål and Nynorsk, showing state of the art results at the time of the research. We train a Support Vector Machine for NEC of Bokmål. The task of extracting names from text. Next, we develop a NER model using deep learning and provide a NER sequence tagger for Bokmål and Nynorsk. The Nynorsk tagger is the first NER model for Nynorsk that we are aware of. The best performing model is trained on both language forms. It shows better performance on both Bokmål and Nynorsk than the models we trained individually on the language forms. At last we show how we can use NEC and NER together with Social Network Analysis tools to investigate two case studies around the news story discussing the consequence study of drilling for oil in Lofoten, Vesterålen, and Senja. In the first case study we show that it is possible to find the thematic structures of a news story by analysing the relationship between the entities in the text. In the second case study, using topic modelling, we find the topics, and who the most important persons are for each topic.	en_US
dc.language.iso	eng	eng
dc.publisher	The University of Bergen	eng
dc.rights	Attribution-NonCommercial (CC BY-NC)	eng
dc.rights.uri	http://creativecommons.org/licenses/by-nc/4.0/	eng
dc.title	Automated analysis of Norwegian text	eng
dc.type	Doctoral thesis	en_US
dc.rights.holder	Copyright the author.	en_US
dc.identifier.cristin	1708253

Tilhørende fil(er)

Filnavn:: drthesis_BjarteJohansen_2019.pdf
Størrelse:: 2.444Mb
Format:: PDF
Beskrivelse:: pdf

Åpne

Denne innførselen finnes i følgende samling(er)

Department of Information Science and Media Studies [858]

Vis enkel innførsel

Med mindre annet er angitt, så er denne innførselen lisensiert som Attribution-NonCommercial (CC BY-NC)