Vis enkel innførsel

dc.contributor.authorKotlyarov, Alexandr
dc.date.accessioned2016-12-05T09:49:28Z
dc.date.available2016-12-05T09:49:28Z
dc.date.issued2016-09-01
dc.date.submitted2016-09-01eng
dc.identifier.urihttps://hdl.handle.net/1956/15228
dc.description.abstractThis thesis proposes an approach to automatic evaluation of the stylistic quality of natural texts through data-driven methods of Natural Language Processing. Advantages of data driven methods and their dependency on the size of training data are discussed. Also the advantages of using Wikipedia as a source for textual data mining are presented. The method in this project crucially involves a program for quick automatic extraction of sentences edited by users from the Wikipedia Revision History. The resulting edits have been compiled in a large-scale corpus of examples of stylistic editing. The complete modular structure of the extraction program is described and its performance is analyzed. Furthermore, the need to separate stylistic edits stylistic edits from factual ones is discussed and a number of Machine Learning classification algorithms for this task are proposed and tested. The program developed in this project was able to process approximately 10% of the whole Russian Wikipedia Revision history (200 gigabytes of textual data) in one month, resulting in the extraction of more than two millions of user edits. The best algorithm for the classification of edits into factual and stylistic ones achieved 86.2% cross-validation accuracy, which is comparable with state-of-the-art performance of similar models described in published papers.en_US
dc.format.extent1502197 byteseng
dc.format.mimetypeapplication/pdfeng
dc.language.isoengeng
dc.publisherThe University of Bergeneng
dc.titleTowards the automatic evaluation of stylistic quality of natural texts: constructing a special-­purpose corpus of stylistic edits from the Wikipedia revision historyeng
dc.typeMaster thesis
dc.rights.holderCopyright the author. All rights reservedeng
dc.description.degreeMaster i Datalingvistikk og språkteknologi
dc.description.localcodeMAHF-DASP
dc.description.localcodeDASP350
dc.subject.nus711726eng
fs.subjectcodeDASP350


Tilhørende fil(er)

Thumbnail

Denne innførselen finnes i følgende samling(er)

Vis enkel innførsel