• norsk
    • English
  • norsk 
    • norsk
    • English
  • Logg inn
Vis innførsel 
  •   Hjem
  • Faculty of Humanities
  • Department of Linguistics, Literary and Aestetic Studies
  • Department of Linguistics, Literary and Aestetic Studies
  • Vis innførsel
  •   Hjem
  • Faculty of Humanities
  • Department of Linguistics, Literary and Aestetic Studies
  • Department of Linguistics, Literary and Aestetic Studies
  • Vis innførsel
JavaScript is disabled for your browser. Some features of this site may not work without it.

Towards the automatic evaluation of stylistic quality of natural texts: constructing a special-­purpose corpus of stylistic edits from the Wikipedia revision history

Kotlyarov, Alexandr
Master thesis
Thumbnail
Åpne
149870132.pdf (1.432Mb)
Permanent lenke
https://hdl.handle.net/1956/15228
Utgivelsesdato
2016-09-01
Metadata
Vis full innførsel
Samlinger
  • Department of Linguistics, Literary and Aestetic Studies [814]
Sammendrag
This thesis proposes an approach to automatic evaluation of the stylistic quality of natural texts through data-driven methods of Natural Language Processing. Advantages of data driven methods and their dependency on the size of training data are discussed. Also the advantages of using Wikipedia as a source for textual data mining are presented. The method in this project crucially involves a program for quick automatic extraction of sentences edited by users from the Wikipedia Revision History. The resulting edits have been compiled in a large-scale corpus of examples of stylistic editing. The complete modular structure of the extraction program is described and its performance is analyzed. Furthermore, the need to separate stylistic edits stylistic edits from factual ones is discussed and a number of Machine Learning classification algorithms for this task are proposed and tested. The program developed in this project was able to process approximately 10% of the whole Russian Wikipedia Revision history (200 gigabytes of textual data) in one month, resulting in the extraction of more than two millions of user edits. The best algorithm for the classification of edits into factual and stylistic ones achieved 86.2% cross-validation accuracy, which is comparable with state-of-the-art performance of similar models described in published papers.
Utgiver
The University of Bergen
Opphavsrett
Copyright the author. All rights reserved

Kontakt oss | Gi tilbakemelding

Personvernerklæring
DSpace software copyright © 2002-2019  DuraSpace

Levert av  Unit
 

 

Bla i

Hele arkivetDelarkiv og samlingerUtgivelsesdatoForfattereTitlerEmneordDokumenttyperTidsskrifterDenne samlingenUtgivelsesdatoForfattereTitlerEmneordDokumenttyperTidsskrifter

Min side

Logg inn

Statistikk

Besøksstatistikk

Kontakt oss | Gi tilbakemelding

Personvernerklæring
DSpace software copyright © 2002-2019  DuraSpace

Levert av  Unit