• norsk
    • English
  • English 
    • norsk
    • English
  • Login
View Item 
  •   Home
  • Faculty of Humanities
  • Department of Linguistics, Literary and Aestetic Studies
  • Department of Linguistics, Literary and Aestetic Studies
  • View Item
  •   Home
  • Faculty of Humanities
  • Department of Linguistics, Literary and Aestetic Studies
  • Department of Linguistics, Literary and Aestetic Studies
  • View Item
JavaScript is disabled for your browser. Some features of this site may not work without it.

Towards the automatic evaluation of stylistic quality of natural texts: constructing a special-­purpose corpus of stylistic edits from the Wikipedia revision history

Kotlyarov, Alexandr
Master thesis
Thumbnail
View/Open
149870132.pdf (1.432Mb)
URI
https://hdl.handle.net/1956/15228
Date
2016-09-01
Metadata
Show full item record
Collections
  • Department of Linguistics, Literary and Aestetic Studies [814]
Abstract
This thesis proposes an approach to automatic evaluation of the stylistic quality of natural texts through data-driven methods of Natural Language Processing. Advantages of data driven methods and their dependency on the size of training data are discussed. Also the advantages of using Wikipedia as a source for textual data mining are presented. The method in this project crucially involves a program for quick automatic extraction of sentences edited by users from the Wikipedia Revision History. The resulting edits have been compiled in a large-scale corpus of examples of stylistic editing. The complete modular structure of the extraction program is described and its performance is analyzed. Furthermore, the need to separate stylistic edits stylistic edits from factual ones is discussed and a number of Machine Learning classification algorithms for this task are proposed and tested. The program developed in this project was able to process approximately 10% of the whole Russian Wikipedia Revision history (200 gigabytes of textual data) in one month, resulting in the extraction of more than two millions of user edits. The best algorithm for the classification of edits into factual and stylistic ones achieved 86.2% cross-validation accuracy, which is comparable with state-of-the-art performance of similar models described in published papers.
Publisher
The University of Bergen
Copyright
Copyright the author. All rights reserved

Contact Us | Send Feedback

Privacy policy
DSpace software copyright © 2002-2019  DuraSpace

Service from  Unit
 

 

Browse

ArchiveCommunities & CollectionsBy Issue DateAuthorsTitlesSubjectsDocument TypesJournalsThis CollectionBy Issue DateAuthorsTitlesSubjectsDocument TypesJournals

My Account

Login

Statistics

View Usage Statistics

Contact Us | Send Feedback

Privacy policy
DSpace software copyright © 2002-2019  DuraSpace

Service from  Unit