BORA - UiBBORA
    • English
    • norsk
  • English 
    • English
    • norsk
  • Login
View Item 
  •   BORA Home
  • Faculty of Humanities
  • Department of Linguistics, Literary and Aestetic Studies
  • Linguistics
  • View Item
  •   BORA Home
  • Faculty of Humanities
  • Department of Linguistics, Literary and Aestetic Studies
  • Linguistics
  • View Item
JavaScript is disabled for your browser. Some features of this site may not work without it.

Towards the automatic evaluation of stylistic quality of natural texts: constructing a special-­purpose corpus of stylistic edits from the Wikipedia revision history

Type
Master thesis
Not peer reviewed
Thumbnail
View/Open
149870132.pdf (1.432Mb)
Date
2016-09-01
Author
Kotlyarov, Alexandr
Share

Metadata
Show full item record
Abstract
This thesis proposes an approach to automatic evaluation of the stylistic quality of natural texts through data-driven methods of Natural Language Processing. Advantages of data driven methods and their dependency on the size of training data are discussed. Also the advantages of using Wikipedia as a source for textual data mining are presented. The method in this project crucially involves a program for quick automatic extraction of sentences edited by users from the Wikipedia Revision History. The resulting edits have been compiled in a large-scale corpus of examples of stylistic editing. The complete modular structure of the extraction program is described and its performance is analyzed. Furthermore, the need to separate stylistic edits stylistic edits from factual ones is discussed and a number of Machine Learning classification algorithms for this task are proposed and tested. The program developed in this project was able to process approximately 10% of the whole Russian Wikipedia Revision history (200 gigabytes of textual data) in one month, resulting in the extraction of more than two millions of user edits. The best algorithm for the classification of edits into factual and stylistic ones achieved 86.2% cross-validation accuracy, which is comparable with state-of-the-art performance of similar models described in published papers.
URI
http://hdl.handle.net/1956/15228
Publisher
The University of Bergen
Collections
  • Linguistics 62
Copyright the author. All rights reserved

University of Bergen Library
Contact Us | Send Feedback
 

 

Browse

All of BORACommunities & CollectionsBy Issue DateAuthorsTitlesSubjectsTypeThis CollectionBy Issue DateAuthorsTitlesSubjectsType

My Account

LoginRegister

Statistics

View Usage Statistics

University of Bergen Library
Contact Us | Send Feedback