Automated Identification of Severe Errors in Speech to Text Transcripts
Master thesis
Permanent lenke
https://hdl.handle.net/11250/3072573Utgivelsesdato
2023-06-02Metadata
Vis full innførselSamlinger
- Master theses [246]
Sammendrag
In this thesis we explore how problematic misplaced words can be automatically identified in speech-to-text-transcripts. Automatic Speech Recognition systems (ASR) are systems that can automatically generate text from human speech. Because natural language spoken by humans is complex, due to dialects, variations in talking speed, and differences in how humans talk compared to the training data, there might be errors introduced by such ASR systems. Sometimes, these errors are so bad that they become problematic. Post-processing of an ASR system means finding such errors after the text has been generated by the system. We want to find out to what degree probabilities of words computed using pre-trained language models can be used to solve this problem, as well as to what degree these probabilities can be used to create a classifier to detect problematic words. We present our solution, where we synthetically introduce problematic words into text documents. Then we compute probabilities of both problematic and non-problematic words in these documents to investigate if they are treated differently by the models. We show that the models generally assign lower probabilities to problematic words and higher probabilities to good words. We train a logistic regression classifier using these probabilities to classify words. Our results show that using probabilities from NorBERT1 and NorBERT2, a logistic regression classifier can accurately detect problematic words. We also show that NB-BERT performs worse than a baseline bigram model.