Vis enkel innførsel

dc.contributor.authorRosenvinge, Frederik Hjelde
dc.date.accessioned2023-06-22T00:05:19Z
dc.date.available2023-06-22T00:05:19Z
dc.date.issued2023-06-02
dc.date.submitted2023-06-21T22:01:18Z
dc.identifier.urihttps://hdl.handle.net/11250/3072573
dc.description.abstractIn this thesis we explore how problematic misplaced words can be automatically identified in speech-to-text-transcripts. Automatic Speech Recognition systems (ASR) are systems that can automatically generate text from human speech. Because natural language spoken by humans is complex, due to dialects, variations in talking speed, and differences in how humans talk compared to the training data, there might be errors introduced by such ASR systems. Sometimes, these errors are so bad that they become problematic. Post-processing of an ASR system means finding such errors after the text has been generated by the system. We want to find out to what degree probabilities of words computed using pre-trained language models can be used to solve this problem, as well as to what degree these probabilities can be used to create a classifier to detect problematic words. We present our solution, where we synthetically introduce problematic words into text documents. Then we compute probabilities of both problematic and non-problematic words in these documents to investigate if they are treated differently by the models. We show that the models generally assign lower probabilities to problematic words and higher probabilities to good words. We train a logistic regression classifier using these probabilities to classify words. Our results show that using probabilities from NorBERT1 and NorBERT2, a logistic regression classifier can accurately detect problematic words. We also show that NB-BERT performs worse than a baseline bigram model.
dc.language.isoeng
dc.publisherThe University of Bergen
dc.rightsCopyright the Author. All rights reserved
dc.titleAutomated Identification of Severe Errors in Speech to Text Transcripts
dc.typeMaster thesis
dc.date.updated2023-06-21T22:01:18Z
dc.rights.holderCopyright the Author. All rights reserved
dc.description.degreeMasteroppgave i informasjonsvitenskap
dc.description.localcodeINFO390
dc.description.localcodeMASV-INFO
dc.subject.nus735115
fs.subjectcodeINFO390
fs.unitcode15-17-0


Tilhørende fil(er)

Thumbnail

Denne innførselen finnes i følgende samling(er)

Vis enkel innførsel