Large language models for document classification
Master thesis
View/ Open
Date
2024-06-03Metadata
Show full item recordCollections
- Master theses [247]
Abstract
This thesis is a research endeavor to explore the efficacy of large language models in conjunction with document classification.This thesis started out as a project started together with the "Byggebot"- project, started by the PolarisMedia-group, after my thesis advisor Professor Andreas Lothe Opdahl was contacted by the project lead to see if he was interested in the idea. The original goal of this project was to see whether or not one could utilize large language models as a means of classifying "news-worthy" stories, with a focus on municipal building permit documents. The municipality i decided on working in was Tromsø, as one of the local newspapers were working on a similar project, and were originally interested in collaborating. I eventually has issues with bad data, as there was the need to acquire them manually, and the municipal database of Tromsø had less than desirable document markings and classifications. After some trial and error, the decision on shifting the thesis to document classification itself, as this was where most of the time was spent on this project, and it seemed like an interesting angle to research. After rigorous attempts at fine-tuning models and semi-manually classified verification documents, it was unclear whether or not the models were being verified well enough because of lacking data. In the end the decision ended up being to use multiple different state-of-the-art models and compared them against each other to test the agreement between them when being tested against multiple different datasets, trying to classify each one with a similar prefixed-prompt. In the end the resulting scores seem to indicate that certain models tend to agree with each other. The argument is then that the probability of the classification being wrong when multiple models have agreed to this extent, is moderately low. The models are by no means perfect, but it seems to be a trend for certain ones to be more effective than others in some of the domains tested.