Creating an Agglomerative Clustering Approach Using GDELT
Master thesis
Permanent lenke
https://hdl.handle.net/11250/3072548Utgivelsesdato
2023-06-01Metadata
Vis full innførselSamlinger
- Master theses [248]
Sammendrag
GDELT is a project with a large scale, continuously updated databank that provides a real-time image of the global news picture by outputting these as files that can be downloaded and used by anyone. However, this data is of low granularity, and each source of data does not provide much information on its own. This thesis attempts to leverage the large amount of data available by utilizing a Hierarchical Agglomerative Cluster method to identify news articles that report about the same real life event. To do this, the thesis also explores if the GDELT data is granular enough to be used without extensive preprocessing, and if a distance metric for the cluster algorithm can be created. The findings show promising results when regarded with qualitative measures, but the quantitative measures are not yet optimized. Inherent flaws in GDELT and clustering algorithms are a hurdle to be overcome before the real potential of GDELT’s data can be unleashed, and this thesis will explore some of these difficulties and make recommendations for how to circumvent them in future works.