From Data to Decisions: ML-Based Job Failure Analysis for the ALICE experiment
Master thesis
View/ Open
Date
2024-06-03Metadata
Show full item recordCollections
- Master theses [218]
Abstract
This paper researches the possibility of utilizing machine learning to locatefactors contributing to a failing job at the ALICE grid, focusing on the gridsite at the University of Bergen. A prototype system has been developedfor data collection, management, analysis, and machine learning. The anal-ysis data originates from the ALICE monitoring system, MonaLISA, and itsgrid middleware JAliEn. A custom transformer model is utilized in the re-search, which addresses memory constraints in the project test environmentby processing subsets of the complete input related to a job execution at atime.