Vis enkel innførsel

dc.contributor.authorDyrland, Kjetil
dc.date.accessioned2022-09-27T23:46:45Z
dc.date.available2022-09-27T23:46:45Z
dc.date.issued2022-06-01
dc.date.submitted2022-09-27T22:01:00Z
dc.identifier.urihttps://hdl.handle.net/11250/3021973
dc.description.abstractDrug discovery plays a critical role in today’s society for treating and preventing sickness and possibly deadly viruses. In early drug discovery development, the main challenge is to find candidate molecules to be used as drugs to treat a disease. This also means assessing key properties that are wanted in the inter- action between molecules and proteins. It is a very difficult problem because the molecular space is so big and complex. Drug discovery development is es- timated to take around 12–15 years on average, and the costs of developing a single drug amount to $2.8 billion dollars in the US. Modern drug discovery and drug development often start with finding candi- date drug molecules (‘compounds’) that can bind to a target, usually a protein in our body. Since there are billions of possible molecules to test, this becomes an endless search for compounds that show promising bioactivity. The search method is called high-throughput screening (HTS), or virtual HTS (VHTS) in a virtual environment. The traditional approach to HTS has been to test every compound one by one. More recent approaches have seen the use of robotics and of features extracted from the molecule, combining them with machine learning algorithms, in an effort to make the process more automated. Research has shown that this will still lead to human errors and bias. So, how can we use machine learning algorithms to make this approach more cost-efficient and more robust to human errors? This project tried to address these issues and led to two scientific papers as a result. The first paper explores how common evaluation metrics used for classification can actually be unsuited to the task, leading to severe consequences when put into a real application. The argument is based on basic principles of Decision Theory, which is recognized in the field of machine learning but has not been put into much use. It makes a distinction between predicting the most probable class and predicting the most valuable class in terms of the “cost” or “gains” for the classes. In an algorithm for classifying a particular disease in a patient, the wrong classification could lead to a life or death situation. The principles also apply to drug discovery, where the cost of further developing and optimizing a "useless" drug could be huge. The goal of the classifier should therefore not be to guess the correct class but to choose the optimal class, and the metric must depend on the type of classification problem. Thus, we show that common metrics such as precision, balanced accuracy, F1-score, Area Under The Curve, Matthews Correlation Coefficient, and Fowlkes-Mallows index are affected by this problem, and propose an evaluation method grounded on the foundations of Decision Theory to provide a solution to this problem. The metric presented, called utility, takes into account gains and losses for each correct or incorrect classification of the confusion matrix. For this to work effectively, the output of the machine learning algorithm needs to be a set of sensible probabilities for each class. This brings us to the second paper. Machine learning algorithms usually output a set of real numbers for the classes they try to predict, which, possibly after some transformation (for exam- ple the ‘softmax’ function), are meant to represent probabilities for the classes. However, the problem is that these numbers cannot be reliably interpreted as actual probabilities, in the sense of degrees of belief. In the paper, we propose the implementation of a probability transducer to transform the output of the algorithm into sensible probabilities. These are then used in conjunction with the utilities to choose the class with the maximal expected utility. The results show that the transducer gives better scores, in terms of the utilities, for all cases compared to the standard method used in machine learning.
dc.language.isoeng
dc.publisherThe University of Bergen
dc.rightsCopyright the Author. All rights reserved
dc.subjectscreening
dc.subjectDecision Theory
dc.subjectclassifiers
dc.subjectMachine Learning
dc.subjectDeep learning
dc.subjectEvaluation
dc.subjectDrug Discovery
dc.subjectmedicine
dc.subjectprobabilities
dc.subjectdrugs
dc.subjectmetrics
dc.titleEvaluation and Improvement of Machine Learning Algorithms in Drug Discovery
dc.typeMaster thesis
dc.date.updated2022-09-27T22:01:00Z
dc.rights.holderCopyright the Author. All rights reserved
dc.description.degreeMasteroppgave i Programutvikling samarbeid med HVL
dc.description.localcodePROG399
dc.description.localcodeMAMN-PROG
dc.subject.nus754199
fs.subjectcodePROG399
fs.unitcode12-12-0


Tilhørende fil(er)

Thumbnail

Denne innførselen finnes i følgende samling(er)

Vis enkel innførsel