Machine learning approaches for high-dimensional genome-wide association studies

Malik, Muhammad Ammar

dc.contributor.author	Malik, Muhammad Ammar
dc.date.accessioned	2022-08-10T08:27:33Z
dc.date.available	2022-08-10T08:27:33Z
dc.date.issued	2022-08-24
dc.date.submitted	2022-08-10T01:34:48Z
dc.identifier	container/e6/3d/91/2d/e63d912d-8fb5-4e27-ad8b-32fca902e778
dc.identifier.isbn	9788230850626
dc.identifier.isbn	9788230853634
dc.identifier.uri	https://hdl.handle.net/11250/3011013
dc.description.abstract	Formålet med Genome-wide association studies (GWAS) er å finne statistiske sammenhenger mellom genetiske varianter og egenskaper av interesser. De genetiske variantene som forklarer mye av variasjonene i genomfattende genekspresjoner kan medføre konfunderende analyser av kvantitative egenskaper ved ekspresjonsplasseringer (eQTL). For å betrakte konfunderende faktorene, presenterte vi LVREML-metoden i artikkel I, en metode som er konseptuelt analogt med å estimere faste og tilfeldige effekter i Lineære Blandede modeller (LMM). Vi viste at de latente variablene med “Maximum likelihood” alltid kan velges ortogonalt til de kjente faktorene (som genetiske variasjoner). Dette indikerer at “Maximum likelihood” variablene forklarer utvalgsvariansene som ikke allerede er forklart av de genetiske variantene i modellen. For å kartlegge hvilke egenskaper som påvirkes av de identifiserte genetiske variantene, må vi reversere den funksjonelle relasjonen mellom genotyper og egenskaper. I denne sammenhengen er en “multi-trait” metode mer fordelaktige enn å studere egenskapene individuelt. “Multi-trait”-metoden drar nytte av økt kapasitet som følge av å vurdere kovarianser på tvers av egenskaper, og redusert multiple tester, fordi det trengs en enkelt test for å teste for sammenhenger til et sett med egenskaper. I artikkel II analyserte vi ulike maskinlæringsmetoder (Naive Bayes/independent univariate correlation, random forests og support vector machines) for omvendt regresjon i multi-trekk GWAS, ved bruk av genotyper, genuttrykksdata og “groundtruth” transcriptional regulatory networks fra DREAM5 SysGen Challenge og fra en krysning mellom to gjærstammer for å evaluere metoder. I artikkel III utvidet vi metoden ovenfor til å behandle menneskelig data. En viktig forskjell mellom data fra artikkel II og artikkel III er at vi ikke har “Groundtruth” data tilgjengelig for sistnevnte. Vi brukte genotypen og Magnetresonanstomografi (MRI) data hentet fra ADNI databasen. Resultatene fra både artikkel II og artikkel III viste at resultat av genotypeprediksjon varierte på tvers av genetiske varianter. Dette hjulpet med å identifisere genomiske regioner som er assosiert med stort antall egenskaper i høydimensjonale fenotypiske data. Vi observerte også at koeffisientene til maskinlæringsmodeller korrelerte med styrken til assosiasjonene mellom varianter og egenskaper. Resultatene våre viste også at ikke-lineære maskin-læringsmetoder som “random forests” identifiserte genetiske varianter tydeligere enn de lineære metodene. Spesielt observerte vi i artikkel III at “random forests” var i stand til å identifisere enkeltnukleotidpolymorfismer (SNP-er) som var forskjellige fra de som ble identifisert “ridge” og“lasso” regresjonsmetodene. Ytterligere analyse viste at de identifiserte SNP-ene tilhørte gener som tidligere var assosiert med hjernerelaterte lidelser.	en_US
dc.description.abstract	Genome-wide association studies (GWAS) aim to find statistical associations between genetic variants and traits of interests. The genetic variants that explain a lot of variation in genome-wide gene expression may lead to confounding in expression quantitative trait loci (eQTL) analyses. To account for these confounding factors, in Article I we proposed LVREML, a method conceptually analogous to estimating fixed and random effects in linear mixed models (LMM). We showed that the maximum-likelihood latent variables can always be chosen orthogonal to the known factors (such genetic variants). This indicates that the maximum-likelihood variables explain the sample covariances that is not already explained by the genetic variants in the model. For identifying which traits are effected by the identified genetic variants, we need to reverse the functional relation between genotypes and traits. In this regard, multitrait approaches are more advantageous than studying the traits individually. The multi-trait approaches benefit from increased power from considering cross-trait covariances and reduced multiple testing burden because a single test is needed to test for associations to a set of traits. In Article II, we analyzed various machine learning methods (ridge regression, Naive Bayes/independent univariate correlation, random forests and support vector machines) for reverse regression in multi-trait GWAS, using genotypes, gene expression data and ground-truth transcriptional regulatory networks from the DREAM5 SysGen Challenge and from a cross between two yeast strains to evaluate methods. In Article III, we extended the above approach to human dataset. An important difference between data from Article II and Article III is that we do not have groundtruth data available for the latter. We used the genotype and brain-imaging features extracted from the MRIs obtained from the ADNI database. The results from both Article II and Article III showed that the genotype prediction performance varied across genetic variants. This helped in identifying genomic regions that are associated with high number of traits in high-dimensional phenotypic data. We also observed that the feature coefficients of fitted machine learning models correlated with the strength of association between variants and traits. Our results also showed that non-linear machine learning methods like random forests identified genetic variants distinct from the linear methods. In particular, we observed in Article III that random forest was able to identify single-nueclotide-polymorphisms (SNPs) that were distinct from the ones identified by ridge and lasso regression. Further analysis showed that the identified SNPs belonged to genes previously associated with brain-related disorders.	en_US
dc.language.iso	eng	en_US
dc.publisher	The University of Bergen	en_US
dc.relation.haspart	Paper 1. Malik MA. and Michoel T. (2022), Restricted maximum-likelihood method for learning latent variance components in gene expression data with known and unknown confounders, G3 12, 2, 2022; jkab410. The article is available at: <a href="https://hdl.handle.net/11250/3011196" target="blank">https://hdl.handle.net/11250/3011196</a>	en_US
dc.relation.haspart	Paper 2. Malik MA., Ludl AA., and Michoel T. (2022), High-dimensional multi-trait GWAS by reverse prediction of genotypes using machine learning methods. The article is available in the thesis. The article is also available at: <a href="https://doi.org/10.48550/arXiv.2111.00108" target="blank">https://doi.org/10.48550/arXiv.2111.00108</a>	en_US
dc.relation.haspart	Paper 3. Malik MA., Lundervold AS. and Michoel T. (2022), rfPhen2Gen: A machine learning based association study of brain imaging phenotypes to genotypes. The article is available in the thesis. The article is also available at: <a href="https://doi.org/10.48550/arXiv.2204.00067" target="blank">https://doi.org/10.48550/arXiv.2204.00067</a>	en_US
dc.rights	Attribution (CC BY). This item's rights statement or license does not apply to the included articles in the thesis.
dc.rights.uri	https://creativecommons.org/licenses/by/4.0/
dc.title	Machine learning approaches for high-dimensional genome-wide association studies	en_US
dc.type	Doctoral thesis	en_US
dc.date.updated	2022-08-10T01:34:48Z
dc.rights.holder	Copyright the Author.	en_US
dc.contributor.orcid	0000-0001-6613-2303
dc.description.degree	Doktorgradsavhandling
fs.unitcode	12-12-0

Tilhørende fil(er)

Filnavn:: archive.pdf
Størrelse:: 49.14Mb
Format:: PDF
Beskrivelse:: PDF

Åpne

Denne innførselen finnes i følgende samling(er)

Department of Informatics [928]

Vis enkel innførsel

Attribution (CC BY). This item's rights statement or license does not apply to the included articles in the thesis.

Med mindre annet er angitt, så er denne innførselen lisensiert som Attribution (CC BY). This item's rights statement or license does not apply to the included articles in the thesis.