Biomarker Discovery Using Statistical and Machine Learning Approaches on Gene Expression Data

Zhang, Xiaokang

Zhang, Xiaokang

Doctoral thesis

View/Open

PDF (14.36Mb)

URI

https://hdl.handle.net/1956/24159

Date

2020-10-30

Metadata

Show full item record

Collections

Department of Informatics [916]

Abstract

My PhD is affiliated with the dCod 1.0 project (https://www.uib.no/en/dcod): decoding the systems toxicology of Atlantic cod (Gadus morhua), which aims to better understand how cods adapt and react to the stressors in the environment. One of the research topics is to discover the biomarkers which discriminate the fish under normal biological status and the ones that are exposed to toxicants.

A biomarker, or biological marker, is an indicator of a biological state in response to an intervention, which can be for example toxic exposure (in toxicology), disease (for example cancer), or drug response (in precision medicine). Biomarker discovery is a very important research topic in toxicology, cancer research, and so on. A good set of biomarkers can give insight into the disease / toxicant response mechanisms and be useful to find if the person has the disease / the fish has been exposed to the toxicant.

On the molecular level, a biomarker could be "genotype" - for instance a single nucleotide variant linked with a particular disease or susceptibility; another biomarker could be the level of expression of a gene or a set of genes. In this thesis we focus on the latter one, aiming to find out the informative genes that can help to distinguish samples from different groups from the gene expression profiling. Several transcriptomics technologies can be used to generate the necessary data, and among them, DNA microarray and RNA sequencing (RNA-Seq) have become the most useful methods for whole transcriptome gene expression profiling. Especially RNA-Seq has become an attractive alternative to microarrays since it was introduced.

Prior to analysis of gene expression, the RNA-Seq data needs to go through a series of processing steps, so a workflow which can automate the process is highly required. Even though many workflows have been proposed to facilitate this process, their application is usually limited to such as model organisms, high-performance computers, computer fluent users, and so on. To fill these gaps, we developed a maximally general RNA-Seq analysis workflow: RNA-Seq Analysis Snakemake Workflow (RASflow), which is applicable to a wide range of applications and requires little programming skills. It takes the sequencing data as input, and maps them to either transcriptome or genome for quantification, and after that the gene expression profile can be achieved which afterwards goes through normalization and statistical tests to find out the differentially expressed genes. This work was presented in Paper I and Paper II.

Differential expression analysis used in RASflow, together with other univariate methods are widely used in biomarker discovery for their simplicity and interpretability. But they rely on a hypothesis that variables are independent, so they can only identify variables that possess significant information by themselves. However, biological processes usually involve many variables that have complex interactions. Multivariate methods which take the interactions between variables into consideration are therefore also popular for biomarker discovery. To study whether there is a significant advantage of one over the other, we conducted a comparative study of various methods from these two categories and evaluated these methods on two aspects: stability and prediction accuracy, we found that a method’s performance is quite data-dependent. This work was presented in Paper III.

Since the biomarker discovery methods perform quite differently on different datasets, then how to choose the most appropriate one for a particular dataset? One solution is to use the function perturbation strategy to combine the outputs from multiple methods. Function perturbation is capable of maintaining prediction accuracy compared with the original individual methods, but its stability is not satisfactory enough. On the other hand, data perturbation uses a similar ensemble learning logic: it firstly generates multiple datasets by resampling the original dataset and then combines the results from those datasets. Data perturbation has been proven to improve the stability of the biomarker discovery method. We therefore proposed a framework which combines function perturbation with data perturbation: Ensemble Feature Selection Integrating Stability (EFSIS) which achieves both high prediction accuracy and stability. This work was presented in Paper IV.

Has parts

Paper I: Yadetie, F., Zhang, X., Hanna, E. M., Aranguren-Abadía, L., Eide, M., Blaser, N., Brun, M., Jonassen, I., Goksøyr, A., & Karlsen, O. A. (2018). RNA-Seq analysis of transcriptome responses in Atlantic cod (Gadus morhua) precisioncut liver slices exposed to benzo[a]pyrene and 17α-ethynylestradiol. Aquatic Toxicology, 201, 174-186. The article is available in the main thesis. The article is also available at: https://doi.org/10.1016/j.aquatox.2018.06.003

Paper II: Zhang, X., & Jonassen, I. (2020). RASflow: an RNA-Seq analysis workflow with Snakemake. BMC Bioinformatics, 21(1), 1-9. The article is available in the main thesis. The article is also available at: https://doi.org/10.1186/s12859-020-3433-x

Paper III: Zhang, X., & Jonassen, I. (2019). A Comparative Analysis of Feature Selection Methods for Biomarker Discovery in Study of Toxicant-Treated Atlantic Cod (Gadus morhua) Liver. In Symposium of the Norwegian AI Society, Communications in Computer and Information Science (pp. 114-123). Springer, Cham. An accepted version of the article is available at: http://hdl.handle.net/1956/21642

Paper IV: Zhang, X., & Jonassen, I. (2019). An Ensemble Feature Selection Framework Integrating Stability. In 2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) (pp. 2792-2798). IEEE. An accepted version of the article is available at: http://hdl.handle.net/1956/22457

Publisher

The University of Bergen

Copyright

Attribution-NonCommercial (CC BY-NC). This item's Creative Commons-license does not apply to the included articles in the thesis.

Except where otherwise noted, this item's license is described as Attribution-NonCommercial (CC BY-NC). This item's Creative Commons-license does not apply to the included articles in the thesis.