Variable selection optimization for multivariate classification of metabolomics data
Master thesis
Permanent lenke
https://hdl.handle.net/1956/7080Utgivelsesdato
2013-05-21Metadata
Vis full innførselSamlinger
- Department of Chemistry [449]
Sammendrag
Variable selection is an important step in multivariate calibration in which the number of variables in the independent variable matrix is reduced by eliminating those that are not related to the response. Many methods based on different criteria have been developed for this purpose. Some of them include competitive adaptive reweighted sampling (CARS), subwindow permutation analysis (SPA) and random forest (RF) which can be implemented prior to the construction of both regression and classification models. When applied to metabolomics datasets, variable selection can aid in the discovery of potential biomarkers for a particular disorder. In this study, the mechanism of the three abovementioned methods described in the literature has been investigated and compared. Their performance when applied to three different metabolomics datasets for multivariate classification was also studied. Although the most favorable method varied for each dataset, model prediction performance was found to improve when variable selection was carried by means of any of the methods. However, because the parameter settings for the methods were set by default for this comparison, an optimization of these is recommended to obtain a more appropriate comparison. In an attempt to optimize the variable selection stage for the creation of classification models for the three metabolomics datasets of interest, the original CARS algorithm was modified to simultaneously optimize three different parameters. Although promising results were obtained with this modification, a discrepancy was detected in terms of the validation process embedded in the algorithm. A new variable selection method based on the separate optimization of identity and number of informative variables was developed. However, its implementation did not prove to increase model prediction performance when compared to the results obtained when using the original or modified CARS, or when using all variables in the original dataset. Some of the aspects identified as possible pathways to improve the method's performance were tested, only to be discarded. Further study regarding other untested pathways is needed for the improvement of this method.