Number of components and prediction error in partial least squares regression determined by Monte Carlo resampling strategies
Peer reviewed, Journal article
MetadataShow full item record
- Department of Chemistry 
Original versionKvalheim OM, Grung B, Rajalahti T. Number of components and prediction error in partial least squares regression determined by Monte Carlo resampling strategies. Chemometrics and Intelligent Laboratory Systems. 2019;188:79-86 https://doi.org/10.1016/j.chemolab.2019.03.006
Using a metabolomics data set with 1057 serum samples, we designed and assessed different procedures based on Monte Carlo resampling schemes to determine the optimal number of components to be included in partial least squares (PLS) regression models. Corresponding estimates of prediction error were calculated and compared in a single algorithm comprising i) a single loop Monte Carlo approach repeatedly and randomly splitting samples into calibration and validation samples, ii) a double loop validation splitting samples into calibration/validation and prediction sets, and, iii) independent sample sets in a third loop. In order to mimic the common situation with only a moderate number of samples available for building the model, only a fraction of the 1057 samples analyzed was randomly selected from the total sample set and used in the algorithm. The results show that if the samples available for modelling are representative for the future samples to be predicted from the model, the single loop Monte Carlo procedure consistently provides the same estimates of prediction errors as double loop resampling procedures and for 75% of the cases these estimates are the same as for independent prediction sets. This has important implications for optimal use of a training set for component selection and estimation of prediction error. Two methods were developed and compared for selecting the optimal number of PLS components defined as the number where no statistically significant improvement in prediction error is observed when additional components are included in the model. Both methods determine a probability measure and provide similar results for model selection in this application.