Throughput and robustness of bioinformatics pipelines for genome-scale data analysis
Abstract
The post-genomic era has been heavily influenced by the rapid development of highthroughput molecular-screening technologies, which has enabled genome-wide analysis approaches on an unprecedented scale. The constantly decreasing cost of producing experimental data resulted in a data deluge, which has led to technical challenges in distributed bioinformatics infrastructure and computational biology methods. At the same time, the advances in deep-sequencing allowed intensified interrogation of human genomes, leading to prominent discoveries linking our genetic makeup with numerous medical conditions. The fast and cost-effective sequencing technology is expected to soon become instrumental in personalized medicine. The transition of the methodology related to genome sequencing and high-throughput data analysis from the research domain to a clinical service is challenging in many aspects. One of them is providing medical personnel with accessible, robust, and accurate methods for analysis of sequencing data. The computational protocols used for analysis of the sequencing data are complex, parameterized, and in continuous development, making results of data analysis sensitive to factors such as the software used and the parameter values selected. However, the influence of parameters on results of computational pipelines has not been systematically studied. To fill this gap, we investigated the robustness of a genetic variant discovery pipeline against changes of its parameter settings. Using two sensitivity screening methods, we evaluated parameter influence on the identified genetic variants, and found that the parameters have irregular effects and are inter-dependent. Only a fraction of parameters were identified to have considerable impact on the results, suggesting that screening parameter sensitivity can lead to simpler pipeline configuration. Our results showed, that although a simple metric can be used to examine parameter influence, more informative results are obtained using a criterion related to the accuracy of pipeline results. Using the results of sensitivity screening, we have shown that the influential pipeline parameters can be adjusted to effectively increase the accuracy of variant discovery. Such information is invaluable for researchers tuning pipeline parameters, and can guide the search for optimal settings for computational pipelines in a clinical setting. Contrasting the two applied screening methods, we learned more about specific requirements of robustness analysis of computational methods, and were able to suggest a more tailored strategy for parameter screening. Our contributions demonstrate the importance and the benefits of systematic robustness analysis of bioinformatics pipelines, and indicate that more efforts are needed to advance research in this area. Web services are commonly used to provide interoperable, programmatic access to bioinformatics resources, and consequently, they are natural building blocks of bioinformatics analysis workflows. However, in the light of the data deluge, their usability for data-intensive applications has been questioned. We investigated applicability of standard Web services to high-throughput pipelines, and showed how throughput and performance of such pipelines can be improved. By developing two complementary approaches, that take advantage of established and proven optimization mechanisms, we were able to enhance Web service communication in a non-intrusive manner. The first strategy increases throughput ofWeb service interfaces by a stream-like invocation pattern. This additionally allows for data-pipelining between consecutive steps of a workflow. The second approach facilitated peer-to-peer data transfer between Web services to increase the capacity of the workflow engine. We evaluated the impact of the enhancements on genome-scale pipelines, and showed that high-throughput data analysis using standard Web service pipelines is possible, when the technology is used sensibly. However, considering the contemporary data volumes and their expected growth, methods capable of handling even larger data should be sought. Systematic analysis of pipeline robustness requires intensive computations, which are particularly demanding for high-throughput pipelines. Providing more efficient methods of pipeline execution is fundamental for enabling such examinations on a largescale. Furthermore, the standardized interfaces of Web services facilitate automated executions, and are perfectly suited for coordinating large computational experiments. I speculate that, provided wide adoption of Web service technology in bioinformatics pipelines, large-scale quality control studies, such as robustness analysis, could be automated and performed routinely on newly published computational methods. This work contributes to realizing such a conception, providing technical basis for building the necessary infrastructure and suggesting methodology for robustness analysis.
Has parts
Paper I: Paweł Sztromwasser, Pål Puntervoll, and Kjell Petersen. Data partitioning enables the use of standard SOAP Web Services in genome-scale workflows. Journal of Integrative Bioinformatics, 8(2):163, 2011. The article is available at: http://hdl.handle.net/1956/7904Paper II: Sattanathan Subramanian, Paweł Sztromwasser, Pål Puntervoll, and Kjell Petersen. Direct data transfer between SOAP web services in Orchestration. In the International Conference on Information Integration andWeb-based Applications & Services (iiWAS). ACM, 2012. The article is available at: http://hdl.handle.net/1956/7905
Paper III: Sattanathan Subramanian, Paweł Sztromwasser, Pål Puntervoll, and Kjell Petersen. Pipelined Data-flow Delegated Orchestration for Data-Intensive eScience Workflows. International Journal of Web Information Systems, 9(3):204-218, 2013. The article is not available in BORA due to publisher restrictions. The published version is available at: http://dx.doi.org/10.1108/ijwis-05-2013-0012
Paper IV: Paweł Sztromwasser, Kjell Petersen, and Inge Jonassen. Sensitivity screening reveals influential parameters of a variant calling pipeline. The article is not available in BORA.