Characteristics of Pyrosequencing Data – Analysis, Methods, and Tools
Abstract
The introduction of this thesis provides background knowledge on the 454 sequencing technology and a detailed review of the most relevant sequencing artifacts. Chapter 1 puts the 454 sequencing technology into a historical context. Chapter 2 gives an overview of where 454 sequencing is applied, focusing on the most common application areas. Chapter 3 provides a detailed description of how 454 sequencing works, from library preparation to sequencing, imaging and data output. Here, the distinction between the different detail levels of sequencing information is crucial since data aggregation involves information loss. Chapter 4 describes where errors and artifacts can arise, how they are manifested in the sequencing data, and what impact they can have on downstream analyses. Finally, Chapter 5 puts the contributions into their respective analytical contexts and discusses their relevance for the research community. The first paper, published in Bioinformatics in September 2010 and presented at the European Conference on Computational Biology (ECCB) in Belgium the same year, comprises of the exploration, modeling and simulation of 454 data. Under the title “Characteristics of 454 pyrosequencing data – enabling realistic simulation with Flowsim”, we present a detailed analysis of sequencing data and a simulation tool that facilitates the design of sequencing projects. The tool can be used to examine and quantify the impact of read length, coverage, sequencing errors and signal degradation on genome assembly. Furthermore, it enables the testing and benchmarking of known and novel algorithms, methods and tools in a number of application areas such as whole genome assembly, read alignment, read correction, single-nucleotide polymorphism (SNP) identification and metagenomics. The second paper, “Systematic exploration of error sources in pyrosequencing flowgram data”, was published in Bioinformatics in July 2011 and presented at the Intelligent Systems for Molecular Biology (ISMB)/ECCB conference in Austria the same year. We added several features and modules to the existing simulation pipeline. Those were based on the observation of several error sources such as copy ing errors introduced through polymerase chain reaction (PCR), a method used in 454 sequencing for amplification of the templates. These errors appear as mutations and are virtually impossible to distinguish from true sequence variants. Similar to the second paper, the third paper, “Filtering duplicate reads from 454 pyrosequencing data”, focuses on a single error type, namely artificially duplicated reads. Our JATAC tool enables removal of this artifact on the most detailed sequencing data level, outperforming existing tools. The paper was published in Bioinformatics in April 2013.
Has parts
Paper I: Characteristics of 454 pyrosequencing data – enabling realistic simulation with flowsim, Balzer S, Malde K, Lanzén A, Sharma A and Jonassen I. Published in Bioinformatics 2010, Volume 26, Issue 18, Pp. i420-i425. The article is available at: http://hdl.handle.net/1956/7456Paper II: Systematic exploration of error sources in pyrosequencing flowgram data, Balzer S, Malde K and Jonassen I. Published in Bioinformatics 2011, Volume 27, Issue 13, Pp. i304-i309. The article is available at: http://hdl.handle.net/1956/7457
Paper III: Filtering duplicate reads from 454 pyrosequencing data, Balzer S, Malde K, Grohme MA and Jonassen . Published in Bioinformatics 2013, Volume 29, Issue 7 Pp. 830-836. The article is available at: http://hdl.handle.net/1956/7458