Modelling the structure, function and evolution of Polycomb/Trithorax Response Elements
Abstract
The correct development of animals and plants depends on carefully coordinated gene regulation. Polycomb/Trithorax Group (PcG/TrxG) proteins are conserved epigenetic regulators that are recruited to Polycomb/Trithorax Response Elements (PREs), a class of DNA cis-regulatory elements (CREs) originally discovered in the fruit fly. The structure and function of PREs has been progressively unravelled over the past three decades, with the identification of sequence motifs and the subsequent motif-based modelling and prediction of PREs, and with the genome-wide experimental mapping of PcG/TrxG binding. Whereas binding patterns vary for different cells, computational prediction holds the potential to predict PREs comprehensively. In this thesis, we exploit the recent explosion of data to conduct new investigations into the structure, function and evolution of PREs, presenting two papers with scientific investigations and two for tools.
Previous studies for computationally predicting fruit fly PREs have used a small training set and selections of known PRE motifs, leaving open the question of how training with genome-wide data might affect generalization. To address this, we trained PRE-predictors using genome-wide PcG binding sites, which we found improves generalization to independent PREs. We also trained models using different motif sets, where the addition of the GTGT motif further improved generalization. We were interested in how well a more advanced model would generalize, and we developed the Support Vector Machine Motif Occurrence Combinatorics Classification Algorithm (SVM-MOCCA), a hierarchical method that trains one Support Vector Machine (SVM) for each motif in a set and combines motif predictions. SVM-MOCCA significantly improved generalization to independent PREs. We predict large new sets of candidate PREs in the fruit fly genome that are enriched in experimental PcG/TrxG signals.
The low number of verified vertebrate PREs and a limited knowledge of relevant motifs has hampered the application of motif-based PRE predictors to vertebrate genomes. Methods such as k-spectrum SVMs can learn motifs from sequences, but the resulting models are high-dimensional and the specification of negative training sets is complicated. Previous computational studies for vertebrate PcG target prediction have focused exclusively on either predicting PcG target genes or on modelling genome-wide clusters of a small set of PcG markers. We developed a reinforcement learning regimen that exploits larger arsenals of genome-wide experimental data for the training of non-linear k-spectrum SVMs, yielding iteratively more precise models. We applied our methods to the fruit fly, mouse and human genomes. The final fruit fly model is competitive with models that incorporate prior motif knowledge. For all three species, we predict candidate PcG target sites genome-wide. We performed model analysis, which revealed a variety of motifs, subsets of which are conserved between models.
The success of SVM-MOCCA with predicting PREs prompted me to develop a polished and configurable implementation that can be useful for the broader community of CRE researchers---the Motif Occurrence Combinatorics Classification Algorithms (MOCCA) suite. MOCCA provides polished implementations of SVM-MOCCA and baseline methods, and also the ability to combine feature set formulations with machine learning methods. Additionally, MOCCA presents RF-MOCCA, a derivative of SVM-MOCCA using the method of Random Forests (RFs). For ease of use, MOCCA implements functionality for generating negative training data and performing genome-wide prediction. We applied our methods for modelling fruit fly PREs and boundary elements. Our MOCCA-based methods improved generalization to both classes of CREs compared with previous methods. MOCCA is open source and extensible.
A Python package that streamlines the specification and application of CRE sequence models has been lacking. I developed Gnocis, a feature-rich package for Python 3 that provides tools for data preparation and analysis and a flexible vocabulary for feature set and model specification, and with implementations of functionality for model evaluation and genome-wide prediction. Gnocis integrates with Scikit-learn and TensorFlow for state-of-the-art machine learning. We demonstrated the use of Gnocis by modelling fruit fly PREs using a selection of methods, including a 5-spectrum mismatch kernel SVM and a Convolutional Neural Network. Gnocis is open source and extensible, and can be installed using the PyPI package manager.
Has parts
Paper I: Bredesen, B.A. and Rehmsmeier, M., 2019. DNA sequence models of genome-wide Drosophila melanogaster Polycomb binding sites improve generalization to independent Polycomb Response Elements. Nucleic acids research, 47(15):7781-7797. The article is available in the main thesis. The article is also available at: https://doi.org/10.1093/nar/gkz617Paper II: Bredesen B. A., Rehmsmeier M. Biomarker reinforcement learning with k-spectra enables precise Polycomb target site prediction without prior motif knowledge. Full text not available in BORA.
Paper III: Bredesen B. A., Rehmsmeier M. MOCCA: A flexible suite for modelling DNA sequence motif occurrence combinatorics. Full text not available in BORA.
Paper IV: Bredesen B. A., Rehmsmeier M. Gnocis: An integrated system for interactive and reproducible analysis and modelling of cis-regulatory elements in Python 3. Full text not available in BORA.