MOCCA: a fexible suite for modelling DNA sequence motif occurrence combinatorics

Bredesen, Bjørn André; Rehmsmeier, Marc

dc.contributor.author	Bredesen, Bjørn André
dc.contributor.author	Rehmsmeier, Marc
dc.date.accessioned	2022-02-18T07:55:31Z
dc.date.available	2022-02-18T07:55:31Z
dc.date.created	2021-06-15T17:54:05Z
dc.date.issued	2021
dc.identifier.issn	1471-2105
dc.identifier.uri	https://hdl.handle.net/11250/2979875
dc.description.abstract	Background Cis-regulatory elements (CREs) are DNA sequence segments that regulate gene expression. Among CREs are promoters, enhancers, Boundary Elements (BEs) and Polycomb Response Elements (PREs), all of which are enriched in specific sequence motifs that form particular occurrence landscapes. We have recently introduced a hierarchical machine learning approach (SVM-MOCCA) in which Support Vector Machines (SVMs) are applied on the level of individual motif occurrences, modelling local sequence composition, and then combined for the prediction of whole regulatory elements. We used SVM-MOCCA to predict PREs in Drosophila and found that it was superior to other methods. However, we did not publish a polished implementation of SVM-MOCCA, which can be useful for other researchers, and we only tested SVM-MOCCA with IUPAC motifs and PREs. Results We here present an expanded suite for modelling CRE sequences in terms of motif occurrence combinatorics—Motif Occurrence Combinatorics Classification Algorithms (MOCCA). MOCCA contains efficient implementations of several modelling methods, including SVM-MOCCA, and a new method, RF-MOCCA, a Random Forest–derivative of SVM-MOCCA. We used SVM-MOCCA and RF-MOCCA to model Drosophila PREs and BEs in cross-validation experiments, making this the first study to model PREs with Random Forests and the first study that applies the hierarchical MOCCA approach to the prediction of BEs. Both models significantly improve generalization to PREs and boundary elements beyond that of previous methods—including 4-spectrum and motif occurrence frequency Support Vector Machines and Random Forests—, with RF-MOCCA yielding the best results. Conclusion MOCCA is a flexible and powerful suite of tools for the motif-based modelling of CRE sequences in terms of motif composition. MOCCA can be applied to any new CRE modelling problems where motifs have been identified. MOCCA supports IUPAC and Position Weight Matrix (PWM) motifs. For ease of use, MOCCA implements generation of negative training data, and additionally a mode that requires only that the user specifies positives, motifs and a genome. MOCCA is licensed under the MIT license and is available on Github at https://github.com/bjornbredesen/MOCCA.	en_US
dc.language.iso	eng	en_US
dc.publisher	BMC	en_US
dc.rights	Navngivelse 4.0 Internasjonal	*
dc.rights.uri	http://creativecommons.org/licenses/by/4.0/deed.no	*
dc.title	MOCCA: a fexible suite for modelling DNA sequence motif occurrence combinatorics	en_US
dc.type	Journal article	en_US
dc.type	Peer reviewed	en_US
dc.description.version	publishedVersion	en_US
dc.rights.holder	Copyright The Author(s) 2021	en_US
dc.source.articlenumber	234	en_US
cristin.ispublished	true
cristin.fulltext	original
cristin.qualitycode	2
dc.identifier.doi	10.1186/s12859-021-04143-2
dc.identifier.cristin	1915975
dc.source.journal	BMC Bioinformatics	en_US
dc.identifier.citation	BMC Bioinformatics. 2021, 22, 234.	en_US
dc.source.volume	22	en_US

Tilhørende fil(er)

Filnavn:: s12859-021-04143-2.pdf
Størrelse:: 1.741Mb
Format:: PDF
Beskrivelse:: PDF

Åpne

Denne innførselen finnes i følgende samling(er)

Department of Informatics [928]
Registrations from Cristin [9791]

Vis enkel innførsel

Med mindre annet er angitt, så er denne innførselen lisensiert som Navngivelse 4.0 Internasjonal