# PMMdeNovo

### From Jstacs

by Ralf Eggeling, Teemu Roos, Petri MyllymĂ¤ki, and Ivo Grosse.

## Description

### Background

Statistical modeling of transcription factor binding sites is one of the classical fields in bioinformatics. The position weight matrix (PWM) model, which assumes statistical independence among all nucleotides in a binding site, used to be the standard model for this task for more than three decades but its simple assumptions are increasingly put into question. Recent high-throughput sequencing methods have provided data sets of sufficient size and quality for studying the benefits of more complex models. However, learning more complex models typically entails the danger of overfitting, and while model classes that dynamically adapt the model complexity to data have been developed, effective model selection is to date only possible for fully observable data, but not, e.g., within de novo motif discovery.

### Results

To address this issue, we propose a stochastic algorithm for performing robust model selection in a latent variable setting. This algorithm yields a solution without relying on hyperparameter-tuning via massive cross-validation or other computationally expensive resampling techniques. Using this algorithm for learning inhomogeneous parsimonious Markov models, we study the degree of putative higher-order intra-motif dependencies for transcription factor binding sites inferred via de novo motif discovery from ChIP-seq data. We find that intra-motif dependencies are prevalent and not limited to first-order dependencies among directly adjacent nucleotides, but that second-order models appear to be the significantly better choice. Conclusions

### Conclusions

The traditional PWM model appears to be indeed insufficient to infer realistic sequence motifs, as it is on average outperformed by more complex models that take into account intra-motif dependencies. Moreover, using such models together with an appropriate model selection procedure does not lead to a significant performance loss in comparison with the PWM model for any of the studied transcription factors. Hence, we find it worthwhile to recommend that any modern motif discovery algorithm should attempt to take into account intra-motif dependencies.

## Paper

The paper **Inferring intra-motif dependencies of DNA binding sites from ChIP-seq data** has been published in BMC Bioinformatics.

## Runnable JARs

The application consists of three independent tools. All tools have mandatory (no default values) and optional arguments. Default values can be used by assigning "def". Alternatively, a shorter list of arguments can be provided, in which case all missing arguments are considered to assume default values.

### ModelTrainer

The tool ModelTrainer performs a de novo motif discovery on a set of putative non aligned sequences. It infers an inhomogenous PMM of arbitrary order, where order 0 corresponds to a PWM model. Run by calling

`java -jar ModelTrainer.jar inputFile motifWidth motifOrder flankingOrder initSteps addSteps restarts output`

where the arguments have the following semantics:

name | type | default | comment |

inputFile | String | -- | The location of a text file containing the input sequences. If the first character in the file is '>' the content is interpreted interpreted as fasta file. Otherwise it is interpreted as plain text, i.e., each line corresponding to one sequence. |

motifWidth | Integer | 20 | The width of the motif to be inferred. |

motifOrder | Integer | 2 | The initial order of the inhomogeneous PMM, i.e., the number of context positions that can be taken into account for modeling intra-motif dependencies. |

flankingOrder | Integer | 2 | The order of the homogenous Markov model, which is used for modeling the flanking sequences that do not belong to the motif. |

initSteps | Integer | 50 | The number of initial iterations steps that the algorithm is always run for each restart. |

addSteps | Integer | 10 | The number of additional iterations steps, i.e., the number of iterations that have to be performed after having obtained the last optimal model structure before termination is allowed. |

restarts | Integer | 10 | The number of restarts of the algorithm. |

output | String | model | The path and file prefix for the output files. The tool produces two files, namely (i) output.xml containing the learned model and (ii) output.dot containing the graphViz representation of the learned PCT structures. |

### BindingSitePrediction

The tool BindingSitePrediction predicts instances of binding sites in a positive data set based on a previously learned model. Run by calling

`java -jar BindingSitePrediction.jar modelFile dataPos dataNeg alpha output`

where the arguments have the following semantics:

name | type | default | comment |

modelFile | String | -- | The location of the .xml representation (output of ModelTrainer) of the learned model. |

dataPos | String | -- | The location of the positive data (fasta file or plain text) in which binding site locations are to be identified. |

dataNeg | String | -- | The location of the negative data (fasta file or plain text) that is used for computing the prediction threshold. |

alpha | Integer | 1E-4 | Significance level on negative data. |

output | String | bindingSites.txt | Location of output file for writing the predicted binding sites. |

### Classification

The tool Classification performs first a motif discovery with subsequent fragment-based classification by using positive data that is assumed to contain an instance of the motif, and negative data that is assumed not to contain the motif. This tool can be used for performing a single step of a K-fold cross validation experiment. Run by calling

`java -jar Classification.jar filePosTrain fileNegTrain filePosTest fileNegTest motifWidth motifOrder flankingOrder initSteps addSteps restarts`

where the arguments have the following semantics:

name | type | default | comment |

filePosTrain | String | -- | The location of a text file containing the positive training sequences (fasta or plain text). |

fileNegTrain | String | -- | The location of a text file containing the negative training sequences (fasta or plain text). |

filePosTest | String | -- | The location of a text file containing the positive test sequences (fasta or plain text). |

fileNegTest | String | -- | The location of a text file containing the negative test sequences (fasta or plain text). |

motifWidth | Integer | 20 | The width of the motif to be inferred. |

motifOrder | Integer | 2 | The initial order of the inhomogeneous PMM, i.e., the number of context positions that can be taken into account for modeling intra-motif dependencies. |

flankingOrder | Integer | 2 | The order of the homogenous Markov model, which is used for modeling the flanking sequences that do not belong to the motif. |

initSteps | Integer | 50 | The number of initial iterations steps that the algorithm is always run for each restart. |

addSteps | Integer | 10 | The number of additional iterations steps, i.e., the number of iterations that have to be performed after having obtained the last optimal model structure before termination is allowed. |

restarts | Integer | 10 | The number of restarts of the algorithm. |

The tool returns (i) the model complexity, i.e., the number of leaves of all parsimonious context trees of the learned motif model, and (ii) performance of the classifier measured by the area under the ROC curve.

## Data

The exemplary data sets contain extracted ChIP seq sequences of 50 different human transcription factors from the ENCODE project, as well as corresponding negative data. All data sets are split into 10 different subsets for enabling a reproducible 10-fold cross validation.

## Source code

Building the source code requires Jstacs 2.1.