__NOTOC__
by Ralf Eggeling, Teemu Roos, Petri Myllymäki, and Ivo Grosse.

== Paper ==
The paper '''Inferring intra-motif dependencies of DNA binding sites from ChIP-seq data''' has been published in [http://www.biomedcentral.com/1471-2105/16/375 BMC Bioinformatics].

'''Note:''' The software on this site is mainly intended to enable reproducibility of the results from the publication.
For other purposes, please consider using the more recent [http://jstacs.de/index.php/InMoDe InMoDe] software, which contains the methodology of PMMdeNovo as well as more advanced features, speedups, better user interfaces, and automatic visualization.

== Description ==

=== Background ===
Statistical modeling of transcription factor binding sites is one of the classical fields in bioinformatics. The position weight matrix (PWM) model, which assumes statistical independence among all nucleotides in a binding site, used to be the standard model for this task for more than three decades but its simple assumptions are increasingly put into question. Recent high-throughput sequencing methods have provided data sets of sufficient size and quality for studying the benefits of more complex models. However, learning more complex models typically entails the danger of overfitting, and while model classes that dynamically adapt the model complexity to data have been developed, effective model selection is to date only possible for fully observable data, but not, e.g., within de novo motif discovery.
=== Results ===
To address this issue, we propose a stochastic algorithm for performing robust model selection in a latent variable setting. This algorithm yields a solution without relying on hyperparameter-tuning via massive cross-validation or other computationally expensive resampling techniques. Using this algorithm for learning inhomogeneous parsimonious Markov models, we study the degree of putative higher-order intra-motif dependencies for transcription factor binding sites inferred via de novo motif discovery from ChIP-seq data. We find that intra-motif dependencies are prevalent and not limited to first-order dependencies among directly adjacent nucleotides, but that second-order models appear to be the significantly better choice.
Conclusions
=== Conclusions ===
The traditional PWM model appears to be indeed insufficient to infer realistic sequence motifs, as it is on average outperformed by more complex models that take into account intra-motif dependencies. Moreover, using such models together with an appropriate model selection procedure does not lead to a significant performance loss in comparison with the PWM model for any of the studied transcription factors. Hence, we find it worthwhile to recommend that any modern motif discovery algorithm should attempt to take into account intra-motif dependencies.

== Runnable JARs ==
The application consists of three independent tools. All tools have mandatory (no default values) and optional arguments.
Default values can be used by assigning "def". Alternatively, a shorter list of arguments can be provided, in which case all missing arguments are considered to assume default values.

=== ModelTrainer ===
The tool [https://www.cs.helsinki.fi/u/eggeling/PMMdenovo/ModelTrainer.jar ModelTrainer] performs a de novo motif discovery on a set of putative non aligned sequences. It infers an inhomogenous PMM of arbitrary order, where order 0 corresponds to a PWM model.
Run by calling

<code>java -jar ModelTrainer.jar inputFile motifWidth motifOrder flankingOrder initSteps addSteps restarts output</code>

where the arguments have the following semantics:
<table border=0 cellpadding=10 align="center">
<tr>
<td>name</td>
<td>type</td>
<td>default</td>
<td>comment</td>
</tr>
<tr><td colspan=4><hr></td></tr>
<tr>
<td>inputFile</td>
<td>String</td>
<td>--</td>
<td>The location of a text file containing the input sequences. If the first character in the file is '>' the content is interpreted interpreted as fasta file. Otherwise it is interpreted as plain text, i.e., each line corresponding to one sequence.</td>
</tr>
<tr>
<td>motifWidth</td>
<td>Integer</td>
<td>20</td>
<td>The width of the motif to be inferred.</td>
</tr>
<tr>
<td>motifOrder</td>
<td>Integer</td>
<td>2</td>
<td>The initial order of the inhomogeneous PMM, i.e., the number of context positions that can be taken into account for modeling intra-motif dependencies.</td>
</tr>
<tr>
<td>flankingOrder</td>
<td>Integer</td>
<td>2</td>
<td>The order of the homogenous Markov model, which is used for modeling the flanking sequences that do not belong to the motif.</td>
</tr>
<tr>
<td>initSteps</td>
<td>Integer</td>
<td>50</td>
<td>The number of initial iterations steps that the algorithm is always run for each restart.</td>
</tr>
<tr>
<td>addSteps</td>
<td>Integer</td>
<td>10</td>
<td>The number of additional iterations steps, i.e., the number of iterations that have to be performed after having obtained the last optimal model structure before termination is allowed.</td>
</tr>
<tr>
<td>restarts</td>
<td>Integer</td>
<td>10</td>
<td>The number of restarts of the algorithm.</td>
</tr>
<tr>
<td>output</td>
<td>String</td>
<td>model</td>
<td>The path and file prefix for the output files. The tool produces two files, namely (i) output.xml containing the learned model and (ii) output.dot containing the graphViz representation of the learned PCT structures.</td>
</tr>
</table>

=== BindingSitePrediction ===
The tool [https://www.cs.helsinki.fi/u/eggeling/PMMdenovo/BindingSitePrediction.jar BindingSitePrediction] predicts instances of binding sites in a positive data set based on a previously learned model.
Run by calling

<code>java -jar BindingSitePrediction.jar modelFile dataPos dataNeg alpha output</code>

where the arguments have the following semantics:

<table border=0 cellpadding=10 align="center">
<tr>
<td>name</td>
<td>type</td>
<td>default</td>
<td>comment</td>
</tr>
<tr><td colspan=4><hr></td></tr>
<tr>
<td>modelFile</td>
<td>String</td>
<td>--</td>
<td>The location of the .xml representation (output of ModelTrainer) of the learned model.</td>
</tr>
<tr>
<td>dataPos</td>
<td>String</td>
<td>--</td>
<td>The location of the positive data (fasta file or plain text) in which binding site locations are to be identified.</td>
</tr>
<tr>
<td>dataNeg</td>
<td>String</td>
<td>--</td>
<td>The location of the negative data (fasta file or plain text) that is used for computing the prediction threshold.</td>
</tr>
<tr>
<td>alpha</td>
<td>Integer</td>
<td>1E-4</td>
<td>Significance level on negative data.</td>
</tr>
<tr>
<td>output</td>
<td>String</td>
<td>bindingSites.txt</td>
<td>Location of output file for writing the predicted binding sites.</td>
</tr>
</table>

=== Classification ===
The tool [https://www.cs.helsinki.fi/u/eggeling/PMMdenovo/Classification.jar Classification] performs first a motif discovery with subsequent fragment-based classification by using positive data that is assumed to contain an instance of the motif, and negative data that is assumed not to contain the motif. This tool can be used for performing a single step of a K-fold cross validation experiment.
Run by calling

<code>java -jar Classification.jar filePosTrain fileNegTrain filePosTest fileNegTest motifWidth motifOrder flankingOrder initSteps addSteps restarts</code>

where the arguments have the following semantics:
<table border=0 cellpadding=10 align="center">
<tr>
<td>name</td>
<td>type</td>
<td>default</td>
<td>comment</td>
</tr>
<tr><td colspan=4><hr></td></tr>
<tr>
<td>filePosTrain</td>
<td>String</td>
<td>--</td>
<td>The location of a text file containing the positive training sequences (fasta or plain text).</td>
</tr>
<tr>
<td>fileNegTrain</td>
<td>String</td>
<td>--</td>
<td>The location of a text file containing the negative training sequences (fasta or plain text).</td>
</tr>
<tr>
<td>filePosTest</td>
<td>String</td>
<td>--</td>
<td>The location of a text file containing the positive test sequences (fasta or plain text).</td>
</tr>
<tr>
<td>fileNegTest</td>
<td>String</td>
<td>--</td>
<td>The location of a text file containing the negative test sequences (fasta or plain text).</td>
</tr>
<tr>
<td>motifWidth</td>
<td>Integer</td>
<td>20</td>
<td>The width of the motif to be inferred.</td>
</tr>
<tr>
<td>motifOrder</td>
<td>Integer</td>
<td>2</td>
<td>The initial order of the inhomogeneous PMM, i.e., the number of context positions that can be taken into account for modeling intra-motif dependencies.</td>
</tr>
<tr>
<td>flankingOrder</td>
<td>Integer</td>
<td>2</td>
<td>The order of the homogenous Markov model, which is used for modeling the flanking sequences that do not belong to the motif.</td>
</tr>
<tr>
<td>initSteps</td>
<td>Integer</td>
<td>50</td>
<td>The number of initial iterations steps that the algorithm is always run for each restart.</td>
</tr>
<tr>
<td>addSteps</td>
<td>Integer</td>
<td>10</td>
<td>The number of additional iterations steps, i.e., the number of iterations that have to be performed after having obtained the last optimal model structure before termination is allowed.</td>
</tr>
<tr>
<td>restarts</td>
<td>Integer</td>
<td>10</td>
<td>The number of restarts of the algorithm.</td>
</tr>
</table>
The tool returns (i) the model complexity, i.e., the number of leaves of all parsimonious context trees of the learned motif model, and (ii) performance of the classifier measured by the area under the ROC curve.

== Data ==
The exemplary [http://www2.informatik.uni-halle.de/agbio/publications/PMMdenovo/data.tar.gz data sets] contain extracted ChIP seq sequences of 50 different human transcription factors from the [http://genome.ucsc.edu/ENCODE ENCODE project], as well as corresponding negative data. All data sets are split into 10 different subsets for enabling a reproducible 10-fold cross validation.

== Source code ==
Building the [https://www.cs.helsinki.fi/u/eggeling/PMMdenovo/PMMdenovo_sources.zip source code] requires Jstacs 2.1.

InMoDe

2017-07-22T12:38:22Z

Eggeling:

[[File:InMoDe-test.png|100px|left]]
by Ralf Eggeling, Ivo Grosse, and Jan Grau.

InMoDe is a collection of seven tools for learning, leveraging, and visualizing '''in'''tra-'''mo'''tif '''de'''pendencies within DNA binding sites and similar functional nucleotide sequences.

For a detailed description of the functionality of InMoDe see the [https://www.cs.helsinki.fi/u/eggeling/InMoDe/1.1/InMoDe_userGuide-1.1.pdf user guide].

== Paper ==

If you use InMoDe, please cite

R. Eggeling, I. Grosse, and J. Grau. [https://academic.oup.com/bioinformatics/article/33/4/580/2666342/InMoDe-tools-for-learning-and-visualizing-intra InMoDe: tools for learning and visualizing intra-motif dependencies of DNA binding sites]. ''Bioinformatics'', 2017; 33(4): 580-582. doi: 10.1093/bioinformatics/btw689

== Download and installation ==

InMoDe offers three user interfaces.

* [https://www.cs.helsinki.fi/u/eggeling/InMoDe/1.1/InMoDeGUI-1.1.jar InMoDeGUI.jar] -- graphical user interface (version 1.1)
* [https://www.cs.helsinki.fi/u/eggeling/InMoDe/1.1/InMoDeCLI-1.1.jar InMoDeCLI.jar] -- command line interface (version 1.1)
* [https://www.cs.helsinki.fi/u/eggeling/InMoDe/1.1/InMoDeGalaxy-1.1.jar InMoDeGalaxy.jar] -- for integration into own Galaxy instance (version 1.1)

that can be started by

java -jar filename.jar

and require an existent Java installation (8u74 or later).

In addition, there are two user-friendly alternatives for installing the GUI variant of InMoDe (version 1.0), namely (i) a [http://www.jstacs.de/downloads/InMoDe-1.0.dmg DMG for installation under Mac OS X], and (ii) a [http://www.jstacs.de/downloads/InMoDe-1.0.exe Windows installer].

Both do not require a recent Java, as they automatically install the required libraries to the local machine.

== Webserver ==

A server with all tools of InMoDe (version 1.0) is available for public use at [http://galaxy.informatik.uni-halle.de].
The provided web-server puts a certain limit on the complexity of runnable jobs for the learning tools.
For unlimited use, please download InMoDe and install it to your local machine or own Galaxy instance.

== Version history ==

=== Version 1.1 ===
''Minor improvements ([https://www.cs.helsinki.fi/u/eggeling/InMoDe/1.1/changelog-1.1.txt changelog]) for ISMB 2017 ([https://www.cs.helsinki.fi/u/eggeling/InMoDe/1.1/poster.pdf poster])''

* [https://www.cs.helsinki.fi/u/eggeling/InMoDe/1.1/InMoDe_userGuide-1.1.pdf InMoDe User Guide (version 1.1)]
* [https://www.cs.helsinki.fi/u/eggeling/InMoDe/1.1/InMoDeGUI-1.1.jar InMoDeGUI-1.1.jar]
* [https://www.cs.helsinki.fi/u/eggeling/InMoDe/1.1/InMoDeCLI-1.1.jar InMoDeCLI-1.1.jar]
* [https://www.cs.helsinki.fi/u/eggeling/InMoDe/1.1/InMoDeGalaxy-1.1.jar InMoDeGalaxy-1.1.jar]

=== Version 1.0 ===
''Initial release''

* [https://www.cs.helsinki.fi/u/eggeling/InMoDe/1.0/InMoDe_userGuide-1.0.pdf InMoDe User Guide (version 1.0)]
* [https://www.cs.helsinki.fi/u/eggeling/InMoDe/1.0/InMoDeGUI-1.0.jar InMoDeGUI-1.0.jar]
* [https://www.cs.helsinki.fi/u/eggeling/InMoDe/1.0/InMoDeCLI-1.0.jar InMoDeCLI-1.0.jar]
* [https://www.cs.helsinki.fi/u/eggeling/InMoDe/1.0/InMoDeGalaxy-1.0.jar InMoDeGalaxy-1.0.jar]

InMoDe

2017-07-22T12:17:30Z

Eggeling:

InMoDe

2017-07-22T12:16:59Z

Eggeling: version 1.1 added

Main Page

2017-02-14T09:18:52Z

Eggeling:

__NOTOC__
== A Java framework for statistical analysis and classification of biological sequences ==

Sequence analysis is one of the major subjects of
[http://en.wikipedia.org/wiki/Bioinformatics bioinformatics].
Several existing libraries combine the representation of biological sequences with exact and approximate pattern matching as well as
alignment algorithms.
We present Jstacs, an [http://en.wikipedia.org/wiki/Open_source open source] Java library, which focuses on the statistical analysis of biological sequences instead. Jstacs comprises an
efficient representation of sequence data and provides implementations of many statistical models with generative and discriminative approaches
for parameter learning. Using Jstacs, classifiers can be assessed and
compared on test datasets or by cross-validation experiments evaluating several performance measures. Due to its strictly object-oriented
design Jstacs is easy to use and readily extensible.

Jstacs is a joint project of the groups [http://www.informatik.uni-halle.de/arbeitsgruppen/bioinformatik/ Bioinformatics] and [http://www.informatik.uni-halle.de/arbeitsgruppen/mustererkennung/ Pattern Recognition and Bioinformatics] at the [http://www.informatik.uni-halle.de/ Institute of Computer Science] of [http://www.uni-halle.de/ Martin Luther University Halle-Wittenberg] and the Bioinformatics group of the [http://www.jki.bund.de/en/startseite/home.html Julius Kuehn Institute]. Initially the projects has also been developed at the [http://www.ipk-gatersleben.de Leibniz Institute of Plant Genetics and Crop Plant Research].

Jstacs is listed in the [http://mloss.org/software/ machine learning open-source software (mloss)] repository.

== Licensing Information ==
Jstacs is free software: you can redistribute it and/or modify under the terms of the [http://www.gnu.org/licenses/gpl-3.0.html GNU General Public License version 3] or (at your option) any later version as published by the [http://www.fsf.org/ Free Software Foundation].

== Current release ==
You can download Jstacs version 2.2 [[Downloads | here]]. 
''You find an overview of the new features in the [[Version history]].'' 
We also provide an [http://www.jstacs.de/api/index.html API documentation], a [[Cookbook]], and a [http://www.jstacs.de/downloads/refcard.pdf Reference card] for this release.

== Getting started & Cookbook==
For set-up instructions, a list of basic requirements, and suggestions for your first steps with Jstacs, please see [[Getting started]].

Since version 2.0, we offer a [[Cookbook]] for Jstacs in addition to the [http://www.jstacs.de/api/index.html API documentation].
This cookbook comprises a general description of the structure of Jstacs including data handling, statistical models, classifiers, and assessments.
The cookbook is accompanied by a number of [[Recipes]] or [[Code examples]] that can serve as a starting point of your own applications.

For a quick reference, we also provide a [http://www.jstacs.de/downloads/refcard.pdf Reference card].

== Publication ==
The [http://jmlr.csail.mit.edu/papers/v13/grau12a.html paper about Jstacs] has been published in the Journal of Machine Learning Research.
If you use Jstacs in your research, please cite

J. Grau, J. Keilwagen, A. Gohr, B. Haldemann, S. Posch, and I. Grosse. ''Jstacs: A java framework for statistical analysis and classification of biological sequences''. Journal of Machine Learning Research, '''13'''(Jun):1967–1971, 2012.

[http://www.jstacs.de/downloads/jstacs_citation.bib BibTeX entry]
== Applications ==
Applications currently using Jstacs:
* [[MotifAdjuster]]
* [[Dispom]]
* [[TALgetter]]
* [[TALENoffer]]
* [[Dimont]]
* [[GeMoMa]]
* [[AnnoTALE]]

== Bug reports & Feature requests ==
You can submit bug reports and feature requests by mail to [mailto:jstacs@informatik.uni-halle.de jstacs@informatik.uni-halle.de]. 


== Latest Papers ==
The paper '''''[[InMoDe | InMoDe: tools for learning and visualizing intra-motif dependencies of DNA binding sites]]''''' has been published in [https://academic.oup.com/bioinformatics/article/33/4/580/2666342/InMoDe-tools-for-learning-and-visualizing-intra Bioinformatics].

The paper '''''[[AnnoTALE | AnnoTALE: bioinformatics tools for identification, annotation, and nomenclature of TALEs from Xanthomonas genomic sequences]]''''' has been published in [http://www.nature.com/articles/srep21077 Scientific Reports].

The paper '''''[[GeMoMa | Using intron position conservation for homology-based gene predictions]]''''' has been published in [https://nar.oxfordjournals.org/content/early/2016/02/17/nar.gkw092 Nucleic Acids Research].

The paper '''''[[PMMdeNovo | Inferring intra-motif dependencies of DNA binding sites from ChIP-seq data ]]''''' has been published in [http://www.biomedcentral.com/1471-2105/16/375 BMC Bioinformatics].

The paper '''''[[Slim | Varying levels of complexity in transcription factor binding motifs]]''''' has been published in [http://nar.oxfordjournals.org/content/early/2015/06/23/nar.gkv577.abstract Nucleic Acids Research].

The paper '''''[[AUC-PR | Area under Precision-Recall Curves for Weighted and Unweighted Data]]''''' has been published in [http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0092209 PLOS ONE].

The paper '''''[[Dimont | A general approach for discriminative de-novo motif discovery from high-throughput data]]''''' has been published in [http://nar.oxfordjournals.org/content/41/21/e197.abstract.html?etoc Nucleic Acids Research].

Further papers and projects can be found under [[Projects]].

InMoDe

2017-02-14T09:16:59Z

Eggeling:

2017-01-01T12:40:08Z

Eggeling:

InMoDe

2016-12-13T15:51:12Z

Eggeling:

InMoDe

2016-12-13T15:37:51Z

Eggeling:

InMoDe

2016-12-12T09:00:54Z

Eggeling:

PMMdeNovo

2016-12-08T14:56:17Z

Eggeling: links updated

__NOTOC__
by Ralf Eggeling, Teemu Roos, Petri Myllymäki, and Ivo Grosse.
== Description ==
=== Background ===
Statistical modeling of transcription factor binding sites is one of the classical fields in bioinformatics. The position weight matrix (PWM) model, which assumes statistical independence among all nucleotides in a binding site, used to be the standard model for this task for more than three decades but its simple assumptions are increasingly put into question. Recent high-throughput sequencing methods have provided data sets of sufficient size and quality for studying the benefits of more complex models. However, learning more complex models typically entails the danger of overfitting, and while model classes that dynamically adapt the model complexity to data have been developed, effective model selection is to date only possible for fully observable data, but not, e.g., within de novo motif discovery.
=== Results ===
To address this issue, we propose a stochastic algorithm for performing robust model selection in a latent variable setting. This algorithm yields a solution without relying on hyperparameter-tuning via massive cross-validation or other computationally expensive resampling techniques. Using this algorithm for learning inhomogeneous parsimonious Markov models, we study the degree of putative higher-order intra-motif dependencies for transcription factor binding sites inferred via de novo motif discovery from ChIP-seq data. We find that intra-motif dependencies are prevalent and not limited to first-order dependencies among directly adjacent nucleotides, but that second-order models appear to be the significantly better choice.
Conclusions
=== Conclusions ===
The traditional PWM model appears to be indeed insufficient to infer realistic sequence motifs, as it is on average outperformed by more complex models that take into account intra-motif dependencies. Moreover, using such models together with an appropriate model selection procedure does not lead to a significant performance loss in comparison with the PWM model for any of the studied transcription factors. Hence, we find it worthwhile to recommend that any modern motif discovery algorithm should attempt to take into account intra-motif dependencies.

== Paper ==
The paper '''Inferring intra-motif dependencies of DNA binding sites from ChIP-seq data''' has been published in [http://www.biomedcentral.com/1471-2105/16/375 BMC Bioinformatics].

== Runnable JARs ==
The application consists of three independent tools. All tools have mandatory (no default values) and optional arguments.
Default values can be used by assigning "def". Alternatively, a shorter list of arguments can be provided, in which case all missing arguments are considered to assume default values.

=== ModelTrainer ===
The tool [https://www.cs.helsinki.fi/u/eggeling/PMMdenovo/ModelTrainer.jar ModelTrainer] performs a de novo motif discovery on a set of putative non aligned sequences. It infers an inhomogenous PMM of arbitrary order, where order 0 corresponds to a PWM model.
Run by calling

<code>java -jar ModelTrainer.jar inputFile motifWidth motifOrder flankingOrder initSteps addSteps restarts output</code>

where the arguments have the following semantics:
<table border=0 cellpadding=10 align="center">
<tr>
<td>name</td>
<td>type</td>
<td>default</td>
<td>comment</td>
</tr>
<tr><td colspan=4><hr></td></tr>
<tr>
<td>inputFile</td>
<td>String</td>
<td>--</td>
<td>The location of a text file containing the input sequences. If the first character in the file is '>' the content is interpreted interpreted as fasta file. Otherwise it is interpreted as plain text, i.e., each line corresponding to one sequence.</td>
</tr>
<tr>
<td>motifWidth</td>
<td>Integer</td>
<td>20</td>
<td>The width of the motif to be inferred.</td>
</tr>
<tr>
<td>motifOrder</td>
<td>Integer</td>
<td>2</td>
<td>The initial order of the inhomogeneous PMM, i.e., the number of context positions that can be taken into account for modeling intra-motif dependencies.</td>
</tr>
<tr>
<td>flankingOrder</td>
<td>Integer</td>
<td>2</td>
<td>The order of the homogenous Markov model, which is used for modeling the flanking sequences that do not belong to the motif.</td>
</tr>
<tr>
<td>initSteps</td>
<td>Integer</td>
<td>50</td>
<td>The number of initial iterations steps that the algorithm is always run for each restart.</td>
</tr>
<tr>
<td>addSteps</td>
<td>Integer</td>
<td>10</td>
<td>The number of additional iterations steps, i.e., the number of iterations that have to be performed after having obtained the last optimal model structure before termination is allowed.</td>
</tr>
<tr>
<td>restarts</td>
<td>Integer</td>
<td>10</td>
<td>The number of restarts of the algorithm.</td>
</tr>
<tr>
<td>output</td>
<td>String</td>
<td>model</td>
<td>The path and file prefix for the output files. The tool produces two files, namely (i) output.xml containing the learned model and (ii) output.dot containing the graphViz representation of the learned PCT structures.</td>
</tr>
</table>

=== BindingSitePrediction ===
The tool [https://www.cs.helsinki.fi/u/eggeling/PMMdenovo/BindingSitePrediction.jar BindingSitePrediction] predicts instances of binding sites in a positive data set based on a previously learned model.
Run by calling

<code>java -jar BindingSitePrediction.jar modelFile dataPos dataNeg alpha output</code>

where the arguments have the following semantics:

<table border=0 cellpadding=10 align="center">
<tr>
<td>name</td>
<td>type</td>
<td>default</td>
<td>comment</td>
</tr>
<tr><td colspan=4><hr></td></tr>
<tr>
<td>modelFile</td>
<td>String</td>
<td>--</td>
<td>The location of the .xml representation (output of ModelTrainer) of the learned model.</td>
</tr>
<tr>
<td>dataPos</td>
<td>String</td>
<td>--</td>
<td>The location of the positive data (fasta file or plain text) in which binding site locations are to be identified.</td>
</tr>
<tr>
<td>dataNeg</td>
<td>String</td>
<td>--</td>
<td>The location of the negative data (fasta file or plain text) that is used for computing the prediction threshold.</td>
</tr>
<tr>
<td>alpha</td>
<td>Integer</td>
<td>1E-4</td>
<td>Significance level on negative data.</td>
</tr>
<tr>
<td>output</td>
<td>String</td>
<td>bindingSites.txt</td>
<td>Location of output file for writing the predicted binding sites.</td>
</tr>
</table>

=== Classification ===
The tool [https://www.cs.helsinki.fi/u/eggeling/PMMdenovo/Classification.jar Classification] performs first a motif discovery with subsequent fragment-based classification by using positive data that is assumed to contain an instance of the motif, and negative data that is assumed not to contain the motif. This tool can be used for performing a single step of a K-fold cross validation experiment.
Run by calling

<code>java -jar Classification.jar filePosTrain fileNegTrain filePosTest fileNegTest motifWidth motifOrder flankingOrder initSteps addSteps restarts</code>

where the arguments have the following semantics:
<table border=0 cellpadding=10 align="center">
<tr>
<td>name</td>
<td>type</td>
<td>default</td>
<td>comment</td>
</tr>
<tr><td colspan=4><hr></td></tr>
<tr>
<td>filePosTrain</td>
<td>String</td>
<td>--</td>
<td>The location of a text file containing the positive training sequences (fasta or plain text).</td>
</tr>
<tr>
<td>fileNegTrain</td>
<td>String</td>
<td>--</td>
<td>The location of a text file containing the negative training sequences (fasta or plain text).</td>
</tr>
<tr>
<td>filePosTest</td>
<td>String</td>
<td>--</td>
<td>The location of a text file containing the positive test sequences (fasta or plain text).</td>
</tr>
<tr>
<td>fileNegTest</td>
<td>String</td>
<td>--</td>
<td>The location of a text file containing the negative test sequences (fasta or plain text).</td>
</tr>
<tr>
<td>motifWidth</td>
<td>Integer</td>
<td>20</td>
<td>The width of the motif to be inferred.</td>
</tr>
<tr>
<td>motifOrder</td>
<td>Integer</td>
<td>2</td>
<td>The initial order of the inhomogeneous PMM, i.e., the number of context positions that can be taken into account for modeling intra-motif dependencies.</td>
</tr>
<tr>
<td>flankingOrder</td>
<td>Integer</td>
<td>2</td>
<td>The order of the homogenous Markov model, which is used for modeling the flanking sequences that do not belong to the motif.</td>
</tr>
<tr>
<td>initSteps</td>
<td>Integer</td>
<td>50</td>
<td>The number of initial iterations steps that the algorithm is always run for each restart.</td>
</tr>
<tr>
<td>addSteps</td>
<td>Integer</td>
<td>10</td>
<td>The number of additional iterations steps, i.e., the number of iterations that have to be performed after having obtained the last optimal model structure before termination is allowed.</td>
</tr>
<tr>
<td>restarts</td>
<td>Integer</td>
<td>10</td>
<td>The number of restarts of the algorithm.</td>
</tr>
</table>
The tool returns (i) the model complexity, i.e., the number of leaves of all parsimonious context trees of the learned motif model, and (ii) performance of the classifier measured by the area under the ROC curve.

== Data ==
The exemplary [http://www2.informatik.uni-halle.de/agbio/publications/PMMdenovo/data.tar.gz data sets] contain extracted ChIP seq sequences of 50 different human transcription factors from the [http://genome.ucsc.edu/ENCODE ENCODE project], as well as corresponding negative data. All data sets are split into 10 different subsets for enabling a reproducible 10-fold cross validation.

== Source code ==
Building the [https://www.cs.helsinki.fi/u/eggeling/PMMdenovo/PMMdenovo_sources.zip source code] requires Jstacs 2.1.

2015-11-10T11:02:54Z

Eggeling:

PMMdeNovo

2015-11-09T17:05:17Z

Eggeling: paper published

__NOTOC__

== Paper ==
The paper [http://www.biomedcentral.com/1471-2105/16/375 Inferring intra-motif dependencies of DNA binding sites from ChIP-seq data] by Ralf Eggeling, Teemu Roos, Petri Myllymäki, and Ivo Grosse has been published in BMC Bioinformatics.

== Runnable JARs ==
The application consists of three independent tools. All tools have mandatory (no default values) and optional arguments.
Default values can be used by assigning "def". Alternatively, a shorter list of arguments can be provided, in which case all missing arguments are considered to assume default values.

=== ModelTrainer ===
The tool [http://www2.informatik.uni-halle.de/agbio/publications/PMMdenovo/ModelTrainer.jar ModelTrainer] performs a de novo motif discovery on a set of putative non aligned sequences. It infers an inhomogenous PMM of arbitrary order, where order 0 corresponds to a PWM model.
Run by calling

<code>java -jar ModelTrainer.jar inputFile motifWidth motifOrder flankingOrder initSteps addSteps restarts output</code>

where the arguments have the following semantics:
<table border=0 cellpadding=10 align="center">
<tr>
<td>name</td>
<td>type</td>
<td>default</td>
<td>comment</td>
</tr>
<tr><td colspan=4><hr></td></tr>
<tr>
<td>inputFile</td>
<td>String</td>
<td>--</td>
<td>The location of a text file containing the input sequences. If the first character in the file is '>' the content is interpreted interpreted as fasta file. Otherwise it is interpreted as plain text, i.e., each line corresponding to one sequence.</td>
</tr>
<tr>
<td>motifWidth</td>
<td>Integer</td>
<td>20</td>
<td>The width of the motif to be inferred.</td>
</tr>
<tr>
<td>motifOrder</td>
<td>Integer</td>
<td>2</td>
<td>The initial order of the inhomogeneous PMM, i.e., the number of context positions that can be taken into account for modeling intra-motif dependencies.</td>
</tr>
<tr>
<td>flankingOrder</td>
<td>Integer</td>
<td>2</td>
<td>The order of the homogenous Markov model, which is used for modeling the flanking sequences that do not belong to the motif.</td>
</tr>
<tr>
<td>initSteps</td>
<td>Integer</td>
<td>50</td>
<td>The number of initial iterations steps that the algorithm is always run for each restart.</td>
</tr>
<tr>
<td>addSteps</td>
<td>Integer</td>
<td>10</td>
<td>The number of additional iterations steps, i.e., the number of iterations that have to be performed after having obtained the last optimal model structure before termination is allowed.</td>
</tr>
<tr>
<td>restarts</td>
<td>Integer</td>
<td>10</td>
<td>The number of restarts of the algorithm.</td>
</tr>
<tr>
<td>output</td>
<td>String</td>
<td>model</td>
<td>The path and file prefix for the output files. The tool produces two files, namely (i) output.xml containing the learned model and (ii) output.dot containing the graphViz representation of the learned PCT structures.</td>
</tr>
</table>

=== BindingSitePrediction ===
The tool [http://www2.informatik.uni-halle.de/agbio/publications/PMMdenovo/BindingSitePrediction.jar BindingSitePrediction] predicts instances of binding sites in a positive data set based on a previously learned model.
Run by calling

<code>java -jar BindingSitePrediction.jar modelFile dataPos dataNeg alpha output</code>

where the arguments have the following semantics:

<table border=0 cellpadding=10 align="center">
<tr>
<td>name</td>
<td>type</td>
<td>default</td>
<td>comment</td>
</tr>
<tr><td colspan=4><hr></td></tr>
<tr>
<td>modelFile</td>
<td>String</td>
<td>--</td>
<td>The location of the .xml representation (output of ModelTrainer) of the learned model.</td>
</tr>
<tr>
<td>dataPos</td>
<td>String</td>
<td>--</td>
<td>The location of the positive data (fasta file or plain text) in which binding site locations are to be identified.</td>
</tr>
<tr>
<td>dataNeg</td>
<td>String</td>
<td>--</td>
<td>The location of the negative data (fasta file or plain text) that is used for computing the prediction threshold.</td>
</tr>
<tr>
<td>alpha</td>
<td>Integer</td>
<td>1E-4</td>
<td>Significance level on negative data.</td>
</tr>
<tr>
<td>output</td>
<td>String</td>
<td>bindingSites.txt</td>
<td>Location of output file for writing the predicted binding sites.</td>
</tr>
</table>

=== Classification ===
The tool [http://www2.informatik.uni-halle.de/agbio/publications/PMMdenovo/Classification.jar Classification] performs first a motif discovery with subsequent fragment-based classification by using positive data that is assumed to contain an instance of the motif, and negative data that is assumed not to contain the motif. This tool can be used for performing a single step of a K-fold cross validation experiment.
Run by calling

<code>java -jar Classification.jar filePosTrain fileNegTrain filePosTest fileNegTest motifWidth motifOrder flankingOrder initSteps addSteps restarts</code>

where the arguments have the following semantics:
<table border=0 cellpadding=10 align="center">
<tr>
<td>name</td>
<td>type</td>
<td>default</td>
<td>comment</td>
</tr>
<tr><td colspan=4><hr></td></tr>
<tr>
<td>filePosTrain</td>
<td>String</td>
<td>--</td>
<td>The location of a text file containing the positive training sequences (fasta or plain text).</td>
</tr>
<tr>
<td>fileNegTrain</td>
<td>String</td>
<td>--</td>
<td>The location of a text file containing the negative training sequences (fasta or plain text).</td>
</tr>
<tr>
<td>filePosTest</td>
<td>String</td>
<td>--</td>
<td>The location of a text file containing the positive test sequences (fasta or plain text).</td>
</tr>
<tr>
<td>fileNegTest</td>
<td>String</td>
<td>--</td>
<td>The location of a text file containing the negative test sequences (fasta or plain text).</td>
</tr>
<tr>
<td>motifWidth</td>
<td>Integer</td>
<td>20</td>
<td>The width of the motif to be inferred.</td>
</tr>
<tr>
<td>motifOrder</td>
<td>Integer</td>
<td>2</td>
<td>The initial order of the inhomogeneous PMM, i.e., the number of context positions that can be taken into account for modeling intra-motif dependencies.</td>
</tr>
<tr>
<td>flankingOrder</td>
<td>Integer</td>
<td>2</td>
<td>The order of the homogenous Markov model, which is used for modeling the flanking sequences that do not belong to the motif.</td>
</tr>
<tr>
<td>initSteps</td>
<td>Integer</td>
<td>50</td>
<td>The number of initial iterations steps that the algorithm is always run for each restart.</td>
</tr>
<tr>
<td>addSteps</td>
<td>Integer</td>
<td>10</td>
<td>The number of additional iterations steps, i.e., the number of iterations that have to be performed after having obtained the last optimal model structure before termination is allowed.</td>
</tr>
<tr>
<td>restarts</td>
<td>Integer</td>
<td>10</td>
<td>The number of restarts of the algorithm.</td>
</tr>
</table>
The tool returns (i) the model complexity, i.e., the number of leaves of all parsimonious context trees of the learned motif model, and (ii) performance of the classifier measured by the area under the ROC curve.

== Data ==
The exemplary [http://www2.informatik.uni-halle.de/agbio/publications/PMMdenovo/data.tar.gz data sets] contain extracted ChIP seq sequences of 50 different human transcription factors from the [http://genome.ucsc.edu/ENCODE ENCODE project], as well as corresponding negative data. All data sets are split into 10 different subsets for enabling a reproducible 10-fold cross validation.

== Source code ==
Building the [http://www2.informatik.uni-halle.de/agbio/publications/PMMdenovo/PMMdenovo_sources.zip source code] requires Jstacs 2.1.

PMMdeNovo

2015-02-23T10:16:11Z

Eggeling:

__NOTOC__
by Ralf Eggeling, Teemu Roos, Petri Myllymäki, and Ivo Grosse

== Runnable JARs ==
The application consists of three independent tools. All tools have mandatory (no default values) and optional arguments.
Default values can be used by assigning "def". Alternatively, a shorter list of arguments can be provided, in which case all missing arguments are considered to assume default values.

=== ModelTrainer ===
The tool [http://www2.informatik.uni-halle.de/agbio/publications/PMMdenovo/ModelTrainer.jar ModelTrainer] performs a de novo motif discovery on a set of putative non aligned sequences. It infers an inhomogenous PMM of arbitrary order, where order 0 corresponds to a PWM model.
Run by calling

<code>java -jar InhPMM.jar inputFile motifWidth motifOrder flankingOrder initSteps addSteps restarts output</code>

where the arguments have the following semantics:
<table border=0 cellpadding=10 align="center">
<tr>
<td>name</td>
<td>type</td>
<td>default</td>
<td>comment</td>
</tr>
<tr><td colspan=4><hr></td></tr>
<tr>
<td>inputFile</td>
<td>String</td>
<td>--</td>
<td>The location of a text file containing the input sequences. If the first character in the file is '>' the content is interpreted interpreted as fasta file. Otherwise it is interpreted as plain text, i.e., each line corresponding to one sequence.</td>
</tr>
<tr>
<td>motifWidth</td>
<td>Integer</td>
<td>20</td>
<td>The width of the motif to be inferred.</td>
</tr>
<tr>
<td>motifOrder</td>
<td>Integer</td>
<td>2</td>
<td>The initial order of the inhomogeneous PMM, i.e., the number of context positions that can be taken into account for modeling intra-motif dependencies.</td>
</tr>
<tr>
<td>flankingOrder</td>
<td>Integer</td>
<td>2</td>
<td>The order of the homogenous Markov model, which is used for modeling the flanking sequences that do not belong to the motif.</td>
</tr>
<tr>
<td>initSteps</td>
<td>Integer</td>
<td>50</td>
<td>The number of initial iterations steps that the algorithm is always run for each restart.</td>
</tr>
<tr>
<td>addSteps</td>
<td>Integer</td>
<td>10</td>
<td>The number of additional iterations steps, i.e., the number of iterations that have to be performed after having obtained the last optimal model structure before termination is allowed.</td>
</tr>
<tr>
<td>restarts</td>
<td>Integer</td>
<td>10</td>
<td>The number of restarts of the algorithm.</td>
</tr>
<tr>
<td>output</td>
<td>String</td>
<td>model</td>
<td>The path and file prefix for the output files. The tool produces two files, namely (i) output.xml containing the learned model and (ii) output.dot containing the graphViz representation of the learned PCT structures.</td>
</tr>
</table>

=== BindingSitePrediction ===
The tool [http://www2.informatik.uni-halle.de/agbio/publications/PMMdenovo/BindingSitePrediction.jar BindingSitePrediction] predicts instances of binding sites in a positive data set based on a previously learned model.
Run by calling

<code>java -jar BindingSitePrediction.jar modelFile dataPos dataNeg alpha output</code>

where the arguments have the following semantics:

<table border=0 cellpadding=10 align="center">
<tr>
<td>name</td>
<td>type</td>
<td>default</td>
<td>comment</td>
</tr>
<tr><td colspan=4><hr></td></tr>
<tr>
<td>modelFile</td>
<td>String</td>
<td>--</td>
<td>The location of the .xml representation (output of ModelTrainer) of the learned model.</td>
</tr>
<tr>
<td>dataPos</td>
<td>String</td>
<td>--</td>
<td>The location of the positive data (fasta file or plain text) in which binding site locations are to be identified.</td>
</tr>
<tr>
<td>dataNeg</td>
<td>String</td>
<td>--</td>
<td>The location of the negative data (fasta file or plain text) that is used for computing the prediction threshold.</td>
</tr>
<tr>
<td>alpha</td>
<td>Integer</td>
<td>1E-4</td>
<td>Significance level on negative data.</td>
</tr>
<tr>
<td>output</td>
<td>String</td>
<td>bindingSites.txt</td>
<td>Location of output file for writing the predicted binding sites.</td>
</tr>
</table>

=== Classification ===
The tool [http://www2.informatik.uni-halle.de/agbio/publications/PMMdenovo/Classification.jar Classification] performs first a motif discovery with subsequent fragment-based classification by using positive data that is assumed to contain an instance of the motif, and negative data that is assumed not to contain the motif. This tool can be used for performing a single step of a K-fold cross validation experiment.
Run by calling

<code>java -jar Classification.jar filePosTrain fileNegTrain filePosTest fileNegTest motifWidth motifOrder flankingOrder initSteps addSteps restarts</code>

where the arguments have the following semantics:
<table border=0 cellpadding=10 align="center">
<tr>
<td>name</td>
<td>type</td>
<td>default</td>
<td>comment</td>
</tr>
<tr><td colspan=4><hr></td></tr>
<tr>
<td>filePosTrain</td>
<td>String</td>
<td>--</td>
<td>The location of a text file containing the positive training sequences (fasta or plain text).</td>
</tr>
<tr>
<td>fileNegTrain</td>
<td>String</td>
<td>--</td>
<td>The location of a text file containing the negative training sequences (fasta or plain text).</td>
</tr>
<tr>
<td>filePosTest</td>
<td>String</td>
<td>--</td>
<td>The location of a text file containing the positive test sequences (fasta or plain text).</td>
</tr>
<tr>
<td>fileNegTest</td>
<td>String</td>
<td>--</td>
<td>The location of a text file containing the negative test sequences (fasta or plain text).</td>
</tr>
<tr>
<td>motifWidth</td>
<td>Integer</td>
<td>20</td>
<td>The width of the motif to be inferred.</td>
</tr>
<tr>
<td>motifOrder</td>
<td>Integer</td>
<td>2</td>
<td>The initial order of the inhomogeneous PMM, i.e., the number of context positions that can be taken into account for modeling intra-motif dependencies.</td>
</tr>
<tr>
<td>flankingOrder</td>
<td>Integer</td>
<td>2</td>
<td>The order of the homogenous Markov model, which is used for modeling the flanking sequences that do not belong to the motif.</td>
</tr>
<tr>
<td>initSteps</td>
<td>Integer</td>
<td>50</td>
<td>The number of initial iterations steps that the algorithm is always run for each restart.</td>
</tr>
<tr>
<td>addSteps</td>
<td>Integer</td>
<td>10</td>
<td>The number of additional iterations steps, i.e., the number of iterations that have to be performed after having obtained the last optimal model structure before termination is allowed.</td>
</tr>
<tr>
<td>restarts</td>
<td>Integer</td>
<td>10</td>
<td>The number of restarts of the algorithm.</td>
</tr>
</table>
The tool returns (i) the model complexity, i.e., the number of leaves of all parsimonious context trees of the learned motif model, and (ii) performance of the classifier measured by the area under the ROC curve.

== Data ==
The exemplary [http://www2.informatik.uni-halle.de/agbio/publications/PMMdenovo/data.tar.gz data sets] contain extracted ChIP seq sequences of 50 different human transcription factors from the [http://genome.ucsc.edu/ENCODE ENCODE project], as well as corresponding negative data. All data sets are split into 10 different subsets for enabling a reproducible 10-fold cross validation.

== Source code ==
Building the [http://www2.informatik.uni-halle.de/agbio/publications/PMMdenovo/PMMdenovo_sources.zip source code] requires Jstacs 2.1.

PMMdeNovo

2015-02-21T13:28:14Z

Eggeling:

__NOTOC__
by Ralf Eggeling, Teemu Roos, Petri Myllymäki, and Ivo Grosse

== Runnable JARs ==
The application consists of three independent tools. All tools have mandatory (no default values) and optional arguments.
Default values can be used by assigning "def". Alternatively, a shorter list of arguments can be provided, in which case all missing arguments are considered to assume default values.

=== ModelTrainer ===
The tool [http://www2.informatik.uni-halle.de/agbio/publications/PMMdenovo/ModelTrainer.jar ModelTrainer] performs a de novo motif discovery on a set of putative non aligned sequences. It infers an inhomogenous PMM of arbitrary order, where order 0 corresponds to a PWM model.
Run by calling

<code>java -jar InhPMM.jar inputFile motifWidth motifOrder flankingOrder initSteps addSteps restarts output</code>

where the arguments have the following semantics:
<table border=0 cellpadding=10 align="center">
<tr>
<td>name</td>
<td>type</td>
<td>default</td>
<td>comment</td>
</tr>
<tr><td colspan=4><hr></td></tr>
<tr>
<td>inputFile</td>
<td>String</td>
<td>--</td>
<td>The location of a text file containing the input sequences. If the first character in the file is '>' the content is interpreted interpreted as fasta file. Otherwise it is interpreted as plain text, i.e., each line corresponding to one sequence.</td>
</tr>
<tr>
<td>motifWidth</td>
<td>Integer</td>
<td>20</td>
<td>The width of the motif to be inferred.</td>
</tr>
<tr>
<td>motifOrder</td>
<td>Integer</td>
<td>2</td>
<td>The initial order of the inhomogeneous PMM, i.e., the number of context positions that can be taken into account for modeling intra-motif dependencies.</td>
</tr>
<tr>
<td>flankingOrder</td>
<td>Integer</td>
<td>2</td>
<td>The order of the homogenous Markov model, which is used for modeling the flanking sequences that do not belong to the motif.</td>
</tr>
<tr>
<td>initSteps</td>
<td>Integer</td>
<td>50</td>
<td>The number of initial iterations steps that the algorithm is always run for each restart.</td>
</tr>
<tr>
<td>addSteps</td>
<td>Integer</td>
<td>10</td>
<td>The number of additional iterations steps, i.e., the number of iterations that have to be performed after having obtained the last optimal model structure before termination is allowed.</td>
</tr>
<tr>
<td>restarts</td>
<td>Integer</td>
<td>10</td>
<td>The number of restarts of the algorithm.</td>
</tr>
<tr>
<td>output</td>
<td>String</td>
<td>model</td>
<td>The path and file prefix for the output files. The tool produces two files, namely (i) output.xml containing the learned model and (ii) output.dot containing the graphViz representation of the learned PCT structures.</td>
</tr>
</table>

=== BindingSitePrediction ===
The tool [http://www2.informatik.uni-halle.de/agbio/publications/PMMdenovo/BindingSitePrediction.jar BindingSitePrediction] predicts instances of binding sites in a positive data set based on a previously learned model.
Run by calling

<code>java -jar BindingSitePrediction.jar modelFile dataPos dataNeg alpha output</code>

where the arguments have the following semantics:

<table border=0 cellpadding=10 align="center">
<tr>
<td>name</td>
<td>type</td>
<td>default</td>
<td>comment</td>
</tr>
<tr><td colspan=4><hr></td></tr>
<tr>
<td>modelFile</td>
<td>String</td>
<td>--</td>
<td>The location of the .xml representation (output of ModelTrainer) of the learned model.</td>
</tr>
<tr>
<td>dataPos</td>
<td>String</td>
<td>--</td>
<td>The location of the positive data (fasta file or plain text) in which binding site locations are to be identified.</td>
</tr>
<tr>
<td>dataNeg</td>
<td>String</td>
<td>--</td>
<td>The location of the negative data (fasta file or plain text) that is used for computing the prediction threshold.</td>
</tr>
<tr>
<td>alpha</td>
<td>Integer</td>
<td>1E-4</td>
<td>Significance level on negative data.</td>
</tr>
<tr>
<td>output</td>
<td>String</td>
<td>bindingSites.txt</td>
<td>Location of output file for writing the predicted binding sites.</td>
</tr>
</table>

=== Classification ===
The tool [http://www2.informatik.uni-halle.de/agbio/publications/PMMdenovo/Classification.jar Classification] performs first a motif discovery with subsequent fragment-based classification using positive data that is assumed to contain an instance of the motif, and negative data that is assumed not to contain the motif.
Run by calling

<code>java -jar Classification.jar filePosTrain fileNegTrain filePosTest fileNegTest motifWidth motifOrder flankingOrder initSteps addSteps restarts</code>

where the arguments have the following semantics:
<table border=0 cellpadding=10 align="center">
<tr>
<td>name</td>
<td>type</td>
<td>default</td>
<td>comment</td>
</tr>
<tr><td colspan=4><hr></td></tr>
<tr>
<td>filePosTrain</td>
<td>String</td>
<td>--</td>
<td>The location of a text file containing the positive training sequences (fasta or plain text).</td>
</tr>
<tr>
<td>fileNegTrain</td>
<td>String</td>
<td>--</td>
<td>The location of a text file containing the negative training sequences (fasta or plain text).</td>
</tr>
<tr>
<td>filePosTest</td>
<td>String</td>
<td>--</td>
<td>The location of a text file containing the positive test sequences (fasta or plain text).</td>
</tr>
<tr>
<td>fileNegTest</td>
<td>String</td>
<td>--</td>
<td>The location of a text file containing the negative test sequences (fasta or plain text).</td>
</tr>
<tr>
<td>motifWidth</td>
<td>Integer</td>
<td>20</td>
<td>The width of the motif to be inferred.</td>
</tr>
<tr>
<td>motifOrder</td>
<td>Integer</td>
<td>2</td>
<td>The initial order of the inhomogeneous PMM, i.e., the number of context positions that can be taken into account for modeling intra-motif dependencies.</td>
</tr>
<tr>
<td>flankingOrder</td>
<td>Integer</td>
<td>2</td>
<td>The order of the homogenous Markov model, which is used for modeling the flanking sequences that do not belong to the motif.</td>
</tr>
<tr>
<td>initSteps</td>
<td>Integer</td>
<td>50</td>
<td>The number of initial iterations steps that the algorithm is always run for each restart.</td>
</tr>
<tr>
<td>addSteps</td>
<td>Integer</td>
<td>10</td>
<td>The number of additional iterations steps, i.e., the number of iterations that have to be performed after having obtained the last optimal model structure before termination is allowed.</td>
</tr>
<tr>
<td>restarts</td>
<td>Integer</td>
<td>10</td>
<td>The number of restarts of the algorithm.</td>
</tr>
</table>
The tool returns the classification results to the standard output.

== Data ==
The exemplary [http://www2.informatik.uni-halle.de/agbio/publications/PMMdenovo/data.tar.gz data sets] contain extracted ChIP seq sequences of 50 different human transcription factors from the [http://genome.ucsc.edu/ENCODE ENCODE project], as well as corresponding negative data. All data sets are split into 10 different subsets for enabling a reproducible 10-fold cross validation.

== Source code ==
Building the [http://www2.informatik.uni-halle.de/agbio/publications/PMMdenovo/PMMdenovo_sources.zip source code] requires Jstacs 2.1.

PMMdeNovo

2015-02-21T13:05:54Z

Eggeling:

__NOTOC__
by Ralf Eggeling, Teemu Roos, Petri Myllymäki, and Ivo Grosse

== Runnable JARs ==
The application consists of three independent tools.

=== ModelTrainer ===
The tool [http://www2.informatik.uni-halle.de/agbio/publications/PMMdenovo/ModelTrainer.jar ModelTrainer] performs a de novo motif discovery on a set of putative non aligned sequences. It infers an inhomogenous PMM of arbitrary order, where order 0 corresponds to a PWM model.
Run by calling

<code>java -jar InhPMM.jar inputFile motifWidth motifOrder flankingOrder initSteps addSteps restarts output</code>

where the arguments have the following semantics:
<table border=0 cellpadding=10 align="center">
<tr>
<td>name</td>
<td>type</td>
<td>default</td>
<td>comment</td>
</tr>
<tr><td colspan=4><hr></td></tr>
<tr>
<td>inputFile</td>
<td>String</td>
<td>--</td>
<td>The location of a text file containing the input sequences. If the first character in the file is '>' the content is interpreted interpreted as fasta file. Otherwise it is interpreted as plain text, i.e., each line corresponding to one sequence.</td>
</tr>
<tr>
<td>motifWidth</td>
<td>Integer</td>
<td>20</td>
<td>The width of the motif to be inferred.</td>
</tr>
<tr>
<td>motifOrder</td>
<td>Integer</td>
<td>2</td>
<td>The initial order of the inhomogeneous PMM, i.e., the number of context positions that can be taken into account for modeling intra-motif dependencies.</td>
</tr>
<tr>
<td>flankingOrder</td>
<td>Integer</td>
<td>2</td>
<td>The order of the homogenous Markov model, which is used for modeling the flanking sequences that do not belong to the motif.</td>
</tr>
<tr>
<td>initSteps</td>
<td>Integer</td>
<td>50</td>
<td>The number of initial iterations steps that the algorithm is always run for each restart.</td>
</tr>
<tr>
<td>addSteps</td>
<td>Integer</td>
<td>10</td>
<td>The number of additional iterations steps, i.e., the number of iterations that have to be performed after having obtained the last optimal model structure before termination is allowed.</td>
</tr>
<tr>
<td>restarts</td>
<td>Integer</td>
<td>10</td>
<td>The number of restarts of the algorithm.</td>
</tr>
<tr>
<td>output</td>
<td>String</td>
<td>model</td>
<td>The path and file prefix for the output files. The tool produces two files, namely (i) output.xml containing the learned model and (ii) output.dot containing the graphViz representation of the learned PCT structures.</td>
</tr>
</table>

=== BindingSitePrediction ===
The tool [http://www2.informatik.uni-halle.de/agbio/publications/PMMdenovo/BindingSitePrediction.jar BindingSitePrediction] predicts instances of binding sites in a positive data set based on a previously learned model.
Run by calling

<code>java -jar BindingSitePrediction.jar modelFile dataPos dataNeg alpha output</code>

where the arguments have the following semantics:

<table border=0 cellpadding=10 align="center">
<tr>
<td>name</td>
<td>type</td>
<td>default</td>
<td>comment</td>
</tr>
<tr><td colspan=4><hr></td></tr>
<tr>
<td>modelFile</td>
<td>String</td>
<td>--</td>
<td>The location of the .xml representation (output of ModelTrainer) of the learned model.</td>
</tr>
<tr>
<td>dataPos</td>
<td>String</td>
<td>--</td>
<td>The location of the positive data (fasta file or plain text) in which binding site locations are to be identified.</td>
</tr>
<tr>
<td>dataNeg</td>
<td>String</td>
<td>--</td>
<td>The location of the negative data (fasta file or plain text) that is used for computing the prediction threshold.</td>
</tr>
<tr>
<td>alpha</td>
<td>Integer</td>
<td>1E-4</td>
<td>Significance level on negative data.</td>
</tr>
<tr>
<td>output</td>
<td>String</td>
<td>bindingSites.txt</td>
<td>Location of output file for writing the predicted binding sites.</td>
</tr>
</table>

=== Classification ===
The tool [http://www2.informatik.uni-halle.de/agbio/publications/PMMdenovo/Classification.jar Classification] performs first a motif discovery with subsequent fragment-based classification using positive data that is assumed to contain an instance of the motif, and negative data that is assumed not to contain the motif. The tool returns the classification results to the standard output.

== Data ==
The exemplary [http://www2.informatik.uni-halle.de/agbio/publications/PMMdenovo/data.tar.gz data sets] contain extracted ChIP seq sequences of 50 different human transcription factors from the [http://genome.ucsc.edu/ENCODE ENCODE project], as well as corresponding negative data. All data sets are split into 10 different subsets for enabling a reproducible 10-fold cross validation.

== Source code ==
Building the [http://www2.informatik.uni-halle.de/agbio/publications/PMMdenovo/PMMdenovo_sources.zip source code] requires Jstacs 2.1.