Dispom: Difference between revisions

From Jstacs
Jump to navigationJump to search
No edit summary
No edit summary
 
(33 intermediate revisions by 2 users not shown)
Line 1: Line 1:
<!--__NOTOC__-->
<!--__NOTOC__-->
by Jens Keilwagen, Jan Grau, Ivan A. Paponov, Stefan Posch, Marc Strickert and Ivo Grosse.
by Jens Keilwagen, Jan Grau, Ivan A. Paponov, Stefan Posch, Marc Strickert and Ivo Grosse.
Recently, we published Dimont a successor of Dispom. Dimont uses some heuristics and is hence much faster than Dipom. Typically it provides results within one hour for large data sets. If you are interested in Dimont please visit the [[Dimont | project homepage]].


== Description ==
== Description ==
Line 13: Line 15:


== Paper ==
== Paper ==
The paper '''''De-novo discovery of differentially abundant transcription factor binding sites including their positional preference''''' has been submitted to Genome Biology.
The paper [http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1001070 '''''De-novo discovery of differentially abundant transcription factor binding sites including their positional preference'''''] has been published in [http://www.ploscompbiol.org/ PLoS Computational Biology].


== Download ==
== Download ==
* Dispom can be downloaded [http://www.jstacs.de/download.php?which=Dispom here].
''The original version of Dispom contained a trivial bug that, unfortunately, prevented Dispom from predicting any binding sites. However, the training process was not affected by this bug and worked completely as expected. For convenience, the updated version of Dispom contains a second program called DispomPredictor (see documentation below), which can be used to predict binding sites given a trained classifier.''
* The benchmark data sets with implanted binding sites from [http://jaspar.cgb.ki.se/ Jaspar database] can be downloaded here.
 
* The auxin data sets can be downloaded here.
* Dispom can be downloaded [http://www.jstacs.de/download.php?which=Dispom1.1 here].
* An alternative version adapted to Jstacs 2.0 can be downloaded [http://www.jstacs.de/download.php?which=Dispom here].
* The benchmark data sets with implanted binding sites from [http://jaspar.cgb.ki.se/ Jaspar database] can be downloaded [http://www.jstacs.de/downloads/benchmark.zip here].
* The auxin data sets can be downloaded [http://www.jstacs.de/downloads/auxin.zip here].
* The position frequency matrices (PFMs) of the predictions on the [http://acgt.cs.tau.ac.il/amadeus/suppl/results_metazoan.html metazoan compendium] can be downloaded [http://www.jstacs.de/downloads/meatzoan-pfms-1E-4.txt here].
* The sources of Dispom are available as part of the [[Downloads | Jstacs sources]]. You find the main class of Dispom at projects.dispom.Dispom.java.


== Start instructions ==
== Start instructions ==
Once you have unzipped the archive, you can start Dispom e.g. by invoking
Once you have unzipped the archive, you can start Dispom e.g. by invoking


<code>java -cp .:jstacs-1.3.1.jar:lib/numericalMethods.jar:lib/bytecode.jar:lib/biojava-live.jar projects.dispom.Dispom <font color="green">home=path/to/data/directory/ fg=fgfile.txt bg=bgfile.txt init=best-random=100 p-val=1E-4</font></code>
<code>java -jar Dispom.jar <font color="green">home=path/to/data/directory/ fg=fgfile.txt bg=bgfile.txt init=best-random=100 p-val=1E-4</font></code>
 
to search for motifs that are over-represented in <code>path/to/data/directory/fgfile.txt</code> but not in <code>path/to/data/directory/bgfile.txt</code>, initialize Dispom with the best from 100 randomly drawn starting values, and search for motif occurrences with a p-value less than <code>1E-4</code>.


Under Windows, you must use &quot;;&quot; instead of &quot;:&quot; in the class path.
to search for motifs that are over-represented in <code>path/to/data/directory/fgfile.txt</code> but not in <code>path/to/data/directory/bgfile.txt</code>, initialize Dispom with the best from 100 randomly drawn starting values, and search for motif occurrences with a p-value of less than <code>1E-4</code>.


The arguments have the following meaning
The arguments have the following meaning
Line 49: Line 54:
<tr>
<tr>
<td><font color="green">fg</font></td>
<td><font color="green">fg</font></td>
<td>the file name of the foreground data file (the file containing sequences which are expected to contain binding sites of a common motif)</td>
<td>the file name of the foreground data file (target data set, i.e. the file containing sequences which are expected to contain binding sites of a common motif)</td>
<td>String</td>
<td>String</td>
</tr>
</tr>
<tr>
<tr>
<td><font color="green">bg</font></td>
<td><font color="green">bg</font></td>
<td>the file name of the background data file, OPTIONAL</td>
<td>the file name of the background data file (control data set), OPTIONAL</td>
<td>String</td>
<td>String</td>
</tr>
</tr>
Line 99: Line 104:
<tr>
<tr>
<td><font color="green">init</font></td>
<td><font color="green">init</font></td>
<td>the method that is used for initialization, one of 'best-random=<number>', 'best-random-plugin=<number>', 'best-random-motif=<number>', 'enum-all=<length>', 'enum-data=<length>', 'heuristic=<number>', and 'specific=<sequence or file of sequences>'</td>
<td>the method that is used for initialization, one of 'best-random=<number>', 'best-random-plugin=<number>', 'best-random-motif=<number>', 'enum-all=<length>', 'enum-data=<length>', 'heuristic=<number>', and 'specific=<sequence or file of sequences>' (also see '''Initialization strategies''' below)</td>
<td>String=[Integer | String]</td>
<td>String=[Integer | String]</td>
</tr>
</tr>
Line 138: Line 143:
</tr>
</tr>
</table>
</table>
'''Memory requirements:''' If your data sets are rather large, i.e. contain a great number of sequences or rather long sequences, the standard memory allocation of Java may not be sufficient. In such cases, you can increase the amount of memory requested by the virtual machine by specifying -Xms and -Xmx VM-arguments. If, for instance, you want Dispom to start with an initial memory of 512MB which may increase up to 1GB, the above command line must be changed to
<code>java <font color="green">-Xms512M -Xmx1G</font> -jar Dispom.jar home=path/to/data/directory/ fg=fgfile.txt bg=bgfile.txt init=best-random=100 p-val=1E-4</code>
=== Initialization strategies===
The following initialization strategies are available for Dispom (argument font color="green">init</font>, see above):
* best-random=<number>: choose the best (according to conditional likelihood) among <number> different random initializations of the ZOOPS model
* best-random-motif=<number>: choose the best among <number> different random initializations of the motif model
* best-random-plugin=<number>: choose the best among <number> motif models, initialized from a randomly chosen substring of the target data
* enum-all=<length>: choose the best among motif models, each initialized from one of all possible <length>-mers
* enum-data=<length>: choose the best among motif models, each initialized from one of all <length>-mers present in the data (allows for greater <length> then enum-all without greatly increased runtime)
* heuristic=<number>: use a heuristic based on the differential occurrence of similar k-mers in the target compared to the control data set
* specific=<sequence or file of sequences>: initialize the motif model either from a single sequence, e.g. specific=ACGTACGT, or from a file containing a set of sequences
If you have a rough guess of the motif you want to find, we recommend to use '''init=specific=<>''' with your guess as input sequence or file.
Otherwise, if available computation time is rather limited, we recommend to start with '''init=heuristic=100''', and proceed with '''init=enum-data=6''' or '''init=enum-data=8''', i.e. initialize from 6 or 8-mers.
If more computation time is available, several starts with '''init=best-random-plugin=100''' may lead to improved motifs. However, these come at the expense of computation time.


== Case studies ==
== Case studies ==
Line 146: Line 173:
* Once, we used '''init=enum-data=8'''.
* Once, we used '''init=enum-data=8'''.
For predicting binding sites, we used '''p-val=1E-4'''.
For predicting binding sites, we used '''p-val=1E-4'''.
== Dispom Predictor ==
In addition to the Dispom binary, we also provide a program (DispomPredictor.jar) that can be used to predict binding sites using an already trained classifier. Application of DispomPredictor.jar could be to predict binding sites of a motif found on some set of training data on additional, independent test data, or to test different p-values for predictions without the need to start the training process repeatedly as well.
You can start the Dispom predictor by invoking
<code>java -jar DispomPredictor.jar <font color="green">home=path/to/data/directory/ fg=fgfile.txt bg=bgfile.txt p-val=1E-4 xml=./classifier.xml</font></code>
The arguments have the following meaning
<table border=0 cellpadding=10 align="center">
<tr>
<td>name</td>
<td>comment</td>
<td>type</td>
</tr>
<tr><td colspan=3><hr></td></tr>
<tr>
<td><font color="green">home</font></td>
<td>the path to the data directory, default = ./</td>
<td>String</td>
</tr>
<tr>
<td><font color="green">ignore</font></td>
<td>the char that is used to mask comment lines in data files, e.g., '>' in a FASTA-file, default = ></td>
<td>Character</td>
</tr>
<tr>
<td><font color="green">fg</font></td>
<td>the file name of the foreground data file (the file containing sequences which are expected to contain binding sites of a common motif)</td>
<td>String</td>
</tr>
<tr>
<td><font color="green">bg</font></td>
<td>the file name of the background data file, OPTIONAL</td>
<td>String</td>
</tr>
<tr>
<td><font color="green">xml</font></td>
<td>the file name of the xml file the classifier has been written to, default = ./classifier.xml</td>
<td>String</td>
</tr>
<tr>
<td><font color="green">p-val</font></td>
<td>a p-value for predicting binding sites, valid range = [0.0, 1.0]</td>
<td>Double</td>
</tr>
<tr>
<td><font color="green">one-histogram</font></td>
<td>if no background file is specificed, p-values are computed either using a joint histogram (true), or a sequence-wise histogram (false), default = true, OPTIONAL</td>
<td>boolean</td>
</tr>
</table>

Latest revision as of 08:23, 6 February 2015

by Jens Keilwagen, Jan Grau, Ivan A. Paponov, Stefan Posch, Marc Strickert and Ivo Grosse.

Recently, we published Dimont a successor of Dispom. Dimont uses some heuristics and is hence much faster than Dipom. Typically it provides results within one hour for large data sets. If you are interested in Dimont please visit the project homepage.

Description

Background

Transcription factors are a main component of gene regulation as they activate or repress gene expression by binding to specific binding sites in promoters. The de-novo discovery of transcription factor binding sites in target regions obtained by wet-lab experiments is a challenging problem in computational biology, which has not been fully solved yet.

Results

We present a de-novo motif discovery tool for finding differentially abundant transcription factor binding sites that models existing positional preferences of binding sites and adjusts the length of the motif in the learning process. Based on the evaluation of 18 various benchmark data sets we find that the prediction performance of this tool is superior to existing tools for de-novo motif discovery. Finally, we apply the tool to discover binding sites enriched in promoters of auxin-responsive genes extracted from Arabidopsis thaliana microarray data, and we find a motif that can be interpreted as an elongated auxin-responsive element predominately positioned in the 250-bp region upstream of the transcription start site. Using an independent data set of auxin-responsive genes, we find that the refined motif increases the auxin specificity by more than three orders of magnitude in genome-wide predictions compared to the canonical auxin-responsive element.

Conclusions

We find that the combination of searching for differentially abundant motifs and inferring a position distribution from the data is beneficial for de-novo motif discovery. Hence, we make the tool freely available as a component of the open-source Java framework Jstacs and as a stand-alone application.

Paper

The paper De-novo discovery of differentially abundant transcription factor binding sites including their positional preference has been published in PLoS Computational Biology.

Download

The original version of Dispom contained a trivial bug that, unfortunately, prevented Dispom from predicting any binding sites. However, the training process was not affected by this bug and worked completely as expected. For convenience, the updated version of Dispom contains a second program called DispomPredictor (see documentation below), which can be used to predict binding sites given a trained classifier.

  • Dispom can be downloaded here.
  • An alternative version adapted to Jstacs 2.0 can be downloaded here.
  • The benchmark data sets with implanted binding sites from Jaspar database can be downloaded here.
  • The auxin data sets can be downloaded here.
  • The position frequency matrices (PFMs) of the predictions on the metazoan compendium can be downloaded here.
  • The sources of Dispom are available as part of the Jstacs sources. You find the main class of Dispom at projects.dispom.Dispom.java.

Start instructions

Once you have unzipped the archive, you can start Dispom e.g. by invoking

java -jar Dispom.jar home=path/to/data/directory/ fg=fgfile.txt bg=bgfile.txt init=best-random=100 p-val=1E-4

to search for motifs that are over-represented in path/to/data/directory/fgfile.txt but not in path/to/data/directory/bgfile.txt, initialize Dispom with the best from 100 randomly drawn starting values, and search for motif occurrences with a p-value of less than 1E-4.

The arguments have the following meaning

name comment type

home the path to the data directory, default = ./ String
ignore the char that is used to mask comment lines in data files, e.g., '>' in a FASTA-file, default = > Character
fg the file name of the foreground data file (target data set, i.e. the file containing sequences which are expected to contain binding sites of a common motif) String
bg the file name of the background data file (control data set), OPTIONAL String
position a switch whether to use uniform, skew-normal, or mixture position distribution, range={UNIFORM, SKEW_NORMAL, MIXTURE}, default = MIXTURE String
mean the mean of the a priori TFBS distribution, default = 250.0 Double
sd the sd of the a priori TFBS distribution, valid range = [1.0, Infinity], default = 150.0 Double
motifs the number of motifs to be searched for, valid range = [1, 5], default = 1 Integer
length the motif length that is used at the beginning, valid range = [1, 50], default = 15 Integer
flankOrder The Markov order of the model for the flanking sequence and the background sequence, valid range = [0, 5], default = 0 Integer
motifOrder The Markov order of the motif model, valid range = [0, 3], default = 0 Integer
bothStrands a switch whether to use both strands or not, default = true Boolean
init the method that is used for initialization, one of 'best-random=<number>', 'best-random-plugin=<number>', 'best-random-motif=<number>', 'enum-all=<length>', 'enum-data=<length>', 'heuristic=<number>', and 'specific=<sequence or file of sequences>' (also see Initialization strategies below) String=[Integer | String]
adjust a switch whether to adjust the motif length, i.e., either to shrink or expand, default = true Boolean
maxPos a switch whether to use max. pos. in the heuristic or not, default = true Boolean
learning a switch for the learning principle, range={ML, MAP, MCL, MSP}, default = MSP String
threads the number of threads that are use to evaluate the objective function and its gradient, valid range = [1, 128], default = 4 Integer
starts the number of independent starts of Dispom, valid range = [1, 100], default = 1 Integer
xml the file name of the xml file the classifier is written to, default = ./classifier.xml String
p-val a p-value for predicting binding sites, valid range = [0.0, 1.0], OPTIONAL Double

Memory requirements: If your data sets are rather large, i.e. contain a great number of sequences or rather long sequences, the standard memory allocation of Java may not be sufficient. In such cases, you can increase the amount of memory requested by the virtual machine by specifying -Xms and -Xmx VM-arguments. If, for instance, you want Dispom to start with an initial memory of 512MB which may increase up to 1GB, the above command line must be changed to

java -Xms512M -Xmx1G -jar Dispom.jar home=path/to/data/directory/ fg=fgfile.txt bg=bgfile.txt init=best-random=100 p-val=1E-4


Initialization strategies

The following initialization strategies are available for Dispom (argument font color="green">init, see above):

  • best-random=<number>: choose the best (according to conditional likelihood) among <number> different random initializations of the ZOOPS model
  • best-random-motif=<number>: choose the best among <number> different random initializations of the motif model
  • best-random-plugin=<number>: choose the best among <number> motif models, initialized from a randomly chosen substring of the target data
  • enum-all=<length>: choose the best among motif models, each initialized from one of all possible <length>-mers
  • enum-data=<length>: choose the best among motif models, each initialized from one of all <length>-mers present in the data (allows for greater <length> then enum-all without greatly increased runtime)
  • heuristic=<number>: use a heuristic based on the differential occurrence of similar k-mers in the target compared to the control data set
  • specific=<sequence or file of sequences>: initialize the motif model either from a single sequence, e.g. specific=ACGTACGT, or from a file containing a set of sequences

If you have a rough guess of the motif you want to find, we recommend to use init=specific=<> with your guess as input sequence or file.

Otherwise, if available computation time is rather limited, we recommend to start with init=heuristic=100, and proceed with init=enum-data=6 or init=enum-data=8, i.e. initialize from 6 or 8-mers.

If more computation time is available, several starts with init=best-random-plugin=100 may lead to improved motifs. However, these come at the expense of computation time.

Case studies

In case studies presented in the paper, we started Dispom 50 times.

  • 47 times, we used init=best-random-plugin=100.
  • Once, we used init=heuristic=100.
  • Once, we used init=enum-data=6.
  • Once, we used init=enum-data=8.

For predicting binding sites, we used p-val=1E-4.

Dispom Predictor

In addition to the Dispom binary, we also provide a program (DispomPredictor.jar) that can be used to predict binding sites using an already trained classifier. Application of DispomPredictor.jar could be to predict binding sites of a motif found on some set of training data on additional, independent test data, or to test different p-values for predictions without the need to start the training process repeatedly as well.

You can start the Dispom predictor by invoking

java -jar DispomPredictor.jar home=path/to/data/directory/ fg=fgfile.txt bg=bgfile.txt p-val=1E-4 xml=./classifier.xml

The arguments have the following meaning

name comment type

home the path to the data directory, default = ./ String
ignore the char that is used to mask comment lines in data files, e.g., '>' in a FASTA-file, default = > Character
fg the file name of the foreground data file (the file containing sequences which are expected to contain binding sites of a common motif) String
bg the file name of the background data file, OPTIONAL String
xml the file name of the xml file the classifier has been written to, default = ./classifier.xml String
p-val a p-value for predicting binding sites, valid range = [0.0, 1.0] Double
one-histogram if no background file is specificed, p-values are computed either using a joint histogram (true), or a sequence-wise histogram (false), default = true, OPTIONAL boolean