Disentangler: Difference between revisions

From Jstacs
Jump to navigationJump to search
(Created page with "Disentangler")
 
(first content)
Line 1: Line 1:
Disentangler
by Ralf Eggeling.
 
Disentangler comprises two tools for analyzing complex features in a set of aligned transcription factor (TFBS) binding sites that can be used individually or within a joint pipeline. IMD can test whether putative complexity can be explained by intermixtures with binding sites from different TFs or other contamination and correct for such artifacts. MCA allows to select an optimal model of TFBS complexity, choosing among dependence models, mixture models, and variants in between.
 
== Paper ==
 
If you use Disentangler, please cite
 
R. Eggeling. [https://tba.org Disentangling transcription factor binding site complexity]. ''Nucleic Acids Research'', 2018; doi: 10.1093/nar/gky683 (to appear)
 
 
== Download and installation ==
 
Disentangler offers two user interfaces.
* [https://tba.org DisentanglerGUI] -- graphical user interface
* [https://tba.org DisentanglerCLI] -- command line interface
 
that can be started by
 
java -jar filename.jar
 
and require an existent Java installation (8u74 or later).
 
== Functionality ==
The software contains the two subtools described in the paper, called ''Intermixture detection'' and ''Motif complexity analysis''. In addition, there is a tool ''Sequence scan'' that can be used to search for motif hits within target sequences based on models that are returned by ''Motif complexity analysis''.
 
All tools expect a set of aligned, gapless, TFBS of the same length as input. If the content of the input file starts with '>', it is interpreted as FastA file. Otherwise it is interpreted as plain text, where every line contains a single sequence.
The input expects upper- and lower case letters of the standard DNA alphabet {A,C,G,T}. If other symbols from the IUPAC code (such as N) are encountered, they are replaced by a random sample from the distribution of {A,C,G,T} in the data set.
 
=== Intermixture detection ===
If "JSD weights" is disabled, the intermixture measure is computed on a non-weighted Jenson-Shannon divergence. Option included for experimental purposes, for practical use keeping the default is strongly recommended.
Smaller values for "Restarts", "Time limit" and "Termination threshold" can speed up every recursive step, which can be beneficial for testing purposes, but they may affect quality of the results.
The tool returns a text file with the intermixture number and all clusters produced by IMD as text files of the binding sites and sequence logos of the mononucleotide statistics. 
The values for the intermixture measure at each recursive step can be found in the protocol.
 
=== Motif complexity analysis ===
The tool allows to learn of proximal/distal dependency models and mixtures thereof. Note: Learning distal dependence models of order greater than one can be very time- and memory consuming if the input sequences are long.
It is not recommended for motifs of length greater than 20.
The tool returns a text-file containing the intra-motif complexity measure of the data set, a visualization of the learned model, and a storable (.xml) file that can be used as input to ''Sequence scan''.
The mixture weights and model complexities of each component can be found in the protocol.
 
=== Sequence scan ===
This tool is a variant of the [http://www.jstacs.de/index.php/InMoDe InMoDe] ScanApp, with increased support for different types of models, that is, mixture models and distal dependence models.
"Input model" needs to be an model file (in .xml format) produced by ``Motif complexity analysis''.
The "FPR" pertains here to the number of sequence that have at least one hit.
The tool returns a list with coordinates of motif hits as well as the extracted binding sites.
 
== Example data ==
These [https://tba.org data sets] are discussed in the paper in detail (Section "Application examples"), which makes them ideal candidates for testing the functionality of Disentangler.
 
== Source code ==
Building the [https://tba.org source code] requires Jstacs 2.3.

Revision as of 10:53, 29 July 2018

by Ralf Eggeling.

Disentangler comprises two tools for analyzing complex features in a set of aligned transcription factor (TFBS) binding sites that can be used individually or within a joint pipeline. IMD can test whether putative complexity can be explained by intermixtures with binding sites from different TFs or other contamination and correct for such artifacts. MCA allows to select an optimal model of TFBS complexity, choosing among dependence models, mixture models, and variants in between.

Paper

If you use Disentangler, please cite

R. Eggeling. Disentangling transcription factor binding site complexity. Nucleic Acids Research, 2018; doi: 10.1093/nar/gky683 (to appear)


Download and installation

Disentangler offers two user interfaces.

that can be started by

java -jar filename.jar

and require an existent Java installation (8u74 or later).

Functionality

The software contains the two subtools described in the paper, called Intermixture detection and Motif complexity analysis. In addition, there is a tool Sequence scan that can be used to search for motif hits within target sequences based on models that are returned by Motif complexity analysis.

All tools expect a set of aligned, gapless, TFBS of the same length as input. If the content of the input file starts with '>', it is interpreted as FastA file. Otherwise it is interpreted as plain text, where every line contains a single sequence. The input expects upper- and lower case letters of the standard DNA alphabet {A,C,G,T}. If other symbols from the IUPAC code (such as N) are encountered, they are replaced by a random sample from the distribution of {A,C,G,T} in the data set.

Intermixture detection

If "JSD weights" is disabled, the intermixture measure is computed on a non-weighted Jenson-Shannon divergence. Option included for experimental purposes, for practical use keeping the default is strongly recommended. Smaller values for "Restarts", "Time limit" and "Termination threshold" can speed up every recursive step, which can be beneficial for testing purposes, but they may affect quality of the results. The tool returns a text file with the intermixture number and all clusters produced by IMD as text files of the binding sites and sequence logos of the mononucleotide statistics. The values for the intermixture measure at each recursive step can be found in the protocol.

Motif complexity analysis

The tool allows to learn of proximal/distal dependency models and mixtures thereof. Note: Learning distal dependence models of order greater than one can be very time- and memory consuming if the input sequences are long. It is not recommended for motifs of length greater than 20. The tool returns a text-file containing the intra-motif complexity measure of the data set, a visualization of the learned model, and a storable (.xml) file that can be used as input to Sequence scan. The mixture weights and model complexities of each component can be found in the protocol.

Sequence scan

This tool is a variant of the InMoDe ScanApp, with increased support for different types of models, that is, mixture models and distal dependence models. "Input model" needs to be an model file (in .xml format) produced by ``Motif complexity analysis. The "FPR" pertains here to the number of sequence that have at least one hit. The tool returns a list with coordinates of motif hits as well as the extracted binding sites.

Example data

These data sets are discussed in the paper in detail (Section "Application examples"), which makes them ideal candidates for testing the functionality of Disentangler.

Source code

Building the source code requires Jstacs 2.3.