Disentangler: Difference between revisions
No edit summary |
(final reference) |
||
(5 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
by Ralf Eggeling. | by Ralf Eggeling. | ||
Disentangler comprises two tools for analyzing complex features in a set of aligned transcription factor (TFBS) binding sites that can be used individually or within a joint pipeline | Disentangler comprises two tools for analyzing complex features in a set of aligned transcription factor (TFBS) binding sites that can be used individually or within a joint pipeline. | ||
== Paper == | == Paper == | ||
Line 7: | Line 7: | ||
If you use Disentangler, please cite | If you use Disentangler, please cite | ||
R. Eggeling. Disentangling transcription factor binding site complexity. ''Nucleic Acids Research'' | R. Eggeling. [https://academic.oup.com/nar/advance-article/doi/10.1093/nar/gky683/5063190 Disentangling transcription factor binding site complexity]. ''Nucleic Acids Research'', 2018; 46(20): e121. doi: 10.1093/nar/gky683 | ||
Line 20: | Line 20: | ||
java -jar filename.jar | java -jar filename.jar | ||
and require an existent Java installation (8u74 or later). | and require an existent Java installation (8u74 or later). | ||
It is recommended to use the GUI for testing purposes or analysis of a single data set and to resort to the CLI for more elaborate applications (multiple data sets, use on a cluster, etc.). | |||
== Functionality == | == Functionality == | ||
The software contains the two subtools described in the paper, called ''Intermixture detection'' and ''Motif complexity analysis''. In addition, there is a tool ''Sequence scan'' that can be used to search for motif hits within target sequences based on models that are returned by ''Motif complexity analysis''. | The software contains the two subtools described in the paper, called ''Intermixture detection'' (IMD) and ''Motif complexity analysis'' (MCA). In addition, there is a tool ''Sequence scan'' that can be used to search for motif hits within target sequences based on models that are returned by ''Motif complexity analysis''. | ||
All tools expect a set of aligned, gapless, TFBS of the same length as input. If the content of the input file starts with '>', it is interpreted as FastA file. Otherwise it is interpreted as plain text, where every line contains a single sequence. | All tools expect a set of aligned, gapless, TFBS of the same length as input. If the content of the input file starts with '>', it is interpreted as FastA file. Otherwise it is interpreted as plain text, where every line contains a single sequence. | ||
The input expects upper- and lower case letters of the standard DNA alphabet {A,C,G,T}. If other symbols from the IUPAC code (such as N) are encountered, they are replaced by a random sample from the distribution of {A,C,G,T} in the data set. | The input expects upper- and lower case letters of the standard DNA alphabet {A,C,G,T}. If other symbols from the IUPAC code (such as N) are encountered, they are replaced by a random sample from the distribution of {A,C,G,T} in the data set. | ||
=== Intermixture detection === | === Intermixture detection (IMD) === | ||
IMD can test whether putative complexity can be explained by intermixtures with binding sites from different TFs or other contamination and correct for such artifacts. | |||
The default value of intermixture threshold (0.19) is a robust choice; slight variations in the interval (0.15,0.3) have only little impact for the majority of cases (see paper). | |||
The tool returns a text file with the intermixture number and all clusters produced by IMD as text files of the binding sites and sequence logos of the mononucleotide statistics. | The tool returns a text file with the intermixture number and all clusters produced by IMD as text files of the binding sites and sequence logos of the mononucleotide statistics. | ||
The values for the intermixture measure | The values for the intermixture measure in each recursive step can be found in the protocol. | ||
If "JSD weights" is disabled, the intermixture measure is computed on a non-weighted Jenson-Shannon divergence. While this option is included for experimental purposes; keeping the default for practical applications is strongly recommended, as otherwise adjusting the intermixture threshold might be needed. | |||
The default values for "Restarts", "Time limit" and "Termination threshold" are fairly conservative. | |||
Smaller values can speed up every recursive step, which can be beneficial for testing purposes, but they may affect quality of the results. | |||
=== Motif complexity analysis (MCA) === | |||
MCA allows to select an optimal model of TFBS complexity, choosing among dependence models, mixtures of PWM models, and variants in between. | |||
The tool itself requires, for each run, to choose a concrete model as input. | |||
It returns a text-file containing the intra-motif complexity (IMC) measure of the data set under the given model. | |||
For comparing different models according to IMC, the tool thus needs to be run multiple times. | |||
In addition, each run also outputs a visualization of the learned model and a storable (.xml) file that can be used as input to "Sequence scan" (see below). | |||
Learning distal dependence models of order greater than one can be very time- and memory consuming if the input sequences are long. | |||
It is not recommended for data sets with sequence length greater than 20. | |||
The default values for "Restarts", "Time limit" and "Termination threshold" are fairly conservative. | |||
Smaller values can speed up every recursive step, which can be beneficial for testing purposes, but they may affect quality of the results. | |||
=== Sequence scan === | === Sequence scan === | ||
This tool is a variant of the [http://www.jstacs.de/index.php/InMoDe InMoDe] ScanApp, with increased support for different types of models, that is, mixture models and distal dependence models. | This tool is a variant of the [http://www.jstacs.de/index.php/InMoDe InMoDe] ScanApp, with increased support for different types of models, that is, mixture models and distal dependence models. | ||
"Input model" needs to be an model file (in .xml format) produced by ``Motif complexity analysis''. | "Input model" needs to be an model file (in .xml format) produced by ``Motif complexity analysis''. | ||
The "FPR" pertains here to the number of sequence that have at least one hit. | The "FPR" pertains here to the number of sequence that have at least one hit. | ||
The tool returns a list with coordinates of motif hits as well as the extracted binding sites. | The tool returns a list with coordinates of motif hits as well as the extracted binding sites. | ||
Line 50: | Line 71: | ||
== Source code == | == Source code == | ||
Building the source code | Building the [https://www.cs.helsinki.fi/u/eggeling/Disentangler/Disentangler-sources.zip source code] requires Jstacs 2.3 and JstacsFX 1.0. For compiling instructions see the included README.txt file. |
Latest revision as of 14:39, 16 November 2018
by Ralf Eggeling.
Disentangler comprises two tools for analyzing complex features in a set of aligned transcription factor (TFBS) binding sites that can be used individually or within a joint pipeline.
Paper
If you use Disentangler, please cite
R. Eggeling. Disentangling transcription factor binding site complexity. Nucleic Acids Research, 2018; 46(20): e121. doi: 10.1093/nar/gky683
Download and installation
Disentangler offers two user interfaces.
- DisentanglerGUI -- graphical user interface
- DisentanglerCLI -- command line interface
that can be started by
java -jar filename.jar
and require an existent Java installation (8u74 or later). It is recommended to use the GUI for testing purposes or analysis of a single data set and to resort to the CLI for more elaborate applications (multiple data sets, use on a cluster, etc.).
Functionality
The software contains the two subtools described in the paper, called Intermixture detection (IMD) and Motif complexity analysis (MCA). In addition, there is a tool Sequence scan that can be used to search for motif hits within target sequences based on models that are returned by Motif complexity analysis.
All tools expect a set of aligned, gapless, TFBS of the same length as input. If the content of the input file starts with '>', it is interpreted as FastA file. Otherwise it is interpreted as plain text, where every line contains a single sequence. The input expects upper- and lower case letters of the standard DNA alphabet {A,C,G,T}. If other symbols from the IUPAC code (such as N) are encountered, they are replaced by a random sample from the distribution of {A,C,G,T} in the data set.
Intermixture detection (IMD)
IMD can test whether putative complexity can be explained by intermixtures with binding sites from different TFs or other contamination and correct for such artifacts.
The default value of intermixture threshold (0.19) is a robust choice; slight variations in the interval (0.15,0.3) have only little impact for the majority of cases (see paper).
The tool returns a text file with the intermixture number and all clusters produced by IMD as text files of the binding sites and sequence logos of the mononucleotide statistics. The values for the intermixture measure in each recursive step can be found in the protocol.
If "JSD weights" is disabled, the intermixture measure is computed on a non-weighted Jenson-Shannon divergence. While this option is included for experimental purposes; keeping the default for practical applications is strongly recommended, as otherwise adjusting the intermixture threshold might be needed.
The default values for "Restarts", "Time limit" and "Termination threshold" are fairly conservative. Smaller values can speed up every recursive step, which can be beneficial for testing purposes, but they may affect quality of the results.
Motif complexity analysis (MCA)
MCA allows to select an optimal model of TFBS complexity, choosing among dependence models, mixtures of PWM models, and variants in between.
The tool itself requires, for each run, to choose a concrete model as input. It returns a text-file containing the intra-motif complexity (IMC) measure of the data set under the given model. For comparing different models according to IMC, the tool thus needs to be run multiple times.
In addition, each run also outputs a visualization of the learned model and a storable (.xml) file that can be used as input to "Sequence scan" (see below).
Learning distal dependence models of order greater than one can be very time- and memory consuming if the input sequences are long. It is not recommended for data sets with sequence length greater than 20.
The default values for "Restarts", "Time limit" and "Termination threshold" are fairly conservative. Smaller values can speed up every recursive step, which can be beneficial for testing purposes, but they may affect quality of the results.
Sequence scan
This tool is a variant of the InMoDe ScanApp, with increased support for different types of models, that is, mixture models and distal dependence models.
"Input model" needs to be an model file (in .xml format) produced by ``Motif complexity analysis. The "FPR" pertains here to the number of sequence that have at least one hit.
The tool returns a list with coordinates of motif hits as well as the extracted binding sites.
Example data
These data sets are discussed in the paper in detail (Section "Application examples"), which makes them suitable for testing the functionality of Disentangler.
Source code
Building the source code requires Jstacs 2.3 and JstacsFX 1.0. For compiling instructions see the included README.txt file.