Disentangler
by Ralf Eggeling.
Disentangler comprises two tools for analyzing complex features in a set of aligned transcription factor (TFBS) binding sites that can be used individually or within a joint pipeline. IMD can test whether putative complexity can be explained by intermixtures with binding sites from different TFs or other contamination and correct for such artifacts. MCA allows to select an optimal model of TFBS complexity, choosing among dependence models, mixture models, and variants in between.
Paper
If you use Disentangler, please cite
R. Eggeling. Disentangling transcription factor binding site complexity. Nucleic Acids Research, gky683, 2018; doi: 10.1093/nar/gky683 (to appear)
Download and installation
Disentangler offers two user interfaces.
- DisentanglerGUI -- graphical user interface
- DisentanglerCLI -- command line interface
that can be started by
java -jar filename.jar
and require an existent Java installation (8u74 or later).
Functionality
The software contains the two subtools described in the paper, called Intermixture detection and Motif complexity analysis. In addition, there is a tool Sequence scan that can be used to search for motif hits within target sequences based on models that are returned by Motif complexity analysis.
All tools expect a set of aligned, gapless, TFBS of the same length as input. If the content of the input file starts with '>', it is interpreted as FastA file. Otherwise it is interpreted as plain text, where every line contains a single sequence. The input expects upper- and lower case letters of the standard DNA alphabet {A,C,G,T}. If other symbols from the IUPAC code (such as N) are encountered, they are replaced by a random sample from the distribution of {A,C,G,T} in the data set.
Intermixture detection
If "JSD weights" is disabled, the intermixture measure is computed on a non-weighted Jenson-Shannon divergence. Option included for experimental purposes, for practical use keeping the default is strongly recommended. Smaller values for "Restarts", "Time limit" and "Termination threshold" can speed up every recursive step, which can be beneficial for testing purposes, but they may affect quality of the results. The tool returns a text file with the intermixture number and all clusters produced by IMD as text files of the binding sites and sequence logos of the mononucleotide statistics. The values for the intermixture measure at each recursive step can be found in the protocol.
Motif complexity analysis
The tool allows to learn of proximal/distal dependency models and mixtures thereof. Note: Learning distal dependence models of order greater than one can be very time- and memory consuming if the input sequences are long. It is not recommended for motifs of length greater than 20. The tool returns a text-file containing the intra-motif complexity measure of the data set, a visualization of the learned model, and a storable (.xml) file that can be used as input to Sequence scan. The mixture weights and model complexities of each component can be found in the protocol.
Sequence scan
This tool is a variant of the InMoDe ScanApp, with increased support for different types of models, that is, mixture models and distal dependence models. "Input model" needs to be an model file (in .xml format) produced by ``Motif complexity analysis. The "FPR" pertains here to the number of sequence that have at least one hit. The tool returns a list with coordinates of motif hits as well as the extracted binding sites.
Example data
These data sets are discussed in the paper in detail (Section "Application examples"), which makes them suitable for testing the functionality of Disentangler.
Source code
Building the source code (to be released soon) requires Jstacs 2.3.