PrediTALE: Difference between revisions
No edit summary |
|||
Line 150: | Line 150: | ||
PrediTALE scores are already normalized to the length of the TALE (i.e., number of repeats). However, scores might still be hard to compare between different TALEs, and ranking by p-values (which are conceptually comparable between TALEs) might be a more reasonable choice for ranking lists for multiple TALEs. P-values should be corrected for multiple testing before applying a significance level as a filtering step to PrediTALE predictions. | PrediTALE scores are already normalized to the length of the TALE (i.e., number of repeats). However, scores might still be hard to compare between different TALEs, and ranking by p-values (which are conceptually comparable between TALEs) might be a more reasonable choice for ranking lists for multiple TALEs. P-values should be corrected for multiple testing before applying a significance level as a filtering step to PrediTALE predictions. | ||
== DerTALE == | == DerTALE == |
Revision as of 11:50, 3 May 2019
PrediTALE predicts TALE target boxes using a novel model learned from quantitative data based on the RVD sequence of a TALE. A pre-print describing the method behind PrediTALE and comparing its performance to other tools for TALE target prediction is available from biorxiv (doi:). In addition to PrediTALE, we also provide DerTALE, a tool for filtering genome-wide target site predictions by mapped RNA-seq data after Xanthomonas infection. Both tools are described in more detail below.
PrediTALE and DerTALE are available as a command line application, but have also been integrated into AnnoTALE, which is available with a graphical user interface.
PrediTALE is also available as a web-application at http://galaxy.informatik.uni-halle.de.
Command line tool
PrediTALE is distributed in the hope that it will be useful, but without any warranty; without even the implied warranty of merchantability or fitness for a particular purpose.
PrediTALE and DerTALE are packaged in one runnable JAR that may be run from the command line with
java -jar PrediTALE.jar
which lists the tools available and usage information
Available tools: preditale - PrediTALE dertale - DerTALE Syntax: java -jar PrediTALE.jar <toolname> [<parameter=value> ...] Further info about the tools is given with java -jar PrediTALE.jar <toolname> info Tool parameters are listed with java -jar PrediTALE.jar <toolname>
You get a list of the tool parameters by calling PrediTALE.jar with the corresponding tool name, e.g.,
java -jar PrediTALE.jar preditale
The meaning of the individual tool parameters is described below.
Source code
Source code of PrediTALE and DerTALE is available from github, where the PrediTALE and DerTALE classes may be found in sub-packages of projects.tals
.
PrediTALE
As input, PrediTALE requires a set of sequences that are scanned for putative TALE target boxes. These sequences could be promoters of genes but also complete genomic sequences (FastA format). For computing p-values, PrediTALE additional needs a background set of sequences, which is by default generated as a sub-sample of the original input data. The prediction threshold may be defined either by means of a p-values or an approximate number of expected sites. The latter will also be converted to a p-value, internally, and the defined number of expected sites in not met exactly, in general. TALEs are specified by a FastA file containing their RVD sequences, where individual RVDs are separated by dashes (-). This is the same format also output by the *TALE Analysis* tool of [AnnoTALE]. Finally, it can be specified if both strands or only one of the strands are scanned where, in the former case, a penalty may be assigned to predictions on the reverse strand. While this penalty may be reasonable when scanning promoters, it should usually be set to ``0`` in case of genome-wide predictions.
The parameters of PrediTALE are summarized in the following table:
parameter | name (comment) | type | ||||||||||||
s | Sequences (The sequences (e.g., a genome) to scan for binding sites) | FILE | ||||||||||||
b | Background sample (The sequences for determining the prediction threshold. Either a sub-sample of the input sequences or a dedicated background data set., range={sub-sample, background sequences}, default = sub-sample) | |||||||||||||
| ||||||||||||||
t | Threshold specification (The way of defining the prediction threshold. Either by explicitly defining a significance level or by specifying the number of expected sites, range={significance level, number of sites}, default = significance level) | |||||||||||||
| ||||||||||||||
TALEs | TALEs (The RVD sequences of the TALE, separated by dashes, in FastA format) | FILE | ||||||||||||
Strand | Strand (Prediction target sites on both strands, or the forward or reverse strand, range={both strands, forward strand, reverse strand}, default = both strands) | |||||||||||||
| ||||||||||||||
outdir | The output directory, defaults to the current working directory (.) | STRING |
The parameters s
, bs
(if b="background sequences"
), and TALEs
require input files in FastA format.
In case of s
(Input sequences) and bs
(Background sequences), these are just FastA files of chromosomes, promoters,...
For TALEs
, the FastA file contains the RVD sequences of individual TALEs, where RVDs are separated by dashes (-). RVDs in standard repeats are indicated by uppercase letters, whereas lowercase letters indicate repeats of aberrant lengths.
An example file of TALEs could look like
>TalAO16 Xoo PXO142 [3183414-3186582:+1] NI-NN-N*-NG-NS-NN-NN-NN-NI-NN-NI-NG-HD-HD-NI-NG >TalAP15 Xoo PXO142 [3173259-3176976:+1] HD-HD-HD-NG-N*-NN-HD-HD-N*-NI-NI-NN-HD-HI-ND-HD-NI-HD-NG-NG >TalAS12 Xoo PXO142 [1226203-1230511:-1] NI-HG-NI-NI-HG-HD-NN-HD-HD-HD-NI-NI-nn-NI-HD-HD-HD-HG-NN-NN-HD-NS-NN-HD-NG-NS-N*
where repeat 13 (nn) of TalAS12 is an aberrant repeat.
The output of PrediTALE is a file Predicted_binding_sites_for_<tal>.tsv
and an example of this output is shown below:
# Seq-ID Position Strand Score Sequence Approx. p-value RVDs TALE CA06g21040 185 + 0.5830908419447522 TATATAAACCTGACCCCCT 2.670610876887025E-8 HD-NG-NS-NG-NI-NI-NI-HD-HD-NG-NS-NS-HD-HD-HD-NG-HD-NG AvrBs3 CA03g22700 150 + 0.5771893086290628 TATATAAACCTGACCCTTT 3.5133014164578924E-8 HD-NG-NS-NG-NI-NI-NI-HD-HD-NG-NS-NS-HD-HD-HD-NG-HD-NG AvrBs3 CA07g16760 19 + 0.562284609978066 TCTATAAAACTTACCCTCA 6.95027071451193E-8 HD-NG-NS-NG-NI-NI-NI-HD-HD-NG-NS-NS-HD-HD-HD-NG-HD-NG AvrBs3 CA10g21700 13 + 0.5355239120812225 TCTTTAAAACTTCCCATCT 2.2789662912359177E-7 HD-NG-NS-NG-NI-NI-NI-HD-HD-NG-NS-NS-HD-HD-HD-NG-HD-NG AvrBs3 CA12g19020 163 - 0.5236377005594375 TCTTTACACCTTGCCATCT 3.8030633153773863E-7 HD-NG-NS-NG-NI-NI-NI-HD-HD-NG-NS-NS-HD-HD-HD-NG-HD-NG AvrBs3 CA06g22640 32 - 0.52299565079445 TCTCTAAAACTCCTCCTCT 3.908676923236598E-7 HD-NG-NS-NG-NI-NI-NI-HD-HD-NG-NS-NS-HD-HD-HD-NG-HD-NG AvrBs3 CA05g04530 210 + 0.5147214141360666 TATATAAACCATCCCCTCA 5.549688084638404E-7 HD-NG-NS-NG-NI-NI-NI-HD-HD-NG-NS-NS-HD-HD-HD-NG-HD-NG AvrBs3 CA06g12590 99 - 0.5122527127048724 TCTTTAACCATTCCCCTCT 6.156103310450689E-7 HD-NG-NS-NG-NI-NI-NI-HD-HD-NG-NS-NS-HD-HD-HD-NG-HD-NG AvrBs3
The first column contains the identifier of the sequence from the FastA file of input sequences. This could, for instance, be the IDs of the downstream genes of promoters or chromosome names in case of genome-wide scans.
The second and third column contain the start position and strand orientation of the predicted target box within that sequence.
The fourth column contain the PrediTALE score, and the sixth column the corresponding predicted target box.
The seventh column contains an approximate p-value, which results from a monotonic mapping from scores to p-values based on a Gaussian distribution estimated from the background scores.
The eights and ninth column echo RVD sequence and name of the input TALE (according to the FastA header the file provided to the TALEs
parameter), which might be handy in case multiple prediction scores should be combined in a joint file (e.g. across all TALEs of a strain).
The reported target boxes of a TALE are already sorted by score (and, as mapping is monotonic, also by p-value), and the most likely target boxes should appear at the very top of the list. Hence, we recommend to investigate matches in the PrediTALE output in the given order. Typically, the best target boxes may be found in the top 20 predictions listed, but there are rare example where true targets also appear at later ranks.
PrediTALE scores are already normalized to the length of the TALE (i.e., number of repeats). However, scores might still be hard to compare between different TALEs, and ranking by p-values (which are conceptually comparable between TALEs) might be a more reasonable choice for ranking lists for multiple TALEs. P-values should be corrected for multiple testing before applying a significance level as a filtering step to PrediTALE predictions.
DerTALE
As input, DerTALE requires a list of target box predictions as generated by the Predict and Intersect Targets tool of AnnotALE or by PrediTALE. Besides, DerTALE also accepts prediction outputs of other tools like TALE-NT or Talvez.
For determining differentially expressed regions, DerTALE also needs mapped RNA-seq data after Xanthomonas infection (treatment) and control in BAM format, which is the standard output format of most mappers, and may be generated from the SAM format using samtools. For each BAM file, DerTALE also needs an index file with the same base name as the BAM file but additional extension .bai
(as generated by samtools).
Further parameters that can be specified include the number of predictions in the list that are considered (counting from top), the width of the region in which differential expression is considered, the width of the window that needs to be differentially expressed, a pseudo count on the count profile, the measure for comparing replicated, and a threshold on the log (base 2) differential abundance (e.g., 1
for a two-fold induction).
The parameters of DerTALE are summarized in the following table:
parameter | name (comment) | type | |||
p | Predictions (Predictions output file) | FILE | |||
The following parameter(s) can be used multiple times: | |||||
| |||||
The following parameter(s) can be used multiple times: | |||||
| |||||
n | Number of predictions (Number of (top) predictions considered, default = 100) | INT | |||
r | Region width (Number of bases around the predicted site, default = 3000) | INT | |||
w | Window width (Width of the window considered for differential abundance, default = 300) | INT | |||
pc | Pseudo count (Pseudo count on the count profile, default = 1.0) | DOUBLE | |||
Compare | Compare (Measure for comparing replicates, range={EXTREMES, MEDIAN, MEAN}, default = MEAN) | STRING | |||
Threshold | Threshold (Threshold on the log differential abundance, default = 1.0) | DOUBLE | |||
outdir | The output directory, defaults to the current working directory (.) | STRING |
We provide an R script for plotting profiles from the DerTALE output in R.