PrediTALE

From Jstacs
Revision as of 15:16, 2 May 2019 by Grau (talk | contribs)
Jump to navigationJump to search

PrediTALE predicts TALE target boxes using a novel model learned from quantitative data based on the RVD sequence of a TALE. A pre-print describing the method behind PrediTALE and comparing its performance to other tools for TALE target prediction is available from biorxiv (doi:). In addition to PrediTALE, we also provide DerTALE, a tool for filtering genome-wide target site predictions by mapped RNA-seq data after Xanthomonas infection. Both tools are described in more detail below.

PrediTALE and DerTALE are available as a command line application, but have also been integrated into AnnoTALE, which is available with a graphical user interface.

PrediTALE is also available as a web-application at http://galaxy.informatik.uni-halle.de.

Command line tool

PrediTALE is distributed in the hope that it will be useful, but without any warranty; without even the implied warranty of merchantability or fitness for a particular purpose.

PrediTALE and DerTALE are packaged in one runnable JAR that may be run from the command line with

java -jar PrediTALE.jar

which lists the tools available and usage information

Available tools:

	preditale - PrediTALE
	dertale - DerTALE

Syntax: java -jar PrediTALE.jar <toolname> [<parameter=value> ...]

Further info about the tools is given with
	java -jar PrediTALE.jar <toolname> info

Tool parameters are listed with
	java -jar PrediTALE.jar <toolname>

You get a list of the tool parameters by calling PrediTALE.jar with the corresponding tool name, e.g.,

java -jar PrediTALE.jar preditale

The meaning of the individual tool parameters is described below.

Source code

Source code of PrediTALE and DerTALE is available from github, where the PrediTALE and DerTALE classes may be found in sub-packages of projects.tals.

PrediTALE

As input, PrediTALE requires a set of sequences that are scanned for putative TALE target boxes. These sequences could be promoters of genes but also complete genomic sequences (FastA format). For computing p-values, PrediTALE additional needs a background set of sequences, which is by default generated as a sub-sample of the original input data. The prediction threshold may be defined either by means of a p-values or an approximate number of expected sites. The latter will also be converted to a p-value, internally, and the defined number of expected sites in not met exactly, in general. TALEs are specified by a FastA file containing their RVD sequences, where individual RVDs are separated by dashes (-). This is the same format also output by the *TALE Analysis* tool of [AnnoTALE]. Finally, it can be specified if both strands or only one of the strands are scanned where, in the former case, a penalty may be assigned to predictions on the reverse strand. While this penalty may be reasonable when scanning promoters, it should usually be set to ``0`` in case of genome-wide predictions.

The parameters of PrediTALE are summarized in the following table:


name comment type

s Sequences (The sequences (e.g., a genome) to scan for binding sites) FILE
b Background sample (The sequences for determining the prediction threshold. Either a sub-sample of the input sequences or a dedicated background data set., range={sub-sample, background sequences}, default = sub-sample)
No parameters for selection "sub-sample"
Parameters for selection "background sequences":
bs Background sequences (The sequences (e.g., a genome) for determining the prediction threshold) FILE
t Threshold specification (The way of defining the prediction threshold. Either by explicitly defining a significance level or by specifying the number of expected sites, range={significance level, number of sites}, default = significance level)
Parameters for selection "significance level":
sl Significance level (The significance level for determining the prediction threshold, valid range = [0.0, 0.01], default = 1.0E-4) DOUBLE
Parameters for selection "number of sites":
n Number of sites (The number of expected binding sites for determining the prediction threshold, valid range = [1, 1000000], default = 10000) INT
TALEs TALEs (The RVD sequences of the TALE, separated by dashes, in FastA format) FILE
Strand Strand (Prediction target sites on both strands, or the forward or reverse strand, range={both strands, forward strand, reverse strand}, default = both strands)
Parameters for selection "both strands":
r Reverse penalty (Penalty for predictions on the reverse strand, valid range = [0.0, 1.7976931348623157E308], default = 0.01) DOUBLE
No parameters for selection "forward strand"
No parameters for selection "reverse strand"
outdir The output directory, defaults to the current working directory (.) STRING

The parameters s, bs (if b="background sequences"), and TALEs require input files in FastA format. In case of s (Input sequences) and bs (Background sequences), these are just FastA files of chromosomes, promoters,... For TALEs, the FastA file contains the RVD sequences of individual TALEs, where RVDs are separated by dashes (-). RVDs in standard repeats are indicated by uppercase letters, whereas lowercase letters indicate repeats of aberrant lengths. An example file of TALEs could look like

>TalAO16 Xoo PXO142 [3183414-3186582:+1]
NI-NN-N*-NG-NS-NN-NN-NN-NI-NN-NI-NG-HD-HD-NI-NG
>TalAP15 Xoo PXO142 [3173259-3176976:+1]
HD-HD-HD-NG-N*-NN-HD-HD-N*-NI-NI-NN-HD-HI-ND-HD-NI-HD-NG-NG
>TalAS12 Xoo PXO142 [1226203-1230511:-1]
NI-HG-NI-NI-HG-HD-NN-HD-HD-HD-NI-NI-nn-NI-HD-HD-HD-HG-NN-NN-HD-NS-NN-HD-NG-NS-N*


where repeat 13 (nn) of TalAS12 is an aberrant repeat.

The output of PrediTALE is a file Predicted_binding_sites_for_<tal>.tsv and an example of this output is shown below:

# Seq-ID	Position	Strand	Score	Sequence	Approx. p-value	RVDs	TALE
CA06g21040	185	+	0.5830908419447522	TATATAAACCTGACCCCCT	2.670610876887025E-8	HD-NG-NS-NG-NI-NI-NI-HD-HD-NG-NS-NS-HD-HD-HD-NG-HD-NG	AvrBs3
CA03g22700	150	+	0.5771893086290628	TATATAAACCTGACCCTTT	3.5133014164578924E-8	HD-NG-NS-NG-NI-NI-NI-HD-HD-NG-NS-NS-HD-HD-HD-NG-HD-NG	AvrBs3
CA07g16760	19	+	0.562284609978066	TCTATAAAACTTACCCTCA	6.95027071451193E-8	HD-NG-NS-NG-NI-NI-NI-HD-HD-NG-NS-NS-HD-HD-HD-NG-HD-NG	AvrBs3
CA10g21700	13	+	0.5355239120812225	TCTTTAAAACTTCCCATCT	2.2789662912359177E-7	HD-NG-NS-NG-NI-NI-NI-HD-HD-NG-NS-NS-HD-HD-HD-NG-HD-NG	AvrBs3
CA12g19020	163	-	0.5236377005594375	TCTTTACACCTTGCCATCT	3.8030633153773863E-7	HD-NG-NS-NG-NI-NI-NI-HD-HD-NG-NS-NS-HD-HD-HD-NG-HD-NG	AvrBs3
CA06g22640	32	-	0.52299565079445	TCTCTAAAACTCCTCCTCT	3.908676923236598E-7	HD-NG-NS-NG-NI-NI-NI-HD-HD-NG-NS-NS-HD-HD-HD-NG-HD-NG	AvrBs3
CA05g04530	210	+	0.5147214141360666	TATATAAACCATCCCCTCA	5.549688084638404E-7	HD-NG-NS-NG-NI-NI-NI-HD-HD-NG-NS-NS-HD-HD-HD-NG-HD-NG	AvrBs3
CA06g12590	99	-	0.5122527127048724	TCTTTAACCATTCCCCTCT	6.156103310450689E-7	HD-NG-NS-NG-NI-NI-NI-HD-HD-NG-NS-NS-HD-HD-HD-NG-HD-NG	AvrBs3

The first column contains the identifier of the sequence from the FastA file of input sequences. This could, for instance, be the IDs of the downstream genes of promoters or chromosome names in case of genome-wide scans. The second and third column contain the start position and strand orientation of the predicted target box within that sequence. The fourth column contain the PrediTALE score, and the sixth column the corresponding predicted target box. The seventh column contains an approximate p-value, which results from a monotonic mapping from scores to p-values based on a Gaussian distribution estimated from the background scores. The eights and ninth column echo RVD sequence and name of the input TALE (according to the FastA header the file provided to the TALEs parameter), which might be handy in case multiple prediction scores should be combined in a joint file (e.g. across all TALEs of a strain).


DerTALE

As input, DerTALE requires a list of target box predictions as generated by the Predict and Intersect Targets tool of AnnotALE or by PrediTALE. Besides, DerTALE also accepts prediction outputs of other tools like TALE-NT or Talvez.

For determining differentially expressed regions, DerTALE also needs mapped RNA-seq data after Xanthomonas infection (treatment) and control in BAM format, which is the standard output format of most mappers, and may be generated from the SAM format using samtools. For each BAM file, DerTALE also needs an index file with the same base name as the BAM file but additional extension .ba (as generated by samtools).

Further parameters that can be specified include the number of predictions in the list that are considered (counting from top), the width of the region in which differential expression is considered, the width of the window that needs to be differentially expressed, a pseudo count on the count profile, the measure for comparing replicated, and a threshold on the log (base 2) differential abundance (e.g., 1 for a two-fold induction).

The parameters of DerTALE are summarized in the following table:

name comment type

p Predictions (Predictions output file) FILE
The following parameter(s) can be used multiple times:
t Treatment BAM (BAM file of mapped reads from treatment experiment. BAM file must have an index with additional extension .bai.) FILE
The following parameter(s) can be used multiple times:
c Control BAM (BAM file of mapped reads from control experiment. BAM file must have an index with additional extension .bai.) FILE
n Number of predictions (Number of (top) predictions considered, default = 100) INT
r Region width (Number of bases around the predicted site, default = 3000) INT
w Window width (Width of the window considered for differential abundance, default = 300) INT
pc Pseudo count (Pseudo count on the count profile, default = 1.0) DOUBLE
Compare Compare (Measure for comparing replicates, range={EXTREMES, MEDIAN, MEAN}, default = MEAN) STRING
Threshold Threshold (Threshold on the log differential abundance, default = 1.0) DOUBLE
outdir The output directory, defaults to the current working directory (.) STRING

We provide an R script for plotting profiles from the DerTALE output in R.