TALgetter: Difference between revisions

From Jstacs
Jump to navigationJump to search
No edit summary
 
(18 intermediate revisions by the same user not shown)
Line 1: Line 1:
TALgetter allows you to scan input sequences for putative target sites of a given TAL (transcription activator like) effector as typically expressed by many ''Xanthomonas'' bacteria.
by Jan Grau, Annett Wolf, Maik Reschke, Ulla Bonas, Stefan Posch and Jens Boch.
TALgetter uses a local mixture model, which assumes that the nucleotide at each position of a putative target site may either be determined by the binding specificity of the RVD at that position (if binding occurs at that position) or by the genomic context (if no binding occurs). Binding specificities and importance of the individual RVDs has been trained on known TAL effector - target site pairs. Nucleotide preferences of the genomic context are learned from (putative) promoter sequences of ''A. thaliana'' and ''O. sativa.''


== Web-application ==
TALgetter allows you to scan input DNA sequences for putative target sites of a given TAL (transcription activator like) effector as typically expressed by many ''Xanthomonas'' bacteria.
TALgetter is available as a web-application at [http://galaxy.informatik.uni-halle.de:8976 galaxy.informatik.uni-halle.de:8976].
TALgetter uses a local mixture model, which assumes that the nucleotide at each position of a putative target site may either be determined by the binding specificity of the RVD at that position (if interaction occurs at that position) or by the genomic context (if no interaction takes place). Binding specificities and importance of the individual RVDs has been trained on known TAL effector - target site pairs. Nucleotide preferences of the genomic context are learned from promoter sequences of ''A. thaliana'' and ''O. sativa''.
Here, you can also download a command line application that is easily scriptable.
 
We provide TALgetter as a public web-server, a web-application that can be installed in a local Galaxy server, and as a command line program.
 
== Paper ==
If you use TALgetter, please cite
 
J. Grau, A. Wolf, M. Reschke, U. Bonas, S. Posch, and J. Boch. [http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1002962 Computational predictions provide insights into the biology of TAL effector target sites]. ''PLOS Computational Biology'' 9 (3), 2013.
 
== TALgetter web-server ==
TALgetter is available as a public web-server at [http://galaxy.informatik.uni-halle.de galaxy.informatik.uni-halle.de].


== Download ==
== Download ==


TALgetter is implemented in Java using Jstacs. Here, can download the [http://www.jstacs.de/download.php?which=TALgetter Jar of the command line application].
TALgetter is implemented in Java using Jstacs. You can download the command line application as a [http://www.jstacs.de/download.php?which=TALgetter Jar].
In addition, we provide the [http://www.jstacs.de/download.php?which=TALgetterWeb Jar of the Galaxy web-application] for installing it in your local Galaxy server.
In addition, we provide the [http://www.jstacs.de/download.php?which=TALgetterWeb Jar of the Galaxy web-application] for installing it in your local [http://getgalaxy.org Galaxy] server.
 
TALgetter is part if Jstacs release 2.1. You find the TALgetter sources in package <code>projects.tals</code>.


== Running the command line application ==
== Running the command line application ==


For running the command line application, Java v1.6 or later is required.
For running the command line application, Java v1.6 or later is required.
The arguments of the command line application have the following meaning:
The arguments of the command line application have the following meaning:
<table border=0 cellpadding=10 align="center">
<table border=0 cellpadding=10 align="center">
Line 24: Line 35:
<tr>
<tr>
<td><font color="green">input</font></td>
<td><font color="green">input</font></td>
<td>Input sequences (The sequences to scan for TAL binding sites, FastA)</td>
<td>Input sequences (The sequences to scan for TAL effector target sites, FastA)</td>
<td>String</td>
<td>String</td>
</tr>
</tr>
<tr>
<tr>
<td><font color="green">tal</font></td>
<td><font color="green">uo</font></td>
<td>TAL sequence (Sequence of RVDs, seperated by '-', default = NI-HD-HD-NG-NN-NK-NK)</td>
<td>Upstream offset (Number of positions ignored at 5' end of each sequence, default = 0)</td>
<td>String</td>
<td>Integer</td>
</tr>
</tr>
<tr>
<tr>
<td><font color="green">fp</font></td>
<td><font color="green">do</font></td>
<td>First position (First position (counted from 5' end) considered for search, default = 0)</td>
<td>Downstream offset (Number of positions ignored at 3' end of each sequence, default = 0)</td>
<td>Integer</td>
<td>Integer</td>
</tr>
</tr>
<tr>
<tr>
<td><font color="green">do</font></td>
<td><font color="green">rvd</font></td>
<td>Downstream offset (Number of positions counted from 3' end that are not considered, default = 0)</td>
<td>RVD sequence (Sequence of RVDs, seperated by '-', default = NI-HD-HD-NG-NN-NK-NK)</td>
<td>Integer</td>
<td>String</td>
</tr>
</tr>
<tr>
<tr>
<td><font color="green">top</font></td>
<td><font color="green">model</font></td>
<td>Top N (Limit the number of reported hits in all input sequences to at most N, valid range = [1, 10000], default = 100)</td>
<td>Model type (TALgetter is the default model that uses individual binding specificities for each RVD. TALgetter13 uses binding specificities that only depend on amino acid 13, i.e., the second amino acid of the repat.While TALgetter is recommended in most cases, the use of TALgetter13 may be beneficial if you search for target sites of TAL effector with many rare RVDs, for instance YG, HH, or S*., range={TALgetter, TALgetter13}, default = TALgetter)</td>
<td>Integer</td>
<td>{TALgetter, TALgetter13}</td>
</tr>
</tr>
<tr>
<tr>
Line 58: Line 69:
</tr>
</tr>
<tr>
<tr>
<td><font color="green">model</font></td>
<td><font color="green">top</font></td>
<td>Model type (TALgetter is the default model that uses individual binding specificities for each RVD. TALgetter13 uses binding specificities that only depend on amino acid 13, i.e., the second amino acid of the repat.While TALgetter is recommended in most cases, the use of TALgetter13 may be beneficial if you search for target sites of TAL effector with many rare RVDs, for instance YG, HH, or S*., range={TALgetter, TALgetter13}, default = TALgetter)</td>
<td>Maximum number of target sites (Limits the total number of reported target sites in all input sequences, valid range = [1, 10000], default = 100)</td>
<td>{TALgetter, TALgetter13}</td>
<td>Integer</td>
</tr>
</tr>
<tr>
<tr>
<td><font color="green">train</font></td>
<td><font color="green">train</font></td>
<td>Training sequences (The sequence to use for training the model, annotated FastA, OPTIONAL)</td>
<td>Training data (The input data to use for training the model, annotated FastA, OPTIONAL)</td>
<td>String</td>
<td>String</td>
</tr>
<tr>
<td><font color="green">strand</font></td>
<td>Both strands (Search both strands of the input sequence for target sites, default = false) [since v1.1, default value restores previous behaviour]</td>
<td>Boolean</td>
</tr>
</tr>
</table>
</table>


If, for instance, you want to scan the FastA-file <code>path/to/myPromoters.fa</code> for the top 100 target sites of the TAL effector Talc, you start TALgetter with
For instance, for scanning the FastA-file <code>path/to/myPromoters.fa</code> for the top 100 target sites of the TAL effector TalC, you start TALgetter with
 
<code>java -jar TALgetter.jar input=path/to/myPromoters.fa rvd="NS-NG-NS-HD-NI-NG-NN-NG-HD-NI-NN-N*-NI-NN-HD-NG-NI-NN-N*-HD-NN-NG"</code>
 
If you analyze large data set, for example all 1kb upstream sequences of rice, TALgetter may require a larger amount of memory than is the default in Java. You can increase the memory available to TALgetter my additional parameters to the Java virtual machine. If you want to start TALgetter with 512MB of memory initially, which may be increased to at most 1GB during the TALgetter execution, you call


<code>java -jar TALgetter.jar input=path/to/myPromoters.fa tal="NS-NG-NS-HD-NI-NG-NN-NG-HD-NI-NN-N*-NI-NN-HD-NG-NI-NN-N*-HD-NN-NG"</code>
<code>java -Xms512M -Xmx1G -jar TALgetter.jar input=path/to/myPromoters.fa rvd="NS-NG-NS-HD-NI-NG-NN-NG-HD-NI-NN-N*-NI-NN-HD-NG-NI-NN-N*-HD-NN-NG"</code>
 
Optionally, you can also train the TALgetter model using your custom training data before making predictions. [http://www.jstacs.de/downloads/trainingdata.fa Here], we provide an example file of input sequences. Basically, the input format is an annotated FastA-File of the form
>seq:<RVD-sequence>; weight: <w>
<DNA-sequence including position 0>
...
for instance:
>seq:NI-NG-NN-NN-NI-HD-HD-NN-NG-NN-NG; weight:0.0476190476190476
TATGGACCGTGT
The specification of the weight is optional.
 
To train the TALgetter model on <code>path/to/trainingdata.fa</code> before making predictions for target sites of TalC in <code>path/to/myPromoters.fa</code>, you call
 
<code>java -jar TALgetter.jar train=path/to/trainingdata.fa input=path/to/myPromoters.fa rvd="NS-NG-NS-HD-NI-NG-NN-NG-HD-NI-NN-N*-NI-NN-HD-NG-NI-NN-N*-HD-NN-NG"</code>
 
After the training, the estimated parameters are printed to STDERR and the predictions are printed to STDOUT, and you can redirect these to different files.
 
== Large input data sets ==
 
Currently, TALgetter is restricted to at most 90 Mb due to memory requirements for computing empirical p-values. As an alternative, you can [http://www.jstacs.de/download.php?which=TALgetterLong download] a version of TALgetter with limited options (especially computation of p-values is disabled) for scanning large input data sets (e.g., complete genomes).


== Installing the web-application ==
== Installing the web-application ==
The command-line program behind the web-application is a Jar as well, so Java is required on the server running Galaxy.
To install this command line program in Galaxy, copy it to the desired destination in the Galaxy <code>tools</code> directory.
The command line application writes its Galaxy tool definition file itself. If you are in the directory containing the command-line program for Galaxy, you can create the tool definition file by calling
<code>java -jar TALgetterWeb.jar --create TALgetterWeb.xml</code>
Afterwards, this directory contains the tool definition file <code>TALgetterWeb.xml</code>. Now you can register TALgetter in the Galaxy <code>tool_conf.xml</code> file. For details, see the [http://wiki.g2.bx.psu.edu/Admin/Tools/Add%20Tool%20Tutorial Galaxy tutorial for adding new tools].

Latest revision as of 19:58, 25 January 2017

by Jan Grau, Annett Wolf, Maik Reschke, Ulla Bonas, Stefan Posch and Jens Boch.

TALgetter allows you to scan input DNA sequences for putative target sites of a given TAL (transcription activator like) effector as typically expressed by many Xanthomonas bacteria. TALgetter uses a local mixture model, which assumes that the nucleotide at each position of a putative target site may either be determined by the binding specificity of the RVD at that position (if interaction occurs at that position) or by the genomic context (if no interaction takes place). Binding specificities and importance of the individual RVDs has been trained on known TAL effector - target site pairs. Nucleotide preferences of the genomic context are learned from promoter sequences of A. thaliana and O. sativa.

We provide TALgetter as a public web-server, a web-application that can be installed in a local Galaxy server, and as a command line program.

Paper

If you use TALgetter, please cite

J. Grau, A. Wolf, M. Reschke, U. Bonas, S. Posch, and J. Boch. Computational predictions provide insights into the biology of TAL effector target sites. PLOS Computational Biology 9 (3), 2013.

TALgetter web-server

TALgetter is available as a public web-server at galaxy.informatik.uni-halle.de.

Download

TALgetter is implemented in Java using Jstacs. You can download the command line application as a Jar. In addition, we provide the Jar of the Galaxy web-application for installing it in your local Galaxy server.

TALgetter is part if Jstacs release 2.1. You find the TALgetter sources in package projects.tals.

Running the command line application

For running the command line application, Java v1.6 or later is required.

The arguments of the command line application have the following meaning:

name comment type

input Input sequences (The sequences to scan for TAL effector target sites, FastA) String
uo Upstream offset (Number of positions ignored at 5' end of each sequence, default = 0) Integer
do Downstream offset (Number of positions ignored at 3' end of each sequence, default = 0) Integer
rvd RVD sequence (Sequence of RVDs, seperated by '-', default = NI-HD-HD-NG-NN-NK-NK) String
model Model type (TALgetter is the default model that uses individual binding specificities for each RVD. TALgetter13 uses binding specificities that only depend on amino acid 13, i.e., the second amino acid of the repat.While TALgetter is recommended in most cases, the use of TALgetter13 may be beneficial if you search for target sites of TAL effector with many rare RVDs, for instance YG, HH, or S*., range={TALgetter, TALgetter13}, default = TALgetter) {TALgetter, TALgetter13}
pval PVals (Computation of p-Values, range={NONE, COARSE, FINE}, default = COARSE) {NONE, COARSE, FINE}
pthresh p-Value (Filter the reported hits by a maximum p-Value. A value of 0 or 1 switches off the filter., valid range = [0.0, 1.0], default = 1.0) Double
top Maximum number of target sites (Limits the total number of reported target sites in all input sequences, valid range = [1, 10000], default = 100) Integer
train Training data (The input data to use for training the model, annotated FastA, OPTIONAL) String
strand Both strands (Search both strands of the input sequence for target sites, default = false) [since v1.1, default value restores previous behaviour] Boolean

For instance, for scanning the FastA-file path/to/myPromoters.fa for the top 100 target sites of the TAL effector TalC, you start TALgetter with

java -jar TALgetter.jar input=path/to/myPromoters.fa rvd="NS-NG-NS-HD-NI-NG-NN-NG-HD-NI-NN-N*-NI-NN-HD-NG-NI-NN-N*-HD-NN-NG"

If you analyze large data set, for example all 1kb upstream sequences of rice, TALgetter may require a larger amount of memory than is the default in Java. You can increase the memory available to TALgetter my additional parameters to the Java virtual machine. If you want to start TALgetter with 512MB of memory initially, which may be increased to at most 1GB during the TALgetter execution, you call

java -Xms512M -Xmx1G -jar TALgetter.jar input=path/to/myPromoters.fa rvd="NS-NG-NS-HD-NI-NG-NN-NG-HD-NI-NN-N*-NI-NN-HD-NG-NI-NN-N*-HD-NN-NG"

Optionally, you can also train the TALgetter model using your custom training data before making predictions. Here, we provide an example file of input sequences. Basically, the input format is an annotated FastA-File of the form

>seq:<RVD-sequence>; weight: <w>
<DNA-sequence including position 0>
...

for instance:

>seq:NI-NG-NN-NN-NI-HD-HD-NN-NG-NN-NG; weight:0.0476190476190476
TATGGACCGTGT

The specification of the weight is optional.

To train the TALgetter model on path/to/trainingdata.fa before making predictions for target sites of TalC in path/to/myPromoters.fa, you call

java -jar TALgetter.jar train=path/to/trainingdata.fa input=path/to/myPromoters.fa rvd="NS-NG-NS-HD-NI-NG-NN-NG-HD-NI-NN-N*-NI-NN-HD-NG-NI-NN-N*-HD-NN-NG"

After the training, the estimated parameters are printed to STDERR and the predictions are printed to STDOUT, and you can redirect these to different files.

Large input data sets

Currently, TALgetter is restricted to at most 90 Mb due to memory requirements for computing empirical p-values. As an alternative, you can download a version of TALgetter with limited options (especially computation of p-values is disabled) for scanning large input data sets (e.g., complete genomes).

Installing the web-application

The command-line program behind the web-application is a Jar as well, so Java is required on the server running Galaxy. To install this command line program in Galaxy, copy it to the desired destination in the Galaxy tools directory.

The command line application writes its Galaxy tool definition file itself. If you are in the directory containing the command-line program for Galaxy, you can create the tool definition file by calling

java -jar TALgetterWeb.jar --create TALgetterWeb.xml

Afterwards, this directory contains the tool definition file TALgetterWeb.xml. Now you can register TALgetter in the Galaxy tool_conf.xml file. For details, see the Galaxy tutorial for adding new tools.