GeMoMa: Difference between revisions

From Jstacs
Jump to navigationJump to search
No edit summary
Line 50: Line 50:
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">gc</font></td>
<td><font color="green">gc</font></td>
<td>genetic code (whether to use the default or a user-specified genetic code, range={default, user-specified}, default = default)<table border=0 cellpadding=10 align="center">
<td>genetic code (optional user-specified genetic code, OPTIONAL)</td>
No parameters for selection &quot;default&quot;<br/>
Parameters for selection &quot;user-specified&quot;:<br/>
<tr style="vertical-align:top">
<td><font color="green">u</font></td>
<td>user code (user-specified genetic code)</td>
<td>FILE</td>
<td>FILE</td>
</tr>
</table></td><td></td>
</tr>
</tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
Line 104: Line 97:
</tr>
</tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">q</font></td>
<td><font color="green">tg</font></td>
<td>query cds parts (The query cds parts file (FASTA), i.e., the cds parts that have been blasted)</td>
<td>target genome (The target genome file (FASTA), i.e., the target sequences in the blast run)</td>
<td>FILE</td>
<td>FILE</td>
</tr>
</tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">a</font></td>
<td><font color="green">c</font></td>
<td>assignment (The assignment file, which combines parts of the CDS to transcripts, OPTIONAL)</td>
<td>cds parts (The query cds parts file (FASTA), i.e., the cds parts that have been blasted)</td>
<td>FILE</td>
<td>FILE</td>
</tr>
</tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">tg</font></td>
<td><font color="green">a</font></td>
<td>target genome (The target genome file (FASTA), i.e., the target sequences in the blast run)</td>
<td>assignment (The assignment file, which combines parts of the CDS to transcripts, OPTIONAL)</td>
<td>FILE</td>
<td>FILE</td>
</tr>
</tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">alignment</font></td>
<td><font color="green">q</font></td>
<td>alignment (for computing the optimal alignment score of the complete prediction vs. the query protein, range={no, yes}, default = no)<table border=0 cellpadding=10 align="center">
<td>query proteins (optional query protein file (FASTA) for computing the optimal alignment score against complete protein prediction, OPTIONAL)</td>
No parameters for selection &quot;no&quot;<br/>
Parameters for selection &quot;yes&quot;:<br/>
<tr style="vertical-align:top">
<td><font color="green">qp</font></td>
<td>query proteins (The path to the query protein file (FASTA), OPTIONAL)</td>
<td>FILE</td>
<td>FILE</td>
</tr>
</table></td><td></td>
</tr>
</tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">g</font></td>
<td><font color="green">g</font></td>
<td>genetic code (whether to use the default or a user-specified genetic code, range={default, user-specified}, default = default)<table border=0 cellpadding=10 align="center">
<td>genetic code (optional user-specified genetic code, OPTIONAL)</td>
No parameters for selection &quot;default&quot;<br/>
Parameters for selection &quot;user-specified&quot;:<br/>
<tr style="vertical-align:top">
<td><font color="green">c</font></td>
<td>code (user-specified genetic code)</td>
<td>FILE</td>
<td>FILE</td>
</tr>
</table></td><td></td>
</tr>
</tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">s</font></td>
<td><font color="green">s</font></td>
<td>substitution matrix (the substitution matrix used in the alignment, range={default, user-specified}, default = default)<table border=0 cellpadding=10 align="center">
<td>substitution matrix (optional user-specified substitution matrix, OPTIONAL)</td>
No parameters for selection &quot;default&quot;<br/>
Parameters for selection &quot;user-specified&quot;:<br/>
<tr style="vertical-align:top">
<td><font color="green">m</font></td>
<td>matrix (user-specified substitution matrix)</td>
<td>FILE</td>
<td>FILE</td>
</tr>
</table></td><td></td>
</tr>
</tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
Line 165: Line 137:
</tr>
</tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">mil</font></td>
<td><font color="green">m</font></td>
<td>maximum intron length (The maximum length of an intron, default = 15000)</td>
<td>maximum intron length (The maximum length of an intron, default = 15000)</td>
<td>INT</td>
<td>INT</td>
Line 227: Line 199:
<td><font color="green">prefix</font></td>
<td><font color="green">prefix</font></td>
<td>prefix (A prefix to be used for naming the predictions, default = )</td>
<td>prefix (A prefix to be used for naming the predictions, default = )</td>
<td>STRING</td>
</tr>
<tr style="vertical-align:top">
<td><font color="green">tag</font></td>
<td>tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = prediction)</td>
<td>STRING</td>
<td>STRING</td>
</tr>
</tr>
Line 242: Line 219:


GeMoMa returns the predicted annotation as gff file and the predicted proteins as fasta file.
GeMoMa returns the predicted annotation as gff file and the predicted proteins as fasta file.


== Version history ==
== Version history ==

Revision as of 20:42, 11 February 2016

by Jens Keilwagen, Michael Wenk, Jessica L. Erickson, Martin H. Schattat, Jan Grau, and Frank Hartung

Gene Model Mapper (GeMoMa) is a homology-based gene prediction program that uses the annotation of protein-coding genes in a reference genome to infer annotation of protein-coding genes in a target genome.

Abstract

Annotation of protein-coding genes is very important in many fields of research and application. Thereby, homology-based gene prediction programs allow for transferring knowledge from an annotated organism to an organism of interest.

Here, we present a homology-based gene prediction program called GeMoMa. GeMoMa utilizes the intron position conservation of related genes in different organisms. We assess the performance of GeMoMa comparing it with state-of-the-art competitors on plant and animal genomes using an extended best reciprocal hit approach. We find that it often makes more precise predictions than its competitors yielding an improvement of up to 622% more correct transcripts. Subsequently, we use RNA-seq data to compare the predictions of homology-based gene prediction programs, and find again that GeMoMa performs well.

Hence, we conclude that exploiting intron position conservation improves homology-based gene prediction, and we make GeMoMa freely available as command-line tool and Galaxy integration.

Paper

We have submitted the paper for review.

Download

GeMoMa is implemented in Java using Jstacs. You can download a zip file containing a readme, the GeMoMa jar file and some tiny scripts for running GeMoMa. The jar file allows for

  • running the command line interface (CLI) version
  • creating the XML file needed for the Galaxy integration.

Running the command line application

For running the command line application, Java v1.6 or later is required.

For preparing the data, we provide the tool Extractor. You can run Extractor from the command line with
java -jar GeMoMa-1.1.2.jar CLI Extractor [<parameter>=<value> ...]
The parameters comprise:

name comment type

a annotation (Reference annotation file (GFF), which contains gene models annotated in the reference genome) FILE
g genome (Reference genome file (FASTA)) FILE
gc genetic code (optional user-specified genetic code, OPTIONAL) FILE
p proteins (whether the complete proteins sequences should returned as output, default = false) BOOLEAN
t transcripts (whether the complete transcripts sequences should returned as output, default = false) BOOLEAN
s selected (The path to list file, which allows to make only a predictions for the contained transcript ids, OPTIONAL) FILE
v verbose (A flag which allows to output wealth of additional information, default = false) BOOLEAN
outdir The output directory, defaults to the current working directory (.) STRING

For predicting gene models, we provide the tool GeMoMa. You can run GeMoMa from the command line with
java -jar GeMoMa-1.1.2.jar CLI GeMoMa [<parameter>=<value> ...]
The parameters comprise:

name comment type

t tblastn results (The sorted tblastn results) FILE
tg target genome (The target genome file (FASTA), i.e., the target sequences in the blast run) FILE
c cds parts (The query cds parts file (FASTA), i.e., the cds parts that have been blasted) FILE
a assignment (The assignment file, which combines parts of the CDS to transcripts, OPTIONAL) FILE
q query proteins (optional query protein file (FASTA) for computing the optimal alignment score against complete protein prediction, OPTIONAL) FILE
g genetic code (optional user-specified genetic code, OPTIONAL) FILE
s substitution matrix (optional user-specified substitution matrix, OPTIONAL) FILE
go gap opening (The gap opening cost in the alignment, default = 11) INT
ge gap extension (The gap extension cost in the alignment, default = 1) INT
m maximum intron length (The maximum length of an intron, default = 15000) INT
i intron-loss-gain-penalty (The penalty used for intron loss and gain, default = 25) INT
e e-value (The e-value for filtering blast results, default = 100.0) DOUBLE
ct contig threshold (The threshold for evaluating contigs, valid range = [0.0, 1.0], default = 0.9) DOUBLE
r region threshold (The threshold for evaluating regions, valid range = [0.0, 1.0], default = 0.9) DOUBLE
h hit threshold (The threshold for adding additional hits, valid range = [0.0, 1.0], default = 0.9) DOUBLE
p predictions (The (maximal) number of predictions per transcript, default = 1) INT
selected selected (The path to list file, which allows to make only a predictions for the contained transcript ids, OPTIONAL) FILE
as avoid stop (A flag which allows to avoid stop codons in a transcript (except the last AS), default = true) BOOLEAN
approx approx (whether an approximation is used to compute the score for intron gain, default = true) BOOLEAN
align align (A flag which allows to output a tab-delimited file, which contains the results in a blast-like format (deprecated), default = false) BOOLEAN
genomic genomic (A flag which allows to output a fasta file containing the genomic regions of the predictions, default = false) BOOLEAN
prefix prefix (A prefix to be used for naming the predictions, default = ) STRING
tag tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = prediction) STRING
v verbose (A flag which allows to output wealth of additional information per transcript, default = false) BOOLEAN
outdir The output directory, defaults to the current working directory (.) STRING

GeMoMa returns the predicted annotation as gff file and the predicted proteins as fasta file.

Version history

GeMoMa 1.1.2 (05.02.2016)

  • GeMoMa bugfix (upstream, downstream sequence for splice site detection)
  • Extractor: new parameter s for selecting transcripts
  • improved Galaxy integration


GeMoMa 1.1.1 (01.02.2016)

  • initial release for paper