GeMoRNA: Difference between revisions

From Jstacs
Jump to navigationJump to search
(Created blank page)
 
No edit summary
Line 1: Line 1:
== Tools ==


=== GeMoRNA ===
''GeMoRNA'' may be called with
java -jar GeMoRNA-1.0.jar gemorna
and has the following parameters
<table border=0 cellpadding=10 align="center" width="100%">
<tr>
<td>name</td>
<td>comment</td>
<td>type</td>
</tr>
<tr><td colspan=3><hr></td></tr>
<tr style="vertical-align:top">
<td><font color="green">g</font></td>
<td>Genome (Genome sequence as FastA, type = fa,fna,fasta)</td>
<td style="width:100px;">FILE</td>
</tr>
<tr style="vertical-align:top">
<td><font color="green">m</font></td>
<td>Mapped reads (Mapped Reads in BAM format, coordinate sorted, type = bam)</td>
<td style="width:100px;">FILE</td>
</tr>
<tr style="vertical-align:top">
<td><font color="green">s</font></td>
<td>Stranded (Library strandedness, range={FR_UNSTRANDED, FR_FIRST_STRAND, FR_SECOND_STRAND}, default = FR_UNSTRANDED)</td>
<td style="width:100px;">STRING</td>
</tr>
<tr style="vertical-align:top">
<td><font color="green">l</font></td>
<td>Longest intron length (Length of the longest intron reported, default = 100000)</td>
<td style="width:100px;">INT</td>
</tr>
<tr style="vertical-align:top">
<td><font color="green">sil</font></td>
<td>Shortest intron length (Length of the shortest intron considered, default = 10)</td>
<td style="width:100px;">INT</td>
</tr>
<tr style="vertical-align:top">
<td><font color="green">lr</font></td>
<td>Long reads (Long-read mode, default = false)</td>
<td style="width:100px;">BOOLEAN</td>
</tr>
<tr style="vertical-align:top">
<td><font color="green">mnor</font></td>
<td>Minimum number of reads (Minimum number of reads required for an edge in the read graph, default = 1.0)</td>
<td style="width:100px;">DOUBLE</td>
</tr>
<tr style="vertical-align:top">
<td><font color="green">mfor</font></td>
<td>Minimum fraction of reads (Minimum fraction of reads relative to adjacent exons that must support an intron in the enumeration, default = 0.01)</td>
<td style="width:100px;">DOUBLE</td>
</tr>
<tr style="vertical-align:top">
<td><font color="green">mnoir</font></td>
<td>Minimum number of intron reads (Minimum number of reads required for an intron, default = 1.0)</td>
<td style="width:100px;">DOUBLE</td>
</tr>
<tr style="vertical-align:top">
<td><font color="green">mfoir</font></td>
<td>Minimum fraction of intron reads (Minimum fraction of reads relative to adjacent exons for an intron to be considered, default = 0.01)</td>
<td style="width:100px;">DOUBLE</td>
</tr>
<tr style="vertical-align:top">
<td><font color="green">p</font></td>
<td>Percent explained (Percent of abundance that must be explained by transcript models after quantification, default = 0.9)</td>
<td style="width:100px;">DOUBLE</td>
</tr>
<tr style="vertical-align:top">
<td><font color="green">mrpg</font></td>
<td>Minimum reads per gene (Minimum abundance required for a gene to be reported, default = 40.0)</td>
<td style="width:100px;">DOUBLE</td>
</tr>
<tr style="vertical-align:top">
<td><font color="green">mrpt</font></td>
<td>Minimum reads per transcript (Minimum abundance required for a transcript to be reported, default = 20.0)</td>
<td style="width:100px;">DOUBLE</td>
</tr>
<tr style="vertical-align:top">
<td><font color="green">pa</font></td>
<td>Percent abundance (Minimum relative abundance required for a transcript to be reported, default = 0.05)</td>
<td style="width:100px;">DOUBLE</td>
</tr>
<tr style="vertical-align:top">
<td><font color="green">sf</font></td>
<td>Successive fraction (Factor of the drop in abundance between successive transcript models, default = 20.0)</td>
<td style="width:100px;">DOUBLE</td>
</tr>
<tr style="vertical-align:top">
<td><font color="green">mrl</font></td>
<td>Maximum region length (Maximum length of a region considered before it is split, default = 750000)</td>
<td style="width:100px;">INT</td>
</tr>
<tr style="vertical-align:top">
<td><font color="green">mfgl</font></td>
<td>Maximum filled gap length (Maximum length of a gap filled by dummy reads, default = 50)</td>
<td style="width:100px;">INT</td>
</tr>
<tr style="vertical-align:top">
<td><font color="green">q</font></td>
<td>Quality filter (Minimum mapping quality required for a read to be considered, default = 40)</td>
<td style="width:100px;">INT</td>
</tr>
<tr style="vertical-align:top">
<td><font color="green">mpl</font></td>
<td>Minimum protein length (Minimum length of protein in AA, default = 70)</td>
<td style="width:100px;">INT</td>
</tr>
<tr style="vertical-align:top">
<td><font color="green">outdir</font></td>
<td>The output directory, defaults to the current working directory (.)</td>
<td>STRING</td>
</tr>
<tr style="vertical-align:top">
<td><font color="green">threads</font></td>
<td>The number of threads used for the tool, defaults to 1</td>
<td>INT</td>
</tr>
</table>
'''Example:'''
java -jar GeMoRNA-1.0.jar gemorna g=&lt;Genome&gt; m=&lt;Mapped_reads&gt;
=== Predict CDS from GFF ===
''Predict CDS from GFF'' may be called with
java -jar GeMoRNA-1.0.jar predictCDS
and has the following parameters
<table border=0 cellpadding=10 align="center" width="100%">
<tr>
<td>name</td>
<td>comment</td>
<td>type</td>
</tr>
<tr><td colspan=3><hr></td></tr>
<tr style="vertical-align:top">
<td><font color="green">g</font></td>
<td>Genome (Genome sequence as FastA, type = fa,fna.fasta)</td>
<td style="width:100px;">FILE</td>
</tr>
<tr style="vertical-align:top">
<td><font color="green">p</font></td>
<td>predicted annotation ("GFF or GTF file containing the predicted annotation", type = gff,gff3,gff.gz,gff3.gz,gtf,gtf.gz)</td>
<td style="width:100px;">FILE</td>
</tr>
<tr style="vertical-align:top">
<td><font color="green">m</font></td>
<td>Minimum protein length (Minimum length of protein in AA, default = 70)</td>
<td style="width:100px;">INT</td>
</tr>
<tr style="vertical-align:top">
<td><font color="green">outdir</font></td>
<td>The output directory, defaults to the current working directory (.)</td>
<td>STRING</td>
</tr>
</table>
'''Example:'''
java -jar GeMoRNA-1.0.jar predictCDS g=&lt;Genome&gt; p=&lt;predicted_annotation&gt;
=== GeMoMa Annotation Filter ===
This tool combines and filters gene predictions from different sources yielding a common gene prediction. The tool does not modify the predictions, but filters redundant and low-quality predictions and selects relevant predictions. In addition, it adds attributes to the annotation of transcript predictions.
The algorithm does the following:
First, redundant predictions are identified (and additional attributes (evidence, sumWeight) are introduced).
Second, the predictions are filtered using the user-specified criterium based on the attributes from the annotation.
Third, clusters of overlapping predictions are determined, the predictions are sorted within the cluster and relevant predictions are extracted.
Optionally, annotation info can be added for each reference organism enabling a functional prediction for predicted transcripts based on the function of the reference transcript.
Phytozome provides annotation info tables for plants, but annotation info can be used from any source as long as they are tab-delimited files with at least the following columns: transcriptName, GO and .*defline.
Initially, GAF was build to combine gene predictions obtained from '''GeMoMa'''. It allows to combine the predictions from multiple reference organisms, but works also using only one reference organism. However, GAF also allows to integrate predictions from ab-initio or purely RNA-seq-based gene predictors as well as manually curated annotation. If you like to do so, we recommend to run '''AnnotationEvidence''' for each of these input files to add additional attributes that can be used for sorting and filtering within '''GAF'''. The sort and filter criteria need to be carefully revised in this case. Default values can be set for missing attributes.
For more information please visit http://www.jstacs.de/index.php/GeMoMa
If you have any questions, comments or bugs, please check FAQs on our homepage, our github page https://github.com/Jstacs/Jstacs/labels/GeMoMa or contact jens.keilwagen@julius-kuehn.de
''GeMoMa Annotation Filter'' may be called with
java -jar GeMoRNA-1.0.jar GAF
and has the following parameters
<table border=0 cellpadding=10 align="center" width="100%">
<tr>
<td>name</td>
<td>comment</td>
<td>type</td>
</tr>
<tr><td colspan=3><hr></td></tr>
<tr style="vertical-align:top">
<td><font color="green">t</font></td>
<td>tag (the tag used to read the GeMoMa annotations, default = mRNA)</td>
<td style="width:100px;">STRING</td>
</tr>
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr>
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%">
<tr style="vertical-align:top">
<td><font color="green">p</font></td>
<td>prefix (the prefix can be used to distinguish predictions from different input files, OPTIONAL)</td>
<td style="width:100px;">STRING</td>
</tr>
<tr style="vertical-align:top">
<td><font color="green">w</font></td>
<td>weight (the weight can be used to prioritize predictions from different input files; each prediction will get an additional attribute sumWeight that can be used in the filter, valid range = [0.0, 1000.0], default = 1.0, OPTIONAL)</td>
<td style="width:100px;">DOUBLE</td>
</tr>
<tr style="vertical-align:top">
<td><font color="green">g</font></td>
<td>gene annotation file (GFF file containing the gene annotations (predicted by GeMoMa), type = gff,gff3)</td>
<td style="width:100px;">FILE</td>
</tr>
<tr style="vertical-align:top">
<td><font color="green">a</font></td>
<td>annotation info (annotation information of the reference, tab-delimited file containing at least the columns transcriptName, GO and .*defline, type = tabular, OPTIONAL)</td>
<td style="width:100px;">FILE</td>
</tr>
</table>
</td></tr>
<tr style="vertical-align:top">
<td><font color="green">d</font></td>
<td>default attributes (Comma-separated list of attributes that is set to NaN if they are not given in the annotation file. This allows to use these attributes for sorting or filter criteria. It is especially meaningful if the gene annotation files were received fom different software packages (e.g., GeMoMa, Braker, ...) having different attributes., default = tie,tde,tae,iAA,pAA,score,lpm,maxGap,bestScore,maxScore,raa,rce)</td>
<td style="width:100px;">STRING</td>
</tr>
<tr style="vertical-align:top">
<td><font color="green">k</font></td>
<td>kmeans (whether kmeans should be performed for each input file and clusters with large mean distance to the origin will be discarded, range={NO, YES}, default = NO)</td>
<td style="width:100px;">STRING</td></tr>
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%">
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr>
<tr><td colspan=3><b>Parameters for selection &quot;YES&quot;:</b></td></tr>
<tr style="vertical-align:top">
<td><font color="green">m</font></td>
<td>minimal number of predictions (only gene sets with at least this number of predictions will be used for clustering, valid range = [0, 100000000], default = 1000)</td>
<td style="width:100px;">INT</td>
</tr>
<tr style="vertical-align:top">
<td><font color="green">c</font></td>
<td>cluster (the number of clusters to be used for kmeans, valid range = [2, 100], default = 2)</td>
<td style="width:100px;">INT</td>
</tr>
<tr style="vertical-align:top">
<td><font color="green">gc</font></td>
<td>good cluster (the number of good clusters, good clusters are those with small mean, all members of a good cluster are further used, valid range = [1, 99], default = 1)</td>
<td style="width:100px;">INT</td>
</tr>
<tr style="vertical-align:top">
<td><font color="green">trend</font></td>
<td>trend (whether a local component should be used for the cluster attribute (might be helpful for regions with different conservation (e.g. introgressions in chromosomes)), range={GLOBAL, LOCAL}, default = GLOBAL)</td>
<td style="width:100px;">STRING</td></tr>
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%">
<tr><td colspan=3><b>No parameters for selection &quot;GLOBAL&quot;</b></td></tr>
<tr><td colspan=3><b>Parameters for selection &quot;LOCAL&quot;:</b></td></tr>
<tr style="vertical-align:top">
<td><font color="green">margin</font></td>
<td>margin (the number of bp upstream and downstream of a predictions used to identify neighboring predictions for the statistics, valid range = [0, 100000000], default = 1000000)</td>
<td style="width:100px;">INT</td>
</tr>
<tr style="vertical-align:top">
<td><font color="green">q</font></td>
<td>quantile (the quantile used for the local trend, valid range = [0.0, 1.0], default = 0.2)</td>
<td style="width:100px;">DOUBLE</td>
</tr>
</table></td></tr>
</table></td></tr>
<tr style="vertical-align:top">
<td><font color="green">f</font></td>
<td>filter (A filter can be applied to transcript predictions to receive only reasonable predictions. The filter is applied to the GFF attributes. The default filter decides based on the completeness of the prediction (start=='M' and stop=='*') and the relative score (score/aa>=0.75) whether a prediction is used or not. In addition, predictions without score (isNaN(score)) will be used as external annotations do not have the attribute score. Different criteria can be combined using 'and' and 'or'. A more sophisticated filter could be applied for instance using the completeness, the relative score, the evidence as well as statistics based on RNA-seq: start=='M' and stop=='*' and score/aa>=0.75 and (evidence>1 or tpc==1.0), default = start=='M' and stop=='*' and (isNaN(score) or score/aa>=0.75), OPTIONAL)</td>
<td style="width:100px;">STRING</td>
</tr>
<tr style="vertical-align:top">
<td><font color="green">s</font></td>
<td>sorting (comma-separated list that defines the way predictions are sorted within a cluster, default = sumWeight,score,aa)</td>
<td style="width:100px;">STRING</td>
</tr>
<tr style="vertical-align:top">
<td><font color="green">i</font></td>
<td>intermediate result (a switch to decide whether an intermediate result of filtered predictions that are not combined to genes should be returned, default = false)</td>
<td style="width:100px;">BOOLEAN</td>
</tr>
<tr style="vertical-align:top">
<td><font color="green">l</font></td>
<td>length difference (maximal percentage of length difference between the representative transcript and an alternative transcript, alternative transcripts with a higher percentage are discarded, valid range = [0.0, 10000.0], OPTIONAL)</td>
<td style="width:100px;">DOUBLE</td>
</tr>
<tr style="vertical-align:top">
<td><font color="green">atf</font></td>
<td>alternative transcript filter (If a prediction is suggested as an alternative transcript, this additional filter will be applied. The filter works syntactically like the 'filter' parameter and allows you to keep the number of alternative transcripts small based on meaningful criteria. Reasonable filter could be based on RNA-seq data (tie==1) or on sumWeight (sumWeight>1). A more sophisticated filter could be applied combining several criteria: tie==1 or sumWeight>1, default = tie==1 or sumWeight>1, OPTIONAL)</td>
<td style="width:100px;">STRING</td>
</tr>
<tr style="vertical-align:top">
<td><font color="green">cbf</font></td>
<td>common border filter (the filter on the common borders of transcripts, the lower the more transcripts will be checked as alternative splice isoforms, valid range = [0.0, 1.0], default = 0.75)</td>
<td style="width:100px;">DOUBLE</td>
</tr>
<tr style="vertical-align:top">
<td><font color="green">mnotpg</font></td>
<td>maximal number of transcripts per gene (the maximal number of allowed transcript predictions per gene, valid range = [1, 2147483647], default = 2147483647)</td>
<td style="width:100px;">INT</td>
</tr>
<tr style="vertical-align:top">
<td><font color="green">aat</font></td>
<td>add alternative transcripts (a switch to decide whether the IDs of alternative transcripts that have the same CDS should be listed for each prediction, default = false)</td>
<td style="width:100px;">BOOLEAN</td>
</tr>
<tr style="vertical-align:top">
<td><font color="green">tf</font></td>
<td>transfer features (if true, additional features like UTRs will be transfered from the input to the output. Only features of the representatives will be transfered. The UTRs of identical CDS predictions listed in "alternative" will not be transfered or combined, default = false)</td>
<td style="width:100px;">BOOLEAN</td>
</tr>
<tr style="vertical-align:top">
<td><font color="green">outdir</font></td>
<td>The output directory, defaults to the current working directory (.)</td>
<td>STRING</td>
</tr>
</table>
'''Example:'''
java -jar GeMoRNA-1.0.jar GAF g=&lt;gene_annotation_file&gt;
=== Analyzer ===
This tools allows to compare true annotation with predicted annotation as it is frequently done in benchmark studies. Furthermore, it can return a detailed table comparing true annotation and predicted annotation which might help to identify systematical errors or biases in the predictions. Hence, this tool might help to detect weaknesses of the prediction algorithm.
True and predicted transcripts are evaluated based on nucleotide F1 measure. For each predicted transcript, the true transcript with highest nucleotide F1 measure is listed. A negative value in a F1 measure column indicates that there is a predicted transcript that matches the true transcript with a F1 measure value that is the absolute value of this entry, but there is another true transcript that matches this predicted transcript with an even better F1. Also true and predicted transcripts are listed that do not overlap with any transcript from the predicted and true annotation, respectively. The table contains the attributes of the true and the predicted annotation besides some additional columns allowing to easily filter interesting examples and to do statistics.
The evaluation can be based on CDS (default) or exon features. The tool also reports sensitivity and precision for the categories gene and transcript.
For more information please visit http://www.jstacs.de/index.php/GeMoMa
If you have any questions, comments or bugs, please check FAQs on our homepage, our github page https://github.com/Jstacs/Jstacs/labels/GeMoMa or contact jens.keilwagen@julius-kuehn.de
''Analyzer'' may be called with
java -jar GeMoRNA-1.0.jar Analyzer
and has the following parameters
<table border=0 cellpadding=10 align="center" width="100%">
<tr>
<td>name</td>
<td>comment</td>
<td>type</td>
</tr>
<tr><td colspan=3><hr></td></tr>
<tr style="vertical-align:top">
<td><font color="green">t</font></td>
<td>truth (the true annotation, type = gff,gff3,gtf,gff.gz,gff3.gz,gtf.gz)</td>
<td style="width:100px;">FILE</td>
</tr>
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr>
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%">
<tr style="vertical-align:top">
<td><font color="green">n</font></td>
<td>name (can be used to distinguish different predictions, OPTIONAL)</td>
<td style="width:100px;">STRING</td>
</tr>
<tr style="vertical-align:top">
<td><font color="green">p</font></td>
<td>predicted annotation (GFF/GTF file containing the predicted annotation, type = gff,gff3,gtf,gff.gz,gff3.gz,gtf.gz)</td>
<td style="width:100px;">FILE</td>
</tr>
</table>
</td></tr>
<tr style="vertical-align:top">
<td><font color="green">c</font></td>
<td>CDS (if true CDS features are used otherwise exon features, default = true)</td>
<td style="width:100px;">BOOLEAN</td>
</tr>
<tr style="vertical-align:top">
<td><font color="green">o</font></td>
<td>only introns (if true only intron borders (=splice sites) are evaluated, default = false)</td>
<td style="width:100px;">BOOLEAN</td>
</tr>
<tr style="vertical-align:top">
<td><font color="green">w</font></td>
<td>write (write detailed table comparing the true and the predicted annotation, range={NO, YES}, default = NO)</td>
<td style="width:100px;">STRING</td></tr>
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%">
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr>
<tr><td colspan=3><b>Parameters for selection &quot;YES&quot;:</b></td></tr>
<tr style="vertical-align:top">
<td><font color="green">ca</font></td>
<td>common attributes (Only gff attributes of mRNAs are included in the result table, that can be found in the given portion of all mRNAs. Attributes and their portion are handled independently for truth and prediction. This parameter allows to choose between a more informative table or compact table., valid range = [0.0, 1.0], default = 0.5)</td>
<td style="width:100px;">DOUBLE</td>
</tr>
</table></td></tr>
<tr style="vertical-align:top">
<td><font color="green">r</font></td>
<td>reliable (additionally evaluate sensitivity for reliable transcripts, range={NO, YES}, default = NO)</td>
<td style="width:100px;">STRING</td></tr>
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%">
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr>
<tr><td colspan=3><b>Parameters for selection &quot;YES&quot;:</b></td></tr>
<tr style="vertical-align:top">
<td><font color="green">f</font></td>
<td>filter (A filter for deciding which transcript from the truth are reliable or not. The filter is applied to the GFF attributes of the truth. You probably need to run AnnotationEvidence on the truth GFF. The default filter decides based on the completeness of the prediction (start=='M' and stop=='*'), no premature stop codons (nps==0), RNA-seq coverage (tpc==1) and intron evidence (isNaN(tie) or tie==1)., default = start=='M' and stop=='*' and nps==0 and (tpc==1 and (isNaN(tie) or tie==1)), OPTIONAL)</td>
<td style="width:100px;">STRING</td>
</tr>
</table></td></tr>
<tr style="vertical-align:top">
<td><font color="green">outdir</font></td>
<td>The output directory, defaults to the current working directory (.)</td>
<td>STRING</td>
</tr>
</table>
'''Example:'''
java -jar GeMoRNA-1.0.jar Analyzer t=&lt;truth&gt; p=&lt;predicted_annotation&gt;
=== Merge ===
''Merge'' may be called with
java -jar GeMoRNA-1.0.jar merge
and has the following parameters
<table border=0 cellpadding=10 align="center" width="100%">
<tr>
<td>name</td>
<td>comment</td>
<td>type</td>
</tr>
<tr><td colspan=3><hr></td></tr>
<tr style="vertical-align:top">
<td><font color="green">g</font></td>
<td>GeMoMa (GeMoMa predictions, type = gff,gff3)</td>
<td style="width:100px;">FILE</td>
</tr>
<tr style="vertical-align:top">
<td><font color="green">GeMoRNA</font></td>
<td>GeMoRNA (GeMoRNA predictions, type = gff,gff3)</td>
<td style="width:100px;">FILE</td>
</tr>
<tr style="vertical-align:top">
<td><font color="green">m</font></td>
<td>Mode (, range={intersect, union, intermediate, annotate}, default = intersect)</td>
<td style="width:100px;">STRING</td></tr>
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%">
<tr><td colspan=3><b>No parameters for selection &quot;intersect&quot;</b></td></tr>
<tr><td colspan=3><b>No parameters for selection &quot;union&quot;</b></td></tr>
<tr><td colspan=3><b>Parameters for selection &quot;intermediate&quot;:</b></td></tr>
<tr style="vertical-align:top">
<td><font color="green">GeMoMa-strict</font></td>
<td>GeMoMa-strict (GeMoMa predictions with strict settings, type = gff,gff3)</td>
<td style="width:100px;">FILE</td>
</tr>
<tr style="vertical-align:top">
<td><font color="green">GeMoRNA-strict</font></td>
<td>GeMoRNA-strict (GeMoRNA predictions with strict settings, type = gff,gff3)</td>
<td style="width:100px;">FILE</td>
</tr>
<tr><td colspan=3><b>Parameters for selection &quot;annotate&quot;:</b></td></tr>
<tr style="vertical-align:top">
<td><font color="green">GeMoMa-strict</font></td>
<td>GeMoMa-strict (GeMoMa predictions with strict settings, type = gff,gff3)</td>
<td style="width:100px;">FILE</td>
</tr>
<tr style="vertical-align:top">
<td><font color="green">GeMoRNA-strict</font></td>
<td>GeMoRNA-strict (GeMoRNA predictions with strict settings, type = gff,gff3)</td>
<td style="width:100px;">FILE</td>
</tr>
</table></td></tr>
<tr style="vertical-align:top">
<td><font color="green">outdir</font></td>
<td>The output directory, defaults to the current working directory (.)</td>
<td>STRING</td>
</tr>
</table>
'''Example:'''
java -jar GeMoRNA-1.0.jar merge g=&lt;GeMoMa&gt; GeMoRNA=&lt;GeMoRNA&gt;

Revision as of 16:18, 8 November 2024

Tools

GeMoRNA

GeMoRNA may be called with

java -jar GeMoRNA-1.0.jar gemorna

and has the following parameters

name comment type

g Genome (Genome sequence as FastA, type = fa,fna,fasta) FILE
m Mapped reads (Mapped Reads in BAM format, coordinate sorted, type = bam) FILE
s Stranded (Library strandedness, range={FR_UNSTRANDED, FR_FIRST_STRAND, FR_SECOND_STRAND}, default = FR_UNSTRANDED) STRING
l Longest intron length (Length of the longest intron reported, default = 100000) INT
sil Shortest intron length (Length of the shortest intron considered, default = 10) INT
lr Long reads (Long-read mode, default = false) BOOLEAN
mnor Minimum number of reads (Minimum number of reads required for an edge in the read graph, default = 1.0) DOUBLE
mfor Minimum fraction of reads (Minimum fraction of reads relative to adjacent exons that must support an intron in the enumeration, default = 0.01) DOUBLE
mnoir Minimum number of intron reads (Minimum number of reads required for an intron, default = 1.0) DOUBLE
mfoir Minimum fraction of intron reads (Minimum fraction of reads relative to adjacent exons for an intron to be considered, default = 0.01) DOUBLE
p Percent explained (Percent of abundance that must be explained by transcript models after quantification, default = 0.9) DOUBLE
mrpg Minimum reads per gene (Minimum abundance required for a gene to be reported, default = 40.0) DOUBLE
mrpt Minimum reads per transcript (Minimum abundance required for a transcript to be reported, default = 20.0) DOUBLE
pa Percent abundance (Minimum relative abundance required for a transcript to be reported, default = 0.05) DOUBLE
sf Successive fraction (Factor of the drop in abundance between successive transcript models, default = 20.0) DOUBLE
mrl Maximum region length (Maximum length of a region considered before it is split, default = 750000) INT
mfgl Maximum filled gap length (Maximum length of a gap filled by dummy reads, default = 50) INT
q Quality filter (Minimum mapping quality required for a read to be considered, default = 40) INT
mpl Minimum protein length (Minimum length of protein in AA, default = 70) INT
outdir The output directory, defaults to the current working directory (.) STRING
threads The number of threads used for the tool, defaults to 1 INT

Example:

java -jar GeMoRNA-1.0.jar gemorna g=<Genome> m=<Mapped_reads>


Predict CDS from GFF

Predict CDS from GFF may be called with

java -jar GeMoRNA-1.0.jar predictCDS

and has the following parameters

name comment type

g Genome (Genome sequence as FastA, type = fa,fna.fasta) FILE
p predicted annotation ("GFF or GTF file containing the predicted annotation", type = gff,gff3,gff.gz,gff3.gz,gtf,gtf.gz) FILE
m Minimum protein length (Minimum length of protein in AA, default = 70) INT
outdir The output directory, defaults to the current working directory (.) STRING

Example:

java -jar GeMoRNA-1.0.jar predictCDS g=<Genome> p=<predicted_annotation>


GeMoMa Annotation Filter

This tool combines and filters gene predictions from different sources yielding a common gene prediction. The tool does not modify the predictions, but filters redundant and low-quality predictions and selects relevant predictions. In addition, it adds attributes to the annotation of transcript predictions.

The algorithm does the following: First, redundant predictions are identified (and additional attributes (evidence, sumWeight) are introduced). Second, the predictions are filtered using the user-specified criterium based on the attributes from the annotation. Third, clusters of overlapping predictions are determined, the predictions are sorted within the cluster and relevant predictions are extracted.

Optionally, annotation info can be added for each reference organism enabling a functional prediction for predicted transcripts based on the function of the reference transcript. Phytozome provides annotation info tables for plants, but annotation info can be used from any source as long as they are tab-delimited files with at least the following columns: transcriptName, GO and .*defline.

Initially, GAF was build to combine gene predictions obtained from GeMoMa. It allows to combine the predictions from multiple reference organisms, but works also using only one reference organism. However, GAF also allows to integrate predictions from ab-initio or purely RNA-seq-based gene predictors as well as manually curated annotation. If you like to do so, we recommend to run AnnotationEvidence for each of these input files to add additional attributes that can be used for sorting and filtering within GAF. The sort and filter criteria need to be carefully revised in this case. Default values can be set for missing attributes.

For more information please visit http://www.jstacs.de/index.php/GeMoMa If you have any questions, comments or bugs, please check FAQs on our homepage, our github page https://github.com/Jstacs/Jstacs/labels/GeMoMa or contact jens.keilwagen@julius-kuehn.de

GeMoMa Annotation Filter may be called with

java -jar GeMoRNA-1.0.jar GAF

and has the following parameters

name comment type

t tag (the tag used to read the GeMoMa annotations, default = mRNA) STRING
The following parameter(s) can be used multiple times:
p prefix (the prefix can be used to distinguish predictions from different input files, OPTIONAL) STRING
w weight (the weight can be used to prioritize predictions from different input files; each prediction will get an additional attribute sumWeight that can be used in the filter, valid range = [0.0, 1000.0], default = 1.0, OPTIONAL) DOUBLE
g gene annotation file (GFF file containing the gene annotations (predicted by GeMoMa), type = gff,gff3) FILE
a annotation info (annotation information of the reference, tab-delimited file containing at least the columns transcriptName, GO and .*defline, type = tabular, OPTIONAL) FILE
d default attributes (Comma-separated list of attributes that is set to NaN if they are not given in the annotation file. This allows to use these attributes for sorting or filter criteria. It is especially meaningful if the gene annotation files were received fom different software packages (e.g., GeMoMa, Braker, ...) having different attributes., default = tie,tde,tae,iAA,pAA,score,lpm,maxGap,bestScore,maxScore,raa,rce) STRING
k kmeans (whether kmeans should be performed for each input file and clusters with large mean distance to the origin will be discarded, range={NO, YES}, default = NO) STRING
No parameters for selection "NO"
Parameters for selection "YES":
m minimal number of predictions (only gene sets with at least this number of predictions will be used for clustering, valid range = [0, 100000000], default = 1000) INT
c cluster (the number of clusters to be used for kmeans, valid range = [2, 100], default = 2) INT
gc good cluster (the number of good clusters, good clusters are those with small mean, all members of a good cluster are further used, valid range = [1, 99], default = 1) INT
trend trend (whether a local component should be used for the cluster attribute (might be helpful for regions with different conservation (e.g. introgressions in chromosomes)), range={GLOBAL, LOCAL}, default = GLOBAL) STRING
No parameters for selection "GLOBAL"
Parameters for selection "LOCAL":
margin margin (the number of bp upstream and downstream of a predictions used to identify neighboring predictions for the statistics, valid range = [0, 100000000], default = 1000000) INT
q quantile (the quantile used for the local trend, valid range = [0.0, 1.0], default = 0.2) DOUBLE
f filter (A filter can be applied to transcript predictions to receive only reasonable predictions. The filter is applied to the GFF attributes. The default filter decides based on the completeness of the prediction (start=='M' and stop=='*') and the relative score (score/aa>=0.75) whether a prediction is used or not. In addition, predictions without score (isNaN(score)) will be used as external annotations do not have the attribute score. Different criteria can be combined using 'and' and 'or'. A more sophisticated filter could be applied for instance using the completeness, the relative score, the evidence as well as statistics based on RNA-seq: start=='M' and stop=='*' and score/aa>=0.75 and (evidence>1 or tpc==1.0), default = start=='M' and stop=='*' and (isNaN(score) or score/aa>=0.75), OPTIONAL) STRING
s sorting (comma-separated list that defines the way predictions are sorted within a cluster, default = sumWeight,score,aa) STRING
i intermediate result (a switch to decide whether an intermediate result of filtered predictions that are not combined to genes should be returned, default = false) BOOLEAN
l length difference (maximal percentage of length difference between the representative transcript and an alternative transcript, alternative transcripts with a higher percentage are discarded, valid range = [0.0, 10000.0], OPTIONAL) DOUBLE
atf alternative transcript filter (If a prediction is suggested as an alternative transcript, this additional filter will be applied. The filter works syntactically like the 'filter' parameter and allows you to keep the number of alternative transcripts small based on meaningful criteria. Reasonable filter could be based on RNA-seq data (tie==1) or on sumWeight (sumWeight>1). A more sophisticated filter could be applied combining several criteria: tie==1 or sumWeight>1, default = tie==1 or sumWeight>1, OPTIONAL) STRING
cbf common border filter (the filter on the common borders of transcripts, the lower the more transcripts will be checked as alternative splice isoforms, valid range = [0.0, 1.0], default = 0.75) DOUBLE
mnotpg maximal number of transcripts per gene (the maximal number of allowed transcript predictions per gene, valid range = [1, 2147483647], default = 2147483647) INT
aat add alternative transcripts (a switch to decide whether the IDs of alternative transcripts that have the same CDS should be listed for each prediction, default = false) BOOLEAN
tf transfer features (if true, additional features like UTRs will be transfered from the input to the output. Only features of the representatives will be transfered. The UTRs of identical CDS predictions listed in "alternative" will not be transfered or combined, default = false) BOOLEAN
outdir The output directory, defaults to the current working directory (.) STRING

Example:

java -jar GeMoRNA-1.0.jar GAF g=<gene_annotation_file>


Analyzer

This tools allows to compare true annotation with predicted annotation as it is frequently done in benchmark studies. Furthermore, it can return a detailed table comparing true annotation and predicted annotation which might help to identify systematical errors or biases in the predictions. Hence, this tool might help to detect weaknesses of the prediction algorithm.

True and predicted transcripts are evaluated based on nucleotide F1 measure. For each predicted transcript, the true transcript with highest nucleotide F1 measure is listed. A negative value in a F1 measure column indicates that there is a predicted transcript that matches the true transcript with a F1 measure value that is the absolute value of this entry, but there is another true transcript that matches this predicted transcript with an even better F1. Also true and predicted transcripts are listed that do not overlap with any transcript from the predicted and true annotation, respectively. The table contains the attributes of the true and the predicted annotation besides some additional columns allowing to easily filter interesting examples and to do statistics.

The evaluation can be based on CDS (default) or exon features. The tool also reports sensitivity and precision for the categories gene and transcript.

For more information please visit http://www.jstacs.de/index.php/GeMoMa If you have any questions, comments or bugs, please check FAQs on our homepage, our github page https://github.com/Jstacs/Jstacs/labels/GeMoMa or contact jens.keilwagen@julius-kuehn.de

Analyzer may be called with

java -jar GeMoRNA-1.0.jar Analyzer

and has the following parameters

name comment type

t truth (the true annotation, type = gff,gff3,gtf,gff.gz,gff3.gz,gtf.gz) FILE
The following parameter(s) can be used multiple times:
n name (can be used to distinguish different predictions, OPTIONAL) STRING
p predicted annotation (GFF/GTF file containing the predicted annotation, type = gff,gff3,gtf,gff.gz,gff3.gz,gtf.gz) FILE
c CDS (if true CDS features are used otherwise exon features, default = true) BOOLEAN
o only introns (if true only intron borders (=splice sites) are evaluated, default = false) BOOLEAN
w write (write detailed table comparing the true and the predicted annotation, range={NO, YES}, default = NO) STRING
No parameters for selection "NO"
Parameters for selection "YES":
ca common attributes (Only gff attributes of mRNAs are included in the result table, that can be found in the given portion of all mRNAs. Attributes and their portion are handled independently for truth and prediction. This parameter allows to choose between a more informative table or compact table., valid range = [0.0, 1.0], default = 0.5) DOUBLE
r reliable (additionally evaluate sensitivity for reliable transcripts, range={NO, YES}, default = NO) STRING
No parameters for selection "NO"
Parameters for selection "YES":
f filter (A filter for deciding which transcript from the truth are reliable or not. The filter is applied to the GFF attributes of the truth. You probably need to run AnnotationEvidence on the truth GFF. The default filter decides based on the completeness of the prediction (start=='M' and stop=='*'), no premature stop codons (nps==0), RNA-seq coverage (tpc==1) and intron evidence (isNaN(tie) or tie==1)., default = start=='M' and stop=='*' and nps==0 and (tpc==1 and (isNaN(tie) or tie==1)), OPTIONAL) STRING
outdir The output directory, defaults to the current working directory (.) STRING

Example:

java -jar GeMoRNA-1.0.jar Analyzer t=<truth> p=<predicted_annotation>


Merge

Merge may be called with

java -jar GeMoRNA-1.0.jar merge

and has the following parameters

name comment type

g GeMoMa (GeMoMa predictions, type = gff,gff3) FILE
GeMoRNA GeMoRNA (GeMoRNA predictions, type = gff,gff3) FILE
m Mode (, range={intersect, union, intermediate, annotate}, default = intersect) STRING
No parameters for selection "intersect"
No parameters for selection "union"
Parameters for selection "intermediate":
GeMoMa-strict GeMoMa-strict (GeMoMa predictions with strict settings, type = gff,gff3) FILE
GeMoRNA-strict GeMoRNA-strict (GeMoRNA predictions with strict settings, type = gff,gff3) FILE
Parameters for selection "annotate":
GeMoMa-strict GeMoMa-strict (GeMoMa predictions with strict settings, type = gff,gff3) FILE
GeMoRNA-strict GeMoRNA-strict (GeMoRNA predictions with strict settings, type = gff,gff3) FILE
outdir The output directory, defaults to the current working directory (.) STRING

Example:

java -jar GeMoRNA-1.0.jar merge g=<GeMoMa> GeMoRNA=<GeMoRNA>