GeMoMa-Docs: Difference between revisions

From Jstacs
Jump to navigationJump to search
(1.7.1)
No edit summary
 
(3 intermediate revisions by the same user not shown)
Line 1: Line 1:
This page describes the parameters of all [[GeMoMa]] modules.</br>
This page describes the parameters of all [[GeMoMa]] modules.</br>
If you have any questions, comments or bugs, please check the [[GeMoMa#Frequently asked questions|FAQs]], [https://github.com/Jstacs/Jstacs/labels/GeMoMa our github page] or contact [mailto:jens.keilwagen@julius-kuehn.de Jens Keilwagen].
If you have any questions, comments or bugs, please check the [[GeMoMa#Frequently asked questions|FAQs]], [https://github.com/Jstacs/Jstacs/issues?q=label%3AGeMoMa our github page] or contact [mailto:jens.keilwagen@julius-kuehn.de Jens Keilwagen].


= GeMoMa pipeline =
=== GeMoMa pipeline ===


This tool can be used to run the complete GeMoMa pipeline. The tool is multi-threaded and can utilize all compute cores on one machine, but not distributed as for instance in a compute cluster. It basically runs: '''Extract RNA-seq evidence (ERE)''', '''DenoiseIntrons''', '''Extractor''', external search (tblastn or mmseqs), '''Gene Model Mapper (GeMoMa)''', '''GeMoMa Annotation Filter (GAF)''', and '''AnnnotationFinalizer'''.
This tool can be used to run the complete GeMoMa pipeline. The tool is multi-threaded and can utilize all compute cores on one machine, but not distributed as for instance in a compute cluster. It basically runs: '''Extract RNA-seq evidence (ERE)''', '''DenoiseIntrons''', '''Extractor''', external search (tblastn or mmseqs), '''Gene Model Mapper (GeMoMa)''', '''GeMoMa Annotation Filter (GAF)''', and '''AnnnotationFinalizer'''.
Line 8: Line 8:
''GeMoMa pipeline'' may be called with
''GeMoMa pipeline'' may be called with


  java -jar GeMoMa-1.7.1.jar CLI GeMoMaPipeline
  java -jar GeMoMa-1.9.jar CLI GeMoMaPipeline


and has the following parameters
and has the following parameters
Line 21: Line 21:
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">t</font></td>
<td><font color="green">t</font></td>
<td>target genome (Target genome file (FASTA), mime = fasta,fa,fas,fna,fasta.gz,fa.gz,fas.gz,fna.gz)</td>
<td>target genome (Target genome file (FASTA), type = fasta,fa,fas,fna,fasta.gz,fa.gz,fas.gz,fna.gz)</td>
<td style="width:100px;">FILE</td>
<td style="width:100px;">FILE</td>
</tr>
</tr>
Line 34: Line 34:
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">i</font></td>
<td><font color="green">i</font></td>
<td>ID (ID to distinguish the different reference species, default = , OPTIONAL)</td>
<td>ID (ID to distinguish the different reference species, OPTIONAL)</td>
<td style="width:100px;">STRING</td>
<td style="width:100px;">STRING</td>
</tr>
</tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">a</font></td>
<td><font color="green">a</font></td>
<td>annotation (Reference annotation file (GFF or GTF), which contains gene models annotated in the reference genome, mime = gff,gff3,gtf)</td>
<td>annotation (Reference annotation file (GFF or GTF), which contains gene models annotated in the reference genome, type = gff,gff3,gtf,gff.gz,gff3.gz,gtf.gz)</td>
<td style="width:100px;">FILE</td>
<td style="width:100px;">FILE</td>
</tr>
</tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">g</font></td>
<td><font color="green">g</font></td>
<td>genome (Reference genome file (FASTA), mime = fasta,fa,fas,fna,fasta.gz,fa.gz,fas.gz,fna.gz)</td>
<td>genome (Reference genome file (FASTA), type = fasta,fa,fas,fna,fasta.gz,fa.gz,fas.gz,fna.gz)</td>
<td style="width:100px;">FILE</td>
<td style="width:100px;">FILE</td>
</tr>
</tr>
Line 54: Line 54:
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">ai</font></td>
<td><font color="green">ai</font></td>
<td>annotation info (annotation information of the reference, tab-delimted file containing at least the columns transcriptName, GO and .*defline, mime = tabular, OPTIONAL)</td>
<td>annotation info (annotation information of the reference, tab-delimted file containing at least the columns transcriptName, GO and .*defline, type = tabular, OPTIONAL)</td>
<td style="width:100px;">FILE</td>
<td style="width:100px;">FILE</td>
</tr>
</tr>
Line 60: Line 60:
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">i</font></td>
<td><font color="green">i</font></td>
<td>ID (ID to distinguish the different reference species, default = , OPTIONAL)</td>
<td>ID (ID to distinguish the different reference species, OPTIONAL)</td>
<td style="width:100px;">STRING</td>
<td style="width:100px;">STRING</td>
</tr>
</tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">c</font></td>
<td><font color="green">c</font></td>
<td>cds parts (The query cds parts file (FASTA), i.e., the cds parts that have been blasted, mime = fasta,fas,fa,fna)</td>
<td>cds parts (The query CDS parts file (protein FASTA), i.e., the CDS parts that have been searched in the target genome using for instance BLAST or mmseqs, type = fasta,fa,fas,fna)</td>
<td style="width:100px;">FILE</td>
<td style="width:100px;">FILE</td>
</tr>
</tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">a</font></td>
<td><font color="green">a</font></td>
<td>assignment (The assignment file, which combines parts of the CDS to transcripts, mime = tabular, OPTIONAL)</td>
<td>assignment (The assignment file, which combines CDS parts to proteins, type = tabular, OPTIONAL)</td>
<td style="width:100px;">FILE</td>
<td style="width:100px;">FILE</td>
</tr>
</tr>
Line 80: Line 80:
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">ai</font></td>
<td><font color="green">ai</font></td>
<td>annotation info (annotation information of the reference, tab-delimted file containing at least the columns transcriptName, GO and .*defline, mime = tabular, OPTIONAL)</td>
<td>annotation info (annotation information of the reference, tab-delimited file containing at least the columns transcriptName, GO and .*defline, type = tabular, OPTIONAL)</td>
<td style="width:100px;">FILE</td>
<td style="width:100px;">FILE</td>
</tr>
</tr>
Line 90: Line 90:
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">ID</font></td>
<td><font color="green">ID</font></td>
<td>ID (ID to distinguish the different external annotations of the target organism, default = , OPTIONAL)</td>
<td>ID (ID to distinguish the different external annotations of the target organism, OPTIONAL)</td>
<td style="width:100px;">STRING</td>
<td style="width:100px;">STRING</td>
</tr>
</tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">e</font></td>
<td><font color="green">e</font></td>
<td>external annotation (External annotation file (GFF,GTF) of the target organism, which contains gene models from an external source (e.g., ab initio gene prediction) and will be integrated in the module GAF, mime = gff,gff3,gtf)</td>
<td>external annotation (External annotation file (GFF,GTF) of the target organism, which contains gene models from an external source (e.g., ab initio gene prediction) and will be integrated in the module GAF, type = gff,gff3,gtf)</td>
<td style="width:100px;">FILE</td>
<td style="width:100px;">FILE</td>
</tr>
</tr>
Line 112: Line 112:
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">selected</font></td>
<td><font color="green">selected</font></td>
<td>selected (The path to list file, which allows to make only a predictions for the contained transcript ids. The first column should contain transcript IDs as given in the annotation. Remaining columns can be used to determine a target region that should be overlapped by the prediction, if columns 2 to 5 contain chromosome, strand, start and end of region, mime = tabular,txt, OPTIONAL)</td>
<td>selected (The path to list file, which allows to make only a predictions for the contained transcript ids. The first column should contain transcript IDs as given in the annotation. Remaining columns can be used to determine a target region that should be overlapped by the prediction, if columns 2 to 5 contain chromosome, strand, start and end of region, type = tabular,txt, OPTIONAL)</td>
<td style="width:100px;">FILE</td>
<td style="width:100px;">FILE</td>
</tr>
</tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">gc</font></td>
<td><font color="green">gc</font></td>
<td>genetic code (optional user-specified genetic code, mime = tabular, OPTIONAL)</td>
<td>genetic code (optional user-specified genetic code, type = tabular, OPTIONAL)</td>
<td style="width:100px;">FILE</td>
<td style="width:100px;">FILE</td>
</tr>
</tr>
Line 146: Line 146:
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">ERE.m</font></td>
<td><font color="green">ERE.m</font></td>
<td>mapped reads file (BAM/SAM files containing the mapped reads, mime = bam,sam)</td>
<td>mapped reads file (BAM/SAM files containing the mapped reads, type = bam,sam)</td>
<td style="width:100px;">FILE</td>
<td style="width:100px;">FILE</td>
</tr>
</tr>
Line 175: Line 175:
<td>minimum context (only introns that have evidence of at least one split read with a minimal M (=(mis)match) stretch in the cigar string larger than or equal to this value will be used, valid range = [1, 1000000], default = 1)</td>
<td>minimum context (only introns that have evidence of at least one split read with a minimal M (=(mis)match) stretch in the cigar string larger than or equal to this value will be used, valid range = [1, 1000000], default = 1)</td>
<td style="width:100px;">INT</td>
<td style="width:100px;">INT</td>
</tr>
<tr style="vertical-align:top">
<td><font color="green">ERE.maximumcoverage</font></td>
<td>maximum coverage (optional parameter to reduce the size of coverage output files, coverage higher than this value will be reported as this value, valid range = [1, 10000], OPTIONAL)</td>
<td style="width:100px;">INT</td>
</tr>
<tr style="vertical-align:top">
<td><font color="green">ERE.f</font></td>
<td>filter by intron mismatches (filter reads by the number of mismatches around splits, range={NO, YES}, default = NO)</td>
<td style="width:100px;">STRING</td></tr>
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%">
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr>
<tr><td colspan=3><b>Parameters for selection &quot;YES&quot;:</b></td></tr>
<tr style="vertical-align:top">
<td><font color="green">ERE.r</font></td>
<td>region around introns (test region of this size around introns/splits for mismatches to the genome, valid range = [0, 100], default = 10)</td>
<td style="width:100px;">INT</td>
</tr>
<tr style="vertical-align:top">
<td><font color="green">ERE.n</font></td>
<td>number of mismatches (number of mismatches allowed in regions around introns/splits, valid range = [0, 100], default = 3)</td>
<td style="width:100px;">INT</td>
</tr>
</table></td></tr>
<tr style="vertical-align:top">
<td><font color="green">ERE.e</font></td>
<td>evidence long splits (require introns to have at least this number of times the supporting reads as their length deviates from the mean split length, valid range = [0.0, 100.0], default = 0.0)</td>
<td style="width:100px;">DOUBLE</td>
</tr>
<tr style="vertical-align:top">
<td><font color="green">ERE.mil</font></td>
<td>minimum intron length (introns shorter than the minimum length are discarded and considered as contiguous, valid range = [0, 1000], default = 0)</td>
<td style="width:100px;">INT</td>
</tr>
<tr style="vertical-align:top">
<td><font color="green">ERE.repositioning</font></td>
<td>repositioning (due to limitations in BAM/SAM format huge chromosomes need to be split before mapping. This parameter allows to undo the split mapping to real chromosomes and coordinates. The repositioning file has 3 columns: split_chr_name, original_chr_name, offset_in_original_chr, type = tabular, OPTIONAL)</td>
<td style="width:100px;">FILE</td>
</tr>
</tr>
<tr><td colspan=3><b>Parameters for selection &quot;EXTRACTED&quot;:</b></td></tr>
<tr><td colspan=3><b>Parameters for selection &quot;EXTRACTED&quot;:</b></td></tr>
Line 181: Line 219:
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">introns</font></td>
<td><font color="green">introns</font></td>
<td>introns (Introns (GFF), which might be obtained from RNA-seq, mime = gff,gff3)</td>
<td>introns (Introns (GFF), which might be obtained from RNA-seq, type = gff,gff3)</td>
<td style="width:100px;">FILE</td>
<td style="width:100px;">FILE</td>
</tr>
</tr>
Line 196: Line 234:
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">coverage_unstranded</font></td>
<td><font color="green">coverage_unstranded</font></td>
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., mime = bedgraph)</td>
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., type = bedgraph)</td>
<td style="width:100px;">FILE</td>
<td style="width:100px;">FILE</td>
</tr>
</tr>
Line 202: Line 240:
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">coverage_forward</font></td>
<td><font color="green">coverage_forward</font></td>
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., mime = bedgraph)</td>
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., type = bedgraph)</td>
<td style="width:100px;">FILE</td>
<td style="width:100px;">FILE</td>
</tr>
</tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">coverage_reverse</font></td>
<td><font color="green">coverage_reverse</font></td>
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., mime = bedgraph)</td>
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., type = bedgraph)</td>
<td style="width:100px;">FILE</td>
<td style="width:100px;">FILE</td>
</tr>
</tr>
Line 279: Line 317:
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">GeMoMa.sm</font></td>
<td><font color="green">GeMoMa.sm</font></td>
<td>substitution matrix (optional user-specified substitution matrix, mime = tabular, OPTIONAL)</td>
<td>substitution matrix (optional user-specified substitution matrix, type = tabular, OPTIONAL)</td>
<td style="width:100px;">FILE</td>
<td style="width:100px;">FILE</td>
</tr>
</tr>
Line 305: Line 343:
<td><font color="green">GeMoMa.i</font></td>
<td><font color="green">GeMoMa.i</font></td>
<td>intron-loss-gain-penalty (The penalty used for intron loss and gain, default = 25)</td>
<td>intron-loss-gain-penalty (The penalty used for intron loss and gain, default = 25)</td>
<td style="width:100px;">INT</td>
</tr>
<tr style="vertical-align:top">
<td><font color="green">GeMoMa.rf</font></td>
<td>reduction factor (Factor for reducing the allowed intron length when searching for missing marginal exons, valid range = [1, 100], default = 10)</td>
<td style="width:100px;">INT</td>
<td style="width:100px;">INT</td>
</tr>
</tr>
Line 318: Line 361:
</tr>
</tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">GeMoMa.rt</font></td>
<td><font color="green">GeMoMa.h</font></td>
<td>region threshold (The threshold for evaluating regions, valid range = [0.0, 1.0], default = 0.9)</td>
<td>hit threshold (The threshold for adding additional hits, valid range = [0.0, 1.0], default = 0.9)</td>
<td style="width:100px;">DOUBLE</td>
<td style="width:100px;">DOUBLE</td>
</tr>
</tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">GeMoMa.h</font></td>
<td><font color="green">GeMoMa.o</font></td>
<td>hit threshold (The threshold for adding additional hits, valid range = [0.0, 1.0], default = 0.9)</td>
<td>output (critierium to determine the number of predictions per reference transcript, range={STATIC, DYNAMIC}, default = STATIC)</td>
<td style="width:100px;">DOUBLE</td>
<td style="width:100px;">STRING</td></tr>
</tr>
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%">
<tr><td colspan=3><b>Parameters for selection &quot;STATIC&quot;:</b></td></tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">GeMoMa.p</font></td>
<td><font color="green">GeMoMa.p</font></td>
Line 332: Line 376:
<td style="width:100px;">INT</td>
<td style="width:100px;">INT</td>
</tr>
</tr>
<tr><td colspan=3><b>Parameters for selection &quot;DYNAMIC&quot;:</b></td></tr>
<tr style="vertical-align:top">
<td><font color="green">GeMoMa.f</font></td>
<td>factor (a prediction is used if: score >= factor*Math.max(0,bestScore), valid range = [0.0, 1.0], default = 0.8)</td>
<td style="width:100px;">DOUBLE</td>
</tr>
</table></td></tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">GeMoMa.a</font></td>
<td><font color="green">GeMoMa.a</font></td>
Line 348: Line 399:
</tr>
</tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">GeMoMa.prefix</font></td>
<td><font color="green">GeMoMa.v</font></td>
<td>prefix (A prefix to be used for naming the predictions, default = )</td>
<td>verbose (A flag which allows to output a wealth of additional information per transcript, default = false)</td>
<td style="width:100px;">STRING</td>
<td style="width:100px;">BOOLEAN</td>
</tr>
</tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
Line 356: Line 407:
<td>timeout (The (maximal) number of seconds to be used for the predictions of one transcript, if exceeded GeMoMa does not output a prediction for this transcript., valid range = [0, 604800], default = 3600)</td>
<td>timeout (The (maximal) number of seconds to be used for the predictions of one transcript, if exceeded GeMoMa does not output a prediction for this transcript., valid range = [0, 604800], default = 3600)</td>
<td style="width:100px;">LONG</td>
<td style="width:100px;">LONG</td>
</tr>
<tr style="vertical-align:top">
<td><font color="green">GeMoMa.ru</font></td>
<td>replace unknown (Replace unknown amino acid symbols by X, default = false)</td>
<td style="width:100px;">BOOLEAN</td>
</tr>
</tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
Line 361: Line 417:
<td>Score (A flag which allows to do nothing, re-score or re-align the search results, range={Trust, ReScore, ReAlign}, default = ReAlign)</td>
<td>Score (A flag which allows to do nothing, re-score or re-align the search results, range={Trust, ReScore, ReAlign}, default = ReAlign)</td>
<td style="width:100px;">STRING</td>
<td style="width:100px;">STRING</td>
</tr>
<tr style="vertical-align:top">
<td><font color="green">GAF.c</font></td>
<td>common border filter (the filter on the common borders of transcripts, the lower the more transcripts will be checked as alternative splice isoforms, valid range = [0.0, 1.0], default = 0.75)</td>
<td style="width:100px;">DOUBLE</td>
</tr>
<tr style="vertical-align:top">
<td><font color="green">GAF.m</font></td>
<td>maximal number of transcripts per gene (the maximal number of allowed transcript predictions per gene, valid range = [1, 2147483647], default = 2147483647)</td>
<td style="width:100px;">INT</td>
</tr>
</tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">GAF.d</font></td>
<td><font color="green">GAF.d</font></td>
<td>default attributes (Comma-separated list of attributes that is set to NaN if they are not given in the annotation file. This allows to use these attributes for sorting or filter criteria. It is especially meaningful if the gene annotation files were received fom different software packages (e.g., GeMoMa, Braker, ...) having different attributes., default = tie,tde,tae,iAA,pAA,score)</td>
<td>default attributes (Comma-separated list of attributes that is set to NaN if they are not given in the annotation file. This allows to use these attributes for sorting or filter criteria. It is especially meaningful if the gene annotation files were received fom different software packages (e.g., GeMoMa, Braker, ...) having different attributes., default = tie,tde,tae,iAA,pAA,score,lpm,maxGap,bestScore,maxScore,raa,rce)</td>
<td style="width:100px;">STRING</td>
<td style="width:100px;">STRING</td>
</tr>
</tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">GAF.f</font></td>
<td><font color="green">GAF.k</font></td>
<td>filter (A filter can be applied to transcript predictions to receive only reasonable predictions. The filter is applied to the GFF attributes. The default filter decides based on the completeness of the prediction (start=='M' and stop=='*') and the relative score (score/aa>=0.75) whether a prediction is used or not. In addition, predictions without score (isNaN(score)) will be used as external annotations do not have the attribute score. Different criteria can be combined using 'and' and 'or'. A more sophisticated filter could be applied for instance using the completeness, the relative score, the evidence as well as statistics based on RNA-seq: start=='M' and stop=='*' and score/aa>=0.75 and (evidence>1 or tpc==1.0), default = start=='M' and stop=='*' and (isNaN(score) or score/aa>=0.75), OPTIONAL)</td>
<td>kmeans (whether kmeans should be performed for each input file and clusters with large mean distance to the origin will be discarded, range={NO, YES}, default = NO)</td>
<td style="width:100px;">STRING</td>
<td style="width:100px;">STRING</td></tr>
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%">
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr>
<tr><td colspan=3><b>Parameters for selection &quot;YES&quot;:</b></td></tr>
<tr style="vertical-align:top">
<td><font color="green">GAF.m</font></td>
<td>minimal number of predictions (only gene sets with at least this number of predictions will be used for clustering, valid range = [0, 100000000], default = 1000)</td>
<td style="width:100px;">INT</td>
</tr>
</tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">GAF.s</font></td>
<td><font color="green">GAF.c</font></td>
<td>sorting (comma-separated list that defines the way predictions are sorted within a cluster, default = evidence,score)</td>
<td>cluster (the number of clusters to be used for kmeans, valid range = [2, 100], default = 2)</td>
<td style="width:100px;">STRING</td>
<td style="width:100px;">INT</td>
</tr>
</tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">GAF.a</font></td>
<td><font color="green">GAF.g</font></td>
<td>alternative transcript filter (If a prediction is suggested as an alternative transcript, this additional filter will be applied. The filter works syntactically like the 'filter' parameter and allows you to keep the number of alternative transcripts small based on meaningful criteria. Reasonable filter could be based on RNA-seq data (tie==1) or on evidence (evidence>1). A more sophisticated filter could be applied combining several criteria: tie==1 or evidence>1, default = tie==1 or evidence>1, OPTIONAL)</td>
<td>good cluster (the number of good clusters, good clusters are those with small mean, all members of a good cluster are further used, valid range = [1, 99], default = 1)</td>
<td style="width:100px;">STRING</td>
<td style="width:100px;">INT</td>
</tr>
</tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">AnnotationFinalizer.u</font></td>
<td><font color="green">GAF.t</font></td>
<td>UTR (allows to predict UTRs using RNA-seq data, range={NO, YES}, default = NO)</td>
<td>trend (whether a local component should be used for the cluster attribute (might be helpful for regions with different conservation (e.g. introgressions in chromosomes)), range={GLOBAL, LOCAL}, default = GLOBAL)</td>
<td style="width:100px;">STRING</td></tr>
<td style="width:100px;">STRING</td></tr>
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%">
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%">
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr>
<tr><td colspan=3><b>No parameters for selection &quot;GLOBAL&quot;</b></td></tr>
<tr><td colspan=3><b>No parameters for selection &quot;YES&quot;</b></td></tr>
<tr><td colspan=3><b>Parameters for selection &quot;LOCAL&quot;:</b></td></tr>
<tr style="vertical-align:top">
<td><font color="green">GAF.margin</font></td>
<td>margin (the number of bp upstream and downstream of a predictions used to identify neighboring predictions for the statistics, valid range = [0, 100000000], default = 1000000)</td>
<td style="width:100px;">INT</td>
</tr>
<tr style="vertical-align:top">
<td><font color="green">GAF.q</font></td>
<td>quantile (the quantile used for the local trend, valid range = [0.0, 1.0], default = 0.2)</td>
<td style="width:100px;">DOUBLE</td>
</tr>
</table></td></tr>
</table></td></tr>
</table></td></tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">AnnotationFinalizer.r</font></td>
<td><font color="green">GAF.f</font></td>
<td>rename (allows to generate generic gene and transcripts names (cf. parameter &quot;name attribute&quot;), range={COMPOSED, SIMPLE, NO}, default = COMPOSED)</td>
<td>filter (A filter can be applied to transcript predictions to receive only reasonable predictions. The filter is applied to the GFF attributes. The default filter decides based on the completeness of the prediction (start=='M' and stop=='*') and the relative score (score/aa>=0.75) whether a prediction is used or not. In addition, predictions without score (isNaN(score)) will be used as external annotations do not have the attribute score. Different criteria can be combined using 'and' and 'or'. A more sophisticated filter could be applied for instance using the completeness, the relative score, the evidence as well as statistics based on RNA-seq: start=='M' and stop=='*' and score/aa>=0.75 and (evidence>1 or tpc==1.0), default = start=='M' and stop=='*' and (isNaN(score) or score/aa>=0.75), OPTIONAL)</td>
<td style="width:100px;">STRING</td></tr>
<td style="width:100px;">STRING</td>
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%">
</tr>
<tr><td colspan=3><b>Parameters for selection &quot;COMPOSED&quot;:</b></td></tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">AnnotationFinalizer.p</font></td>
<td><font color="green">GAF.s</font></td>
<td>prefix (the prefix of the generic name)</td>
<td>sorting (comma-separated list that defines the way predictions are sorted within a cluster, default = sumWeight,score,aa)</td>
<td style="width:100px;">STRING</td>
<td style="width:100px;">STRING</td>
</tr>
</tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">AnnotationFinalizer.i</font></td>
<td><font color="green">GAF.l</font></td>
<td>infix (the infix of the generic name, default = G)</td>
<td>length difference (maximal percentage of length difference between the representative transcript and an alternative transcript, alternative transcripts with a higher percentage are discarded, valid range = [0.0, 10000.0], OPTIONAL)</td>
<td style="width:100px;">DOUBLE</td>
</tr>
<tr style="vertical-align:top">
<td><font color="green">GAF.a</font></td>
<td>alternative transcript filter (If a prediction is suggested as an alternative transcript, this additional filter will be applied. The filter works syntactically like the 'filter' parameter and allows you to keep the number of alternative transcripts small based on meaningful criteria. Reasonable filter could be based on RNA-seq data (tie==1) or on sumWeight (sumWeight>1). A more sophisticated filter could be applied combining several criteria: tie==1 or sumWeight>1, default = tie==1 or sumWeight>1, OPTIONAL)</td>
<td style="width:100px;">STRING</td>
<td style="width:100px;">STRING</td>
</tr>
</tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">AnnotationFinalizer.s</font></td>
<td><font color="green">GAF.cbf</font></td>
<td>suffix (the suffix of the generic name, default = 0)</td>
<td>common border filter (the filter on the common borders of transcripts, the lower the more transcripts will be checked as alternative splice isoforms, valid range = [0.0, 1.0], default = 0.75)</td>
<td style="width:100px;">STRING</td>
<td style="width:100px;">DOUBLE</td>
</tr>
</tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">AnnotationFinalizer.d</font></td>
<td><font color="green">GAF.mnotpg</font></td>
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td>
<td>maximal number of transcripts per gene (the maximal number of allowed transcript predictions per gene, valid range = [1, 2147483647], default = 2147483647)</td>
<td style="width:100px;">INT</td>
<td style="width:100px;">INT</td>
</tr>
</tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">AnnotationFinalizer.di</font></td>
<td><font color="green">GAF.aat</font></td>
<td>delete infix (a comma-separated list of infixes that is deleted from the sequence names before building the gene/transcript name, default = )</td>
<td>add alternative transcripts (a switch to decide whether the IDs of alternative transcripts that have the same CDS should be listed for each prediction, default = false)</td>
<td style="width:100px;">STRING</td>
<td style="width:100px;">BOOLEAN</td>
</tr>
</tr>
<tr><td colspan=3><b>Parameters for selection &quot;SIMPLE&quot;:</b></td></tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">AnnotationFinalizer.p</font></td>
<td><font color="green">GAF.tf</font></td>
<td>prefix (the prefix of the generic name)</td>
<td>transfer features (if true, additional features like UTRs will be transfered from the input to the output. Only features of the representatives will be transfered. The UTRs of identical CDS predictions listed in "alternative" will not be transfered or combined, default = false)</td>
<td style="width:100px;">STRING</td>
<td style="width:100px;">BOOLEAN</td>
</tr>
<tr style="vertical-align:top">
<td><font color="green">AnnotationFinalizer.d</font></td>
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td>
<td style="width:100px;">INT</td>
</tr>
</tr>
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr>
</table></td></tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">AnnotationFinalizer.n</font></td>
<td><font color="green">AnnotationFinalizer.t</font></td>
<td>name attribute (if true the new name is added as new attribute &quot;Name&quot;, otherwise &quot;Parent&quot; and &quot;ID&quot; values are modified accordingly, default = true)</td>
<td>transfer features (if true other features than gene, &lt;tag&gt; (default: mRNA), and CDS of the input will be written in the output, default = false)</td>
<td style="width:100px;">BOOLEAN</td>
<td style="width:100px;">BOOLEAN</td>
</tr>
</tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">p</font></td>
<td><font color="green">AnnotationFinalizer.u</font></td>
<td>predicted proteins (If *true*, returns the predicted proteins of the target organism as fastA file, default = true)</td>
<td>UTR (allows to predict UTRs using RNA-seq data, range={NO, YES}, default = NO)</td>
<td style="width:100px;">BOOLEAN</td>
<td style="width:100px;">STRING</td></tr>
</tr>
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%">
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr>
<tr><td colspan=3><b>Parameters for selection &quot;YES&quot;:</b></td></tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">pc</font></td>
<td><font color="green">AnnotationFinalizer.a</font></td>
<td>predicted CDSs (If *true*, returns the predicted CDSs of the target organism as fastA file, default = false)</td>
<td>additional source suffix (a suffix for source values of UTR features, default = )</td>
<td style="width:100px;">BOOLEAN</td>
<td style="width:100px;">STRING</td>
</tr>
</tr>
</table></td></tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">pgr</font></td>
<td><font color="green">AnnotationFinalizer.r</font></td>
<td>predicted genomic regions (If *true*, returns the genomic regions of predicted gene models of the target organism as fastA file, default = false)</td>
<td>rename (allows to generate generic gene and transcripts names (cf. parameter &quot;name attribute&quot;), range={COMPOSED, SIMPLE, NO}, default = COMPOSED)</td>
<td style="width:100px;">BOOLEAN</td>
<td style="width:100px;">STRING</td></tr>
</tr>
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%">
<tr><td colspan=3><b>Parameters for selection &quot;COMPOSED&quot;:</b></td></tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">o</font></td>
<td><font color="green">AnnotationFinalizer.p</font></td>
<td>output individual predictions (If *true*, returns the predictions for each reference species, default = false)</td>
<td>prefix (the prefix of the generic name)</td>
<td style="width:100px;">BOOLEAN</td>
<td style="width:100px;">STRING</td>
</tr>
</tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">debug</font></td>
<td><font color="green">AnnotationFinalizer.i</font></td>
<td>debug (If *false* removes all temporary files even if the jobs exits unexpected, default = true)</td>
<td>infix (the infix of the generic name, default = G)</td>
<td style="width:100px;">BOOLEAN</td>
<td style="width:100px;">STRING</td>
</tr>
</tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">restart</font></td>
<td><font color="green">AnnotationFinalizer.s</font></td>
<td>restart (can be used to restart the latest GeMoMaPipeline run, which was finished without results, with very similar parameters, e.g., after an exception was thrown (cf. parameter debug), default = false)</td>
<td>suffix (the suffix of the generic name, default = 0)</td>
<td style="width:100px;">BOOLEAN</td>
<td style="width:100px;">STRING</td>
</tr>
</tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">b</font></td>
<td><font color="green">AnnotationFinalizer.d</font></td>
<td>BLAST_PATH (allows to set a path to the blast binaries if not set in the environment, default = , OPTIONAL)</td>
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td>
<td style="width:100px;">INT</td>
</tr>
<tr style="vertical-align:top">
<td><font color="green">AnnotationFinalizer.c</font></td>
<td>contig search pattern (search string, i.e., a regular expression for search-and-replace parts of the contig/scaffold/chromosome names, the modified string is used as infix for the gene name, default = )</td>
<td style="width:100px;">STRING</td>
<td style="width:100px;">STRING</td>
</tr>
</tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">m</font></td>
<td><font color="green">AnnotationFinalizer.crp</font></td>
<td>MMSEQS_PATH (allows to set a path to the blast binaries if not set in the environment, default = , OPTIONAL)</td>
<td>contig replace pattern (replace string, i.e., a regular expression for search-and-replace parts of the contig/scaffold/chromosome names, the modified string is used as infix for the gene name, default = )</td>
<td style="width:100px;">STRING</td>
<td style="width:100px;">STRING</td>
</tr>
</tr>
<tr><td colspan=3><b>Parameters for selection &quot;SIMPLE&quot;:</b></td></tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">outdir</font></td>
<td><font color="green">AnnotationFinalizer.p</font></td>
<td>The output directory, defaults to the current working directory (.)</td>
<td>prefix (the prefix of the generic name)</td>
<td>STRING</td>
<td style="width:100px;">STRING</td>
</tr>
</tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">threads</font></td>
<td><font color="green">AnnotationFinalizer.d</font></td>
<td>The number of threads used for the tool, defaults to 1</td>
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td>
<td>INT</td>
<td style="width:100px;">INT</td>
</tr>
</tr>
</table>
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr>
 
</table></td></tr>
'''Example:'''
<tr style="vertical-align:top">
 
<td><font color="green">AnnotationFinalizer.n</font></td>
java -jar GeMoMa-1.7.1.jar CLI GeMoMaPipeline t=&lt;target_genome&gt; g=&lt;genome&gt; a=&lt;annotation&gt; AnnotationFinalizer.p=&lt;prefix&gt;
<td>name attribute (if true the new name is added as new attribute &quot;Name&quot;, otherwise &quot;Parent&quot; and &quot;ID&quot; values are modified accordingly, default = true)</td>
 
<td style="width:100px;">BOOLEAN</td>
 
= Extract RNA-seq Evidence =
 
This tools extracts introns and coverage from mapped RNA-seq reads. Introns might be denoised by the tool '''DenoiseIntrons'''. Introns and coverage results can be used in '''GeMoMa''' to improve the predictions and might help to select better gene models in '''GAF'''. In addition, introns and coverage can be used to predict UTRs by '''AnnotationFinalizer'''.
 
''Extract RNA-seq Evidence'' may be called with
 
java -jar GeMoMa-1.7.1.jar CLI ERE
 
and has the following parameters
 
<table border=0 cellpadding=10 align="center" width="100%">
<tr>
<td>name</td>
<td>comment</td>
<td>type</td>
</tr>
</tr>
<tr><td colspan=3><hr></td></tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">s</font></td>
<td><font color="green">sc</font></td>
<td>Stranded (Defines whether the reads are stranded. In case of FR_FIRST_STRAND, the first read of a read pair or the only read in case of single-end data is assumed to be located on forward strand of the cDNA, i.e., reverse to the mRNA orientation. If you are using Illumina TruSeq you should use FR_FIRST_STRAND., range={FR_UNSTRANDED, FR_FIRST_STRAND, FR_SECOND_STRAND}, default = FR_UNSTRANDED)</td>
<td>synteny check (run SyntenyChecker if possible, default = true)</td>
<td style="width:100px;">STRING</td>
<td style="width:100px;">BOOLEAN</td>
</tr>
</tr>
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr>
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%">
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">m</font></td>
<td><font color="green">p</font></td>
<td>mapped reads file (BAM/SAM files containing the mapped reads, mime = bam,sam)</td>
<td>predicted proteins (If *true*, returns the predicted proteins of the target organism as fastA file, default = true)</td>
<td style="width:100px;">FILE</td>
<td style="width:100px;">BOOLEAN</td>
</tr>
</tr>
</table>
</td></tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">v</font></td>
<td><font color="green">pc</font></td>
<td>ValidationStringency (Defines how strict to be when reading a SAM or BAM, beyond bare minimum validation., range={STRICT, LENIENT, SILENT}, default = LENIENT)</td>
<td>predicted CDSs (If *true*, returns the predicted CDSs of the target organism as fastA file, default = false)</td>
<td style="width:100px;">STRING</td>
<td style="width:100px;">BOOLEAN</td>
</tr>
</tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">u</font></td>
<td><font color="green">pgr</font></td>
<td>use secondary alignments (allows to filter flags in the SAM or BAM, default = true)</td>
<td>predicted genomic regions (If *true*, returns the genomic regions of predicted gene models of the target organism as fastA file, default = false)</td>
<td style="width:100px;">BOOLEAN</td>
<td style="width:100px;">BOOLEAN</td>
</tr>
</tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">c</font></td>
<td><font color="green">o</font></td>
<td>coverage (allows to output the coverage, default = true)</td>
<td>output individual predictions (If *true*, returns the predictions for each reference species, default = false)</td>
<td style="width:100px;">BOOLEAN</td>
<td style="width:100px;">BOOLEAN</td>
</tr>
</tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">mmq</font></td>
<td><font color="green">debug</font></td>
<td>minimum mapping quality (reads with a mapping quality that is lower than this value will be ignored, valid range = [0, 255], default = 40)</td>
<td>debug (If *false* removes all temporary files even if the jobs exits unexpected, default = true)</td>
<td style="width:100px;">INT</td>
<td style="width:100px;">BOOLEAN</td>
</tr>
</tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">mc</font></td>
<td><font color="green">restart</font></td>
<td>minimum context (only introns that have evidence of at least one split read with a minimal M (=(mis)match) stretch in the cigar string larger than or equal to this value will be used, valid range = [1, 1000000], default = 1)</td>
<td>restart (can be used to restart the latest GeMoMaPipeline run, which was finished without results, with very similar parameters, e.g., after an exception was thrown (cf. parameter debug), default = false)</td>
<td style="width:100px;">INT</td>
<td style="width:100px;">BOOLEAN</td>
</tr>
<tr style="vertical-align:top">
<td><font color="green">b</font></td>
<td>BLAST_PATH (allows to set a path to the blast binaries if not set in the environment, default = , OPTIONAL)</td>
<td style="width:100px;">STRING</td>
</tr>
<tr style="vertical-align:top">
<td><font color="green">m</font></td>
<td>MMSEQS_PATH (allows to set a path to the blast binaries if not set in the environment, default = , OPTIONAL)</td>
<td style="width:100px;">STRING</td>
</tr>
</tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
Line 566: Line 625:
<td>The output directory, defaults to the current working directory (.)</td>
<td>The output directory, defaults to the current working directory (.)</td>
<td>STRING</td>
<td>STRING</td>
</tr>
<tr style="vertical-align:top">
<td><font color="green">threads</font></td>
<td>The number of threads used for the tool, defaults to 1</td>
<td>INT</td>
</tr>
</tr>
</table>
</table>
Line 571: Line 635:
'''Example:'''
'''Example:'''


  java -jar GeMoMa-1.7.1.jar CLI ERE m=&lt;mapped_reads_file&gt;
  java -jar GeMoMa-1.9.jar CLI GeMoMaPipeline a=<reference_annotation> g=<reference_genome> t=&lt;target_genome&gt; AnnotationFinalizer.p=&lt;prefix&gt;




= CheckIntrons =
=== Extract RNA-seq Evidence ===


The tool checks the distribution of introns on the strands and the dinucleotide distribution at splice sites.
This tools extracts introns and coverage from mapped RNA-seq reads. Introns might be denoised by the tool '''DenoiseIntrons'''. Introns and coverage results can be used in '''GeMoMa''' to improve the predictions and might help to select better gene models in '''GAF'''. In addition, introns and coverage can be used to predict UTRs by '''AnnotationFinalizer'''.


''CheckIntrons'' may be called with
''Extract RNA-seq Evidence'' may be called with


  java -jar GeMoMa-1.7.1.jar CLI CheckIntrons
  java -jar GeMoMa-1.9.jar CLI ERE


and has the following parameters
and has the following parameters
Line 592: Line 656:
<tr><td colspan=3><hr></td></tr>
<tr><td colspan=3><hr></td></tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">t</font></td>
<td><font color="green">s</font></td>
<td>target genome (The target genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code, mime = fasta)</td>
<td>Stranded (Defines whether the reads are stranded. In case of FR_FIRST_STRAND, the first read of a read pair or the only read in case of single-end data is assumed to be located on forward strand of the cDNA, i.e., reverse to the mRNA orientation. If you are using Illumina TruSeq you should use FR_FIRST_STRAND., range={FR_UNSTRANDED, FR_FIRST_STRAND, FR_SECOND_STRAND}, default = FR_UNSTRANDED)</td>
<td style="width:100px;">FILE</td>
<td style="width:100px;">STRING</td>
</tr>
</tr>
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr>
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr>
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%">
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%">
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">i</font></td>
<td><font color="green">m</font></td>
<td>introns (Introns (GFF), which might be obtained from RNA-seq, mime = gff, OPTIONAL)</td>
<td>mapped reads file (BAM/SAM files containing the mapped reads, type = bam,sam)</td>
<td style="width:100px;">FILE</td>
<td style="width:100px;">FILE</td>
</tr>
</tr>
Line 607: Line 671:
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">v</font></td>
<td><font color="green">v</font></td>
<td>verbose (A flag which allows to output a wealth of additional information per transcript, default = false)</td>
<td>ValidationStringency (Defines how strict to be when reading a SAM or BAM, beyond bare minimum validation., range={STRICT, LENIENT, SILENT}, default = LENIENT)</td>
<td style="width:100px;">STRING</td>
</tr>
<tr style="vertical-align:top">
<td><font color="green">u</font></td>
<td>use secondary alignments (allows to filter flags in the SAM or BAM, default = true)</td>
<td style="width:100px;">BOOLEAN</td>
</tr>
<tr style="vertical-align:top">
<td><font color="green">c</font></td>
<td>coverage (allows to output the coverage, default = true)</td>
<td style="width:100px;">BOOLEAN</td>
<td style="width:100px;">BOOLEAN</td>
</tr>
</tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">outdir</font></td>
<td><font color="green">mmq</font></td>
<td>The output directory, defaults to the current working directory (.)</td>
<td>minimum mapping quality (reads with a mapping quality that is lower than this value will be ignored, valid range = [0, 255], default = 40)</td>
<td>STRING</td>
<td style="width:100px;">INT</td>
</tr>
</tr>
</table>
<tr style="vertical-align:top">
 
<td><font color="green">mc</font></td>
'''Example:'''
<td>minimum context (only introns that have evidence of at least one split read with a minimal M (=(mis)match) stretch in the cigar string larger than or equal to this value will be used, valid range = [1, 1000000], default = 1)</td>
 
<td style="width:100px;">INT</td>
java -jar GeMoMa-1.7.1.jar CLI CheckIntrons t=&lt;target_genome&gt; i=&lt;introns&gt;
 
 
= DenoiseIntrons =
 
This module allows to analyze introns extracted by '''ERE'''. Introns with a large intron size or a low relative expression are possibly artefacts and will be removed. The result of this module can be used in the module '''GeMoMa''', '''AnnotationEvidence''', and '''AnnotationFinalizer'''.
 
''DenoiseIntrons'' may be called with
 
java -jar GeMoMa-1.7.1.jar CLI DenoiseIntrons
 
and has the following parameters
 
<table border=0 cellpadding=10 align="center" width="100%">
<tr>
<td>name</td>
<td>comment</td>
<td>type</td>
</tr>
</tr>
<tr><td colspan=3><hr></td></tr>
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr>
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%">
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">i</font></td>
<td><font color="green">maximumcoverage</font></td>
<td>introns (Introns (GFF), which might be obtained from RNA-seq, mime = gff,gff3)</td>
<td>maximum coverage (optional parameter to reduce the size of coverage output files, coverage higher than this value will be reported as this value, valid range = [1, 10000], OPTIONAL)</td>
<td style="width:100px;">FILE</td>
<td style="width:100px;">INT</td>
</tr>
</tr>
</table>
</td></tr>
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr>
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%">
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">c</font></td>
<td><font color="green">f</font></td>
<td>coverage (experimental coverage (RNA-seq), range={UNSTRANDED, STRANDED}, default = UNSTRANDED)</td>
<td>filter by intron mismatches (filter reads by the number of mismatches around splits, range={NO, YES}, default = NO)</td>
<td style="width:100px;">STRING</td></tr>
<td style="width:100px;">STRING</td></tr>
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%">
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%">
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr>
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr>
<tr><td colspan=3><b>Parameters for selection &quot;YES&quot;:</b></td></tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">coverage_unstranded</font></td>
<td><font color="green">r</font></td>
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., mime = bedgraph)</td>
<td>region around introns (test region of this size around introns/splits for mismatches to the genome, valid range = [0, 100], default = 10)</td>
<td style="width:100px;">FILE</td>
<td style="width:100px;">INT</td>
</tr>
</tr>
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">coverage_forward</font></td>
<td><font color="green">n</font></td>
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., mime = bedgraph)</td>
<td>number of mismatches (number of mismatches allowed in regions around introns/splits, valid range = [0, 100], default = 3)</td>
<td style="width:100px;">FILE</td>
<td style="width:100px;">INT</td>
</tr>
</tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">coverage_reverse</font></td>
<td><font color="green">t</font></td>
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., mime = bedgraph)</td>
<td>target genome (The target genome file (FASTA). Should be in IUPAC code, type = fasta,fas,fa,fna,fasta.gz,fas.gz,fa.gz,fna.gz)</td>
<td style="width:100px;">FILE</td>
<td style="width:100px;">FILE</td>
</tr>
</tr>
</table></td></tr>
</table></td></tr>
</table>
</td></tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">m</font></td>
<td><font color="green">e</font></td>
<td>maximum intron length (The maximum length of an intron, default = 15000)</td>
<td>evidence long splits (require introns to have at least this number of times the supporting reads as their length deviates from the mean split length, valid range = [0.0, 100.0], default = 0.0)</td>
<td style="width:100px;">INT</td>
<td style="width:100px;">DOUBLE</td>
</tr>
</tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">me</font></td>
<td><font color="green">mil</font></td>
<td>minimum expression (The threshold for removing introns, valid range = [0.0, 1.0], default = 0.01)</td>
<td>minimum intron length (introns shorter than the minimum length are discarded and considered as contiguous, valid range = [0, 1000], default = 0)</td>
<td style="width:100px;">DOUBLE</td>
<td style="width:100px;">INT</td>
</tr>
</tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">context</font></td>
<td><font color="green">repositioning</font></td>
<td>context (The context upstream a donor and donwstream an acceptor site that is used to determine the expression of the region, valid range = [0, 100], default = 10)</td>
<td>repositioning (due to limitations in BAM/SAM format huge chromosomes need to be split before mapping. This parameter allows to undo the split mapping to real chromosomes and coordinates. The repositioning file has 3 columns: split_chr_name, original_chr_name, offset_in_original_chr, type = tabular, OPTIONAL)</td>
<td style="width:100px;">INT</td>
<td style="width:100px;">FILE</td>
</tr>
</tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
Line 699: Line 746:
'''Example:'''
'''Example:'''


  java -jar GeMoMa-1.7.1.jar CLI DenoiseIntrons i=&lt;introns&gt; coverage_unstranded=&lt;coverage_unstranded&gt;
  java -jar GeMoMa-1.9.jar CLI ERE m=&lt;mapped_reads_file&gt;




= NCBI Reference Retriever =
=== CheckIntrons ===


This tool can be used to download or update assembly and annotation files of reference organsims from NCBI. This way it allows to easily collect all data necessary to start '''GeMoMaPipeline''' or '''Extractor'''.
The tool checks the distribution of introns on the strands and the dinucleotide distribution at splice sites.


''NCBI Reference Retriever'' may be called with
''CheckIntrons'' may be called with


  java -jar GeMoMa-1.7.1.jar CLI NRR
  java -jar GeMoMa-1.9.jar CLI CheckIntrons


and has the following parameters
and has the following parameters
Line 720: Line 767:
<tr><td colspan=3><hr></td></tr>
<tr><td colspan=3><hr></td></tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">r</font></td>
<td><font color="green">t</font></td>
<td>reference directory (the directory where the genome and annotation files of the reference organisms should be stored, default = references/)</td>
<td>target genome (The target genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code, type = fasta)</td>
<td style="width:100px;">STRING</td>
<td style="width:100px;">FILE</td>
</tr>
</tr>
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr>
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%">
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">n</font></td>
<td><font color="green">i</font></td>
<td>number of tries (the number of tries for downloading a reference file, valid range = [1, 100], default = 10)</td>
<td>introns (Introns (GFF), which might be obtained from RNA-seq, type = gff)</td>
<td style="width:100px;">INT</td>
<td style="width:100px;">FILE</td>
</tr>
</tr>
</table>
</td></tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">rl</font></td>
<td><font color="green">v</font></td>
<td>reference list (a list of reference organisms, mime = txt)</td>
<td>verbose (A flag which allows to output a wealth of additional information per transcript, default = false)</td>
<td style="width:100px;">FILE</td>
<td style="width:100px;">BOOLEAN</td>
</tr>
</tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
Line 743: Line 794:
'''Example:'''
'''Example:'''


  java -jar GeMoMa-1.7.1.jar CLI NRR rl=&lt;reference_list&gt;
  java -jar GeMoMa-1.9.jar CLI CheckIntrons t=&lt;target_genome&gt; i=&lt;introns&gt;




= Extractor =
=== DenoiseIntrons ===


This tool can be used to create input files for '''GeMoMa''', i.e., it creates at least a fasta file containing the translated parts of the CDS and a tabular file containing the assignment of transcripts to genes and parts of CDS to transcripts. In addition, '''Extractor''' can be used to create several additional files from the final prediction, e.g. proteins, CDSs, ... . Two inputs are mandatory: The genome as fasta or fasta.gz and the corresponding annotation as gff or gff.gz. The gff file should be sorted. If you like to set a user-specific genetic code, please use a tab-delimited file with two columns. The first column contains the amino acid in one letter code, the second a list of tripletts.
This module allows to analyze introns extracted by '''ERE'''. Introns with a large intron size or a low relative expression are possibly artefacts and will be removed. The result of this module can be used in the module '''GeMoMa''', '''AnnotationEvidence''', and '''AnnotationFinalizer'''.


''Extractor'' may be called with
''DenoiseIntrons'' may be called with


  java -jar GeMoMa-1.7.1.jar CLI Extractor
  java -jar GeMoMa-1.9.jar CLI DenoiseIntrons


and has the following parameters
and has the following parameters
Line 763: Line 814:
</tr>
</tr>
<tr><td colspan=3><hr></td></tr>
<tr><td colspan=3><hr></td></tr>
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr>
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%">
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">a</font></td>
<td><font color="green">i</font></td>
<td>annotation (Reference annotation file (GFF or GTF), which contains gene models annotated in the reference genome, mime = gff,gff3,gtf)</td>
<td>introns (Introns (GFF), which might be obtained from RNA-seq, type = gff,gff3)</td>
<td style="width:100px;">FILE</td>
<td style="width:100px;">FILE</td>
</tr>
</tr>
</table>
</td></tr>
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr>
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%">
<tr style="vertical-align:top">
<td><font color="green">c</font></td>
<td>coverage (experimental coverage (RNA-seq), range={UNSTRANDED, STRANDED}, default = UNSTRANDED)</td>
<td style="width:100px;">STRING</td></tr>
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%">
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">g</font></td>
<td><font color="green">coverage_unstranded</font></td>
<td>genome (Reference genome file (FASTA), mime = fasta,fa,fas,fna,fasta.gz,fa.gz,fas.gz,fna.gz)</td>
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., type = bedgraph)</td>
<td style="width:100px;">FILE</td>
<td style="width:100px;">FILE</td>
</tr>
</tr>
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">gc</font></td>
<td><font color="green">coverage_forward</font></td>
<td>genetic code (optional user-specified genetic code, mime = tabular, OPTIONAL)</td>
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., type = bedgraph)</td>
<td style="width:100px;">FILE</td>
<td style="width:100px;">FILE</td>
</tr>
</tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">p</font></td>
<td><font color="green">coverage_reverse</font></td>
<td>proteins (whether the complete proteins sequences should returned as output, default = false)</td>
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., type = bedgraph)</td>
<td style="width:100px;">BOOLEAN</td>
<td style="width:100px;">FILE</td>
</tr>
</tr>
</table></td></tr>
</table>
</td></tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">c</font></td>
<td><font color="green">m</font></td>
<td>cds (whether the complete CDSs should returned as output, default = false)</td>
<td>maximum intron length (The maximum length of an intron, default = 15000)</td>
<td style="width:100px;">BOOLEAN</td>
<td style="width:100px;">INT</td>
</tr>
</tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">genomic</font></td>
<td><font color="green">me</font></td>
<td>genomic (whether the genomic regions should be returned (upper case = coding, lower case = non coding), default = false)</td>
<td>minimum expression (The threshold for removing introns, valid range = [0.0, 1.0], default = 0.01)</td>
<td style="width:100px;">BOOLEAN</td>
<td style="width:100px;">DOUBLE</td>
</tr>
</tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">i</font></td>
<td><font color="green">context</font></td>
<td>introns (whether introns should be extracted from annotation, that might be used for test cases, default = false)</td>
<td>context (The context upstream a donor and donwstream an acceptor site that is used to determine the expression of the region, valid range = [0, 100], default = 10)</td>
<td style="width:100px;">BOOLEAN</td>
<td style="width:100px;">INT</td>
</tr>
</tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">u</font></td>
<td><font color="green">outdir</font></td>
<td>upcase IDs (whether the IDs in the GFF should be upcased, default = false)</td>
<td>The output directory, defaults to the current working directory (.)</td>
<td style="width:100px;">BOOLEAN</td>
<td>STRING</td>
</tr>
</tr>
<tr style="vertical-align:top">
</table>
<td><font color="green">r</font></td>
 
<td>repair (if a transcript annotation can not be parsed, the program will try to infer the phase of the CDS parts to repair the annotation, default = false)</td>
'''Example:'''
<td style="width:100px;">BOOLEAN</td>
 
java -jar GeMoMa-1.9.jar CLI DenoiseIntrons i=&lt;introns&gt; coverage_unstranded=&lt;coverage_unstranded&gt;
 
 
=== NCBI Reference Retriever ===
 
This tool can be used to download or update assembly and annotation files of reference organsims from NCBI. This way it allows to easily collect all data necessary to start '''GeMoMaPipeline''' or '''Extractor'''.
 
''NCBI Reference Retriever'' may be called with
 
java -jar GeMoMa-1.9.jar CLI NRR
 
and has the following parameters
 
<table border=0 cellpadding=10 align="center" width="100%">
<tr>
<td>name</td>
<td>comment</td>
<td>type</td>
</tr>
</tr>
<tr><td colspan=3><hr></td></tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">s</font></td>
<td><font color="green">r</font></td>
<td>selected (The path to list file, which allows to make only a predictions for the contained transcript ids. The first column should contain transcript IDs as given in the annotation. Remaining columns will be ignored., mime = tabular,txt, OPTIONAL)</td>
<td>reference directory (the directory where the genome and annotation files of the reference organisms should be stored, default = references/)</td>
<td style="width:100px;">FILE</td>
<td style="width:100px;">STRING</td>
</tr>
</tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">Ambiguity</font></td>
<td><font color="green">n</font></td>
<td>Ambiguity (This parameter defines how to deal with ambiguities in the DNA. There are 3 options: EXCEPTION, which will remove the corresponding transcript, AMBIGUOUS, which will use an X for the corresponding amino acid, and RANDOM, which will randomly select an amnio acid from the list of possibilities., range={EXCEPTION, AMBIGUOUS, RANDOM}, default = EXCEPTION)</td>
<td>number of tries (the number of tries for downloading a reference file, valid range = [1, 100], default = 10)</td>
<td style="width:100px;">STRING</td>
<td style="width:100px;">INT</td>
</tr>
</tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">d</font></td>
<td><font color="green">rl</font></td>
<td>discard pre-mature stop (if *true* transcripts with pre-mature stop codon are discarded as they often indicate misannotation, default = true)</td>
<td>reference list (a list of reference organisms, type = txt)</td>
<td style="width:100px;">BOOLEAN</td>
<td style="width:100px;">FILE</td>
</tr>
<tr style="vertical-align:top">
<td><font color="green">sefc</font></td>
<td>stop-codon excluded from CDS (A flag that states whether the reference annotation contains the stop codon in the CDS annotation or not, default = false)</td>
<td style="width:100px;">BOOLEAN</td>
</tr>
<tr style="vertical-align:top">
<td><font color="green">f</font></td>
<td>full-length (A flag which allows for choosing between only full-length and all (i.e., full-length and partial) transcripts, default = true)</td>
<td style="width:100px;">BOOLEAN</td>
</tr>
<tr style="vertical-align:top">
<td><font color="green">v</font></td>
<td>verbose (A flag which allows to output a wealth of additional information, default = false)</td>
<td style="width:100px;">BOOLEAN</td>
</tr>
</tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
Line 847: Line 918:
'''Example:'''
'''Example:'''


  java -jar GeMoMa-1.7.1.jar CLI Extractor a=&lt;annotation&gt; g=&lt;genome&gt;
  java -jar GeMoMa-1.9.jar CLI NRR rl=&lt;reference_list&gt;




= GeneModelMapper =
=== Extractor ===


This tool is the main part of, a homology-based gene prediction tool. GeMoMa builds gene models from search results (e.g. tblastn or mmseqs).
This tool can be used to create input files for '''GeMoMa''', i.e., it creates at least a fasta file containing the translated parts of the CDS and a tabular file containing the assignment of transcripts to genes and parts of CDS to transcripts. In addition, '''Extractor''' can be used to create several additional files from the final prediction, e.g. proteins, CDSs, ... . Two inputs are mandatory: The genome as fasta or fasta.gz and the corresponding annotation as gff or gff.gz. The gff file should be sorted. If you like to set a user-specific genetic code, please use a tab-delimited file with two columns. The first column contains the amino acid in one letter code, the second a list of tripletts.


As first step, you should run '''Extractor''' obtaining ''cds parts'' and ''assignment''. Second, you should run a search algorithm, e.g. '''tblastn''' or '''mmseqs''', with ''cds parts'' as query. Finally, these search results are then used in '''GeMoMa'''. Search results should be clustered according to the reference genes. The most easiest way is to sort the search results accoring to the first column. If the search results are not sorted by default (e.g. mmseqs), you should the parameter ''sort''.
''Extractor'' may be called with
If you like to run GeMoMa ignoring intron position conservation, you should blast protein sequences and feed the results in ''query cds parts'' and leave ''assignment'' unselected.


If you like to run GeMoMa using RNA-seq evidence, you should map your RNA-seq reads to the genome and run '''ERE''' on the mapped reads. For several reasons, spurious introns can be extracted from RNA-seq data. Hence, we recommend to run '''DenoiseIntrons''' to remove such spurious introns. Finally, you can use the obtained ''introns'' (and ''coverage'') in GeMoMa.
java -jar GeMoMa-1.9.jar CLI Extractor


If you like to obtain multiple predictions per gene model of the reference organism, you should set ''predictions'' accordingly. In addition, we suggest to decrease the value of ''contig threshold'' allowing GeMoMa to evaluate more candidate contigs/chromosomes.
and has the following parameters
 
If you change the values of ''contig threshold'', ''region threshold'' and ''hit threshold'', this will influence the predictions as well as the runtime of the algorithm. The lower the values are, the slower the algorithm is.
 
You can filter your predictions using '''GAF''', which also allows for combining predictions from different reference organismns.
 
Finally, you can predict UTRs and rename predictions using '''AnnotationFinalizer'''.
 
If you like to run the complete GeMoMa pipeline and not only specific module, you can run the multi-threaded module '''GeMoMaPipeline'''.
 
''GeneModelMapper'' may be called with
 
java -jar GeMoMa-1.7.1.jar CLI GeMoMa
 
and has the following parameters


<table border=0 cellpadding=10 align="center" width="100%">
<table border=0 cellpadding=10 align="center" width="100%">
Line 883: Line 939:
<tr><td colspan=3><hr></td></tr>
<tr><td colspan=3><hr></td></tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">s</font></td>
<td><font color="green">a</font></td>
<td>search results (The search results, e.g., from tblastn or mmseqs, mime = tabular)</td>
<td>annotation (Reference annotation file (GFF or GTF), which contains gene models annotated in the reference genome, type = gff,gff3,gtf,gff.gz,gff3.gz,gtf.gz)</td>
<td style="width:100px;">FILE</td>
<td style="width:100px;">FILE</td>
</tr>
</tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">t</font></td>
<td><font color="green">g</font></td>
<td>target genome (The target genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code, mime = fasta,fas,fa,fna,fasta.gz,fas.gz,fa.gz,fna.gz)</td>
<td>genome (Reference genome file (FASTA), type = fasta,fa,fas,fna,fasta.gz,fa.gz,fas.gz,fna.gz)</td>
<td style="width:100px;">FILE</td>
<td style="width:100px;">FILE</td>
</tr>
</tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">c</font></td>
<td><font color="green">gc</font></td>
<td>cds parts (The query cds parts file (FASTA), i.e., the cds parts that have been blasted, mime = fasta,fa,fas,fna)</td>
<td>genetic code (optional user-specified genetic code, type = tabular, OPTIONAL)</td>
<td style="width:100px;">FILE</td>
<td style="width:100px;">FILE</td>
</tr>
</tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">a</font></td>
<td><font color="green">p</font></td>
<td>assignment (The assignment file, which combines parts of the CDS to transcripts, mime = tabular, OPTIONAL)</td>
<td>proteins (whether the complete proteins sequences should returned as output, default = false)</td>
<td style="width:100px;">FILE</td>
<td style="width:100px;">BOOLEAN</td>
</tr>
</tr>
<tr><td colspan=3>The following parameter(s) can be used zero or multiple times:</td></tr>
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%">
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">i</font></td>
<td><font color="green">c</font></td>
<td>introns (Introns (GFF), which might be obtained from RNA-seq, mime = gff,gff3)</td>
<td>cds (whether the complete CDSs should returned as output, default = false)</td>
<td style="width:100px;">FILE</td>
<td style="width:100px;">BOOLEAN</td>
</tr>
</tr>
</table>
</td></tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">r</font></td>
<td><font color="green">genomic</font></td>
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td>
<td>genomic (whether the genomic regions should be returned (upper case = coding, lower case = non coding), default = false)</td>
<td style="width:100px;">INT</td>
<td style="width:100px;">BOOLEAN</td>
</tr>
</tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">splice</font></td>
<td><font color="green">i</font></td>
<td>splice (if no intron is given by RNA-seq, compute candidate splice sites or not, default = true)</td>
<td>introns (whether introns should be extracted from annotation, that might be used for test cases, default = false)</td>
<td style="width:100px;">BOOLEAN</td>
<td style="width:100px;">BOOLEAN</td>
</tr>
</tr>
<tr><td colspan=3>The following parameter(s) can be used zero or multiple times:</td></tr>
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%">
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">coverage</font></td>
<td><font color="green">identical</font></td>
<td>coverage (experimental coverage (RNA-seq), range={UNSTRANDED, STRANDED}, default = UNSTRANDED)</td>
<td>identical (if CDS is identical Extractor only used one transcript. This parameter allows to return a table that lists in the first column the used transcript and in the second column the discarded transcript. If no transcript is discarded, the list is empty., default = false)</td>
<td style="width:100px;">STRING</td></tr>
<td style="width:100px;">BOOLEAN</td>
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%">
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr>
<tr style="vertical-align:top">
<td><font color="green">coverage_unstranded</font></td>
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., mime = bedgraph)</td>
<td style="width:100px;">FILE</td>
</tr>
</tr>
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">coverage_forward</font></td>
<td><font color="green">u</font></td>
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., mime = bedgraph)</td>
<td>upcase IDs (whether the IDs in the GFF should be upcased, default = false)</td>
<td style="width:100px;">FILE</td>
<td style="width:100px;">BOOLEAN</td>
</tr>
</tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">coverage_reverse</font></td>
<td><font color="green">r</font></td>
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., mime = bedgraph)</td>
<td>repair (if a transcript annotation can not be parsed, the program will try to infer the phase of the CDS parts to repair the annotation, default = false)</td>
<td style="width:100px;">FILE</td>
<td style="width:100px;">BOOLEAN</td>
</tr>
</tr>
</table></td></tr>
</table>
</td></tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">g</font></td>
<td><font color="green">s</font></td>
<td>genetic code (optional user-specified genetic code, mime = tabular, OPTIONAL)</td>
<td>selected (The path to list file, which allows to make only a predictions for the contained transcript ids. The first column should contain transcript IDs as given in the annotation. Remaining columns will be ignored., type = tabular,txt, OPTIONAL)</td>
<td style="width:100px;">FILE</td>
<td style="width:100px;">FILE</td>
</tr>
</tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">sm</font></td>
<td><font color="green">Ambiguity</font></td>
<td>substitution matrix (optional user-specified substitution matrix, mime = tabular, OPTIONAL)</td>
<td>Ambiguity (This parameter defines how to deal with ambiguities in the DNA. There are 3 options: EXCEPTION, which will remove the corresponding transcript, AMBIGUOUS, which will use an X for the corresponding amino acid, and RANDOM, which will randomly select an amnio acid from the list of possibilities., range={EXCEPTION, AMBIGUOUS, RANDOM}, default = EXCEPTION)</td>
<td style="width:100px;">FILE</td>
<td style="width:100px;">STRING</td>
</tr>
</tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">go</font></td>
<td><font color="green">d</font></td>
<td>gap opening (The gap opening cost in the alignment, default = 11)</td>
<td>discard pre-mature stop (if *true* transcripts with pre-mature stop codon are discarded as they often indicate misannotation, default = true)</td>
<td style="width:100px;">INT</td>
<td style="width:100px;">BOOLEAN</td>
</tr>
</tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">ge</font></td>
<td><font color="green">sefc</font></td>
<td>gap extension (The gap extension cost in the alignment, default = 1)</td>
<td>stop-codon excluded from CDS (A flag that states whether the reference annotation contains the stop codon in the CDS annotation or not, default = false)</td>
<td style="width:100px;">INT</td>
<td style="width:100px;">BOOLEAN</td>
</tr>
</tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">m</font></td>
<td><font color="green">f</font></td>
<td>maximum intron length (The maximum length of an intron, default = 15000)</td>
<td>full-length (A flag which allows for choosing between only full-length and all (i.e., full-length and partial) transcripts, default = true)</td>
<td style="width:100px;">INT</td>
<td style="width:100px;">BOOLEAN</td>
</tr>
</tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">sil</font></td>
<td><font color="green">l</font></td>
<td>static intron length (A flag which allows to switch between static intron length, which can be specified by the user and is identical for all genes, and dynamic intron length, which is based on the gene-specific maximum intron length in the reference organism plus the user given maximum intron length, default = true)</td>
<td>long fasta comment (whether a short (transcript ID) or a long (transcript ID, gene ID, chromosome, strand, interval) fasta comment should be written for proteins, CDSs, and genomic regions, default = false)</td>
<td style="width:100px;">BOOLEAN</td>
<td style="width:100px;">BOOLEAN</td>
</tr>
</tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">intron-loss-gain-penalty</font></td>
<td><font color="green">v</font></td>
<td>intron-loss-gain-penalty (The penalty used for intron loss and gain, default = 25)</td>
<td>verbose (A flag which allows to output a wealth of additional information, default = false)</td>
<td style="width:100px;">INT</td>
<td style="width:100px;">BOOLEAN</td>
</tr>
</tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">e</font></td>
<td><font color="green">outdir</font></td>
<td>e-value (The e-value for filtering blast results, default = 100.0)</td>
<td>The output directory, defaults to the current working directory (.)</td>
<td style="width:100px;">DOUBLE</td>
<td>STRING</td>
</tr>
</tr>
<tr style="vertical-align:top">
</table>
<td><font color="green">ct</font></td>
 
<td>contig threshold (The threshold for evaluating contigs, valid range = [0.0, 1.0], default = 0.4)</td>
'''Example:'''
<td style="width:100px;">DOUBLE</td>
 
</tr>
java -jar GeMoMa-1.9.jar CLI Extractor a=&lt;annotation&gt; g=&lt;genome&gt;
<tr style="vertical-align:top">
 
<td><font color="green">rt</font></td>
 
<td>region threshold (The threshold for evaluating regions, valid range = [0.0, 1.0], default = 0.9)</td>
=== GeneModelMapper ===
<td style="width:100px;">DOUBLE</td>
 
</tr>
This tool is the main part of, a homology-based gene prediction tool. GeMoMa builds gene models from search results (e.g. tblastn or mmseqs).
<tr style="vertical-align:top">
 
<td><font color="green">h</font></td>
As first step, you should run '''Extractor''' obtaining ''cds parts'' and ''assignment''. Second, you should run a search algorithm, e.g. '''tblastn''' or '''mmseqs''', with ''cds parts'' as query. Finally, these search results are then used in '''GeMoMa'''. Search results should be clustered according to the reference genes. The most easiest way is to sort the search results accoring to the first column. If the search results are not sorted by default (e.g. mmseqs), you should the parameter ''sort''.
<td>hit threshold (The threshold for adding additional hits, valid range = [0.0, 1.0], default = 0.9)</td>
If you like to run GeMoMa ignoring intron position conservation, you should blast protein sequences and feed the results in ''query cds parts'' and leave ''assignment'' unselected.
<td style="width:100px;">DOUBLE</td>
 
</tr>
If you like to run GeMoMa using RNA-seq evidence, you should map your RNA-seq reads to the genome and run '''ERE''' on the mapped reads. For several reasons, spurious introns can be extracted from RNA-seq data. Hence, we recommend to run '''DenoiseIntrons''' to remove such spurious introns. Finally, you can use the obtained ''introns'' (and ''coverage'') in GeMoMa.
<tr style="vertical-align:top">
 
<td><font color="green">p</font></td>
If you like to obtain multiple predictions per gene model of the reference organism, you should set ''predictions'' accordingly. In addition, we suggest to decrease the value of ''contig threshold'' allowing GeMoMa to evaluate more candidate contigs/chromosomes.
<td>predictions (The (maximal) number of predictions per transcript, default = 10)</td>
 
<td style="width:100px;">INT</td>
If you change the values of ''contig threshold'', ''region threshold'' and ''hit threshold'', this will influence the predictions as well as the runtime of the algorithm. The lower the values are, the slower the algorithm is.
 
You can filter your predictions using '''GAF''', which also allows for combining predictions from different reference organismns.
 
Finally, you can predict UTRs and rename predictions using '''AnnotationFinalizer'''.
 
If you like to run the complete GeMoMa pipeline and not only specific module, you can run the multi-threaded module '''GeMoMaPipeline'''.
 
''GeneModelMapper'' may be called with
 
java -jar GeMoMa-1.9.jar CLI GeMoMa
 
and has the following parameters
 
<table border=0 cellpadding=10 align="center" width="100%">
<tr>
<td>name</td>
<td>comment</td>
<td>type</td>
</tr>
</tr>
<tr><td colspan=3><hr></td></tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">selected</font></td>
<td><font color="green">s</font></td>
<td>selected (The path to list file, which allows to make only a predictions for the contained transcript ids. The first column should contain transcript IDs as given in the annotation. Remaining columns can be used to determine a target region that should be overlapped by the prediction, if columns 2 to 5 contain chromosome, strand, start and end of region, mime = tabular,txt, OPTIONAL)</td>
<td>search results (The search results, e.g., from tblastn or mmseqs, type = tabular)</td>
<td style="width:100px;">FILE</td>
<td style="width:100px;">FILE</td>
</tr>
</tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">as</font></td>
<td><font color="green">t</font></td>
<td>avoid stop (A flag which allows to avoid (additional) pre-mature stop codons in a transcript, default = true)</td>
<td>target genome (The target genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code, type = fasta,fas,fa,fna,fasta.gz,fas.gz,fa.gz,fna.gz)</td>
<td style="width:100px;">BOOLEAN</td>
<td style="width:100px;">FILE</td>
</tr>
</tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">approx</font></td>
<td><font color="green">c</font></td>
<td>approx (whether an approximation is used to compute the score for intron gain, default = true)</td>
<td>cds parts (The query CDS parts file (protein FASTA), i.e., the CDS parts that have been searched in the target genome using for instance BLAST or mmseqs, type = fasta,fa,fas,fna)</td>
<td style="width:100px;">BOOLEAN</td>
<td style="width:100px;">FILE</td>
</tr>
</tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">pa</font></td>
<td><font color="green">a</font></td>
<td>protein alignment (whether a protein alignment between the prediction and the reference transcript should be computed. If so two additional attributes (iAA, pAA) will be added to predictions in the gff output. These might be used in GAF. However, since some transcripts are very long this can increase the needed runtime and memory (RAM)., default = true)</td>
<td>assignment (The assignment file, which combines CDS parts to proteins, type = tabular, OPTIONAL)</td>
<td style="width:100px;">BOOLEAN</td>
<td style="width:100px;">FILE</td>
</tr>
</tr>
<tr><td colspan=3>The following parameter(s) can be used zero or multiple times:</td></tr>
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%">
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">prefix</font></td>
<td><font color="green">i</font></td>
<td>prefix (A prefix to be used for naming the predictions, default = )</td>
<td>introns (Introns (GFF), which might be obtained from RNA-seq, type = gff,gff3)</td>
<td style="width:100px;">STRING</td>
<td style="width:100px;">FILE</td>
</tr>
</tr>
</table>
</td></tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">tag</font></td>
<td><font color="green">r</font></td>
<td>tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = mRNA)</td>
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td>
<td style="width:100px;">STRING</td>
<td style="width:100px;">INT</td>
</tr>
</tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">v</font></td>
<td><font color="green">splice</font></td>
<td>verbose (A flag which allows to output a wealth of additional information per transcript, default = false)</td>
<td>splice (if no intron is given by RNA-seq, compute candidate splice sites or not, default = true)</td>
<td style="width:100px;">BOOLEAN</td>
<td style="width:100px;">BOOLEAN</td>
</tr>
</tr>
<tr><td colspan=3>The following parameter(s) can be used zero or multiple times:</td></tr>
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%">
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">timeout</font></td>
<td><font color="green">coverage</font></td>
<td>timeout (The (maximal) number of seconds to be used for the predictions of one transcript, if exceeded GeMoMa does not output a prediction for this transcript., valid range = [0, 604800], default = 3600)</td>
<td>coverage (experimental coverage (RNA-seq), range={UNSTRANDED, STRANDED}, default = UNSTRANDED)</td>
<td style="width:100px;">LONG</td>
<td style="width:100px;">STRING</td></tr>
</tr>
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%">
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">sort</font></td>
<td><font color="green">coverage_unstranded</font></td>
<td>sort (A flag which allows to sort the search results, default = false)</td>
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., type = bedgraph)</td>
<td style="width:100px;">BOOLEAN</td>
<td style="width:100px;">FILE</td>
</tr>
</tr>
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">Score</font></td>
<td><font color="green">coverage_forward</font></td>
<td>Score (A flag which allows to do nothing, re-score or re-align the search results, range={Trust, ReScore, ReAlign}, default = Trust)</td>
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., type = bedgraph)</td>
<td style="width:100px;">STRING</td>
<td style="width:100px;">FILE</td>
</tr>
</tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">outdir</font></td>
<td><font color="green">coverage_reverse</font></td>
<td>The output directory, defaults to the current working directory (.)</td>
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., type = bedgraph)</td>
<td>STRING</td>
<td style="width:100px;">FILE</td>
</tr>
</tr>
</table></td></tr>
</table>
</table>
 
</td></tr>
'''Example:'''
<tr style="vertical-align:top">
 
<td><font color="green">g</font></td>
java -jar GeMoMa-1.7.1.jar CLI GeMoMa s=&lt;search_results&gt; t=&lt;target_genome&gt; c=&lt;cds_parts&gt;
<td>genetic code (optional user-specified genetic code, type = tabular, OPTIONAL)</td>
 
<td style="width:100px;">FILE</td>
 
</tr>
= GeMoMa Annotation Filter =
<tr style="vertical-align:top">
 
<td><font color="green">sm</font></td>
This tool combines and filters gene predictions from different sources yielding a common gene prediction. The tool does not modify the predictions, but filters redundant and low-quality predictions and selects relevant predictions. In addition, it adds attributes to the annotation of transcript predictions.
<td>substitution matrix (optional user-specified substitution matrix, type = tabular, OPTIONAL)</td>
 
<td style="width:100px;">FILE</td>
The algorithm does the following:
</tr>
First, redundant predictions are identified (and additional attributes (evidence, sumWeight) are introduced).
<tr style="vertical-align:top">
Second, the predictions are filtered using the user-specified criterium based on the attributes from the annotation.
<td><font color="green">go</font></td>
Third, clusters of overlapping predictions are determined, the predictions are sorted within the cluster and relevant predictions are extracted.
<td>gap opening (The gap opening cost in the alignment, default = 11)</td>
 
<td style="width:100px;">INT</td>
Optionally, annotation info can be added for each reference organism enabling a functional prediction for predicted transcripts based on the function of the reference transcript.
</tr>
Phytozome provides annotation info tables for plants, but annotation info can be used from any source as long as they are tab-delimited files with at least the following columns: transcriptName, GO and .*defline.
<tr style="vertical-align:top">
 
<td><font color="green">ge</font></td>
Initially, GAF was build to combine gene predictions obtained from '''GeMoMa'''. It allows to combine the predictions from multiple reference organisms, but works also using only one reference organism. However, GAF also allows to integrate predictions from ab-initio or purely RNA-seq-based gene predictors as well as manually curated annotation. If you like to do so, we recommend to run '''AnnotationEvidence''' for each of these input files to add additional attributes that can be used for sorting and filtering within '''GAF'''. The sort and filter criteria need to be carefully revised in this case. Default values can be set for missing attributes.
<td>gap extension (The gap extension cost in the alignment, default = 1)</td>
 
<td style="width:100px;">INT</td>
''GeMoMa Annotation Filter'' may be called with
 
java -jar GeMoMa-1.7.1.jar CLI GAF
 
and has the following parameters
 
<table border=0 cellpadding=10 align="center" width="100%">
<tr>
<td>name</td>
<td>comment</td>
<td>type</td>
</tr>
</tr>
<tr><td colspan=3><hr></td></tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">t</font></td>
<td><font color="green">m</font></td>
<td>tag (the tag used to read the GeMoMa annotations, default = mRNA)</td>
<td>maximum intron length (The maximum length of an intron, default = 15000)</td>
<td style="width:100px;">STRING</td>
<td style="width:100px;">INT</td>
</tr>
</tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">c</font></td>
<td><font color="green">sil</font></td>
<td>common border filter (the filter on the common borders of transcripts, the lower the more transcripts will be checked as alternative splice isoforms, valid range = [0.0, 1.0], default = 0.75)</td>
<td>static intron length (A flag which allows to switch between static intron length, which can be specified by the user and is identical for all genes, and dynamic intron length, which is based on the gene-specific maximum intron length in the reference organism plus the user given maximum intron length, default = true)</td>
<td style="width:100px;">DOUBLE</td>
<td style="width:100px;">BOOLEAN</td>
</tr>
</tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">m</font></td>
<td><font color="green">intron-loss-gain-penalty</font></td>
<td>maximal number of transcripts per gene (the maximal number of allowed transcript predictions per gene, valid range = [1, 2147483647], default = 2147483647)</td>
<td>intron-loss-gain-penalty (The penalty used for intron loss and gain, default = 25)</td>
<td style="width:100px;">INT</td>
<td style="width:100px;">INT</td>
</tr>
</tr>
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr>
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%">
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">p</font></td>
<td><font color="green">rf</font></td>
<td>prefix (the prefix can be used to distinguish predictions from different input files, OPTIONAL)</td>
<td>reduction factor (Factor for reducing the allowed intron length when searching for missing marginal exons, valid range = [1, 100], default = 10)</td>
<td style="width:100px;">STRING</td>
<td style="width:100px;">INT</td>
</tr>
</tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">w</font></td>
<td><font color="green">e</font></td>
<td>weight (the weight can be used to prioritize predictions from different input files; each prediction will get an additional attribute sumWeight that can be used in the filter, valid range = [0.0, 1000.0], default = 1.0, OPTIONAL)</td>
<td>e-value (The e-value for filtering blast results, default = 100.0)</td>
<td style="width:100px;">DOUBLE</td>
<td style="width:100px;">DOUBLE</td>
</tr>
</tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">g</font></td>
<td><font color="green">ct</font></td>
<td>gene annotation file (GFF file containing the gene annotations (predicted by GeMoMa), mime = gff,gff3)</td>
<td>contig threshold (The threshold for evaluating contigs, valid range = [0.0, 1.0], default = 0.4)</td>
<td style="width:100px;">FILE</td>
<td style="width:100px;">DOUBLE</td>
</tr>
</tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">a</font></td>
<td><font color="green">h</font></td>
<td>annotation info (annotation information of the reference, tab-delimted file containing at least the columns transcriptName, GO and .*defline, mime = tabular, OPTIONAL)</td>
<td>hit threshold (The threshold for adding additional hits, valid range = [0.0, 1.0], default = 0.9)</td>
<td style="width:100px;">FILE</td>
<td style="width:100px;">DOUBLE</td>
</tr>
</tr>
</table>
</td></tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">d</font></td>
<td><font color="green">o</font></td>
<td>default attributes (Comma-separated list of attributes that is set to NaN if they are not given in the annotation file. This allows to use these attributes for sorting or filter criteria. It is especially meaningful if the gene annotation files were received fom different software packages (e.g., GeMoMa, Braker, ...) having different attributes., default = tie,tde,tae,iAA,pAA,score)</td>
<td>output (critierium to determine the number of predictions per reference transcript, range={STATIC, DYNAMIC}, default = STATIC)</td>
<td style="width:100px;">STRING</td>
<td style="width:100px;">STRING</td></tr>
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%">
<tr><td colspan=3><b>Parameters for selection &quot;STATIC&quot;:</b></td></tr>
<tr style="vertical-align:top">
<td><font color="green">p</font></td>
<td>predictions (The (maximal) number of predictions per transcript, default = 10)</td>
<td style="width:100px;">INT</td>
</tr>
</tr>
<tr><td colspan=3><b>Parameters for selection &quot;DYNAMIC&quot;:</b></td></tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">f</font></td>
<td><font color="green">f</font></td>
<td>filter (A filter can be applied to transcript predictions to receive only reasonable predictions. The filter is applied to the GFF attributes. The default filter decides based on the completeness of the prediction (start=='M' and stop=='*') and the relative score (score/aa>=0.75) whether a prediction is used or not. In addition, predictions without score (isNaN(score)) will be used as external annotations do not have the attribute score. Different criteria can be combined using 'and' and 'or'. A more sophisticated filter could be applied for instance using the completeness, the relative score, the evidence as well as statistics based on RNA-seq: start=='M' and stop=='*' and score/aa>=0.75 and (evidence>1 or tpc==1.0), default = start=='M' and stop=='*' and (isNaN(score) or score/aa>=0.75), OPTIONAL)</td>
<td>factor (a prediction is used if: score >= factor*Math.max(0,bestScore), valid range = [0.0, 1.0], default = 0.8)</td>
<td style="width:100px;">STRING</td>
<td style="width:100px;">DOUBLE</td>
</tr>
</tr>
</table></td></tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">s</font></td>
<td><font color="green">selected</font></td>
<td>sorting (comma-separated list that defines the way predictions are sorted within a cluster, default = evidence,score)</td>
<td>selected (The path to list file, which allows to make only a predictions for the contained transcript ids. The first column should contain transcript IDs as given in the annotation. Remaining columns can be used to determine a target region that should be overlapped by the prediction, if columns 2 to 5 contain chromosome, strand, start and end of region, type = tabular,txt, OPTIONAL)</td>
<td style="width:100px;">STRING</td>
<td style="width:100px;">FILE</td>
</tr>
</tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">atf</font></td>
<td><font color="green">as</font></td>
<td>alternative transcript filter (If a prediction is suggested as an alternative transcript, this additional filter will be applied. The filter works syntactically like the 'filter' parameter and allows you to keep the number of alternative transcripts small based on meaningful criteria. Reasonable filter could be based on RNA-seq data (tie==1) or on evidence (evidence>1). A more sophisticated filter could be applied combining several criteria: tie==1 or evidence>1, default = tie==1 or evidence>1, OPTIONAL)</td>
<td>avoid stop (A flag which allows to avoid (additional) pre-mature stop codons in a transcript, default = true)</td>
<td style="width:100px;">STRING</td>
<td style="width:100px;">BOOLEAN</td>
</tr>
</tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">outdir</font></td>
<td><font color="green">approx</font></td>
<td>The output directory, defaults to the current working directory (.)</td>
<td>approx (whether an approximation is used to compute the score for intron gain, default = true)</td>
<td>STRING</td>
<td style="width:100px;">BOOLEAN</td>
</tr>
</tr>
</table>
'''Example:'''
java -jar GeMoMa-1.7.1.jar CLI GAF g=&lt;gene_annotation_file&gt;
= AnnotationFinalizer =
This tool finalizes an annotation. It allows to predict for UTRs for annotated coding sequences and to generate generic gene and transcript names. UTR prediction might be negatively influenced (i.e. too long predictions) by genomic contamination of RNA-seq libraries, overlapping genes or genes in close proximity as well as unstranded RNA-seq libraries. Please use '''ERE''' to preprocess the mapped reads.
''AnnotationFinalizer'' may be called with
java -jar GeMoMa-1.7.1.jar CLI AnnotationFinalizer
and has the following parameters
<table border=0 cellpadding=10 align="center" width="100%">
<tr>
<td>name</td>
<td>comment</td>
<td>type</td>
</tr>
<tr><td colspan=3><hr></td></tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">g</font></td>
<td><font color="green">pa</font></td>
<td>genome (The genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code, mime = fasta,fa,fas,fna,fasta.gz,fa.gz,fas.gz,fna.gz)</td>
<td>protein alignment (whether a protein alignment between the prediction and the reference transcript should be computed. If so two additional attributes (iAA, pAA) will be added to predictions in the gff output. These might be used in GAF. However, since some transcripts are very long this can increase the needed runtime and memory (RAM)., default = true)</td>
<td style="width:100px;">FILE</td>
<td style="width:100px;">BOOLEAN</td>
</tr>
</tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">a</font></td>
<td><font color="green">prefix</font></td>
<td>annotation (The predicted genome annotation file (GFF), mime = gff,gff3)</td>
<td>prefix (A prefix to be used for naming the predictions, default = )</td>
<td style="width:100px;">FILE</td>
<td style="width:100px;">STRING</td>
</tr>
</tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">t</font></td>
<td><font color="green">tag</font></td>
<td>tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = mRNA)</td>
<td>tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = mRNA)</td>
<td style="width:100px;">STRING</td>
<td style="width:100px;">STRING</td>
</tr>
</tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">u</font></td>
<td><font color="green">v</font></td>
<td>UTR (allows to predict UTRs using RNA-seq data, range={NO, YES}, default = NO)</td>
<td>verbose (A flag which allows to output a wealth of additional information per transcript, default = false)</td>
<td style="width:100px;">STRING</td></tr>
<td style="width:100px;">BOOLEAN</td>
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%">
</tr>
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr>
<tr><td colspan=3><b>Parameters for selection &quot;YES&quot;:</b></td></tr>
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr>
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%">
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">i</font></td>
<td><font color="green">timeout</font></td>
<td>introns file (Introns (GFF), which might be obtained from RNA-seq, mime = gff,gff3)</td>
<td>timeout (The (maximal) number of seconds to be used for the predictions of one transcript, if exceeded GeMoMa does not output a prediction for this transcript., valid range = [0, 604800], default = 3600)</td>
<td style="width:100px;">FILE</td>
<td style="width:100px;">LONG</td>
</tr>
</tr>
</table>
</td></tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">r</font></td>
<td><font color="green">sort</font></td>
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td>
<td>sort (A flag which allows to sort the search results, default = false)</td>
<td style="width:100px;">INT</td>
<td style="width:100px;">BOOLEAN</td>
</tr>
</tr>
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr>
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%">
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">c</font></td>
<td><font color="green">ru</font></td>
<td>coverage file (experimental coverage (RNA-seq), range={NO, UNSTRANDED, STRANDED}, default = NO)</td>
<td>replace unknown (Replace unknown amino acid symbols by X, default = false)</td>
<td style="width:100px;">STRING</td></tr>
<td style="width:100px;">BOOLEAN</td>
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%">
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr>
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr>
<tr style="vertical-align:top">
<td><font color="green">coverage_unstranded</font></td>
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., mime = bedgraph)</td>
<td style="width:100px;">FILE</td>
</tr>
</tr>
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">coverage_forward</font></td>
<td><font color="green">Score</font></td>
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., mime = bedgraph)</td>
<td>Score (A flag which allows to do nothing, re-score or re-align the search results, range={Trust, ReScore, ReAlign}, default = Trust)</td>
<td style="width:100px;">FILE</td>
<td style="width:100px;">STRING</td>
</tr>
</tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">coverage_reverse</font></td>
<td><font color="green">outdir</font></td>
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., mime = bedgraph)</td>
<td>The output directory, defaults to the current working directory (.)</td>
<td style="width:100px;">FILE</td>
<td>STRING</td>
</tr>
</tr>
</table></td></tr>
</table>
</table>
</td></tr>
 
</table></td></tr>
'''Example:'''
<tr style="vertical-align:top">
 
<td><font color="green">rename</font></td>
java -jar GeMoMa-1.9.jar CLI GeMoMa s=&lt;search_results&gt; t=&lt;target_genome&gt; c=&lt;cds_parts&gt;
<td>rename (allows to generate generic gene and transcripts names (cf. parameter &quot;name attribute&quot;), range={COMPOSED, SIMPLE, NO}, default = COMPOSED)</td>
 
<td style="width:100px;">STRING</td></tr>
 
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%">
=== GeMoMa Annotation Filter ===
<tr><td colspan=3><b>Parameters for selection &quot;COMPOSED&quot;:</b></td></tr>
 
<tr style="vertical-align:top">
This tool combines and filters gene predictions from different sources yielding a common gene prediction. The tool does not modify the predictions, but filters redundant and low-quality predictions and selects relevant predictions. In addition, it adds attributes to the annotation of transcript predictions.
<td><font color="green">p</font></td>
 
<td>prefix (the prefix of the generic name)</td>
The algorithm does the following:
<td style="width:100px;">STRING</td>
First, redundant predictions are identified (and additional attributes (evidence, sumWeight) are introduced).
</tr>
Second, the predictions are filtered using the user-specified criterium based on the attributes from the annotation.
<tr style="vertical-align:top">
Third, clusters of overlapping predictions are determined, the predictions are sorted within the cluster and relevant predictions are extracted.
<td><font color="green">infix</font></td>
 
<td>infix (the infix of the generic name, default = G)</td>
Optionally, annotation info can be added for each reference organism enabling a functional prediction for predicted transcripts based on the function of the reference transcript.
<td style="width:100px;">STRING</td>
Phytozome provides annotation info tables for plants, but annotation info can be used from any source as long as they are tab-delimited files with at least the following columns: transcriptName, GO and .*defline.
 
Initially, GAF was build to combine gene predictions obtained from '''GeMoMa'''. It allows to combine the predictions from multiple reference organisms, but works also using only one reference organism. However, GAF also allows to integrate predictions from ab-initio or purely RNA-seq-based gene predictors as well as manually curated annotation. If you like to do so, we recommend to run '''AnnotationEvidence''' for each of these input files to add additional attributes that can be used for sorting and filtering within '''GAF'''. The sort and filter criteria need to be carefully revised in this case. Default values can be set for missing attributes.
 
''GeMoMa Annotation Filter'' may be called with
 
java -jar GeMoMa-1.9.jar CLI GAF
 
and has the following parameters
 
<table border=0 cellpadding=10 align="center" width="100%">
<tr>
<td>name</td>
<td>comment</td>
<td>type</td>
</tr>
</tr>
<tr><td colspan=3><hr></td></tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">s</font></td>
<td><font color="green">t</font></td>
<td>suffix (the suffix of the generic name, default = 0)</td>
<td>tag (the tag used to read the GeMoMa annotations, default = mRNA)</td>
<td style="width:100px;">STRING</td>
<td style="width:100px;">STRING</td>
</tr>
</tr>
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr>
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%">
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">d</font></td>
<td><font color="green">p</font></td>
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td>
<td>prefix (the prefix can be used to distinguish predictions from different input files, OPTIONAL)</td>
<td style="width:100px;">INT</td>
<td style="width:100px;">STRING</td>
</tr>
</tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">di</font></td>
<td><font color="green">w</font></td>
<td>delete infix (a comma-separated list of infixes that is deleted from the sequence names before building the gene/transcript name, default = )</td>
<td>weight (the weight can be used to prioritize predictions from different input files; each prediction will get an additional attribute sumWeight that can be used in the filter, valid range = [0.0, 1000.0], default = 1.0, OPTIONAL)</td>
<td style="width:100px;">STRING</td>
<td style="width:100px;">DOUBLE</td>
</tr>
<tr style="vertical-align:top">
<td><font color="green">g</font></td>
<td>gene annotation file (GFF file containing the gene annotations (predicted by GeMoMa), type = gff,gff3)</td>
<td style="width:100px;">FILE</td>
</tr>
</tr>
<tr><td colspan=3><b>Parameters for selection &quot;SIMPLE&quot;:</b></td></tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">p</font></td>
<td><font color="green">a</font></td>
<td>prefix (the prefix of the generic name)</td>
<td>annotation info (annotation information of the reference, tab-delimited file containing at least the columns transcriptName, GO and .*defline, type = tabular, OPTIONAL)</td>
<td style="width:100px;">STRING</td>
<td style="width:100px;">FILE</td>
</tr>
</tr>
</table>
</td></tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">d</font></td>
<td><font color="green">d</font></td>
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td>
<td>default attributes (Comma-separated list of attributes that is set to NaN if they are not given in the annotation file. This allows to use these attributes for sorting or filter criteria. It is especially meaningful if the gene annotation files were received fom different software packages (e.g., GeMoMa, Braker, ...) having different attributes., default = tie,tde,tae,iAA,pAA,score,lpm,maxGap,bestScore,maxScore,raa,rce)</td>
<td style="width:100px;">INT</td>
<td style="width:100px;">STRING</td>
</tr>
</tr>
<tr style="vertical-align:top">
<td><font color="green">k</font></td>
<td>kmeans (whether kmeans should be performed for each input file and clusters with large mean distance to the origin will be discarded, range={NO, YES}, default = NO)</td>
<td style="width:100px;">STRING</td></tr>
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%">
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr>
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr>
</table></td></tr>
<tr><td colspan=3><b>Parameters for selection &quot;YES&quot;:</b></td></tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">n</font></td>
<td><font color="green">m</font></td>
<td>name attribute (if true the new name is added as new attribute &quot;Name&quot;, otherwise &quot;Parent&quot; and &quot;ID&quot; values are modified accordingly, default = true)</td>
<td>minimal number of predictions (only gene sets with at least this number of predictions will be used for clustering, valid range = [0, 100000000], default = 1000)</td>
<td style="width:100px;">BOOLEAN</td>
<td style="width:100px;">INT</td>
</tr>
</tr>
<tr style="vertical-align:top">
<td><font color="green">c</font></td>
<td>cluster (the number of clusters to be used for kmeans, valid range = [2, 100], default = 2)</td>
<td style="width:100px;">INT</td>
</tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">outdir</font></td>
<td><font color="green">gc</font></td>
<td>The output directory, defaults to the current working directory (.)</td>
<td>good cluster (the number of good clusters, good clusters are those with small mean, all members of a good cluster are further used, valid range = [1, 99], default = 1)</td>
<td>STRING</td>
<td style="width:100px;">INT</td>
</tr>
</tr>
</table>
<tr style="vertical-align:top">
 
<td><font color="green">trend</font></td>
'''Example:'''
<td>trend (whether a local component should be used for the cluster attribute (might be helpful for regions with different conservation (e.g. introgressions in chromosomes)), range={GLOBAL, LOCAL}, default = GLOBAL)</td>
 
<td style="width:100px;">STRING</td></tr>
java -jar GeMoMa-1.7.1.jar CLI AnnotationFinalizer g=&lt;genome&gt; a=&lt;annotation&gt; p=&lt;prefix&gt;
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%">
 
<tr><td colspan=3><b>No parameters for selection &quot;GLOBAL&quot;</b></td></tr>
 
<tr><td colspan=3><b>Parameters for selection &quot;LOCAL&quot;:</b></td></tr>
= Annotation evidence =
<tr style="vertical-align:top">
 
<td><font color="green">margin</font></td>
This tool adds attributes to the annotation, e.g., tie, tpc, aa, start, stop. These attributes can be used, for instance, if the annotation is used in '''GAF'''. All predictions of the annotation are used. The predictions are not filtered for internal stop codons, missing start or stop codons, frame-shifts, ... . Please use '''ERE''' to preprocess the mapped reads.
<td>margin (the number of bp upstream and downstream of a predictions used to identify neighboring predictions for the statistics, valid range = [0, 100000000], default = 1000000)</td>
 
<td style="width:100px;">INT</td>
''Annotation evidence'' may be called with
 
java -jar GeMoMa-1.7.1.jar CLI AnnotationEvidence
 
and has the following parameters
 
<table border=0 cellpadding=10 align="center" width="100%">
<tr>
<td>name</td>
<td>comment</td>
<td>type</td>
</tr>
</tr>
<tr><td colspan=3><hr></td></tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">a</font></td>
<td><font color="green">q</font></td>
<td>annotation (The genome annotation file (GFF,GTF), mime = gff,gff3,gtf)</td>
<td>quantile (the quantile used for the local trend, valid range = [0.0, 1.0], default = 0.2)</td>
<td style="width:100px;">FILE</td>
<td style="width:100px;">DOUBLE</td>
</tr>
</tr>
</table></td></tr>
</table></td></tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">t</font></td>
<td><font color="green">f</font></td>
<td>tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = mRNA)</td>
<td>filter (A filter can be applied to transcript predictions to receive only reasonable predictions. The filter is applied to the GFF attributes. The default filter decides based on the completeness of the prediction (start=='M' and stop=='*') and the relative score (score/aa>=0.75) whether a prediction is used or not. In addition, predictions without score (isNaN(score)) will be used as external annotations do not have the attribute score. Different criteria can be combined using 'and' and 'or'. A more sophisticated filter could be applied for instance using the completeness, the relative score, the evidence as well as statistics based on RNA-seq: start=='M' and stop=='*' and score/aa>=0.75 and (evidence>1 or tpc==1.0), default = start=='M' and stop=='*' and (isNaN(score) or score/aa>=0.75), OPTIONAL)</td>
<td style="width:100px;">STRING</td>
<td style="width:100px;">STRING</td>
</tr>
</tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">g</font></td>
<td><font color="green">s</font></td>
<td>genome (The genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code, mime = fasta,fas,fa,fna,fasta.gz,fas.gz,fa.gz,fna.gz)</td>
<td>sorting (comma-separated list that defines the way predictions are sorted within a cluster, default = sumWeight,score,aa)</td>
<td style="width:100px;">FILE</td>
<td style="width:100px;">STRING</td>
</tr>
</tr>
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr>
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%">
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">i</font></td>
<td><font color="green">i</font></td>
<td>introns file (Introns (GFF), which might be obtained from RNA-seq, mime = gff,gff3, OPTIONAL)</td>
<td>intermediate result (a switch to decide whether an intermediate result of filtered predictions that are not combined to genes should be returned, default = false)</td>
<td style="width:100px;">FILE</td>
<td style="width:100px;">BOOLEAN</td>
</tr>
</tr>
</table>
</td></tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">r</font></td>
<td><font color="green">l</font></td>
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td>
<td>length difference (maximal percentage of length difference between the representative transcript and an alternative transcript, alternative transcripts with a higher percentage are discarded, valid range = [0.0, 10000.0], OPTIONAL)</td>
<td style="width:100px;">INT</td>
<td style="width:100px;">DOUBLE</td>
</tr>
</tr>
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr>
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%">
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">c</font></td>
<td><font color="green">atf</font></td>
<td>coverage file (experimental coverage (RNA-seq), range={NO, UNSTRANDED, STRANDED}, default = NO)</td>
<td>alternative transcript filter (If a prediction is suggested as an alternative transcript, this additional filter will be applied. The filter works syntactically like the 'filter' parameter and allows you to keep the number of alternative transcripts small based on meaningful criteria. Reasonable filter could be based on RNA-seq data (tie==1) or on sumWeight (sumWeight>1). A more sophisticated filter could be applied combining several criteria: tie==1 or sumWeight>1, default = tie==1 or sumWeight>1, OPTIONAL)</td>
<td style="width:100px;">STRING</td></tr>
<td style="width:100px;">STRING</td>
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%">
</tr>
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr>
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">coverage_unstranded</font></td>
<td><font color="green">cbf</font></td>
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., mime = bedgraph)</td>
<td>common border filter (the filter on the common borders of transcripts, the lower the more transcripts will be checked as alternative splice isoforms, valid range = [0.0, 1.0], default = 0.75)</td>
<td style="width:100px;">FILE</td>
<td style="width:100px;">DOUBLE</td>
</tr>
</tr>
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">coverage_forward</font></td>
<td><font color="green">mnotpg</font></td>
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., mime = bedgraph)</td>
<td>maximal number of transcripts per gene (the maximal number of allowed transcript predictions per gene, valid range = [1, 2147483647], default = 2147483647)</td>
<td style="width:100px;">FILE</td>
<td style="width:100px;">INT</td>
</tr>
</tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">coverage_reverse</font></td>
<td><font color="green">aat</font></td>
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., mime = bedgraph)</td>
<td>add alternative transcripts (a switch to decide whether the IDs of alternative transcripts that have the same CDS should be listed for each prediction, default = false)</td>
<td style="width:100px;">FILE</td>
<td style="width:100px;">BOOLEAN</td>
</tr>
</tr>
</table></td></tr>
<tr style="vertical-align:top">
</table>
<td><font color="green">tf</font></td>
</td></tr>
<td>transfer features (if true, additional features like UTRs will be transfered from the input to the output. Only features of the representatives will be transfered. The UTRs of identical CDS predictions listed in "alternative" will not be transfered or combined, default = false)</td>
<tr style="vertical-align:top">
<td><font color="green">ao</font></td>
<td>annotation output (if the annotation should be returned with attributes tie, tpc, and aa, default = true)</td>
<td style="width:100px;">BOOLEAN</td>
<td style="width:100px;">BOOLEAN</td>
</tr>
<tr style="vertical-align:top">
<td><font color="green">gc</font></td>
<td>genetic code (optional user-specified genetic code, mime = tabular, OPTIONAL)</td>
<td style="width:100px;">FILE</td>
</tr>
</tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
Line 1,404: Line 1,429:
'''Example:'''
'''Example:'''


  java -jar GeMoMa-1.7.1.jar CLI AnnotationEvidence a=&lt;annotation&gt; g=&lt;genome&gt;
  java -jar GeMoMa-1.9.jar CLI GAF g=&lt;gene_annotation_file&gt;




= Compare transcripts =
=== AnnotationFinalizer ===


This tool compares a predicted annotation with a given annotation in terms of F1 measure. If the F1 measure is 1 both annotations are in perfect agreement for this transcript. The smaller the value is the low is the agreement. If it is NA then there is no overlapping annotation.
This tool finalizes an annotation. It allows to predict for UTRs for annotated coding sequences and to generate generic gene and transcript names. UTR prediction might be negatively influenced (i.e. too long predictions) by genomic contamination of RNA-seq libraries, overlapping genes or genes in close proximity as well as unstranded RNA-seq libraries. Please use '''ERE''' to preprocess the mapped reads.


''Compare transcripts'' may be called with
''AnnotationFinalizer'' may be called with


  java -jar GeMoMa-1.7.1.jar CLI CompareTranscripts
  java -jar GeMoMa-1.9.jar CLI AnnotationFinalizer


and has the following parameters
and has the following parameters
Line 1,425: Line 1,450:
<tr><td colspan=3><hr></td></tr>
<tr><td colspan=3><hr></td></tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">p</font></td>
<td><font color="green">g</font></td>
<td>prediction (The predicted annotation, mime = gff,gff3)</td>
<td>genome (The genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code, type = fasta,fa,fas,fna,fasta.gz,fa.gz,fas.gz,fna.gz)</td>
<td style="width:100px;">FILE</td>
<td style="width:100px;">FILE</td>
</tr>
</tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">a</font></td>
<td><font color="green">a</font></td>
<td>annotation (The true annotation, mime = gff,gff3)</td>
<td>annotation (The predicted genome annotation file (GFF), type = gff,gff3)</td>
<td style="width:100px;">FILE</td>
<td style="width:100px;">FILE</td>
</tr>
</tr>
<tr><td colspan=3>The following parameter(s) can be used zero or multiple times:</td></tr>
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%">
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">prefix</font></td>
<td><font color="green">t</font></td>
<td>prefix (the prefix can be used to distinguish predictions from different input files, default = , OPTIONAL)</td>
<td>tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = mRNA)</td>
<td style="width:100px;">STRING</td>
<td style="width:100px;">STRING</td>
</tr>
</tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">assignment</font></td>
<td><font color="green">tf</font></td>
<td>assignment (the transcript info for the reference of the prediction, mime = tabular)</td>
<td>transfer features (if true other features than gene, &lt;tag&gt; (default: mRNA), and CDS of the input will be written in the output, default = false)</td>
<td style="width:100px;">FILE</td>
<td style="width:100px;">BOOLEAN</td>
</tr>
</tr>
</table>
</td></tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">outdir</font></td>
<td><font color="green">u</font></td>
<td>The output directory, defaults to the current working directory (.)</td>
<td>UTR (allows to predict UTRs using RNA-seq data, range={NO, YES}, default = NO)</td>
<td>STRING</td>
<td style="width:100px;">STRING</td></tr>
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%">
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr>
<tr><td colspan=3><b>Parameters for selection &quot;YES&quot;:</b></td></tr>
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr>
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%">
<tr style="vertical-align:top">
<td><font color="green">i</font></td>
<td>introns file (Introns (GFF), which might be obtained from RNA-seq, type = gff,gff3)</td>
<td style="width:100px;">FILE</td>
</tr>
</tr>
</table>
</table>
 
</td></tr>
'''Example:'''
<tr style="vertical-align:top">
 
<td><font color="green">r</font></td>
java -jar GeMoMa-1.7.1.jar CLI CompareTranscripts p=&lt;prediction&gt; a=&lt;annotation&gt;
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td>
 
<td style="width:100px;">INT</td>
 
</tr>
= Synteny checker =
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr>
 
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%">
This tool can be used to determine syntenic regions between target organism and reference organism based on similiarity of genes.!The tool returns a table of reference genes per predicted gene. This table can be easily visualized with an R script that is included in the GeMoMa package.
<tr style="vertical-align:top">
 
<td><font color="green">c</font></td>
''Synteny checker'' may be called with
<td>coverage file (experimental coverage (RNA-seq), range={NO, UNSTRANDED, STRANDED}, default = NO)</td>
 
<td style="width:100px;">STRING</td></tr>
java -jar GeMoMa-1.7.1.jar CLI SyntenyChecker
 
and has the following parameters
 
<table border=0 cellpadding=10 align="center" width="100%">
<tr>
<td>name</td>
<td>comment</td>
<td>type</td>
</tr>
<tr><td colspan=3><hr></td></tr>
<tr style="vertical-align:top">
<td><font color="green">t</font></td>
<td>tag (the tag used to read the GeMoMa annotations, default = mRNA)</td>
<td style="width:100px;">STRING</td>
</tr>
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr>
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%">
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%">
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr>
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">p</font></td>
<td><font color="green">coverage_unstranded</font></td>
<td>prefix (the prefix can be used to distinguish predictions from different input files, OPTIONAL)</td>
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., type = bedgraph)</td>
<td style="width:100px;">STRING</td>
<td style="width:100px;">FILE</td>
</tr>
</tr>
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">a</font></td>
<td><font color="green">coverage_forward</font></td>
<td>assignment (The assignment file, which combines parts of the CDS to transcripts, mime = tabular)</td>
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., type = bedgraph)</td>
<td style="width:100px;">FILE</td>
<td style="width:100px;">FILE</td>
</tr>
</tr>
<tr style="vertical-align:top">
<td><font color="green">coverage_reverse</font></td>
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., type = bedgraph)</td>
<td style="width:100px;">FILE</td>
</tr>
</table></td></tr>
</table>
</table>
</td></tr>
</td></tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">g</font></td>
<td><font color="green">ass</font></td>
<td>gene annotation file (GFF file containing the gene annotations predicted by GAF, mime = gff,gff3)</td>
<td>additional source suffix (a suffix for source values of UTR features, default = )</td>
<td style="width:100px;">FILE</td>
<td style="width:100px;">STRING</td>
</tr>
</tr>
</table></td></tr>
<tr style="vertical-align:top">
<td><font color="green">rename</font></td>
<td>rename (allows to generate generic gene and transcripts names (cf. parameter &quot;name attribute&quot;), range={COMPOSED, SIMPLE, NO}, default = COMPOSED)</td>
<td style="width:100px;">STRING</td></tr>
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%">
<tr><td colspan=3><b>Parameters for selection &quot;COMPOSED&quot;:</b></td></tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">outdir</font></td>
<td><font color="green">p</font></td>
<td>The output directory, defaults to the current working directory (.)</td>
<td>prefix (the prefix of the generic name)</td>
<td>STRING</td>
<td style="width:100px;">STRING</td>
</tr>
</tr>
</table>
<tr style="vertical-align:top">
 
<td><font color="green">infix</font></td>
'''Example:'''
<td>infix (the infix of the generic name, default = G)</td>
 
<td style="width:100px;">STRING</td>
java -jar GeMoMa-1.7.1.jar CLI SyntenyChecker a=&lt;assignment&gt; g=&lt;gene_annotation_file&gt;
 
 
= AddAttribute =
 
This tool allows to add an additional attribute to specific features of an annotation.
 
Those additional attributes might be used in '''GAF''' for filtering or sorting or might be displayed in genome browsers like IGV or WebApollo. The user can choose binary attributes (true or false) or attributes with values according to given tab-delimited table.
 
''AddAttribute'' may be called with
 
java -jar GeMoMa-1.7.1.jar CLI AddAttribute
 
and has the following parameters
 
<table border=0 cellpadding=10 align="center" width="100%">
<tr>
<td>name</td>
<td>comment</td>
<td>type</td>
</tr>
</tr>
<tr><td colspan=3><hr></td></tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">a</font></td>
<td><font color="green">s</font></td>
<td>annotation (annotation file, mime = gff,gff3)</td>
<td>suffix (the suffix of the generic name, default = 0)</td>
<td style="width:100px;">FILE</td>
<td style="width:100px;">STRING</td>
</tr>
<tr style="vertical-align:top">
<td><font color="green">d</font></td>
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td>
<td style="width:100px;">INT</td>
</tr>
<tr style="vertical-align:top">
<td><font color="green">csp</font></td>
<td>contig search pattern (search string, i.e., a regular expression for search-and-replace parts of the contig/scaffold/chromosome names, the modified string is used as infix for the gene name, default = )</td>
<td style="width:100px;">STRING</td>
</tr>
<tr style="vertical-align:top">
<td><font color="green">crp</font></td>
<td>contig replace pattern (replace string, i.e., a regular expression for search-and-replace parts of the contig/scaffold/chromosome names, the modified string is used as infix for the gene name, default = )</td>
<td style="width:100px;">STRING</td>
</tr>
<tr><td colspan=3><b>Parameters for selection &quot;SIMPLE&quot;:</b></td></tr>
<tr style="vertical-align:top">
<td><font color="green">p</font></td>
<td>prefix (the prefix of the generic name)</td>
<td style="width:100px;">STRING</td>
</tr>
<tr style="vertical-align:top">
<td><font color="green">d</font></td>
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td>
<td style="width:100px;">INT</td>
</tr>
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr>
</table></td></tr>
<tr style="vertical-align:top">
<td><font color="green">n</font></td>
<td>name attribute (if true the new name is added as new attribute &quot;Name&quot;, otherwise &quot;Parent&quot; and &quot;ID&quot; values are modified accordingly, default = true)</td>
<td style="width:100px;">BOOLEAN</td>
</tr>
<tr style="vertical-align:top">
<td><font color="green">outdir</font></td>
<td>The output directory, defaults to the current working directory (.)</td>
<td>STRING</td>
</tr>
</table>
 
'''Example:'''
 
java -jar GeMoMa-1.9.jar CLI AnnotationFinalizer g=&lt;genome&gt; a=&lt;annotation&gt; p=&lt;prefix&gt;
 
 
=== Annotation evidence ===
 
This tool adds attributes to the annotation, e.g., tie, tpc, aa, start, stop. These attributes can be used, for instance, if the annotation is used in '''GAF'''. All predictions of the annotation are used. The predictions are not filtered for internal stop codons, missing start or stop codons, frame-shifts, ... . Please use '''ERE''' to preprocess the mapped reads.
 
''Annotation evidence'' may be called with
 
java -jar GeMoMa-1.9.jar CLI AnnotationEvidence
 
and has the following parameters
 
<table border=0 cellpadding=10 align="center" width="100%">
<tr>
<td>name</td>
<td>comment</td>
<td>type</td>
</tr>
<tr><td colspan=3><hr></td></tr>
<tr style="vertical-align:top">
<td><font color="green">a</font></td>
<td>annotation (The genome annotation file (GFF,GTF), type = gff,gff3,gtf,gff.gz,gff3.gz,gtf.gz)</td>
<td style="width:100px;">FILE</td>
</tr>
<tr style="vertical-align:top">
<td><font color="green">t</font></td>
<td>tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = mRNA)</td>
<td style="width:100px;">STRING</td>
</tr>
<tr style="vertical-align:top">
<td><font color="green">g</font></td>
<td>genome (The target genome file (FASTA). Should be in IUPAC code, type = fasta,fas,fa,fna,fasta.gz,fas.gz,fa.gz,fna.gz)</td>
<td style="width:100px;">FILE</td>
</tr>
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr>
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%">
<tr style="vertical-align:top">
<td><font color="green">i</font></td>
<td>introns file (Introns (GFF), which might be obtained from RNA-seq, type = gff,gff3, OPTIONAL)</td>
<td style="width:100px;">FILE</td>
</tr>
</table>
</td></tr>
<tr style="vertical-align:top">
<td><font color="green">r</font></td>
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td>
<td style="width:100px;">INT</td>
</tr>
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr>
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%">
<tr style="vertical-align:top">
<td><font color="green">c</font></td>
<td>coverage file (experimental coverage (RNA-seq), range={NO, UNSTRANDED, STRANDED}, default = NO)</td>
<td style="width:100px;">STRING</td></tr>
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%">
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr>
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr>
<tr style="vertical-align:top">
<td><font color="green">coverage_unstranded</font></td>
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., type = bedgraph)</td>
<td style="width:100px;">FILE</td>
</tr>
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr>
<tr style="vertical-align:top">
<td><font color="green">coverage_forward</font></td>
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., type = bedgraph)</td>
<td style="width:100px;">FILE</td>
</tr>
<tr style="vertical-align:top">
<td><font color="green">coverage_reverse</font></td>
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., type = bedgraph)</td>
<td style="width:100px;">FILE</td>
</tr>
</table></td></tr>
</table>
</td></tr>
<tr style="vertical-align:top">
<td><font color="green">ao</font></td>
<td>annotation output (if the annotation should be returned with attributes tie, tpc, and aa, default = true)</td>
<td style="width:100px;">BOOLEAN</td>
</tr>
<tr style="vertical-align:top">
<td><font color="green">gc</font></td>
<td>genetic code (optional user-specified genetic code, type = tabular, OPTIONAL)</td>
<td style="width:100px;">FILE</td>
</tr>
<tr style="vertical-align:top">
<td><font color="green">outdir</font></td>
<td>The output directory, defaults to the current working directory (.)</td>
<td>STRING</td>
</tr>
</table>
 
'''Example:'''
 
java -jar GeMoMa-1.9.jar CLI AnnotationEvidence a=&lt;annotation&gt; g=&lt;genome&gt;
 
 
=== Attribute2Table ===
 
This tool returns a table of best attribute per predicted final annotation.
 
''Attribute2Table'' may be called with
 
java -jar GeMoMa-1.9.jar CLI Attribute2Table
 
and has the following parameters
 
<table border=0 cellpadding=10 align="center" width="100%">
<tr>
<td>name</td>
<td>comment</td>
<td>type</td>
</tr>
<tr><td colspan=3><hr></td></tr>
<tr style="vertical-align:top">
<td><font color="green">t</font></td>
<td>tag (the tag used to read the GeMoMa annotations, default = mRNA)</td>
<td style="width:100px;">STRING</td>
</tr>
</tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">f</font></td>
<td><font color="green">f</font></td>
<td>feature (a feature of the annotation, e.g., gene, transcript or mRNA, default = mRNA)</td>
<td>final gene annotation file (GFF file containing the gene annotations (predicted by GeMoMa), type = gff,gff3)</td>
<td style="width:100px;">STRING</td>
<td style="width:100px;">FILE</td>
</tr>
</tr>
<tr style="vertical-align:top">
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr>
<td><font color="green">attribute</font></td>
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%">
<td>attribute (the name of the attribute that is added to the annotation)</td>
<tr style="vertical-align:top">
<td style="width:100px;">STRING</td>
<td><font color="green">p</font></td>
</tr>
<td>prefix (the prefix can be used to distinguish predictions from different input files, OPTIONAL)</td>
<tr style="vertical-align:top">
<td style="width:100px;">STRING</td>
<td><font color="green">t</font></td>
</tr>
<td>table (a tab-delimited file containing IDs and additional attribute, mime = tabular)</td>
<tr style="vertical-align:top">
<td style="width:100px;">FILE</td>
<td><font color="green">r</font></td>
</tr>
<td>raw gene annotation file (GFF file containing the gene annotations (predicted by GeMoMa), type = gff,gff3)</td>
<tr style="vertical-align:top">
<td style="width:100px;">FILE</td>
<td><font color="green">i</font></td>
</tr>
<td>ID column (the ID column in the tab-delimited file, valid range = [0, 2147483647])</td>
</table>
<td style="width:100px;">INT</td>
</td></tr>
</tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">a</font></td>
<td><font color="green">type</font></td>
<td>add (add the prefix to the gene ID, default = true)</td>
<td>type (type of addition attribute, range={VALUES, BINARY}, default = VALUES)</td>
<td style="width:100px;">BOOLEAN</td>
<td style="width:100px;">STRING</td></tr>
</tr>
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%">
<tr style="vertical-align:top">
<tr><td colspan=3><b>Parameters for selection &quot;VALUES&quot;:</b></td></tr>
<td><font color="green">attribute</font></td>
<tr style="vertical-align:top">
<td>attribute (the attribute to be checked, default = iAA)</td>
<td><font color="green">ac</font></td>
<td style="width:100px;">STRING</td>
<td>attribute column (the attribute column in the tab-delimited file, valid range = [0, 2147483647])</td>
</tr>
<tr style="vertical-align:top">
<td><font color="green">outdir</font></td>
<td>The output directory, defaults to the current working directory (.)</td>
<td>STRING</td>
</tr>
</table>
 
'''Example:'''
 
java -jar GeMoMa-1.9.jar CLI Attribute2Table f=&lt;final_gene_annotation_file&gt; r=&lt;raw_gene_annotation_file&gt;
 
 
=== Synteny checker ===
 
This tool can be used to determine syntenic regions between target organism and reference organism based on similiarity of genes. The tool returns a table of reference genes per predicted gene. This table can be easily visualized with an R script that is included in the GeMoMa package.
 
''Synteny checker'' may be called with
 
java -jar GeMoMa-1.9.jar CLI SyntenyChecker
 
and has the following parameters
 
<table border=0 cellpadding=10 align="center" width="100%">
<tr>
<td>name</td>
<td>comment</td>
<td>type</td>
</tr>
<tr><td colspan=3><hr></td></tr>
<tr style="vertical-align:top">
<td><font color="green">t</font></td>
<td>tag (the tag used to read the GeMoMa annotations, default = mRNA)</td>
<td style="width:100px;">STRING</td>
</tr>
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr>
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%">
<tr style="vertical-align:top">
<td><font color="green">p</font></td>
<td>prefix (the prefix can be used to distinguish predictions from different input files (=reference organisms), OPTIONAL)</td>
<td style="width:100px;">STRING</td>
</tr>
<tr style="vertical-align:top">
<td><font color="green">a</font></td>
<td>assignment (the assignment file of this reference organism, which combines parts of the CDS to transcripts, type = tabular)</td>
<td style="width:100px;">FILE</td>
</tr>
</table>
</td></tr>
<tr style="vertical-align:top">
<td><font color="green">g</font></td>
<td>gene annotation file (GFF file containing the gene annotations predicted by GAF, type = gff,gff3)</td>
<td style="width:100px;">FILE</td>
</tr>
<tr style="vertical-align:top">
<td><font color="green">outdir</font></td>
<td>The output directory, defaults to the current working directory (.)</td>
<td>STRING</td>
</tr>
</table>
 
'''Example:'''
 
java -jar GeMoMa-1.9.jar CLI SyntenyChecker a=&lt;assignment&gt; g=&lt;gene_annotation_file&gt;
 
 
=== AddAttribute ===
 
This tool allows to add an additional attribute to specific features of an annotation.
 
Those additional attributes might be used in '''GAF''' for filtering or sorting or might be displayed in genome browsers like IGV or WebApollo. The user can choose binary attributes (true or false) or attributes with values according to given tab-delimited table.
 
''AddAttribute'' may be called with
 
java -jar GeMoMa-1.9.jar CLI AddAttribute
 
and has the following parameters
 
<table border=0 cellpadding=10 align="center" width="100%">
<tr>
<td>name</td>
<td>comment</td>
<td>type</td>
</tr>
<tr><td colspan=3><hr></td></tr>
<tr style="vertical-align:top">
<td><font color="green">a</font></td>
<td>annotation (annotation file, type = gff,gff3)</td>
<td style="width:100px;">FILE</td>
</tr>
<tr style="vertical-align:top">
<td><font color="green">f</font></td>
<td>feature (a feature of the annotation, e.g., gene, transcript or mRNA, default = mRNA)</td>
<td style="width:100px;">STRING</td>
</tr>
<tr style="vertical-align:top">
<td><font color="green">attribute</font></td>
<td>attribute (the name of the attribute that is added to the annotation)</td>
<td style="width:100px;">STRING</td>
</tr>
<tr style="vertical-align:top">
<td><font color="green">t</font></td>
<td>table (a tab-delimited file containing IDs and additional attribute, type = tabular)</td>
<td style="width:100px;">FILE</td>
</tr>
<tr style="vertical-align:top">
<td><font color="green">i</font></td>
<td>ID column (the ID column in the tab-delimited file, valid range = [0, 2147483647])</td>
<td style="width:100px;">INT</td>
</tr>
<tr style="vertical-align:top">
<td><font color="green">type</font></td>
<td>type (type of addition attribute, range={VALUES, BINARY}, default = VALUES)</td>
<td style="width:100px;">STRING</td></tr>
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%">
<tr><td colspan=3><b>Parameters for selection &quot;VALUES&quot;:</b></td></tr>
<tr style="vertical-align:top">
<td><font color="green">ac</font></td>
<td>attribute column (the attribute column in the tab-delimited file, valid range = [0, 2147483647])</td>
<td style="width:100px;">INT</td>
</tr>
<tr><td colspan=3><b>No parameters for selection &quot;BINARY&quot;</b></td></tr>
</table></td></tr>
<tr style="vertical-align:top">
<td><font color="green">outdir</font></td>
<td>The output directory, defaults to the current working directory (.)</td>
<td>STRING</td>
</tr>
</table>
 
'''Example:'''
 
java -jar GeMoMa-1.9.jar CLI AddAttribute a=&lt;annotation&gt; attribute=&lt;attribute&gt; t=&lt;table&gt; i=&lt;ID_column&gt; ac=&lt;attribute_column&gt;
 
 
=== GAFComparison ===
 
This tool allows to compare results from GAF based on the attributed ref-gene and alternative.
Hence, you can compare the annotation of different genomes or the effect of different parameters on the annotation of one genome.
 
''GAFComparison'' may be called with
 
java -jar GeMoMa-1.9.jar CLI GAFComparison
 
and has the following parameters
 
<table border=0 cellpadding=10 align="center" width="100%">
<tr>
<td>name</td>
<td>comment</td>
<td>type</td>
</tr>
<tr><td colspan=3><hr></td></tr>
<tr style="vertical-align:top">
<td><font color="green">t</font></td>
<td>tag (the tag used to read the GAF annotations, default = mRNA)</td>
<td style="width:100px;">STRING</td>
</tr>
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr>
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%">
<tr style="vertical-align:top">
<td><font color="green">n</font></td>
<td>name (a simple name for the organism)</td>
<td style="width:100px;">STRING</td>
</tr>
<tr style="vertical-align:top">
<td><font color="green">g</font></td>
<td>gene annotation file (GFF file containing the gene annotations (predicted by GAF), type = gff,gff3,gff.gz.gff3.gz)</td>
<td style="width:100px;">FILE</td>
</tr>
</table>
</td></tr>
<tr style="vertical-align:top">
<td><font color="green">s</font></td>
<td>split prefix (a switch to decide whether the prefix should be split and writen in a separat column, default = false)</td>
<td style="width:100px;">BOOLEAN</td>
</tr>
<tr style="vertical-align:top">
<td><font color="green">d</font></td>
<td>differences (a switch to decide whether only genes with difference should be returned, default = true)</td>
<td style="width:100px;">BOOLEAN</td>
</tr>
<tr style="vertical-align:top">
<td><font color="green">outdir</font></td>
<td>The output directory, defaults to the current working directory (.)</td>
<td>STRING</td>
</tr>
</table>
 
'''Example:'''
 
java -jar GeMoMa-1.9.jar CLI GAFComparison n=&lt;name&gt; g=&lt;gene_annotation_file&gt;
 
 
=== Analyzer ===
 
This tools allows to compare true annotation with predicted annotation as it is frequently done in benchmark studies. Furthermore, it can return a detailed table comparing true annotation and predicted annotation which might help to identify systematical errors or biases in the predictions. Hence, this tool might help to detect weaknesses of the prediction algorithm.
 
True and predicted transcripts are evaluated based on nucleotide F1 measure. For each predicted transcript, the true transcript with highest nucleotide F1 measure is listed. A negative value in a F1 measure column indicates that there is a predicted transcript that matches the true transcript with a F1 measure value that is the absolute value of this entry, but there is another true transcript that matches this predicted transcript with an even better F1. Also true and predicted transcripts are listed that do not overlap with any transcript from the predicted and true annotation, respectively. The table contains the attributes of the true and the predicted annotation besides some additional columns allowing to easily filter interesting examples and to do statistics.
 
The evaluation can be based on CDS (default) or exon features. The tool also reports sensitivity and precision for the categories gene and transcript.
 
''Analyzer'' may be called with
 
java -jar GeMoMa-1.9.jar CLI Analyzer
 
and has the following parameters
 
<table border=0 cellpadding=10 align="center" width="100%">
<tr>
<td>name</td>
<td>comment</td>
<td>type</td>
</tr>
<tr><td colspan=3><hr></td></tr>
<tr style="vertical-align:top">
<td><font color="green">t</font></td>
<td>truth (the true annotation, type = gff,gff3,gtf,gff.gz,gff3.gz,gtf.gz)</td>
<td style="width:100px;">FILE</td>
</tr>
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr>
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%">
<tr style="vertical-align:top">
<td><font color="green">n</font></td>
<td>name (can be used to distinguish different predictions, OPTIONAL)</td>
<td style="width:100px;">STRING</td>
</tr>
<tr style="vertical-align:top">
<td><font color="green">p</font></td>
<td>predicted annotation (GFF/GTF file containing the predicted annotation, type = gff,gff3,gtf,gff.gz,gff3.gz,gtf.gz)</td>
<td style="width:100px;">FILE</td>
</tr>
</table>
</td></tr>
<tr style="vertical-align:top">
<td><font color="green">c</font></td>
<td>CDS (if true CDS features are used otherwise exon features, default = true)</td>
<td style="width:100px;">BOOLEAN</td>
</tr>
<tr style="vertical-align:top">
<td><font color="green">o</font></td>
<td>only introns (if true only intron borders (=splice sites) are evaluated, default = false)</td>
<td style="width:100px;">BOOLEAN</td>
</tr>
<tr style="vertical-align:top">
<td><font color="green">w</font></td>
<td>write (write detailed table comparing the true and the predicted annotation, range={NO, YES}, default = NO)</td>
<td style="width:100px;">STRING</td></tr>
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%">
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr>
<tr><td colspan=3><b>Parameters for selection &quot;YES&quot;:</b></td></tr>
<tr style="vertical-align:top">
<td><font color="green">ca</font></td>
<td>common attributes (Only gff attributes of mRNAs are included in the result table, that can be found in the given portion of all mRNAs. Attributes and their portion are handled independently for truth and prediction. This parameter allows to choose between a more informative table or compact table., valid range = [0.0, 1.0], default = 0.5)</td>
<td style="width:100px;">DOUBLE</td>
</tr>
</table></td></tr>
<tr style="vertical-align:top">
<td><font color="green">r</font></td>
<td>reliable (additionally evaluate sensitivity for reliable transcripts, range={NO, YES}, default = NO)</td>
<td style="width:100px;">STRING</td></tr>
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%">
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr>
<tr><td colspan=3><b>Parameters for selection &quot;YES&quot;:</b></td></tr>
<tr style="vertical-align:top">
<td><font color="green">f</font></td>
<td>filter (A filter for deciding which transcript from the truth are reliable or not. The filter is applied to the GFF attributes of the truth. You probably need to run AnnotationEvidence on the truth GFF. The default filter decides based on the completeness of the prediction (start=='M' and stop=='*'), no premature stop codons (nps==0), RNA-seq coverage (tpc==1) and intron evidence (isNaN(tie) or tie==1)., default = start=='M' and stop=='*' and nps==0 and (tpc==1 and (isNaN(tie) or tie==1)), OPTIONAL)</td>
<td style="width:100px;">STRING</td>
</tr>
</table></td></tr>
<tr style="vertical-align:top">
<td><font color="green">outdir</font></td>
<td>The output directory, defaults to the current working directory (.)</td>
<td>STRING</td>
</tr>
</table>
 
'''Example:'''
 
java -jar GeMoMa-1.9.jar CLI Analyzer t=&lt;truth&gt; p=&lt;predicted_annotation&gt;
 
 
=== BUSCORecomputer ===
 
This tool can be used to compute BUSCO statistics for genes instead of transcripts. Proteins of an annotation file can be extracted with '''Exctractor''', Proteins can be used to compute BUSCO statistics with BUSCO. The full BUSCO table and the assignment file from the '''Extractor''' can be used as input for this tool. Alternatively, a table can be generated from the annotation file that can be used instead of the assignment file.
 
''BUSCORecomputer'' may be called with
 
java -jar GeMoMa-1.9.jar CLI BUSCORecomputer
 
and has the following parameters
 
<table border=0 cellpadding=10 align="center" width="100%">
<tr>
<td>name</td>
<td>comment</td>
<td>type</td>
</tr>
<tr><td colspan=3><hr></td></tr>
<tr style="vertical-align:top">
<td><font color="green">b</font></td>
<td>BUSCO (the BUSCO full table based on transcripts/proteins, type = tabular)</td>
<td style="width:100px;">FILE</td>
</tr>
<tr style="vertical-align:top">
<td><font color="green">i</font></td>
<td>IDs (a table with at least two columns, the first is the gene ID, the second is the transcript/protein ID. The assignment file from the Extractor can be used or a table can be derived by the user from the gene annotation file (gff,gtf), type = tabular)</td>
<td style="width:100px;">FILE</td>
</tr>
<tr><td colspan=3>The following parameter(s) can be used zero or multiple times:</td></tr>
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%">
<tr style="vertical-align:top">
<td><font color="green">s</font></td>
<td>subgenome (regex for contigs/chromosomes of this subgenome)</td>
<td style="width:100px;">STRING</td>
</tr>
</table>
</td></tr>
<tr style="vertical-align:top">
<td><font color="green">outdir</font></td>
<td>The output directory, defaults to the current working directory (.)</td>
<td>STRING</td>
</tr>
</table>
 
'''Example:'''
 
java -jar GeMoMa-1.9.jar CLI BUSCORecomputer b=&lt;BUSCO&gt; i=&lt;IDs&gt;
 
 
=== GFFAttributes ===
 
Annotations that are build with '''GeMoMaPipeline''' or augmented with '''AnnotationEvidence''' have lots of attributes that might be intersting for the user. This module allows to create a simple table that can easily be parsed and used for visualization of statistics. However, the module could also be used for annotations that are not created of modified with GeMoMa modules.
 
''GFFAttributes'' may be called with
 
java -jar GeMoMa-1.9.jar CLI GFFAttributes
 
and has the following parameters
 
<table border=0 cellpadding=10 align="center" width="100%">
<tr>
<td>name</td>
<td>comment</td>
<td>type</td>
</tr>
<tr><td colspan=3><hr></td></tr>
<tr style="vertical-align:top">
<td><font color="green">a</font></td>
<td>annotation (GFF file containing the gene annotations, type = gff,gff3)</td>
<td style="width:100px;">FILE</td>
</tr>
<tr style="vertical-align:top">
<td><font color="green">f</font></td>
<td>feature (the feature which is used to parse the attributes, default = mRNA)</td>
<td style="width:100px;">STRING</td>
</tr>
<tr style="vertical-align:top">
<td><font color="green">m</font></td>
<td>missing (the value used for missing attributes of a feature, default = )</td>
<td style="width:100px;">STRING</td>
</tr>
<tr style="vertical-align:top">
<td><font color="green">outdir</font></td>
<td>The output directory, defaults to the current working directory (.)</td>
<td>STRING</td>
</tr>
</table>
 
'''Example:'''
 
java -jar GeMoMa-1.9.jar CLI GFFAttributes a=&lt;annotation&gt;
 
 
=== Transcribed Cluster ===
 
'''What it does'''
 
This tool computes ... .
 
 
 
''Transcribed Cluster'' may be called with
 
java -jar GeMoMa-1.9.jar CLI TranscribedCluster
 
and has the following parameters
 
<table border=0 cellpadding=10 align="center" width="100%">
<tr>
<td>name</td>
<td>comment</td>
<td>type</td>
</tr>
<tr><td colspan=3><hr></td></tr>
<tr style="vertical-align:top">
<td><font color="green">g</font></td>
<td>genome (The genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code, type = fasta)</td>
<td style="width:100px;">FILE</td>
</tr>
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr>
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%">
<tr style="vertical-align:top">
<td><font color="green">i</font></td>
<td>introns file (Introns (GFF), which might be obtained from RNA-seq, type = gff, OPTIONAL)</td>
<td style="width:100px;">FILE</td>
</tr>
</table>
</td></tr>
<tr style="vertical-align:top">
<td><font color="green">r</font></td>
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td>
<td style="width:100px;">INT</td>
</tr>
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr>
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%">
<tr style="vertical-align:top">
<td><font color="green">c</font></td>
<td>coverage file (experimental coverage (RNA-seq), range={UNSTRANDED, STRANDED}, default = UNSTRANDED)</td>
<td style="width:100px;">STRING</td></tr>
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%">
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr>
<tr style="vertical-align:top">
<td><font color="green">coverage_unstranded</font></td>
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., type = bedgraph)</td>
<td style="width:100px;">FILE</td>
</tr>
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr>
<tr style="vertical-align:top">
<td><font color="green">coverage_forward</font></td>
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., type = bedgraph)</td>
<td style="width:100px;">FILE</td>
</tr>
<tr style="vertical-align:top">
<td><font color="green">coverage_reverse</font></td>
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., type = bedgraph)</td>
<td style="width:100px;">FILE</td>
</tr>
</table></td></tr>
</table>
</td></tr>
<tr style="vertical-align:top">
<td><font color="green">m</font></td>
<td>minimal gap (the minimal gap between two transcribed clusters, otherwise these will be merged, valid range = [0, 2147483647], default = 50)</td>
<td style="width:100px;">INT</td>
<td style="width:100px;">INT</td>
</tr>
</tr>
<tr><td colspan=3><b>No parameters for selection &quot;BINARY&quot;</b></td></tr>
</table></td></tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">outdir</font></td>
<td><font color="green">outdir</font></td>
Line 1,579: Line 2,191:
'''Example:'''
'''Example:'''


  java -jar GeMoMa-1.7.1.jar CLI AddAttribute a=&lt;annotation&gt; attribute=&lt;attribute&gt; t=&lt;table&gt; i=&lt;ID_column&gt; ac=&lt;attribute_column&gt;
  java -jar GeMoMa-1.9.jar CLI TranscribedCluster g=&lt;genome&gt; coverage_unstranded=&lt;coverage_unstranded&gt;

Latest revision as of 20:04, 16 July 2022

This page describes the parameters of all GeMoMa modules.
If you have any questions, comments or bugs, please check the FAQs, our github page or contact Jens Keilwagen.

GeMoMa pipeline

This tool can be used to run the complete GeMoMa pipeline. The tool is multi-threaded and can utilize all compute cores on one machine, but not distributed as for instance in a compute cluster. It basically runs: Extract RNA-seq evidence (ERE), DenoiseIntrons, Extractor, external search (tblastn or mmseqs), Gene Model Mapper (GeMoMa), GeMoMa Annotation Filter (GAF), and AnnnotationFinalizer.

GeMoMa pipeline may be called with

java -jar GeMoMa-1.9.jar CLI GeMoMaPipeline

and has the following parameters

name comment type

t target genome (Target genome file (FASTA), type = fasta,fa,fas,fna,fasta.gz,fa.gz,fas.gz,fna.gz) FILE
The following parameter(s) can be used zero or multiple times:
s species (data for reference species, range={own, pre-extracted}, default = own) STRING
Parameters for selection "own":
i ID (ID to distinguish the different reference species, OPTIONAL) STRING
a annotation (Reference annotation file (GFF or GTF), which contains gene models annotated in the reference genome, type = gff,gff3,gtf,gff.gz,gff3.gz,gtf.gz) FILE
g genome (Reference genome file (FASTA), type = fasta,fa,fas,fna,fasta.gz,fa.gz,fas.gz,fna.gz) FILE
w weight (the weight can be used to prioritize predictions from different input files; each prediction will get an additional attribute sumWeight that can be used in the filter, valid range = [0.0, 1000.0], default = 1.0, OPTIONAL) DOUBLE
ai annotation info (annotation information of the reference, tab-delimted file containing at least the columns transcriptName, GO and .*defline, type = tabular, OPTIONAL) FILE
Parameters for selection "pre-extracted":
i ID (ID to distinguish the different reference species, OPTIONAL) STRING
c cds parts (The query CDS parts file (protein FASTA), i.e., the CDS parts that have been searched in the target genome using for instance BLAST or mmseqs, type = fasta,fa,fas,fna) FILE
a assignment (The assignment file, which combines CDS parts to proteins, type = tabular, OPTIONAL) FILE
w weight (the weight can be used to prioritize predictions from different input files; each prediction will get an additional attribute sumWeight that can be used in the filter, valid range = [0.0, 1000.0], default = 1.0, OPTIONAL) DOUBLE
ai annotation info (annotation information of the reference, tab-delimited file containing at least the columns transcriptName, GO and .*defline, type = tabular, OPTIONAL) FILE
The following parameter(s) can be used zero or multiple times:
ID ID (ID to distinguish the different external annotations of the target organism, OPTIONAL) STRING
e external annotation (External annotation file (GFF,GTF) of the target organism, which contains gene models from an external source (e.g., ab initio gene prediction) and will be integrated in the module GAF, type = gff,gff3,gtf) FILE
weight weight (the weight can be used to prioritize predictions from different input files in the module GAF; each prediction will get an additional attribute sumWeight that can be used in the filter, valid range = [0.0, 1000.0], default = 1.0, OPTIONAL) DOUBLE
ae annotation evidence (run AnnotationEvidence on this external annotation, default = true) BOOLEAN
selected selected (The path to list file, which allows to make only a predictions for the contained transcript ids. The first column should contain transcript IDs as given in the annotation. Remaining columns can be used to determine a target region that should be overlapped by the prediction, if columns 2 to 5 contain chromosome, strand, start and end of region, type = tabular,txt, OPTIONAL) FILE
gc genetic code (optional user-specified genetic code, type = tabular, OPTIONAL) FILE
tblastn tblastn (if *true* tblastn is used as search algorithm, otherwise mmseqs is used. Tblastn and mmseqs need to be installed to use the corresponding option, default = false) BOOLEAN
tag tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = mRNA) STRING
r RNA-seq evidence (data for RNA-seq evidence, range={NO, MAPPED, EXTRACTED}, default = NO) STRING
No parameters for selection "NO"
Parameters for selection "MAPPED":
ERE.s Stranded (Defines whether the reads are stranded. In case of FR_FIRST_STRAND, the first read of a read pair or the only read in case of single-end data is assumed to be located on forward strand of the cDNA, i.e., reverse to the mRNA orientation. If you are using Illumina TruSeq you should use FR_FIRST_STRAND., range={FR_UNSTRANDED, FR_FIRST_STRAND, FR_SECOND_STRAND}, default = FR_UNSTRANDED) STRING
The following parameter(s) can be used multiple times:
ERE.m mapped reads file (BAM/SAM files containing the mapped reads, type = bam,sam) FILE
ERE.v ValidationStringency (Defines how strict to be when reading a SAM or BAM, beyond bare minimum validation., range={STRICT, LENIENT, SILENT}, default = LENIENT) STRING
ERE.u use secondary alignments (allows to filter flags in the SAM or BAM, default = true) BOOLEAN
ERE.c coverage (allows to output the coverage, default = true) BOOLEAN
ERE.mmq minimum mapping quality (reads with a mapping quality that is lower than this value will be ignored, valid range = [0, 255], default = 40) INT
ERE.mc minimum context (only introns that have evidence of at least one split read with a minimal M (=(mis)match) stretch in the cigar string larger than or equal to this value will be used, valid range = [1, 1000000], default = 1) INT
ERE.maximumcoverage maximum coverage (optional parameter to reduce the size of coverage output files, coverage higher than this value will be reported as this value, valid range = [1, 10000], OPTIONAL) INT
ERE.f filter by intron mismatches (filter reads by the number of mismatches around splits, range={NO, YES}, default = NO) STRING
No parameters for selection "NO"
Parameters for selection "YES":
ERE.r region around introns (test region of this size around introns/splits for mismatches to the genome, valid range = [0, 100], default = 10) INT
ERE.n number of mismatches (number of mismatches allowed in regions around introns/splits, valid range = [0, 100], default = 3) INT
ERE.e evidence long splits (require introns to have at least this number of times the supporting reads as their length deviates from the mean split length, valid range = [0.0, 100.0], default = 0.0) DOUBLE
ERE.mil minimum intron length (introns shorter than the minimum length are discarded and considered as contiguous, valid range = [0, 1000], default = 0) INT
ERE.repositioning repositioning (due to limitations in BAM/SAM format huge chromosomes need to be split before mapping. This parameter allows to undo the split mapping to real chromosomes and coordinates. The repositioning file has 3 columns: split_chr_name, original_chr_name, offset_in_original_chr, type = tabular, OPTIONAL) FILE
Parameters for selection "EXTRACTED":
The following parameter(s) can be used multiple times:
introns introns (Introns (GFF), which might be obtained from RNA-seq, type = gff,gff3) FILE
The following parameter(s) can be used zero or multiple times:
coverage coverage (experimental coverage (RNA-seq), range={UNSTRANDED, STRANDED}, default = UNSTRANDED) STRING
Parameters for selection "UNSTRANDED":
coverage_unstranded coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., type = bedgraph) FILE
Parameters for selection "STRANDED":
coverage_forward coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., type = bedgraph) FILE
coverage_reverse coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., type = bedgraph) FILE
d denoise (removing questionable introns that have been extracted by ERE, range={DENOISE, RAW}, default = DENOISE) STRING
Parameters for selection "DENOISE":
DenoiseIntrons.m maximum intron length (The maximum length of an intron, default = 15000) INT
DenoiseIntrons.me minimum expression (The threshold for removing introns, valid range = [0.0, 1.0], default = 0.01) DOUBLE
DenoiseIntrons.c context (The context upstream a donor and donwstream an acceptor site that is used to determine the expression of the region, valid range = [0, 100], default = 10) INT
No parameters for selection "RAW"
Extractor.u upcase IDs (whether the IDs in the GFF should be upcased, default = false) BOOLEAN
Extractor.r repair (if a transcript annotation can not be parsed, the program will try to infer the phase of the CDS parts to repair the annotation, default = false) BOOLEAN
Extractor.a Ambiguity (This parameter defines how to deal with ambiguities in the DNA. There are 3 options: EXCEPTION, which will remove the corresponding transcript, AMBIGUOUS, which will use an X for the corresponding amino acid, and RANDOM, which will randomly select an amnio acid from the list of possibilities., range={EXCEPTION, AMBIGUOUS, RANDOM}, default = AMBIGUOUS) STRING
Extractor.d discard pre-mature stop (if *true* transcripts with pre-mature stop codon are discarded as they often indicate misannotation, default = true) BOOLEAN
Extractor.s stop-codon excluded from CDS (A flag that states whether the reference annotation contains the stop codon in the CDS annotation or not, default = false) BOOLEAN
Extractor.f full-length (A flag which allows for choosing between only full-length and all (i.e., full-length and partial) transcripts, default = true) BOOLEAN
GeMoMa.r reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1) INT
GeMoMa.s splice (if no intron is given by RNA-seq, compute candidate splice sites or not, default = true) BOOLEAN
GeMoMa.sm substitution matrix (optional user-specified substitution matrix, type = tabular, OPTIONAL) FILE
GeMoMa.g gap opening (The gap opening cost in the alignment, default = 11) INT
GeMoMa.ge gap extension (The gap extension cost in the alignment, default = 1) INT
GeMoMa.m maximum intron length (The maximum length of an intron, default = 15000) INT
GeMoMa.sil static intron length (A flag which allows to switch between static intron length, which can be specified by the user and is identical for all genes, and dynamic intron length, which is based on the gene-specific maximum intron length in the reference organism plus the user given maximum intron length, default = true) BOOLEAN
GeMoMa.i intron-loss-gain-penalty (The penalty used for intron loss and gain, default = 25) INT
GeMoMa.rf reduction factor (Factor for reducing the allowed intron length when searching for missing marginal exons, valid range = [1, 100], default = 10) INT
GeMoMa.e e-value (The e-value for filtering blast results, default = 100.0) DOUBLE
GeMoMa.c contig threshold (The threshold for evaluating contigs, valid range = [0.0, 1.0], default = 0.4) DOUBLE
GeMoMa.h hit threshold (The threshold for adding additional hits, valid range = [0.0, 1.0], default = 0.9) DOUBLE
GeMoMa.o output (critierium to determine the number of predictions per reference transcript, range={STATIC, DYNAMIC}, default = STATIC) STRING
Parameters for selection "STATIC":
GeMoMa.p predictions (The (maximal) number of predictions per transcript, default = 10) INT
Parameters for selection "DYNAMIC":
GeMoMa.f factor (a prediction is used if: score >= factor*Math.max(0,bestScore), valid range = [0.0, 1.0], default = 0.8) DOUBLE
GeMoMa.a avoid stop (A flag which allows to avoid (additional) pre-mature stop codons in a transcript, default = true) BOOLEAN
GeMoMa.approx approx (whether an approximation is used to compute the score for intron gain, default = true) BOOLEAN
GeMoMa.pa protein alignment (whether a protein alignment between the prediction and the reference transcript should be computed. If so two additional attributes (iAA, pAA) will be added to predictions in the gff output. These might be used in GAF. However, since some transcripts are very long this can increase the needed runtime and memory (RAM)., default = true) BOOLEAN
GeMoMa.v verbose (A flag which allows to output a wealth of additional information per transcript, default = false) BOOLEAN
GeMoMa.t timeout (The (maximal) number of seconds to be used for the predictions of one transcript, if exceeded GeMoMa does not output a prediction for this transcript., valid range = [0, 604800], default = 3600) LONG
GeMoMa.ru replace unknown (Replace unknown amino acid symbols by X, default = false) BOOLEAN
GeMoMa.Score Score (A flag which allows to do nothing, re-score or re-align the search results, range={Trust, ReScore, ReAlign}, default = ReAlign) STRING
GAF.d default attributes (Comma-separated list of attributes that is set to NaN if they are not given in the annotation file. This allows to use these attributes for sorting or filter criteria. It is especially meaningful if the gene annotation files were received fom different software packages (e.g., GeMoMa, Braker, ...) having different attributes., default = tie,tde,tae,iAA,pAA,score,lpm,maxGap,bestScore,maxScore,raa,rce) STRING
GAF.k kmeans (whether kmeans should be performed for each input file and clusters with large mean distance to the origin will be discarded, range={NO, YES}, default = NO) STRING
No parameters for selection "NO"
Parameters for selection "YES":
GAF.m minimal number of predictions (only gene sets with at least this number of predictions will be used for clustering, valid range = [0, 100000000], default = 1000) INT
GAF.c cluster (the number of clusters to be used for kmeans, valid range = [2, 100], default = 2) INT
GAF.g good cluster (the number of good clusters, good clusters are those with small mean, all members of a good cluster are further used, valid range = [1, 99], default = 1) INT
GAF.t trend (whether a local component should be used for the cluster attribute (might be helpful for regions with different conservation (e.g. introgressions in chromosomes)), range={GLOBAL, LOCAL}, default = GLOBAL) STRING
No parameters for selection "GLOBAL"
Parameters for selection "LOCAL":
GAF.margin margin (the number of bp upstream and downstream of a predictions used to identify neighboring predictions for the statistics, valid range = [0, 100000000], default = 1000000) INT
GAF.q quantile (the quantile used for the local trend, valid range = [0.0, 1.0], default = 0.2) DOUBLE
GAF.f filter (A filter can be applied to transcript predictions to receive only reasonable predictions. The filter is applied to the GFF attributes. The default filter decides based on the completeness of the prediction (start=='M' and stop=='*') and the relative score (score/aa>=0.75) whether a prediction is used or not. In addition, predictions without score (isNaN(score)) will be used as external annotations do not have the attribute score. Different criteria can be combined using 'and' and 'or'. A more sophisticated filter could be applied for instance using the completeness, the relative score, the evidence as well as statistics based on RNA-seq: start=='M' and stop=='*' and score/aa>=0.75 and (evidence>1 or tpc==1.0), default = start=='M' and stop=='*' and (isNaN(score) or score/aa>=0.75), OPTIONAL) STRING
GAF.s sorting (comma-separated list that defines the way predictions are sorted within a cluster, default = sumWeight,score,aa) STRING
GAF.l length difference (maximal percentage of length difference between the representative transcript and an alternative transcript, alternative transcripts with a higher percentage are discarded, valid range = [0.0, 10000.0], OPTIONAL) DOUBLE
GAF.a alternative transcript filter (If a prediction is suggested as an alternative transcript, this additional filter will be applied. The filter works syntactically like the 'filter' parameter and allows you to keep the number of alternative transcripts small based on meaningful criteria. Reasonable filter could be based on RNA-seq data (tie==1) or on sumWeight (sumWeight>1). A more sophisticated filter could be applied combining several criteria: tie==1 or sumWeight>1, default = tie==1 or sumWeight>1, OPTIONAL) STRING
GAF.cbf common border filter (the filter on the common borders of transcripts, the lower the more transcripts will be checked as alternative splice isoforms, valid range = [0.0, 1.0], default = 0.75) DOUBLE
GAF.mnotpg maximal number of transcripts per gene (the maximal number of allowed transcript predictions per gene, valid range = [1, 2147483647], default = 2147483647) INT
GAF.aat add alternative transcripts (a switch to decide whether the IDs of alternative transcripts that have the same CDS should be listed for each prediction, default = false) BOOLEAN
GAF.tf transfer features (if true, additional features like UTRs will be transfered from the input to the output. Only features of the representatives will be transfered. The UTRs of identical CDS predictions listed in "alternative" will not be transfered or combined, default = false) BOOLEAN
AnnotationFinalizer.t transfer features (if true other features than gene, <tag> (default: mRNA), and CDS of the input will be written in the output, default = false) BOOLEAN
AnnotationFinalizer.u UTR (allows to predict UTRs using RNA-seq data, range={NO, YES}, default = NO) STRING
No parameters for selection "NO"
Parameters for selection "YES":
AnnotationFinalizer.a additional source suffix (a suffix for source values of UTR features, default = ) STRING
AnnotationFinalizer.r rename (allows to generate generic gene and transcripts names (cf. parameter "name attribute"), range={COMPOSED, SIMPLE, NO}, default = COMPOSED) STRING
Parameters for selection "COMPOSED":
AnnotationFinalizer.p prefix (the prefix of the generic name) STRING
AnnotationFinalizer.i infix (the infix of the generic name, default = G) STRING
AnnotationFinalizer.s suffix (the suffix of the generic name, default = 0) STRING
AnnotationFinalizer.d digits (the number of informative digits, valid range = [4, 10], default = 5) INT
AnnotationFinalizer.c contig search pattern (search string, i.e., a regular expression for search-and-replace parts of the contig/scaffold/chromosome names, the modified string is used as infix for the gene name, default = ) STRING
AnnotationFinalizer.crp contig replace pattern (replace string, i.e., a regular expression for search-and-replace parts of the contig/scaffold/chromosome names, the modified string is used as infix for the gene name, default = ) STRING
Parameters for selection "SIMPLE":
AnnotationFinalizer.p prefix (the prefix of the generic name) STRING
AnnotationFinalizer.d digits (the number of informative digits, valid range = [4, 10], default = 5) INT
No parameters for selection "NO"
AnnotationFinalizer.n name attribute (if true the new name is added as new attribute "Name", otherwise "Parent" and "ID" values are modified accordingly, default = true) BOOLEAN
sc synteny check (run SyntenyChecker if possible, default = true) BOOLEAN
p predicted proteins (If *true*, returns the predicted proteins of the target organism as fastA file, default = true) BOOLEAN
pc predicted CDSs (If *true*, returns the predicted CDSs of the target organism as fastA file, default = false) BOOLEAN
pgr predicted genomic regions (If *true*, returns the genomic regions of predicted gene models of the target organism as fastA file, default = false) BOOLEAN
o output individual predictions (If *true*, returns the predictions for each reference species, default = false) BOOLEAN
debug debug (If *false* removes all temporary files even if the jobs exits unexpected, default = true) BOOLEAN
restart restart (can be used to restart the latest GeMoMaPipeline run, which was finished without results, with very similar parameters, e.g., after an exception was thrown (cf. parameter debug), default = false) BOOLEAN
b BLAST_PATH (allows to set a path to the blast binaries if not set in the environment, default = , OPTIONAL) STRING
m MMSEQS_PATH (allows to set a path to the blast binaries if not set in the environment, default = , OPTIONAL) STRING
outdir The output directory, defaults to the current working directory (.) STRING
threads The number of threads used for the tool, defaults to 1 INT

Example:

java -jar GeMoMa-1.9.jar CLI GeMoMaPipeline a=<reference_annotation> g=<reference_genome> t=<target_genome> AnnotationFinalizer.p=<prefix>


Extract RNA-seq Evidence

This tools extracts introns and coverage from mapped RNA-seq reads. Introns might be denoised by the tool DenoiseIntrons. Introns and coverage results can be used in GeMoMa to improve the predictions and might help to select better gene models in GAF. In addition, introns and coverage can be used to predict UTRs by AnnotationFinalizer.

Extract RNA-seq Evidence may be called with

java -jar GeMoMa-1.9.jar CLI ERE

and has the following parameters

name comment type

s Stranded (Defines whether the reads are stranded. In case of FR_FIRST_STRAND, the first read of a read pair or the only read in case of single-end data is assumed to be located on forward strand of the cDNA, i.e., reverse to the mRNA orientation. If you are using Illumina TruSeq you should use FR_FIRST_STRAND., range={FR_UNSTRANDED, FR_FIRST_STRAND, FR_SECOND_STRAND}, default = FR_UNSTRANDED) STRING
The following parameter(s) can be used multiple times:
m mapped reads file (BAM/SAM files containing the mapped reads, type = bam,sam) FILE
v ValidationStringency (Defines how strict to be when reading a SAM or BAM, beyond bare minimum validation., range={STRICT, LENIENT, SILENT}, default = LENIENT) STRING
u use secondary alignments (allows to filter flags in the SAM or BAM, default = true) BOOLEAN
c coverage (allows to output the coverage, default = true) BOOLEAN
mmq minimum mapping quality (reads with a mapping quality that is lower than this value will be ignored, valid range = [0, 255], default = 40) INT
mc minimum context (only introns that have evidence of at least one split read with a minimal M (=(mis)match) stretch in the cigar string larger than or equal to this value will be used, valid range = [1, 1000000], default = 1) INT
maximumcoverage maximum coverage (optional parameter to reduce the size of coverage output files, coverage higher than this value will be reported as this value, valid range = [1, 10000], OPTIONAL) INT
f filter by intron mismatches (filter reads by the number of mismatches around splits, range={NO, YES}, default = NO) STRING
No parameters for selection "NO"
Parameters for selection "YES":
r region around introns (test region of this size around introns/splits for mismatches to the genome, valid range = [0, 100], default = 10) INT
n number of mismatches (number of mismatches allowed in regions around introns/splits, valid range = [0, 100], default = 3) INT
t target genome (The target genome file (FASTA). Should be in IUPAC code, type = fasta,fas,fa,fna,fasta.gz,fas.gz,fa.gz,fna.gz) FILE
e evidence long splits (require introns to have at least this number of times the supporting reads as their length deviates from the mean split length, valid range = [0.0, 100.0], default = 0.0) DOUBLE
mil minimum intron length (introns shorter than the minimum length are discarded and considered as contiguous, valid range = [0, 1000], default = 0) INT
repositioning repositioning (due to limitations in BAM/SAM format huge chromosomes need to be split before mapping. This parameter allows to undo the split mapping to real chromosomes and coordinates. The repositioning file has 3 columns: split_chr_name, original_chr_name, offset_in_original_chr, type = tabular, OPTIONAL) FILE
outdir The output directory, defaults to the current working directory (.) STRING

Example:

java -jar GeMoMa-1.9.jar CLI ERE m=<mapped_reads_file>


CheckIntrons

The tool checks the distribution of introns on the strands and the dinucleotide distribution at splice sites.

CheckIntrons may be called with

java -jar GeMoMa-1.9.jar CLI CheckIntrons

and has the following parameters

name comment type

t target genome (The target genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code, type = fasta) FILE
The following parameter(s) can be used multiple times:
i introns (Introns (GFF), which might be obtained from RNA-seq, type = gff) FILE
v verbose (A flag which allows to output a wealth of additional information per transcript, default = false) BOOLEAN
outdir The output directory, defaults to the current working directory (.) STRING

Example:

java -jar GeMoMa-1.9.jar CLI CheckIntrons t=<target_genome> i=<introns>


DenoiseIntrons

This module allows to analyze introns extracted by ERE. Introns with a large intron size or a low relative expression are possibly artefacts and will be removed. The result of this module can be used in the module GeMoMa, AnnotationEvidence, and AnnotationFinalizer.

DenoiseIntrons may be called with

java -jar GeMoMa-1.9.jar CLI DenoiseIntrons

and has the following parameters

name comment type

The following parameter(s) can be used multiple times:
i introns (Introns (GFF), which might be obtained from RNA-seq, type = gff,gff3) FILE
The following parameter(s) can be used multiple times:
c coverage (experimental coverage (RNA-seq), range={UNSTRANDED, STRANDED}, default = UNSTRANDED) STRING
Parameters for selection "UNSTRANDED":
coverage_unstranded coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., type = bedgraph) FILE
Parameters for selection "STRANDED":
coverage_forward coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., type = bedgraph) FILE
coverage_reverse coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., type = bedgraph) FILE
m maximum intron length (The maximum length of an intron, default = 15000) INT
me minimum expression (The threshold for removing introns, valid range = [0.0, 1.0], default = 0.01) DOUBLE
context context (The context upstream a donor and donwstream an acceptor site that is used to determine the expression of the region, valid range = [0, 100], default = 10) INT
outdir The output directory, defaults to the current working directory (.) STRING

Example:

java -jar GeMoMa-1.9.jar CLI DenoiseIntrons i=<introns> coverage_unstranded=<coverage_unstranded>


NCBI Reference Retriever

This tool can be used to download or update assembly and annotation files of reference organsims from NCBI. This way it allows to easily collect all data necessary to start GeMoMaPipeline or Extractor.

NCBI Reference Retriever may be called with

java -jar GeMoMa-1.9.jar CLI NRR

and has the following parameters

name comment type

r reference directory (the directory where the genome and annotation files of the reference organisms should be stored, default = references/) STRING
n number of tries (the number of tries for downloading a reference file, valid range = [1, 100], default = 10) INT
rl reference list (a list of reference organisms, type = txt) FILE
outdir The output directory, defaults to the current working directory (.) STRING

Example:

java -jar GeMoMa-1.9.jar CLI NRR rl=<reference_list>


Extractor

This tool can be used to create input files for GeMoMa, i.e., it creates at least a fasta file containing the translated parts of the CDS and a tabular file containing the assignment of transcripts to genes and parts of CDS to transcripts. In addition, Extractor can be used to create several additional files from the final prediction, e.g. proteins, CDSs, ... . Two inputs are mandatory: The genome as fasta or fasta.gz and the corresponding annotation as gff or gff.gz. The gff file should be sorted. If you like to set a user-specific genetic code, please use a tab-delimited file with two columns. The first column contains the amino acid in one letter code, the second a list of tripletts.

Extractor may be called with

java -jar GeMoMa-1.9.jar CLI Extractor

and has the following parameters

name comment type

a annotation (Reference annotation file (GFF or GTF), which contains gene models annotated in the reference genome, type = gff,gff3,gtf,gff.gz,gff3.gz,gtf.gz) FILE
g genome (Reference genome file (FASTA), type = fasta,fa,fas,fna,fasta.gz,fa.gz,fas.gz,fna.gz) FILE
gc genetic code (optional user-specified genetic code, type = tabular, OPTIONAL) FILE
p proteins (whether the complete proteins sequences should returned as output, default = false) BOOLEAN
c cds (whether the complete CDSs should returned as output, default = false) BOOLEAN
genomic genomic (whether the genomic regions should be returned (upper case = coding, lower case = non coding), default = false) BOOLEAN
i introns (whether introns should be extracted from annotation, that might be used for test cases, default = false) BOOLEAN
identical identical (if CDS is identical Extractor only used one transcript. This parameter allows to return a table that lists in the first column the used transcript and in the second column the discarded transcript. If no transcript is discarded, the list is empty., default = false) BOOLEAN
u upcase IDs (whether the IDs in the GFF should be upcased, default = false) BOOLEAN
r repair (if a transcript annotation can not be parsed, the program will try to infer the phase of the CDS parts to repair the annotation, default = false) BOOLEAN
s selected (The path to list file, which allows to make only a predictions for the contained transcript ids. The first column should contain transcript IDs as given in the annotation. Remaining columns will be ignored., type = tabular,txt, OPTIONAL) FILE
Ambiguity Ambiguity (This parameter defines how to deal with ambiguities in the DNA. There are 3 options: EXCEPTION, which will remove the corresponding transcript, AMBIGUOUS, which will use an X for the corresponding amino acid, and RANDOM, which will randomly select an amnio acid from the list of possibilities., range={EXCEPTION, AMBIGUOUS, RANDOM}, default = EXCEPTION) STRING
d discard pre-mature stop (if *true* transcripts with pre-mature stop codon are discarded as they often indicate misannotation, default = true) BOOLEAN
sefc stop-codon excluded from CDS (A flag that states whether the reference annotation contains the stop codon in the CDS annotation or not, default = false) BOOLEAN
f full-length (A flag which allows for choosing between only full-length and all (i.e., full-length and partial) transcripts, default = true) BOOLEAN
l long fasta comment (whether a short (transcript ID) or a long (transcript ID, gene ID, chromosome, strand, interval) fasta comment should be written for proteins, CDSs, and genomic regions, default = false) BOOLEAN
v verbose (A flag which allows to output a wealth of additional information, default = false) BOOLEAN
outdir The output directory, defaults to the current working directory (.) STRING

Example:

java -jar GeMoMa-1.9.jar CLI Extractor a=<annotation> g=<genome>


GeneModelMapper

This tool is the main part of, a homology-based gene prediction tool. GeMoMa builds gene models from search results (e.g. tblastn or mmseqs).

As first step, you should run Extractor obtaining cds parts and assignment. Second, you should run a search algorithm, e.g. tblastn or mmseqs, with cds parts as query. Finally, these search results are then used in GeMoMa. Search results should be clustered according to the reference genes. The most easiest way is to sort the search results accoring to the first column. If the search results are not sorted by default (e.g. mmseqs), you should the parameter sort. If you like to run GeMoMa ignoring intron position conservation, you should blast protein sequences and feed the results in query cds parts and leave assignment unselected.

If you like to run GeMoMa using RNA-seq evidence, you should map your RNA-seq reads to the genome and run ERE on the mapped reads. For several reasons, spurious introns can be extracted from RNA-seq data. Hence, we recommend to run DenoiseIntrons to remove such spurious introns. Finally, you can use the obtained introns (and coverage) in GeMoMa.

If you like to obtain multiple predictions per gene model of the reference organism, you should set predictions accordingly. In addition, we suggest to decrease the value of contig threshold allowing GeMoMa to evaluate more candidate contigs/chromosomes.

If you change the values of contig threshold, region threshold and hit threshold, this will influence the predictions as well as the runtime of the algorithm. The lower the values are, the slower the algorithm is.

You can filter your predictions using GAF, which also allows for combining predictions from different reference organismns.

Finally, you can predict UTRs and rename predictions using AnnotationFinalizer.

If you like to run the complete GeMoMa pipeline and not only specific module, you can run the multi-threaded module GeMoMaPipeline.

GeneModelMapper may be called with

java -jar GeMoMa-1.9.jar CLI GeMoMa

and has the following parameters

name comment type

s search results (The search results, e.g., from tblastn or mmseqs, type = tabular) FILE
t target genome (The target genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code, type = fasta,fas,fa,fna,fasta.gz,fas.gz,fa.gz,fna.gz) FILE
c cds parts (The query CDS parts file (protein FASTA), i.e., the CDS parts that have been searched in the target genome using for instance BLAST or mmseqs, type = fasta,fa,fas,fna) FILE
a assignment (The assignment file, which combines CDS parts to proteins, type = tabular, OPTIONAL) FILE
The following parameter(s) can be used zero or multiple times:
i introns (Introns (GFF), which might be obtained from RNA-seq, type = gff,gff3) FILE
r reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1) INT
splice splice (if no intron is given by RNA-seq, compute candidate splice sites or not, default = true) BOOLEAN
The following parameter(s) can be used zero or multiple times:
coverage coverage (experimental coverage (RNA-seq), range={UNSTRANDED, STRANDED}, default = UNSTRANDED) STRING
Parameters for selection "UNSTRANDED":
coverage_unstranded coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., type = bedgraph) FILE
Parameters for selection "STRANDED":
coverage_forward coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., type = bedgraph) FILE
coverage_reverse coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., type = bedgraph) FILE
g genetic code (optional user-specified genetic code, type = tabular, OPTIONAL) FILE
sm substitution matrix (optional user-specified substitution matrix, type = tabular, OPTIONAL) FILE
go gap opening (The gap opening cost in the alignment, default = 11) INT
ge gap extension (The gap extension cost in the alignment, default = 1) INT
m maximum intron length (The maximum length of an intron, default = 15000) INT
sil static intron length (A flag which allows to switch between static intron length, which can be specified by the user and is identical for all genes, and dynamic intron length, which is based on the gene-specific maximum intron length in the reference organism plus the user given maximum intron length, default = true) BOOLEAN
intron-loss-gain-penalty intron-loss-gain-penalty (The penalty used for intron loss and gain, default = 25) INT
rf reduction factor (Factor for reducing the allowed intron length when searching for missing marginal exons, valid range = [1, 100], default = 10) INT
e e-value (The e-value for filtering blast results, default = 100.0) DOUBLE
ct contig threshold (The threshold for evaluating contigs, valid range = [0.0, 1.0], default = 0.4) DOUBLE
h hit threshold (The threshold for adding additional hits, valid range = [0.0, 1.0], default = 0.9) DOUBLE
o output (critierium to determine the number of predictions per reference transcript, range={STATIC, DYNAMIC}, default = STATIC) STRING
Parameters for selection "STATIC":
p predictions (The (maximal) number of predictions per transcript, default = 10) INT
Parameters for selection "DYNAMIC":
f factor (a prediction is used if: score >= factor*Math.max(0,bestScore), valid range = [0.0, 1.0], default = 0.8) DOUBLE
selected selected (The path to list file, which allows to make only a predictions for the contained transcript ids. The first column should contain transcript IDs as given in the annotation. Remaining columns can be used to determine a target region that should be overlapped by the prediction, if columns 2 to 5 contain chromosome, strand, start and end of region, type = tabular,txt, OPTIONAL) FILE
as avoid stop (A flag which allows to avoid (additional) pre-mature stop codons in a transcript, default = true) BOOLEAN
approx approx (whether an approximation is used to compute the score for intron gain, default = true) BOOLEAN
pa protein alignment (whether a protein alignment between the prediction and the reference transcript should be computed. If so two additional attributes (iAA, pAA) will be added to predictions in the gff output. These might be used in GAF. However, since some transcripts are very long this can increase the needed runtime and memory (RAM)., default = true) BOOLEAN
prefix prefix (A prefix to be used for naming the predictions, default = ) STRING
tag tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = mRNA) STRING
v verbose (A flag which allows to output a wealth of additional information per transcript, default = false) BOOLEAN
timeout timeout (The (maximal) number of seconds to be used for the predictions of one transcript, if exceeded GeMoMa does not output a prediction for this transcript., valid range = [0, 604800], default = 3600) LONG
sort sort (A flag which allows to sort the search results, default = false) BOOLEAN
ru replace unknown (Replace unknown amino acid symbols by X, default = false) BOOLEAN
Score Score (A flag which allows to do nothing, re-score or re-align the search results, range={Trust, ReScore, ReAlign}, default = Trust) STRING
outdir The output directory, defaults to the current working directory (.) STRING

Example:

java -jar GeMoMa-1.9.jar CLI GeMoMa s=<search_results> t=<target_genome> c=<cds_parts>


GeMoMa Annotation Filter

This tool combines and filters gene predictions from different sources yielding a common gene prediction. The tool does not modify the predictions, but filters redundant and low-quality predictions and selects relevant predictions. In addition, it adds attributes to the annotation of transcript predictions.

The algorithm does the following: First, redundant predictions are identified (and additional attributes (evidence, sumWeight) are introduced). Second, the predictions are filtered using the user-specified criterium based on the attributes from the annotation. Third, clusters of overlapping predictions are determined, the predictions are sorted within the cluster and relevant predictions are extracted.

Optionally, annotation info can be added for each reference organism enabling a functional prediction for predicted transcripts based on the function of the reference transcript. Phytozome provides annotation info tables for plants, but annotation info can be used from any source as long as they are tab-delimited files with at least the following columns: transcriptName, GO and .*defline.

Initially, GAF was build to combine gene predictions obtained from GeMoMa. It allows to combine the predictions from multiple reference organisms, but works also using only one reference organism. However, GAF also allows to integrate predictions from ab-initio or purely RNA-seq-based gene predictors as well as manually curated annotation. If you like to do so, we recommend to run AnnotationEvidence for each of these input files to add additional attributes that can be used for sorting and filtering within GAF. The sort and filter criteria need to be carefully revised in this case. Default values can be set for missing attributes.

GeMoMa Annotation Filter may be called with

java -jar GeMoMa-1.9.jar CLI GAF

and has the following parameters

name comment type

t tag (the tag used to read the GeMoMa annotations, default = mRNA) STRING
The following parameter(s) can be used multiple times:
p prefix (the prefix can be used to distinguish predictions from different input files, OPTIONAL) STRING
w weight (the weight can be used to prioritize predictions from different input files; each prediction will get an additional attribute sumWeight that can be used in the filter, valid range = [0.0, 1000.0], default = 1.0, OPTIONAL) DOUBLE
g gene annotation file (GFF file containing the gene annotations (predicted by GeMoMa), type = gff,gff3) FILE
a annotation info (annotation information of the reference, tab-delimited file containing at least the columns transcriptName, GO and .*defline, type = tabular, OPTIONAL) FILE
d default attributes (Comma-separated list of attributes that is set to NaN if they are not given in the annotation file. This allows to use these attributes for sorting or filter criteria. It is especially meaningful if the gene annotation files were received fom different software packages (e.g., GeMoMa, Braker, ...) having different attributes., default = tie,tde,tae,iAA,pAA,score,lpm,maxGap,bestScore,maxScore,raa,rce) STRING
k kmeans (whether kmeans should be performed for each input file and clusters with large mean distance to the origin will be discarded, range={NO, YES}, default = NO) STRING
No parameters for selection "NO"
Parameters for selection "YES":
m minimal number of predictions (only gene sets with at least this number of predictions will be used for clustering, valid range = [0, 100000000], default = 1000) INT
c cluster (the number of clusters to be used for kmeans, valid range = [2, 100], default = 2) INT
gc good cluster (the number of good clusters, good clusters are those with small mean, all members of a good cluster are further used, valid range = [1, 99], default = 1) INT
trend trend (whether a local component should be used for the cluster attribute (might be helpful for regions with different conservation (e.g. introgressions in chromosomes)), range={GLOBAL, LOCAL}, default = GLOBAL) STRING
No parameters for selection "GLOBAL"
Parameters for selection "LOCAL":
margin margin (the number of bp upstream and downstream of a predictions used to identify neighboring predictions for the statistics, valid range = [0, 100000000], default = 1000000) INT
q quantile (the quantile used for the local trend, valid range = [0.0, 1.0], default = 0.2) DOUBLE
f filter (A filter can be applied to transcript predictions to receive only reasonable predictions. The filter is applied to the GFF attributes. The default filter decides based on the completeness of the prediction (start=='M' and stop=='*') and the relative score (score/aa>=0.75) whether a prediction is used or not. In addition, predictions without score (isNaN(score)) will be used as external annotations do not have the attribute score. Different criteria can be combined using 'and' and 'or'. A more sophisticated filter could be applied for instance using the completeness, the relative score, the evidence as well as statistics based on RNA-seq: start=='M' and stop=='*' and score/aa>=0.75 and (evidence>1 or tpc==1.0), default = start=='M' and stop=='*' and (isNaN(score) or score/aa>=0.75), OPTIONAL) STRING
s sorting (comma-separated list that defines the way predictions are sorted within a cluster, default = sumWeight,score,aa) STRING
i intermediate result (a switch to decide whether an intermediate result of filtered predictions that are not combined to genes should be returned, default = false) BOOLEAN
l length difference (maximal percentage of length difference between the representative transcript and an alternative transcript, alternative transcripts with a higher percentage are discarded, valid range = [0.0, 10000.0], OPTIONAL) DOUBLE
atf alternative transcript filter (If a prediction is suggested as an alternative transcript, this additional filter will be applied. The filter works syntactically like the 'filter' parameter and allows you to keep the number of alternative transcripts small based on meaningful criteria. Reasonable filter could be based on RNA-seq data (tie==1) or on sumWeight (sumWeight>1). A more sophisticated filter could be applied combining several criteria: tie==1 or sumWeight>1, default = tie==1 or sumWeight>1, OPTIONAL) STRING
cbf common border filter (the filter on the common borders of transcripts, the lower the more transcripts will be checked as alternative splice isoforms, valid range = [0.0, 1.0], default = 0.75) DOUBLE
mnotpg maximal number of transcripts per gene (the maximal number of allowed transcript predictions per gene, valid range = [1, 2147483647], default = 2147483647) INT
aat add alternative transcripts (a switch to decide whether the IDs of alternative transcripts that have the same CDS should be listed for each prediction, default = false) BOOLEAN
tf transfer features (if true, additional features like UTRs will be transfered from the input to the output. Only features of the representatives will be transfered. The UTRs of identical CDS predictions listed in "alternative" will not be transfered or combined, default = false) BOOLEAN
outdir The output directory, defaults to the current working directory (.) STRING

Example:

java -jar GeMoMa-1.9.jar CLI GAF g=<gene_annotation_file>


AnnotationFinalizer

This tool finalizes an annotation. It allows to predict for UTRs for annotated coding sequences and to generate generic gene and transcript names. UTR prediction might be negatively influenced (i.e. too long predictions) by genomic contamination of RNA-seq libraries, overlapping genes or genes in close proximity as well as unstranded RNA-seq libraries. Please use ERE to preprocess the mapped reads.

AnnotationFinalizer may be called with

java -jar GeMoMa-1.9.jar CLI AnnotationFinalizer

and has the following parameters

name comment type

g genome (The genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code, type = fasta,fa,fas,fna,fasta.gz,fa.gz,fas.gz,fna.gz) FILE
a annotation (The predicted genome annotation file (GFF), type = gff,gff3) FILE
t tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = mRNA) STRING
tf transfer features (if true other features than gene, <tag> (default: mRNA), and CDS of the input will be written in the output, default = false) BOOLEAN
u UTR (allows to predict UTRs using RNA-seq data, range={NO, YES}, default = NO) STRING
No parameters for selection "NO"
Parameters for selection "YES":
The following parameter(s) can be used multiple times:
i introns file (Introns (GFF), which might be obtained from RNA-seq, type = gff,gff3) FILE
r reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1) INT
The following parameter(s) can be used multiple times:
c coverage file (experimental coverage (RNA-seq), range={NO, UNSTRANDED, STRANDED}, default = NO) STRING
No parameters for selection "NO"
Parameters for selection "UNSTRANDED":
coverage_unstranded coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., type = bedgraph) FILE
Parameters for selection "STRANDED":
coverage_forward coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., type = bedgraph) FILE
coverage_reverse coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., type = bedgraph) FILE
ass additional source suffix (a suffix for source values of UTR features, default = ) STRING
rename rename (allows to generate generic gene and transcripts names (cf. parameter "name attribute"), range={COMPOSED, SIMPLE, NO}, default = COMPOSED) STRING
Parameters for selection "COMPOSED":
p prefix (the prefix of the generic name) STRING
infix infix (the infix of the generic name, default = G) STRING
s suffix (the suffix of the generic name, default = 0) STRING
d digits (the number of informative digits, valid range = [4, 10], default = 5) INT
csp contig search pattern (search string, i.e., a regular expression for search-and-replace parts of the contig/scaffold/chromosome names, the modified string is used as infix for the gene name, default = ) STRING
crp contig replace pattern (replace string, i.e., a regular expression for search-and-replace parts of the contig/scaffold/chromosome names, the modified string is used as infix for the gene name, default = ) STRING
Parameters for selection "SIMPLE":
p prefix (the prefix of the generic name) STRING
d digits (the number of informative digits, valid range = [4, 10], default = 5) INT
No parameters for selection "NO"
n name attribute (if true the new name is added as new attribute "Name", otherwise "Parent" and "ID" values are modified accordingly, default = true) BOOLEAN
outdir The output directory, defaults to the current working directory (.) STRING

Example:

java -jar GeMoMa-1.9.jar CLI AnnotationFinalizer g=<genome> a=<annotation> p=<prefix>


Annotation evidence

This tool adds attributes to the annotation, e.g., tie, tpc, aa, start, stop. These attributes can be used, for instance, if the annotation is used in GAF. All predictions of the annotation are used. The predictions are not filtered for internal stop codons, missing start or stop codons, frame-shifts, ... . Please use ERE to preprocess the mapped reads.

Annotation evidence may be called with

java -jar GeMoMa-1.9.jar CLI AnnotationEvidence

and has the following parameters

name comment type

a annotation (The genome annotation file (GFF,GTF), type = gff,gff3,gtf,gff.gz,gff3.gz,gtf.gz) FILE
t tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = mRNA) STRING
g genome (The target genome file (FASTA). Should be in IUPAC code, type = fasta,fas,fa,fna,fasta.gz,fas.gz,fa.gz,fna.gz) FILE
The following parameter(s) can be used multiple times:
i introns file (Introns (GFF), which might be obtained from RNA-seq, type = gff,gff3, OPTIONAL) FILE
r reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1) INT
The following parameter(s) can be used multiple times:
c coverage file (experimental coverage (RNA-seq), range={NO, UNSTRANDED, STRANDED}, default = NO) STRING
No parameters for selection "NO"
Parameters for selection "UNSTRANDED":
coverage_unstranded coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., type = bedgraph) FILE
Parameters for selection "STRANDED":
coverage_forward coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., type = bedgraph) FILE
coverage_reverse coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., type = bedgraph) FILE
ao annotation output (if the annotation should be returned with attributes tie, tpc, and aa, default = true) BOOLEAN
gc genetic code (optional user-specified genetic code, type = tabular, OPTIONAL) FILE
outdir The output directory, defaults to the current working directory (.) STRING

Example:

java -jar GeMoMa-1.9.jar CLI AnnotationEvidence a=<annotation> g=<genome>


Attribute2Table

This tool returns a table of best attribute per predicted final annotation.

Attribute2Table may be called with

java -jar GeMoMa-1.9.jar CLI Attribute2Table

and has the following parameters

name comment type

t tag (the tag used to read the GeMoMa annotations, default = mRNA) STRING
f final gene annotation file (GFF file containing the gene annotations (predicted by GeMoMa), type = gff,gff3) FILE
The following parameter(s) can be used multiple times:
p prefix (the prefix can be used to distinguish predictions from different input files, OPTIONAL) STRING
r raw gene annotation file (GFF file containing the gene annotations (predicted by GeMoMa), type = gff,gff3) FILE
a add (add the prefix to the gene ID, default = true) BOOLEAN
attribute attribute (the attribute to be checked, default = iAA) STRING
outdir The output directory, defaults to the current working directory (.) STRING

Example:

java -jar GeMoMa-1.9.jar CLI Attribute2Table f=<final_gene_annotation_file> r=<raw_gene_annotation_file>


Synteny checker

This tool can be used to determine syntenic regions between target organism and reference organism based on similiarity of genes. The tool returns a table of reference genes per predicted gene. This table can be easily visualized with an R script that is included in the GeMoMa package.

Synteny checker may be called with

java -jar GeMoMa-1.9.jar CLI SyntenyChecker

and has the following parameters

name comment type

t tag (the tag used to read the GeMoMa annotations, default = mRNA) STRING
The following parameter(s) can be used multiple times:
p prefix (the prefix can be used to distinguish predictions from different input files (=reference organisms), OPTIONAL) STRING
a assignment (the assignment file of this reference organism, which combines parts of the CDS to transcripts, type = tabular) FILE
g gene annotation file (GFF file containing the gene annotations predicted by GAF, type = gff,gff3) FILE
outdir The output directory, defaults to the current working directory (.) STRING

Example:

java -jar GeMoMa-1.9.jar CLI SyntenyChecker a=<assignment> g=<gene_annotation_file>


AddAttribute

This tool allows to add an additional attribute to specific features of an annotation.

Those additional attributes might be used in GAF for filtering or sorting or might be displayed in genome browsers like IGV or WebApollo. The user can choose binary attributes (true or false) or attributes with values according to given tab-delimited table.

AddAttribute may be called with

java -jar GeMoMa-1.9.jar CLI AddAttribute

and has the following parameters

name comment type

a annotation (annotation file, type = gff,gff3) FILE
f feature (a feature of the annotation, e.g., gene, transcript or mRNA, default = mRNA) STRING
attribute attribute (the name of the attribute that is added to the annotation) STRING
t table (a tab-delimited file containing IDs and additional attribute, type = tabular) FILE
i ID column (the ID column in the tab-delimited file, valid range = [0, 2147483647]) INT
type type (type of addition attribute, range={VALUES, BINARY}, default = VALUES) STRING
Parameters for selection "VALUES":
ac attribute column (the attribute column in the tab-delimited file, valid range = [0, 2147483647]) INT
No parameters for selection "BINARY"
outdir The output directory, defaults to the current working directory (.) STRING

Example:

java -jar GeMoMa-1.9.jar CLI AddAttribute a=<annotation> attribute=<attribute> t=<table> i=<ID_column> ac=<attribute_column>


GAFComparison

This tool allows to compare results from GAF based on the attributed ref-gene and alternative. Hence, you can compare the annotation of different genomes or the effect of different parameters on the annotation of one genome.

GAFComparison may be called with

java -jar GeMoMa-1.9.jar CLI GAFComparison

and has the following parameters

name comment type

t tag (the tag used to read the GAF annotations, default = mRNA) STRING
The following parameter(s) can be used multiple times:
n name (a simple name for the organism) STRING
g gene annotation file (GFF file containing the gene annotations (predicted by GAF), type = gff,gff3,gff.gz.gff3.gz) FILE
s split prefix (a switch to decide whether the prefix should be split and writen in a separat column, default = false) BOOLEAN
d differences (a switch to decide whether only genes with difference should be returned, default = true) BOOLEAN
outdir The output directory, defaults to the current working directory (.) STRING

Example:

java -jar GeMoMa-1.9.jar CLI GAFComparison n=<name> g=<gene_annotation_file>


Analyzer

This tools allows to compare true annotation with predicted annotation as it is frequently done in benchmark studies. Furthermore, it can return a detailed table comparing true annotation and predicted annotation which might help to identify systematical errors or biases in the predictions. Hence, this tool might help to detect weaknesses of the prediction algorithm.

True and predicted transcripts are evaluated based on nucleotide F1 measure. For each predicted transcript, the true transcript with highest nucleotide F1 measure is listed. A negative value in a F1 measure column indicates that there is a predicted transcript that matches the true transcript with a F1 measure value that is the absolute value of this entry, but there is another true transcript that matches this predicted transcript with an even better F1. Also true and predicted transcripts are listed that do not overlap with any transcript from the predicted and true annotation, respectively. The table contains the attributes of the true and the predicted annotation besides some additional columns allowing to easily filter interesting examples and to do statistics.

The evaluation can be based on CDS (default) or exon features. The tool also reports sensitivity and precision for the categories gene and transcript.

Analyzer may be called with

java -jar GeMoMa-1.9.jar CLI Analyzer

and has the following parameters

name comment type

t truth (the true annotation, type = gff,gff3,gtf,gff.gz,gff3.gz,gtf.gz) FILE
The following parameter(s) can be used multiple times:
n name (can be used to distinguish different predictions, OPTIONAL) STRING
p predicted annotation (GFF/GTF file containing the predicted annotation, type = gff,gff3,gtf,gff.gz,gff3.gz,gtf.gz) FILE
c CDS (if true CDS features are used otherwise exon features, default = true) BOOLEAN
o only introns (if true only intron borders (=splice sites) are evaluated, default = false) BOOLEAN
w write (write detailed table comparing the true and the predicted annotation, range={NO, YES}, default = NO) STRING
No parameters for selection "NO"
Parameters for selection "YES":
ca common attributes (Only gff attributes of mRNAs are included in the result table, that can be found in the given portion of all mRNAs. Attributes and their portion are handled independently for truth and prediction. This parameter allows to choose between a more informative table or compact table., valid range = [0.0, 1.0], default = 0.5) DOUBLE
r reliable (additionally evaluate sensitivity for reliable transcripts, range={NO, YES}, default = NO) STRING
No parameters for selection "NO"
Parameters for selection "YES":
f filter (A filter for deciding which transcript from the truth are reliable or not. The filter is applied to the GFF attributes of the truth. You probably need to run AnnotationEvidence on the truth GFF. The default filter decides based on the completeness of the prediction (start=='M' and stop=='*'), no premature stop codons (nps==0), RNA-seq coverage (tpc==1) and intron evidence (isNaN(tie) or tie==1)., default = start=='M' and stop=='*' and nps==0 and (tpc==1 and (isNaN(tie) or tie==1)), OPTIONAL) STRING
outdir The output directory, defaults to the current working directory (.) STRING

Example:

java -jar GeMoMa-1.9.jar CLI Analyzer t=<truth> p=<predicted_annotation>


BUSCORecomputer

This tool can be used to compute BUSCO statistics for genes instead of transcripts. Proteins of an annotation file can be extracted with Exctractor, Proteins can be used to compute BUSCO statistics with BUSCO. The full BUSCO table and the assignment file from the Extractor can be used as input for this tool. Alternatively, a table can be generated from the annotation file that can be used instead of the assignment file.

BUSCORecomputer may be called with

java -jar GeMoMa-1.9.jar CLI BUSCORecomputer

and has the following parameters

name comment type

b BUSCO (the BUSCO full table based on transcripts/proteins, type = tabular) FILE
i IDs (a table with at least two columns, the first is the gene ID, the second is the transcript/protein ID. The assignment file from the Extractor can be used or a table can be derived by the user from the gene annotation file (gff,gtf), type = tabular) FILE
The following parameter(s) can be used zero or multiple times:
s subgenome (regex for contigs/chromosomes of this subgenome) STRING
outdir The output directory, defaults to the current working directory (.) STRING

Example:

java -jar GeMoMa-1.9.jar CLI BUSCORecomputer b=<BUSCO> i=<IDs>


GFFAttributes

Annotations that are build with GeMoMaPipeline or augmented with AnnotationEvidence have lots of attributes that might be intersting for the user. This module allows to create a simple table that can easily be parsed and used for visualization of statistics. However, the module could also be used for annotations that are not created of modified with GeMoMa modules.

GFFAttributes may be called with

java -jar GeMoMa-1.9.jar CLI GFFAttributes

and has the following parameters

name comment type

a annotation (GFF file containing the gene annotations, type = gff,gff3) FILE
f feature (the feature which is used to parse the attributes, default = mRNA) STRING
m missing (the value used for missing attributes of a feature, default = ) STRING
outdir The output directory, defaults to the current working directory (.) STRING

Example:

java -jar GeMoMa-1.9.jar CLI GFFAttributes a=<annotation>


Transcribed Cluster

What it does

This tool computes ... .


Transcribed Cluster may be called with

java -jar GeMoMa-1.9.jar CLI TranscribedCluster

and has the following parameters

name comment type

g genome (The genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code, type = fasta) FILE
The following parameter(s) can be used multiple times:
i introns file (Introns (GFF), which might be obtained from RNA-seq, type = gff, OPTIONAL) FILE
r reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1) INT
The following parameter(s) can be used multiple times:
c coverage file (experimental coverage (RNA-seq), range={UNSTRANDED, STRANDED}, default = UNSTRANDED) STRING
Parameters for selection "UNSTRANDED":
coverage_unstranded coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., type = bedgraph) FILE
Parameters for selection "STRANDED":
coverage_forward coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., type = bedgraph) FILE
coverage_reverse coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., type = bedgraph) FILE
m minimal gap (the minimal gap between two transcribed clusters, otherwise these will be merged, valid range = [0, 2147483647], default = 50) INT
outdir The output directory, defaults to the current working directory (.) STRING

Example:

java -jar GeMoMa-1.9.jar CLI TranscribedCluster g=<genome> coverage_unstranded=<coverage_unstranded>