https://www.jstacs.de/api.php?action=feedcontributions&user=Keilwagen&feedformat=atomJstacs - User contributions [en]2024-03-28T08:53:57ZUser contributionsMediaWiki 1.38.2https://www.jstacs.de/index.php?title=GeMoMa-Docs&diff=1163GeMoMa-Docs2022-07-16T20:04:10Z<p>Keilwagen: </p>
<hr />
<div>This page describes the parameters of all [[GeMoMa]] modules.</br><br />
If you have any questions, comments or bugs, please check the [[GeMoMa#Frequently asked questions|FAQs]], [https://github.com/Jstacs/Jstacs/issues?q=label%3AGeMoMa our github page] or contact [mailto:jens.keilwagen@julius-kuehn.de Jens Keilwagen].<br />
<br />
=== GeMoMa pipeline ===<br />
<br />
This tool can be used to run the complete GeMoMa pipeline. The tool is multi-threaded and can utilize all compute cores on one machine, but not distributed as for instance in a compute cluster. It basically runs: '''Extract RNA-seq evidence (ERE)''', '''DenoiseIntrons''', '''Extractor''', external search (tblastn or mmseqs), '''Gene Model Mapper (GeMoMa)''', '''GeMoMa Annotation Filter (GAF)''', and '''AnnnotationFinalizer'''.<br />
<br />
''GeMoMa pipeline'' may be called with<br />
<br />
java -jar GeMoMa-1.9.jar CLI GeMoMaPipeline<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>target genome (Target genome file (FASTA), type = fasta,fa,fas,fna,fasta.gz,fa.gz,fas.gz,fna.gz)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used zero or multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>species (data for reference species, range={own, pre-extracted}, default = own)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;own&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>ID (ID to distinguish the different reference species, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (Reference annotation file (GFF or GTF), which contains gene models annotated in the reference genome, type = gff,gff3,gtf,gff.gz,gff3.gz,gtf.gz)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (Reference genome file (FASTA), type = fasta,fa,fas,fna,fasta.gz,fa.gz,fas.gz,fna.gz)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>weight (the weight can be used to prioritize predictions from different input files; each prediction will get an additional attribute sumWeight that can be used in the filter, valid range = [0.0, 1000.0], default = 1.0, OPTIONAL)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ai</font></td><br />
<td>annotation info (annotation information of the reference, tab-delimted file containing at least the columns transcriptName, GO and .*defline, type = tabular, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;pre-extracted&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>ID (ID to distinguish the different reference species, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>cds parts (The query CDS parts file (protein FASTA), i.e., the CDS parts that have been searched in the target genome using for instance BLAST or mmseqs, type = fasta,fa,fas,fna)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>assignment (The assignment file, which combines CDS parts to proteins, type = tabular, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>weight (the weight can be used to prioritize predictions from different input files; each prediction will get an additional attribute sumWeight that can be used in the filter, valid range = [0.0, 1000.0], default = 1.0, OPTIONAL)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ai</font></td><br />
<td>annotation info (annotation information of the reference, tab-delimited file containing at least the columns transcriptName, GO and .*defline, type = tabular, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used zero or multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ID</font></td><br />
<td>ID (ID to distinguish the different external annotations of the target organism, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">e</font></td><br />
<td>external annotation (External annotation file (GFF,GTF) of the target organism, which contains gene models from an external source (e.g., ab initio gene prediction) and will be integrated in the module GAF, type = gff,gff3,gtf)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">weight</font></td><br />
<td>weight (the weight can be used to prioritize predictions from different input files in the module GAF; each prediction will get an additional attribute sumWeight that can be used in the filter, valid range = [0.0, 1000.0], default = 1.0, OPTIONAL)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ae</font></td><br />
<td>annotation evidence (run AnnotationEvidence on this external annotation, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">selected</font></td><br />
<td>selected (The path to list file, which allows to make only a predictions for the contained transcript ids. The first column should contain transcript IDs as given in the annotation. Remaining columns can be used to determine a target region that should be overlapped by the prediction, if columns 2 to 5 contain chromosome, strand, start and end of region, type = tabular,txt, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">gc</font></td><br />
<td>genetic code (optional user-specified genetic code, type = tabular, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">tblastn</font></td><br />
<td>tblastn (if *true* tblastn is used as search algorithm, otherwise mmseqs is used. Tblastn and mmseqs need to be installed to use the corresponding option, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">tag</font></td><br />
<td>tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = mRNA)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>RNA-seq evidence (data for RNA-seq evidence, range={NO, MAPPED, EXTRACTED}, default = NO)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;MAPPED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.s</font></td><br />
<td>Stranded (Defines whether the reads are stranded. In case of FR_FIRST_STRAND, the first read of a read pair or the only read in case of single-end data is assumed to be located on forward strand of the cDNA, i.e., reverse to the mRNA orientation. If you are using Illumina TruSeq you should use FR_FIRST_STRAND., range={FR_UNSTRANDED, FR_FIRST_STRAND, FR_SECOND_STRAND}, default = FR_UNSTRANDED)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.m</font></td><br />
<td>mapped reads file (BAM/SAM files containing the mapped reads, type = bam,sam)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.v</font></td><br />
<td>ValidationStringency (Defines how strict to be when reading a SAM or BAM, beyond bare minimum validation., range={STRICT, LENIENT, SILENT}, default = LENIENT)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.u</font></td><br />
<td>use secondary alignments (allows to filter flags in the SAM or BAM, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.c</font></td><br />
<td>coverage (allows to output the coverage, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.mmq</font></td><br />
<td>minimum mapping quality (reads with a mapping quality that is lower than this value will be ignored, valid range = [0, 255], default = 40)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.mc</font></td><br />
<td>minimum context (only introns that have evidence of at least one split read with a minimal M (=(mis)match) stretch in the cigar string larger than or equal to this value will be used, valid range = [1, 1000000], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.maximumcoverage</font></td><br />
<td>maximum coverage (optional parameter to reduce the size of coverage output files, coverage higher than this value will be reported as this value, valid range = [1, 10000], OPTIONAL)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.f</font></td><br />
<td>filter by intron mismatches (filter reads by the number of mismatches around splits, range={NO, YES}, default = NO)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;YES&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.r</font></td><br />
<td>region around introns (test region of this size around introns/splits for mismatches to the genome, valid range = [0, 100], default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.n</font></td><br />
<td>number of mismatches (number of mismatches allowed in regions around introns/splits, valid range = [0, 100], default = 3)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.e</font></td><br />
<td>evidence long splits (require introns to have at least this number of times the supporting reads as their length deviates from the mean split length, valid range = [0.0, 100.0], default = 0.0)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.mil</font></td><br />
<td>minimum intron length (introns shorter than the minimum length are discarded and considered as contiguous, valid range = [0, 1000], default = 0)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.repositioning</font></td><br />
<td>repositioning (due to limitations in BAM/SAM format huge chromosomes need to be split before mapping. This parameter allows to undo the split mapping to real chromosomes and coordinates. The repositioning file has 3 columns: split_chr_name, original_chr_name, offset_in_original_chr, type = tabular, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;EXTRACTED&quot;:</b></td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">introns</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq, type = gff,gff3)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used zero or multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage</font></td><br />
<td>coverage (experimental coverage (RNA-seq), range={UNSTRANDED, STRANDED}, default = UNSTRANDED)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., type = bedgraph)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., type = bedgraph)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., type = bedgraph)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>denoise (removing questionable introns that have been extracted by ERE, range={DENOISE, RAW}, default = DENOISE)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;DENOISE&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">DenoiseIntrons.m</font></td><br />
<td>maximum intron length (The maximum length of an intron, default = 15000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">DenoiseIntrons.me</font></td><br />
<td>minimum expression (The threshold for removing introns, valid range = [0.0, 1.0], default = 0.01)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">DenoiseIntrons.c</font></td><br />
<td>context (The context upstream a donor and donwstream an acceptor site that is used to determine the expression of the region, valid range = [0, 100], default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;RAW&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.u</font></td><br />
<td>upcase IDs (whether the IDs in the GFF should be upcased, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.r</font></td><br />
<td>repair (if a transcript annotation can not be parsed, the program will try to infer the phase of the CDS parts to repair the annotation, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.a</font></td><br />
<td>Ambiguity (This parameter defines how to deal with ambiguities in the DNA. There are 3 options: EXCEPTION, which will remove the corresponding transcript, AMBIGUOUS, which will use an X for the corresponding amino acid, and RANDOM, which will randomly select an amnio acid from the list of possibilities., range={EXCEPTION, AMBIGUOUS, RANDOM}, default = AMBIGUOUS)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.d</font></td><br />
<td>discard pre-mature stop (if *true* transcripts with pre-mature stop codon are discarded as they often indicate misannotation, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.s</font></td><br />
<td>stop-codon excluded from CDS (A flag that states whether the reference annotation contains the stop codon in the CDS annotation or not, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.f</font></td><br />
<td>full-length (A flag which allows for choosing between only full-length and all (i.e., full-length and partial) transcripts, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.s</font></td><br />
<td>splice (if no intron is given by RNA-seq, compute candidate splice sites or not, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.sm</font></td><br />
<td>substitution matrix (optional user-specified substitution matrix, type = tabular, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.g</font></td><br />
<td>gap opening (The gap opening cost in the alignment, default = 11)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.ge</font></td><br />
<td>gap extension (The gap extension cost in the alignment, default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.m</font></td><br />
<td>maximum intron length (The maximum length of an intron, default = 15000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.sil</font></td><br />
<td>static intron length (A flag which allows to switch between static intron length, which can be specified by the user and is identical for all genes, and dynamic intron length, which is based on the gene-specific maximum intron length in the reference organism plus the user given maximum intron length, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.i</font></td><br />
<td>intron-loss-gain-penalty (The penalty used for intron loss and gain, default = 25)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.rf</font></td><br />
<td>reduction factor (Factor for reducing the allowed intron length when searching for missing marginal exons, valid range = [1, 100], default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.e</font></td><br />
<td>e-value (The e-value for filtering blast results, default = 100.0)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.c</font></td><br />
<td>contig threshold (The threshold for evaluating contigs, valid range = [0.0, 1.0], default = 0.4)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.h</font></td><br />
<td>hit threshold (The threshold for adding additional hits, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.o</font></td><br />
<td>output (critierium to determine the number of predictions per reference transcript, range={STATIC, DYNAMIC}, default = STATIC)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;STATIC&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.p</font></td><br />
<td>predictions (The (maximal) number of predictions per transcript, default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;DYNAMIC&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.f</font></td><br />
<td>factor (a prediction is used if: score >= factor*Math.max(0,bestScore), valid range = [0.0, 1.0], default = 0.8)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.a</font></td><br />
<td>avoid stop (A flag which allows to avoid (additional) pre-mature stop codons in a transcript, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.approx</font></td><br />
<td>approx (whether an approximation is used to compute the score for intron gain, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.pa</font></td><br />
<td>protein alignment (whether a protein alignment between the prediction and the reference transcript should be computed. If so two additional attributes (iAA, pAA) will be added to predictions in the gff output. These might be used in GAF. However, since some transcripts are very long this can increase the needed runtime and memory (RAM)., default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.v</font></td><br />
<td>verbose (A flag which allows to output a wealth of additional information per transcript, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.t</font></td><br />
<td>timeout (The (maximal) number of seconds to be used for the predictions of one transcript, if exceeded GeMoMa does not output a prediction for this transcript., valid range = [0, 604800], default = 3600)</td><br />
<td style="width:100px;">LONG</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.ru</font></td><br />
<td>replace unknown (Replace unknown amino acid symbols by X, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.Score</font></td><br />
<td>Score (A flag which allows to do nothing, re-score or re-align the search results, range={Trust, ReScore, ReAlign}, default = ReAlign)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.d</font></td><br />
<td>default attributes (Comma-separated list of attributes that is set to NaN if they are not given in the annotation file. This allows to use these attributes for sorting or filter criteria. It is especially meaningful if the gene annotation files were received fom different software packages (e.g., GeMoMa, Braker, ...) having different attributes., default = tie,tde,tae,iAA,pAA,score,lpm,maxGap,bestScore,maxScore,raa,rce)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.k</font></td><br />
<td>kmeans (whether kmeans should be performed for each input file and clusters with large mean distance to the origin will be discarded, range={NO, YES}, default = NO)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;YES&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.m</font></td><br />
<td>minimal number of predictions (only gene sets with at least this number of predictions will be used for clustering, valid range = [0, 100000000], default = 1000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.c</font></td><br />
<td>cluster (the number of clusters to be used for kmeans, valid range = [2, 100], default = 2)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.g</font></td><br />
<td>good cluster (the number of good clusters, good clusters are those with small mean, all members of a good cluster are further used, valid range = [1, 99], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.t</font></td><br />
<td>trend (whether a local component should be used for the cluster attribute (might be helpful for regions with different conservation (e.g. introgressions in chromosomes)), range={GLOBAL, LOCAL}, default = GLOBAL)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;GLOBAL&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;LOCAL&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.margin</font></td><br />
<td>margin (the number of bp upstream and downstream of a predictions used to identify neighboring predictions for the statistics, valid range = [0, 100000000], default = 1000000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.q</font></td><br />
<td>quantile (the quantile used for the local trend, valid range = [0.0, 1.0], default = 0.2)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
</table></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.f</font></td><br />
<td>filter (A filter can be applied to transcript predictions to receive only reasonable predictions. The filter is applied to the GFF attributes. The default filter decides based on the completeness of the prediction (start=='M' and stop=='*') and the relative score (score/aa>=0.75) whether a prediction is used or not. In addition, predictions without score (isNaN(score)) will be used as external annotations do not have the attribute score. Different criteria can be combined using 'and' and 'or'. A more sophisticated filter could be applied for instance using the completeness, the relative score, the evidence as well as statistics based on RNA-seq: start=='M' and stop=='*' and score/aa>=0.75 and (evidence>1 or tpc==1.0), default = start=='M' and stop=='*' and (isNaN(score) or score/aa>=0.75), OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.s</font></td><br />
<td>sorting (comma-separated list that defines the way predictions are sorted within a cluster, default = sumWeight,score,aa)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.l</font></td><br />
<td>length difference (maximal percentage of length difference between the representative transcript and an alternative transcript, alternative transcripts with a higher percentage are discarded, valid range = [0.0, 10000.0], OPTIONAL)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.a</font></td><br />
<td>alternative transcript filter (If a prediction is suggested as an alternative transcript, this additional filter will be applied. The filter works syntactically like the 'filter' parameter and allows you to keep the number of alternative transcripts small based on meaningful criteria. Reasonable filter could be based on RNA-seq data (tie==1) or on sumWeight (sumWeight>1). A more sophisticated filter could be applied combining several criteria: tie==1 or sumWeight>1, default = tie==1 or sumWeight>1, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.cbf</font></td><br />
<td>common border filter (the filter on the common borders of transcripts, the lower the more transcripts will be checked as alternative splice isoforms, valid range = [0.0, 1.0], default = 0.75)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.mnotpg</font></td><br />
<td>maximal number of transcripts per gene (the maximal number of allowed transcript predictions per gene, valid range = [1, 2147483647], default = 2147483647)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.aat</font></td><br />
<td>add alternative transcripts (a switch to decide whether the IDs of alternative transcripts that have the same CDS should be listed for each prediction, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.tf</font></td><br />
<td>transfer features (if true, additional features like UTRs will be transfered from the input to the output. Only features of the representatives will be transfered. The UTRs of identical CDS predictions listed in "alternative" will not be transfered or combined, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.t</font></td><br />
<td>transfer features (if true other features than gene, &lt;tag&gt; (default: mRNA), and CDS of the input will be written in the output, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.u</font></td><br />
<td>UTR (allows to predict UTRs using RNA-seq data, range={NO, YES}, default = NO)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;YES&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.a</font></td><br />
<td>additional source suffix (a suffix for source values of UTR features, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.r</font></td><br />
<td>rename (allows to generate generic gene and transcripts names (cf. parameter &quot;name attribute&quot;), range={COMPOSED, SIMPLE, NO}, default = COMPOSED)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;COMPOSED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.i</font></td><br />
<td>infix (the infix of the generic name, default = G)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.s</font></td><br />
<td>suffix (the suffix of the generic name, default = 0)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.c</font></td><br />
<td>contig search pattern (search string, i.e., a regular expression for search-and-replace parts of the contig/scaffold/chromosome names, the modified string is used as infix for the gene name, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.crp</font></td><br />
<td>contig replace pattern (replace string, i.e., a regular expression for search-and-replace parts of the contig/scaffold/chromosome names, the modified string is used as infix for the gene name, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;SIMPLE&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.n</font></td><br />
<td>name attribute (if true the new name is added as new attribute &quot;Name&quot;, otherwise &quot;Parent&quot; and &quot;ID&quot; values are modified accordingly, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sc</font></td><br />
<td>synteny check (run SyntenyChecker if possible, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>predicted proteins (If *true*, returns the predicted proteins of the target organism as fastA file, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">pc</font></td><br />
<td>predicted CDSs (If *true*, returns the predicted CDSs of the target organism as fastA file, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">pgr</font></td><br />
<td>predicted genomic regions (If *true*, returns the genomic regions of predicted gene models of the target organism as fastA file, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">o</font></td><br />
<td>output individual predictions (If *true*, returns the predictions for each reference species, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">debug</font></td><br />
<td>debug (If *false* removes all temporary files even if the jobs exits unexpected, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">restart</font></td><br />
<td>restart (can be used to restart the latest GeMoMaPipeline run, which was finished without results, with very similar parameters, e.g., after an exception was thrown (cf. parameter debug), default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>BLAST_PATH (allows to set a path to the blast binaries if not set in the environment, default = , OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>MMSEQS_PATH (allows to set a path to the blast binaries if not set in the environment, default = , OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">threads</font></td><br />
<td>The number of threads used for the tool, defaults to 1</td><br />
<td>INT</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.9.jar CLI GeMoMaPipeline a=<reference_annotation> g=<reference_genome> t=&lt;target_genome&gt; AnnotationFinalizer.p=&lt;prefix&gt;<br />
<br />
<br />
=== Extract RNA-seq Evidence ===<br />
<br />
This tools extracts introns and coverage from mapped RNA-seq reads. Introns might be denoised by the tool '''DenoiseIntrons'''. Introns and coverage results can be used in '''GeMoMa''' to improve the predictions and might help to select better gene models in '''GAF'''. In addition, introns and coverage can be used to predict UTRs by '''AnnotationFinalizer'''.<br />
<br />
''Extract RNA-seq Evidence'' may be called with<br />
<br />
java -jar GeMoMa-1.9.jar CLI ERE<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>Stranded (Defines whether the reads are stranded. In case of FR_FIRST_STRAND, the first read of a read pair or the only read in case of single-end data is assumed to be located on forward strand of the cDNA, i.e., reverse to the mRNA orientation. If you are using Illumina TruSeq you should use FR_FIRST_STRAND., range={FR_UNSTRANDED, FR_FIRST_STRAND, FR_SECOND_STRAND}, default = FR_UNSTRANDED)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>mapped reads file (BAM/SAM files containing the mapped reads, type = bam,sam)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>ValidationStringency (Defines how strict to be when reading a SAM or BAM, beyond bare minimum validation., range={STRICT, LENIENT, SILENT}, default = LENIENT)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">u</font></td><br />
<td>use secondary alignments (allows to filter flags in the SAM or BAM, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage (allows to output the coverage, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">mmq</font></td><br />
<td>minimum mapping quality (reads with a mapping quality that is lower than this value will be ignored, valid range = [0, 255], default = 40)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">mc</font></td><br />
<td>minimum context (only introns that have evidence of at least one split read with a minimal M (=(mis)match) stretch in the cigar string larger than or equal to this value will be used, valid range = [1, 1000000], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">maximumcoverage</font></td><br />
<td>maximum coverage (optional parameter to reduce the size of coverage output files, coverage higher than this value will be reported as this value, valid range = [1, 10000], OPTIONAL)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>filter by intron mismatches (filter reads by the number of mismatches around splits, range={NO, YES}, default = NO)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;YES&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>region around introns (test region of this size around introns/splits for mismatches to the genome, valid range = [0, 100], default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>number of mismatches (number of mismatches allowed in regions around introns/splits, valid range = [0, 100], default = 3)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>target genome (The target genome file (FASTA). Should be in IUPAC code, type = fasta,fas,fa,fna,fasta.gz,fas.gz,fa.gz,fna.gz)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">e</font></td><br />
<td>evidence long splits (require introns to have at least this number of times the supporting reads as their length deviates from the mean split length, valid range = [0.0, 100.0], default = 0.0)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">mil</font></td><br />
<td>minimum intron length (introns shorter than the minimum length are discarded and considered as contiguous, valid range = [0, 1000], default = 0)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">repositioning</font></td><br />
<td>repositioning (due to limitations in BAM/SAM format huge chromosomes need to be split before mapping. This parameter allows to undo the split mapping to real chromosomes and coordinates. The repositioning file has 3 columns: split_chr_name, original_chr_name, offset_in_original_chr, type = tabular, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.9.jar CLI ERE m=&lt;mapped_reads_file&gt;<br />
<br />
<br />
=== CheckIntrons ===<br />
<br />
The tool checks the distribution of introns on the strands and the dinucleotide distribution at splice sites.<br />
<br />
''CheckIntrons'' may be called with<br />
<br />
java -jar GeMoMa-1.9.jar CLI CheckIntrons<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>target genome (The target genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code, type = fasta)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq, type = gff)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>verbose (A flag which allows to output a wealth of additional information per transcript, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.9.jar CLI CheckIntrons t=&lt;target_genome&gt; i=&lt;introns&gt;<br />
<br />
<br />
=== DenoiseIntrons ===<br />
<br />
This module allows to analyze introns extracted by '''ERE'''. Introns with a large intron size or a low relative expression are possibly artefacts and will be removed. The result of this module can be used in the module '''GeMoMa''', '''AnnotationEvidence''', and '''AnnotationFinalizer'''.<br />
<br />
''DenoiseIntrons'' may be called with<br />
<br />
java -jar GeMoMa-1.9.jar CLI DenoiseIntrons<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq, type = gff,gff3)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage (experimental coverage (RNA-seq), range={UNSTRANDED, STRANDED}, default = UNSTRANDED)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., type = bedgraph)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., type = bedgraph)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., type = bedgraph)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>maximum intron length (The maximum length of an intron, default = 15000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">me</font></td><br />
<td>minimum expression (The threshold for removing introns, valid range = [0.0, 1.0], default = 0.01)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">context</font></td><br />
<td>context (The context upstream a donor and donwstream an acceptor site that is used to determine the expression of the region, valid range = [0, 100], default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.9.jar CLI DenoiseIntrons i=&lt;introns&gt; coverage_unstranded=&lt;coverage_unstranded&gt;<br />
<br />
<br />
=== NCBI Reference Retriever ===<br />
<br />
This tool can be used to download or update assembly and annotation files of reference organsims from NCBI. This way it allows to easily collect all data necessary to start '''GeMoMaPipeline''' or '''Extractor'''.<br />
<br />
''NCBI Reference Retriever'' may be called with<br />
<br />
java -jar GeMoMa-1.9.jar CLI NRR<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reference directory (the directory where the genome and annotation files of the reference organisms should be stored, default = references/)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>number of tries (the number of tries for downloading a reference file, valid range = [1, 100], default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">rl</font></td><br />
<td>reference list (a list of reference organisms, type = txt)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.9.jar CLI NRR rl=&lt;reference_list&gt;<br />
<br />
<br />
=== Extractor ===<br />
<br />
This tool can be used to create input files for '''GeMoMa''', i.e., it creates at least a fasta file containing the translated parts of the CDS and a tabular file containing the assignment of transcripts to genes and parts of CDS to transcripts. In addition, '''Extractor''' can be used to create several additional files from the final prediction, e.g. proteins, CDSs, ... . Two inputs are mandatory: The genome as fasta or fasta.gz and the corresponding annotation as gff or gff.gz. The gff file should be sorted. If you like to set a user-specific genetic code, please use a tab-delimited file with two columns. The first column contains the amino acid in one letter code, the second a list of tripletts.<br />
<br />
''Extractor'' may be called with<br />
<br />
java -jar GeMoMa-1.9.jar CLI Extractor<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (Reference annotation file (GFF or GTF), which contains gene models annotated in the reference genome, type = gff,gff3,gtf,gff.gz,gff3.gz,gtf.gz)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (Reference genome file (FASTA), type = fasta,fa,fas,fna,fasta.gz,fa.gz,fas.gz,fna.gz)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">gc</font></td><br />
<td>genetic code (optional user-specified genetic code, type = tabular, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>proteins (whether the complete proteins sequences should returned as output, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>cds (whether the complete CDSs should returned as output, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">genomic</font></td><br />
<td>genomic (whether the genomic regions should be returned (upper case = coding, lower case = non coding), default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns (whether introns should be extracted from annotation, that might be used for test cases, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">identical</font></td><br />
<td>identical (if CDS is identical Extractor only used one transcript. This parameter allows to return a table that lists in the first column the used transcript and in the second column the discarded transcript. If no transcript is discarded, the list is empty., default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">u</font></td><br />
<td>upcase IDs (whether the IDs in the GFF should be upcased, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>repair (if a transcript annotation can not be parsed, the program will try to infer the phase of the CDS parts to repair the annotation, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>selected (The path to list file, which allows to make only a predictions for the contained transcript ids. The first column should contain transcript IDs as given in the annotation. Remaining columns will be ignored., type = tabular,txt, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Ambiguity</font></td><br />
<td>Ambiguity (This parameter defines how to deal with ambiguities in the DNA. There are 3 options: EXCEPTION, which will remove the corresponding transcript, AMBIGUOUS, which will use an X for the corresponding amino acid, and RANDOM, which will randomly select an amnio acid from the list of possibilities., range={EXCEPTION, AMBIGUOUS, RANDOM}, default = EXCEPTION)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>discard pre-mature stop (if *true* transcripts with pre-mature stop codon are discarded as they often indicate misannotation, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sefc</font></td><br />
<td>stop-codon excluded from CDS (A flag that states whether the reference annotation contains the stop codon in the CDS annotation or not, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>full-length (A flag which allows for choosing between only full-length and all (i.e., full-length and partial) transcripts, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">l</font></td><br />
<td>long fasta comment (whether a short (transcript ID) or a long (transcript ID, gene ID, chromosome, strand, interval) fasta comment should be written for proteins, CDSs, and genomic regions, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>verbose (A flag which allows to output a wealth of additional information, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.9.jar CLI Extractor a=&lt;annotation&gt; g=&lt;genome&gt;<br />
<br />
<br />
=== GeneModelMapper ===<br />
<br />
This tool is the main part of, a homology-based gene prediction tool. GeMoMa builds gene models from search results (e.g. tblastn or mmseqs).<br />
<br />
As first step, you should run '''Extractor''' obtaining ''cds parts'' and ''assignment''. Second, you should run a search algorithm, e.g. '''tblastn''' or '''mmseqs''', with ''cds parts'' as query. Finally, these search results are then used in '''GeMoMa'''. Search results should be clustered according to the reference genes. The most easiest way is to sort the search results accoring to the first column. If the search results are not sorted by default (e.g. mmseqs), you should the parameter ''sort''.<br />
If you like to run GeMoMa ignoring intron position conservation, you should blast protein sequences and feed the results in ''query cds parts'' and leave ''assignment'' unselected.<br />
<br />
If you like to run GeMoMa using RNA-seq evidence, you should map your RNA-seq reads to the genome and run '''ERE''' on the mapped reads. For several reasons, spurious introns can be extracted from RNA-seq data. Hence, we recommend to run '''DenoiseIntrons''' to remove such spurious introns. Finally, you can use the obtained ''introns'' (and ''coverage'') in GeMoMa.<br />
<br />
If you like to obtain multiple predictions per gene model of the reference organism, you should set ''predictions'' accordingly. In addition, we suggest to decrease the value of ''contig threshold'' allowing GeMoMa to evaluate more candidate contigs/chromosomes.<br />
<br />
If you change the values of ''contig threshold'', ''region threshold'' and ''hit threshold'', this will influence the predictions as well as the runtime of the algorithm. The lower the values are, the slower the algorithm is.<br />
<br />
You can filter your predictions using '''GAF''', which also allows for combining predictions from different reference organismns.<br />
<br />
Finally, you can predict UTRs and rename predictions using '''AnnotationFinalizer'''.<br />
<br />
If you like to run the complete GeMoMa pipeline and not only specific module, you can run the multi-threaded module '''GeMoMaPipeline'''.<br />
<br />
''GeneModelMapper'' may be called with<br />
<br />
java -jar GeMoMa-1.9.jar CLI GeMoMa<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>search results (The search results, e.g., from tblastn or mmseqs, type = tabular)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>target genome (The target genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code, type = fasta,fas,fa,fna,fasta.gz,fas.gz,fa.gz,fna.gz)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>cds parts (The query CDS parts file (protein FASTA), i.e., the CDS parts that have been searched in the target genome using for instance BLAST or mmseqs, type = fasta,fa,fas,fna)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>assignment (The assignment file, which combines CDS parts to proteins, type = tabular, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used zero or multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq, type = gff,gff3)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">splice</font></td><br />
<td>splice (if no intron is given by RNA-seq, compute candidate splice sites or not, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used zero or multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage</font></td><br />
<td>coverage (experimental coverage (RNA-seq), range={UNSTRANDED, STRANDED}, default = UNSTRANDED)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., type = bedgraph)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., type = bedgraph)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., type = bedgraph)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genetic code (optional user-specified genetic code, type = tabular, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sm</font></td><br />
<td>substitution matrix (optional user-specified substitution matrix, type = tabular, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">go</font></td><br />
<td>gap opening (The gap opening cost in the alignment, default = 11)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ge</font></td><br />
<td>gap extension (The gap extension cost in the alignment, default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>maximum intron length (The maximum length of an intron, default = 15000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sil</font></td><br />
<td>static intron length (A flag which allows to switch between static intron length, which can be specified by the user and is identical for all genes, and dynamic intron length, which is based on the gene-specific maximum intron length in the reference organism plus the user given maximum intron length, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">intron-loss-gain-penalty</font></td><br />
<td>intron-loss-gain-penalty (The penalty used for intron loss and gain, default = 25)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">rf</font></td><br />
<td>reduction factor (Factor for reducing the allowed intron length when searching for missing marginal exons, valid range = [1, 100], default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">e</font></td><br />
<td>e-value (The e-value for filtering blast results, default = 100.0)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ct</font></td><br />
<td>contig threshold (The threshold for evaluating contigs, valid range = [0.0, 1.0], default = 0.4)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">h</font></td><br />
<td>hit threshold (The threshold for adding additional hits, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">o</font></td><br />
<td>output (critierium to determine the number of predictions per reference transcript, range={STATIC, DYNAMIC}, default = STATIC)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;STATIC&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>predictions (The (maximal) number of predictions per transcript, default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;DYNAMIC&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>factor (a prediction is used if: score >= factor*Math.max(0,bestScore), valid range = [0.0, 1.0], default = 0.8)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">selected</font></td><br />
<td>selected (The path to list file, which allows to make only a predictions for the contained transcript ids. The first column should contain transcript IDs as given in the annotation. Remaining columns can be used to determine a target region that should be overlapped by the prediction, if columns 2 to 5 contain chromosome, strand, start and end of region, type = tabular,txt, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">as</font></td><br />
<td>avoid stop (A flag which allows to avoid (additional) pre-mature stop codons in a transcript, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">approx</font></td><br />
<td>approx (whether an approximation is used to compute the score for intron gain, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">pa</font></td><br />
<td>protein alignment (whether a protein alignment between the prediction and the reference transcript should be computed. If so two additional attributes (iAA, pAA) will be added to predictions in the gff output. These might be used in GAF. However, since some transcripts are very long this can increase the needed runtime and memory (RAM)., default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">prefix</font></td><br />
<td>prefix (A prefix to be used for naming the predictions, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">tag</font></td><br />
<td>tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = mRNA)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>verbose (A flag which allows to output a wealth of additional information per transcript, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">timeout</font></td><br />
<td>timeout (The (maximal) number of seconds to be used for the predictions of one transcript, if exceeded GeMoMa does not output a prediction for this transcript., valid range = [0, 604800], default = 3600)</td><br />
<td style="width:100px;">LONG</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sort</font></td><br />
<td>sort (A flag which allows to sort the search results, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ru</font></td><br />
<td>replace unknown (Replace unknown amino acid symbols by X, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Score</font></td><br />
<td>Score (A flag which allows to do nothing, re-score or re-align the search results, range={Trust, ReScore, ReAlign}, default = Trust)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.9.jar CLI GeMoMa s=&lt;search_results&gt; t=&lt;target_genome&gt; c=&lt;cds_parts&gt;<br />
<br />
<br />
=== GeMoMa Annotation Filter ===<br />
<br />
This tool combines and filters gene predictions from different sources yielding a common gene prediction. The tool does not modify the predictions, but filters redundant and low-quality predictions and selects relevant predictions. In addition, it adds attributes to the annotation of transcript predictions.<br />
<br />
The algorithm does the following:<br />
First, redundant predictions are identified (and additional attributes (evidence, sumWeight) are introduced).<br />
Second, the predictions are filtered using the user-specified criterium based on the attributes from the annotation.<br />
Third, clusters of overlapping predictions are determined, the predictions are sorted within the cluster and relevant predictions are extracted.<br />
<br />
Optionally, annotation info can be added for each reference organism enabling a functional prediction for predicted transcripts based on the function of the reference transcript.<br />
Phytozome provides annotation info tables for plants, but annotation info can be used from any source as long as they are tab-delimited files with at least the following columns: transcriptName, GO and .*defline.<br />
<br />
Initially, GAF was build to combine gene predictions obtained from '''GeMoMa'''. It allows to combine the predictions from multiple reference organisms, but works also using only one reference organism. However, GAF also allows to integrate predictions from ab-initio or purely RNA-seq-based gene predictors as well as manually curated annotation. If you like to do so, we recommend to run '''AnnotationEvidence''' for each of these input files to add additional attributes that can be used for sorting and filtering within '''GAF'''. The sort and filter criteria need to be carefully revised in this case. Default values can be set for missing attributes.<br />
<br />
''GeMoMa Annotation Filter'' may be called with<br />
<br />
java -jar GeMoMa-1.9.jar CLI GAF<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>tag (the tag used to read the GeMoMa annotations, default = mRNA)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prefix (the prefix can be used to distinguish predictions from different input files, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>weight (the weight can be used to prioritize predictions from different input files; each prediction will get an additional attribute sumWeight that can be used in the filter, valid range = [0.0, 1000.0], default = 1.0, OPTIONAL)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>gene annotation file (GFF file containing the gene annotations (predicted by GeMoMa), type = gff,gff3)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation info (annotation information of the reference, tab-delimited file containing at least the columns transcriptName, GO and .*defline, type = tabular, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>default attributes (Comma-separated list of attributes that is set to NaN if they are not given in the annotation file. This allows to use these attributes for sorting or filter criteria. It is especially meaningful if the gene annotation files were received fom different software packages (e.g., GeMoMa, Braker, ...) having different attributes., default = tie,tde,tae,iAA,pAA,score,lpm,maxGap,bestScore,maxScore,raa,rce)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">k</font></td><br />
<td>kmeans (whether kmeans should be performed for each input file and clusters with large mean distance to the origin will be discarded, range={NO, YES}, default = NO)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;YES&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>minimal number of predictions (only gene sets with at least this number of predictions will be used for clustering, valid range = [0, 100000000], default = 1000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>cluster (the number of clusters to be used for kmeans, valid range = [2, 100], default = 2)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">gc</font></td><br />
<td>good cluster (the number of good clusters, good clusters are those with small mean, all members of a good cluster are further used, valid range = [1, 99], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">trend</font></td><br />
<td>trend (whether a local component should be used for the cluster attribute (might be helpful for regions with different conservation (e.g. introgressions in chromosomes)), range={GLOBAL, LOCAL}, default = GLOBAL)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;GLOBAL&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;LOCAL&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">margin</font></td><br />
<td>margin (the number of bp upstream and downstream of a predictions used to identify neighboring predictions for the statistics, valid range = [0, 100000000], default = 1000000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">q</font></td><br />
<td>quantile (the quantile used for the local trend, valid range = [0.0, 1.0], default = 0.2)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
</table></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>filter (A filter can be applied to transcript predictions to receive only reasonable predictions. The filter is applied to the GFF attributes. The default filter decides based on the completeness of the prediction (start=='M' and stop=='*') and the relative score (score/aa>=0.75) whether a prediction is used or not. In addition, predictions without score (isNaN(score)) will be used as external annotations do not have the attribute score. Different criteria can be combined using 'and' and 'or'. A more sophisticated filter could be applied for instance using the completeness, the relative score, the evidence as well as statistics based on RNA-seq: start=='M' and stop=='*' and score/aa>=0.75 and (evidence>1 or tpc==1.0), default = start=='M' and stop=='*' and (isNaN(score) or score/aa>=0.75), OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>sorting (comma-separated list that defines the way predictions are sorted within a cluster, default = sumWeight,score,aa)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>intermediate result (a switch to decide whether an intermediate result of filtered predictions that are not combined to genes should be returned, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">l</font></td><br />
<td>length difference (maximal percentage of length difference between the representative transcript and an alternative transcript, alternative transcripts with a higher percentage are discarded, valid range = [0.0, 10000.0], OPTIONAL)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">atf</font></td><br />
<td>alternative transcript filter (If a prediction is suggested as an alternative transcript, this additional filter will be applied. The filter works syntactically like the 'filter' parameter and allows you to keep the number of alternative transcripts small based on meaningful criteria. Reasonable filter could be based on RNA-seq data (tie==1) or on sumWeight (sumWeight>1). A more sophisticated filter could be applied combining several criteria: tie==1 or sumWeight>1, default = tie==1 or sumWeight>1, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">cbf</font></td><br />
<td>common border filter (the filter on the common borders of transcripts, the lower the more transcripts will be checked as alternative splice isoforms, valid range = [0.0, 1.0], default = 0.75)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">mnotpg</font></td><br />
<td>maximal number of transcripts per gene (the maximal number of allowed transcript predictions per gene, valid range = [1, 2147483647], default = 2147483647)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">aat</font></td><br />
<td>add alternative transcripts (a switch to decide whether the IDs of alternative transcripts that have the same CDS should be listed for each prediction, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">tf</font></td><br />
<td>transfer features (if true, additional features like UTRs will be transfered from the input to the output. Only features of the representatives will be transfered. The UTRs of identical CDS predictions listed in "alternative" will not be transfered or combined, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.9.jar CLI GAF g=&lt;gene_annotation_file&gt;<br />
<br />
<br />
=== AnnotationFinalizer ===<br />
<br />
This tool finalizes an annotation. It allows to predict for UTRs for annotated coding sequences and to generate generic gene and transcript names. UTR prediction might be negatively influenced (i.e. too long predictions) by genomic contamination of RNA-seq libraries, overlapping genes or genes in close proximity as well as unstranded RNA-seq libraries. Please use '''ERE''' to preprocess the mapped reads.<br />
<br />
''AnnotationFinalizer'' may be called with<br />
<br />
java -jar GeMoMa-1.9.jar CLI AnnotationFinalizer<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (The genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code, type = fasta,fa,fas,fna,fasta.gz,fa.gz,fas.gz,fna.gz)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (The predicted genome annotation file (GFF), type = gff,gff3)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = mRNA)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">tf</font></td><br />
<td>transfer features (if true other features than gene, &lt;tag&gt; (default: mRNA), and CDS of the input will be written in the output, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">u</font></td><br />
<td>UTR (allows to predict UTRs using RNA-seq data, range={NO, YES}, default = NO)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;YES&quot;:</b></td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns file (Introns (GFF), which might be obtained from RNA-seq, type = gff,gff3)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage file (experimental coverage (RNA-seq), range={NO, UNSTRANDED, STRANDED}, default = NO)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., type = bedgraph)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., type = bedgraph)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., type = bedgraph)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ass</font></td><br />
<td>additional source suffix (a suffix for source values of UTR features, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">rename</font></td><br />
<td>rename (allows to generate generic gene and transcripts names (cf. parameter &quot;name attribute&quot;), range={COMPOSED, SIMPLE, NO}, default = COMPOSED)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;COMPOSED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">infix</font></td><br />
<td>infix (the infix of the generic name, default = G)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>suffix (the suffix of the generic name, default = 0)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">csp</font></td><br />
<td>contig search pattern (search string, i.e., a regular expression for search-and-replace parts of the contig/scaffold/chromosome names, the modified string is used as infix for the gene name, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">crp</font></td><br />
<td>contig replace pattern (replace string, i.e., a regular expression for search-and-replace parts of the contig/scaffold/chromosome names, the modified string is used as infix for the gene name, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;SIMPLE&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>name attribute (if true the new name is added as new attribute &quot;Name&quot;, otherwise &quot;Parent&quot; and &quot;ID&quot; values are modified accordingly, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.9.jar CLI AnnotationFinalizer g=&lt;genome&gt; a=&lt;annotation&gt; p=&lt;prefix&gt;<br />
<br />
<br />
=== Annotation evidence ===<br />
<br />
This tool adds attributes to the annotation, e.g., tie, tpc, aa, start, stop. These attributes can be used, for instance, if the annotation is used in '''GAF'''. All predictions of the annotation are used. The predictions are not filtered for internal stop codons, missing start or stop codons, frame-shifts, ... . Please use '''ERE''' to preprocess the mapped reads.<br />
<br />
''Annotation evidence'' may be called with<br />
<br />
java -jar GeMoMa-1.9.jar CLI AnnotationEvidence<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (The genome annotation file (GFF,GTF), type = gff,gff3,gtf,gff.gz,gff3.gz,gtf.gz)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = mRNA)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (The target genome file (FASTA). Should be in IUPAC code, type = fasta,fas,fa,fna,fasta.gz,fas.gz,fa.gz,fna.gz)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns file (Introns (GFF), which might be obtained from RNA-seq, type = gff,gff3, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage file (experimental coverage (RNA-seq), range={NO, UNSTRANDED, STRANDED}, default = NO)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., type = bedgraph)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., type = bedgraph)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., type = bedgraph)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ao</font></td><br />
<td>annotation output (if the annotation should be returned with attributes tie, tpc, and aa, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">gc</font></td><br />
<td>genetic code (optional user-specified genetic code, type = tabular, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.9.jar CLI AnnotationEvidence a=&lt;annotation&gt; g=&lt;genome&gt;<br />
<br />
<br />
=== Attribute2Table ===<br />
<br />
This tool returns a table of best attribute per predicted final annotation.<br />
<br />
''Attribute2Table'' may be called with<br />
<br />
java -jar GeMoMa-1.9.jar CLI Attribute2Table<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>tag (the tag used to read the GeMoMa annotations, default = mRNA)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>final gene annotation file (GFF file containing the gene annotations (predicted by GeMoMa), type = gff,gff3)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prefix (the prefix can be used to distinguish predictions from different input files, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>raw gene annotation file (GFF file containing the gene annotations (predicted by GeMoMa), type = gff,gff3)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>add (add the prefix to the gene ID, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">attribute</font></td><br />
<td>attribute (the attribute to be checked, default = iAA)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.9.jar CLI Attribute2Table f=&lt;final_gene_annotation_file&gt; r=&lt;raw_gene_annotation_file&gt;<br />
<br />
<br />
=== Synteny checker ===<br />
<br />
This tool can be used to determine syntenic regions between target organism and reference organism based on similiarity of genes. The tool returns a table of reference genes per predicted gene. This table can be easily visualized with an R script that is included in the GeMoMa package.<br />
<br />
''Synteny checker'' may be called with<br />
<br />
java -jar GeMoMa-1.9.jar CLI SyntenyChecker<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>tag (the tag used to read the GeMoMa annotations, default = mRNA)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prefix (the prefix can be used to distinguish predictions from different input files (=reference organisms), OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>assignment (the assignment file of this reference organism, which combines parts of the CDS to transcripts, type = tabular)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>gene annotation file (GFF file containing the gene annotations predicted by GAF, type = gff,gff3)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.9.jar CLI SyntenyChecker a=&lt;assignment&gt; g=&lt;gene_annotation_file&gt;<br />
<br />
<br />
=== AddAttribute ===<br />
<br />
This tool allows to add an additional attribute to specific features of an annotation.<br />
<br />
Those additional attributes might be used in '''GAF''' for filtering or sorting or might be displayed in genome browsers like IGV or WebApollo. The user can choose binary attributes (true or false) or attributes with values according to given tab-delimited table.<br />
<br />
''AddAttribute'' may be called with<br />
<br />
java -jar GeMoMa-1.9.jar CLI AddAttribute<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (annotation file, type = gff,gff3)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>feature (a feature of the annotation, e.g., gene, transcript or mRNA, default = mRNA)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">attribute</font></td><br />
<td>attribute (the name of the attribute that is added to the annotation)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>table (a tab-delimited file containing IDs and additional attribute, type = tabular)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>ID column (the ID column in the tab-delimited file, valid range = [0, 2147483647])</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">type</font></td><br />
<td>type (type of addition attribute, range={VALUES, BINARY}, default = VALUES)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;VALUES&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ac</font></td><br />
<td>attribute column (the attribute column in the tab-delimited file, valid range = [0, 2147483647])</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;BINARY&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.9.jar CLI AddAttribute a=&lt;annotation&gt; attribute=&lt;attribute&gt; t=&lt;table&gt; i=&lt;ID_column&gt; ac=&lt;attribute_column&gt;<br />
<br />
<br />
=== GAFComparison ===<br />
<br />
This tool allows to compare results from GAF based on the attributed ref-gene and alternative.<br />
Hence, you can compare the annotation of different genomes or the effect of different parameters on the annotation of one genome.<br />
<br />
''GAFComparison'' may be called with<br />
<br />
java -jar GeMoMa-1.9.jar CLI GAFComparison<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>tag (the tag used to read the GAF annotations, default = mRNA)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>name (a simple name for the organism)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>gene annotation file (GFF file containing the gene annotations (predicted by GAF), type = gff,gff3,gff.gz.gff3.gz)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>split prefix (a switch to decide whether the prefix should be split and writen in a separat column, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>differences (a switch to decide whether only genes with difference should be returned, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.9.jar CLI GAFComparison n=&lt;name&gt; g=&lt;gene_annotation_file&gt;<br />
<br />
<br />
=== Analyzer ===<br />
<br />
This tools allows to compare true annotation with predicted annotation as it is frequently done in benchmark studies. Furthermore, it can return a detailed table comparing true annotation and predicted annotation which might help to identify systematical errors or biases in the predictions. Hence, this tool might help to detect weaknesses of the prediction algorithm.<br />
<br />
True and predicted transcripts are evaluated based on nucleotide F1 measure. For each predicted transcript, the true transcript with highest nucleotide F1 measure is listed. A negative value in a F1 measure column indicates that there is a predicted transcript that matches the true transcript with a F1 measure value that is the absolute value of this entry, but there is another true transcript that matches this predicted transcript with an even better F1. Also true and predicted transcripts are listed that do not overlap with any transcript from the predicted and true annotation, respectively. The table contains the attributes of the true and the predicted annotation besides some additional columns allowing to easily filter interesting examples and to do statistics.<br />
<br />
The evaluation can be based on CDS (default) or exon features. The tool also reports sensitivity and precision for the categories gene and transcript.<br />
<br />
''Analyzer'' may be called with<br />
<br />
java -jar GeMoMa-1.9.jar CLI Analyzer<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>truth (the true annotation, type = gff,gff3,gtf,gff.gz,gff3.gz,gtf.gz)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>name (can be used to distinguish different predictions, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>predicted annotation (GFF/GTF file containing the predicted annotation, type = gff,gff3,gtf,gff.gz,gff3.gz,gtf.gz)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>CDS (if true CDS features are used otherwise exon features, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">o</font></td><br />
<td>only introns (if true only intron borders (=splice sites) are evaluated, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>write (write detailed table comparing the true and the predicted annotation, range={NO, YES}, default = NO)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;YES&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ca</font></td><br />
<td>common attributes (Only gff attributes of mRNAs are included in the result table, that can be found in the given portion of all mRNAs. Attributes and their portion are handled independently for truth and prediction. This parameter allows to choose between a more informative table or compact table., valid range = [0.0, 1.0], default = 0.5)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reliable (additionally evaluate sensitivity for reliable transcripts, range={NO, YES}, default = NO)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;YES&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>filter (A filter for deciding which transcript from the truth are reliable or not. The filter is applied to the GFF attributes of the truth. You probably need to run AnnotationEvidence on the truth GFF. The default filter decides based on the completeness of the prediction (start=='M' and stop=='*'), no premature stop codons (nps==0), RNA-seq coverage (tpc==1) and intron evidence (isNaN(tie) or tie==1)., default = start=='M' and stop=='*' and nps==0 and (tpc==1 and (isNaN(tie) or tie==1)), OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.9.jar CLI Analyzer t=&lt;truth&gt; p=&lt;predicted_annotation&gt;<br />
<br />
<br />
=== BUSCORecomputer ===<br />
<br />
This tool can be used to compute BUSCO statistics for genes instead of transcripts. Proteins of an annotation file can be extracted with '''Exctractor''', Proteins can be used to compute BUSCO statistics with BUSCO. The full BUSCO table and the assignment file from the '''Extractor''' can be used as input for this tool. Alternatively, a table can be generated from the annotation file that can be used instead of the assignment file.<br />
<br />
''BUSCORecomputer'' may be called with<br />
<br />
java -jar GeMoMa-1.9.jar CLI BUSCORecomputer<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>BUSCO (the BUSCO full table based on transcripts/proteins, type = tabular)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>IDs (a table with at least two columns, the first is the gene ID, the second is the transcript/protein ID. The assignment file from the Extractor can be used or a table can be derived by the user from the gene annotation file (gff,gtf), type = tabular)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used zero or multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>subgenome (regex for contigs/chromosomes of this subgenome)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.9.jar CLI BUSCORecomputer b=&lt;BUSCO&gt; i=&lt;IDs&gt;<br />
<br />
<br />
=== GFFAttributes ===<br />
<br />
Annotations that are build with '''GeMoMaPipeline''' or augmented with '''AnnotationEvidence''' have lots of attributes that might be intersting for the user. This module allows to create a simple table that can easily be parsed and used for visualization of statistics. However, the module could also be used for annotations that are not created of modified with GeMoMa modules.<br />
<br />
''GFFAttributes'' may be called with<br />
<br />
java -jar GeMoMa-1.9.jar CLI GFFAttributes<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (GFF file containing the gene annotations, type = gff,gff3)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>feature (the feature which is used to parse the attributes, default = mRNA)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>missing (the value used for missing attributes of a feature, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.9.jar CLI GFFAttributes a=&lt;annotation&gt;<br />
<br />
<br />
=== Transcribed Cluster ===<br />
<br />
'''What it does'''<br />
<br />
This tool computes ... .<br />
<br />
<br />
<br />
''Transcribed Cluster'' may be called with<br />
<br />
java -jar GeMoMa-1.9.jar CLI TranscribedCluster<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (The genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code, type = fasta)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns file (Introns (GFF), which might be obtained from RNA-seq, type = gff, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage file (experimental coverage (RNA-seq), range={UNSTRANDED, STRANDED}, default = UNSTRANDED)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., type = bedgraph)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., type = bedgraph)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., type = bedgraph)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>minimal gap (the minimal gap between two transcribed clusters, otherwise these will be merged, valid range = [0, 2147483647], default = 50)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.9.jar CLI TranscribedCluster g=&lt;genome&gt; coverage_unstranded=&lt;coverage_unstranded&gt;</div>Keilwagenhttps://www.jstacs.de/index.php?title=GeMoMa&diff=1162GeMoMa2022-07-16T20:02:57Z<p>Keilwagen: /* In a nutshell */</p>
<hr />
<div>'''Ge'''ne '''Mo'''del '''Ma'''pper (GeMoMa) is a homology-based gene prediction program. GeMoMa uses the annotation of protein-coding genes in a reference genome to infer the annotation of protein-coding genes in a target genome. Thereby, GeMoMa utilizes amino acid sequence and intron position conservation. In addition, GeMoMa allows to incorporate RNA-seq evidence for splice site prediction.<br />
<br />
GeMoMa is available in a public web-server at [http://galaxy.informatik.uni-halle.de galaxy.informatik.uni-halle.de]. The provided web-server only allows a limited number of reference genes and uses a time out of 2 minutes per transcript prediction. For unlimited use, please use the command line program or integrate GeMoMa in your own [https://galaxyproject.org/ Galaxy] instance.<br />
<br />
{|<br />
|__TOC__<br />
|[[File:GeMoMa-schema.png|thumb|right|450px|Schema of GeMoMa algorithm]]<br />
|}<br />
<br />
== Installation ==<br />
<br />
GeMoMa is now available via bioconda. Here is the [https://anaconda.org/bioconda/gemoma direct link to the package]. To install this package with conda run:<br />
conda install -c bioconda gemoma <br />
However, you can also install GeMoMa manually.<br />
<br />
=== Requirements ===<br />
<br />
For running the GeMoMa, you need the following software on your computer<br />
* Java v1.8 or later<br />
* [ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ blast] or [https://github.com/soedinglab/MMseqs2 mmseqs]<br />
<br />
=== Download ===<br />
<br />
GeMoMa is implemented in Java using Jstacs. You can [http://www.jstacs.de/download.php?which=GeMoMa download a zip file] containing a readme, the GeMoMa jar file and some tiny scripts for running GeMoMa. The jar file allows for <br />
<ul><br />
<li>creating the XML file needed for the Galaxy integration</li><br />
<li>running the command line interface (CLI) version.</li><br />
</ul><br />
<br />
== In a nutshell ==<br />
<br />
GeMoMa is a modular, homology-based gene prediction program with huge flexibility. However, we also provide a pipeline allowing to use GeMoMa easily. If you like to start GeMoMa for the first time, we recommend to use the GeMoMaPipeline like this<br />
java -jar GeMoMa-1.9.jar CLI GeMoMaPipeline threads=<threads> outdir=<outdir> GeMoMa.Score=ReAlign AnnotationFinalizer.r=NO o=true t=<target_genome> i=<reference_1_id> a=<reference_1_annotation> g=<reference_1_genome><br />
there are several parameters that need to be set indicated with '''&lt;'''foo'''&gt;'''. You can specify<br />
* the number of threads<br />
* the output directory<br />
* the target genome<br />
* and the reference ID (optional), annotation and genome. If you have several references just repeat <code>s=own</code> and the parameter tags <code>i</code>, <code>a</code>, <code>g</code> with the corresponding values.<br />
In addition, we recommend to set several parameters:<br />
* <code>GeMoMa.Score=ReAlign</code>: states that the score from mmseqs should be recomputed as mmseqs uses an approximation<br />
* <code>AnnotationFinalizer.r=NO</code>: do not rename genes and transcripts<br />
* <code>o=true</code>: output individual predictions for each reference as a separate file allowing to rerun the combination step ('''GAF''') very easily and quickly<br />
If you like to specify the maximum intron length please consider the parameters <code>GeMoMa.m</code> and <code>GeMoMa.sil</code>.<br />
If you have RNA-seq data either from own experiments or publicly available data sets (cf. [https://www.ncbi.nlm.nih.gov/sra NCBI SRA], [https://www.ebi.ac.uk/ena EMBL-EBI ENA]), we recommend to use them. You need to map the data against the target genome with your favorite read mapper. In addition, we recommend to check the parameters of the section '''DenoiseIntrons'''.<br />
<br />
<br />
The complete documentation describing all GeMoMa modules and all parameters can be accessed at [[GeMoMa-Docs]].<br />
<br />
You can also [[:File:GeMoMa-manual.pdf|download a small manual for GeMoMa]] which explains the main steps for the analysis.<br />
<br />
== GFF attributes ==<br />
<br />
Using GeMoMa and GAF, you'll obtain GFFs containing some special attributes. We briefly explain the most prominent attributes in the following table.<br />
<br />
{| class="wikitable sortable"<br />
|-<br />
!Attribute !!Long name !!Tool !!Necessary parameter !!Feature !!Description<br />
|-<br />
| aa || amino acids || GeMoMa || || mRNA || the number of amino acids in the predicted protein<br />
|-<br />
| raa || reference amino acids || GeMoMa || || mRNA || the number of amino acids in the reference protein<br />
|-<br />
| score || GeMoMa score || GeMoMa || || mRNA || score computed by GeMoMa using the substitution matrix, gap costs and additional penalties<br />
|-<br />
| maxScore || maximal GeMoMa score || GeMoMa || || mRNA || maximal score which will be obtained by a prediction that is identical to the reference transcript<br />
|-<br />
| bestScore || best GeMoMa score || GeMoMa || || mRNA || score of the best GeMoMa prediction of this transcript and this target organism<br />
|-<br />
| maxGap || maximal gap || GeMoMa || || mRNA || length of the longest gap in the alignment between predicted and reference protein<br />
|-<br />
| lpm || longest positive match || GeMoMa || || mRNA || length of the longest positive scoring match in the alignment between predicted and reference protein, i.e., each pair of amino acids in the match has a positive score<br />
|-<br />
| nps || number of premature stops || GeMoMa || || mRNA || the number of premature stop codons in the prediction<br />
|-<br />
| ce || coding exons || GeMoMa || assignment || mRNA || the number of coding exons of the prediction<br />
|-<br />
| rce || reference coding exons || GeMoMa || assignment || mRNA || the number of coding exons of the reference transcript<br />
|-<br />
| minCov || minimal coverage || GeMoMa || coverage, ... || mRNA || minimal coverage of any base of the prediction given RNA-seq evidence<br />
|-<br />
| avgCov || average coverage || GeMoMa || coverage, ... || mRNA || average coverage of all bases of the prediction given RNA-seq evidence<br />
|-<br />
| tpc || transcript percentage coverage || GeMoMa || coverage, ... || mRNA || percentage of covered bases per predicted transcript given RNA-seq evidence<br />
|-<br />
| tae || transcript acceptor evidence || GeMoMa || introns || mRNA || percentage of predicted acceptor sites per predicted transcript with RNA-seq evidence<br />
|-<br />
| tde || transcript donor evidence || GeMoMa || introns || mRNA || percentage of predicted donor sites per predicted transcript with RNA-seq evidence<br />
|-<br />
| tie || transcript intron evidence || GeMoMa || introns || mRNA || percentage of predicted introns per predicted transcript with RNA-seq evidence<br />
|-<br />
| minSplitReads || minimal split reads || GeMoMa || introns || mRNA || minimal number of split reads for any of the predicted introns per predicted transcript<br />
|-<br />
| iAA || identical amino acid || GeMoMa || protein alignment || mRNA || percentage of identical amino acids between reference transcript and prediction<br />
|-<br />
| pAA || positive amino acid || GeMoMa || protein alignment || mRNA || percentage of aligned positions between reference transcript and prediction yielding a positive score in the substitution matrix<br />
|-<br />
| evidence || || GAF || || mRNA || number of reference organisms that have a transcript yielding this prediction<br />
|-<br />
| alternative || || GAF || || mRNA || alternative gene ID(s) leading to the same prediction<br />
|-<br />
| sumWeight || || GAF || || mRNA || the sum of the weights of the references that perfectly support this prediction<br />
|-<br />
| maxTie || maximal tie || GAF || || gene || maximal tie of all transcripts of this gene<br />
|-<br />
| maxEvidence || maximal evidence || GAF || || gene || maximal evidence of all transcripts of this gene<br />
|-<br />
|}<br />
<br />
The name of the feature describing a transcript prediction can be altered using the parameter "tag". Before version 1.7 the default value of tag was "prediction" instead of "mRNA".<br />
<br />
== Frequently asked questions ==<br />
<br />
; Why does the Extractor not return a single CDS-part, protein, ...?<br />
:Please check whether the names of your contigs/chromosomes in your annotation (gff) and genome file (fasta) are identical. The fasta comments should at best only contain the contig/chromosome name. (Since GeMoMa 1.4, comments, which contain the contig/chromosome name and some additional information separated by a space, are also fine.) In addition, check the statistics that are given by the Extractor. It lists how many genes have been read and how many genes have been removed for different reasons. One common problem is that some annotation files do not include the stop codon in the CDS annotation.<br />
<br />
; How can I force GeMoMa to make more predictions?<br />
:There are several parameters affecting the number of predictions. The most prominent are the number of predictions (p) and the contig threshold (ct). For each reference transcript/CDS, GeMoMa initially makes a preliminary prediction and uses this prediction to determine whether a contig is promising and should be used to determine the final predictions. You may decrease ct and increase p to have more contigs in the final prediction. Increasing the number of predictions allows GeMoMa to output more predictions that have been computed. Decreasing the contig threshold allows to increase the number of predictions that are (internally) computed. Increasing p to a very large number without decreasing ct does not help.<br />
<br />
; Running GeMoMa on a single contig of my assembly yield thousands of weird predictions. What went wrong?<br />
:By default, GeMoMa is not build to be run on a single contig. GeMoMa tries to make predictions for all given reference CDS in the given target sequence(s). If the given target sequence is only a fraction of the complete target genome/assembly, GeMoMa will produce weird predictions as it does not filter for the quality of the predictions internally. There are two options to handle this:<br />
:* Use a list of gene models that you expect to be located on this contig (cf. parameter "selected").<br />
:* Filter the predictions using GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>).<br />
<br />
; Is it mandatory to use RNA-seq data?<br />
:No, GeMoMa is able to make predictions with and without RNA-seq evidence.<br />
<br />
; Is it possible to use multiple reference organisms?<br />
:It is possible to use multiple reference organisms for GeMoMa. Just run GeMoMa on each reference organism separately. Finally, you can employ GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>) to combine these annotations.<br />
<br />
; Why do some reference genes not lead to a prediction in the target genome?<br />
:Please first check whether your reference genes have been discarded by the Extractor (cf. assignment file).<br />
:If the genes have been discarded, there are two possibilities:<br />
:* The CDS might be redundant, i.e. the coding exons are identical to those of another transcript. In this case, only one CDS is further evaluated.<br />
:* There might be something wrong with your reference genes, e.g., missing start codon, missing stop codon, premature stop codon, ambiguous nucleotides, ... and you should check the options of Extractor or the annotation.<br />
:If the reference genes passed the Extractor, there are several possible explanations for this behavior. The two most prominent are:<br />
:* GeMoMa stopped the prediction of a reference genes since it does not return a result within the given time (cf. parameter "timeout").<br />
:* GeMoMa simply did not find a prediction matching the remaining quality criteria.<br />
<br />
; What does "partial gene model" mean in the context of GeMoMa?<br />
:We called a gene model partial if it does not contain an initial start codon and a final stop codon. However, this does not mean that the gene model is located at or close to the border of a chromosome or contig.<br />
<br />
; For two different reference transcripts, the predictions of GeMoMa overlap or are identical. What should I do with those?<br />
:GeMoMa makes the predictions for each reference transcript independently. Hence, it can occur that some of predictions of different reference transcripts overlap or are identical especially in gene families. Typically, you might like to filter or rank these predictions. We have implemented GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>) to do this automatically. However, you can also do it by hand using the GFF attributes. Using RNA-seq data in GeMoMa yields additional fields in the annotation that can be used, e.g., average coverage (avgCov).<br />
<br />
; A lot of transcripts have been filtered out by the Extractor. What can I do?<br />
:There are several reasons for removing transcripts by the Extractor. At least in two cases you can try to get more transcripts by setting specific parameter values. First, if the transcript contains ambiguous nucleotides, please test the parameter "Ambiguity". Second, sometimes we received GFFs which contain wrong phases for CDS entries (e.g., 0 for all CDS entries in the phase column of the GFF). Since version 1.3.2, we provide the option "r" which stands for repair. If r=true is chosen, the Extractor tries to infer all phases for transcripts that show an error and would be filtered out.<br />
<br />
; Is GeMoMa able to predict pseudo-genes/ncRNA?<br />
:No, currently not.<br />
<br />
; My RNA-seq data indicates there is an additional intron in a transcipt, but GeMoMa does not predict this. Or vice versa, GeMoMa predicts an intron that is not supported by RNA-seq data. What's the reason?<br />
:GeMoMa is mainly based on the assumptions of amino acid and intron position conservation between reference and target species. Hence, GeMoMa tries to predict a gene model with similar exon-intron-structure in the target species and does not stick too much with RNA-seq data. Although intron position conservation can be observed in most cases, sometimes new introns evolve or others vanish. For this reasons, GeMoMa also allows for the inclusion or exclusion of introns adding some additional costs (cf. GeMoMa parameter intron-loss-gain-penalty). However, the behaviour of GeMoMa depends on the parameters settings (especially intron-loss-gain-penalty, sm (substitution matrix), go (gap opening), ge (gap extension)) and the length of the missed/additional intron. Nevertheless, such cases can only occur if the additional/missed intron has a length that can be divided by 3 preserving the reading frame. Since the available RNA-seq data only reflects a fraction of tissues/environmental conditions/..., missing RNA-seq evidence does not necessarily mean that the predictions is wrong.<br />
<br />
; My RNA-seq data indicates two alternative, highly overlapping introns. Interestingly, GeMoMa does not take the intron that is more abundant. Why?<br />
:GeMoMa reads the introns from the input file using some filter (cf. GeMoMa parameter r (reads)). All introns that pass the filter are used and treated equally. Hence, GeMoMa uses the intron that matches the expectation of intron position and amino acid conservation compared to the reference transcript.<br />
<br />
; Does GeMoMa predict multiple transcripts per gene?<br />
:GeMoMa in principle allows to predict multiple transcripts per gene, if corresponding transcripts are given in the reference species or if multiple reference species are used. <br />
<br />
; GeMoMa failed with java.lang.OutOfMemoryError. What can I do?<br />
:Whenever you see a java.lang.OutOfMemoryError, you should rerun the program with Java virtual machine (VM) options. More specifically you should set: -Xms the initially used RAM, e.g. to 5Gb (–Xms5G), and -Xmx the maximally used RAM, e.g. to 50Gb (-Xmx50G). GeMoMa often needs more memory if you have a large genome and if you’re providing a large coverage file (extracted from RNA-seq data). If you don’t have a compute node with enough memory, you can run GeMoMa without coverage, which will return the same predictions, but does not include all statistics. Another point could be the protein alignment, if you use the optional parameter "query protein" (below version 1.; or "protein alignment" (since version 1.;. Again you can run GeMoMa without protein alignment, which will return the same predictions, but less statistics. <br />
<br />
; I need to specify the genetic code for my organisms. What is the expected format?<br />
:The genetic code is given in a two column tab-delimited table, where the first column is the one letter code of the amino acid and the second column is a comma-separated list of triplets. As we are working on genomic DNA, GeMoMa expects the bases A, C, G, and T, and not U (as expected in mRNA). Here is the link to the default genetic code, which might be used as template:<br />
:https://github.com/Jstacs/Jstacs/blob/master/projects/gemoma/test_data/genetic_code.txt<br />
:Alternative genetic codes are described here using the RNA alphabet:<br />
:https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi<br />
:The genetic code might be specified for a reference organism in the module Extractor or for a target organism in the module GeMoMa.<br />
<br />
; I like to accelerate GeMoMa. What can I do?<br />
:You can use several threads for the computation. If you run the GeMoMaPipeline you just have to select threads=<your_number>.<br />
:In addition, you can change the search algorithm that is used in GeMoMa. Tblastn is used by default as search algorithm in GeMoMa (for historical reasons, until version 1.6.4). However, tblastn can be replaced by mmseqs which is typically much faster. If you run the GeMoMaPipeline you just have to select tblastn=false, which is default since version 1.7. However, changing the search algorithm can also effect the results. We try to minimize these effect using specific parameters for the search algorithms.<br />
:If you modify other parameters, you will probably receive results that differ to a larger extend from those received using the default parameters. <br />
<br />
; Is there a way to use GeMoMa to search a single CDS or protein sequence against a genome and return the predicted gene model (CDS fasta, protein fasta, GFF) similar to exonerate?<br />
:There at least two ways to do this. If you use GeMoMaPipeline you can <br />
: (A) Use the parameter “selected” to select specific gene models (=transcripts/proteins) from the annotation instead of using all or<br />
: (B) Use s=pre-extracted, use a fasta file with the proteins for the parameter cds-parts and leave assignment unset.<br />
:Using one of these options you can either look for a single or few transcripts/proteins either with (A) or without (B) intron-position conservation. In addition, you can use RNA-seq data to improve the predictions, which should be not possible with exonerate.<br />
<br />
; Can I determine synteny based on GeMoMa predictions?<br />
:Yes, since version 1.7 we provide the module SyntenyChecker and a R script that can be used for this purpose. It exploits the fact the the reference gene and the alternative are known. Hence no alignment is need at this point and synteny can be determined quite fast. <br />
<br />
; How, can I add additional attributes to the annotation?<br />
:Additional attribute, e.g., functional annotation from InterProScan, can be added to the structural gene annotation using the module AddAttribute, which has been included since version 1.7. Such additional attributes might be used in GAF for filtering and sorting and can also be displayed in genome browsers like IGV or WebApollo. <br />
<br />
; Can structural gene annotation provided by GeMoMa be submitted to NCBI?<br />
:Yes, NCBI allows to submit structural gene annotation in GFF format (https://www.ncbi.nlm.nih.gov/genbank/genomes_gff/). If you run GeMoMaPipeline or AnnotationFinalizer, the GFF should be valid for conversion. <br />
<br />
; Running GeMoMaPipeline throws an exception. Can I restart GeMoMaPipeline using intermediate results?<br />
:Yes, since version 1.7 we have a new parameter in GeMoMaPipeline called restart. <br />
:If you want to restart the last broken GeMoMaPipeline run, you have to execute GeMoMaPipeline with the same command line as before and add restart=true.<br />
:If necessary, you can also slightly change the other parameters. However, if the parameters differ too much from those used before, GeMoMaPipeline will decide to perform a new independent run.<br />
:A restart of GeMoMaPipeline is particularly useful if the time-consuming search (tblastn or mmseqs) was successful, since this can save runtime.<br />
<br />
If you have any further questions, comments or bugs, please check the [[GeMoMa-Docs]], [https://github.com/Jstacs/Jstacs/labels/GeMoMa our github page] or contact [mailto:jens.keilwagen@julius-kuehn.de Jens Keilwagen].<br />
<br />
== References ==<br />
If you use GeMoMa, please cite<br />
<br />
J. Keilwagen, M. Wenk, J. L. Erickson, M. H. Schattat, J. Grau, and F. Hartung. [https://nar.oxfordjournals.org/content/44/9/e89 Using intron position conservation for homology-based gene prediction]. ''Nucleic Acids Research'', 2016. doi: 10.1093/nar/gkw092<br />
<br />
J. Keilwagen, F. Hartung, M. Paulini, S. O. Twardziok, and J. Grau<br />
[https://rdcu.be/QbKc Combining RNA-seq data and homology-based gene prediction for plants, animals and fungi]. ''BMC Bioinformatics'', 2018. doi: 10.1186/s12859-018-2203-5<br />
<br />
== Version history ==<br />
[http://www.jstacs.de/download.php?which=GeMoMa GeMoMa 1.9] (15.07.2022)<br />
* improved handling of warnings in Galaxy<br />
* new modules: Attribute2Table, GFFAttributes and TranscribedCluster<br />
* AnnotationFinalizer:<br />
** new parameters: transfer features, additional source suffix<br />
** changed renaming using regex<br />
** adding oldID if renaming IDs <br />
** do not re-sort transcripts of a gene<br />
* BUSCORecomputer:<br />
** add FileExistsValidator<br />
** extend to polyploid organisms<br />
** new result: BUSCO parsed full table<br />
** bugfix last duplicated <br />
* CombineCoverageFiles:<br />
** make it more memory efficient <br />
* CombineIntronFiles:<br />
** make it more memory efficient<br />
* GAF:<br />
** new parameters allowing gene set specific kmeans using global or local detrending<br />
** new parameter "intermediate result" allowing to retrieve intermediate results<br />
** new parameter "length difference" allowing to discard predictions that deviate too much from the representative transcript at a locus<br />
* GeMoMa:<br />
** new parameter options for the amount of predictions per reference transcript: STATIC(=default) or DYNAMIC<br />
** delete unnecessary parameter "region threshold"<br />
** improved verbose output<br />
** improve fasta header parsing<br />
* GeMoMaPipeline:<br />
** removed long fasta comment parameter <br />
** improved behaviour if errors occur if restart=true <br />
** shifted prefix from GAF to GeMoMa module were possible <br />
** bug fix: SyntenyChecker if assignment is not used <br />
* Extractor:<br />
** new category for discarded annotation: non-linear transcripts <br />
** bugfix longest intron=0<br />
* ExtractRNAseqEvidence:<br />
** improved protocol if errors with the repositioning occur <br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.8.zip GeMoMa 1.8] (07.10.2021)<br />
* extended manual<br />
* new module Analyzer: for benchmarking<br />
* new module BUSCORecomputer: allowing to recompute BUSCO stats based on geneID instead of transcriptID avoiding to overestimate the number of duplicates<br />
* AnnotationEvidence<br />
** bugfix: gene borders if only one gene is on the contig<br />
** discard genes that do not code for a protein<br />
* AnnotationFinalizer:<br />
** new parameter "transfer feature" allowing to keep additional features like UTRs, ...<br />
** implemented check of regular expression for prefix<br />
** bugfix if score==NA<br />
* CheckIntrons: introns are not optional<br />
* ERE: <br />
** new parameters for handling spurious split reads<br />
** new parameter for repositioning that is needed for genomes with huge chromosome due to limitations of BAM/SAM<br />
** bugfix last intron<br />
** improved protocol<br />
* Extractor:<br />
** new parameter for long fasta comment<br />
** new parameter identical<br />
** more verbose output in case of problems<br />
** finding errors if CDS parts have different strands<br />
** changed optional intron output<br />
** bugfix for exons with DNA but no AA<br />
* GAF:<br />
** new parameter allowing to output the transcript names of redundant predictions as GFF attribute<br />
** new parameter "transfer feature" allowing to keep additional features like UTRs, ...<br />
** bugfix: missing entries for alternative<br />
** changed default value for atf and sorting<br />
** implemented check of regular expression for prefix<br />
** changed handling of transcript within clusters<br />
** changed output order in gff: now for each gene the gene feature is reported first and subsequently the mRNA and CDS features<br />
* GeMoMa: <br />
** new parameter for replacing unknown AA by X<br />
** handling missing GeMoMa.ini.xml<br />
** additional GFF attributes: lpm, maxScore, maxGap, bestScore<br />
** improved error handling and protocol<br />
** changed heuristic for identifying multiple transcripts predictions on one contig/chromosome<br />
* GeMoMaPipeline:<br />
** new parameter "check synteny" allowing to run SyntenyChecker<br />
** implemented check of regular expression for prefix<br />
** removed unnecessary parameter<br />
** improved handling of exceptions<br />
** bugfix for stranded RNA-seq evidence<br />
** allow re-start only for same version<br />
** improved protocol if threads==1<br />
* SyntenyChecker: implemented check of regular expression for prefix<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.7,1.zip GeMoMa 1.7.1] (07.09.2020)<br />
*GeMoMa: <br />
**bugfix if assignment == null<br />
**bugfix remove toUpperCase<br />
*GeMoMaPipeline<br />
**Galaxy integration bugfix for hidden parameter restart<br />
**hide BLAST_PATH and MMSEQS_PATH from Galaxy integration<br />
**improved protocol output if threads=1<br />
**add addtional test to GeMoMaPipeline<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.7.zip GeMoMa 1.7] (29.07.2020)<br />
*improved manual including new module and runtime <br />
*check whether input files exist before execution<br />
*partially checking MIME types in CLI before execution<br />
*changed homepage from http to https<br />
*new module AddAttribute: allows to add attributes (like functional annotation from InterProScan) to gene annotation files that might be used in GAF or displayed genome browsers like IGV or WebApollo<br />
*new module SyntenyChecker: creates a table that can be used to create dot plots between the annotation of the target and reference organism<br />
*changed default value of parameter "tag" from "prediction" to "mRNA"<br />
*AnnotationEvidence:<br />
**additional attributes: avgCov, minCov, nps, ce<br />
**changed default value of "annotation output" to true<br />
**bugfix: transcript start and end<br />
*ERE: <br />
**changed default value of coverage to "true" <br />
**new parameter "minimum context": allows to discard introns if all split reads have short aligned contexts<br />
*Extractor: <br />
**bugfix splitAA if coding exon is very short<br />
**improved verbose mode<br />
**new parameter "upcase IDs"<br />
**new parameter "introns" allowing to extract introns from the reference (only for test cases)<br />
**new parameter "discard pre-mature stop" allowing to discard or use transcripts with pre-mature stop<br />
**improved handling of corrupt annotations<br />
*GAF: <br />
**bugfix missing transcripts<br />
**slightly changed the default value of "filter" <br />
*GeMoMa:<br />
**replaced parameter "query proteins" by "protein alignment"<br />
**using splitAA for scoring predictions <br />
**new gff attributes: <br />
*** ce and rce for the feature prediction indicating the number of coding exons for the prediction and the reference, respectively<br />
*** nps for the number of premature stop codons (if avoid stop is false)<br />
**slightly changed the meaning of the parameter "avoid stop"<br />
*GeMoMaPipeline:<br />
**changed the default value of tblastn to false, hence mmseqs is used as search algorithm<br />
**changed the default value of score to ReAlign<br />
**remove "--dont-split-seq-by-len" from mmseqs createdb<br />
**new optional parameter BLAST_PATH<br />
**new optional parameter MMSEQS_PATH<br />
**new option to allow for incorporation of external annotation, e.g., from ab-initio gene prediction<br />
**new parameter restart allowing to restart the latest GeMoMaPipeline run, which was finished without results, with very similar parameters, e.g., after an exception was thrown (cf. parameter debug)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.6.4.zip GeMoMa 1.6.4] (24.04.2020)<br />
* improved help section<br />
* change gff attribute "AA" to "aa"<br />
* GAF:<br />
** bugfix overlapping genes<br />
** accelerated computation<br />
* GeMoMa:<br />
** bugfix: if no assignment file is used and protein ID are prefixes of other protein IDs<br />
** change GFF attribute AA to aa<br />
* AnnotationFinalizer: new parameter "name attribute" allowing to decide whether a name attribute or the Parent and ID attributes should be used for renaming<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.6.3.zip GeMoMa 1.6.3] (05.03.2020)<br />
* Jstacs changes:<br />
** CLI: bugfix ExpandableParameterSet<br />
* python wrapper (for *conda)<br />
* updated tests.sh, run.sh, pipeline.sh<br />
* rename Denoise to DenoiseIntrons<br />
* AnnotationEvidence: write phase (as given) to gff<br />
* GAF: new parameter: default attributes allows to set attributes that are not included in some gene annotation files<br />
* GeMoMa: new parameter: static intron length allowing to use dynamic intron length if set to false<br />
* GeMoMaPipeline: <br />
** bugfix: time-out<br />
** improve output<br />
** separate parameters for maximum intron length (DenoiseIntrons, GeMoMa)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.6.2.zip GeMoMa 1.6.2] (17.12.2019)<br />
* Jstacs changes:<br />
** test methods for modules<br />
** live protocol for Galaxy<br />
* new module Denoise: allowing to clean introns extracted by ERE<br />
* new module NCBIReferenceRetriever: allowing to retrieve data for reference organisms easily from NCBI.<br />
* GAF:<br />
** bugfix for filter using specific attributes if no RNA-seq or query proteins was used<br />
** allow to add annotation info (as for instance provided by Phytozome) based on the reference organisms<br />
* GeMoMa: bugfix for timeout<br />
* GeMoMaPipeline:<br />
** bugfix reporting predicted partial proteins<br />
** improved protocol<br />
** new default value for query proteins (changed from false to true)<br />
** new default value for Ambiguity (changed from EXCEPTION to AMBIGUOUS)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.6.1.zip GeMoMa 1.6.1] (4.06.2019)<br />
* createGalaxyIntegration.sh: bugfix for GeMoMaPipeline<br />
* new module CheckIntrons: allowing to create statistics for introns (extracted by ERE)<br />
* AnnotationFinalizer: bugfix for sequence IDs with large numbers<br />
* CompareTranscripts: <br />
** bugfix for prefix of ref-gene<br />
** allow no transcript info, but making assignment non-optional if a transcript info is set <br />
* GAF: bugfix for Galaxy integration<br />
* GeMoMaPipeline:<br />
** improved output in case of Exceptions<br />
** new parameter "output individual predictions" allows to in- or exclude individual predictions from each reference organism in the final result<br />
** new parameter "weight" allows weights for reference species (cf. GAF)<br />
* ERE: new parameter "minimum mapping quality"<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.6.zip GeMoMa 1.6] (2.04.2019)<br />
* allow to use mmseqs as alternative to tblastn<br />
* AnnotationEvidence:<br />
** allows to add attributes to the input gff: tie, tpc, AA, start, stop<br />
** new parameter for gff output<br />
* AnnotationFinalizer: new tool for predicting UTRs and renaming predictions<br />
* GAF:<br />
** relative score filter and evidence filter are replaced by a flexible filter that allows to filter by relative score, evidence or other GFF attributes as well as combinations thereof<br />
** sorting criteria of the predictions within clusters can now be user-specified<br />
** new attribute for genes: combinedEvidence<br />
** new attribute for predictions: sumWeight<br />
** allows to use gene predictions from all sources, including for instance ab-initio gene predictors, purely RNA-seq based gene prediction and manually curation<br />
** bugfix for predictions from multiple reference organisms<br />
** improved statistic output<br />
* GeMoMa<br />
** renamed the parameter tblastn results to search results<br />
** new parameter for sorting the results of the similarity search (tblastn or mmseqs), if you use mmseqs for the similarity search you have use sort <br />
** new parameter for score of the search results: three options: Trust (as is), ReScore (use aligned sequence, but recompute score), and ReAlign (use detected sequence for realignment and rescore)<br />
** bugfix: threshold for introns from multiple files<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.5.2.zip GeMoMa 1.5.3] (23.07.2018)<br />
* improved parameter description and presentation<br />
* GeMoMaPipeline:<br />
** removed unnecessary parameters<br />
* GeMoMa:<br />
** bugfix: reading coverage file<br />
** removed parameter genomic (cf. Extractor)<br />
** removed protein output (cf. Extractor)<br />
* GAF:<br />
** bugfix: prefix<br />
* Extractor:<br />
** new parameter genomic<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.5.2.zip GeMoMa 1.5.2] (31.5.2018)<br />
* GAF:<br />
** new parameter that allows to restrict the maximal number of transcript predictions per gene<br />
** altered behavior of the evidence filter from percentages to absolute values<br />
** bugfix: nested genes <br />
** checking for duplicates in prediction IDs<br />
* GeMoMa:<br />
** warning if RNA-seq data does not match with target genome<br />
* GeMoMaPipeline: new tool for running the complete GeMoMa pipeline at once allowing multi-threading<br />
* folder for temporary files of GeMoMa<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.5.zip GeMoMa 1.5] (13.02.2018)<br />
* AnnotationEvidence: add chromosome to output<br />
* CompareTranscripts: new parameter that allows to remove prefixes introduces by GAF<br />
* Extractor: new parameter "stop-codon excluded from CDS" that might be used if the annotation does not contain the stop codons<br />
* ExtractRNASeqEvidence: <br />
** print intron length stats<br />
** include program infos in introns.gff3<br />
* GeMoMa:<br />
** new attribute pAA in gff output if query protein is given<br />
** include program infos in predicted_annotation.gff3<br />
** minor bugfix<br />
* GAF:<br />
** new parameter that allows to specify a prefix for each input gff<br />
** collect and print program infos to filtered_prediction.gff3<br />
** improved statistics output<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.4.2.zip GeMoMa 1.4.2] (21.07.2017)<br />
* automatic searching for available updates<br />
* AnnotationEvidence: bugfix (tie computation: Arrays.binarysearch does not find first match)<br />
* Extractor: bugfix (files that are not zipped)<br />
* GeMoMa: bugfix (tie computation: Arrays.binarysearch does not find first match)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.4.1.zip GeMoMa 1.4.1] (30.05.2017)<br />
* CompareTranscripts: bugfix (NullPointerException)<br />
* Extractor: reference genome can be .*fa.gz and .*fasta.gz<br />
* GeMoMa: bugfix (shutdown problem after timeout)<br />
* modified additional scripts and documentation<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.4.zip GeMoMa 1.4] (03.05.2017)<br />
* AnnotationEvidence: new tool computing tie and tpc for given annotation (gff)<br />
* CompareTranscripts: new tool comparing predicted and given annotation (gff)<br />
* Extractor:<br />
** reading CDS with no parent tag (cf. discontinuous feature)<br />
** automatic recognition of GFF or GTF annotation<br />
** Warning if sequences mentioned in the annotation are not included in the reference sequence<br />
* GeMoMa: <br />
** allowing for multiple intron and coverage files (= using different library types at the same time)<br />
** NA instead of "?" for tae, tde, tie, minSplitReads of single coding exon genes<br />
** new default values for the parameters: predictions (10 instead of 1) and contig threshold (0.4 instead of 0.9)<br />
** bugfix (write pc and minCov if possible for last CDS part in predicted annotation)<br />
** bugfix (ref-gene name if no assignment is used)<br />
** bugfix (minSplitReads, minCov, tpc, avgCov if no coverage available)<br />
* GAF:<br />
** nested genes on the same strand<br />
** bugfix (if nothing passes the filter)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.3.2.zip GeMoMa 1.3.2] (18.01.2017)<br />
* Extractor: new parameter repair for broken transcript annotations<br />
* GeMoMa: bugfixes (splice site computation)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa_1.3.1.zip GeMoMa 1.3.1] (09.12.2016)<br />
* GeMoMa bugfix (finding start/stop codon for very small exons)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.3.zip GeMoMa 1.3] (06.12.2016)<br />
* ERE: new tool for extracting RNA-seq evidence<br />
* Extractor: offers options for <br />
** partial gene models<br />
** ambiguities<br />
* GeMoMa:<br />
** RNA-seq<br />
*** defining splice sites<br />
*** additional feature in GFF and output<br />
**** transcript intron evidence (tie)<br />
**** transcript acceptor evidence (tae)<br />
**** transcript donor evidence (tde)<br />
**** transcript percentage coverage (tpc)<br />
**** ...<br />
** improved GFF<br />
** simplified the command line parameters<br />
** IMPORTANT: parameter names changed for some parameters<br />
* GAF: new tool for filtering and combining different predictions (especially of different reference organisms)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.1.3.zip GeMoMa 1.1.3] (06.06.2016)<br />
* minor modifications to the Extractor tool<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.1.2.zip GeMoMa 1.1.2] (05.02.2016)<br />
* GeMoMa bugfix (upstream, downstream sequence for splice site detection)<br />
* Extractor: new parameter s for selecting transcripts<br />
* improved Galaxy integration<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.1.1.zip GeMoMa 1.1.1] (01.02.2016)<br />
* initial release for paper</div>Keilwagenhttps://www.jstacs.de/index.php?title=GeMoMa&diff=1161GeMoMa2022-07-16T20:02:24Z<p>Keilwagen: /* Version history */</p>
<hr />
<div>'''Ge'''ne '''Mo'''del '''Ma'''pper (GeMoMa) is a homology-based gene prediction program. GeMoMa uses the annotation of protein-coding genes in a reference genome to infer the annotation of protein-coding genes in a target genome. Thereby, GeMoMa utilizes amino acid sequence and intron position conservation. In addition, GeMoMa allows to incorporate RNA-seq evidence for splice site prediction.<br />
<br />
GeMoMa is available in a public web-server at [http://galaxy.informatik.uni-halle.de galaxy.informatik.uni-halle.de]. The provided web-server only allows a limited number of reference genes and uses a time out of 2 minutes per transcript prediction. For unlimited use, please use the command line program or integrate GeMoMa in your own [https://galaxyproject.org/ Galaxy] instance.<br />
<br />
{|<br />
|__TOC__<br />
|[[File:GeMoMa-schema.png|thumb|right|450px|Schema of GeMoMa algorithm]]<br />
|}<br />
<br />
== Installation ==<br />
<br />
GeMoMa is now available via bioconda. Here is the [https://anaconda.org/bioconda/gemoma direct link to the package]. To install this package with conda run:<br />
conda install -c bioconda gemoma <br />
However, you can also install GeMoMa manually.<br />
<br />
=== Requirements ===<br />
<br />
For running the GeMoMa, you need the following software on your computer<br />
* Java v1.8 or later<br />
* [ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ blast] or [https://github.com/soedinglab/MMseqs2 mmseqs]<br />
<br />
=== Download ===<br />
<br />
GeMoMa is implemented in Java using Jstacs. You can [http://www.jstacs.de/download.php?which=GeMoMa download a zip file] containing a readme, the GeMoMa jar file and some tiny scripts for running GeMoMa. The jar file allows for <br />
<ul><br />
<li>creating the XML file needed for the Galaxy integration</li><br />
<li>running the command line interface (CLI) version.</li><br />
</ul><br />
<br />
== In a nutshell ==<br />
<br />
GeMoMa is a modular, homology-based gene prediction program with huge flexibility. However, we also provide a pipeline allowing to use GeMoMa easily. If you like to start GeMoMa for the first time, we recommend to use the GeMoMaPipeline like this<br />
java -jar GeMoMa-1.8.jar CLI GeMoMaPipeline threads=<threads> outdir=<outdir> GeMoMa.Score=ReAlign AnnotationFinalizer.r=NO o=true t=<target_genome> i=<reference_1_id> a=<reference_1_annotation> g=<reference_1_genome><br />
there are several parameters that need to be set indicated with '''&lt;'''foo'''&gt;'''. You can specify<br />
* the number of threads<br />
* the output directory<br />
* the target genome<br />
* and the reference ID (optional), annotation and genome. If you have several references just repeat <code>s=own</code> and the parameter tags <code>i</code>, <code>a</code>, <code>g</code> with the corresponding values.<br />
In addition, we recommend to set several parameters:<br />
* <code>GeMoMa.Score=ReAlign</code>: states that the score from mmseqs should be recomputed as mmseqs uses an approximation<br />
* <code>AnnotationFinalizer.r=NO</code>: do not rename genes and transcripts<br />
* <code>o=true</code>: output individual predictions for each reference as a separate file allowing to rerun the combination step ('''GAF''') very easily and quickly<br />
If you like to specify the maximum intron length please consider the parameters <code>GeMoMa.m</code> and <code>GeMoMa.sil</code>.<br />
If you have RNA-seq data either from own experiments or publicly available data sets (cf. [https://www.ncbi.nlm.nih.gov/sra NCBI SRA], [https://www.ebi.ac.uk/ena EMBL-EBI ENA]), we recommend to use them. You need to map the data against the target genome with your favorite read mapper. In addition, we recommend to check the parameters of the section '''DenoiseIntrons'''.<br />
<br />
<br />
The complete documentation describing all GeMoMa modules and all parameters can be accessed at [[GeMoMa-Docs]].<br />
<br />
You can also [[:File:GeMoMa-manual.pdf|download a small manual for GeMoMa]] which explains the main steps for the analysis.<br />
<br />
== GFF attributes ==<br />
<br />
Using GeMoMa and GAF, you'll obtain GFFs containing some special attributes. We briefly explain the most prominent attributes in the following table.<br />
<br />
{| class="wikitable sortable"<br />
|-<br />
!Attribute !!Long name !!Tool !!Necessary parameter !!Feature !!Description<br />
|-<br />
| aa || amino acids || GeMoMa || || mRNA || the number of amino acids in the predicted protein<br />
|-<br />
| raa || reference amino acids || GeMoMa || || mRNA || the number of amino acids in the reference protein<br />
|-<br />
| score || GeMoMa score || GeMoMa || || mRNA || score computed by GeMoMa using the substitution matrix, gap costs and additional penalties<br />
|-<br />
| maxScore || maximal GeMoMa score || GeMoMa || || mRNA || maximal score which will be obtained by a prediction that is identical to the reference transcript<br />
|-<br />
| bestScore || best GeMoMa score || GeMoMa || || mRNA || score of the best GeMoMa prediction of this transcript and this target organism<br />
|-<br />
| maxGap || maximal gap || GeMoMa || || mRNA || length of the longest gap in the alignment between predicted and reference protein<br />
|-<br />
| lpm || longest positive match || GeMoMa || || mRNA || length of the longest positive scoring match in the alignment between predicted and reference protein, i.e., each pair of amino acids in the match has a positive score<br />
|-<br />
| nps || number of premature stops || GeMoMa || || mRNA || the number of premature stop codons in the prediction<br />
|-<br />
| ce || coding exons || GeMoMa || assignment || mRNA || the number of coding exons of the prediction<br />
|-<br />
| rce || reference coding exons || GeMoMa || assignment || mRNA || the number of coding exons of the reference transcript<br />
|-<br />
| minCov || minimal coverage || GeMoMa || coverage, ... || mRNA || minimal coverage of any base of the prediction given RNA-seq evidence<br />
|-<br />
| avgCov || average coverage || GeMoMa || coverage, ... || mRNA || average coverage of all bases of the prediction given RNA-seq evidence<br />
|-<br />
| tpc || transcript percentage coverage || GeMoMa || coverage, ... || mRNA || percentage of covered bases per predicted transcript given RNA-seq evidence<br />
|-<br />
| tae || transcript acceptor evidence || GeMoMa || introns || mRNA || percentage of predicted acceptor sites per predicted transcript with RNA-seq evidence<br />
|-<br />
| tde || transcript donor evidence || GeMoMa || introns || mRNA || percentage of predicted donor sites per predicted transcript with RNA-seq evidence<br />
|-<br />
| tie || transcript intron evidence || GeMoMa || introns || mRNA || percentage of predicted introns per predicted transcript with RNA-seq evidence<br />
|-<br />
| minSplitReads || minimal split reads || GeMoMa || introns || mRNA || minimal number of split reads for any of the predicted introns per predicted transcript<br />
|-<br />
| iAA || identical amino acid || GeMoMa || protein alignment || mRNA || percentage of identical amino acids between reference transcript and prediction<br />
|-<br />
| pAA || positive amino acid || GeMoMa || protein alignment || mRNA || percentage of aligned positions between reference transcript and prediction yielding a positive score in the substitution matrix<br />
|-<br />
| evidence || || GAF || || mRNA || number of reference organisms that have a transcript yielding this prediction<br />
|-<br />
| alternative || || GAF || || mRNA || alternative gene ID(s) leading to the same prediction<br />
|-<br />
| sumWeight || || GAF || || mRNA || the sum of the weights of the references that perfectly support this prediction<br />
|-<br />
| maxTie || maximal tie || GAF || || gene || maximal tie of all transcripts of this gene<br />
|-<br />
| maxEvidence || maximal evidence || GAF || || gene || maximal evidence of all transcripts of this gene<br />
|-<br />
|}<br />
<br />
The name of the feature describing a transcript prediction can be altered using the parameter "tag". Before version 1.7 the default value of tag was "prediction" instead of "mRNA".<br />
<br />
== Frequently asked questions ==<br />
<br />
; Why does the Extractor not return a single CDS-part, protein, ...?<br />
:Please check whether the names of your contigs/chromosomes in your annotation (gff) and genome file (fasta) are identical. The fasta comments should at best only contain the contig/chromosome name. (Since GeMoMa 1.4, comments, which contain the contig/chromosome name and some additional information separated by a space, are also fine.) In addition, check the statistics that are given by the Extractor. It lists how many genes have been read and how many genes have been removed for different reasons. One common problem is that some annotation files do not include the stop codon in the CDS annotation.<br />
<br />
; How can I force GeMoMa to make more predictions?<br />
:There are several parameters affecting the number of predictions. The most prominent are the number of predictions (p) and the contig threshold (ct). For each reference transcript/CDS, GeMoMa initially makes a preliminary prediction and uses this prediction to determine whether a contig is promising and should be used to determine the final predictions. You may decrease ct and increase p to have more contigs in the final prediction. Increasing the number of predictions allows GeMoMa to output more predictions that have been computed. Decreasing the contig threshold allows to increase the number of predictions that are (internally) computed. Increasing p to a very large number without decreasing ct does not help.<br />
<br />
; Running GeMoMa on a single contig of my assembly yield thousands of weird predictions. What went wrong?<br />
:By default, GeMoMa is not build to be run on a single contig. GeMoMa tries to make predictions for all given reference CDS in the given target sequence(s). If the given target sequence is only a fraction of the complete target genome/assembly, GeMoMa will produce weird predictions as it does not filter for the quality of the predictions internally. There are two options to handle this:<br />
:* Use a list of gene models that you expect to be located on this contig (cf. parameter "selected").<br />
:* Filter the predictions using GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>).<br />
<br />
; Is it mandatory to use RNA-seq data?<br />
:No, GeMoMa is able to make predictions with and without RNA-seq evidence.<br />
<br />
; Is it possible to use multiple reference organisms?<br />
:It is possible to use multiple reference organisms for GeMoMa. Just run GeMoMa on each reference organism separately. Finally, you can employ GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>) to combine these annotations.<br />
<br />
; Why do some reference genes not lead to a prediction in the target genome?<br />
:Please first check whether your reference genes have been discarded by the Extractor (cf. assignment file).<br />
:If the genes have been discarded, there are two possibilities:<br />
:* The CDS might be redundant, i.e. the coding exons are identical to those of another transcript. In this case, only one CDS is further evaluated.<br />
:* There might be something wrong with your reference genes, e.g., missing start codon, missing stop codon, premature stop codon, ambiguous nucleotides, ... and you should check the options of Extractor or the annotation.<br />
:If the reference genes passed the Extractor, there are several possible explanations for this behavior. The two most prominent are:<br />
:* GeMoMa stopped the prediction of a reference genes since it does not return a result within the given time (cf. parameter "timeout").<br />
:* GeMoMa simply did not find a prediction matching the remaining quality criteria.<br />
<br />
; What does "partial gene model" mean in the context of GeMoMa?<br />
:We called a gene model partial if it does not contain an initial start codon and a final stop codon. However, this does not mean that the gene model is located at or close to the border of a chromosome or contig.<br />
<br />
; For two different reference transcripts, the predictions of GeMoMa overlap or are identical. What should I do with those?<br />
:GeMoMa makes the predictions for each reference transcript independently. Hence, it can occur that some of predictions of different reference transcripts overlap or are identical especially in gene families. Typically, you might like to filter or rank these predictions. We have implemented GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>) to do this automatically. However, you can also do it by hand using the GFF attributes. Using RNA-seq data in GeMoMa yields additional fields in the annotation that can be used, e.g., average coverage (avgCov).<br />
<br />
; A lot of transcripts have been filtered out by the Extractor. What can I do?<br />
:There are several reasons for removing transcripts by the Extractor. At least in two cases you can try to get more transcripts by setting specific parameter values. First, if the transcript contains ambiguous nucleotides, please test the parameter "Ambiguity". Second, sometimes we received GFFs which contain wrong phases for CDS entries (e.g., 0 for all CDS entries in the phase column of the GFF). Since version 1.3.2, we provide the option "r" which stands for repair. If r=true is chosen, the Extractor tries to infer all phases for transcripts that show an error and would be filtered out.<br />
<br />
; Is GeMoMa able to predict pseudo-genes/ncRNA?<br />
:No, currently not.<br />
<br />
; My RNA-seq data indicates there is an additional intron in a transcipt, but GeMoMa does not predict this. Or vice versa, GeMoMa predicts an intron that is not supported by RNA-seq data. What's the reason?<br />
:GeMoMa is mainly based on the assumptions of amino acid and intron position conservation between reference and target species. Hence, GeMoMa tries to predict a gene model with similar exon-intron-structure in the target species and does not stick too much with RNA-seq data. Although intron position conservation can be observed in most cases, sometimes new introns evolve or others vanish. For this reasons, GeMoMa also allows for the inclusion or exclusion of introns adding some additional costs (cf. GeMoMa parameter intron-loss-gain-penalty). However, the behaviour of GeMoMa depends on the parameters settings (especially intron-loss-gain-penalty, sm (substitution matrix), go (gap opening), ge (gap extension)) and the length of the missed/additional intron. Nevertheless, such cases can only occur if the additional/missed intron has a length that can be divided by 3 preserving the reading frame. Since the available RNA-seq data only reflects a fraction of tissues/environmental conditions/..., missing RNA-seq evidence does not necessarily mean that the predictions is wrong.<br />
<br />
; My RNA-seq data indicates two alternative, highly overlapping introns. Interestingly, GeMoMa does not take the intron that is more abundant. Why?<br />
:GeMoMa reads the introns from the input file using some filter (cf. GeMoMa parameter r (reads)). All introns that pass the filter are used and treated equally. Hence, GeMoMa uses the intron that matches the expectation of intron position and amino acid conservation compared to the reference transcript.<br />
<br />
; Does GeMoMa predict multiple transcripts per gene?<br />
:GeMoMa in principle allows to predict multiple transcripts per gene, if corresponding transcripts are given in the reference species or if multiple reference species are used. <br />
<br />
; GeMoMa failed with java.lang.OutOfMemoryError. What can I do?<br />
:Whenever you see a java.lang.OutOfMemoryError, you should rerun the program with Java virtual machine (VM) options. More specifically you should set: -Xms the initially used RAM, e.g. to 5Gb (–Xms5G), and -Xmx the maximally used RAM, e.g. to 50Gb (-Xmx50G). GeMoMa often needs more memory if you have a large genome and if you’re providing a large coverage file (extracted from RNA-seq data). If you don’t have a compute node with enough memory, you can run GeMoMa without coverage, which will return the same predictions, but does not include all statistics. Another point could be the protein alignment, if you use the optional parameter "query protein" (below version 1.; or "protein alignment" (since version 1.;. Again you can run GeMoMa without protein alignment, which will return the same predictions, but less statistics. <br />
<br />
; I need to specify the genetic code for my organisms. What is the expected format?<br />
:The genetic code is given in a two column tab-delimited table, where the first column is the one letter code of the amino acid and the second column is a comma-separated list of triplets. As we are working on genomic DNA, GeMoMa expects the bases A, C, G, and T, and not U (as expected in mRNA). Here is the link to the default genetic code, which might be used as template:<br />
:https://github.com/Jstacs/Jstacs/blob/master/projects/gemoma/test_data/genetic_code.txt<br />
:Alternative genetic codes are described here using the RNA alphabet:<br />
:https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi<br />
:The genetic code might be specified for a reference organism in the module Extractor or for a target organism in the module GeMoMa.<br />
<br />
; I like to accelerate GeMoMa. What can I do?<br />
:You can use several threads for the computation. If you run the GeMoMaPipeline you just have to select threads=<your_number>.<br />
:In addition, you can change the search algorithm that is used in GeMoMa. Tblastn is used by default as search algorithm in GeMoMa (for historical reasons, until version 1.6.4). However, tblastn can be replaced by mmseqs which is typically much faster. If you run the GeMoMaPipeline you just have to select tblastn=false, which is default since version 1.7. However, changing the search algorithm can also effect the results. We try to minimize these effect using specific parameters for the search algorithms.<br />
:If you modify other parameters, you will probably receive results that differ to a larger extend from those received using the default parameters. <br />
<br />
; Is there a way to use GeMoMa to search a single CDS or protein sequence against a genome and return the predicted gene model (CDS fasta, protein fasta, GFF) similar to exonerate?<br />
:There at least two ways to do this. If you use GeMoMaPipeline you can <br />
: (A) Use the parameter “selected” to select specific gene models (=transcripts/proteins) from the annotation instead of using all or<br />
: (B) Use s=pre-extracted, use a fasta file with the proteins for the parameter cds-parts and leave assignment unset.<br />
:Using one of these options you can either look for a single or few transcripts/proteins either with (A) or without (B) intron-position conservation. In addition, you can use RNA-seq data to improve the predictions, which should be not possible with exonerate.<br />
<br />
; Can I determine synteny based on GeMoMa predictions?<br />
:Yes, since version 1.7 we provide the module SyntenyChecker and a R script that can be used for this purpose. It exploits the fact the the reference gene and the alternative are known. Hence no alignment is need at this point and synteny can be determined quite fast. <br />
<br />
; How, can I add additional attributes to the annotation?<br />
:Additional attribute, e.g., functional annotation from InterProScan, can be added to the structural gene annotation using the module AddAttribute, which has been included since version 1.7. Such additional attributes might be used in GAF for filtering and sorting and can also be displayed in genome browsers like IGV or WebApollo. <br />
<br />
; Can structural gene annotation provided by GeMoMa be submitted to NCBI?<br />
:Yes, NCBI allows to submit structural gene annotation in GFF format (https://www.ncbi.nlm.nih.gov/genbank/genomes_gff/). If you run GeMoMaPipeline or AnnotationFinalizer, the GFF should be valid for conversion. <br />
<br />
; Running GeMoMaPipeline throws an exception. Can I restart GeMoMaPipeline using intermediate results?<br />
:Yes, since version 1.7 we have a new parameter in GeMoMaPipeline called restart. <br />
:If you want to restart the last broken GeMoMaPipeline run, you have to execute GeMoMaPipeline with the same command line as before and add restart=true.<br />
:If necessary, you can also slightly change the other parameters. However, if the parameters differ too much from those used before, GeMoMaPipeline will decide to perform a new independent run.<br />
:A restart of GeMoMaPipeline is particularly useful if the time-consuming search (tblastn or mmseqs) was successful, since this can save runtime.<br />
<br />
If you have any further questions, comments or bugs, please check the [[GeMoMa-Docs]], [https://github.com/Jstacs/Jstacs/labels/GeMoMa our github page] or contact [mailto:jens.keilwagen@julius-kuehn.de Jens Keilwagen].<br />
<br />
== References ==<br />
If you use GeMoMa, please cite<br />
<br />
J. Keilwagen, M. Wenk, J. L. Erickson, M. H. Schattat, J. Grau, and F. Hartung. [https://nar.oxfordjournals.org/content/44/9/e89 Using intron position conservation for homology-based gene prediction]. ''Nucleic Acids Research'', 2016. doi: 10.1093/nar/gkw092<br />
<br />
J. Keilwagen, F. Hartung, M. Paulini, S. O. Twardziok, and J. Grau<br />
[https://rdcu.be/QbKc Combining RNA-seq data and homology-based gene prediction for plants, animals and fungi]. ''BMC Bioinformatics'', 2018. doi: 10.1186/s12859-018-2203-5<br />
<br />
== Version history ==<br />
[http://www.jstacs.de/download.php?which=GeMoMa GeMoMa 1.9] (15.07.2022)<br />
* improved handling of warnings in Galaxy<br />
* new modules: Attribute2Table, GFFAttributes and TranscribedCluster<br />
* AnnotationFinalizer:<br />
** new parameters: transfer features, additional source suffix<br />
** changed renaming using regex<br />
** adding oldID if renaming IDs <br />
** do not re-sort transcripts of a gene<br />
* BUSCORecomputer:<br />
** add FileExistsValidator<br />
** extend to polyploid organisms<br />
** new result: BUSCO parsed full table<br />
** bugfix last duplicated <br />
* CombineCoverageFiles:<br />
** make it more memory efficient <br />
* CombineIntronFiles:<br />
** make it more memory efficient<br />
* GAF:<br />
** new parameters allowing gene set specific kmeans using global or local detrending<br />
** new parameter "intermediate result" allowing to retrieve intermediate results<br />
** new parameter "length difference" allowing to discard predictions that deviate too much from the representative transcript at a locus<br />
* GeMoMa:<br />
** new parameter options for the amount of predictions per reference transcript: STATIC(=default) or DYNAMIC<br />
** delete unnecessary parameter "region threshold"<br />
** improved verbose output<br />
** improve fasta header parsing<br />
* GeMoMaPipeline:<br />
** removed long fasta comment parameter <br />
** improved behaviour if errors occur if restart=true <br />
** shifted prefix from GAF to GeMoMa module were possible <br />
** bug fix: SyntenyChecker if assignment is not used <br />
* Extractor:<br />
** new category for discarded annotation: non-linear transcripts <br />
** bugfix longest intron=0<br />
* ExtractRNAseqEvidence:<br />
** improved protocol if errors with the repositioning occur <br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.8.zip GeMoMa 1.8] (07.10.2021)<br />
* extended manual<br />
* new module Analyzer: for benchmarking<br />
* new module BUSCORecomputer: allowing to recompute BUSCO stats based on geneID instead of transcriptID avoiding to overestimate the number of duplicates<br />
* AnnotationEvidence<br />
** bugfix: gene borders if only one gene is on the contig<br />
** discard genes that do not code for a protein<br />
* AnnotationFinalizer:<br />
** new parameter "transfer feature" allowing to keep additional features like UTRs, ...<br />
** implemented check of regular expression for prefix<br />
** bugfix if score==NA<br />
* CheckIntrons: introns are not optional<br />
* ERE: <br />
** new parameters for handling spurious split reads<br />
** new parameter for repositioning that is needed for genomes with huge chromosome due to limitations of BAM/SAM<br />
** bugfix last intron<br />
** improved protocol<br />
* Extractor:<br />
** new parameter for long fasta comment<br />
** new parameter identical<br />
** more verbose output in case of problems<br />
** finding errors if CDS parts have different strands<br />
** changed optional intron output<br />
** bugfix for exons with DNA but no AA<br />
* GAF:<br />
** new parameter allowing to output the transcript names of redundant predictions as GFF attribute<br />
** new parameter "transfer feature" allowing to keep additional features like UTRs, ...<br />
** bugfix: missing entries for alternative<br />
** changed default value for atf and sorting<br />
** implemented check of regular expression for prefix<br />
** changed handling of transcript within clusters<br />
** changed output order in gff: now for each gene the gene feature is reported first and subsequently the mRNA and CDS features<br />
* GeMoMa: <br />
** new parameter for replacing unknown AA by X<br />
** handling missing GeMoMa.ini.xml<br />
** additional GFF attributes: lpm, maxScore, maxGap, bestScore<br />
** improved error handling and protocol<br />
** changed heuristic for identifying multiple transcripts predictions on one contig/chromosome<br />
* GeMoMaPipeline:<br />
** new parameter "check synteny" allowing to run SyntenyChecker<br />
** implemented check of regular expression for prefix<br />
** removed unnecessary parameter<br />
** improved handling of exceptions<br />
** bugfix for stranded RNA-seq evidence<br />
** allow re-start only for same version<br />
** improved protocol if threads==1<br />
* SyntenyChecker: implemented check of regular expression for prefix<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.7,1.zip GeMoMa 1.7.1] (07.09.2020)<br />
*GeMoMa: <br />
**bugfix if assignment == null<br />
**bugfix remove toUpperCase<br />
*GeMoMaPipeline<br />
**Galaxy integration bugfix for hidden parameter restart<br />
**hide BLAST_PATH and MMSEQS_PATH from Galaxy integration<br />
**improved protocol output if threads=1<br />
**add addtional test to GeMoMaPipeline<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.7.zip GeMoMa 1.7] (29.07.2020)<br />
*improved manual including new module and runtime <br />
*check whether input files exist before execution<br />
*partially checking MIME types in CLI before execution<br />
*changed homepage from http to https<br />
*new module AddAttribute: allows to add attributes (like functional annotation from InterProScan) to gene annotation files that might be used in GAF or displayed genome browsers like IGV or WebApollo<br />
*new module SyntenyChecker: creates a table that can be used to create dot plots between the annotation of the target and reference organism<br />
*changed default value of parameter "tag" from "prediction" to "mRNA"<br />
*AnnotationEvidence:<br />
**additional attributes: avgCov, minCov, nps, ce<br />
**changed default value of "annotation output" to true<br />
**bugfix: transcript start and end<br />
*ERE: <br />
**changed default value of coverage to "true" <br />
**new parameter "minimum context": allows to discard introns if all split reads have short aligned contexts<br />
*Extractor: <br />
**bugfix splitAA if coding exon is very short<br />
**improved verbose mode<br />
**new parameter "upcase IDs"<br />
**new parameter "introns" allowing to extract introns from the reference (only for test cases)<br />
**new parameter "discard pre-mature stop" allowing to discard or use transcripts with pre-mature stop<br />
**improved handling of corrupt annotations<br />
*GAF: <br />
**bugfix missing transcripts<br />
**slightly changed the default value of "filter" <br />
*GeMoMa:<br />
**replaced parameter "query proteins" by "protein alignment"<br />
**using splitAA for scoring predictions <br />
**new gff attributes: <br />
*** ce and rce for the feature prediction indicating the number of coding exons for the prediction and the reference, respectively<br />
*** nps for the number of premature stop codons (if avoid stop is false)<br />
**slightly changed the meaning of the parameter "avoid stop"<br />
*GeMoMaPipeline:<br />
**changed the default value of tblastn to false, hence mmseqs is used as search algorithm<br />
**changed the default value of score to ReAlign<br />
**remove "--dont-split-seq-by-len" from mmseqs createdb<br />
**new optional parameter BLAST_PATH<br />
**new optional parameter MMSEQS_PATH<br />
**new option to allow for incorporation of external annotation, e.g., from ab-initio gene prediction<br />
**new parameter restart allowing to restart the latest GeMoMaPipeline run, which was finished without results, with very similar parameters, e.g., after an exception was thrown (cf. parameter debug)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.6.4.zip GeMoMa 1.6.4] (24.04.2020)<br />
* improved help section<br />
* change gff attribute "AA" to "aa"<br />
* GAF:<br />
** bugfix overlapping genes<br />
** accelerated computation<br />
* GeMoMa:<br />
** bugfix: if no assignment file is used and protein ID are prefixes of other protein IDs<br />
** change GFF attribute AA to aa<br />
* AnnotationFinalizer: new parameter "name attribute" allowing to decide whether a name attribute or the Parent and ID attributes should be used for renaming<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.6.3.zip GeMoMa 1.6.3] (05.03.2020)<br />
* Jstacs changes:<br />
** CLI: bugfix ExpandableParameterSet<br />
* python wrapper (for *conda)<br />
* updated tests.sh, run.sh, pipeline.sh<br />
* rename Denoise to DenoiseIntrons<br />
* AnnotationEvidence: write phase (as given) to gff<br />
* GAF: new parameter: default attributes allows to set attributes that are not included in some gene annotation files<br />
* GeMoMa: new parameter: static intron length allowing to use dynamic intron length if set to false<br />
* GeMoMaPipeline: <br />
** bugfix: time-out<br />
** improve output<br />
** separate parameters for maximum intron length (DenoiseIntrons, GeMoMa)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.6.2.zip GeMoMa 1.6.2] (17.12.2019)<br />
* Jstacs changes:<br />
** test methods for modules<br />
** live protocol for Galaxy<br />
* new module Denoise: allowing to clean introns extracted by ERE<br />
* new module NCBIReferenceRetriever: allowing to retrieve data for reference organisms easily from NCBI.<br />
* GAF:<br />
** bugfix for filter using specific attributes if no RNA-seq or query proteins was used<br />
** allow to add annotation info (as for instance provided by Phytozome) based on the reference organisms<br />
* GeMoMa: bugfix for timeout<br />
* GeMoMaPipeline:<br />
** bugfix reporting predicted partial proteins<br />
** improved protocol<br />
** new default value for query proteins (changed from false to true)<br />
** new default value for Ambiguity (changed from EXCEPTION to AMBIGUOUS)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.6.1.zip GeMoMa 1.6.1] (4.06.2019)<br />
* createGalaxyIntegration.sh: bugfix for GeMoMaPipeline<br />
* new module CheckIntrons: allowing to create statistics for introns (extracted by ERE)<br />
* AnnotationFinalizer: bugfix for sequence IDs with large numbers<br />
* CompareTranscripts: <br />
** bugfix for prefix of ref-gene<br />
** allow no transcript info, but making assignment non-optional if a transcript info is set <br />
* GAF: bugfix for Galaxy integration<br />
* GeMoMaPipeline:<br />
** improved output in case of Exceptions<br />
** new parameter "output individual predictions" allows to in- or exclude individual predictions from each reference organism in the final result<br />
** new parameter "weight" allows weights for reference species (cf. GAF)<br />
* ERE: new parameter "minimum mapping quality"<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.6.zip GeMoMa 1.6] (2.04.2019)<br />
* allow to use mmseqs as alternative to tblastn<br />
* AnnotationEvidence:<br />
** allows to add attributes to the input gff: tie, tpc, AA, start, stop<br />
** new parameter for gff output<br />
* AnnotationFinalizer: new tool for predicting UTRs and renaming predictions<br />
* GAF:<br />
** relative score filter and evidence filter are replaced by a flexible filter that allows to filter by relative score, evidence or other GFF attributes as well as combinations thereof<br />
** sorting criteria of the predictions within clusters can now be user-specified<br />
** new attribute for genes: combinedEvidence<br />
** new attribute for predictions: sumWeight<br />
** allows to use gene predictions from all sources, including for instance ab-initio gene predictors, purely RNA-seq based gene prediction and manually curation<br />
** bugfix for predictions from multiple reference organisms<br />
** improved statistic output<br />
* GeMoMa<br />
** renamed the parameter tblastn results to search results<br />
** new parameter for sorting the results of the similarity search (tblastn or mmseqs), if you use mmseqs for the similarity search you have use sort <br />
** new parameter for score of the search results: three options: Trust (as is), ReScore (use aligned sequence, but recompute score), and ReAlign (use detected sequence for realignment and rescore)<br />
** bugfix: threshold for introns from multiple files<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.5.2.zip GeMoMa 1.5.3] (23.07.2018)<br />
* improved parameter description and presentation<br />
* GeMoMaPipeline:<br />
** removed unnecessary parameters<br />
* GeMoMa:<br />
** bugfix: reading coverage file<br />
** removed parameter genomic (cf. Extractor)<br />
** removed protein output (cf. Extractor)<br />
* GAF:<br />
** bugfix: prefix<br />
* Extractor:<br />
** new parameter genomic<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.5.2.zip GeMoMa 1.5.2] (31.5.2018)<br />
* GAF:<br />
** new parameter that allows to restrict the maximal number of transcript predictions per gene<br />
** altered behavior of the evidence filter from percentages to absolute values<br />
** bugfix: nested genes <br />
** checking for duplicates in prediction IDs<br />
* GeMoMa:<br />
** warning if RNA-seq data does not match with target genome<br />
* GeMoMaPipeline: new tool for running the complete GeMoMa pipeline at once allowing multi-threading<br />
* folder for temporary files of GeMoMa<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.5.zip GeMoMa 1.5] (13.02.2018)<br />
* AnnotationEvidence: add chromosome to output<br />
* CompareTranscripts: new parameter that allows to remove prefixes introduces by GAF<br />
* Extractor: new parameter "stop-codon excluded from CDS" that might be used if the annotation does not contain the stop codons<br />
* ExtractRNASeqEvidence: <br />
** print intron length stats<br />
** include program infos in introns.gff3<br />
* GeMoMa:<br />
** new attribute pAA in gff output if query protein is given<br />
** include program infos in predicted_annotation.gff3<br />
** minor bugfix<br />
* GAF:<br />
** new parameter that allows to specify a prefix for each input gff<br />
** collect and print program infos to filtered_prediction.gff3<br />
** improved statistics output<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.4.2.zip GeMoMa 1.4.2] (21.07.2017)<br />
* automatic searching for available updates<br />
* AnnotationEvidence: bugfix (tie computation: Arrays.binarysearch does not find first match)<br />
* Extractor: bugfix (files that are not zipped)<br />
* GeMoMa: bugfix (tie computation: Arrays.binarysearch does not find first match)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.4.1.zip GeMoMa 1.4.1] (30.05.2017)<br />
* CompareTranscripts: bugfix (NullPointerException)<br />
* Extractor: reference genome can be .*fa.gz and .*fasta.gz<br />
* GeMoMa: bugfix (shutdown problem after timeout)<br />
* modified additional scripts and documentation<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.4.zip GeMoMa 1.4] (03.05.2017)<br />
* AnnotationEvidence: new tool computing tie and tpc for given annotation (gff)<br />
* CompareTranscripts: new tool comparing predicted and given annotation (gff)<br />
* Extractor:<br />
** reading CDS with no parent tag (cf. discontinuous feature)<br />
** automatic recognition of GFF or GTF annotation<br />
** Warning if sequences mentioned in the annotation are not included in the reference sequence<br />
* GeMoMa: <br />
** allowing for multiple intron and coverage files (= using different library types at the same time)<br />
** NA instead of "?" for tae, tde, tie, minSplitReads of single coding exon genes<br />
** new default values for the parameters: predictions (10 instead of 1) and contig threshold (0.4 instead of 0.9)<br />
** bugfix (write pc and minCov if possible for last CDS part in predicted annotation)<br />
** bugfix (ref-gene name if no assignment is used)<br />
** bugfix (minSplitReads, minCov, tpc, avgCov if no coverage available)<br />
* GAF:<br />
** nested genes on the same strand<br />
** bugfix (if nothing passes the filter)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.3.2.zip GeMoMa 1.3.2] (18.01.2017)<br />
* Extractor: new parameter repair for broken transcript annotations<br />
* GeMoMa: bugfixes (splice site computation)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa_1.3.1.zip GeMoMa 1.3.1] (09.12.2016)<br />
* GeMoMa bugfix (finding start/stop codon for very small exons)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.3.zip GeMoMa 1.3] (06.12.2016)<br />
* ERE: new tool for extracting RNA-seq evidence<br />
* Extractor: offers options for <br />
** partial gene models<br />
** ambiguities<br />
* GeMoMa:<br />
** RNA-seq<br />
*** defining splice sites<br />
*** additional feature in GFF and output<br />
**** transcript intron evidence (tie)<br />
**** transcript acceptor evidence (tae)<br />
**** transcript donor evidence (tde)<br />
**** transcript percentage coverage (tpc)<br />
**** ...<br />
** improved GFF<br />
** simplified the command line parameters<br />
** IMPORTANT: parameter names changed for some parameters<br />
* GAF: new tool for filtering and combining different predictions (especially of different reference organisms)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.1.3.zip GeMoMa 1.1.3] (06.06.2016)<br />
* minor modifications to the Extractor tool<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.1.2.zip GeMoMa 1.1.2] (05.02.2016)<br />
* GeMoMa bugfix (upstream, downstream sequence for splice site detection)<br />
* Extractor: new parameter s for selecting transcripts<br />
* improved Galaxy integration<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.1.1.zip GeMoMa 1.1.1] (01.02.2016)<br />
* initial release for paper</div>Keilwagenhttps://www.jstacs.de/index.php?title=GeMoMa-Docs&diff=1160GeMoMa-Docs2021-10-11T12:27:52Z<p>Keilwagen: /* GeMoMa pipeline */</p>
<hr />
<div>This page describes the parameters of all [[GeMoMa]] modules.</br><br />
If you have any questions, comments or bugs, please check the [[GeMoMa#Frequently asked questions|FAQs]], [https://github.com/Jstacs/Jstacs/issues?q=label%3AGeMoMa our github page] or contact [mailto:jens.keilwagen@julius-kuehn.de Jens Keilwagen].<br />
<br />
=== GeMoMa pipeline ===<br />
<br />
This tool can be used to run the complete GeMoMa pipeline. The tool is multi-threaded and can utilize all compute cores on one machine, but not distributed as for instance in a compute cluster. It basically runs: '''Extract RNA-seq evidence (ERE)''', '''DenoiseIntrons''', '''Extractor''', external search (tblastn or mmseqs), '''Gene Model Mapper (GeMoMa)''', '''GeMoMa Annotation Filter (GAF)''', and '''AnnnotationFinalizer'''.<br />
<br />
''GeMoMa pipeline'' may be called with<br />
<br />
java -jar GeMoMa-1.8.jar CLI GeMoMaPipeline<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>target genome (Target genome file (FASTA), type = fasta,fa,fas,fna,fasta.gz,fa.gz,fas.gz,fna.gz)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used zero or multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>species (data for reference species, range={own, pre-extracted}, default = own)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;own&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>ID (ID to distinguish the different reference species, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (Reference annotation file (GFF or GTF), which contains gene models annotated in the reference genome, type = gff,gff3,gtf,gff.gz,gff3.gz,gtf.gz)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (Reference genome file (FASTA), type = fasta,fa,fas,fna,fasta.gz,fa.gz,fas.gz,fna.gz)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>weight (the weight can be used to prioritize predictions from different input files; each prediction will get an additional attribute sumWeight that can be used in the filter, valid range = [0.0, 1000.0], default = 1.0, OPTIONAL)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ai</font></td><br />
<td>annotation info (annotation information of the reference, tab-delimted file containing at least the columns transcriptName, GO and .*defline, type = tabular, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;pre-extracted&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>ID (ID to distinguish the different reference species, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>cds parts (The query CDS parts file (protein FASTA), i.e., the CDS parts that have been searched in the target genome using for instance BLAST or mmseqs, type = fasta,fa,fas,fna)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>assignment (The assignment file, which combines CDS parts to proteins, type = tabular, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>weight (the weight can be used to prioritize predictions from different input files; each prediction will get an additional attribute sumWeight that can be used in the filter, valid range = [0.0, 1000.0], default = 1.0, OPTIONAL)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ai</font></td><br />
<td>annotation info (annotation information of the reference, tab-delimted file containing at least the columns transcriptName, GO and .*defline, type = tabular, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used zero or multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ID</font></td><br />
<td>ID (ID to distinguish the different external annotations of the target organism, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">e</font></td><br />
<td>external annotation (External annotation file (GFF,GTF) of the target organism, which contains gene models from an external source (e.g., ab initio gene prediction) and will be integrated in the module GAF, type = gff,gff3,gtf)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">weight</font></td><br />
<td>weight (the weight can be used to prioritize predictions from different input files in the module GAF; each prediction will get an additional attribute sumWeight that can be used in the filter, valid range = [0.0, 1000.0], default = 1.0, OPTIONAL)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ae</font></td><br />
<td>annotation evidence (run AnnotationEvidence on this external annotation, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">selected</font></td><br />
<td>selected (The path to list file, which allows to make only a predictions for the contained transcript ids. The first column should contain transcript IDs as given in the annotation. Remaining columns can be used to determine a target region that should be overlapped by the prediction, if columns 2 to 5 contain chromosome, strand, start and end of region, type = tabular,txt, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">gc</font></td><br />
<td>genetic code (optional user-specified genetic code, type = tabular, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">tblastn</font></td><br />
<td>tblastn (if *true* tblastn is used as search algorithm, otherwise mmseqs is used. Tblastn and mmseqs need to be installed to use the corresponding option, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">tag</font></td><br />
<td>tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = mRNA)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>RNA-seq evidence (data for RNA-seq evidence, range={NO, MAPPED, EXTRACTED}, default = NO)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;MAPPED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.s</font></td><br />
<td>Stranded (Defines whether the reads are stranded. In case of FR_FIRST_STRAND, the first read of a read pair or the only read in case of single-end data is assumed to be located on forward strand of the cDNA, i.e., reverse to the mRNA orientation. If you are using Illumina TruSeq you should use FR_FIRST_STRAND., range={FR_UNSTRANDED, FR_FIRST_STRAND, FR_SECOND_STRAND}, default = FR_UNSTRANDED)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.m</font></td><br />
<td>mapped reads file (BAM/SAM files containing the mapped reads, type = bam,sam)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.v</font></td><br />
<td>ValidationStringency (Defines how strict to be when reading a SAM or BAM, beyond bare minimum validation., range={STRICT, LENIENT, SILENT}, default = LENIENT)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.u</font></td><br />
<td>use secondary alignments (allows to filter flags in the SAM or BAM, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.c</font></td><br />
<td>coverage (allows to output the coverage, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.mmq</font></td><br />
<td>minimum mapping quality (reads with a mapping quality that is lower than this value will be ignored, valid range = [0, 255], default = 40)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.mc</font></td><br />
<td>minimum context (only introns that have evidence of at least one split read with a minimal M (=(mis)match) stretch in the cigar string larger than or equal to this value will be used, valid range = [1, 1000000], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.maximumcoverage</font></td><br />
<td>maximum coverage (optional parameter to reduce the size of coverage output files, coverage higher than this value will be reported as this value, valid range = [1, 10000], OPTIONAL)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.f</font></td><br />
<td>filter by intron mismatches (filter reads by the number of mismatches around splits, range={NO, YES}, default = NO)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;YES&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.r</font></td><br />
<td>region around introns (test region of this size around introns/splits for mismatches to the genome, valid range = [0, 100], default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.n</font></td><br />
<td>number of mismatches (number of mismatches allowed in regions around introns/splits, valid range = [0, 100], default = 3)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.e</font></td><br />
<td>evidence long splits (require introns to have at least this number of times the supporting reads as their length deviates from the mean split length, valid range = [0.0, 100.0], default = 0.0)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.mil</font></td><br />
<td>minimum intron length (introns shorter than the minimum length are discarded and considered as contiguous, valid range = [0, 1000], default = 0)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.repositioning</font></td><br />
<td>repositioning (due to limitations in BAM/SAM format huge chromosomes need to be split before mapping. This parameter allows to undo the split mapping to real chromosomes and coordinates. The repositioning file has 3 columns: split_chr_name, original_chr_name, offset_in_original_chr, type = tabular, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;EXTRACTED&quot;:</b></td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">introns</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq, type = gff,gff3)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used zero or multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage</font></td><br />
<td>coverage (experimental coverage (RNA-seq), range={UNSTRANDED, STRANDED}, default = UNSTRANDED)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., type = bedgraph)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., type = bedgraph)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., type = bedgraph)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>denoise (removing questionable introns that have been extracted by ERE, range={DENOISE, RAW}, default = DENOISE)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;DENOISE&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">DenoiseIntrons.m</font></td><br />
<td>maximum intron length (The maximum length of an intron, default = 15000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">DenoiseIntrons.me</font></td><br />
<td>minimum expression (The threshold for removing introns, valid range = [0.0, 1.0], default = 0.01)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">DenoiseIntrons.c</font></td><br />
<td>context (The context upstream a donor and donwstream an acceptor site that is used to determine the expression of the region, valid range = [0, 100], default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;RAW&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.u</font></td><br />
<td>upcase IDs (whether the IDs in the GFF should be upcased, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.r</font></td><br />
<td>repair (if a transcript annotation can not be parsed, the program will try to infer the phase of the CDS parts to repair the annotation, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.a</font></td><br />
<td>Ambiguity (This parameter defines how to deal with ambiguities in the DNA. There are 3 options: EXCEPTION, which will remove the corresponding transcript, AMBIGUOUS, which will use an X for the corresponding amino acid, and RANDOM, which will randomly select an amnio acid from the list of possibilities., range={EXCEPTION, AMBIGUOUS, RANDOM}, default = AMBIGUOUS)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.d</font></td><br />
<td>discard pre-mature stop (if *true* transcripts with pre-mature stop codon are discarded as they often indicate misannotation, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.s</font></td><br />
<td>stop-codon excluded from CDS (A flag that states whether the reference annotation contains the stop codon in the CDS annotation or not, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.f</font></td><br />
<td>full-length (A flag which allows for choosing between only full-length and all (i.e., full-length and partial) transcripts, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.l</font></td><br />
<td>long fasta comment (whether a short (transcript ID) or a long (transcript ID, gene ID, chromosome, strand, interval) fasta comment should be written for proteins, CDSs, and genomic regions, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.s</font></td><br />
<td>splice (if no intron is given by RNA-seq, compute candidate splice sites or not, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.sm</font></td><br />
<td>substitution matrix (optional user-specified substitution matrix, type = tabular, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.g</font></td><br />
<td>gap opening (The gap opening cost in the alignment, default = 11)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.ge</font></td><br />
<td>gap extension (The gap extension cost in the alignment, default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.m</font></td><br />
<td>maximum intron length (The maximum length of an intron, default = 15000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.sil</font></td><br />
<td>static intron length (A flag which allows to switch between static intron length, which can be specified by the user and is identical for all genes, and dynamic intron length, which is based on the gene-specific maximum intron length in the reference organism plus the user given maximum intron length, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.i</font></td><br />
<td>intron-loss-gain-penalty (The penalty used for intron loss and gain, default = 25)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.e</font></td><br />
<td>e-value (The e-value for filtering blast results, default = 100.0)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.c</font></td><br />
<td>contig threshold (The threshold for evaluating contigs, valid range = [0.0, 1.0], default = 0.4)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.rt</font></td><br />
<td>region threshold (The threshold for evaluating regions, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.h</font></td><br />
<td>hit threshold (The threshold for adding additional hits, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.p</font></td><br />
<td>predictions (The (maximal) number of predictions per transcript, default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.a</font></td><br />
<td>avoid stop (A flag which allows to avoid (additional) pre-mature stop codons in a transcript, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.approx</font></td><br />
<td>approx (whether an approximation is used to compute the score for intron gain, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.pa</font></td><br />
<td>protein alignment (whether a protein alignment between the prediction and the reference transcript should be computed. If so two additional attributes (iAA, pAA) will be added to predictions in the gff output. These might be used in GAF. However, since some transcripts are very long this can increase the needed runtime and memory (RAM)., default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.prefix</font></td><br />
<td>prefix (A prefix to be used for naming the predictions, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.v</font></td><br />
<td>verbose (A flag which allows to output a wealth of additional information per transcript, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.t</font></td><br />
<td>timeout (The (maximal) number of seconds to be used for the predictions of one transcript, if exceeded GeMoMa does not output a prediction for this transcript., valid range = [0, 604800], default = 3600)</td><br />
<td style="width:100px;">LONG</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.ru</font></td><br />
<td>replace unknown (Replace unknown amino acid symbols by X, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.Score</font></td><br />
<td>Score (A flag which allows to do nothing, re-score or re-align the search results, range={Trust, ReScore, ReAlign}, default = ReAlign)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.d</font></td><br />
<td>default attributes (Comma-separated list of attributes that is set to NaN if they are not given in the annotation file. This allows to use these attributes for sorting or filter criteria. It is especially meaningful if the gene annotation files were received fom different software packages (e.g., GeMoMa, Braker, ...) having different attributes., default = tie,tde,tae,iAA,pAA,score,lpm,maxGap,bestScore,maxScore,raa,rce)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.f</font></td><br />
<td>filter (A filter can be applied to transcript predictions to receive only reasonable predictions. The filter is applied to the GFF attributes. The default filter decides based on the completeness of the prediction (start=='M' and stop=='*') and the relative score (score/aa>=0.75) whether a prediction is used or not. In addition, predictions without score (isNaN(score)) will be used as external annotations do not have the attribute score. Different criteria can be combined using 'and' and 'or'. A more sophisticated filter could be applied for instance using the completeness, the relative score, the evidence as well as statistics based on RNA-seq: start=='M' and stop=='*' and score/aa>=0.75 and (evidence>1 or tpc==1.0), default = start=='M' and stop=='*' and (isNaN(score) or score/aa>=0.75), OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.s</font></td><br />
<td>sorting (comma-separated list that defines the way predictions are sorted within a cluster, default = sumWeight,score)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.a</font></td><br />
<td>alternative transcript filter (If a prediction is suggested as an alternative transcript, this additional filter will be applied. The filter works syntactically like the 'filter' parameter and allows you to keep the number of alternative transcripts small based on meaningful criteria. Reasonable filter could be based on RNA-seq data (tie==1) or on sumWeight (sumWeight>1). A more sophisticated filter could be applied combining several criteria: tie==1 or sumWeight>1, default = tie==1 or sumWeight>1, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.c</font></td><br />
<td>common border filter (the filter on the common borders of transcripts, the lower the more transcripts will be checked as alternative splice isoforms, valid range = [0.0, 1.0], default = 0.75)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.m</font></td><br />
<td>maximal number of transcripts per gene (the maximal number of allowed transcript predictions per gene, valid range = [1, 2147483647], default = 2147483647)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.aat</font></td><br />
<td>add alternative transcripts (a switch to decide whether the IDs of alternative transcripts that have the same CDS should be listed for each prediction, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.t</font></td><br />
<td>transfer features (if true, additional features like UTRs will be transfered from the input to the output. Only features of the representatives will be transfered. The UTRs of identical CDS predictions listed in "alternative" will not be transfered or combined, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.u</font></td><br />
<td>UTR (allows to predict UTRs using RNA-seq data, range={NO, YES}, default = NO)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;NO&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.t</font></td><br />
<td>transfer features (if true other features than gene, &lt;tag&gt; (default: mRNA), and CDS of the input will be written in the output, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;YES&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.r</font></td><br />
<td>rename (allows to generate generic gene and transcripts names (cf. parameter &quot;name attribute&quot;), range={COMPOSED, SIMPLE, NO}, default = COMPOSED)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;COMPOSED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.i</font></td><br />
<td>infix (the infix of the generic name, default = G)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.s</font></td><br />
<td>suffix (the suffix of the generic name, default = 0)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.di</font></td><br />
<td>delete infix (a comma-separated list of infixes that is deleted from the sequence names before building the gene/transcript name, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;SIMPLE&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.n</font></td><br />
<td>name attribute (if true the new name is added as new attribute &quot;Name&quot;, otherwise &quot;Parent&quot; and &quot;ID&quot; values are modified accordingly, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sc</font></td><br />
<td>synteny check (run SyntenyChecker if possible, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>predicted proteins (If *true*, returns the predicted proteins of the target organism as fastA file, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">pc</font></td><br />
<td>predicted CDSs (If *true*, returns the predicted CDSs of the target organism as fastA file, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">pgr</font></td><br />
<td>predicted genomic regions (If *true*, returns the genomic regions of predicted gene models of the target organism as fastA file, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">o</font></td><br />
<td>output individual predictions (If *true*, returns the predictions for each reference species, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">debug</font></td><br />
<td>debug (If *false* removes all temporary files even if the jobs exits unexpected, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">restart</font></td><br />
<td>restart (can be used to restart the latest GeMoMaPipeline run, which was finished without results, with very similar parameters, e.g., after an exception was thrown (cf. parameter debug), default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>BLAST_PATH (allows to set a path to the blast binaries if not set in the environment, default = , OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>MMSEQS_PATH (allows to set a path to the blast binaries if not set in the environment, default = , OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">threads</font></td><br />
<td>The number of threads used for the tool, defaults to 1</td><br />
<td>INT</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.8.jar CLI GeMoMaPipeline a=&lt;reference_annotation&gt; g=&lt;reference_genome&gt; t=&lt;target_genome&gt; AnnotationFinalizer.p=&lt;prefix&gt;<br />
<br />
=== Extract RNA-seq Evidence ===<br />
<br />
This tools extracts introns and coverage from mapped RNA-seq reads. Introns might be denoised by the tool '''DenoiseIntrons'''. Introns and coverage results can be used in '''GeMoMa''' to improve the predictions and might help to select better gene models in '''GAF'''. In addition, introns and coverage can be used to predict UTRs by '''AnnotationFinalizer'''.<br />
<br />
''Extract RNA-seq Evidence'' may be called with<br />
<br />
java -jar GeMoMa-1.8.jar CLI ERE<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>Stranded (Defines whether the reads are stranded. In case of FR_FIRST_STRAND, the first read of a read pair or the only read in case of single-end data is assumed to be located on forward strand of the cDNA, i.e., reverse to the mRNA orientation. If you are using Illumina TruSeq you should use FR_FIRST_STRAND., range={FR_UNSTRANDED, FR_FIRST_STRAND, FR_SECOND_STRAND}, default = FR_UNSTRANDED)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>mapped reads file (BAM/SAM files containing the mapped reads, type = bam,sam)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>ValidationStringency (Defines how strict to be when reading a SAM or BAM, beyond bare minimum validation., range={STRICT, LENIENT, SILENT}, default = LENIENT)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">u</font></td><br />
<td>use secondary alignments (allows to filter flags in the SAM or BAM, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage (allows to output the coverage, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">mmq</font></td><br />
<td>minimum mapping quality (reads with a mapping quality that is lower than this value will be ignored, valid range = [0, 255], default = 40)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">mc</font></td><br />
<td>minimum context (only introns that have evidence of at least one split read with a minimal M (=(mis)match) stretch in the cigar string larger than or equal to this value will be used, valid range = [1, 1000000], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">maximumcoverage</font></td><br />
<td>maximum coverage (optional parameter to reduce the size of coverage output files, coverage higher than this value will be reported as this value, valid range = [1, 10000], OPTIONAL)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>filter by intron mismatches (filter reads by the number of mismatches around splits, range={NO, YES}, default = NO)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;YES&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>region around introns (test region of this size around introns/splits for mismatches to the genome, valid range = [0, 100], default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>number of mismatches (number of mismatches allowed in regions around introns/splits, valid range = [0, 100], default = 3)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>target genome (The target genome file (FASTA). Should be in IUPAC code, type = fasta,fas,fa,fna,fasta.gz,fas.gz,fa.gz,fna.gz)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">e</font></td><br />
<td>evidence long splits (require introns to have at least this number of times the supporting reads as their length deviates from the mean split length, valid range = [0.0, 100.0], default = 0.0)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">mil</font></td><br />
<td>minimum intron length (introns shorter than the minimum length are discarded and considered as contiguous, valid range = [0, 1000], default = 0)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">repositioning</font></td><br />
<td>repositioning (due to limitations in BAM/SAM format huge chromosomes need to be split before mapping. This parameter allows to undo the split mapping to real chromosomes and coordinates. The repositioning file has 3 columns: split_chr_name, original_chr_name, offset_in_original_chr, type = tabular, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.8.jar CLI ERE m=&lt;mapped_reads_file&gt;<br />
<br />
<br />
=== CheckIntrons ===<br />
<br />
The tool checks the distribution of introns on the strands and the dinucleotide distribution at splice sites.<br />
<br />
''CheckIntrons'' may be called with<br />
<br />
java -jar GeMoMa-1.8.jar CLI CheckIntrons<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>target genome (The target genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code, type = fasta)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq, type = gff)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>verbose (A flag which allows to output a wealth of additional information per transcript, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.8.jar CLI CheckIntrons t=&lt;target_genome&gt; i=&lt;introns&gt;<br />
<br />
<br />
=== DenoiseIntrons ===<br />
<br />
This module allows to analyze introns extracted by '''ERE'''. Introns with a large intron size or a low relative expression are possibly artefacts and will be removed. The result of this module can be used in the module '''GeMoMa''', '''AnnotationEvidence''', and '''AnnotationFinalizer'''.<br />
<br />
''DenoiseIntrons'' may be called with<br />
<br />
java -jar GeMoMa-1.8.jar CLI DenoiseIntrons<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq, type = gff,gff3)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage (experimental coverage (RNA-seq), range={UNSTRANDED, STRANDED}, default = UNSTRANDED)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., type = bedgraph)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., type = bedgraph)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., type = bedgraph)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>maximum intron length (The maximum length of an intron, default = 15000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">me</font></td><br />
<td>minimum expression (The threshold for removing introns, valid range = [0.0, 1.0], default = 0.01)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">context</font></td><br />
<td>context (The context upstream a donor and donwstream an acceptor site that is used to determine the expression of the region, valid range = [0, 100], default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.8.jar CLI DenoiseIntrons i=&lt;introns&gt; coverage_unstranded=&lt;coverage_unstranded&gt;<br />
<br />
<br />
=== NCBI Reference Retriever ===<br />
<br />
This tool can be used to download or update assembly and annotation files of reference organsims from NCBI. This way it allows to easily collect all data necessary to start '''GeMoMaPipeline''' or '''Extractor'''.<br />
<br />
''NCBI Reference Retriever'' may be called with<br />
<br />
java -jar GeMoMa-1.8.jar CLI NRR<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reference directory (the directory where the genome and annotation files of the reference organisms should be stored, default = references/)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>number of tries (the number of tries for downloading a reference file, valid range = [1, 100], default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">rl</font></td><br />
<td>reference list (a list of reference organisms, type = txt)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.8.jar CLI NRR rl=&lt;reference_list&gt;<br />
<br />
<br />
=== Extractor ===<br />
<br />
This tool can be used to create input files for '''GeMoMa''', i.e., it creates at least a fasta file containing the translated parts of the CDS and a tabular file containing the assignment of transcripts to genes and parts of CDS to transcripts. In addition, '''Extractor''' can be used to create several additional files from the final prediction, e.g. proteins, CDSs, ... . Two inputs are mandatory: The genome as fasta or fasta.gz and the corresponding annotation as gff or gff.gz. The gff file should be sorted. If you like to set a user-specific genetic code, please use a tab-delimited file with two columns. The first column contains the amino acid in one letter code, the second a list of tripletts.<br />
<br />
''Extractor'' may be called with<br />
<br />
java -jar GeMoMa-1.8.jar CLI Extractor<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (Reference annotation file (GFF or GTF), which contains gene models annotated in the reference genome, type = gff,gff3,gtf,gff.gz,gff3.gz,gtf.gz)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (Reference genome file (FASTA), type = fasta,fa,fas,fna,fasta.gz,fa.gz,fas.gz,fna.gz)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">gc</font></td><br />
<td>genetic code (optional user-specified genetic code, type = tabular, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>proteins (whether the complete proteins sequences should returned as output, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>cds (whether the complete CDSs should returned as output, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">genomic</font></td><br />
<td>genomic (whether the genomic regions should be returned (upper case = coding, lower case = non coding), default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns (whether introns should be extracted from annotation, that might be used for test cases, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">identical</font></td><br />
<td>identical (if CDS is identical Extractor only used one transcript. This parameter allows to return a table that lists in the first column the used transcript and in the second column the discarded transcript. If no transcript is discarded, the list is empty., default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">u</font></td><br />
<td>upcase IDs (whether the IDs in the GFF should be upcased, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>repair (if a transcript annotation can not be parsed, the program will try to infer the phase of the CDS parts to repair the annotation, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>selected (The path to list file, which allows to make only a predictions for the contained transcript ids. The first column should contain transcript IDs as given in the annotation. Remaining columns will be ignored., type = tabular,txt, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Ambiguity</font></td><br />
<td>Ambiguity (This parameter defines how to deal with ambiguities in the DNA. There are 3 options: EXCEPTION, which will remove the corresponding transcript, AMBIGUOUS, which will use an X for the corresponding amino acid, and RANDOM, which will randomly select an amnio acid from the list of possibilities., range={EXCEPTION, AMBIGUOUS, RANDOM}, default = EXCEPTION)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>discard pre-mature stop (if *true* transcripts with pre-mature stop codon are discarded as they often indicate misannotation, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sefc</font></td><br />
<td>stop-codon excluded from CDS (A flag that states whether the reference annotation contains the stop codon in the CDS annotation or not, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>full-length (A flag which allows for choosing between only full-length and all (i.e., full-length and partial) transcripts, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">l</font></td><br />
<td>long fasta comment (whether a short (transcript ID) or a long (transcript ID, gene ID, chromosome, strand, interval) fasta comment should be written for proteins, CDSs, and genomic regions, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>verbose (A flag which allows to output a wealth of additional information, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.8.jar CLI Extractor a=&lt;annotation&gt; g=&lt;genome&gt;<br />
<br />
<br />
=== GeneModelMapper ===<br />
<br />
This tool is the main part of, a homology-based gene prediction tool. GeMoMa builds gene models from search results (e.g. tblastn or mmseqs).<br />
<br />
As first step, you should run '''Extractor''' obtaining ''cds parts'' and ''assignment''. Second, you should run a search algorithm, e.g. '''tblastn''' or '''mmseqs''', with ''cds parts'' as query. Finally, these search results are then used in '''GeMoMa'''. Search results should be clustered according to the reference genes. The most easiest way is to sort the search results accoring to the first column. If the search results are not sorted by default (e.g. mmseqs), you should the parameter ''sort''.<br />
If you like to run GeMoMa ignoring intron position conservation, you should blast protein sequences and feed the results in ''query cds parts'' and leave ''assignment'' unselected.<br />
<br />
If you like to run GeMoMa using RNA-seq evidence, you should map your RNA-seq reads to the genome and run '''ERE''' on the mapped reads. For several reasons, spurious introns can be extracted from RNA-seq data. Hence, we recommend to run '''DenoiseIntrons''' to remove such spurious introns. Finally, you can use the obtained ''introns'' (and ''coverage'') in GeMoMa.<br />
<br />
If you like to obtain multiple predictions per gene model of the reference organism, you should set ''predictions'' accordingly. In addition, we suggest to decrease the value of ''contig threshold'' allowing GeMoMa to evaluate more candidate contigs/chromosomes.<br />
<br />
If you change the values of ''contig threshold'', ''region threshold'' and ''hit threshold'', this will influence the predictions as well as the runtime of the algorithm. The lower the values are, the slower the algorithm is.<br />
<br />
You can filter your predictions using '''GAF''', which also allows for combining predictions from different reference organismns.<br />
<br />
Finally, you can predict UTRs and rename predictions using '''AnnotationFinalizer'''.<br />
<br />
If you like to run the complete GeMoMa pipeline and not only specific module, you can run the multi-threaded module '''GeMoMaPipeline'''.<br />
<br />
''GeneModelMapper'' may be called with<br />
<br />
java -jar GeMoMa-1.8.jar CLI GeMoMa<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>search results (The search results, e.g., from tblastn or mmseqs, type = tabular)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>target genome (The target genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code, type = fasta,fas,fa,fna,fasta.gz,fas.gz,fa.gz,fna.gz)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>cds parts (The query CDS parts file (protein FASTA), i.e., the CDS parts that have been searched in the target genome using for instance BLAST or mmseqs, type = fasta,fa,fas,fna)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>assignment (The assignment file, which combines CDS parts to proteins, type = tabular, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used zero or multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq, type = gff,gff3)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">splice</font></td><br />
<td>splice (if no intron is given by RNA-seq, compute candidate splice sites or not, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used zero or multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage</font></td><br />
<td>coverage (experimental coverage (RNA-seq), range={UNSTRANDED, STRANDED}, default = UNSTRANDED)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., type = bedgraph)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., type = bedgraph)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., type = bedgraph)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genetic code (optional user-specified genetic code, type = tabular, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sm</font></td><br />
<td>substitution matrix (optional user-specified substitution matrix, type = tabular, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">go</font></td><br />
<td>gap opening (The gap opening cost in the alignment, default = 11)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ge</font></td><br />
<td>gap extension (The gap extension cost in the alignment, default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>maximum intron length (The maximum length of an intron, default = 15000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sil</font></td><br />
<td>static intron length (A flag which allows to switch between static intron length, which can be specified by the user and is identical for all genes, and dynamic intron length, which is based on the gene-specific maximum intron length in the reference organism plus the user given maximum intron length, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">intron-loss-gain-penalty</font></td><br />
<td>intron-loss-gain-penalty (The penalty used for intron loss and gain, default = 25)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">e</font></td><br />
<td>e-value (The e-value for filtering blast results, default = 100.0)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ct</font></td><br />
<td>contig threshold (The threshold for evaluating contigs, valid range = [0.0, 1.0], default = 0.4)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">rt</font></td><br />
<td>region threshold (The threshold for evaluating regions, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">h</font></td><br />
<td>hit threshold (The threshold for adding additional hits, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>predictions (The (maximal) number of predictions per transcript, default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">selected</font></td><br />
<td>selected (The path to list file, which allows to make only a predictions for the contained transcript ids. The first column should contain transcript IDs as given in the annotation. Remaining columns can be used to determine a target region that should be overlapped by the prediction, if columns 2 to 5 contain chromosome, strand, start and end of region, type = tabular,txt, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">as</font></td><br />
<td>avoid stop (A flag which allows to avoid (additional) pre-mature stop codons in a transcript, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">approx</font></td><br />
<td>approx (whether an approximation is used to compute the score for intron gain, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">pa</font></td><br />
<td>protein alignment (whether a protein alignment between the prediction and the reference transcript should be computed. If so two additional attributes (iAA, pAA) will be added to predictions in the gff output. These might be used in GAF. However, since some transcripts are very long this can increase the needed runtime and memory (RAM)., default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">prefix</font></td><br />
<td>prefix (A prefix to be used for naming the predictions, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">tag</font></td><br />
<td>tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = mRNA)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>verbose (A flag which allows to output a wealth of additional information per transcript, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">timeout</font></td><br />
<td>timeout (The (maximal) number of seconds to be used for the predictions of one transcript, if exceeded GeMoMa does not output a prediction for this transcript., valid range = [0, 604800], default = 3600)</td><br />
<td style="width:100px;">LONG</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sort</font></td><br />
<td>sort (A flag which allows to sort the search results, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ru</font></td><br />
<td>replace unknown (Replace unknown amino acid symbols by X, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Score</font></td><br />
<td>Score (A flag which allows to do nothing, re-score or re-align the search results, range={Trust, ReScore, ReAlign}, default = Trust)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.8.jar CLI GeMoMa s=&lt;search_results&gt; t=&lt;target_genome&gt; c=&lt;cds_parts&gt;<br />
<br />
<br />
=== GeMoMa Annotation Filter ===<br />
<br />
This tool combines and filters gene predictions from different sources yielding a common gene prediction. The tool does not modify the predictions, but filters redundant and low-quality predictions and selects relevant predictions. In addition, it adds attributes to the annotation of transcript predictions.<br />
<br />
The algorithm does the following:<br />
First, redundant predictions are identified (and additional attributes (evidence, sumWeight) are introduced).<br />
Second, the predictions are filtered using the user-specified criterium based on the attributes from the annotation.<br />
Third, clusters of overlapping predictions are determined, the predictions are sorted within the cluster and relevant predictions are extracted.<br />
<br />
Optionally, annotation info can be added for each reference organism enabling a functional prediction for predicted transcripts based on the function of the reference transcript.<br />
Phytozome provides annotation info tables for plants, but annotation info can be used from any source as long as they are tab-delimited files with at least the following columns: transcriptName, GO and .*defline.<br />
<br />
Initially, GAF was build to combine gene predictions obtained from '''GeMoMa'''. It allows to combine the predictions from multiple reference organisms, but works also using only one reference organism. However, GAF also allows to integrate predictions from ab-initio or purely RNA-seq-based gene predictors as well as manually curated annotation. If you like to do so, we recommend to run '''AnnotationEvidence''' for each of these input files to add additional attributes that can be used for sorting and filtering within '''GAF'''. The sort and filter criteria need to be carefully revised in this case. Default values can be set for missing attributes.<br />
<br />
''GeMoMa Annotation Filter'' may be called with<br />
<br />
java -jar GeMoMa-1.8.jar CLI GAF<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>tag (the tag used to read the GeMoMa annotations, default = mRNA)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prefix (the prefix can be used to distinguish predictions from different input files, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>weight (the weight can be used to prioritize predictions from different input files; each prediction will get an additional attribute sumWeight that can be used in the filter, valid range = [0.0, 1000.0], default = 1.0, OPTIONAL)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>gene annotation file (GFF file containing the gene annotations (predicted by GeMoMa), type = gff,gff3)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation info (annotation information of the reference, tab-delimted file containing at least the columns transcriptName, GO and .*defline, type = tabular, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>default attributes (Comma-separated list of attributes that is set to NaN if they are not given in the annotation file. This allows to use these attributes for sorting or filter criteria. It is especially meaningful if the gene annotation files were received fom different software packages (e.g., GeMoMa, Braker, ...) having different attributes., default = tie,tde,tae,iAA,pAA,score,lpm,maxGap,bestScore,maxScore,raa,rce)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>filter (A filter can be applied to transcript predictions to receive only reasonable predictions. The filter is applied to the GFF attributes. The default filter decides based on the completeness of the prediction (start=='M' and stop=='*') and the relative score (score/aa>=0.75) whether a prediction is used or not. In addition, predictions without score (isNaN(score)) will be used as external annotations do not have the attribute score. Different criteria can be combined using 'and' and 'or'. A more sophisticated filter could be applied for instance using the completeness, the relative score, the evidence as well as statistics based on RNA-seq: start=='M' and stop=='*' and score/aa>=0.75 and (evidence>1 or tpc==1.0), default = start=='M' and stop=='*' and (isNaN(score) or score/aa>=0.75), OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>sorting (comma-separated list that defines the way predictions are sorted within a cluster, default = sumWeight,score)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">atf</font></td><br />
<td>alternative transcript filter (If a prediction is suggested as an alternative transcript, this additional filter will be applied. The filter works syntactically like the 'filter' parameter and allows you to keep the number of alternative transcripts small based on meaningful criteria. Reasonable filter could be based on RNA-seq data (tie==1) or on sumWeight (sumWeight>1). A more sophisticated filter could be applied combining several criteria: tie==1 or sumWeight>1, default = tie==1 or sumWeight>1, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>common border filter (the filter on the common borders of transcripts, the lower the more transcripts will be checked as alternative splice isoforms, valid range = [0.0, 1.0], default = 0.75)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>maximal number of transcripts per gene (the maximal number of allowed transcript predictions per gene, valid range = [1, 2147483647], default = 2147483647)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">aat</font></td><br />
<td>add alternative transcripts (a switch to decide whether the IDs of alternative transcripts that have the same CDS should be listed for each prediction, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">tf</font></td><br />
<td>transfer features (if true, additional features like UTRs will be transfered from the input to the output. Only features of the representatives will be transfered. The UTRs of identical CDS predictions listed in "alternative" will not be transfered or combined, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.8.jar CLI GAF g=&lt;gene_annotation_file&gt;<br />
<br />
<br />
=== AnnotationFinalizer ===<br />
<br />
This tool finalizes an annotation. It allows to predict for UTRs for annotated coding sequences and to generate generic gene and transcript names. UTR prediction might be negatively influenced (i.e. too long predictions) by genomic contamination of RNA-seq libraries, overlapping genes or genes in close proximity as well as unstranded RNA-seq libraries. Please use '''ERE''' to preprocess the mapped reads.<br />
<br />
''AnnotationFinalizer'' may be called with<br />
<br />
java -jar GeMoMa-1.8.jar CLI AnnotationFinalizer<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (The genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code, type = fasta,fa,fas,fna,fasta.gz,fa.gz,fas.gz,fna.gz)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (The predicted genome annotation file (GFF), type = gff,gff3)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = mRNA)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">u</font></td><br />
<td>UTR (allows to predict UTRs using RNA-seq data, range={NO, YES}, default = NO)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;NO&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">tf</font></td><br />
<td>transfer features (if true other features than gene, &lt;tag&gt; (default: mRNA), and CDS of the input will be written in the output, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;YES&quot;:</b></td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns file (Introns (GFF), which might be obtained from RNA-seq, type = gff,gff3)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage file (experimental coverage (RNA-seq), range={NO, UNSTRANDED, STRANDED}, default = NO)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., type = bedgraph)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., type = bedgraph)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., type = bedgraph)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">rename</font></td><br />
<td>rename (allows to generate generic gene and transcripts names (cf. parameter &quot;name attribute&quot;), range={COMPOSED, SIMPLE, NO}, default = COMPOSED)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;COMPOSED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">infix</font></td><br />
<td>infix (the infix of the generic name, default = G)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>suffix (the suffix of the generic name, default = 0)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">di</font></td><br />
<td>delete infix (a comma-separated list of infixes that is deleted from the sequence names before building the gene/transcript name, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;SIMPLE&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>name attribute (if true the new name is added as new attribute &quot;Name&quot;, otherwise &quot;Parent&quot; and &quot;ID&quot; values are modified accordingly, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.8.jar CLI AnnotationFinalizer g=&lt;genome&gt; a=&lt;annotation&gt; p=&lt;prefix&gt;<br />
<br />
<br />
=== Annotation evidence ===<br />
<br />
This tool adds attributes to the annotation, e.g., tie, tpc, aa, start, stop. These attributes can be used, for instance, if the annotation is used in '''GAF'''. All predictions of the annotation are used. The predictions are not filtered for internal stop codons, missing start or stop codons, frame-shifts, ... . Please use '''ERE''' to preprocess the mapped reads.<br />
<br />
''Annotation evidence'' may be called with<br />
<br />
java -jar GeMoMa-1.8.jar CLI AnnotationEvidence<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (The genome annotation file (GFF,GTF), type = gff,gff3,gtf,gff.gz,gff3.gz,gtf.gz)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = mRNA)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (The genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code, type = fasta,fas,fa,fna,fasta.gz,fas.gz,fa.gz,fna.gz)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns file (Introns (GFF), which might be obtained from RNA-seq, type = gff,gff3, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage file (experimental coverage (RNA-seq), range={NO, UNSTRANDED, STRANDED}, default = NO)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., type = bedgraph)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., type = bedgraph)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., type = bedgraph)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ao</font></td><br />
<td>annotation output (if the annotation should be returned with attributes tie, tpc, and aa, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">gc</font></td><br />
<td>genetic code (optional user-specified genetic code, type = tabular, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.8.jar CLI AnnotationEvidence a=&lt;annotation&gt; g=&lt;genome&gt;<br />
<br />
<br />
=== Synteny checker ===<br />
<br />
This tool can be used to determine syntenic regions between target organism and reference organism based on similiarity of genes. The tool returns a table of reference genes per predicted gene. This table can be easily visualized with an R script that is included in the GeMoMa package.<br />
<br />
''Synteny checker'' may be called with<br />
<br />
java -jar GeMoMa-1.8.jar CLI SyntenyChecker<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>tag (the tag used to read the GeMoMa annotations, default = mRNA)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prefix (the prefix can be used to distinguish predictions from different input files (=reference organisms), OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>assignment (the assignment file of this reference organism, which combines parts of the CDS to transcripts, type = tabular)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>gene annotation file (GFF file containing the gene annotations predicted by GAF, type = gff,gff3)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.8.jar CLI SyntenyChecker a=&lt;assignment&gt; g=&lt;gene_annotation_file&gt;<br />
<br />
<br />
=== AddAttribute ===<br />
<br />
This tool allows to add an additional attribute to specific features of an annotation.<br />
<br />
Those additional attributes might be used in '''GAF''' for filtering or sorting or might be displayed in genome browsers like IGV or WebApollo. The user can choose binary attributes (true or false) or attributes with values according to given tab-delimited table.<br />
<br />
''AddAttribute'' may be called with<br />
<br />
java -jar GeMoMa-1.8.jar CLI AddAttribute<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (annotation file, type = gff,gff3)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>feature (a feature of the annotation, e.g., gene, transcript or mRNA, default = mRNA)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">attribute</font></td><br />
<td>attribute (the name of the attribute that is added to the annotation)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>table (a tab-delimited file containing IDs and additional attribute, type = tabular)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>ID column (the ID column in the tab-delimited file, valid range = [0, 2147483647])</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">type</font></td><br />
<td>type (type of addition attribute, range={VALUES, BINARY}, default = VALUES)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;VALUES&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ac</font></td><br />
<td>attribute column (the attribute column in the tab-delimited file, valid range = [0, 2147483647])</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;BINARY&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.8.jar CLI AddAttribute a=&lt;annotation&gt; attribute=&lt;attribute&gt; t=&lt;table&gt; i=&lt;ID_column&gt; ac=&lt;attribute_column&gt;<br />
<br />
<br />
=== GAFComparison ===<br />
<br />
This tool allows to compare results from GAF based on the attributed ref-gene and alternative.<br />
Hence, you can compare the annotation of different genomes or the effect of different parameters on the annotation of one genome.<br />
<br />
''GAFComparison'' may be called with<br />
<br />
java -jar GeMoMa-1.8.jar CLI GAFComparison<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>tag (the tag used to read the GAF annotations, default = mRNA)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>name (a simple name for the organism)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>gene annotation file (GFF file containing the gene annotations (predicted by GAF), type = gff,gff3,gff.gz.gff3.gz)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>split prefix (a switch to decide whether the prefix should be split and writen in a separat column, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>differences (a switch to decide whether only genes with difference should be returned, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.8.jar CLI GAFComparison n=&lt;name&gt; g=&lt;gene_annotation_file&gt;<br />
<br />
<br />
=== Analyzer ===<br />
<br />
This tools allows to compare true annotation with predicted annotation as it is frequently done in benchmark studies. Furthermore, it can return a detailed table comparing true annotation and predicted annotation which might help to identify systematical errors or biases in the predictions. Hence, this tool might help to detect weaknesses of the prediction algorithm.<br />
<br />
True and predicted transcripts are evaluated based on nucleotide F1 measure. For each predicted transcript, the true transcript with highest nucleotide F1 measure is listed. A negative value in a F1 measure column indicates that there is a predicted transcript that matches the true transcript with a F1 measure value that is the absolute value of this entry, but there is another true transcript that matches this predicted transcript with an even better F1. Also true and predicted transcripts are listed that do not overlap with any transcript from the predicted and true annotation, respectively. The table contains the attributes of the true and the predicted annotation besides some additional columns allowing to easily filter interesting examples and to do statistics.<br />
<br />
The evaluation can be based on CDS (default) or exon features. The tool also reports sensitivity and precision for the categories gene and transcript.<br />
<br />
''Analyzer'' may be called with<br />
<br />
java -jar GeMoMa-1.8.jar CLI Analyzer<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>truth (the true annotation, type = gff,gff3,gff.gz,gff3.gz)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>name (can be used to distinguish different predictions, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>predicted annotation (GFF file containing the predicted annotation, type = gff,gff3,gff.gz,gff3.gz)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>CDS (if true CDS features are used otherwise exon features, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>write (write detailed table comparing the true and the predicted annotation, range={NO, YES}, default = NO)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;YES&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ca</font></td><br />
<td>common attributes (Only gff attributes of mRNAs are included in the result table, that can be found in the given portion of all mRNAs. Attributes and their portion are handled independently for truth and prediction. This parameter allows to choose between a more informative table or compact table., valid range = [0.0, 1.0], default = 0.5)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reliable (additionally evaluate sensitivity for reliable transcripts, range={NO, YES}, default = NO)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;YES&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>filter (A filter for deciding which transcript from the truth are reliable or not. The filter is applied to the GFF attributes of the truth. You probably need to run AnnotationEvidence on the truth GFF. The default filter decides based on the completeness of the prediction (start=='M' and stop=='*'), no premature stop codons (nps==0), RNA-seq coverage (tpc==1) and intron evidence (isNaN(tie) or tie==1)., default = start=='M' and stop=='*' and nps==0 and (tpc==1 and (isNaN(tie) or tie==1)), OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.8.jar CLI Analyzer t=&lt;truth&gt; p=&lt;predicted_annotation&gt;<br />
<br />
<br />
=== BUSCORecomputer ===<br />
<br />
This tool can be used to compute BUSCO statistics for genes instead of transcripts. Proteins of an annotation file can be extracted with '''Exctractor''', Proteins can be used to compute BUSCO statistics with BUSCO. The full BUSCO table and the assignment file from the '''Extractor''' can be used as input for this tool. Alternatively, a table can be generated from the annotation file that can be used instead of the assignment file.<br />
<br />
''BUSCORecomputer'' may be called with<br />
<br />
java -jar GeMoMa-1.8.jar CLI BUSCORecomputer<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>BUSCO (the BUSCO full table based on transcripts/proteins, type = tabular)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>IDs (a table with at leat two columns, the first is the gene ID, the second is the transcript/protein ID. The assignment file from the Extractor can be used or a table can be derived by the user from the gene annotation file (gff,gtf), type = tabular)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.8.jar CLI BUSCORecomputer b=&lt;BUSCO&gt; i=&lt;IDs&gt;</div>Keilwagenhttps://www.jstacs.de/index.php?title=File:GeMoMa-manual.pdf&diff=1159File:GeMoMa-manual.pdf2021-10-08T13:17:44Z<p>Keilwagen: Keilwagen uploaded a new version of File:GeMoMa-manual.pdf</p>
<hr />
<div></div>Keilwagenhttps://www.jstacs.de/index.php?title=GeMoMa&diff=1155GeMoMa2021-10-07T12:23:45Z<p>Keilwagen: /* GFF attributes */ 1.8</p>
<hr />
<div>'''Ge'''ne '''Mo'''del '''Ma'''pper (GeMoMa) is a homology-based gene prediction program. GeMoMa uses the annotation of protein-coding genes in a reference genome to infer the annotation of protein-coding genes in a target genome. Thereby, GeMoMa utilizes amino acid sequence and intron position conservation. In addition, GeMoMa allows to incorporate RNA-seq evidence for splice site prediction.<br />
<br />
GeMoMa is available in a public web-server at [http://galaxy.informatik.uni-halle.de galaxy.informatik.uni-halle.de]. The provided web-server only allows a limited number of reference genes and uses a time out of 2 minutes per transcript prediction. For unlimited use, please use the command line program or integrate GeMoMa in your own [https://galaxyproject.org/ Galaxy] instance.<br />
<br />
{|<br />
|__TOC__<br />
|[[File:GeMoMa-schema.png|thumb|right|450px|Schema of GeMoMa algorithm]]<br />
|}<br />
<br />
== Installation ==<br />
<br />
GeMoMa is now available via bioconda. Here is the [https://anaconda.org/bioconda/gemoma direct link to the package]. To install this package with conda run:<br />
conda install -c bioconda gemoma <br />
However, you can also install GeMoMa manually.<br />
<br />
=== Requirements ===<br />
<br />
For running the GeMoMa, you need the following software on your computer<br />
* Java v1.8 or later<br />
* [ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ blast] or [https://github.com/soedinglab/MMseqs2 mmseqs]<br />
<br />
=== Download ===<br />
<br />
GeMoMa is implemented in Java using Jstacs. You can [http://www.jstacs.de/download.php?which=GeMoMa download a zip file] containing a readme, the GeMoMa jar file and some tiny scripts for running GeMoMa. The jar file allows for <br />
<ul><br />
<li>creating the XML file needed for the Galaxy integration</li><br />
<li>running the command line interface (CLI) version.</li><br />
</ul><br />
<br />
== In a nutshell ==<br />
<br />
GeMoMa is a modular, homology-based gene prediction program with huge flexibility. However, we also provide a pipeline allowing to use GeMoMa easily. If you like to start GeMoMa for the first time, we recommend to use the GeMoMaPipeline like this<br />
java -jar GeMoMa-1.8.jar CLI GeMoMaPipeline threads=<threads> outdir=<outdir> GeMoMa.Score=ReAlign AnnotationFinalizer.r=NO o=true t=<target_genome> i=<reference_1_id> a=<reference_1_annotation> g=<reference_1_genome><br />
there are several parameters that need to be set indicated with '''&lt;'''foo'''&gt;'''. You can specify<br />
* the number of threads<br />
* the output directory<br />
* the target genome<br />
* and the reference ID (optional), annotation and genome. If you have several references just repeat <code>s=own</code> and the parameter tags <code>i</code>, <code>a</code>, <code>g</code> with the corresponding values.<br />
In addition, we recommend to set several parameters:<br />
* <code>GeMoMa.Score=ReAlign</code>: states that the score from mmseqs should be recomputed as mmseqs uses an approximation<br />
* <code>AnnotationFinalizer.r=NO</code>: do not rename genes and transcripts<br />
* <code>o=true</code>: output individual predictions for each reference as a separate file allowing to rerun the combination step ('''GAF''') very easily and quickly<br />
If you like to specify the maximum intron length please consider the parameters <code>GeMoMa.m</code> and <code>GeMoMa.sil</code>.<br />
If you have RNA-seq data either from own experiments or publicly available data sets (cf. [https://www.ncbi.nlm.nih.gov/sra NCBI SRA], [https://www.ebi.ac.uk/ena EMBL-EBI ENA]), we recommend to use them. You need to map the data against the target genome with your favorite read mapper. In addition, we recommend to check the parameters of the section '''DenoiseIntrons'''.<br />
<br />
<br />
The complete documentation describing all GeMoMa modules and all parameters can be accessed at [[GeMoMa-Docs]].<br />
<br />
You can also [[:File:GeMoMa-manual.pdf|download a small manual for GeMoMa]] which explains the main steps for the analysis.<br />
<br />
== GFF attributes ==<br />
<br />
Using GeMoMa and GAF, you'll obtain GFFs containing some special attributes. We briefly explain the most prominent attributes in the following table.<br />
<br />
{| class="wikitable sortable"<br />
|-<br />
!Attribute !!Long name !!Tool !!Necessary parameter !!Feature !!Description<br />
|-<br />
| aa || amino acids || GeMoMa || || mRNA || the number of amino acids in the predicted protein<br />
|-<br />
| raa || reference amino acids || GeMoMa || || mRNA || the number of amino acids in the reference protein<br />
|-<br />
| score || GeMoMa score || GeMoMa || || mRNA || score computed by GeMoMa using the substitution matrix, gap costs and additional penalties<br />
|-<br />
| maxScore || maximal GeMoMa score || GeMoMa || || mRNA || maximal score which will be obtained by a prediction that is identical to the reference transcript<br />
|-<br />
| bestScore || best GeMoMa score || GeMoMa || || mRNA || score of the best GeMoMa prediction of this transcript and this target organism<br />
|-<br />
| maxGap || maximal gap || GeMoMa || || mRNA || length of the longest gap in the alignment between predicted and reference protein<br />
|-<br />
| lpm || longest positive match || GeMoMa || || mRNA || length of the longest positive scoring match in the alignment between predicted and reference protein, i.e., each pair of amino acids in the match has a positive score<br />
|-<br />
| nps || number of premature stops || GeMoMa || || mRNA || the number of premature stop codons in the prediction<br />
|-<br />
| ce || coding exons || GeMoMa || assignment || mRNA || the number of coding exons of the prediction<br />
|-<br />
| rce || reference coding exons || GeMoMa || assignment || mRNA || the number of coding exons of the reference transcript<br />
|-<br />
| minCov || minimal coverage || GeMoMa || coverage, ... || mRNA || minimal coverage of any base of the prediction given RNA-seq evidence<br />
|-<br />
| avgCov || average coverage || GeMoMa || coverage, ... || mRNA || average coverage of all bases of the prediction given RNA-seq evidence<br />
|-<br />
| tpc || transcript percentage coverage || GeMoMa || coverage, ... || mRNA || percentage of covered bases per predicted transcript given RNA-seq evidence<br />
|-<br />
| tae || transcript acceptor evidence || GeMoMa || introns || mRNA || percentage of predicted acceptor sites per predicted transcript with RNA-seq evidence<br />
|-<br />
| tde || transcript donor evidence || GeMoMa || introns || mRNA || percentage of predicted donor sites per predicted transcript with RNA-seq evidence<br />
|-<br />
| tie || transcript intron evidence || GeMoMa || introns || mRNA || percentage of predicted introns per predicted transcript with RNA-seq evidence<br />
|-<br />
| minSplitReads || minimal split reads || GeMoMa || introns || mRNA || minimal number of split reads for any of the predicted introns per predicted transcript<br />
|-<br />
| iAA || identical amino acid || GeMoMa || protein alignment || mRNA || percentage of identical amino acids between reference transcript and prediction<br />
|-<br />
| pAA || positive amino acid || GeMoMa || protein alignment || mRNA || percentage of aligned positions between reference transcript and prediction yielding a positive score in the substitution matrix<br />
|-<br />
| evidence || || GAF || || mRNA || number of reference organisms that have a transcript yielding this prediction<br />
|-<br />
| alternative || || GAF || || mRNA || alternative gene ID(s) leading to the same prediction<br />
|-<br />
| sumWeight || || GAF || || mRNA || the sum of the weights of the references that perfectly support this prediction<br />
|-<br />
| maxTie || maximal tie || GAF || || gene || maximal tie of all transcripts of this gene<br />
|-<br />
| maxEvidence || maximal evidence || GAF || || gene || maximal evidence of all transcripts of this gene<br />
|-<br />
|}<br />
<br />
The name of the feature describing a transcript prediction can be altered using the parameter "tag". Before version 1.7 the default value of tag was "prediction" instead of "mRNA".<br />
<br />
== Frequently asked questions ==<br />
<br />
; Why does the Extractor not return a single CDS-part, protein, ...?<br />
:Please check whether the names of your contigs/chromosomes in your annotation (gff) and genome file (fasta) are identical. The fasta comments should at best only contain the contig/chromosome name. (Since GeMoMa 1.4, comments, which contain the contig/chromosome name and some additional information separated by a space, are also fine.) In addition, check the statistics that are given by the Extractor. It lists how many genes have been read and how many genes have been removed for different reasons. One common problem is that some annotation files do not include the stop codon in the CDS annotation.<br />
<br />
; How can I force GeMoMa to make more predictions?<br />
:There are several parameters affecting the number of predictions. The most prominent are the number of predictions (p) and the contig threshold (ct). For each reference transcript/CDS, GeMoMa initially makes a preliminary prediction and uses this prediction to determine whether a contig is promising and should be used to determine the final predictions. You may decrease ct and increase p to have more contigs in the final prediction. Increasing the number of predictions allows GeMoMa to output more predictions that have been computed. Decreasing the contig threshold allows to increase the number of predictions that are (internally) computed. Increasing p to a very large number without decreasing ct does not help.<br />
<br />
; Running GeMoMa on a single contig of my assembly yield thousands of weird predictions. What went wrong?<br />
:By default, GeMoMa is not build to be run on a single contig. GeMoMa tries to make predictions for all given reference CDS in the given target sequence(s). If the given target sequence is only a fraction of the complete target genome/assembly, GeMoMa will produce weird predictions as it does not filter for the quality of the predictions internally. There are two options to handle this:<br />
:* Use a list of gene models that you expect to be located on this contig (cf. parameter "selected").<br />
:* Filter the predictions using GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>).<br />
<br />
; Is it mandatory to use RNA-seq data?<br />
:No, GeMoMa is able to make predictions with and without RNA-seq evidence.<br />
<br />
; Is it possible to use multiple reference organisms?<br />
:It is possible to use multiple reference organisms for GeMoMa. Just run GeMoMa on each reference organism separately. Finally, you can employ GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>) to combine these annotations.<br />
<br />
; Why do some reference genes not lead to a prediction in the target genome?<br />
:Please first check whether your reference genes have been discarded by the Extractor (cf. assignment file).<br />
:If the genes have been discarded, there are two possibilities:<br />
:* The CDS might be redundant, i.e. the coding exons are identical to those of another transcript. In this case, only one CDS is further evaluated.<br />
:* There might be something wrong with your reference genes, e.g., missing start codon, missing stop codon, premature stop codon, ambiguous nucleotides, ... and you should check the options of Extractor or the annotation.<br />
:If the reference genes passed the Extractor, there are several possible explanations for this behavior. The two most prominent are:<br />
:* GeMoMa stopped the prediction of a reference genes since it does not return a result within the given time (cf. parameter "timeout").<br />
:* GeMoMa simply did not find a prediction matching the remaining quality criteria.<br />
<br />
; What does "partial gene model" mean in the context of GeMoMa?<br />
:We called a gene model partial if it does not contain an initial start codon and a final stop codon. However, this does not mean that the gene model is located at or close to the border of a chromosome or contig.<br />
<br />
; For two different reference transcripts, the predictions of GeMoMa overlap or are identical. What should I do with those?<br />
:GeMoMa makes the predictions for each reference transcript independently. Hence, it can occur that some of predictions of different reference transcripts overlap or are identical especially in gene families. Typically, you might like to filter or rank these predictions. We have implemented GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>) to do this automatically. However, you can also do it by hand using the GFF attributes. Using RNA-seq data in GeMoMa yields additional fields in the annotation that can be used, e.g., average coverage (avgCov).<br />
<br />
; A lot of transcripts have been filtered out by the Extractor. What can I do?<br />
:There are several reasons for removing transcripts by the Extractor. At least in two cases you can try to get more transcripts by setting specific parameter values. First, if the transcript contains ambiguous nucleotides, please test the parameter "Ambiguity". Second, sometimes we received GFFs which contain wrong phases for CDS entries (e.g., 0 for all CDS entries in the phase column of the GFF). Since version 1.3.2, we provide the option "r" which stands for repair. If r=true is chosen, the Extractor tries to infer all phases for transcripts that show an error and would be filtered out.<br />
<br />
; Is GeMoMa able to predict pseudo-genes/ncRNA?<br />
:No, currently not.<br />
<br />
; My RNA-seq data indicates there is an additional intron in a transcipt, but GeMoMa does not predict this. Or vice versa, GeMoMa predicts an intron that is not supported by RNA-seq data. What's the reason?<br />
:GeMoMa is mainly based on the assumptions of amino acid and intron position conservation between reference and target species. Hence, GeMoMa tries to predict a gene model with similar exon-intron-structure in the target species and does not stick too much with RNA-seq data. Although intron position conservation can be observed in most cases, sometimes new introns evolve or others vanish. For this reasons, GeMoMa also allows for the inclusion or exclusion of introns adding some additional costs (cf. GeMoMa parameter intron-loss-gain-penalty). However, the behaviour of GeMoMa depends on the parameters settings (especially intron-loss-gain-penalty, sm (substitution matrix), go (gap opening), ge (gap extension)) and the length of the missed/additional intron. Nevertheless, such cases can only occur if the additional/missed intron has a length that can be divided by 3 preserving the reading frame. Since the available RNA-seq data only reflects a fraction of tissues/environmental conditions/..., missing RNA-seq evidence does not necessarily mean that the predictions is wrong.<br />
<br />
; My RNA-seq data indicates two alternative, highly overlapping introns. Interestingly, GeMoMa does not take the intron that is more abundant. Why?<br />
:GeMoMa reads the introns from the input file using some filter (cf. GeMoMa parameter r (reads)). All introns that pass the filter are used and treated equally. Hence, GeMoMa uses the intron that matches the expectation of intron position and amino acid conservation compared to the reference transcript.<br />
<br />
; Does GeMoMa predict multiple transcripts per gene?<br />
:GeMoMa in principle allows to predict multiple transcripts per gene, if corresponding transcripts are given in the reference species or if multiple reference species are used. <br />
<br />
; GeMoMa failed with java.lang.OutOfMemoryError. What can I do?<br />
:Whenever you see a java.lang.OutOfMemoryError, you should rerun the program with Java virtual machine (VM) options. More specifically you should set: -Xms the initially used RAM, e.g. to 5Gb (–Xms5G), and -Xmx the maximally used RAM, e.g. to 50Gb (-Xmx50G). GeMoMa often needs more memory if you have a large genome and if you’re providing a large coverage file (extracted from RNA-seq data). If you don’t have a compute node with enough memory, you can run GeMoMa without coverage, which will return the same predictions, but does not include all statistics. Another point could be the protein alignment, if you use the optional parameter "query protein" (below version 1.; or "protein alignment" (since version 1.;. Again you can run GeMoMa without protein alignment, which will return the same predictions, but less statistics. <br />
<br />
; I need to specify the genetic code for my organisms. What is the expected format?<br />
:The genetic code is given in a two column tab-delimited table, where the first column is the one letter code of the amino acid and the second column is a comma-separated list of triplets. As we are working on genomic DNA, GeMoMa expects the bases A, C, G, and T, and not U (as expected in mRNA). Here is the link to the default genetic code, which might be used as template:<br />
:https://github.com/Jstacs/Jstacs/blob/master/projects/gemoma/test_data/genetic_code.txt<br />
:Alternative genetic codes are described here using the RNA alphabet:<br />
:https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi<br />
:The genetic code might be specified for a reference organism in the module Extractor or for a target organism in the module GeMoMa.<br />
<br />
; I like to accelerate GeMoMa. What can I do?<br />
:You can use several threads for the computation. If you run the GeMoMaPipeline you just have to select threads=<your_number>.<br />
:In addition, you can change the search algorithm that is used in GeMoMa. Tblastn is used by default as search algorithm in GeMoMa (for historical reasons, until version 1.6.4). However, tblastn can be replaced by mmseqs which is typically much faster. If you run the GeMoMaPipeline you just have to select tblastn=false, which is default since version 1.7. However, changing the search algorithm can also effect the results. We try to minimize these effect using specific parameters for the search algorithms.<br />
:If you modify other parameters, you will probably receive results that differ to a larger extend from those received using the default parameters. <br />
<br />
; Is there a way to use GeMoMa to search a single CDS or protein sequence against a genome and return the predicted gene model (CDS fasta, protein fasta, GFF) similar to exonerate?<br />
:There at least two ways to do this. If you use GeMoMaPipeline you can <br />
: (A) Use the parameter “selected” to select specific gene models (=transcripts/proteins) from the annotation instead of using all or<br />
: (B) Use s=pre-extracted, use a fasta file with the proteins for the parameter cds-parts and leave assignment unset.<br />
:Using one of these options you can either look for a single or few transcripts/proteins either with (A) or without (B) intron-position conservation. In addition, you can use RNA-seq data to improve the predictions, which should be not possible with exonerate.<br />
<br />
; Can I determine synteny based on GeMoMa predictions?<br />
:Yes, since version 1.7 we provide the module SyntenyChecker and a R script that can be used for this purpose. It exploits the fact the the reference gene and the alternative are known. Hence no alignment is need at this point and synteny can be determined quite fast. <br />
<br />
; How, can I add additional attributes to the annotation?<br />
:Additional attribute, e.g., functional annotation from InterProScan, can be added to the structural gene annotation using the module AddAttribute, which has been included since version 1.7. Such additional attributes might be used in GAF for filtering and sorting and can also be displayed in genome browsers like IGV or WebApollo. <br />
<br />
; Can structural gene annotation provided by GeMoMa be submitted to NCBI?<br />
:Yes, NCBI allows to submit structural gene annotation in GFF format (https://www.ncbi.nlm.nih.gov/genbank/genomes_gff/). If you run GeMoMaPipeline or AnnotationFinalizer, the GFF should be valid for conversion. <br />
<br />
; Running GeMoMaPipeline throws an exception. Can I restart GeMoMaPipeline using intermediate results?<br />
:Yes, since version 1.7 we have a new parameter in GeMoMaPipeline called restart. <br />
:If you want to restart the last broken GeMoMaPipeline run, you have to execute GeMoMaPipeline with the same command line as before and add restart=true.<br />
:If necessary, you can also slightly change the other parameters. However, if the parameters differ too much from those used before, GeMoMaPipeline will decide to perform a new independent run.<br />
:A restart of GeMoMaPipeline is particularly useful if the time-consuming search (tblastn or mmseqs) was successful, since this can save runtime.<br />
<br />
If you have any further questions, comments or bugs, please check the [[GeMoMa-Docs]], [https://github.com/Jstacs/Jstacs/labels/GeMoMa our github page] or contact [mailto:jens.keilwagen@julius-kuehn.de Jens Keilwagen].<br />
<br />
== References ==<br />
If you use GeMoMa, please cite<br />
<br />
J. Keilwagen, M. Wenk, J. L. Erickson, M. H. Schattat, J. Grau, and F. Hartung. [https://nar.oxfordjournals.org/content/44/9/e89 Using intron position conservation for homology-based gene prediction]. ''Nucleic Acids Research'', 2016. doi: 10.1093/nar/gkw092<br />
<br />
J. Keilwagen, F. Hartung, M. Paulini, S. O. Twardziok, and J. Grau<br />
[https://rdcu.be/QbKc Combining RNA-seq data and homology-based gene prediction for plants, animals and fungi]. ''BMC Bioinformatics'', 2018. doi: 10.1186/s12859-018-2203-5<br />
<br />
== Version history ==<br />
[http://www.jstacs.de/download.php?which=GeMoMa GeMoMa 1.8] (07.10.2021)<br />
* extended manual<br />
* new module Analyzer: for benchmarking<br />
* new module BUSCORecomputer: allowing to recompute BUSCO stats based on geneID instead of transcriptID avoiding to overestimate the number of duplicates<br />
* AnnotationEvidence<br />
** bugfix: gene borders if only one gene is on the contig<br />
** discard genes that do not code for a protein<br />
* AnnotationFinalizer:<br />
** new parameter "transfer feature" allowing to keep additional features like UTRs, ...<br />
** implemented check of regular expression for prefix<br />
** bugfix if score==NA<br />
* CheckIntrons: introns are not optional<br />
* ERE: <br />
** new parameters for handling spurious split reads<br />
** new parameter for repositioning that is needed for genomes with huge chromosome due to limitations of BAM/SAM<br />
** bugfix last intron<br />
** improved protocol<br />
* Extractor:<br />
** new parameter for long fasta comment<br />
** new parameter identical<br />
** more verbose output in case of problems<br />
** finding errors if CDS parts have different strands<br />
** changed optional intron output<br />
** bugfix for exons with DNA but no AA<br />
* GAF:<br />
** new parameter allowing to output the transcript names of redundant predictions as GFF attribute<br />
** new parameter "transfer feature" allowing to keep additional features like UTRs, ...<br />
** bugfix: missing entries for alternative<br />
** changed default value for atf and sorting<br />
** implemented check of regular expression for prefix<br />
** changed handling of transcript within clusters<br />
** changed output order in gff: now for each gene the gene feature is reported first and subsequently the mRNA and CDS features<br />
* GeMoMa: <br />
** new parameter for replacing unknown AA by X<br />
** handling missing GeMoMa.ini.xml<br />
** additional GFF attributes: lpm, maxScore, maxGap, bestScore<br />
** improved error handling and protocol<br />
** changed heuristic for identifying multiple transcripts predictions on one contig/chromosome<br />
* GeMoMaPipeline:<br />
** new parameter "check synteny" allowing to run SyntenyChecker<br />
** implemented check of regular expression for prefix<br />
** removed unnecessary parameter<br />
** improved handling of exceptions<br />
** bugfix for stranded RNA-seq evidence<br />
** allow re-start only for same version<br />
** improved protocol if threads==1<br />
* SyntenyChecker: implemented check of regular expression for prefix<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.7.zip GeMoMa 1.7.1] (07.09.2020)<br />
*GeMoMa: <br />
**bugfix if assignment == null<br />
**bugfix remove toUpperCase<br />
*GeMoMaPipeline<br />
**Galaxy integration bugfix for hidden parameter restart<br />
**hide BLAST_PATH and MMSEQS_PATH from Galaxy integration<br />
**improved protocol output if threads=1<br />
**add addtional test to GeMoMaPipeline<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.7.zip GeMoMa 1.7] (29.07.2020)<br />
*improved manual including new module and runtime <br />
*check whether input files exist before execution<br />
*partially checking MIME types in CLI before execution<br />
*changed homepage from http to https<br />
*new module AddAttribute: allows to add attributes (like functional annotation from InterProScan) to gene annotation files that might be used in GAF or displayed genome browsers like IGV or WebApollo<br />
*new module SyntenyChecker: creates a table that can be used to create dot plots between the annotation of the target and reference organism<br />
*changed default value of parameter "tag" from "prediction" to "mRNA"<br />
*AnnotationEvidence:<br />
**additional attributes: avgCov, minCov, nps, ce<br />
**changed default value of "annotation output" to true<br />
**bugfix: transcript start and end<br />
*ERE: <br />
**changed default value of coverage to "true" <br />
**new parameter "minimum context": allows to discard introns if all split reads have short aligned contexts<br />
*Extractor: <br />
**bugfix splitAA if coding exon is very short<br />
**improved verbose mode<br />
**new parameter "upcase IDs"<br />
**new parameter "introns" allowing to extract introns from the reference (only for test cases)<br />
**new parameter "discard pre-mature stop" allowing to discard or use transcripts with pre-mature stop<br />
**improved handling of corrupt annotations<br />
*GAF: <br />
**bugfix missing transcripts<br />
**slightly changed the default value of "filter" <br />
*GeMoMa:<br />
**replaced parameter "query proteins" by "protein alignment"<br />
**using splitAA for scoring predictions <br />
**new gff attributes: <br />
*** ce and rce for the feature prediction indicating the number of coding exons for the prediction and the reference, respectively<br />
*** nps for the number of premature stop codons (if avoid stop is false)<br />
**slightly changed the meaning of the parameter "avoid stop"<br />
*GeMoMaPipeline:<br />
**changed the default value of tblastn to false, hence mmseqs is used as search algorithm<br />
**changed the default value of score to ReAlign<br />
**remove "--dont-split-seq-by-len" from mmseqs createdb<br />
**new optional parameter BLAST_PATH<br />
**new optional parameter MMSEQS_PATH<br />
**new option to allow for incorporation of external annotation, e.g., from ab-initio gene prediction<br />
**new parameter restart allowing to restart the latest GeMoMaPipeline run, which was finished without results, with very similar parameters, e.g., after an exception was thrown (cf. parameter debug)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.6.4.zip GeMoMa 1.6.4] (24.04.2020)<br />
* improved help section<br />
* change gff attribute "AA" to "aa"<br />
* GAF:<br />
** bugfix overlapping genes<br />
** accelerated computation<br />
* GeMoMa:<br />
** bugfix: if no assignment file is used and protein ID are prefixes of other protein IDs<br />
** change GFF attribute AA to aa<br />
* AnnotationFinalizer: new parameter "name attribute" allowing to decide whether a name attribute or the Parent and ID attributes should be used for renaming<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.6.3.zip GeMoMa 1.6.3] (05.03.2020)<br />
* Jstacs changes:<br />
** CLI: bugfix ExpandableParameterSet<br />
* python wrapper (for *conda)<br />
* updated tests.sh, run.sh, pipeline.sh<br />
* rename Denoise to DenoiseIntrons<br />
* AnnotationEvidence: write phase (as given) to gff<br />
* GAF: new parameter: default attributes allows to set attributes that are not included in some gene annotation files<br />
* GeMoMa: new parameter: static intron length allowing to use dynamic intron length if set to false<br />
* GeMoMaPipeline: <br />
** bugfix: time-out<br />
** improve output<br />
** separate parameters for maximum intron length (DenoiseIntrons, GeMoMa)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.6.2.zip GeMoMa 1.6.2] (17.12.2019)<br />
* Jstacs changes:<br />
** test methods for modules<br />
** live protocol for Galaxy<br />
* new module Denoise: allowing to clean introns extracted by ERE<br />
* new module NCBIReferenceRetriever: allowing to retrieve data for reference organisms easily from NCBI.<br />
* GAF:<br />
** bugfix for filter using specific attributes if no RNA-seq or query proteins was used<br />
** allow to add annotation info (as for instance provided by Phytozome) based on the reference organisms<br />
* GeMoMa: bugfix for timeout<br />
* GeMoMaPipeline:<br />
** bugfix reporting predicted partial proteins<br />
** improved protocol<br />
** new default value for query proteins (changed from false to true)<br />
** new default value for Ambiguity (changed from EXCEPTION to AMBIGUOUS)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.6.1.zip GeMoMa 1.6.1] (4.06.2019)<br />
* createGalaxyIntegration.sh: bugfix for GeMoMaPipeline<br />
* new module CheckIntrons: allowing to create statistics for introns (extracted by ERE)<br />
* AnnotationFinalizer: bugfix for sequence IDs with large numbers<br />
* CompareTranscripts: <br />
** bugfix for prefix of ref-gene<br />
** allow no transcript info, but making assignment non-optional if a transcript info is set <br />
* GAF: bugfix for Galaxy integration<br />
* GeMoMaPipeline:<br />
** improved output in case of Exceptions<br />
** new parameter "output individual predictions" allows to in- or exclude individual predictions from each reference organism in the final result<br />
** new parameter "weight" allows weights for reference species (cf. GAF)<br />
* ERE: new parameter "minimum mapping quality"<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.6.zip GeMoMa 1.6] (2.04.2019)<br />
* allow to use mmseqs as alternative to tblastn<br />
* AnnotationEvidence:<br />
** allows to add attributes to the input gff: tie, tpc, AA, start, stop<br />
** new parameter for gff output<br />
* AnnotationFinalizer: new tool for predicting UTRs and renaming predictions<br />
* GAF:<br />
** relative score filter and evidence filter are replaced by a flexible filter that allows to filter by relative score, evidence or other GFF attributes as well as combinations thereof<br />
** sorting criteria of the predictions within clusters can now be user-specified<br />
** new attribute for genes: combinedEvidence<br />
** new attribute for predictions: sumWeight<br />
** allows to use gene predictions from all sources, including for instance ab-initio gene predictors, purely RNA-seq based gene prediction and manually curation<br />
** bugfix for predictions from multiple reference organisms<br />
** improved statistic output<br />
* GeMoMa<br />
** renamed the parameter tblastn results to search results<br />
** new parameter for sorting the results of the similarity search (tblastn or mmseqs), if you use mmseqs for the similarity search you have use sort <br />
** new parameter for score of the search results: three options: Trust (as is), ReScore (use aligned sequence, but recompute score), and ReAlign (use detected sequence for realignment and rescore)<br />
** bugfix: threshold for introns from multiple files<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.5.2.zip GeMoMa 1.5.3] (23.07.2018)<br />
* improved parameter description and presentation<br />
* GeMoMaPipeline:<br />
** removed unnecessary parameters<br />
* GeMoMa:<br />
** bugfix: reading coverage file<br />
** removed parameter genomic (cf. Extractor)<br />
** removed protein output (cf. Extractor)<br />
* GAF:<br />
** bugfix: prefix<br />
* Extractor:<br />
** new parameter genomic<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.5.2.zip GeMoMa 1.5.2] (31.5.2018)<br />
* GAF:<br />
** new parameter that allows to restrict the maximal number of transcript predictions per gene<br />
** altered behavior of the evidence filter from percentages to absolute values<br />
** bugfix: nested genes <br />
** checking for duplicates in prediction IDs<br />
* GeMoMa:<br />
** warning if RNA-seq data does not match with target genome<br />
* GeMoMaPipeline: new tool for running the complete GeMoMa pipeline at once allowing multi-threading<br />
* folder for temporary files of GeMoMa<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.5.zip GeMoMa 1.5] (13.02.2018)<br />
* AnnotationEvidence: add chromosome to output<br />
* CompareTranscripts: new parameter that allows to remove prefixes introduces by GAF<br />
* Extractor: new parameter "stop-codon excluded from CDS" that might be used if the annotation does not contain the stop codons<br />
* ExtractRNASeqEvidence: <br />
** print intron length stats<br />
** include program infos in introns.gff3<br />
* GeMoMa:<br />
** new attribute pAA in gff output if query protein is given<br />
** include program infos in predicted_annotation.gff3<br />
** minor bugfix<br />
* GAF:<br />
** new parameter that allows to specify a prefix for each input gff<br />
** collect and print program infos to filtered_prediction.gff3<br />
** improved statistics output<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.4.2.zip GeMoMa 1.4.2] (21.07.2017)<br />
* automatic searching for available updates<br />
* AnnotationEvidence: bugfix (tie computation: Arrays.binarysearch does not find first match)<br />
* Extractor: bugfix (files that are not zipped)<br />
* GeMoMa: bugfix (tie computation: Arrays.binarysearch does not find first match)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.4.1.zip GeMoMa 1.4.1] (30.05.2017)<br />
* CompareTranscripts: bugfix (NullPointerException)<br />
* Extractor: reference genome can be .*fa.gz and .*fasta.gz<br />
* GeMoMa: bugfix (shutdown problem after timeout)<br />
* modified additional scripts and documentation<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.4.zip GeMoMa 1.4] (03.05.2017)<br />
* AnnotationEvidence: new tool computing tie and tpc for given annotation (gff)<br />
* CompareTranscripts: new tool comparing predicted and given annotation (gff)<br />
* Extractor:<br />
** reading CDS with no parent tag (cf. discontinuous feature)<br />
** automatic recognition of GFF or GTF annotation<br />
** Warning if sequences mentioned in the annotation are not included in the reference sequence<br />
* GeMoMa: <br />
** allowing for multiple intron and coverage files (= using different library types at the same time)<br />
** NA instead of "?" for tae, tde, tie, minSplitReads of single coding exon genes<br />
** new default values for the parameters: predictions (10 instead of 1) and contig threshold (0.4 instead of 0.9)<br />
** bugfix (write pc and minCov if possible for last CDS part in predicted annotation)<br />
** bugfix (ref-gene name if no assignment is used)<br />
** bugfix (minSplitReads, minCov, tpc, avgCov if no coverage available)<br />
* GAF:<br />
** nested genes on the same strand<br />
** bugfix (if nothing passes the filter)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.3.2.zip GeMoMa 1.3.2] (18.01.2017)<br />
* Extractor: new parameter repair for broken transcript annotations<br />
* GeMoMa: bugfixes (splice site computation)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa_1.3.1.zip GeMoMa 1.3.1] (09.12.2016)<br />
* GeMoMa bugfix (finding start/stop codon for very small exons)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.3.zip GeMoMa 1.3] (06.12.2016)<br />
* ERE: new tool for extracting RNA-seq evidence<br />
* Extractor: offers options for <br />
** partial gene models<br />
** ambiguities<br />
* GeMoMa:<br />
** RNA-seq<br />
*** defining splice sites<br />
*** additional feature in GFF and output<br />
**** transcript intron evidence (tie)<br />
**** transcript acceptor evidence (tae)<br />
**** transcript donor evidence (tde)<br />
**** transcript percentage coverage (tpc)<br />
**** ...<br />
** improved GFF<br />
** simplified the command line parameters<br />
** IMPORTANT: parameter names changed for some parameters<br />
* GAF: new tool for filtering and combining different predictions (especially of different reference organisms)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.1.3.zip GeMoMa 1.1.3] (06.06.2016)<br />
* minor modifications to the Extractor tool<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.1.2.zip GeMoMa 1.1.2] (05.02.2016)<br />
* GeMoMa bugfix (upstream, downstream sequence for splice site detection)<br />
* Extractor: new parameter s for selecting transcripts<br />
* improved Galaxy integration<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.1.1.zip GeMoMa 1.1.1] (01.02.2016)<br />
* initial release for paper</div>Keilwagenhttps://www.jstacs.de/index.php?title=GeMoMa&diff=1154GeMoMa2021-10-07T12:19:22Z<p>Keilwagen: /* Version history */ 1.8</p>
<hr />
<div>'''Ge'''ne '''Mo'''del '''Ma'''pper (GeMoMa) is a homology-based gene prediction program. GeMoMa uses the annotation of protein-coding genes in a reference genome to infer the annotation of protein-coding genes in a target genome. Thereby, GeMoMa utilizes amino acid sequence and intron position conservation. In addition, GeMoMa allows to incorporate RNA-seq evidence for splice site prediction.<br />
<br />
GeMoMa is available in a public web-server at [http://galaxy.informatik.uni-halle.de galaxy.informatik.uni-halle.de]. The provided web-server only allows a limited number of reference genes and uses a time out of 2 minutes per transcript prediction. For unlimited use, please use the command line program or integrate GeMoMa in your own [https://galaxyproject.org/ Galaxy] instance.<br />
<br />
{|<br />
|__TOC__<br />
|[[File:GeMoMa-schema.png|thumb|right|450px|Schema of GeMoMa algorithm]]<br />
|}<br />
<br />
== Installation ==<br />
<br />
GeMoMa is now available via bioconda. Here is the [https://anaconda.org/bioconda/gemoma direct link to the package]. To install this package with conda run:<br />
conda install -c bioconda gemoma <br />
However, you can also install GeMoMa manually.<br />
<br />
=== Requirements ===<br />
<br />
For running the GeMoMa, you need the following software on your computer<br />
* Java v1.8 or later<br />
* [ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ blast] or [https://github.com/soedinglab/MMseqs2 mmseqs]<br />
<br />
=== Download ===<br />
<br />
GeMoMa is implemented in Java using Jstacs. You can [http://www.jstacs.de/download.php?which=GeMoMa download a zip file] containing a readme, the GeMoMa jar file and some tiny scripts for running GeMoMa. The jar file allows for <br />
<ul><br />
<li>creating the XML file needed for the Galaxy integration</li><br />
<li>running the command line interface (CLI) version.</li><br />
</ul><br />
<br />
== In a nutshell ==<br />
<br />
GeMoMa is a modular, homology-based gene prediction program with huge flexibility. However, we also provide a pipeline allowing to use GeMoMa easily. If you like to start GeMoMa for the first time, we recommend to use the GeMoMaPipeline like this<br />
java -jar GeMoMa-1.8.jar CLI GeMoMaPipeline threads=<threads> outdir=<outdir> GeMoMa.Score=ReAlign AnnotationFinalizer.r=NO o=true t=<target_genome> i=<reference_1_id> a=<reference_1_annotation> g=<reference_1_genome><br />
there are several parameters that need to be set indicated with '''&lt;'''foo'''&gt;'''. You can specify<br />
* the number of threads<br />
* the output directory<br />
* the target genome<br />
* and the reference ID (optional), annotation and genome. If you have several references just repeat <code>s=own</code> and the parameter tags <code>i</code>, <code>a</code>, <code>g</code> with the corresponding values.<br />
In addition, we recommend to set several parameters:<br />
* <code>GeMoMa.Score=ReAlign</code>: states that the score from mmseqs should be recomputed as mmseqs uses an approximation<br />
* <code>AnnotationFinalizer.r=NO</code>: do not rename genes and transcripts<br />
* <code>o=true</code>: output individual predictions for each reference as a separate file allowing to rerun the combination step ('''GAF''') very easily and quickly<br />
If you like to specify the maximum intron length please consider the parameters <code>GeMoMa.m</code> and <code>GeMoMa.sil</code>.<br />
If you have RNA-seq data either from own experiments or publicly available data sets (cf. [https://www.ncbi.nlm.nih.gov/sra NCBI SRA], [https://www.ebi.ac.uk/ena EMBL-EBI ENA]), we recommend to use them. You need to map the data against the target genome with your favorite read mapper. In addition, we recommend to check the parameters of the section '''DenoiseIntrons'''.<br />
<br />
<br />
The complete documentation describing all GeMoMa modules and all parameters can be accessed at [[GeMoMa-Docs]].<br />
<br />
You can also [[:File:GeMoMa-manual.pdf|download a small manual for GeMoMa]] which explains the main steps for the analysis.<br />
<br />
== GFF attributes ==<br />
<br />
Using GeMoMa, you'll obtain GFFs containing some special attributes. We briefly explain the most prominent attributes in the following table.<br />
<br />
{| class="wikitable sortable"<br />
|-<br />
!Attribute !!Long name !!Tool !!Necessary parameter !!Feature !!Description<br />
|-<br />
| aa || amino acids || GeMoMa || || mRNA || the number of amino acids in the protein<br />
|-<br />
| score || GeMoMa score || GeMoMa || || mRNA || score computed by GeMoMa using the substitution matrix, gap costs and additional penalties<br />
|-<br />
| nps || number of premature stops || GeMoMa || || mRNA || the number of premature stop codons in the prediction<br />
|-<br />
| ce || coding exons || GeMoMa || assignment || mRNA || the number of coding exons of the prediction<br />
|-<br />
| rce || reference coding exons || GeMoMa || assignment || mRNA || the number of coding exons of the reference transcript<br />
|-<br />
| minCov || minimal coverage || GeMoMa || coverage, ... || mRNA || minimal coverage of any base of the prediction given RNA-seq evidence<br />
|-<br />
| avgCov || average coverage || GeMoMa || coverage, ... || mRNA || average coverage of all bases of the prediction given RNA-seq evidence<br />
|-<br />
| tpc || transcript percentage coverage || GeMoMa || coverage, ... || mRNA || percentage of covered bases per predicted transcript given RNA-seq evidence<br />
|-<br />
| tae || transcript acceptor evidence || GeMoMa || introns || mRNA || percentage of predicted acceptor sites per predicted transcript with RNA-seq evidence<br />
|-<br />
| tde || transcript donor evidence || GeMoMa || introns || mRNA || percentage of predicted donor sites per predicted transcript with RNA-seq evidence<br />
|-<br />
| tie || transcript intron evidence || GeMoMa || introns || mRNA || percentage of predicted introns per predicted transcript with RNA-seq evidence<br />
|-<br />
| minSplitReads || minimal split reads || GeMoMa || introns || mRNA || minimal number of split reads for any of the predicted introns per predicted transcript<br />
|-<br />
| iAA || identical amino acid || GeMoMa || query proteins || mRNA || percentage of identical amino acids between reference transcript and prediction<br />
|-<br />
| pAA || positive amino acid || GeMoMa || query proteins || mRNA || percentage of aligned positions between reference transcript and prediction yielding a positive score in the substitution matrix<br />
|-<br />
| evidence || || GAF || || mRNA || number of reference organisms that have a transcript yielding this prediction<br />
|-<br />
| alternative || || GAF || || mRNA || alternative gene ID(s) leading to the same prediction<br />
|-<br />
| sumWeight || || GAF || || mRNA || the sum of the weights of the references that perfectly support this prediction<br />
|-<br />
| maxTie || maximal tie || GAF || || gene || maximal tie of all transcripts of this gene<br />
|-<br />
| maxEvidence || maximal evidence || GAF || || gene || maximal evidence of all transcripts of this gene<br />
|-<br />
|}<br />
<br />
The name of the feature describing a transcript prediction can be altered using the parameter "tag". Before version 1.7 the default value of tag was "prediction" instead of "mRNA".<br />
<br />
== Frequently asked questions ==<br />
<br />
; Why does the Extractor not return a single CDS-part, protein, ...?<br />
:Please check whether the names of your contigs/chromosomes in your annotation (gff) and genome file (fasta) are identical. The fasta comments should at best only contain the contig/chromosome name. (Since GeMoMa 1.4, comments, which contain the contig/chromosome name and some additional information separated by a space, are also fine.) In addition, check the statistics that are given by the Extractor. It lists how many genes have been read and how many genes have been removed for different reasons. One common problem is that some annotation files do not include the stop codon in the CDS annotation.<br />
<br />
; How can I force GeMoMa to make more predictions?<br />
:There are several parameters affecting the number of predictions. The most prominent are the number of predictions (p) and the contig threshold (ct). For each reference transcript/CDS, GeMoMa initially makes a preliminary prediction and uses this prediction to determine whether a contig is promising and should be used to determine the final predictions. You may decrease ct and increase p to have more contigs in the final prediction. Increasing the number of predictions allows GeMoMa to output more predictions that have been computed. Decreasing the contig threshold allows to increase the number of predictions that are (internally) computed. Increasing p to a very large number without decreasing ct does not help.<br />
<br />
; Running GeMoMa on a single contig of my assembly yield thousands of weird predictions. What went wrong?<br />
:By default, GeMoMa is not build to be run on a single contig. GeMoMa tries to make predictions for all given reference CDS in the given target sequence(s). If the given target sequence is only a fraction of the complete target genome/assembly, GeMoMa will produce weird predictions as it does not filter for the quality of the predictions internally. There are two options to handle this:<br />
:* Use a list of gene models that you expect to be located on this contig (cf. parameter "selected").<br />
:* Filter the predictions using GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>).<br />
<br />
; Is it mandatory to use RNA-seq data?<br />
:No, GeMoMa is able to make predictions with and without RNA-seq evidence.<br />
<br />
; Is it possible to use multiple reference organisms?<br />
:It is possible to use multiple reference organisms for GeMoMa. Just run GeMoMa on each reference organism separately. Finally, you can employ GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>) to combine these annotations.<br />
<br />
; Why do some reference genes not lead to a prediction in the target genome?<br />
:Please first check whether your reference genes have been discarded by the Extractor (cf. assignment file).<br />
:If the genes have been discarded, there are two possibilities:<br />
:* The CDS might be redundant, i.e. the coding exons are identical to those of another transcript. In this case, only one CDS is further evaluated.<br />
:* There might be something wrong with your reference genes, e.g., missing start codon, missing stop codon, premature stop codon, ambiguous nucleotides, ... and you should check the options of Extractor or the annotation.<br />
:If the reference genes passed the Extractor, there are several possible explanations for this behavior. The two most prominent are:<br />
:* GeMoMa stopped the prediction of a reference genes since it does not return a result within the given time (cf. parameter "timeout").<br />
:* GeMoMa simply did not find a prediction matching the remaining quality criteria.<br />
<br />
; What does "partial gene model" mean in the context of GeMoMa?<br />
:We called a gene model partial if it does not contain an initial start codon and a final stop codon. However, this does not mean that the gene model is located at or close to the border of a chromosome or contig.<br />
<br />
; For two different reference transcripts, the predictions of GeMoMa overlap or are identical. What should I do with those?<br />
:GeMoMa makes the predictions for each reference transcript independently. Hence, it can occur that some of predictions of different reference transcripts overlap or are identical especially in gene families. Typically, you might like to filter or rank these predictions. We have implemented GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>) to do this automatically. However, you can also do it by hand using the GFF attributes. Using RNA-seq data in GeMoMa yields additional fields in the annotation that can be used, e.g., average coverage (avgCov).<br />
<br />
; A lot of transcripts have been filtered out by the Extractor. What can I do?<br />
:There are several reasons for removing transcripts by the Extractor. At least in two cases you can try to get more transcripts by setting specific parameter values. First, if the transcript contains ambiguous nucleotides, please test the parameter "Ambiguity". Second, sometimes we received GFFs which contain wrong phases for CDS entries (e.g., 0 for all CDS entries in the phase column of the GFF). Since version 1.3.2, we provide the option "r" which stands for repair. If r=true is chosen, the Extractor tries to infer all phases for transcripts that show an error and would be filtered out.<br />
<br />
; Is GeMoMa able to predict pseudo-genes/ncRNA?<br />
:No, currently not.<br />
<br />
; My RNA-seq data indicates there is an additional intron in a transcipt, but GeMoMa does not predict this. Or vice versa, GeMoMa predicts an intron that is not supported by RNA-seq data. What's the reason?<br />
:GeMoMa is mainly based on the assumptions of amino acid and intron position conservation between reference and target species. Hence, GeMoMa tries to predict a gene model with similar exon-intron-structure in the target species and does not stick too much with RNA-seq data. Although intron position conservation can be observed in most cases, sometimes new introns evolve or others vanish. For this reasons, GeMoMa also allows for the inclusion or exclusion of introns adding some additional costs (cf. GeMoMa parameter intron-loss-gain-penalty). However, the behaviour of GeMoMa depends on the parameters settings (especially intron-loss-gain-penalty, sm (substitution matrix), go (gap opening), ge (gap extension)) and the length of the missed/additional intron. Nevertheless, such cases can only occur if the additional/missed intron has a length that can be divided by 3 preserving the reading frame. Since the available RNA-seq data only reflects a fraction of tissues/environmental conditions/..., missing RNA-seq evidence does not necessarily mean that the predictions is wrong.<br />
<br />
; My RNA-seq data indicates two alternative, highly overlapping introns. Interestingly, GeMoMa does not take the intron that is more abundant. Why?<br />
:GeMoMa reads the introns from the input file using some filter (cf. GeMoMa parameter r (reads)). All introns that pass the filter are used and treated equally. Hence, GeMoMa uses the intron that matches the expectation of intron position and amino acid conservation compared to the reference transcript.<br />
<br />
; Does GeMoMa predict multiple transcripts per gene?<br />
:GeMoMa in principle allows to predict multiple transcripts per gene, if corresponding transcripts are given in the reference species or if multiple reference species are used. <br />
<br />
; GeMoMa failed with java.lang.OutOfMemoryError. What can I do?<br />
:Whenever you see a java.lang.OutOfMemoryError, you should rerun the program with Java virtual machine (VM) options. More specifically you should set: -Xms the initially used RAM, e.g. to 5Gb (–Xms5G), and -Xmx the maximally used RAM, e.g. to 50Gb (-Xmx50G). GeMoMa often needs more memory if you have a large genome and if you’re providing a large coverage file (extracted from RNA-seq data). If you don’t have a compute node with enough memory, you can run GeMoMa without coverage, which will return the same predictions, but does not include all statistics. Another point could be the protein alignment, if you use the optional parameter "query protein" (below version 1.; or "protein alignment" (since version 1.;. Again you can run GeMoMa without protein alignment, which will return the same predictions, but less statistics. <br />
<br />
; I need to specify the genetic code for my organisms. What is the expected format?<br />
:The genetic code is given in a two column tab-delimited table, where the first column is the one letter code of the amino acid and the second column is a comma-separated list of triplets. As we are working on genomic DNA, GeMoMa expects the bases A, C, G, and T, and not U (as expected in mRNA). Here is the link to the default genetic code, which might be used as template:<br />
:https://github.com/Jstacs/Jstacs/blob/master/projects/gemoma/test_data/genetic_code.txt<br />
:Alternative genetic codes are described here using the RNA alphabet:<br />
:https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi<br />
:The genetic code might be specified for a reference organism in the module Extractor or for a target organism in the module GeMoMa.<br />
<br />
; I like to accelerate GeMoMa. What can I do?<br />
:You can use several threads for the computation. If you run the GeMoMaPipeline you just have to select threads=<your_number>.<br />
:In addition, you can change the search algorithm that is used in GeMoMa. Tblastn is used by default as search algorithm in GeMoMa (for historical reasons, until version 1.6.4). However, tblastn can be replaced by mmseqs which is typically much faster. If you run the GeMoMaPipeline you just have to select tblastn=false, which is default since version 1.7. However, changing the search algorithm can also effect the results. We try to minimize these effect using specific parameters for the search algorithms.<br />
:If you modify other parameters, you will probably receive results that differ to a larger extend from those received using the default parameters. <br />
<br />
; Is there a way to use GeMoMa to search a single CDS or protein sequence against a genome and return the predicted gene model (CDS fasta, protein fasta, GFF) similar to exonerate?<br />
:There at least two ways to do this. If you use GeMoMaPipeline you can <br />
: (A) Use the parameter “selected” to select specific gene models (=transcripts/proteins) from the annotation instead of using all or<br />
: (B) Use s=pre-extracted, use a fasta file with the proteins for the parameter cds-parts and leave assignment unset.<br />
:Using one of these options you can either look for a single or few transcripts/proteins either with (A) or without (B) intron-position conservation. In addition, you can use RNA-seq data to improve the predictions, which should be not possible with exonerate.<br />
<br />
; Can I determine synteny based on GeMoMa predictions?<br />
:Yes, since version 1.7 we provide the module SyntenyChecker and a R script that can be used for this purpose. It exploits the fact the the reference gene and the alternative are known. Hence no alignment is need at this point and synteny can be determined quite fast. <br />
<br />
; How, can I add additional attributes to the annotation?<br />
:Additional attribute, e.g., functional annotation from InterProScan, can be added to the structural gene annotation using the module AddAttribute, which has been included since version 1.7. Such additional attributes might be used in GAF for filtering and sorting and can also be displayed in genome browsers like IGV or WebApollo. <br />
<br />
; Can structural gene annotation provided by GeMoMa be submitted to NCBI?<br />
:Yes, NCBI allows to submit structural gene annotation in GFF format (https://www.ncbi.nlm.nih.gov/genbank/genomes_gff/). If you run GeMoMaPipeline or AnnotationFinalizer, the GFF should be valid for conversion. <br />
<br />
; Running GeMoMaPipeline throws an exception. Can I restart GeMoMaPipeline using intermediate results?<br />
:Yes, since version 1.7 we have a new parameter in GeMoMaPipeline called restart. <br />
:If you want to restart the last broken GeMoMaPipeline run, you have to execute GeMoMaPipeline with the same command line as before and add restart=true.<br />
:If necessary, you can also slightly change the other parameters. However, if the parameters differ too much from those used before, GeMoMaPipeline will decide to perform a new independent run.<br />
:A restart of GeMoMaPipeline is particularly useful if the time-consuming search (tblastn or mmseqs) was successful, since this can save runtime.<br />
<br />
If you have any further questions, comments or bugs, please check the [[GeMoMa-Docs]], [https://github.com/Jstacs/Jstacs/labels/GeMoMa our github page] or contact [mailto:jens.keilwagen@julius-kuehn.de Jens Keilwagen].<br />
<br />
== References ==<br />
If you use GeMoMa, please cite<br />
<br />
J. Keilwagen, M. Wenk, J. L. Erickson, M. H. Schattat, J. Grau, and F. Hartung. [https://nar.oxfordjournals.org/content/44/9/e89 Using intron position conservation for homology-based gene prediction]. ''Nucleic Acids Research'', 2016. doi: 10.1093/nar/gkw092<br />
<br />
J. Keilwagen, F. Hartung, M. Paulini, S. O. Twardziok, and J. Grau<br />
[https://rdcu.be/QbKc Combining RNA-seq data and homology-based gene prediction for plants, animals and fungi]. ''BMC Bioinformatics'', 2018. doi: 10.1186/s12859-018-2203-5<br />
<br />
== Version history ==<br />
[http://www.jstacs.de/download.php?which=GeMoMa GeMoMa 1.8] (07.10.2021)<br />
* extended manual<br />
* new module Analyzer: for benchmarking<br />
* new module BUSCORecomputer: allowing to recompute BUSCO stats based on geneID instead of transcriptID avoiding to overestimate the number of duplicates<br />
* AnnotationEvidence<br />
** bugfix: gene borders if only one gene is on the contig<br />
** discard genes that do not code for a protein<br />
* AnnotationFinalizer:<br />
** new parameter "transfer feature" allowing to keep additional features like UTRs, ...<br />
** implemented check of regular expression for prefix<br />
** bugfix if score==NA<br />
* CheckIntrons: introns are not optional<br />
* ERE: <br />
** new parameters for handling spurious split reads<br />
** new parameter for repositioning that is needed for genomes with huge chromosome due to limitations of BAM/SAM<br />
** bugfix last intron<br />
** improved protocol<br />
* Extractor:<br />
** new parameter for long fasta comment<br />
** new parameter identical<br />
** more verbose output in case of problems<br />
** finding errors if CDS parts have different strands<br />
** changed optional intron output<br />
** bugfix for exons with DNA but no AA<br />
* GAF:<br />
** new parameter allowing to output the transcript names of redundant predictions as GFF attribute<br />
** new parameter "transfer feature" allowing to keep additional features like UTRs, ...<br />
** bugfix: missing entries for alternative<br />
** changed default value for atf and sorting<br />
** implemented check of regular expression for prefix<br />
** changed handling of transcript within clusters<br />
** changed output order in gff: now for each gene the gene feature is reported first and subsequently the mRNA and CDS features<br />
* GeMoMa: <br />
** new parameter for replacing unknown AA by X<br />
** handling missing GeMoMa.ini.xml<br />
** additional GFF attributes: lpm, maxScore, maxGap, bestScore<br />
** improved error handling and protocol<br />
** changed heuristic for identifying multiple transcripts predictions on one contig/chromosome<br />
* GeMoMaPipeline:<br />
** new parameter "check synteny" allowing to run SyntenyChecker<br />
** implemented check of regular expression for prefix<br />
** removed unnecessary parameter<br />
** improved handling of exceptions<br />
** bugfix for stranded RNA-seq evidence<br />
** allow re-start only for same version<br />
** improved protocol if threads==1<br />
* SyntenyChecker: implemented check of regular expression for prefix<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.7.zip GeMoMa 1.7.1] (07.09.2020)<br />
*GeMoMa: <br />
**bugfix if assignment == null<br />
**bugfix remove toUpperCase<br />
*GeMoMaPipeline<br />
**Galaxy integration bugfix for hidden parameter restart<br />
**hide BLAST_PATH and MMSEQS_PATH from Galaxy integration<br />
**improved protocol output if threads=1<br />
**add addtional test to GeMoMaPipeline<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.7.zip GeMoMa 1.7] (29.07.2020)<br />
*improved manual including new module and runtime <br />
*check whether input files exist before execution<br />
*partially checking MIME types in CLI before execution<br />
*changed homepage from http to https<br />
*new module AddAttribute: allows to add attributes (like functional annotation from InterProScan) to gene annotation files that might be used in GAF or displayed genome browsers like IGV or WebApollo<br />
*new module SyntenyChecker: creates a table that can be used to create dot plots between the annotation of the target and reference organism<br />
*changed default value of parameter "tag" from "prediction" to "mRNA"<br />
*AnnotationEvidence:<br />
**additional attributes: avgCov, minCov, nps, ce<br />
**changed default value of "annotation output" to true<br />
**bugfix: transcript start and end<br />
*ERE: <br />
**changed default value of coverage to "true" <br />
**new parameter "minimum context": allows to discard introns if all split reads have short aligned contexts<br />
*Extractor: <br />
**bugfix splitAA if coding exon is very short<br />
**improved verbose mode<br />
**new parameter "upcase IDs"<br />
**new parameter "introns" allowing to extract introns from the reference (only for test cases)<br />
**new parameter "discard pre-mature stop" allowing to discard or use transcripts with pre-mature stop<br />
**improved handling of corrupt annotations<br />
*GAF: <br />
**bugfix missing transcripts<br />
**slightly changed the default value of "filter" <br />
*GeMoMa:<br />
**replaced parameter "query proteins" by "protein alignment"<br />
**using splitAA for scoring predictions <br />
**new gff attributes: <br />
*** ce and rce for the feature prediction indicating the number of coding exons for the prediction and the reference, respectively<br />
*** nps for the number of premature stop codons (if avoid stop is false)<br />
**slightly changed the meaning of the parameter "avoid stop"<br />
*GeMoMaPipeline:<br />
**changed the default value of tblastn to false, hence mmseqs is used as search algorithm<br />
**changed the default value of score to ReAlign<br />
**remove "--dont-split-seq-by-len" from mmseqs createdb<br />
**new optional parameter BLAST_PATH<br />
**new optional parameter MMSEQS_PATH<br />
**new option to allow for incorporation of external annotation, e.g., from ab-initio gene prediction<br />
**new parameter restart allowing to restart the latest GeMoMaPipeline run, which was finished without results, with very similar parameters, e.g., after an exception was thrown (cf. parameter debug)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.6.4.zip GeMoMa 1.6.4] (24.04.2020)<br />
* improved help section<br />
* change gff attribute "AA" to "aa"<br />
* GAF:<br />
** bugfix overlapping genes<br />
** accelerated computation<br />
* GeMoMa:<br />
** bugfix: if no assignment file is used and protein ID are prefixes of other protein IDs<br />
** change GFF attribute AA to aa<br />
* AnnotationFinalizer: new parameter "name attribute" allowing to decide whether a name attribute or the Parent and ID attributes should be used for renaming<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.6.3.zip GeMoMa 1.6.3] (05.03.2020)<br />
* Jstacs changes:<br />
** CLI: bugfix ExpandableParameterSet<br />
* python wrapper (for *conda)<br />
* updated tests.sh, run.sh, pipeline.sh<br />
* rename Denoise to DenoiseIntrons<br />
* AnnotationEvidence: write phase (as given) to gff<br />
* GAF: new parameter: default attributes allows to set attributes that are not included in some gene annotation files<br />
* GeMoMa: new parameter: static intron length allowing to use dynamic intron length if set to false<br />
* GeMoMaPipeline: <br />
** bugfix: time-out<br />
** improve output<br />
** separate parameters for maximum intron length (DenoiseIntrons, GeMoMa)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.6.2.zip GeMoMa 1.6.2] (17.12.2019)<br />
* Jstacs changes:<br />
** test methods for modules<br />
** live protocol for Galaxy<br />
* new module Denoise: allowing to clean introns extracted by ERE<br />
* new module NCBIReferenceRetriever: allowing to retrieve data for reference organisms easily from NCBI.<br />
* GAF:<br />
** bugfix for filter using specific attributes if no RNA-seq or query proteins was used<br />
** allow to add annotation info (as for instance provided by Phytozome) based on the reference organisms<br />
* GeMoMa: bugfix for timeout<br />
* GeMoMaPipeline:<br />
** bugfix reporting predicted partial proteins<br />
** improved protocol<br />
** new default value for query proteins (changed from false to true)<br />
** new default value for Ambiguity (changed from EXCEPTION to AMBIGUOUS)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.6.1.zip GeMoMa 1.6.1] (4.06.2019)<br />
* createGalaxyIntegration.sh: bugfix for GeMoMaPipeline<br />
* new module CheckIntrons: allowing to create statistics for introns (extracted by ERE)<br />
* AnnotationFinalizer: bugfix for sequence IDs with large numbers<br />
* CompareTranscripts: <br />
** bugfix for prefix of ref-gene<br />
** allow no transcript info, but making assignment non-optional if a transcript info is set <br />
* GAF: bugfix for Galaxy integration<br />
* GeMoMaPipeline:<br />
** improved output in case of Exceptions<br />
** new parameter "output individual predictions" allows to in- or exclude individual predictions from each reference organism in the final result<br />
** new parameter "weight" allows weights for reference species (cf. GAF)<br />
* ERE: new parameter "minimum mapping quality"<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.6.zip GeMoMa 1.6] (2.04.2019)<br />
* allow to use mmseqs as alternative to tblastn<br />
* AnnotationEvidence:<br />
** allows to add attributes to the input gff: tie, tpc, AA, start, stop<br />
** new parameter for gff output<br />
* AnnotationFinalizer: new tool for predicting UTRs and renaming predictions<br />
* GAF:<br />
** relative score filter and evidence filter are replaced by a flexible filter that allows to filter by relative score, evidence or other GFF attributes as well as combinations thereof<br />
** sorting criteria of the predictions within clusters can now be user-specified<br />
** new attribute for genes: combinedEvidence<br />
** new attribute for predictions: sumWeight<br />
** allows to use gene predictions from all sources, including for instance ab-initio gene predictors, purely RNA-seq based gene prediction and manually curation<br />
** bugfix for predictions from multiple reference organisms<br />
** improved statistic output<br />
* GeMoMa<br />
** renamed the parameter tblastn results to search results<br />
** new parameter for sorting the results of the similarity search (tblastn or mmseqs), if you use mmseqs for the similarity search you have use sort <br />
** new parameter for score of the search results: three options: Trust (as is), ReScore (use aligned sequence, but recompute score), and ReAlign (use detected sequence for realignment and rescore)<br />
** bugfix: threshold for introns from multiple files<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.5.2.zip GeMoMa 1.5.3] (23.07.2018)<br />
* improved parameter description and presentation<br />
* GeMoMaPipeline:<br />
** removed unnecessary parameters<br />
* GeMoMa:<br />
** bugfix: reading coverage file<br />
** removed parameter genomic (cf. Extractor)<br />
** removed protein output (cf. Extractor)<br />
* GAF:<br />
** bugfix: prefix<br />
* Extractor:<br />
** new parameter genomic<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.5.2.zip GeMoMa 1.5.2] (31.5.2018)<br />
* GAF:<br />
** new parameter that allows to restrict the maximal number of transcript predictions per gene<br />
** altered behavior of the evidence filter from percentages to absolute values<br />
** bugfix: nested genes <br />
** checking for duplicates in prediction IDs<br />
* GeMoMa:<br />
** warning if RNA-seq data does not match with target genome<br />
* GeMoMaPipeline: new tool for running the complete GeMoMa pipeline at once allowing multi-threading<br />
* folder for temporary files of GeMoMa<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.5.zip GeMoMa 1.5] (13.02.2018)<br />
* AnnotationEvidence: add chromosome to output<br />
* CompareTranscripts: new parameter that allows to remove prefixes introduces by GAF<br />
* Extractor: new parameter "stop-codon excluded from CDS" that might be used if the annotation does not contain the stop codons<br />
* ExtractRNASeqEvidence: <br />
** print intron length stats<br />
** include program infos in introns.gff3<br />
* GeMoMa:<br />
** new attribute pAA in gff output if query protein is given<br />
** include program infos in predicted_annotation.gff3<br />
** minor bugfix<br />
* GAF:<br />
** new parameter that allows to specify a prefix for each input gff<br />
** collect and print program infos to filtered_prediction.gff3<br />
** improved statistics output<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.4.2.zip GeMoMa 1.4.2] (21.07.2017)<br />
* automatic searching for available updates<br />
* AnnotationEvidence: bugfix (tie computation: Arrays.binarysearch does not find first match)<br />
* Extractor: bugfix (files that are not zipped)<br />
* GeMoMa: bugfix (tie computation: Arrays.binarysearch does not find first match)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.4.1.zip GeMoMa 1.4.1] (30.05.2017)<br />
* CompareTranscripts: bugfix (NullPointerException)<br />
* Extractor: reference genome can be .*fa.gz and .*fasta.gz<br />
* GeMoMa: bugfix (shutdown problem after timeout)<br />
* modified additional scripts and documentation<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.4.zip GeMoMa 1.4] (03.05.2017)<br />
* AnnotationEvidence: new tool computing tie and tpc for given annotation (gff)<br />
* CompareTranscripts: new tool comparing predicted and given annotation (gff)<br />
* Extractor:<br />
** reading CDS with no parent tag (cf. discontinuous feature)<br />
** automatic recognition of GFF or GTF annotation<br />
** Warning if sequences mentioned in the annotation are not included in the reference sequence<br />
* GeMoMa: <br />
** allowing for multiple intron and coverage files (= using different library types at the same time)<br />
** NA instead of "?" for tae, tde, tie, minSplitReads of single coding exon genes<br />
** new default values for the parameters: predictions (10 instead of 1) and contig threshold (0.4 instead of 0.9)<br />
** bugfix (write pc and minCov if possible for last CDS part in predicted annotation)<br />
** bugfix (ref-gene name if no assignment is used)<br />
** bugfix (minSplitReads, minCov, tpc, avgCov if no coverage available)<br />
* GAF:<br />
** nested genes on the same strand<br />
** bugfix (if nothing passes the filter)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.3.2.zip GeMoMa 1.3.2] (18.01.2017)<br />
* Extractor: new parameter repair for broken transcript annotations<br />
* GeMoMa: bugfixes (splice site computation)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa_1.3.1.zip GeMoMa 1.3.1] (09.12.2016)<br />
* GeMoMa bugfix (finding start/stop codon for very small exons)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.3.zip GeMoMa 1.3] (06.12.2016)<br />
* ERE: new tool for extracting RNA-seq evidence<br />
* Extractor: offers options for <br />
** partial gene models<br />
** ambiguities<br />
* GeMoMa:<br />
** RNA-seq<br />
*** defining splice sites<br />
*** additional feature in GFF and output<br />
**** transcript intron evidence (tie)<br />
**** transcript acceptor evidence (tae)<br />
**** transcript donor evidence (tde)<br />
**** transcript percentage coverage (tpc)<br />
**** ...<br />
** improved GFF<br />
** simplified the command line parameters<br />
** IMPORTANT: parameter names changed for some parameters<br />
* GAF: new tool for filtering and combining different predictions (especially of different reference organisms)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.1.3.zip GeMoMa 1.1.3] (06.06.2016)<br />
* minor modifications to the Extractor tool<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.1.2.zip GeMoMa 1.1.2] (05.02.2016)<br />
* GeMoMa bugfix (upstream, downstream sequence for splice site detection)<br />
* Extractor: new parameter s for selecting transcripts<br />
* improved Galaxy integration<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.1.1.zip GeMoMa 1.1.1] (01.02.2016)<br />
* initial release for paper</div>Keilwagenhttps://www.jstacs.de/index.php?title=GeMoMa-Docs&diff=1153GeMoMa-Docs2021-10-07T12:16:19Z<p>Keilwagen: 1.8</p>
<hr />
<div>This page describes the parameters of all [[GeMoMa]] modules.</br><br />
If you have any questions, comments or bugs, please check the [[GeMoMa#Frequently asked questions|FAQs]], [https://github.com/Jstacs/Jstacs/issues?q=label%3AGeMoMa our github page] or contact [mailto:jens.keilwagen@julius-kuehn.de Jens Keilwagen].<br />
<br />
=== GeMoMa pipeline ===<br />
<br />
This tool can be used to run the complete GeMoMa pipeline. The tool is multi-threaded and can utilize all compute cores on one machine, but not distributed as for instance in a compute cluster. It basically runs: '''Extract RNA-seq evidence (ERE)''', '''DenoiseIntrons''', '''Extractor''', external search (tblastn or mmseqs), '''Gene Model Mapper (GeMoMa)''', '''GeMoMa Annotation Filter (GAF)''', and '''AnnnotationFinalizer'''.<br />
<br />
''GeMoMa pipeline'' may be called with<br />
<br />
java -jar GeMoMa-1.8.jar CLI GeMoMaPipeline<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>target genome (Target genome file (FASTA), type = fasta,fa,fas,fna,fasta.gz,fa.gz,fas.gz,fna.gz)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used zero or multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>species (data for reference species, range={own, pre-extracted}, default = own)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;own&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>ID (ID to distinguish the different reference species, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (Reference annotation file (GFF or GTF), which contains gene models annotated in the reference genome, type = gff,gff3,gtf,gff.gz,gff3.gz,gtf.gz)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (Reference genome file (FASTA), type = fasta,fa,fas,fna,fasta.gz,fa.gz,fas.gz,fna.gz)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>weight (the weight can be used to prioritize predictions from different input files; each prediction will get an additional attribute sumWeight that can be used in the filter, valid range = [0.0, 1000.0], default = 1.0, OPTIONAL)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ai</font></td><br />
<td>annotation info (annotation information of the reference, tab-delimted file containing at least the columns transcriptName, GO and .*defline, type = tabular, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;pre-extracted&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>ID (ID to distinguish the different reference species, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>cds parts (The query CDS parts file (protein FASTA), i.e., the CDS parts that have been searched in the target genome using for instance BLAST or mmseqs, type = fasta,fa,fas,fna)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>assignment (The assignment file, which combines CDS parts to proteins, type = tabular, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>weight (the weight can be used to prioritize predictions from different input files; each prediction will get an additional attribute sumWeight that can be used in the filter, valid range = [0.0, 1000.0], default = 1.0, OPTIONAL)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ai</font></td><br />
<td>annotation info (annotation information of the reference, tab-delimted file containing at least the columns transcriptName, GO and .*defline, type = tabular, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used zero or multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ID</font></td><br />
<td>ID (ID to distinguish the different external annotations of the target organism, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">e</font></td><br />
<td>external annotation (External annotation file (GFF,GTF) of the target organism, which contains gene models from an external source (e.g., ab initio gene prediction) and will be integrated in the module GAF, type = gff,gff3,gtf)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">weight</font></td><br />
<td>weight (the weight can be used to prioritize predictions from different input files in the module GAF; each prediction will get an additional attribute sumWeight that can be used in the filter, valid range = [0.0, 1000.0], default = 1.0, OPTIONAL)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ae</font></td><br />
<td>annotation evidence (run AnnotationEvidence on this external annotation, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">selected</font></td><br />
<td>selected (The path to list file, which allows to make only a predictions for the contained transcript ids. The first column should contain transcript IDs as given in the annotation. Remaining columns can be used to determine a target region that should be overlapped by the prediction, if columns 2 to 5 contain chromosome, strand, start and end of region, type = tabular,txt, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">gc</font></td><br />
<td>genetic code (optional user-specified genetic code, type = tabular, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">tblastn</font></td><br />
<td>tblastn (if *true* tblastn is used as search algorithm, otherwise mmseqs is used. Tblastn and mmseqs need to be installed to use the corresponding option, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">tag</font></td><br />
<td>tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = mRNA)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>RNA-seq evidence (data for RNA-seq evidence, range={NO, MAPPED, EXTRACTED}, default = NO)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;MAPPED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.s</font></td><br />
<td>Stranded (Defines whether the reads are stranded. In case of FR_FIRST_STRAND, the first read of a read pair or the only read in case of single-end data is assumed to be located on forward strand of the cDNA, i.e., reverse to the mRNA orientation. If you are using Illumina TruSeq you should use FR_FIRST_STRAND., range={FR_UNSTRANDED, FR_FIRST_STRAND, FR_SECOND_STRAND}, default = FR_UNSTRANDED)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.m</font></td><br />
<td>mapped reads file (BAM/SAM files containing the mapped reads, type = bam,sam)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.v</font></td><br />
<td>ValidationStringency (Defines how strict to be when reading a SAM or BAM, beyond bare minimum validation., range={STRICT, LENIENT, SILENT}, default = LENIENT)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.u</font></td><br />
<td>use secondary alignments (allows to filter flags in the SAM or BAM, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.c</font></td><br />
<td>coverage (allows to output the coverage, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.mmq</font></td><br />
<td>minimum mapping quality (reads with a mapping quality that is lower than this value will be ignored, valid range = [0, 255], default = 40)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.mc</font></td><br />
<td>minimum context (only introns that have evidence of at least one split read with a minimal M (=(mis)match) stretch in the cigar string larger than or equal to this value will be used, valid range = [1, 1000000], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.maximumcoverage</font></td><br />
<td>maximum coverage (optional parameter to reduce the size of coverage output files, coverage higher than this value will be reported as this value, valid range = [1, 10000], OPTIONAL)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.f</font></td><br />
<td>filter by intron mismatches (filter reads by the number of mismatches around splits, range={NO, YES}, default = NO)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;YES&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.r</font></td><br />
<td>region around introns (test region of this size around introns/splits for mismatches to the genome, valid range = [0, 100], default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.n</font></td><br />
<td>number of mismatches (number of mismatches allowed in regions around introns/splits, valid range = [0, 100], default = 3)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.e</font></td><br />
<td>evidence long splits (require introns to have at least this number of times the supporting reads as their length deviates from the mean split length, valid range = [0.0, 100.0], default = 0.0)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.mil</font></td><br />
<td>minimum intron length (introns shorter than the minimum length are discarded and considered as contiguous, valid range = [0, 1000], default = 0)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.repositioning</font></td><br />
<td>repositioning (due to limitations in BAM/SAM format huge chromosomes need to be split before mapping. This parameter allows to undo the split mapping to real chromosomes and coordinates. The repositioning file has 3 columns: split_chr_name, original_chr_name, offset_in_original_chr, type = tabular, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;EXTRACTED&quot;:</b></td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">introns</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq, type = gff,gff3)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used zero or multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage</font></td><br />
<td>coverage (experimental coverage (RNA-seq), range={UNSTRANDED, STRANDED}, default = UNSTRANDED)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., type = bedgraph)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., type = bedgraph)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., type = bedgraph)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>denoise (removing questionable introns that have been extracted by ERE, range={DENOISE, RAW}, default = DENOISE)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;DENOISE&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">DenoiseIntrons.m</font></td><br />
<td>maximum intron length (The maximum length of an intron, default = 15000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">DenoiseIntrons.me</font></td><br />
<td>minimum expression (The threshold for removing introns, valid range = [0.0, 1.0], default = 0.01)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">DenoiseIntrons.c</font></td><br />
<td>context (The context upstream a donor and donwstream an acceptor site that is used to determine the expression of the region, valid range = [0, 100], default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;RAW&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.u</font></td><br />
<td>upcase IDs (whether the IDs in the GFF should be upcased, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.r</font></td><br />
<td>repair (if a transcript annotation can not be parsed, the program will try to infer the phase of the CDS parts to repair the annotation, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.a</font></td><br />
<td>Ambiguity (This parameter defines how to deal with ambiguities in the DNA. There are 3 options: EXCEPTION, which will remove the corresponding transcript, AMBIGUOUS, which will use an X for the corresponding amino acid, and RANDOM, which will randomly select an amnio acid from the list of possibilities., range={EXCEPTION, AMBIGUOUS, RANDOM}, default = AMBIGUOUS)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.d</font></td><br />
<td>discard pre-mature stop (if *true* transcripts with pre-mature stop codon are discarded as they often indicate misannotation, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.s</font></td><br />
<td>stop-codon excluded from CDS (A flag that states whether the reference annotation contains the stop codon in the CDS annotation or not, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.f</font></td><br />
<td>full-length (A flag which allows for choosing between only full-length and all (i.e., full-length and partial) transcripts, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.l</font></td><br />
<td>long fasta comment (whether a short (transcript ID) or a long (transcript ID, gene ID, chromosome, strand, interval) fasta comment should be written for proteins, CDSs, and genomic regions, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.s</font></td><br />
<td>splice (if no intron is given by RNA-seq, compute candidate splice sites or not, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.sm</font></td><br />
<td>substitution matrix (optional user-specified substitution matrix, type = tabular, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.g</font></td><br />
<td>gap opening (The gap opening cost in the alignment, default = 11)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.ge</font></td><br />
<td>gap extension (The gap extension cost in the alignment, default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.m</font></td><br />
<td>maximum intron length (The maximum length of an intron, default = 15000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.sil</font></td><br />
<td>static intron length (A flag which allows to switch between static intron length, which can be specified by the user and is identical for all genes, and dynamic intron length, which is based on the gene-specific maximum intron length in the reference organism plus the user given maximum intron length, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.i</font></td><br />
<td>intron-loss-gain-penalty (The penalty used for intron loss and gain, default = 25)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.e</font></td><br />
<td>e-value (The e-value for filtering blast results, default = 100.0)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.c</font></td><br />
<td>contig threshold (The threshold for evaluating contigs, valid range = [0.0, 1.0], default = 0.4)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.rt</font></td><br />
<td>region threshold (The threshold for evaluating regions, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.h</font></td><br />
<td>hit threshold (The threshold for adding additional hits, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.p</font></td><br />
<td>predictions (The (maximal) number of predictions per transcript, default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.a</font></td><br />
<td>avoid stop (A flag which allows to avoid (additional) pre-mature stop codons in a transcript, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.approx</font></td><br />
<td>approx (whether an approximation is used to compute the score for intron gain, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.pa</font></td><br />
<td>protein alignment (whether a protein alignment between the prediction and the reference transcript should be computed. If so two additional attributes (iAA, pAA) will be added to predictions in the gff output. These might be used in GAF. However, since some transcripts are very long this can increase the needed runtime and memory (RAM)., default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.prefix</font></td><br />
<td>prefix (A prefix to be used for naming the predictions, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.v</font></td><br />
<td>verbose (A flag which allows to output a wealth of additional information per transcript, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.t</font></td><br />
<td>timeout (The (maximal) number of seconds to be used for the predictions of one transcript, if exceeded GeMoMa does not output a prediction for this transcript., valid range = [0, 604800], default = 3600)</td><br />
<td style="width:100px;">LONG</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.ru</font></td><br />
<td>replace unknown (Replace unknown amino acid symbols by X, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.Score</font></td><br />
<td>Score (A flag which allows to do nothing, re-score or re-align the search results, range={Trust, ReScore, ReAlign}, default = ReAlign)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.d</font></td><br />
<td>default attributes (Comma-separated list of attributes that is set to NaN if they are not given in the annotation file. This allows to use these attributes for sorting or filter criteria. It is especially meaningful if the gene annotation files were received fom different software packages (e.g., GeMoMa, Braker, ...) having different attributes., default = tie,tde,tae,iAA,pAA,score,lpm,maxGap,bestScore,maxScore,raa,rce)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.f</font></td><br />
<td>filter (A filter can be applied to transcript predictions to receive only reasonable predictions. The filter is applied to the GFF attributes. The default filter decides based on the completeness of the prediction (start=='M' and stop=='*') and the relative score (score/aa>=0.75) whether a prediction is used or not. In addition, predictions without score (isNaN(score)) will be used as external annotations do not have the attribute score. Different criteria can be combined using 'and' and 'or'. A more sophisticated filter could be applied for instance using the completeness, the relative score, the evidence as well as statistics based on RNA-seq: start=='M' and stop=='*' and score/aa>=0.75 and (evidence>1 or tpc==1.0), default = start=='M' and stop=='*' and (isNaN(score) or score/aa>=0.75), OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.s</font></td><br />
<td>sorting (comma-separated list that defines the way predictions are sorted within a cluster, default = sumWeight,score)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.a</font></td><br />
<td>alternative transcript filter (If a prediction is suggested as an alternative transcript, this additional filter will be applied. The filter works syntactically like the 'filter' parameter and allows you to keep the number of alternative transcripts small based on meaningful criteria. Reasonable filter could be based on RNA-seq data (tie==1) or on sumWeight (sumWeight>1). A more sophisticated filter could be applied combining several criteria: tie==1 or sumWeight>1, default = tie==1 or sumWeight>1, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.c</font></td><br />
<td>common border filter (the filter on the common borders of transcripts, the lower the more transcripts will be checked as alternative splice isoforms, valid range = [0.0, 1.0], default = 0.75)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.m</font></td><br />
<td>maximal number of transcripts per gene (the maximal number of allowed transcript predictions per gene, valid range = [1, 2147483647], default = 2147483647)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.aat</font></td><br />
<td>add alternative transcripts (a switch to decide whether the IDs of alternative transcripts that have the same CDS should be listed for each prediction, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.t</font></td><br />
<td>transfer features (if true, additional features like UTRs will be transfered from the input to the output. Only features of the representatives will be transfered. The UTRs of identical CDS predictions listed in "alternative" will not be transfered or combined, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.u</font></td><br />
<td>UTR (allows to predict UTRs using RNA-seq data, range={NO, YES}, default = NO)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;NO&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.t</font></td><br />
<td>transfer features (if true other features than gene, &lt;tag&gt; (default: mRNA), and CDS of the input will be written in the output, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;YES&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.r</font></td><br />
<td>rename (allows to generate generic gene and transcripts names (cf. parameter &quot;name attribute&quot;), range={COMPOSED, SIMPLE, NO}, default = COMPOSED)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;COMPOSED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.i</font></td><br />
<td>infix (the infix of the generic name, default = G)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.s</font></td><br />
<td>suffix (the suffix of the generic name, default = 0)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.di</font></td><br />
<td>delete infix (a comma-separated list of infixes that is deleted from the sequence names before building the gene/transcript name, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;SIMPLE&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.n</font></td><br />
<td>name attribute (if true the new name is added as new attribute &quot;Name&quot;, otherwise &quot;Parent&quot; and &quot;ID&quot; values are modified accordingly, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sc</font></td><br />
<td>synteny check (run SyntenyChecker if possible, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>predicted proteins (If *true*, returns the predicted proteins of the target organism as fastA file, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">pc</font></td><br />
<td>predicted CDSs (If *true*, returns the predicted CDSs of the target organism as fastA file, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">pgr</font></td><br />
<td>predicted genomic regions (If *true*, returns the genomic regions of predicted gene models of the target organism as fastA file, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">o</font></td><br />
<td>output individual predictions (If *true*, returns the predictions for each reference species, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">debug</font></td><br />
<td>debug (If *false* removes all temporary files even if the jobs exits unexpected, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">restart</font></td><br />
<td>restart (can be used to restart the latest GeMoMaPipeline run, which was finished without results, with very similar parameters, e.g., after an exception was thrown (cf. parameter debug), default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>BLAST_PATH (allows to set a path to the blast binaries if not set in the environment, default = , OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>MMSEQS_PATH (allows to set a path to the blast binaries if not set in the environment, default = , OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">threads</font></td><br />
<td>The number of threads used for the tool, defaults to 1</td><br />
<td>INT</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.8.jar CLI GeMoMaPipeline t=&lt;target_genome&gt; AnnotationFinalizer.p=&lt;prefix&gt;<br />
<br />
<br />
=== Extract RNA-seq Evidence ===<br />
<br />
This tools extracts introns and coverage from mapped RNA-seq reads. Introns might be denoised by the tool '''DenoiseIntrons'''. Introns and coverage results can be used in '''GeMoMa''' to improve the predictions and might help to select better gene models in '''GAF'''. In addition, introns and coverage can be used to predict UTRs by '''AnnotationFinalizer'''.<br />
<br />
''Extract RNA-seq Evidence'' may be called with<br />
<br />
java -jar GeMoMa-1.8.jar CLI ERE<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>Stranded (Defines whether the reads are stranded. In case of FR_FIRST_STRAND, the first read of a read pair or the only read in case of single-end data is assumed to be located on forward strand of the cDNA, i.e., reverse to the mRNA orientation. If you are using Illumina TruSeq you should use FR_FIRST_STRAND., range={FR_UNSTRANDED, FR_FIRST_STRAND, FR_SECOND_STRAND}, default = FR_UNSTRANDED)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>mapped reads file (BAM/SAM files containing the mapped reads, type = bam,sam)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>ValidationStringency (Defines how strict to be when reading a SAM or BAM, beyond bare minimum validation., range={STRICT, LENIENT, SILENT}, default = LENIENT)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">u</font></td><br />
<td>use secondary alignments (allows to filter flags in the SAM or BAM, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage (allows to output the coverage, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">mmq</font></td><br />
<td>minimum mapping quality (reads with a mapping quality that is lower than this value will be ignored, valid range = [0, 255], default = 40)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">mc</font></td><br />
<td>minimum context (only introns that have evidence of at least one split read with a minimal M (=(mis)match) stretch in the cigar string larger than or equal to this value will be used, valid range = [1, 1000000], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">maximumcoverage</font></td><br />
<td>maximum coverage (optional parameter to reduce the size of coverage output files, coverage higher than this value will be reported as this value, valid range = [1, 10000], OPTIONAL)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>filter by intron mismatches (filter reads by the number of mismatches around splits, range={NO, YES}, default = NO)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;YES&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>region around introns (test region of this size around introns/splits for mismatches to the genome, valid range = [0, 100], default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>number of mismatches (number of mismatches allowed in regions around introns/splits, valid range = [0, 100], default = 3)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>target genome (The target genome file (FASTA). Should be in IUPAC code, type = fasta,fas,fa,fna,fasta.gz,fas.gz,fa.gz,fna.gz)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">e</font></td><br />
<td>evidence long splits (require introns to have at least this number of times the supporting reads as their length deviates from the mean split length, valid range = [0.0, 100.0], default = 0.0)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">mil</font></td><br />
<td>minimum intron length (introns shorter than the minimum length are discarded and considered as contiguous, valid range = [0, 1000], default = 0)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">repositioning</font></td><br />
<td>repositioning (due to limitations in BAM/SAM format huge chromosomes need to be split before mapping. This parameter allows to undo the split mapping to real chromosomes and coordinates. The repositioning file has 3 columns: split_chr_name, original_chr_name, offset_in_original_chr, type = tabular, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.8.jar CLI ERE m=&lt;mapped_reads_file&gt;<br />
<br />
<br />
=== CheckIntrons ===<br />
<br />
The tool checks the distribution of introns on the strands and the dinucleotide distribution at splice sites.<br />
<br />
''CheckIntrons'' may be called with<br />
<br />
java -jar GeMoMa-1.8.jar CLI CheckIntrons<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>target genome (The target genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code, type = fasta)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq, type = gff)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>verbose (A flag which allows to output a wealth of additional information per transcript, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.8.jar CLI CheckIntrons t=&lt;target_genome&gt; i=&lt;introns&gt;<br />
<br />
<br />
=== DenoiseIntrons ===<br />
<br />
This module allows to analyze introns extracted by '''ERE'''. Introns with a large intron size or a low relative expression are possibly artefacts and will be removed. The result of this module can be used in the module '''GeMoMa''', '''AnnotationEvidence''', and '''AnnotationFinalizer'''.<br />
<br />
''DenoiseIntrons'' may be called with<br />
<br />
java -jar GeMoMa-1.8.jar CLI DenoiseIntrons<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq, type = gff,gff3)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage (experimental coverage (RNA-seq), range={UNSTRANDED, STRANDED}, default = UNSTRANDED)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., type = bedgraph)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., type = bedgraph)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., type = bedgraph)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>maximum intron length (The maximum length of an intron, default = 15000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">me</font></td><br />
<td>minimum expression (The threshold for removing introns, valid range = [0.0, 1.0], default = 0.01)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">context</font></td><br />
<td>context (The context upstream a donor and donwstream an acceptor site that is used to determine the expression of the region, valid range = [0, 100], default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.8.jar CLI DenoiseIntrons i=&lt;introns&gt; coverage_unstranded=&lt;coverage_unstranded&gt;<br />
<br />
<br />
=== NCBI Reference Retriever ===<br />
<br />
This tool can be used to download or update assembly and annotation files of reference organsims from NCBI. This way it allows to easily collect all data necessary to start '''GeMoMaPipeline''' or '''Extractor'''.<br />
<br />
''NCBI Reference Retriever'' may be called with<br />
<br />
java -jar GeMoMa-1.8.jar CLI NRR<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reference directory (the directory where the genome and annotation files of the reference organisms should be stored, default = references/)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>number of tries (the number of tries for downloading a reference file, valid range = [1, 100], default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">rl</font></td><br />
<td>reference list (a list of reference organisms, type = txt)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.8.jar CLI NRR rl=&lt;reference_list&gt;<br />
<br />
<br />
=== Extractor ===<br />
<br />
This tool can be used to create input files for '''GeMoMa''', i.e., it creates at least a fasta file containing the translated parts of the CDS and a tabular file containing the assignment of transcripts to genes and parts of CDS to transcripts. In addition, '''Extractor''' can be used to create several additional files from the final prediction, e.g. proteins, CDSs, ... . Two inputs are mandatory: The genome as fasta or fasta.gz and the corresponding annotation as gff or gff.gz. The gff file should be sorted. If you like to set a user-specific genetic code, please use a tab-delimited file with two columns. The first column contains the amino acid in one letter code, the second a list of tripletts.<br />
<br />
''Extractor'' may be called with<br />
<br />
java -jar GeMoMa-1.8.jar CLI Extractor<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (Reference annotation file (GFF or GTF), which contains gene models annotated in the reference genome, type = gff,gff3,gtf,gff.gz,gff3.gz,gtf.gz)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (Reference genome file (FASTA), type = fasta,fa,fas,fna,fasta.gz,fa.gz,fas.gz,fna.gz)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">gc</font></td><br />
<td>genetic code (optional user-specified genetic code, type = tabular, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>proteins (whether the complete proteins sequences should returned as output, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>cds (whether the complete CDSs should returned as output, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">genomic</font></td><br />
<td>genomic (whether the genomic regions should be returned (upper case = coding, lower case = non coding), default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns (whether introns should be extracted from annotation, that might be used for test cases, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">identical</font></td><br />
<td>identical (if CDS is identical Extractor only used one transcript. This parameter allows to return a table that lists in the first column the used transcript and in the second column the discarded transcript. If no transcript is discarded, the list is empty., default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">u</font></td><br />
<td>upcase IDs (whether the IDs in the GFF should be upcased, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>repair (if a transcript annotation can not be parsed, the program will try to infer the phase of the CDS parts to repair the annotation, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>selected (The path to list file, which allows to make only a predictions for the contained transcript ids. The first column should contain transcript IDs as given in the annotation. Remaining columns will be ignored., type = tabular,txt, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Ambiguity</font></td><br />
<td>Ambiguity (This parameter defines how to deal with ambiguities in the DNA. There are 3 options: EXCEPTION, which will remove the corresponding transcript, AMBIGUOUS, which will use an X for the corresponding amino acid, and RANDOM, which will randomly select an amnio acid from the list of possibilities., range={EXCEPTION, AMBIGUOUS, RANDOM}, default = EXCEPTION)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>discard pre-mature stop (if *true* transcripts with pre-mature stop codon are discarded as they often indicate misannotation, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sefc</font></td><br />
<td>stop-codon excluded from CDS (A flag that states whether the reference annotation contains the stop codon in the CDS annotation or not, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>full-length (A flag which allows for choosing between only full-length and all (i.e., full-length and partial) transcripts, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">l</font></td><br />
<td>long fasta comment (whether a short (transcript ID) or a long (transcript ID, gene ID, chromosome, strand, interval) fasta comment should be written for proteins, CDSs, and genomic regions, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>verbose (A flag which allows to output a wealth of additional information, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.8.jar CLI Extractor a=&lt;annotation&gt; g=&lt;genome&gt;<br />
<br />
<br />
=== GeneModelMapper ===<br />
<br />
This tool is the main part of, a homology-based gene prediction tool. GeMoMa builds gene models from search results (e.g. tblastn or mmseqs).<br />
<br />
As first step, you should run '''Extractor''' obtaining ''cds parts'' and ''assignment''. Second, you should run a search algorithm, e.g. '''tblastn''' or '''mmseqs''', with ''cds parts'' as query. Finally, these search results are then used in '''GeMoMa'''. Search results should be clustered according to the reference genes. The most easiest way is to sort the search results accoring to the first column. If the search results are not sorted by default (e.g. mmseqs), you should the parameter ''sort''.<br />
If you like to run GeMoMa ignoring intron position conservation, you should blast protein sequences and feed the results in ''query cds parts'' and leave ''assignment'' unselected.<br />
<br />
If you like to run GeMoMa using RNA-seq evidence, you should map your RNA-seq reads to the genome and run '''ERE''' on the mapped reads. For several reasons, spurious introns can be extracted from RNA-seq data. Hence, we recommend to run '''DenoiseIntrons''' to remove such spurious introns. Finally, you can use the obtained ''introns'' (and ''coverage'') in GeMoMa.<br />
<br />
If you like to obtain multiple predictions per gene model of the reference organism, you should set ''predictions'' accordingly. In addition, we suggest to decrease the value of ''contig threshold'' allowing GeMoMa to evaluate more candidate contigs/chromosomes.<br />
<br />
If you change the values of ''contig threshold'', ''region threshold'' and ''hit threshold'', this will influence the predictions as well as the runtime of the algorithm. The lower the values are, the slower the algorithm is.<br />
<br />
You can filter your predictions using '''GAF''', which also allows for combining predictions from different reference organismns.<br />
<br />
Finally, you can predict UTRs and rename predictions using '''AnnotationFinalizer'''.<br />
<br />
If you like to run the complete GeMoMa pipeline and not only specific module, you can run the multi-threaded module '''GeMoMaPipeline'''.<br />
<br />
''GeneModelMapper'' may be called with<br />
<br />
java -jar GeMoMa-1.8.jar CLI GeMoMa<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>search results (The search results, e.g., from tblastn or mmseqs, type = tabular)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>target genome (The target genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code, type = fasta,fas,fa,fna,fasta.gz,fas.gz,fa.gz,fna.gz)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>cds parts (The query CDS parts file (protein FASTA), i.e., the CDS parts that have been searched in the target genome using for instance BLAST or mmseqs, type = fasta,fa,fas,fna)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>assignment (The assignment file, which combines CDS parts to proteins, type = tabular, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used zero or multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq, type = gff,gff3)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">splice</font></td><br />
<td>splice (if no intron is given by RNA-seq, compute candidate splice sites or not, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used zero or multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage</font></td><br />
<td>coverage (experimental coverage (RNA-seq), range={UNSTRANDED, STRANDED}, default = UNSTRANDED)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., type = bedgraph)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., type = bedgraph)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., type = bedgraph)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genetic code (optional user-specified genetic code, type = tabular, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sm</font></td><br />
<td>substitution matrix (optional user-specified substitution matrix, type = tabular, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">go</font></td><br />
<td>gap opening (The gap opening cost in the alignment, default = 11)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ge</font></td><br />
<td>gap extension (The gap extension cost in the alignment, default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>maximum intron length (The maximum length of an intron, default = 15000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sil</font></td><br />
<td>static intron length (A flag which allows to switch between static intron length, which can be specified by the user and is identical for all genes, and dynamic intron length, which is based on the gene-specific maximum intron length in the reference organism plus the user given maximum intron length, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">intron-loss-gain-penalty</font></td><br />
<td>intron-loss-gain-penalty (The penalty used for intron loss and gain, default = 25)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">e</font></td><br />
<td>e-value (The e-value for filtering blast results, default = 100.0)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ct</font></td><br />
<td>contig threshold (The threshold for evaluating contigs, valid range = [0.0, 1.0], default = 0.4)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">rt</font></td><br />
<td>region threshold (The threshold for evaluating regions, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">h</font></td><br />
<td>hit threshold (The threshold for adding additional hits, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>predictions (The (maximal) number of predictions per transcript, default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">selected</font></td><br />
<td>selected (The path to list file, which allows to make only a predictions for the contained transcript ids. The first column should contain transcript IDs as given in the annotation. Remaining columns can be used to determine a target region that should be overlapped by the prediction, if columns 2 to 5 contain chromosome, strand, start and end of region, type = tabular,txt, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">as</font></td><br />
<td>avoid stop (A flag which allows to avoid (additional) pre-mature stop codons in a transcript, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">approx</font></td><br />
<td>approx (whether an approximation is used to compute the score for intron gain, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">pa</font></td><br />
<td>protein alignment (whether a protein alignment between the prediction and the reference transcript should be computed. If so two additional attributes (iAA, pAA) will be added to predictions in the gff output. These might be used in GAF. However, since some transcripts are very long this can increase the needed runtime and memory (RAM)., default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">prefix</font></td><br />
<td>prefix (A prefix to be used for naming the predictions, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">tag</font></td><br />
<td>tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = mRNA)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>verbose (A flag which allows to output a wealth of additional information per transcript, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">timeout</font></td><br />
<td>timeout (The (maximal) number of seconds to be used for the predictions of one transcript, if exceeded GeMoMa does not output a prediction for this transcript., valid range = [0, 604800], default = 3600)</td><br />
<td style="width:100px;">LONG</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sort</font></td><br />
<td>sort (A flag which allows to sort the search results, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ru</font></td><br />
<td>replace unknown (Replace unknown amino acid symbols by X, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Score</font></td><br />
<td>Score (A flag which allows to do nothing, re-score or re-align the search results, range={Trust, ReScore, ReAlign}, default = Trust)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.8.jar CLI GeMoMa s=&lt;search_results&gt; t=&lt;target_genome&gt; c=&lt;cds_parts&gt;<br />
<br />
<br />
=== GeMoMa Annotation Filter ===<br />
<br />
This tool combines and filters gene predictions from different sources yielding a common gene prediction. The tool does not modify the predictions, but filters redundant and low-quality predictions and selects relevant predictions. In addition, it adds attributes to the annotation of transcript predictions.<br />
<br />
The algorithm does the following:<br />
First, redundant predictions are identified (and additional attributes (evidence, sumWeight) are introduced).<br />
Second, the predictions are filtered using the user-specified criterium based on the attributes from the annotation.<br />
Third, clusters of overlapping predictions are determined, the predictions are sorted within the cluster and relevant predictions are extracted.<br />
<br />
Optionally, annotation info can be added for each reference organism enabling a functional prediction for predicted transcripts based on the function of the reference transcript.<br />
Phytozome provides annotation info tables for plants, but annotation info can be used from any source as long as they are tab-delimited files with at least the following columns: transcriptName, GO and .*defline.<br />
<br />
Initially, GAF was build to combine gene predictions obtained from '''GeMoMa'''. It allows to combine the predictions from multiple reference organisms, but works also using only one reference organism. However, GAF also allows to integrate predictions from ab-initio or purely RNA-seq-based gene predictors as well as manually curated annotation. If you like to do so, we recommend to run '''AnnotationEvidence''' for each of these input files to add additional attributes that can be used for sorting and filtering within '''GAF'''. The sort and filter criteria need to be carefully revised in this case. Default values can be set for missing attributes.<br />
<br />
''GeMoMa Annotation Filter'' may be called with<br />
<br />
java -jar GeMoMa-1.8.jar CLI GAF<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>tag (the tag used to read the GeMoMa annotations, default = mRNA)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prefix (the prefix can be used to distinguish predictions from different input files, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>weight (the weight can be used to prioritize predictions from different input files; each prediction will get an additional attribute sumWeight that can be used in the filter, valid range = [0.0, 1000.0], default = 1.0, OPTIONAL)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>gene annotation file (GFF file containing the gene annotations (predicted by GeMoMa), type = gff,gff3)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation info (annotation information of the reference, tab-delimted file containing at least the columns transcriptName, GO and .*defline, type = tabular, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>default attributes (Comma-separated list of attributes that is set to NaN if they are not given in the annotation file. This allows to use these attributes for sorting or filter criteria. It is especially meaningful if the gene annotation files were received fom different software packages (e.g., GeMoMa, Braker, ...) having different attributes., default = tie,tde,tae,iAA,pAA,score,lpm,maxGap,bestScore,maxScore,raa,rce)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>filter (A filter can be applied to transcript predictions to receive only reasonable predictions. The filter is applied to the GFF attributes. The default filter decides based on the completeness of the prediction (start=='M' and stop=='*') and the relative score (score/aa>=0.75) whether a prediction is used or not. In addition, predictions without score (isNaN(score)) will be used as external annotations do not have the attribute score. Different criteria can be combined using 'and' and 'or'. A more sophisticated filter could be applied for instance using the completeness, the relative score, the evidence as well as statistics based on RNA-seq: start=='M' and stop=='*' and score/aa>=0.75 and (evidence>1 or tpc==1.0), default = start=='M' and stop=='*' and (isNaN(score) or score/aa>=0.75), OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>sorting (comma-separated list that defines the way predictions are sorted within a cluster, default = sumWeight,score)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">atf</font></td><br />
<td>alternative transcript filter (If a prediction is suggested as an alternative transcript, this additional filter will be applied. The filter works syntactically like the 'filter' parameter and allows you to keep the number of alternative transcripts small based on meaningful criteria. Reasonable filter could be based on RNA-seq data (tie==1) or on sumWeight (sumWeight>1). A more sophisticated filter could be applied combining several criteria: tie==1 or sumWeight>1, default = tie==1 or sumWeight>1, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>common border filter (the filter on the common borders of transcripts, the lower the more transcripts will be checked as alternative splice isoforms, valid range = [0.0, 1.0], default = 0.75)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>maximal number of transcripts per gene (the maximal number of allowed transcript predictions per gene, valid range = [1, 2147483647], default = 2147483647)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">aat</font></td><br />
<td>add alternative transcripts (a switch to decide whether the IDs of alternative transcripts that have the same CDS should be listed for each prediction, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">tf</font></td><br />
<td>transfer features (if true, additional features like UTRs will be transfered from the input to the output. Only features of the representatives will be transfered. The UTRs of identical CDS predictions listed in "alternative" will not be transfered or combined, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.8.jar CLI GAF g=&lt;gene_annotation_file&gt;<br />
<br />
<br />
=== AnnotationFinalizer ===<br />
<br />
This tool finalizes an annotation. It allows to predict for UTRs for annotated coding sequences and to generate generic gene and transcript names. UTR prediction might be negatively influenced (i.e. too long predictions) by genomic contamination of RNA-seq libraries, overlapping genes or genes in close proximity as well as unstranded RNA-seq libraries. Please use '''ERE''' to preprocess the mapped reads.<br />
<br />
''AnnotationFinalizer'' may be called with<br />
<br />
java -jar GeMoMa-1.8.jar CLI AnnotationFinalizer<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (The genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code, type = fasta,fa,fas,fna,fasta.gz,fa.gz,fas.gz,fna.gz)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (The predicted genome annotation file (GFF), type = gff,gff3)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = mRNA)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">u</font></td><br />
<td>UTR (allows to predict UTRs using RNA-seq data, range={NO, YES}, default = NO)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;NO&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">tf</font></td><br />
<td>transfer features (if true other features than gene, &lt;tag&gt; (default: mRNA), and CDS of the input will be written in the output, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;YES&quot;:</b></td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns file (Introns (GFF), which might be obtained from RNA-seq, type = gff,gff3)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage file (experimental coverage (RNA-seq), range={NO, UNSTRANDED, STRANDED}, default = NO)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., type = bedgraph)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., type = bedgraph)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., type = bedgraph)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">rename</font></td><br />
<td>rename (allows to generate generic gene and transcripts names (cf. parameter &quot;name attribute&quot;), range={COMPOSED, SIMPLE, NO}, default = COMPOSED)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;COMPOSED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">infix</font></td><br />
<td>infix (the infix of the generic name, default = G)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>suffix (the suffix of the generic name, default = 0)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">di</font></td><br />
<td>delete infix (a comma-separated list of infixes that is deleted from the sequence names before building the gene/transcript name, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;SIMPLE&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>name attribute (if true the new name is added as new attribute &quot;Name&quot;, otherwise &quot;Parent&quot; and &quot;ID&quot; values are modified accordingly, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.8.jar CLI AnnotationFinalizer g=&lt;genome&gt; a=&lt;annotation&gt; p=&lt;prefix&gt;<br />
<br />
<br />
=== Annotation evidence ===<br />
<br />
This tool adds attributes to the annotation, e.g., tie, tpc, aa, start, stop. These attributes can be used, for instance, if the annotation is used in '''GAF'''. All predictions of the annotation are used. The predictions are not filtered for internal stop codons, missing start or stop codons, frame-shifts, ... . Please use '''ERE''' to preprocess the mapped reads.<br />
<br />
''Annotation evidence'' may be called with<br />
<br />
java -jar GeMoMa-1.8.jar CLI AnnotationEvidence<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (The genome annotation file (GFF,GTF), type = gff,gff3,gtf,gff.gz,gff3.gz,gtf.gz)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = mRNA)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (The genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code, type = fasta,fas,fa,fna,fasta.gz,fas.gz,fa.gz,fna.gz)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns file (Introns (GFF), which might be obtained from RNA-seq, type = gff,gff3, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage file (experimental coverage (RNA-seq), range={NO, UNSTRANDED, STRANDED}, default = NO)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., type = bedgraph)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., type = bedgraph)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., type = bedgraph)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ao</font></td><br />
<td>annotation output (if the annotation should be returned with attributes tie, tpc, and aa, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">gc</font></td><br />
<td>genetic code (optional user-specified genetic code, type = tabular, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.8.jar CLI AnnotationEvidence a=&lt;annotation&gt; g=&lt;genome&gt;<br />
<br />
<br />
=== Synteny checker ===<br />
<br />
This tool can be used to determine syntenic regions between target organism and reference organism based on similiarity of genes. The tool returns a table of reference genes per predicted gene. This table can be easily visualized with an R script that is included in the GeMoMa package.<br />
<br />
''Synteny checker'' may be called with<br />
<br />
java -jar GeMoMa-1.8.jar CLI SyntenyChecker<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>tag (the tag used to read the GeMoMa annotations, default = mRNA)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prefix (the prefix can be used to distinguish predictions from different input files (=reference organisms), OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>assignment (the assignment file of this reference organism, which combines parts of the CDS to transcripts, type = tabular)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>gene annotation file (GFF file containing the gene annotations predicted by GAF, type = gff,gff3)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.8.jar CLI SyntenyChecker a=&lt;assignment&gt; g=&lt;gene_annotation_file&gt;<br />
<br />
<br />
=== AddAttribute ===<br />
<br />
This tool allows to add an additional attribute to specific features of an annotation.<br />
<br />
Those additional attributes might be used in '''GAF''' for filtering or sorting or might be displayed in genome browsers like IGV or WebApollo. The user can choose binary attributes (true or false) or attributes with values according to given tab-delimited table.<br />
<br />
''AddAttribute'' may be called with<br />
<br />
java -jar GeMoMa-1.8.jar CLI AddAttribute<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (annotation file, type = gff,gff3)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>feature (a feature of the annotation, e.g., gene, transcript or mRNA, default = mRNA)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">attribute</font></td><br />
<td>attribute (the name of the attribute that is added to the annotation)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>table (a tab-delimited file containing IDs and additional attribute, type = tabular)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>ID column (the ID column in the tab-delimited file, valid range = [0, 2147483647])</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">type</font></td><br />
<td>type (type of addition attribute, range={VALUES, BINARY}, default = VALUES)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;VALUES&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ac</font></td><br />
<td>attribute column (the attribute column in the tab-delimited file, valid range = [0, 2147483647])</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;BINARY&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.8.jar CLI AddAttribute a=&lt;annotation&gt; attribute=&lt;attribute&gt; t=&lt;table&gt; i=&lt;ID_column&gt; ac=&lt;attribute_column&gt;<br />
<br />
<br />
=== GAFComparison ===<br />
<br />
This tool allows to compare results from GAF based on the attributed ref-gene and alternative.<br />
Hence, you can compare the annotation of different genomes or the effect of different parameters on the annotation of one genome.<br />
<br />
''GAFComparison'' may be called with<br />
<br />
java -jar GeMoMa-1.8.jar CLI GAFComparison<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>tag (the tag used to read the GAF annotations, default = mRNA)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>name (a simple name for the organism)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>gene annotation file (GFF file containing the gene annotations (predicted by GAF), type = gff,gff3,gff.gz.gff3.gz)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>split prefix (a switch to decide whether the prefix should be split and writen in a separat column, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>differences (a switch to decide whether only genes with difference should be returned, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.8.jar CLI GAFComparison n=&lt;name&gt; g=&lt;gene_annotation_file&gt;<br />
<br />
<br />
=== Analyzer ===<br />
<br />
This tools allows to compare true annotation with predicted annotation as it is frequently done in benchmark studies. Furthermore, it can return a detailed table comparing true annotation and predicted annotation which might help to identify systematical errors or biases in the predictions. Hence, this tool might help to detect weaknesses of the prediction algorithm.<br />
<br />
True and predicted transcripts are evaluated based on nucleotide F1 measure. For each predicted transcript, the true transcript with highest nucleotide F1 measure is listed. A negative value in a F1 measure column indicates that there is a predicted transcript that matches the true transcript with a F1 measure value that is the absolute value of this entry, but there is another true transcript that matches this predicted transcript with an even better F1. Also true and predicted transcripts are listed that do not overlap with any transcript from the predicted and true annotation, respectively. The table contains the attributes of the true and the predicted annotation besides some additional columns allowing to easily filter interesting examples and to do statistics.<br />
<br />
The evaluation can be based on CDS (default) or exon features. The tool also reports sensitivity and precision for the categories gene and transcript.<br />
<br />
''Analyzer'' may be called with<br />
<br />
java -jar GeMoMa-1.8.jar CLI Analyzer<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>truth (the true annotation, type = gff,gff3,gff.gz,gff3.gz)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>name (can be used to distinguish different predictions, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>predicted annotation (GFF file containing the predicted annotation, type = gff,gff3,gff.gz,gff3.gz)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>CDS (if true CDS features are used otherwise exon features, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>write (write detailed table comparing the true and the predicted annotation, range={NO, YES}, default = NO)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;YES&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ca</font></td><br />
<td>common attributes (Only gff attributes of mRNAs are included in the result table, that can be found in the given portion of all mRNAs. Attributes and their portion are handled independently for truth and prediction. This parameter allows to choose between a more informative table or compact table., valid range = [0.0, 1.0], default = 0.5)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reliable (additionally evaluate sensitivity for reliable transcripts, range={NO, YES}, default = NO)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;YES&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>filter (A filter for deciding which transcript from the truth are reliable or not. The filter is applied to the GFF attributes of the truth. You probably need to run AnnotationEvidence on the truth GFF. The default filter decides based on the completeness of the prediction (start=='M' and stop=='*'), no premature stop codons (nps==0), RNA-seq coverage (tpc==1) and intron evidence (isNaN(tie) or tie==1)., default = start=='M' and stop=='*' and nps==0 and (tpc==1 and (isNaN(tie) or tie==1)), OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.8.jar CLI Analyzer t=&lt;truth&gt; p=&lt;predicted_annotation&gt;<br />
<br />
<br />
=== BUSCORecomputer ===<br />
<br />
This tool can be used to compute BUSCO statistics for genes instead of transcripts. Proteins of an annotation file can be extracted with '''Exctractor''', Proteins can be used to compute BUSCO statistics with BUSCO. The full BUSCO table and the assignment file from the '''Extractor''' can be used as input for this tool. Alternatively, a table can be generated from the annotation file that can be used instead of the assignment file.<br />
<br />
''BUSCORecomputer'' may be called with<br />
<br />
java -jar GeMoMa-1.8.jar CLI BUSCORecomputer<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>BUSCO (the BUSCO full table based on transcripts/proteins, type = tabular)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>IDs (a table with at leat two columns, the first is the gene ID, the second is the transcript/protein ID. The assignment file from the Extractor can be used or a table can be derived by the user from the gene annotation file (gff,gtf), type = tabular)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.8.jar CLI BUSCORecomputer b=&lt;BUSCO&gt; i=&lt;IDs&gt;</div>Keilwagenhttps://www.jstacs.de/index.php?title=GeMoMa&diff=1152GeMoMa2021-10-07T12:15:01Z<p>Keilwagen: /* In a nutshell */</p>
<hr />
<div>'''Ge'''ne '''Mo'''del '''Ma'''pper (GeMoMa) is a homology-based gene prediction program. GeMoMa uses the annotation of protein-coding genes in a reference genome to infer the annotation of protein-coding genes in a target genome. Thereby, GeMoMa utilizes amino acid sequence and intron position conservation. In addition, GeMoMa allows to incorporate RNA-seq evidence for splice site prediction.<br />
<br />
GeMoMa is available in a public web-server at [http://galaxy.informatik.uni-halle.de galaxy.informatik.uni-halle.de]. The provided web-server only allows a limited number of reference genes and uses a time out of 2 minutes per transcript prediction. For unlimited use, please use the command line program or integrate GeMoMa in your own [https://galaxyproject.org/ Galaxy] instance.<br />
<br />
{|<br />
|__TOC__<br />
|[[File:GeMoMa-schema.png|thumb|right|450px|Schema of GeMoMa algorithm]]<br />
|}<br />
<br />
== Installation ==<br />
<br />
GeMoMa is now available via bioconda. Here is the [https://anaconda.org/bioconda/gemoma direct link to the package]. To install this package with conda run:<br />
conda install -c bioconda gemoma <br />
However, you can also install GeMoMa manually.<br />
<br />
=== Requirements ===<br />
<br />
For running the GeMoMa, you need the following software on your computer<br />
* Java v1.8 or later<br />
* [ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ blast] or [https://github.com/soedinglab/MMseqs2 mmseqs]<br />
<br />
=== Download ===<br />
<br />
GeMoMa is implemented in Java using Jstacs. You can [http://www.jstacs.de/download.php?which=GeMoMa download a zip file] containing a readme, the GeMoMa jar file and some tiny scripts for running GeMoMa. The jar file allows for <br />
<ul><br />
<li>creating the XML file needed for the Galaxy integration</li><br />
<li>running the command line interface (CLI) version.</li><br />
</ul><br />
<br />
== In a nutshell ==<br />
<br />
GeMoMa is a modular, homology-based gene prediction program with huge flexibility. However, we also provide a pipeline allowing to use GeMoMa easily. If you like to start GeMoMa for the first time, we recommend to use the GeMoMaPipeline like this<br />
java -jar GeMoMa-1.8.jar CLI GeMoMaPipeline threads=<threads> outdir=<outdir> GeMoMa.Score=ReAlign AnnotationFinalizer.r=NO o=true t=<target_genome> i=<reference_1_id> a=<reference_1_annotation> g=<reference_1_genome><br />
there are several parameters that need to be set indicated with '''&lt;'''foo'''&gt;'''. You can specify<br />
* the number of threads<br />
* the output directory<br />
* the target genome<br />
* and the reference ID (optional), annotation and genome. If you have several references just repeat <code>s=own</code> and the parameter tags <code>i</code>, <code>a</code>, <code>g</code> with the corresponding values.<br />
In addition, we recommend to set several parameters:<br />
* <code>GeMoMa.Score=ReAlign</code>: states that the score from mmseqs should be recomputed as mmseqs uses an approximation<br />
* <code>AnnotationFinalizer.r=NO</code>: do not rename genes and transcripts<br />
* <code>o=true</code>: output individual predictions for each reference as a separate file allowing to rerun the combination step ('''GAF''') very easily and quickly<br />
If you like to specify the maximum intron length please consider the parameters <code>GeMoMa.m</code> and <code>GeMoMa.sil</code>.<br />
If you have RNA-seq data either from own experiments or publicly available data sets (cf. [https://www.ncbi.nlm.nih.gov/sra NCBI SRA], [https://www.ebi.ac.uk/ena EMBL-EBI ENA]), we recommend to use them. You need to map the data against the target genome with your favorite read mapper. In addition, we recommend to check the parameters of the section '''DenoiseIntrons'''.<br />
<br />
<br />
The complete documentation describing all GeMoMa modules and all parameters can be accessed at [[GeMoMa-Docs]].<br />
<br />
You can also [[:File:GeMoMa-manual.pdf|download a small manual for GeMoMa]] which explains the main steps for the analysis.<br />
<br />
== GFF attributes ==<br />
<br />
Using GeMoMa, you'll obtain GFFs containing some special attributes. We briefly explain the most prominent attributes in the following table.<br />
<br />
{| class="wikitable sortable"<br />
|-<br />
!Attribute !!Long name !!Tool !!Necessary parameter !!Feature !!Description<br />
|-<br />
| aa || amino acids || GeMoMa || || mRNA || the number of amino acids in the protein<br />
|-<br />
| score || GeMoMa score || GeMoMa || || mRNA || score computed by GeMoMa using the substitution matrix, gap costs and additional penalties<br />
|-<br />
| nps || number of premature stops || GeMoMa || || mRNA || the number of premature stop codons in the prediction<br />
|-<br />
| ce || coding exons || GeMoMa || assignment || mRNA || the number of coding exons of the prediction<br />
|-<br />
| rce || reference coding exons || GeMoMa || assignment || mRNA || the number of coding exons of the reference transcript<br />
|-<br />
| minCov || minimal coverage || GeMoMa || coverage, ... || mRNA || minimal coverage of any base of the prediction given RNA-seq evidence<br />
|-<br />
| avgCov || average coverage || GeMoMa || coverage, ... || mRNA || average coverage of all bases of the prediction given RNA-seq evidence<br />
|-<br />
| tpc || transcript percentage coverage || GeMoMa || coverage, ... || mRNA || percentage of covered bases per predicted transcript given RNA-seq evidence<br />
|-<br />
| tae || transcript acceptor evidence || GeMoMa || introns || mRNA || percentage of predicted acceptor sites per predicted transcript with RNA-seq evidence<br />
|-<br />
| tde || transcript donor evidence || GeMoMa || introns || mRNA || percentage of predicted donor sites per predicted transcript with RNA-seq evidence<br />
|-<br />
| tie || transcript intron evidence || GeMoMa || introns || mRNA || percentage of predicted introns per predicted transcript with RNA-seq evidence<br />
|-<br />
| minSplitReads || minimal split reads || GeMoMa || introns || mRNA || minimal number of split reads for any of the predicted introns per predicted transcript<br />
|-<br />
| iAA || identical amino acid || GeMoMa || query proteins || mRNA || percentage of identical amino acids between reference transcript and prediction<br />
|-<br />
| pAA || positive amino acid || GeMoMa || query proteins || mRNA || percentage of aligned positions between reference transcript and prediction yielding a positive score in the substitution matrix<br />
|-<br />
| evidence || || GAF || || mRNA || number of reference organisms that have a transcript yielding this prediction<br />
|-<br />
| alternative || || GAF || || mRNA || alternative gene ID(s) leading to the same prediction<br />
|-<br />
| sumWeight || || GAF || || mRNA || the sum of the weights of the references that perfectly support this prediction<br />
|-<br />
| maxTie || maximal tie || GAF || || gene || maximal tie of all transcripts of this gene<br />
|-<br />
| maxEvidence || maximal evidence || GAF || || gene || maximal evidence of all transcripts of this gene<br />
|-<br />
|}<br />
<br />
The name of the feature describing a transcript prediction can be altered using the parameter "tag". Before version 1.7 the default value of tag was "prediction" instead of "mRNA".<br />
<br />
== Frequently asked questions ==<br />
<br />
; Why does the Extractor not return a single CDS-part, protein, ...?<br />
:Please check whether the names of your contigs/chromosomes in your annotation (gff) and genome file (fasta) are identical. The fasta comments should at best only contain the contig/chromosome name. (Since GeMoMa 1.4, comments, which contain the contig/chromosome name and some additional information separated by a space, are also fine.) In addition, check the statistics that are given by the Extractor. It lists how many genes have been read and how many genes have been removed for different reasons. One common problem is that some annotation files do not include the stop codon in the CDS annotation.<br />
<br />
; How can I force GeMoMa to make more predictions?<br />
:There are several parameters affecting the number of predictions. The most prominent are the number of predictions (p) and the contig threshold (ct). For each reference transcript/CDS, GeMoMa initially makes a preliminary prediction and uses this prediction to determine whether a contig is promising and should be used to determine the final predictions. You may decrease ct and increase p to have more contigs in the final prediction. Increasing the number of predictions allows GeMoMa to output more predictions that have been computed. Decreasing the contig threshold allows to increase the number of predictions that are (internally) computed. Increasing p to a very large number without decreasing ct does not help.<br />
<br />
; Running GeMoMa on a single contig of my assembly yield thousands of weird predictions. What went wrong?<br />
:By default, GeMoMa is not build to be run on a single contig. GeMoMa tries to make predictions for all given reference CDS in the given target sequence(s). If the given target sequence is only a fraction of the complete target genome/assembly, GeMoMa will produce weird predictions as it does not filter for the quality of the predictions internally. There are two options to handle this:<br />
:* Use a list of gene models that you expect to be located on this contig (cf. parameter "selected").<br />
:* Filter the predictions using GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>).<br />
<br />
; Is it mandatory to use RNA-seq data?<br />
:No, GeMoMa is able to make predictions with and without RNA-seq evidence.<br />
<br />
; Is it possible to use multiple reference organisms?<br />
:It is possible to use multiple reference organisms for GeMoMa. Just run GeMoMa on each reference organism separately. Finally, you can employ GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>) to combine these annotations.<br />
<br />
; Why do some reference genes not lead to a prediction in the target genome?<br />
:Please first check whether your reference genes have been discarded by the Extractor (cf. assignment file).<br />
:If the genes have been discarded, there are two possibilities:<br />
:* The CDS might be redundant, i.e. the coding exons are identical to those of another transcript. In this case, only one CDS is further evaluated.<br />
:* There might be something wrong with your reference genes, e.g., missing start codon, missing stop codon, premature stop codon, ambiguous nucleotides, ... and you should check the options of Extractor or the annotation.<br />
:If the reference genes passed the Extractor, there are several possible explanations for this behavior. The two most prominent are:<br />
:* GeMoMa stopped the prediction of a reference genes since it does not return a result within the given time (cf. parameter "timeout").<br />
:* GeMoMa simply did not find a prediction matching the remaining quality criteria.<br />
<br />
; What does "partial gene model" mean in the context of GeMoMa?<br />
:We called a gene model partial if it does not contain an initial start codon and a final stop codon. However, this does not mean that the gene model is located at or close to the border of a chromosome or contig.<br />
<br />
; For two different reference transcripts, the predictions of GeMoMa overlap or are identical. What should I do with those?<br />
:GeMoMa makes the predictions for each reference transcript independently. Hence, it can occur that some of predictions of different reference transcripts overlap or are identical especially in gene families. Typically, you might like to filter or rank these predictions. We have implemented GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>) to do this automatically. However, you can also do it by hand using the GFF attributes. Using RNA-seq data in GeMoMa yields additional fields in the annotation that can be used, e.g., average coverage (avgCov).<br />
<br />
; A lot of transcripts have been filtered out by the Extractor. What can I do?<br />
:There are several reasons for removing transcripts by the Extractor. At least in two cases you can try to get more transcripts by setting specific parameter values. First, if the transcript contains ambiguous nucleotides, please test the parameter "Ambiguity". Second, sometimes we received GFFs which contain wrong phases for CDS entries (e.g., 0 for all CDS entries in the phase column of the GFF). Since version 1.3.2, we provide the option "r" which stands for repair. If r=true is chosen, the Extractor tries to infer all phases for transcripts that show an error and would be filtered out.<br />
<br />
; Is GeMoMa able to predict pseudo-genes/ncRNA?<br />
:No, currently not.<br />
<br />
; My RNA-seq data indicates there is an additional intron in a transcipt, but GeMoMa does not predict this. Or vice versa, GeMoMa predicts an intron that is not supported by RNA-seq data. What's the reason?<br />
:GeMoMa is mainly based on the assumptions of amino acid and intron position conservation between reference and target species. Hence, GeMoMa tries to predict a gene model with similar exon-intron-structure in the target species and does not stick too much with RNA-seq data. Although intron position conservation can be observed in most cases, sometimes new introns evolve or others vanish. For this reasons, GeMoMa also allows for the inclusion or exclusion of introns adding some additional costs (cf. GeMoMa parameter intron-loss-gain-penalty). However, the behaviour of GeMoMa depends on the parameters settings (especially intron-loss-gain-penalty, sm (substitution matrix), go (gap opening), ge (gap extension)) and the length of the missed/additional intron. Nevertheless, such cases can only occur if the additional/missed intron has a length that can be divided by 3 preserving the reading frame. Since the available RNA-seq data only reflects a fraction of tissues/environmental conditions/..., missing RNA-seq evidence does not necessarily mean that the predictions is wrong.<br />
<br />
; My RNA-seq data indicates two alternative, highly overlapping introns. Interestingly, GeMoMa does not take the intron that is more abundant. Why?<br />
:GeMoMa reads the introns from the input file using some filter (cf. GeMoMa parameter r (reads)). All introns that pass the filter are used and treated equally. Hence, GeMoMa uses the intron that matches the expectation of intron position and amino acid conservation compared to the reference transcript.<br />
<br />
; Does GeMoMa predict multiple transcripts per gene?<br />
:GeMoMa in principle allows to predict multiple transcripts per gene, if corresponding transcripts are given in the reference species or if multiple reference species are used. <br />
<br />
; GeMoMa failed with java.lang.OutOfMemoryError. What can I do?<br />
:Whenever you see a java.lang.OutOfMemoryError, you should rerun the program with Java virtual machine (VM) options. More specifically you should set: -Xms the initially used RAM, e.g. to 5Gb (–Xms5G), and -Xmx the maximally used RAM, e.g. to 50Gb (-Xmx50G). GeMoMa often needs more memory if you have a large genome and if you’re providing a large coverage file (extracted from RNA-seq data). If you don’t have a compute node with enough memory, you can run GeMoMa without coverage, which will return the same predictions, but does not include all statistics. Another point could be the protein alignment, if you use the optional parameter "query protein" (below version 1.; or "protein alignment" (since version 1.;. Again you can run GeMoMa without protein alignment, which will return the same predictions, but less statistics. <br />
<br />
; I need to specify the genetic code for my organisms. What is the expected format?<br />
:The genetic code is given in a two column tab-delimited table, where the first column is the one letter code of the amino acid and the second column is a comma-separated list of triplets. As we are working on genomic DNA, GeMoMa expects the bases A, C, G, and T, and not U (as expected in mRNA). Here is the link to the default genetic code, which might be used as template:<br />
:https://github.com/Jstacs/Jstacs/blob/master/projects/gemoma/test_data/genetic_code.txt<br />
:Alternative genetic codes are described here using the RNA alphabet:<br />
:https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi<br />
:The genetic code might be specified for a reference organism in the module Extractor or for a target organism in the module GeMoMa.<br />
<br />
; I like to accelerate GeMoMa. What can I do?<br />
:You can use several threads for the computation. If you run the GeMoMaPipeline you just have to select threads=<your_number>.<br />
:In addition, you can change the search algorithm that is used in GeMoMa. Tblastn is used by default as search algorithm in GeMoMa (for historical reasons, until version 1.6.4). However, tblastn can be replaced by mmseqs which is typically much faster. If you run the GeMoMaPipeline you just have to select tblastn=false, which is default since version 1.7. However, changing the search algorithm can also effect the results. We try to minimize these effect using specific parameters for the search algorithms.<br />
:If you modify other parameters, you will probably receive results that differ to a larger extend from those received using the default parameters. <br />
<br />
; Is there a way to use GeMoMa to search a single CDS or protein sequence against a genome and return the predicted gene model (CDS fasta, protein fasta, GFF) similar to exonerate?<br />
:There at least two ways to do this. If you use GeMoMaPipeline you can <br />
: (A) Use the parameter “selected” to select specific gene models (=transcripts/proteins) from the annotation instead of using all or<br />
: (B) Use s=pre-extracted, use a fasta file with the proteins for the parameter cds-parts and leave assignment unset.<br />
:Using one of these options you can either look for a single or few transcripts/proteins either with (A) or without (B) intron-position conservation. In addition, you can use RNA-seq data to improve the predictions, which should be not possible with exonerate.<br />
<br />
; Can I determine synteny based on GeMoMa predictions?<br />
:Yes, since version 1.7 we provide the module SyntenyChecker and a R script that can be used for this purpose. It exploits the fact the the reference gene and the alternative are known. Hence no alignment is need at this point and synteny can be determined quite fast. <br />
<br />
; How, can I add additional attributes to the annotation?<br />
:Additional attribute, e.g., functional annotation from InterProScan, can be added to the structural gene annotation using the module AddAttribute, which has been included since version 1.7. Such additional attributes might be used in GAF for filtering and sorting and can also be displayed in genome browsers like IGV or WebApollo. <br />
<br />
; Can structural gene annotation provided by GeMoMa be submitted to NCBI?<br />
:Yes, NCBI allows to submit structural gene annotation in GFF format (https://www.ncbi.nlm.nih.gov/genbank/genomes_gff/). If you run GeMoMaPipeline or AnnotationFinalizer, the GFF should be valid for conversion. <br />
<br />
; Running GeMoMaPipeline throws an exception. Can I restart GeMoMaPipeline using intermediate results?<br />
:Yes, since version 1.7 we have a new parameter in GeMoMaPipeline called restart. <br />
:If you want to restart the last broken GeMoMaPipeline run, you have to execute GeMoMaPipeline with the same command line as before and add restart=true.<br />
:If necessary, you can also slightly change the other parameters. However, if the parameters differ too much from those used before, GeMoMaPipeline will decide to perform a new independent run.<br />
:A restart of GeMoMaPipeline is particularly useful if the time-consuming search (tblastn or mmseqs) was successful, since this can save runtime.<br />
<br />
If you have any further questions, comments or bugs, please check the [[GeMoMa-Docs]], [https://github.com/Jstacs/Jstacs/labels/GeMoMa our github page] or contact [mailto:jens.keilwagen@julius-kuehn.de Jens Keilwagen].<br />
<br />
== References ==<br />
If you use GeMoMa, please cite<br />
<br />
J. Keilwagen, M. Wenk, J. L. Erickson, M. H. Schattat, J. Grau, and F. Hartung. [https://nar.oxfordjournals.org/content/44/9/e89 Using intron position conservation for homology-based gene prediction]. ''Nucleic Acids Research'', 2016. doi: 10.1093/nar/gkw092<br />
<br />
J. Keilwagen, F. Hartung, M. Paulini, S. O. Twardziok, and J. Grau<br />
[https://rdcu.be/QbKc Combining RNA-seq data and homology-based gene prediction for plants, animals and fungi]. ''BMC Bioinformatics'', 2018. doi: 10.1186/s12859-018-2203-5<br />
<br />
== Version history ==<br />
[http://www.jstacs.de/download.php?which=GeMoMa GeMoMa 1.7.1] (07.09.2020)<br />
*GeMoMa: <br />
**bugfix if assignment == null<br />
**bugfix remove toUpperCase<br />
*GeMoMaPipeline<br />
**Galaxy integration bugfix for hidden parameter restart<br />
**hide BLAST_PATH and MMSEQS_PATH from Galaxy integration<br />
**improved protocol output if threads=1<br />
**add addtional test to GeMoMaPipeline<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.7.zip GeMoMa 1.7] (29.07.2020)<br />
*improved manual including new module and runtime <br />
*check whether input files exist before execution<br />
*partially checking MIME types in CLI before execution<br />
*changed homepage from http to https<br />
*new module AddAttribute: allows to add attributes (like functional annotation from InterProScan) to gene annotation files that might be used in GAF or displayed genome browsers like IGV or WebApollo<br />
*new module SyntenyChecker: creates a table that can be used to create dot plots between the annotation of the target and reference organism<br />
*changed default value of parameter "tag" from "prediction" to "mRNA"<br />
*AnnotationEvidence:<br />
**additional attributes: avgCov, minCov, nps, ce<br />
**changed default value of "annotation output" to true<br />
**bugfix: transcript start and end<br />
*ERE: <br />
**changed default value of coverage to "true" <br />
**new parameter "minimum context": allows to discard introns if all split reads have short aligned contexts<br />
*Extractor: <br />
**bugfix splitAA if coding exon is very short<br />
**improved verbose mode<br />
**new parameter "upcase IDs"<br />
**new parameter "introns" allowing to extract introns from the reference (only for test cases)<br />
**new parameter "discard pre-mature stop" allowing to discard or use transcripts with pre-mature stop<br />
**improved handling of corrupt annotations<br />
*GAF: <br />
**bugfix missing transcripts<br />
**slightly changed the default value of "filter" <br />
*GeMoMa:<br />
**replaced parameter "query proteins" by "protein alignment"<br />
**using splitAA for scoring predictions <br />
**new gff attributes: <br />
*** ce and rce for the feature prediction indicating the number of coding exons for the prediction and the reference, respectively<br />
*** nps for the number of premature stop codons (if avoid stop is false)<br />
**slightly changed the meaning of the parameter "avoid stop"<br />
*GeMoMaPipeline:<br />
**changed the default value of tblastn to false, hence mmseqs is used as search algorithm<br />
**changed the default value of score to ReAlign<br />
**remove "--dont-split-seq-by-len" from mmseqs createdb<br />
**new optional parameter BLAST_PATH<br />
**new optional parameter MMSEQS_PATH<br />
**new option to allow for incorporation of external annotation, e.g., from ab-initio gene prediction<br />
**new parameter restart allowing to restart the latest GeMoMaPipeline run, which was finished without results, with very similar parameters, e.g., after an exception was thrown (cf. parameter debug)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.6.4.zip GeMoMa 1.6.4] (24.04.2020)<br />
* improved help section<br />
* change gff attribute "AA" to "aa"<br />
* GAF:<br />
** bugfix overlapping genes<br />
** accelerated computation<br />
* GeMoMa:<br />
** bugfix: if no assignment file is used and protein ID are prefixes of other protein IDs<br />
** change GFF attribute AA to aa<br />
* AnnotationFinalizer: new parameter "name attribute" allowing to decide whether a name attribute or the Parent and ID attributes should be used for renaming<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.6.3.zip GeMoMa 1.6.3] (05.03.2020)<br />
* Jstacs changes:<br />
** CLI: bugfix ExpandableParameterSet<br />
* python wrapper (for *conda)<br />
* updated tests.sh, run.sh, pipeline.sh<br />
* rename Denoise to DenoiseIntrons<br />
* AnnotationEvidence: write phase (as given) to gff<br />
* GAF: new parameter: default attributes allows to set attributes that are not included in some gene annotation files<br />
* GeMoMa: new parameter: static intron length allowing to use dynamic intron length if set to false<br />
* GeMoMaPipeline: <br />
** bugfix: time-out<br />
** improve output<br />
** separate parameters for maximum intron length (DenoiseIntrons, GeMoMa)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.6.2.zip GeMoMa 1.6.2] (17.12.2019)<br />
* Jstacs changes:<br />
** test methods for modules<br />
** live protocol for Galaxy<br />
* new module Denoise: allowing to clean introns extracted by ERE<br />
* new module NCBIReferenceRetriever: allowing to retrieve data for reference organisms easily from NCBI.<br />
* GAF:<br />
** bugfix for filter using specific attributes if no RNA-seq or query proteins was used<br />
** allow to add annotation info (as for instance provided by Phytozome) based on the reference organisms<br />
* GeMoMa: bugfix for timeout<br />
* GeMoMaPipeline:<br />
** bugfix reporting predicted partial proteins<br />
** improved protocol<br />
** new default value for query proteins (changed from false to true)<br />
** new default value for Ambiguity (changed from EXCEPTION to AMBIGUOUS)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.6.1.zip GeMoMa 1.6.1] (4.06.2019)<br />
* createGalaxyIntegration.sh: bugfix for GeMoMaPipeline<br />
* new module CheckIntrons: allowing to create statistics for introns (extracted by ERE)<br />
* AnnotationFinalizer: bugfix for sequence IDs with large numbers<br />
* CompareTranscripts: <br />
** bugfix for prefix of ref-gene<br />
** allow no transcript info, but making assignment non-optional if a transcript info is set <br />
* GAF: bugfix for Galaxy integration<br />
* GeMoMaPipeline:<br />
** improved output in case of Exceptions<br />
** new parameter "output individual predictions" allows to in- or exclude individual predictions from each reference organism in the final result<br />
** new parameter "weight" allows weights for reference species (cf. GAF)<br />
* ERE: new parameter "minimum mapping quality"<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.6.zip GeMoMa 1.6] (2.04.2019)<br />
* allow to use mmseqs as alternative to tblastn<br />
* AnnotationEvidence:<br />
** allows to add attributes to the input gff: tie, tpc, AA, start, stop<br />
** new parameter for gff output<br />
* AnnotationFinalizer: new tool for predicting UTRs and renaming predictions<br />
* GAF:<br />
** relative score filter and evidence filter are replaced by a flexible filter that allows to filter by relative score, evidence or other GFF attributes as well as combinations thereof<br />
** sorting criteria of the predictions within clusters can now be user-specified<br />
** new attribute for genes: combinedEvidence<br />
** new attribute for predictions: sumWeight<br />
** allows to use gene predictions from all sources, including for instance ab-initio gene predictors, purely RNA-seq based gene prediction and manually curation<br />
** bugfix for predictions from multiple reference organisms<br />
** improved statistic output<br />
* GeMoMa<br />
** renamed the parameter tblastn results to search results<br />
** new parameter for sorting the results of the similarity search (tblastn or mmseqs), if you use mmseqs for the similarity search you have use sort <br />
** new parameter for score of the search results: three options: Trust (as is), ReScore (use aligned sequence, but recompute score), and ReAlign (use detected sequence for realignment and rescore)<br />
** bugfix: threshold for introns from multiple files<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.5.2.zip GeMoMa 1.5.3] (23.07.2018)<br />
* improved parameter description and presentation<br />
* GeMoMaPipeline:<br />
** removed unnecessary parameters<br />
* GeMoMa:<br />
** bugfix: reading coverage file<br />
** removed parameter genomic (cf. Extractor)<br />
** removed protein output (cf. Extractor)<br />
* GAF:<br />
** bugfix: prefix<br />
* Extractor:<br />
** new parameter genomic<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.5.2.zip GeMoMa 1.5.2] (31.5.2018)<br />
* GAF:<br />
** new parameter that allows to restrict the maximal number of transcript predictions per gene<br />
** altered behavior of the evidence filter from percentages to absolute values<br />
** bugfix: nested genes <br />
** checking for duplicates in prediction IDs<br />
* GeMoMa:<br />
** warning if RNA-seq data does not match with target genome<br />
* GeMoMaPipeline: new tool for running the complete GeMoMa pipeline at once allowing multi-threading<br />
* folder for temporary files of GeMoMa<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.5.zip GeMoMa 1.5] (13.02.2018)<br />
* AnnotationEvidence: add chromosome to output<br />
* CompareTranscripts: new parameter that allows to remove prefixes introduces by GAF<br />
* Extractor: new parameter "stop-codon excluded from CDS" that might be used if the annotation does not contain the stop codons<br />
* ExtractRNASeqEvidence: <br />
** print intron length stats<br />
** include program infos in introns.gff3<br />
* GeMoMa:<br />
** new attribute pAA in gff output if query protein is given<br />
** include program infos in predicted_annotation.gff3<br />
** minor bugfix<br />
* GAF:<br />
** new parameter that allows to specify a prefix for each input gff<br />
** collect and print program infos to filtered_prediction.gff3<br />
** improved statistics output<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.4.2.zip GeMoMa 1.4.2] (21.07.2017)<br />
* automatic searching for available updates<br />
* AnnotationEvidence: bugfix (tie computation: Arrays.binarysearch does not find first match)<br />
* Extractor: bugfix (files that are not zipped)<br />
* GeMoMa: bugfix (tie computation: Arrays.binarysearch does not find first match)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.4.1.zip GeMoMa 1.4.1] (30.05.2017)<br />
* CompareTranscripts: bugfix (NullPointerException)<br />
* Extractor: reference genome can be .*fa.gz and .*fasta.gz<br />
* GeMoMa: bugfix (shutdown problem after timeout)<br />
* modified additional scripts and documentation<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.4.zip GeMoMa 1.4] (03.05.2017)<br />
* AnnotationEvidence: new tool computing tie and tpc for given annotation (gff)<br />
* CompareTranscripts: new tool comparing predicted and given annotation (gff)<br />
* Extractor:<br />
** reading CDS with no parent tag (cf. discontinuous feature)<br />
** automatic recognition of GFF or GTF annotation<br />
** Warning if sequences mentioned in the annotation are not included in the reference sequence<br />
* GeMoMa: <br />
** allowing for multiple intron and coverage files (= using different library types at the same time)<br />
** NA instead of "?" for tae, tde, tie, minSplitReads of single coding exon genes<br />
** new default values for the parameters: predictions (10 instead of 1) and contig threshold (0.4 instead of 0.9)<br />
** bugfix (write pc and minCov if possible for last CDS part in predicted annotation)<br />
** bugfix (ref-gene name if no assignment is used)<br />
** bugfix (minSplitReads, minCov, tpc, avgCov if no coverage available)<br />
* GAF:<br />
** nested genes on the same strand<br />
** bugfix (if nothing passes the filter)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.3.2.zip GeMoMa 1.3.2] (18.01.2017)<br />
* Extractor: new parameter repair for broken transcript annotations<br />
* GeMoMa: bugfixes (splice site computation)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa_1.3.1.zip GeMoMa 1.3.1] (09.12.2016)<br />
* GeMoMa bugfix (finding start/stop codon for very small exons)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.3.zip GeMoMa 1.3] (06.12.2016)<br />
* ERE: new tool for extracting RNA-seq evidence<br />
* Extractor: offers options for <br />
** partial gene models<br />
** ambiguities<br />
* GeMoMa:<br />
** RNA-seq<br />
*** defining splice sites<br />
*** additional feature in GFF and output<br />
**** transcript intron evidence (tie)<br />
**** transcript acceptor evidence (tae)<br />
**** transcript donor evidence (tde)<br />
**** transcript percentage coverage (tpc)<br />
**** ...<br />
** improved GFF<br />
** simplified the command line parameters<br />
** IMPORTANT: parameter names changed for some parameters<br />
* GAF: new tool for filtering and combining different predictions (especially of different reference organisms)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.1.3.zip GeMoMa 1.1.3] (06.06.2016)<br />
* minor modifications to the Extractor tool<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.1.2.zip GeMoMa 1.1.2] (05.02.2016)<br />
* GeMoMa bugfix (upstream, downstream sequence for splice site detection)<br />
* Extractor: new parameter s for selecting transcripts<br />
* improved Galaxy integration<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.1.1.zip GeMoMa 1.1.1] (01.02.2016)<br />
* initial release for paper</div>Keilwagenhttps://www.jstacs.de/index.php?title=GeMoMa-Docs&diff=1120GeMoMa-Docs2020-10-14T05:27:47Z<p>Keilwagen: github link</p>
<hr />
<div>This page describes the parameters of all [[GeMoMa]] modules.</br><br />
If you have any questions, comments or bugs, please check the [[GeMoMa#Frequently asked questions|FAQs]], [https://github.com/Jstacs/Jstacs/issues?q=label%3AGeMoMa our github page] or contact [mailto:jens.keilwagen@julius-kuehn.de Jens Keilwagen].<br />
<br />
= GeMoMa pipeline =<br />
<br />
This tool can be used to run the complete GeMoMa pipeline. The tool is multi-threaded and can utilize all compute cores on one machine, but not distributed as for instance in a compute cluster. It basically runs: '''Extract RNA-seq evidence (ERE)''', '''DenoiseIntrons''', '''Extractor''', external search (tblastn or mmseqs), '''Gene Model Mapper (GeMoMa)''', '''GeMoMa Annotation Filter (GAF)''', and '''AnnnotationFinalizer'''.<br />
<br />
''GeMoMa pipeline'' may be called with<br />
<br />
java -jar GeMoMa-1.7.1.jar CLI GeMoMaPipeline<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>target genome (Target genome file (FASTA), mime = fasta,fa,fas,fna,fasta.gz,fa.gz,fas.gz,fna.gz)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used zero or multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>species (data for reference species, range={own, pre-extracted}, default = own)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;own&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>ID (ID to distinguish the different reference species, default = , OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (Reference annotation file (GFF or GTF), which contains gene models annotated in the reference genome, mime = gff,gff3,gtf)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (Reference genome file (FASTA), mime = fasta,fa,fas,fna,fasta.gz,fa.gz,fas.gz,fna.gz)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>weight (the weight can be used to prioritize predictions from different input files; each prediction will get an additional attribute sumWeight that can be used in the filter, valid range = [0.0, 1000.0], default = 1.0, OPTIONAL)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ai</font></td><br />
<td>annotation info (annotation information of the reference, tab-delimted file containing at least the columns transcriptName, GO and .*defline, mime = tabular, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;pre-extracted&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>ID (ID to distinguish the different reference species, default = , OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>cds parts (The query cds parts file (FASTA), i.e., the cds parts that have been blasted, mime = fasta,fas,fa,fna)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>assignment (The assignment file, which combines parts of the CDS to transcripts, mime = tabular, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>weight (the weight can be used to prioritize predictions from different input files; each prediction will get an additional attribute sumWeight that can be used in the filter, valid range = [0.0, 1000.0], default = 1.0, OPTIONAL)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ai</font></td><br />
<td>annotation info (annotation information of the reference, tab-delimted file containing at least the columns transcriptName, GO and .*defline, mime = tabular, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used zero or multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ID</font></td><br />
<td>ID (ID to distinguish the different external annotations of the target organism, default = , OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">e</font></td><br />
<td>external annotation (External annotation file (GFF,GTF) of the target organism, which contains gene models from an external source (e.g., ab initio gene prediction) and will be integrated in the module GAF, mime = gff,gff3,gtf)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">weight</font></td><br />
<td>weight (the weight can be used to prioritize predictions from different input files in the module GAF; each prediction will get an additional attribute sumWeight that can be used in the filter, valid range = [0.0, 1000.0], default = 1.0, OPTIONAL)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ae</font></td><br />
<td>annotation evidence (run AnnotationEvidence on this external annotation, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">selected</font></td><br />
<td>selected (The path to list file, which allows to make only a predictions for the contained transcript ids. The first column should contain transcript IDs as given in the annotation. Remaining columns can be used to determine a target region that should be overlapped by the prediction, if columns 2 to 5 contain chromosome, strand, start and end of region, mime = tabular,txt, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">gc</font></td><br />
<td>genetic code (optional user-specified genetic code, mime = tabular, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">tblastn</font></td><br />
<td>tblastn (if *true* tblastn is used as search algorithm, otherwise mmseqs is used. Tblastn and mmseqs need to be installed to use the corresponding option, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">tag</font></td><br />
<td>tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = mRNA)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>RNA-seq evidence (data for RNA-seq evidence, range={NO, MAPPED, EXTRACTED}, default = NO)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;MAPPED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.s</font></td><br />
<td>Stranded (Defines whether the reads are stranded. In case of FR_FIRST_STRAND, the first read of a read pair or the only read in case of single-end data is assumed to be located on forward strand of the cDNA, i.e., reverse to the mRNA orientation. If you are using Illumina TruSeq you should use FR_FIRST_STRAND., range={FR_UNSTRANDED, FR_FIRST_STRAND, FR_SECOND_STRAND}, default = FR_UNSTRANDED)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.m</font></td><br />
<td>mapped reads file (BAM/SAM files containing the mapped reads, mime = bam,sam)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.v</font></td><br />
<td>ValidationStringency (Defines how strict to be when reading a SAM or BAM, beyond bare minimum validation., range={STRICT, LENIENT, SILENT}, default = LENIENT)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.u</font></td><br />
<td>use secondary alignments (allows to filter flags in the SAM or BAM, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.c</font></td><br />
<td>coverage (allows to output the coverage, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.mmq</font></td><br />
<td>minimum mapping quality (reads with a mapping quality that is lower than this value will be ignored, valid range = [0, 255], default = 40)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.mc</font></td><br />
<td>minimum context (only introns that have evidence of at least one split read with a minimal M (=(mis)match) stretch in the cigar string larger than or equal to this value will be used, valid range = [1, 1000000], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;EXTRACTED&quot;:</b></td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">introns</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq, mime = gff,gff3)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used zero or multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage</font></td><br />
<td>coverage (experimental coverage (RNA-seq), range={UNSTRANDED, STRANDED}, default = UNSTRANDED)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., mime = bedgraph)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., mime = bedgraph)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., mime = bedgraph)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>denoise (removing questionable introns that have been extracted by ERE, range={DENOISE, RAW}, default = DENOISE)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;DENOISE&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">DenoiseIntrons.m</font></td><br />
<td>maximum intron length (The maximum length of an intron, default = 15000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">DenoiseIntrons.me</font></td><br />
<td>minimum expression (The threshold for removing introns, valid range = [0.0, 1.0], default = 0.01)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">DenoiseIntrons.c</font></td><br />
<td>context (The context upstream a donor and donwstream an acceptor site that is used to determine the expression of the region, valid range = [0, 100], default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;RAW&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.u</font></td><br />
<td>upcase IDs (whether the IDs in the GFF should be upcased, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.r</font></td><br />
<td>repair (if a transcript annotation can not be parsed, the program will try to infer the phase of the CDS parts to repair the annotation, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.a</font></td><br />
<td>Ambiguity (This parameter defines how to deal with ambiguities in the DNA. There are 3 options: EXCEPTION, which will remove the corresponding transcript, AMBIGUOUS, which will use an X for the corresponding amino acid, and RANDOM, which will randomly select an amnio acid from the list of possibilities., range={EXCEPTION, AMBIGUOUS, RANDOM}, default = AMBIGUOUS)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.d</font></td><br />
<td>discard pre-mature stop (if *true* transcripts with pre-mature stop codon are discarded as they often indicate misannotation, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.s</font></td><br />
<td>stop-codon excluded from CDS (A flag that states whether the reference annotation contains the stop codon in the CDS annotation or not, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.f</font></td><br />
<td>full-length (A flag which allows for choosing between only full-length and all (i.e., full-length and partial) transcripts, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.s</font></td><br />
<td>splice (if no intron is given by RNA-seq, compute candidate splice sites or not, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.sm</font></td><br />
<td>substitution matrix (optional user-specified substitution matrix, mime = tabular, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.g</font></td><br />
<td>gap opening (The gap opening cost in the alignment, default = 11)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.ge</font></td><br />
<td>gap extension (The gap extension cost in the alignment, default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.m</font></td><br />
<td>maximum intron length (The maximum length of an intron, default = 15000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.sil</font></td><br />
<td>static intron length (A flag which allows to switch between static intron length, which can be specified by the user and is identical for all genes, and dynamic intron length, which is based on the gene-specific maximum intron length in the reference organism plus the user given maximum intron length, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.i</font></td><br />
<td>intron-loss-gain-penalty (The penalty used for intron loss and gain, default = 25)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.e</font></td><br />
<td>e-value (The e-value for filtering blast results, default = 100.0)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.c</font></td><br />
<td>contig threshold (The threshold for evaluating contigs, valid range = [0.0, 1.0], default = 0.4)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.rt</font></td><br />
<td>region threshold (The threshold for evaluating regions, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.h</font></td><br />
<td>hit threshold (The threshold for adding additional hits, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.p</font></td><br />
<td>predictions (The (maximal) number of predictions per transcript, default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.a</font></td><br />
<td>avoid stop (A flag which allows to avoid (additional) pre-mature stop codons in a transcript, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.approx</font></td><br />
<td>approx (whether an approximation is used to compute the score for intron gain, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.pa</font></td><br />
<td>protein alignment (whether a protein alignment between the prediction and the reference transcript should be computed. If so two additional attributes (iAA, pAA) will be added to predictions in the gff output. These might be used in GAF. However, since some transcripts are very long this can increase the needed runtime and memory (RAM)., default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.prefix</font></td><br />
<td>prefix (A prefix to be used for naming the predictions, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.t</font></td><br />
<td>timeout (The (maximal) number of seconds to be used for the predictions of one transcript, if exceeded GeMoMa does not output a prediction for this transcript., valid range = [0, 604800], default = 3600)</td><br />
<td style="width:100px;">LONG</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.Score</font></td><br />
<td>Score (A flag which allows to do nothing, re-score or re-align the search results, range={Trust, ReScore, ReAlign}, default = ReAlign)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.c</font></td><br />
<td>common border filter (the filter on the common borders of transcripts, the lower the more transcripts will be checked as alternative splice isoforms, valid range = [0.0, 1.0], default = 0.75)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.m</font></td><br />
<td>maximal number of transcripts per gene (the maximal number of allowed transcript predictions per gene, valid range = [1, 2147483647], default = 2147483647)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.d</font></td><br />
<td>default attributes (Comma-separated list of attributes that is set to NaN if they are not given in the annotation file. This allows to use these attributes for sorting or filter criteria. It is especially meaningful if the gene annotation files were received fom different software packages (e.g., GeMoMa, Braker, ...) having different attributes., default = tie,tde,tae,iAA,pAA,score)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.f</font></td><br />
<td>filter (A filter can be applied to transcript predictions to receive only reasonable predictions. The filter is applied to the GFF attributes. The default filter decides based on the completeness of the prediction (start=='M' and stop=='*') and the relative score (score/aa>=0.75) whether a prediction is used or not. In addition, predictions without score (isNaN(score)) will be used as external annotations do not have the attribute score. Different criteria can be combined using 'and' and 'or'. A more sophisticated filter could be applied for instance using the completeness, the relative score, the evidence as well as statistics based on RNA-seq: start=='M' and stop=='*' and score/aa>=0.75 and (evidence>1 or tpc==1.0), default = start=='M' and stop=='*' and (isNaN(score) or score/aa>=0.75), OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.s</font></td><br />
<td>sorting (comma-separated list that defines the way predictions are sorted within a cluster, default = evidence,score)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.a</font></td><br />
<td>alternative transcript filter (If a prediction is suggested as an alternative transcript, this additional filter will be applied. The filter works syntactically like the 'filter' parameter and allows you to keep the number of alternative transcripts small based on meaningful criteria. Reasonable filter could be based on RNA-seq data (tie==1) or on evidence (evidence>1). A more sophisticated filter could be applied combining several criteria: tie==1 or evidence>1, default = tie==1 or evidence>1, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.u</font></td><br />
<td>UTR (allows to predict UTRs using RNA-seq data, range={NO, YES}, default = NO)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;YES&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.r</font></td><br />
<td>rename (allows to generate generic gene and transcripts names (cf. parameter &quot;name attribute&quot;), range={COMPOSED, SIMPLE, NO}, default = COMPOSED)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;COMPOSED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.i</font></td><br />
<td>infix (the infix of the generic name, default = G)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.s</font></td><br />
<td>suffix (the suffix of the generic name, default = 0)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.di</font></td><br />
<td>delete infix (a comma-separated list of infixes that is deleted from the sequence names before building the gene/transcript name, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;SIMPLE&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.n</font></td><br />
<td>name attribute (if true the new name is added as new attribute &quot;Name&quot;, otherwise &quot;Parent&quot; and &quot;ID&quot; values are modified accordingly, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>predicted proteins (If *true*, returns the predicted proteins of the target organism as fastA file, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">pc</font></td><br />
<td>predicted CDSs (If *true*, returns the predicted CDSs of the target organism as fastA file, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">pgr</font></td><br />
<td>predicted genomic regions (If *true*, returns the genomic regions of predicted gene models of the target organism as fastA file, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">o</font></td><br />
<td>output individual predictions (If *true*, returns the predictions for each reference species, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">debug</font></td><br />
<td>debug (If *false* removes all temporary files even if the jobs exits unexpected, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">restart</font></td><br />
<td>restart (can be used to restart the latest GeMoMaPipeline run, which was finished without results, with very similar parameters, e.g., after an exception was thrown (cf. parameter debug), default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>BLAST_PATH (allows to set a path to the blast binaries if not set in the environment, default = , OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>MMSEQS_PATH (allows to set a path to the blast binaries if not set in the environment, default = , OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">threads</font></td><br />
<td>The number of threads used for the tool, defaults to 1</td><br />
<td>INT</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.7.1.jar CLI GeMoMaPipeline t=&lt;target_genome&gt; g=&lt;genome&gt; a=&lt;annotation&gt; AnnotationFinalizer.p=&lt;prefix&gt;<br />
<br />
<br />
= Extract RNA-seq Evidence =<br />
<br />
This tools extracts introns and coverage from mapped RNA-seq reads. Introns might be denoised by the tool '''DenoiseIntrons'''. Introns and coverage results can be used in '''GeMoMa''' to improve the predictions and might help to select better gene models in '''GAF'''. In addition, introns and coverage can be used to predict UTRs by '''AnnotationFinalizer'''.<br />
<br />
''Extract RNA-seq Evidence'' may be called with<br />
<br />
java -jar GeMoMa-1.7.1.jar CLI ERE<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>Stranded (Defines whether the reads are stranded. In case of FR_FIRST_STRAND, the first read of a read pair or the only read in case of single-end data is assumed to be located on forward strand of the cDNA, i.e., reverse to the mRNA orientation. If you are using Illumina TruSeq you should use FR_FIRST_STRAND., range={FR_UNSTRANDED, FR_FIRST_STRAND, FR_SECOND_STRAND}, default = FR_UNSTRANDED)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>mapped reads file (BAM/SAM files containing the mapped reads, mime = bam,sam)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>ValidationStringency (Defines how strict to be when reading a SAM or BAM, beyond bare minimum validation., range={STRICT, LENIENT, SILENT}, default = LENIENT)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">u</font></td><br />
<td>use secondary alignments (allows to filter flags in the SAM or BAM, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage (allows to output the coverage, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">mmq</font></td><br />
<td>minimum mapping quality (reads with a mapping quality that is lower than this value will be ignored, valid range = [0, 255], default = 40)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">mc</font></td><br />
<td>minimum context (only introns that have evidence of at least one split read with a minimal M (=(mis)match) stretch in the cigar string larger than or equal to this value will be used, valid range = [1, 1000000], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.7.1.jar CLI ERE m=&lt;mapped_reads_file&gt;<br />
<br />
<br />
= CheckIntrons =<br />
<br />
The tool checks the distribution of introns on the strands and the dinucleotide distribution at splice sites.<br />
<br />
''CheckIntrons'' may be called with<br />
<br />
java -jar GeMoMa-1.7.1.jar CLI CheckIntrons<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>target genome (The target genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code, mime = fasta)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq, mime = gff, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>verbose (A flag which allows to output a wealth of additional information per transcript, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.7.1.jar CLI CheckIntrons t=&lt;target_genome&gt; i=&lt;introns&gt;<br />
<br />
<br />
= DenoiseIntrons =<br />
<br />
This module allows to analyze introns extracted by '''ERE'''. Introns with a large intron size or a low relative expression are possibly artefacts and will be removed. The result of this module can be used in the module '''GeMoMa''', '''AnnotationEvidence''', and '''AnnotationFinalizer'''.<br />
<br />
''DenoiseIntrons'' may be called with<br />
<br />
java -jar GeMoMa-1.7.1.jar CLI DenoiseIntrons<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq, mime = gff,gff3)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage (experimental coverage (RNA-seq), range={UNSTRANDED, STRANDED}, default = UNSTRANDED)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., mime = bedgraph)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., mime = bedgraph)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., mime = bedgraph)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>maximum intron length (The maximum length of an intron, default = 15000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">me</font></td><br />
<td>minimum expression (The threshold for removing introns, valid range = [0.0, 1.0], default = 0.01)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">context</font></td><br />
<td>context (The context upstream a donor and donwstream an acceptor site that is used to determine the expression of the region, valid range = [0, 100], default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.7.1.jar CLI DenoiseIntrons i=&lt;introns&gt; coverage_unstranded=&lt;coverage_unstranded&gt;<br />
<br />
<br />
= NCBI Reference Retriever =<br />
<br />
This tool can be used to download or update assembly and annotation files of reference organsims from NCBI. This way it allows to easily collect all data necessary to start '''GeMoMaPipeline''' or '''Extractor'''.<br />
<br />
''NCBI Reference Retriever'' may be called with<br />
<br />
java -jar GeMoMa-1.7.1.jar CLI NRR<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reference directory (the directory where the genome and annotation files of the reference organisms should be stored, default = references/)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>number of tries (the number of tries for downloading a reference file, valid range = [1, 100], default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">rl</font></td><br />
<td>reference list (a list of reference organisms, mime = txt)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.7.1.jar CLI NRR rl=&lt;reference_list&gt;<br />
<br />
<br />
= Extractor =<br />
<br />
This tool can be used to create input files for '''GeMoMa''', i.e., it creates at least a fasta file containing the translated parts of the CDS and a tabular file containing the assignment of transcripts to genes and parts of CDS to transcripts. In addition, '''Extractor''' can be used to create several additional files from the final prediction, e.g. proteins, CDSs, ... . Two inputs are mandatory: The genome as fasta or fasta.gz and the corresponding annotation as gff or gff.gz. The gff file should be sorted. If you like to set a user-specific genetic code, please use a tab-delimited file with two columns. The first column contains the amino acid in one letter code, the second a list of tripletts.<br />
<br />
''Extractor'' may be called with<br />
<br />
java -jar GeMoMa-1.7.1.jar CLI Extractor<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (Reference annotation file (GFF or GTF), which contains gene models annotated in the reference genome, mime = gff,gff3,gtf)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (Reference genome file (FASTA), mime = fasta,fa,fas,fna,fasta.gz,fa.gz,fas.gz,fna.gz)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">gc</font></td><br />
<td>genetic code (optional user-specified genetic code, mime = tabular, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>proteins (whether the complete proteins sequences should returned as output, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>cds (whether the complete CDSs should returned as output, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">genomic</font></td><br />
<td>genomic (whether the genomic regions should be returned (upper case = coding, lower case = non coding), default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns (whether introns should be extracted from annotation, that might be used for test cases, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">u</font></td><br />
<td>upcase IDs (whether the IDs in the GFF should be upcased, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>repair (if a transcript annotation can not be parsed, the program will try to infer the phase of the CDS parts to repair the annotation, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>selected (The path to list file, which allows to make only a predictions for the contained transcript ids. The first column should contain transcript IDs as given in the annotation. Remaining columns will be ignored., mime = tabular,txt, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Ambiguity</font></td><br />
<td>Ambiguity (This parameter defines how to deal with ambiguities in the DNA. There are 3 options: EXCEPTION, which will remove the corresponding transcript, AMBIGUOUS, which will use an X for the corresponding amino acid, and RANDOM, which will randomly select an amnio acid from the list of possibilities., range={EXCEPTION, AMBIGUOUS, RANDOM}, default = EXCEPTION)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>discard pre-mature stop (if *true* transcripts with pre-mature stop codon are discarded as they often indicate misannotation, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sefc</font></td><br />
<td>stop-codon excluded from CDS (A flag that states whether the reference annotation contains the stop codon in the CDS annotation or not, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>full-length (A flag which allows for choosing between only full-length and all (i.e., full-length and partial) transcripts, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>verbose (A flag which allows to output a wealth of additional information, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.7.1.jar CLI Extractor a=&lt;annotation&gt; g=&lt;genome&gt;<br />
<br />
<br />
= GeneModelMapper =<br />
<br />
This tool is the main part of, a homology-based gene prediction tool. GeMoMa builds gene models from search results (e.g. tblastn or mmseqs).<br />
<br />
As first step, you should run '''Extractor''' obtaining ''cds parts'' and ''assignment''. Second, you should run a search algorithm, e.g. '''tblastn''' or '''mmseqs''', with ''cds parts'' as query. Finally, these search results are then used in '''GeMoMa'''. Search results should be clustered according to the reference genes. The most easiest way is to sort the search results accoring to the first column. If the search results are not sorted by default (e.g. mmseqs), you should the parameter ''sort''.<br />
If you like to run GeMoMa ignoring intron position conservation, you should blast protein sequences and feed the results in ''query cds parts'' and leave ''assignment'' unselected.<br />
<br />
If you like to run GeMoMa using RNA-seq evidence, you should map your RNA-seq reads to the genome and run '''ERE''' on the mapped reads. For several reasons, spurious introns can be extracted from RNA-seq data. Hence, we recommend to run '''DenoiseIntrons''' to remove such spurious introns. Finally, you can use the obtained ''introns'' (and ''coverage'') in GeMoMa.<br />
<br />
If you like to obtain multiple predictions per gene model of the reference organism, you should set ''predictions'' accordingly. In addition, we suggest to decrease the value of ''contig threshold'' allowing GeMoMa to evaluate more candidate contigs/chromosomes.<br />
<br />
If you change the values of ''contig threshold'', ''region threshold'' and ''hit threshold'', this will influence the predictions as well as the runtime of the algorithm. The lower the values are, the slower the algorithm is.<br />
<br />
You can filter your predictions using '''GAF''', which also allows for combining predictions from different reference organismns.<br />
<br />
Finally, you can predict UTRs and rename predictions using '''AnnotationFinalizer'''.<br />
<br />
If you like to run the complete GeMoMa pipeline and not only specific module, you can run the multi-threaded module '''GeMoMaPipeline'''.<br />
<br />
''GeneModelMapper'' may be called with<br />
<br />
java -jar GeMoMa-1.7.1.jar CLI GeMoMa<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>search results (The search results, e.g., from tblastn or mmseqs, mime = tabular)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>target genome (The target genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code, mime = fasta,fas,fa,fna,fasta.gz,fas.gz,fa.gz,fna.gz)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>cds parts (The query cds parts file (FASTA), i.e., the cds parts that have been blasted, mime = fasta,fa,fas,fna)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>assignment (The assignment file, which combines parts of the CDS to transcripts, mime = tabular, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used zero or multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq, mime = gff,gff3)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">splice</font></td><br />
<td>splice (if no intron is given by RNA-seq, compute candidate splice sites or not, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used zero or multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage</font></td><br />
<td>coverage (experimental coverage (RNA-seq), range={UNSTRANDED, STRANDED}, default = UNSTRANDED)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., mime = bedgraph)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., mime = bedgraph)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., mime = bedgraph)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genetic code (optional user-specified genetic code, mime = tabular, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sm</font></td><br />
<td>substitution matrix (optional user-specified substitution matrix, mime = tabular, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">go</font></td><br />
<td>gap opening (The gap opening cost in the alignment, default = 11)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ge</font></td><br />
<td>gap extension (The gap extension cost in the alignment, default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>maximum intron length (The maximum length of an intron, default = 15000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sil</font></td><br />
<td>static intron length (A flag which allows to switch between static intron length, which can be specified by the user and is identical for all genes, and dynamic intron length, which is based on the gene-specific maximum intron length in the reference organism plus the user given maximum intron length, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">intron-loss-gain-penalty</font></td><br />
<td>intron-loss-gain-penalty (The penalty used for intron loss and gain, default = 25)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">e</font></td><br />
<td>e-value (The e-value for filtering blast results, default = 100.0)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ct</font></td><br />
<td>contig threshold (The threshold for evaluating contigs, valid range = [0.0, 1.0], default = 0.4)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">rt</font></td><br />
<td>region threshold (The threshold for evaluating regions, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">h</font></td><br />
<td>hit threshold (The threshold for adding additional hits, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>predictions (The (maximal) number of predictions per transcript, default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">selected</font></td><br />
<td>selected (The path to list file, which allows to make only a predictions for the contained transcript ids. The first column should contain transcript IDs as given in the annotation. Remaining columns can be used to determine a target region that should be overlapped by the prediction, if columns 2 to 5 contain chromosome, strand, start and end of region, mime = tabular,txt, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">as</font></td><br />
<td>avoid stop (A flag which allows to avoid (additional) pre-mature stop codons in a transcript, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">approx</font></td><br />
<td>approx (whether an approximation is used to compute the score for intron gain, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">pa</font></td><br />
<td>protein alignment (whether a protein alignment between the prediction and the reference transcript should be computed. If so two additional attributes (iAA, pAA) will be added to predictions in the gff output. These might be used in GAF. However, since some transcripts are very long this can increase the needed runtime and memory (RAM)., default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">prefix</font></td><br />
<td>prefix (A prefix to be used for naming the predictions, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">tag</font></td><br />
<td>tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = mRNA)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>verbose (A flag which allows to output a wealth of additional information per transcript, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">timeout</font></td><br />
<td>timeout (The (maximal) number of seconds to be used for the predictions of one transcript, if exceeded GeMoMa does not output a prediction for this transcript., valid range = [0, 604800], default = 3600)</td><br />
<td style="width:100px;">LONG</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sort</font></td><br />
<td>sort (A flag which allows to sort the search results, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Score</font></td><br />
<td>Score (A flag which allows to do nothing, re-score or re-align the search results, range={Trust, ReScore, ReAlign}, default = Trust)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.7.1.jar CLI GeMoMa s=&lt;search_results&gt; t=&lt;target_genome&gt; c=&lt;cds_parts&gt;<br />
<br />
<br />
= GeMoMa Annotation Filter =<br />
<br />
This tool combines and filters gene predictions from different sources yielding a common gene prediction. The tool does not modify the predictions, but filters redundant and low-quality predictions and selects relevant predictions. In addition, it adds attributes to the annotation of transcript predictions.<br />
<br />
The algorithm does the following:<br />
First, redundant predictions are identified (and additional attributes (evidence, sumWeight) are introduced).<br />
Second, the predictions are filtered using the user-specified criterium based on the attributes from the annotation.<br />
Third, clusters of overlapping predictions are determined, the predictions are sorted within the cluster and relevant predictions are extracted.<br />
<br />
Optionally, annotation info can be added for each reference organism enabling a functional prediction for predicted transcripts based on the function of the reference transcript.<br />
Phytozome provides annotation info tables for plants, but annotation info can be used from any source as long as they are tab-delimited files with at least the following columns: transcriptName, GO and .*defline.<br />
<br />
Initially, GAF was build to combine gene predictions obtained from '''GeMoMa'''. It allows to combine the predictions from multiple reference organisms, but works also using only one reference organism. However, GAF also allows to integrate predictions from ab-initio or purely RNA-seq-based gene predictors as well as manually curated annotation. If you like to do so, we recommend to run '''AnnotationEvidence''' for each of these input files to add additional attributes that can be used for sorting and filtering within '''GAF'''. The sort and filter criteria need to be carefully revised in this case. Default values can be set for missing attributes.<br />
<br />
''GeMoMa Annotation Filter'' may be called with<br />
<br />
java -jar GeMoMa-1.7.1.jar CLI GAF<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>tag (the tag used to read the GeMoMa annotations, default = mRNA)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>common border filter (the filter on the common borders of transcripts, the lower the more transcripts will be checked as alternative splice isoforms, valid range = [0.0, 1.0], default = 0.75)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>maximal number of transcripts per gene (the maximal number of allowed transcript predictions per gene, valid range = [1, 2147483647], default = 2147483647)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prefix (the prefix can be used to distinguish predictions from different input files, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>weight (the weight can be used to prioritize predictions from different input files; each prediction will get an additional attribute sumWeight that can be used in the filter, valid range = [0.0, 1000.0], default = 1.0, OPTIONAL)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>gene annotation file (GFF file containing the gene annotations (predicted by GeMoMa), mime = gff,gff3)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation info (annotation information of the reference, tab-delimted file containing at least the columns transcriptName, GO and .*defline, mime = tabular, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>default attributes (Comma-separated list of attributes that is set to NaN if they are not given in the annotation file. This allows to use these attributes for sorting or filter criteria. It is especially meaningful if the gene annotation files were received fom different software packages (e.g., GeMoMa, Braker, ...) having different attributes., default = tie,tde,tae,iAA,pAA,score)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>filter (A filter can be applied to transcript predictions to receive only reasonable predictions. The filter is applied to the GFF attributes. The default filter decides based on the completeness of the prediction (start=='M' and stop=='*') and the relative score (score/aa>=0.75) whether a prediction is used or not. In addition, predictions without score (isNaN(score)) will be used as external annotations do not have the attribute score. Different criteria can be combined using 'and' and 'or'. A more sophisticated filter could be applied for instance using the completeness, the relative score, the evidence as well as statistics based on RNA-seq: start=='M' and stop=='*' and score/aa>=0.75 and (evidence>1 or tpc==1.0), default = start=='M' and stop=='*' and (isNaN(score) or score/aa>=0.75), OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>sorting (comma-separated list that defines the way predictions are sorted within a cluster, default = evidence,score)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">atf</font></td><br />
<td>alternative transcript filter (If a prediction is suggested as an alternative transcript, this additional filter will be applied. The filter works syntactically like the 'filter' parameter and allows you to keep the number of alternative transcripts small based on meaningful criteria. Reasonable filter could be based on RNA-seq data (tie==1) or on evidence (evidence>1). A more sophisticated filter could be applied combining several criteria: tie==1 or evidence>1, default = tie==1 or evidence>1, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.7.1.jar CLI GAF g=&lt;gene_annotation_file&gt;<br />
<br />
<br />
= AnnotationFinalizer =<br />
<br />
This tool finalizes an annotation. It allows to predict for UTRs for annotated coding sequences and to generate generic gene and transcript names. UTR prediction might be negatively influenced (i.e. too long predictions) by genomic contamination of RNA-seq libraries, overlapping genes or genes in close proximity as well as unstranded RNA-seq libraries. Please use '''ERE''' to preprocess the mapped reads.<br />
<br />
''AnnotationFinalizer'' may be called with<br />
<br />
java -jar GeMoMa-1.7.1.jar CLI AnnotationFinalizer<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (The genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code, mime = fasta,fa,fas,fna,fasta.gz,fa.gz,fas.gz,fna.gz)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (The predicted genome annotation file (GFF), mime = gff,gff3)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = mRNA)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">u</font></td><br />
<td>UTR (allows to predict UTRs using RNA-seq data, range={NO, YES}, default = NO)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;YES&quot;:</b></td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns file (Introns (GFF), which might be obtained from RNA-seq, mime = gff,gff3)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage file (experimental coverage (RNA-seq), range={NO, UNSTRANDED, STRANDED}, default = NO)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., mime = bedgraph)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., mime = bedgraph)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., mime = bedgraph)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">rename</font></td><br />
<td>rename (allows to generate generic gene and transcripts names (cf. parameter &quot;name attribute&quot;), range={COMPOSED, SIMPLE, NO}, default = COMPOSED)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;COMPOSED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">infix</font></td><br />
<td>infix (the infix of the generic name, default = G)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>suffix (the suffix of the generic name, default = 0)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">di</font></td><br />
<td>delete infix (a comma-separated list of infixes that is deleted from the sequence names before building the gene/transcript name, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;SIMPLE&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>name attribute (if true the new name is added as new attribute &quot;Name&quot;, otherwise &quot;Parent&quot; and &quot;ID&quot; values are modified accordingly, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.7.1.jar CLI AnnotationFinalizer g=&lt;genome&gt; a=&lt;annotation&gt; p=&lt;prefix&gt;<br />
<br />
<br />
= Annotation evidence =<br />
<br />
This tool adds attributes to the annotation, e.g., tie, tpc, aa, start, stop. These attributes can be used, for instance, if the annotation is used in '''GAF'''. All predictions of the annotation are used. The predictions are not filtered for internal stop codons, missing start or stop codons, frame-shifts, ... . Please use '''ERE''' to preprocess the mapped reads.<br />
<br />
''Annotation evidence'' may be called with<br />
<br />
java -jar GeMoMa-1.7.1.jar CLI AnnotationEvidence<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (The genome annotation file (GFF,GTF), mime = gff,gff3,gtf)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = mRNA)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (The genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code, mime = fasta,fas,fa,fna,fasta.gz,fas.gz,fa.gz,fna.gz)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns file (Introns (GFF), which might be obtained from RNA-seq, mime = gff,gff3, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage file (experimental coverage (RNA-seq), range={NO, UNSTRANDED, STRANDED}, default = NO)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., mime = bedgraph)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., mime = bedgraph)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., mime = bedgraph)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ao</font></td><br />
<td>annotation output (if the annotation should be returned with attributes tie, tpc, and aa, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">gc</font></td><br />
<td>genetic code (optional user-specified genetic code, mime = tabular, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.7.1.jar CLI AnnotationEvidence a=&lt;annotation&gt; g=&lt;genome&gt;<br />
<br />
<br />
= Compare transcripts =<br />
<br />
This tool compares a predicted annotation with a given annotation in terms of F1 measure. If the F1 measure is 1 both annotations are in perfect agreement for this transcript. The smaller the value is the low is the agreement. If it is NA then there is no overlapping annotation.<br />
<br />
''Compare transcripts'' may be called with<br />
<br />
java -jar GeMoMa-1.7.1.jar CLI CompareTranscripts<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prediction (The predicted annotation, mime = gff,gff3)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (The true annotation, mime = gff,gff3)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used zero or multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">prefix</font></td><br />
<td>prefix (the prefix can be used to distinguish predictions from different input files, default = , OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">assignment</font></td><br />
<td>assignment (the transcript info for the reference of the prediction, mime = tabular)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.7.1.jar CLI CompareTranscripts p=&lt;prediction&gt; a=&lt;annotation&gt;<br />
<br />
<br />
= Synteny checker =<br />
<br />
This tool can be used to determine syntenic regions between target organism and reference organism based on similiarity of genes.!The tool returns a table of reference genes per predicted gene. This table can be easily visualized with an R script that is included in the GeMoMa package.<br />
<br />
''Synteny checker'' may be called with<br />
<br />
java -jar GeMoMa-1.7.1.jar CLI SyntenyChecker<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>tag (the tag used to read the GeMoMa annotations, default = mRNA)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prefix (the prefix can be used to distinguish predictions from different input files, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>assignment (The assignment file, which combines parts of the CDS to transcripts, mime = tabular)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>gene annotation file (GFF file containing the gene annotations predicted by GAF, mime = gff,gff3)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.7.1.jar CLI SyntenyChecker a=&lt;assignment&gt; g=&lt;gene_annotation_file&gt;<br />
<br />
<br />
= AddAttribute =<br />
<br />
This tool allows to add an additional attribute to specific features of an annotation.<br />
<br />
Those additional attributes might be used in '''GAF''' for filtering or sorting or might be displayed in genome browsers like IGV or WebApollo. The user can choose binary attributes (true or false) or attributes with values according to given tab-delimited table.<br />
<br />
''AddAttribute'' may be called with<br />
<br />
java -jar GeMoMa-1.7.1.jar CLI AddAttribute<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (annotation file, mime = gff,gff3)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>feature (a feature of the annotation, e.g., gene, transcript or mRNA, default = mRNA)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">attribute</font></td><br />
<td>attribute (the name of the attribute that is added to the annotation)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>table (a tab-delimited file containing IDs and additional attribute, mime = tabular)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>ID column (the ID column in the tab-delimited file, valid range = [0, 2147483647])</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">type</font></td><br />
<td>type (type of addition attribute, range={VALUES, BINARY}, default = VALUES)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;VALUES&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ac</font></td><br />
<td>attribute column (the attribute column in the tab-delimited file, valid range = [0, 2147483647])</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;BINARY&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.7.1.jar CLI AddAttribute a=&lt;annotation&gt; attribute=&lt;attribute&gt; t=&lt;table&gt; i=&lt;ID_column&gt; ac=&lt;attribute_column&gt;</div>Keilwagenhttps://www.jstacs.de/index.php?title=GeMoMa&diff=1112GeMoMa2020-09-09T12:21:00Z<p>Keilwagen: /* In a nutshell */</p>
<hr />
<div>'''Ge'''ne '''Mo'''del '''Ma'''pper (GeMoMa) is a homology-based gene prediction program. GeMoMa uses the annotation of protein-coding genes in a reference genome to infer the annotation of protein-coding genes in a target genome. Thereby, GeMoMa utilizes amino acid sequence and intron position conservation. In addition, GeMoMa allows to incorporate RNA-seq evidence for splice site prediction.<br />
<br />
GeMoMa is available in a public web-server at [http://galaxy.informatik.uni-halle.de galaxy.informatik.uni-halle.de]. The provided web-server only allows a limited number of reference genes and uses a time out of 2 minutes per transcript prediction. For unlimited use, please use the command line program or integrate GeMoMa in your own [https://galaxyproject.org/ Galaxy] instance.<br />
<br />
{|<br />
|__TOC__<br />
|[[File:GeMoMa-schema.png|thumb|right|450px|Schema of GeMoMa algorithm]]<br />
|}<br />
<br />
== Installation ==<br />
<br />
GeMoMa is now available via bioconda. Here is the [https://anaconda.org/bioconda/gemoma direct link to the package]. To install this package with conda run:<br />
conda install -c bioconda gemoma <br />
However, you can also install GeMoMa manually.<br />
<br />
=== Requirements ===<br />
<br />
For running the GeMoMa, you need the following software on your computer<br />
* Java v1.8 or later<br />
* [ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ blast] or [https://github.com/soedinglab/MMseqs2 mmseqs]<br />
<br />
=== Download ===<br />
<br />
GeMoMa is implemented in Java using Jstacs. You can [http://www.jstacs.de/download.php?which=GeMoMa download a zip file] containing a readme, the GeMoMa jar file and some tiny scripts for running GeMoMa. The jar file allows for <br />
<ul><br />
<li>creating the XML file needed for the Galaxy integration</li><br />
<li>running the command line interface (CLI) version.</li><br />
</ul><br />
<br />
== In a nutshell ==<br />
<br />
GeMoMa is a modular, homology-based gene prediction program with huge flexibility. However, we also provide a pipeline allowing to use GeMoMa easily. If you like to start GeMoMa for the first time, we recommend to use the GeMoMaPipeline like this<br />
java -jar GeMoMa-1.7.1.jar CLI GeMoMaPipeline threads=<threads> outdir=<outdir> GeMoMa.Score=ReAlign AnnotationFinalizer.r=NO o=true t=<target_genome> i=<reference_1_id> a=<reference_1_annotation> g=<reference_1_genome><br />
there are several parameters that need to be set indicated with '''&lt;'''foo'''&gt;'''. You can specify<br />
* the number of threads<br />
* the output directory<br />
* the target genome<br />
* and the reference ID (optional), annotation and genome. If you have several references just repeat <code>s=own</code> and the parameter tags <code>i</code>, <code>a</code>, <code>g</code> with the corresponding values.<br />
In addition, we recommend to set several parameters:<br />
* <code>GeMoMa.Score=ReAlign</code>: states that the score from mmseqs should be recomputed as mmseqs uses an approximation<br />
* <code>AnnotationFinalizer.r=NO</code>: do not rename genes and transcripts<br />
* <code>o=true</code>: output individual predictions for each reference as a separate file allowing to rerun the combination step ('''GAF''') very easily and quickly<br />
If you like to specify the maximum intron length please consider the parameters <code>GeMoMa.m</code> and <code>GeMoMa.sil</code>.<br />
If you have RNA-seq data either from own experiments or publicly available data sets (cf. [https://www.ncbi.nlm.nih.gov/sra NCBI SRA], [https://www.ebi.ac.uk/ena EMBL-EBI ENA]), we recommend to use them. You need to map the data against the target genome with your favorite read mapper. In addition, we recommend to check the parameters of the section '''DenoiseIntrons'''.<br />
<br />
<br />
The complete documentation describing all GeMoMa modules and all parameters can be accessed at [[GeMoMa-Docs]].<br />
<br />
You can also [[:File:GeMoMa-manual.pdf|download a small manual for GeMoMa]] which explains the main steps for the analysis.<br />
<br />
== GFF attributes ==<br />
<br />
Using GeMoMa, you'll obtain GFFs containing some special attributes. We briefly explain the most prominent attributes in the following table.<br />
<br />
{| class="wikitable sortable"<br />
|-<br />
!Attribute !!Long name !!Tool !!Necessary parameter !!Feature !!Description<br />
|-<br />
| aa || amino acids || GeMoMa || || mRNA || the number of amino acids in the protein<br />
|-<br />
| score || GeMoMa score || GeMoMa || || mRNA || score computed by GeMoMa using the substitution matrix, gap costs and additional penalties<br />
|-<br />
| nps || number of premature stops || GeMoMa || || mRNA || the number of premature stop codons in the prediction<br />
|-<br />
| ce || coding exons || GeMoMa || assignment || mRNA || the number of coding exons of the prediction<br />
|-<br />
| rce || reference coding exons || GeMoMa || assignment || mRNA || the number of coding exons of the reference transcript<br />
|-<br />
| minCov || minimal coverage || GeMoMa || coverage, ... || mRNA || minimal coverage of any base of the prediction given RNA-seq evidence<br />
|-<br />
| avgCov || average coverage || GeMoMa || coverage, ... || mRNA || average coverage of all bases of the prediction given RNA-seq evidence<br />
|-<br />
| tpc || transcript percentage coverage || GeMoMa || coverage, ... || mRNA || percentage of covered bases per predicted transcript given RNA-seq evidence<br />
|-<br />
| tae || transcript acceptor evidence || GeMoMa || introns || mRNA || percentage of predicted acceptor sites per predicted transcript with RNA-seq evidence<br />
|-<br />
| tde || transcript donor evidence || GeMoMa || introns || mRNA || percentage of predicted donor sites per predicted transcript with RNA-seq evidence<br />
|-<br />
| tie || transcript intron evidence || GeMoMa || introns || mRNA || percentage of predicted introns per predicted transcript with RNA-seq evidence<br />
|-<br />
| minSplitReads || minimal split reads || GeMoMa || introns || mRNA || minimal number of split reads for any of the predicted introns per predicted transcript<br />
|-<br />
| iAA || identical amino acid || GeMoMa || query proteins || mRNA || percentage of identical amino acids between reference transcript and prediction<br />
|-<br />
| pAA || positive amino acid || GeMoMa || query proteins || mRNA || percentage of aligned positions between reference transcript and prediction yielding a positive score in the substitution matrix<br />
|-<br />
| evidence || || GAF || || mRNA || number of reference organisms that have a transcript yielding this prediction<br />
|-<br />
| alternative || || GAF || || mRNA || alternative gene ID(s) leading to the same prediction<br />
|-<br />
| sumWeight || || GAF || || mRNA || the sum of the weights of the references that perfectly support this prediction<br />
|-<br />
| maxTie || maximal tie || GAF || || gene || maximal tie of all transcripts of this gene<br />
|-<br />
| maxEvidence || maximal evidence || GAF || || gene || maximal evidence of all transcripts of this gene<br />
|-<br />
|}<br />
<br />
The name of the feature describing a transcript prediction can be altered using the parameter "tag". Before version 1.7 the default value of tag was "prediction" instead of "mRNA".<br />
<br />
== Frequently asked questions ==<br />
<br />
; Why does the Extractor not return a single CDS-part, protein, ...?<br />
:Please check whether the names of your contigs/chromosomes in your annotation (gff) and genome file (fasta) are identical. The fasta comments should at best only contain the contig/chromosome name. (Since GeMoMa 1.4, comments, which contain the contig/chromosome name and some additional information separated by a space, are also fine.) In addition, check the statistics that are given by the Extractor. It lists how many genes have been read and how many genes have been removed for different reasons. One common problem is that some annotation files do not include the stop codon in the CDS annotation.<br />
<br />
; How can I force GeMoMa to make more predictions?<br />
:There are several parameters affecting the number of predictions. The most prominent are the number of predictions (p) and the contig threshold (ct). For each reference transcript/CDS, GeMoMa initially makes a preliminary prediction and uses this prediction to determine whether a contig is promising and should be used to determine the final predictions. You may decrease ct and increase p to have more contigs in the final prediction. Increasing the number of predictions allows GeMoMa to output more predictions that have been computed. Decreasing the contig threshold allows to increase the number of predictions that are (internally) computed. Increasing p to a very large number without decreasing ct does not help.<br />
<br />
; Running GeMoMa on a single contig of my assembly yield thousands of weird predictions. What went wrong?<br />
:By default, GeMoMa is not build to be run on a single contig. GeMoMa tries to make predictions for all given reference CDS in the given target sequence(s). If the given target sequence is only a fraction of the complete target genome/assembly, GeMoMa will produce weird predictions as it does not filter for the quality of the predictions internally. There are two options to handle this:<br />
:* Use a list of gene models that you expect to be located on this contig (cf. parameter "selected").<br />
:* Filter the predictions using GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>).<br />
<br />
; Is it mandatory to use RNA-seq data?<br />
:No, GeMoMa is able to make predictions with and without RNA-seq evidence.<br />
<br />
; Is it possible to use multiple reference organisms?<br />
:It is possible to use multiple reference organisms for GeMoMa. Just run GeMoMa on each reference organism separately. Finally, you can employ GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>) to combine these annotations.<br />
<br />
; Why do some reference genes not lead to a prediction in the target genome?<br />
:Please first check whether your reference genes have been discarded by the Extractor (cf. assignment file).<br />
:If the genes have been discarded, there are two possibilities:<br />
:* The CDS might be redundant, i.e. the coding exons are identical to those of another transcript. In this case, only one CDS is further evaluated.<br />
:* There might be something wrong with your reference genes, e.g., missing start codon, missing stop codon, premature stop codon, ambiguous nucleotides, ... and you should check the options of Extractor or the annotation.<br />
:If the reference genes passed the Extractor, there are several possible explanations for this behavior. The two most prominent are:<br />
:* GeMoMa stopped the prediction of a reference genes since it does not return a result within the given time (cf. parameter "timeout").<br />
:* GeMoMa simply did not find a prediction matching the remaining quality criteria.<br />
<br />
; What does "partial gene model" mean in the context of GeMoMa?<br />
:We called a gene model partial if it does not contain an initial start codon and a final stop codon. However, this does not mean that the gene model is located at or close to the border of a chromosome or contig.<br />
<br />
; For two different reference transcripts, the predictions of GeMoMa overlap or are identical. What should I do with those?<br />
:GeMoMa makes the predictions for each reference transcript independently. Hence, it can occur that some of predictions of different reference transcripts overlap or are identical especially in gene families. Typically, you might like to filter or rank these predictions. We have implemented GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>) to do this automatically. However, you can also do it by hand using the GFF attributes. Using RNA-seq data in GeMoMa yields additional fields in the annotation that can be used, e.g., average coverage (avgCov).<br />
<br />
; A lot of transcripts have been filtered out by the Extractor. What can I do?<br />
:There are several reasons for removing transcripts by the Extractor. At least in two cases you can try to get more transcripts by setting specific parameter values. First, if the transcript contains ambiguous nucleotides, please test the parameter "Ambiguity". Second, sometimes we received GFFs which contain wrong phases for CDS entries (e.g., 0 for all CDS entries in the phase column of the GFF). Since version 1.3.2, we provide the option "r" which stands for repair. If r=true is chosen, the Extractor tries to infer all phases for transcripts that show an error and would be filtered out.<br />
<br />
; Is GeMoMa able to predict pseudo-genes/ncRNA?<br />
:No, currently not.<br />
<br />
; My RNA-seq data indicates there is an additional intron in a transcipt, but GeMoMa does not predict this. Or vice versa, GeMoMa predicts an intron that is not supported by RNA-seq data. What's the reason?<br />
:GeMoMa is mainly based on the assumptions of amino acid and intron position conservation between reference and target species. Hence, GeMoMa tries to predict a gene model with similar exon-intron-structure in the target species and does not stick too much with RNA-seq data. Although intron position conservation can be observed in most cases, sometimes new introns evolve or others vanish. For this reasons, GeMoMa also allows for the inclusion or exclusion of introns adding some additional costs (cf. GeMoMa parameter intron-loss-gain-penalty). However, the behaviour of GeMoMa depends on the parameters settings (especially intron-loss-gain-penalty, sm (substitution matrix), go (gap opening), ge (gap extension)) and the length of the missed/additional intron. Nevertheless, such cases can only occur if the additional/missed intron has a length that can be divided by 3 preserving the reading frame. Since the available RNA-seq data only reflects a fraction of tissues/environmental conditions/..., missing RNA-seq evidence does not necessarily mean that the predictions is wrong.<br />
<br />
; My RNA-seq data indicates two alternative, highly overlapping introns. Interestingly, GeMoMa does not take the intron that is more abundant. Why?<br />
:GeMoMa reads the introns from the input file using some filter (cf. GeMoMa parameter r (reads)). All introns that pass the filter are used and treated equally. Hence, GeMoMa uses the intron that matches the expectation of intron position and amino acid conservation compared to the reference transcript.<br />
<br />
; Does GeMoMa predict multiple transcripts per gene?<br />
:GeMoMa in principle allows to predict multiple transcripts per gene, if corresponding transcripts are given in the reference species or if multiple reference species are used. <br />
<br />
; GeMoMa failed with java.lang.OutOfMemoryError. What can I do?<br />
:Whenever you see a java.lang.OutOfMemoryError, you should rerun the program with Java virtual machine (VM) options. More specifically you should set: -Xms the initially used RAM, e.g. to 5Gb (–Xms5G), and -Xmx the maximally used RAM, e.g. to 50Gb (-Xmx50G). GeMoMa often needs more memory if you have a large genome and if you’re providing a large coverage file (extracted from RNA-seq data). If you don’t have a compute node with enough memory, you can run GeMoMa without coverage, which will return the same predictions, but does not include all statistics. Another point could be the protein alignment, if you use the optional parameter "query protein" (below version 1.; or "protein alignment" (since version 1.;. Again you can run GeMoMa without protein alignment, which will return the same predictions, but less statistics. <br />
<br />
; I need to specify the genetic code for my organisms. What is the expected format?<br />
:The genetic code is given in a two column tab-delimited table, where the first column is the one letter code of the amino acid and the second column is a comma-separated list of triplets. As we are working on genomic DNA, GeMoMa expects the bases A, C, G, and T, and not U (as expected in mRNA). Here is the link to the default genetic code, which might be used as template:<br />
:https://github.com/Jstacs/Jstacs/blob/master/projects/gemoma/test_data/genetic_code.txt<br />
:Alternative genetic codes are described here using the RNA alphabet:<br />
:https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi<br />
:The genetic code might be specified for a reference organism in the module Extractor or for a target organism in the module GeMoMa.<br />
<br />
; I like to accelerate GeMoMa. What can I do?<br />
:You can use several threads for the computation. If you run the GeMoMaPipeline you just have to select threads=<your_number>.<br />
:In addition, you can change the search algorithm that is used in GeMoMa. Tblastn is used by default as search algorithm in GeMoMa (for historical reasons, until version 1.6.4). However, tblastn can be replaced by mmseqs which is typically much faster. If you run the GeMoMaPipeline you just have to select tblastn=false, which is default since version 1.7. However, changing the search algorithm can also effect the results. We try to minimize these effect using specific parameters for the search algorithms.<br />
:If you modify other parameters, you will probably receive results that differ to a larger extend from those received using the default parameters. <br />
<br />
; Is there a way to use GeMoMa to search a single CDS or protein sequence against a genome and return the predicted gene model (CDS fasta, protein fasta, GFF) similar to exonerate?<br />
:There at least two ways to do this. If you use GeMoMaPipeline you can <br />
: (A) Use the parameter “selected” to select specific gene models (=transcripts/proteins) from the annotation instead of using all or<br />
: (B) Use s=pre-extracted, use a fasta file with the proteins for the parameter cds-parts and leave assignment unset.<br />
:Using one of these options you can either look for a single or few transcripts/proteins either with (A) or without (B) intron-position conservation. In addition, you can use RNA-seq data to improve the predictions, which should be not possible with exonerate.<br />
<br />
; Can I determine synteny based on GeMoMa predictions?<br />
:Yes, since version 1.7 we provide the module SyntenyChecker and a R script that can be used for this purpose. It exploits the fact the the reference gene and the alternative are known. Hence no alignment is need at this point and synteny can be determined quite fast. <br />
<br />
; How, can I add additional attributes to the annotation?<br />
:Additional attribute, e.g., functional annotation from InterProScan, can be added to the structural gene annotation using the module AddAttribute, which has been included since version 1.7. Such additional attributes might be used in GAF for filtering and sorting and can also be displayed in genome browsers like IGV or WebApollo. <br />
<br />
; Can structural gene annotation provided by GeMoMa be submitted to NCBI?<br />
:Yes, NCBI allows to submit structural gene annotation in GFF format (https://www.ncbi.nlm.nih.gov/genbank/genomes_gff/). If you run GeMoMaPipeline or AnnotationFinalizer, the GFF should be valid for conversion. <br />
<br />
; Running GeMoMaPipeline throws an exception. Can I restart GeMoMaPipeline using intermediate results?<br />
:Yes, since version 1.7 we have a new parameter in GeMoMaPipeline called restart. <br />
:If you want to restart the last broken GeMoMaPipeline run, you have to execute GeMoMaPipeline with the same command line as before and add restart=true.<br />
:If necessary, you can also slightly change the other parameters. However, if the parameters differ too much from those used before, GeMoMaPipeline will decide to perform a new independent run.<br />
:A restart of GeMoMaPipeline is particularly useful if the time-consuming search (tblastn or mmseqs) was successful, since this can save runtime.<br />
<br />
If you have any further questions, comments or bugs, please check the [[GeMoMa-Docs]], [https://github.com/Jstacs/Jstacs/labels/GeMoMa our github page] or contact [mailto:jens.keilwagen@julius-kuehn.de Jens Keilwagen].<br />
<br />
== References ==<br />
If you use GeMoMa, please cite<br />
<br />
J. Keilwagen, M. Wenk, J. L. Erickson, M. H. Schattat, J. Grau, and F. Hartung. [https://nar.oxfordjournals.org/content/44/9/e89 Using intron position conservation for homology-based gene prediction]. ''Nucleic Acids Research'', 2016. doi: 10.1093/nar/gkw092<br />
<br />
J. Keilwagen, F. Hartung, M. Paulini, S. O. Twardziok, and J. Grau<br />
[https://rdcu.be/QbKc Combining RNA-seq data and homology-based gene prediction for plants, animals and fungi]. ''BMC Bioinformatics'', 2018. doi: 10.1186/s12859-018-2203-5<br />
<br />
== Version history ==<br />
[http://www.jstacs.de/download.php?which=GeMoMa GeMoMa 1.7.1] (07.09.2020)<br />
*GeMoMa: <br />
**bugfix if assignment == null<br />
**bugfix remove toUpperCase<br />
*GeMoMaPipeline<br />
**Galaxy integration bugfix for hidden parameter restart<br />
**hide BLAST_PATH and MMSEQS_PATH from Galaxy integration<br />
**improved protocol output if threads=1<br />
**add addtional test to GeMoMaPipeline<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.7.zip GeMoMa 1.7] (29.07.2020)<br />
*improved manual including new module and runtime <br />
*check whether input files exist before execution<br />
*partially checking MIME types in CLI before execution<br />
*changed homepage from http to https<br />
*new module AddAttribute: allows to add attributes (like functional annotation from InterProScan) to gene annotation files that might be used in GAF or displayed genome browsers like IGV or WebApollo<br />
*new module SyntenyChecker: creates a table that can be used to create dot plots between the annotation of the target and reference organism<br />
*changed default value of parameter "tag" from "prediction" to "mRNA"<br />
*AnnotationEvidence:<br />
**additional attributes: avgCov, minCov, nps, ce<br />
**changed default value of "annotation output" to true<br />
**bugfix: transcript start and end<br />
*ERE: <br />
**changed default value of coverage to "true" <br />
**new parameter "minimum context": allows to discard introns if all split reads have short aligned contexts<br />
*Extractor: <br />
**bugfix splitAA if coding exon is very short<br />
**improved verbose mode<br />
**new parameter "upcase IDs"<br />
**new parameter "introns" allowing to extract introns from the reference (only for test cases)<br />
**new parameter "discard pre-mature stop" allowing to discard or use transcripts with pre-mature stop<br />
**improved handling of corrupt annotations<br />
*GAF: <br />
**bugfix missing transcripts<br />
**slightly changed the default value of "filter" <br />
*GeMoMa:<br />
**replaced parameter "query proteins" by "protein alignment"<br />
**using splitAA for scoring predictions <br />
**new gff attributes: <br />
*** ce and rce for the feature prediction indicating the number of coding exons for the prediction and the reference, respectively<br />
*** nps for the number of premature stop codons (if avoid stop is false)<br />
**slightly changed the meaning of the parameter "avoid stop"<br />
*GeMoMaPipeline:<br />
**changed the default value of tblastn to false, hence mmseqs is used as search algorithm<br />
**changed the default value of score to ReAlign<br />
**remove "--dont-split-seq-by-len" from mmseqs createdb<br />
**new optional parameter BLAST_PATH<br />
**new optional parameter MMSEQS_PATH<br />
**new option to allow for incorporation of external annotation, e.g., from ab-initio gene prediction<br />
**new parameter restart allowing to restart the latest GeMoMaPipeline run, which was finished without results, with very similar parameters, e.g., after an exception was thrown (cf. parameter debug)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.6.4.zip GeMoMa 1.6.4] (24.04.2020)<br />
* improved help section<br />
* change gff attribute "AA" to "aa"<br />
* GAF:<br />
** bugfix overlapping genes<br />
** accelerated computation<br />
* GeMoMa:<br />
** bugfix: if no assignment file is used and protein ID are prefixes of other protein IDs<br />
** change GFF attribute AA to aa<br />
* AnnotationFinalizer: new parameter "name attribute" allowing to decide whether a name attribute or the Parent and ID attributes should be used for renaming<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.6.3.zip GeMoMa 1.6.3] (05.03.2020)<br />
* Jstacs changes:<br />
** CLI: bugfix ExpandableParameterSet<br />
* python wrapper (for *conda)<br />
* updated tests.sh, run.sh, pipeline.sh<br />
* rename Denoise to DenoiseIntrons<br />
* AnnotationEvidence: write phase (as given) to gff<br />
* GAF: new parameter: default attributes allows to set attributes that are not included in some gene annotation files<br />
* GeMoMa: new parameter: static intron length allowing to use dynamic intron length if set to false<br />
* GeMoMaPipeline: <br />
** bugfix: time-out<br />
** improve output<br />
** separate parameters for maximum intron length (DenoiseIntrons, GeMoMa)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.6.2.zip GeMoMa 1.6.2] (17.12.2019)<br />
* Jstacs changes:<br />
** test methods for modules<br />
** live protocol for Galaxy<br />
* new module Denoise: allowing to clean introns extracted by ERE<br />
* new module NCBIReferenceRetriever: allowing to retrieve data for reference organisms easily from NCBI.<br />
* GAF:<br />
** bugfix for filter using specific attributes if no RNA-seq or query proteins was used<br />
** allow to add annotation info (as for instance provided by Phytozome) based on the reference organisms<br />
* GeMoMa: bugfix for timeout<br />
* GeMoMaPipeline:<br />
** bugfix reporting predicted partial proteins<br />
** improved protocol<br />
** new default value for query proteins (changed from false to true)<br />
** new default value for Ambiguity (changed from EXCEPTION to AMBIGUOUS)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.6.1.zip GeMoMa 1.6.1] (4.06.2019)<br />
* createGalaxyIntegration.sh: bugfix for GeMoMaPipeline<br />
* new module CheckIntrons: allowing to create statistics for introns (extracted by ERE)<br />
* AnnotationFinalizer: bugfix for sequence IDs with large numbers<br />
* CompareTranscripts: <br />
** bugfix for prefix of ref-gene<br />
** allow no transcript info, but making assignment non-optional if a transcript info is set <br />
* GAF: bugfix for Galaxy integration<br />
* GeMoMaPipeline:<br />
** improved output in case of Exceptions<br />
** new parameter "output individual predictions" allows to in- or exclude individual predictions from each reference organism in the final result<br />
** new parameter "weight" allows weights for reference species (cf. GAF)<br />
* ERE: new parameter "minimum mapping quality"<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.6.zip GeMoMa 1.6] (2.04.2019)<br />
* allow to use mmseqs as alternative to tblastn<br />
* AnnotationEvidence:<br />
** allows to add attributes to the input gff: tie, tpc, AA, start, stop<br />
** new parameter for gff output<br />
* AnnotationFinalizer: new tool for predicting UTRs and renaming predictions<br />
* GAF:<br />
** relative score filter and evidence filter are replaced by a flexible filter that allows to filter by relative score, evidence or other GFF attributes as well as combinations thereof<br />
** sorting criteria of the predictions within clusters can now be user-specified<br />
** new attribute for genes: combinedEvidence<br />
** new attribute for predictions: sumWeight<br />
** allows to use gene predictions from all sources, including for instance ab-initio gene predictors, purely RNA-seq based gene prediction and manually curation<br />
** bugfix for predictions from multiple reference organisms<br />
** improved statistic output<br />
* GeMoMa<br />
** renamed the parameter tblastn results to search results<br />
** new parameter for sorting the results of the similarity search (tblastn or mmseqs), if you use mmseqs for the similarity search you have use sort <br />
** new parameter for score of the search results: three options: Trust (as is), ReScore (use aligned sequence, but recompute score), and ReAlign (use detected sequence for realignment and rescore)<br />
** bugfix: threshold for introns from multiple files<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.5.2.zip GeMoMa 1.5.3] (23.07.2018)<br />
* improved parameter description and presentation<br />
* GeMoMaPipeline:<br />
** removed unnecessary parameters<br />
* GeMoMa:<br />
** bugfix: reading coverage file<br />
** removed parameter genomic (cf. Extractor)<br />
** removed protein output (cf. Extractor)<br />
* GAF:<br />
** bugfix: prefix<br />
* Extractor:<br />
** new parameter genomic<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.5.2.zip GeMoMa 1.5.2] (31.5.2018)<br />
* GAF:<br />
** new parameter that allows to restrict the maximal number of transcript predictions per gene<br />
** altered behavior of the evidence filter from percentages to absolute values<br />
** bugfix: nested genes <br />
** checking for duplicates in prediction IDs<br />
* GeMoMa:<br />
** warning if RNA-seq data does not match with target genome<br />
* GeMoMaPipeline: new tool for running the complete GeMoMa pipeline at once allowing multi-threading<br />
* folder for temporary files of GeMoMa<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.5.zip GeMoMa 1.5] (13.02.2018)<br />
* AnnotationEvidence: add chromosome to output<br />
* CompareTranscripts: new parameter that allows to remove prefixes introduces by GAF<br />
* Extractor: new parameter "stop-codon excluded from CDS" that might be used if the annotation does not contain the stop codons<br />
* ExtractRNASeqEvidence: <br />
** print intron length stats<br />
** include program infos in introns.gff3<br />
* GeMoMa:<br />
** new attribute pAA in gff output if query protein is given<br />
** include program infos in predicted_annotation.gff3<br />
** minor bugfix<br />
* GAF:<br />
** new parameter that allows to specify a prefix for each input gff<br />
** collect and print program infos to filtered_prediction.gff3<br />
** improved statistics output<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.4.2.zip GeMoMa 1.4.2] (21.07.2017)<br />
* automatic searching for available updates<br />
* AnnotationEvidence: bugfix (tie computation: Arrays.binarysearch does not find first match)<br />
* Extractor: bugfix (files that are not zipped)<br />
* GeMoMa: bugfix (tie computation: Arrays.binarysearch does not find first match)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.4.1.zip GeMoMa 1.4.1] (30.05.2017)<br />
* CompareTranscripts: bugfix (NullPointerException)<br />
* Extractor: reference genome can be .*fa.gz and .*fasta.gz<br />
* GeMoMa: bugfix (shutdown problem after timeout)<br />
* modified additional scripts and documentation<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.4.zip GeMoMa 1.4] (03.05.2017)<br />
* AnnotationEvidence: new tool computing tie and tpc for given annotation (gff)<br />
* CompareTranscripts: new tool comparing predicted and given annotation (gff)<br />
* Extractor:<br />
** reading CDS with no parent tag (cf. discontinuous feature)<br />
** automatic recognition of GFF or GTF annotation<br />
** Warning if sequences mentioned in the annotation are not included in the reference sequence<br />
* GeMoMa: <br />
** allowing for multiple intron and coverage files (= using different library types at the same time)<br />
** NA instead of "?" for tae, tde, tie, minSplitReads of single coding exon genes<br />
** new default values for the parameters: predictions (10 instead of 1) and contig threshold (0.4 instead of 0.9)<br />
** bugfix (write pc and minCov if possible for last CDS part in predicted annotation)<br />
** bugfix (ref-gene name if no assignment is used)<br />
** bugfix (minSplitReads, minCov, tpc, avgCov if no coverage available)<br />
* GAF:<br />
** nested genes on the same strand<br />
** bugfix (if nothing passes the filter)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.3.2.zip GeMoMa 1.3.2] (18.01.2017)<br />
* Extractor: new parameter repair for broken transcript annotations<br />
* GeMoMa: bugfixes (splice site computation)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa_1.3.1.zip GeMoMa 1.3.1] (09.12.2016)<br />
* GeMoMa bugfix (finding start/stop codon for very small exons)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.3.zip GeMoMa 1.3] (06.12.2016)<br />
* ERE: new tool for extracting RNA-seq evidence<br />
* Extractor: offers options for <br />
** partial gene models<br />
** ambiguities<br />
* GeMoMa:<br />
** RNA-seq<br />
*** defining splice sites<br />
*** additional feature in GFF and output<br />
**** transcript intron evidence (tie)<br />
**** transcript acceptor evidence (tae)<br />
**** transcript donor evidence (tde)<br />
**** transcript percentage coverage (tpc)<br />
**** ...<br />
** improved GFF<br />
** simplified the command line parameters<br />
** IMPORTANT: parameter names changed for some parameters<br />
* GAF: new tool for filtering and combining different predictions (especially of different reference organisms)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.1.3.zip GeMoMa 1.1.3] (06.06.2016)<br />
* minor modifications to the Extractor tool<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.1.2.zip GeMoMa 1.1.2] (05.02.2016)<br />
* GeMoMa bugfix (upstream, downstream sequence for splice site detection)<br />
* Extractor: new parameter s for selecting transcripts<br />
* improved Galaxy integration<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.1.1.zip GeMoMa 1.1.1] (01.02.2016)<br />
* initial release for paper</div>Keilwagenhttps://www.jstacs.de/index.php?title=GeMoMa-Docs&diff=1111GeMoMa-Docs2020-09-09T12:16:46Z<p>Keilwagen: 1.7.1</p>
<hr />
<div>This page describes the parameters of all [[GeMoMa]] modules.</br><br />
If you have any questions, comments or bugs, please check the [[GeMoMa#Frequently asked questions|FAQs]], [https://github.com/Jstacs/Jstacs/labels/GeMoMa our github page] or contact [mailto:jens.keilwagen@julius-kuehn.de Jens Keilwagen].<br />
<br />
= GeMoMa pipeline =<br />
<br />
This tool can be used to run the complete GeMoMa pipeline. The tool is multi-threaded and can utilize all compute cores on one machine, but not distributed as for instance in a compute cluster. It basically runs: '''Extract RNA-seq evidence (ERE)''', '''DenoiseIntrons''', '''Extractor''', external search (tblastn or mmseqs), '''Gene Model Mapper (GeMoMa)''', '''GeMoMa Annotation Filter (GAF)''', and '''AnnnotationFinalizer'''.<br />
<br />
''GeMoMa pipeline'' may be called with<br />
<br />
java -jar GeMoMa-1.7.1.jar CLI GeMoMaPipeline<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>target genome (Target genome file (FASTA), mime = fasta,fa,fas,fna,fasta.gz,fa.gz,fas.gz,fna.gz)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used zero or multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>species (data for reference species, range={own, pre-extracted}, default = own)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;own&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>ID (ID to distinguish the different reference species, default = , OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (Reference annotation file (GFF or GTF), which contains gene models annotated in the reference genome, mime = gff,gff3,gtf)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (Reference genome file (FASTA), mime = fasta,fa,fas,fna,fasta.gz,fa.gz,fas.gz,fna.gz)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>weight (the weight can be used to prioritize predictions from different input files; each prediction will get an additional attribute sumWeight that can be used in the filter, valid range = [0.0, 1000.0], default = 1.0, OPTIONAL)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ai</font></td><br />
<td>annotation info (annotation information of the reference, tab-delimted file containing at least the columns transcriptName, GO and .*defline, mime = tabular, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;pre-extracted&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>ID (ID to distinguish the different reference species, default = , OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>cds parts (The query cds parts file (FASTA), i.e., the cds parts that have been blasted, mime = fasta,fas,fa,fna)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>assignment (The assignment file, which combines parts of the CDS to transcripts, mime = tabular, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>weight (the weight can be used to prioritize predictions from different input files; each prediction will get an additional attribute sumWeight that can be used in the filter, valid range = [0.0, 1000.0], default = 1.0, OPTIONAL)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ai</font></td><br />
<td>annotation info (annotation information of the reference, tab-delimted file containing at least the columns transcriptName, GO and .*defline, mime = tabular, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used zero or multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ID</font></td><br />
<td>ID (ID to distinguish the different external annotations of the target organism, default = , OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">e</font></td><br />
<td>external annotation (External annotation file (GFF,GTF) of the target organism, which contains gene models from an external source (e.g., ab initio gene prediction) and will be integrated in the module GAF, mime = gff,gff3,gtf)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">weight</font></td><br />
<td>weight (the weight can be used to prioritize predictions from different input files in the module GAF; each prediction will get an additional attribute sumWeight that can be used in the filter, valid range = [0.0, 1000.0], default = 1.0, OPTIONAL)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ae</font></td><br />
<td>annotation evidence (run AnnotationEvidence on this external annotation, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">selected</font></td><br />
<td>selected (The path to list file, which allows to make only a predictions for the contained transcript ids. The first column should contain transcript IDs as given in the annotation. Remaining columns can be used to determine a target region that should be overlapped by the prediction, if columns 2 to 5 contain chromosome, strand, start and end of region, mime = tabular,txt, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">gc</font></td><br />
<td>genetic code (optional user-specified genetic code, mime = tabular, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">tblastn</font></td><br />
<td>tblastn (if *true* tblastn is used as search algorithm, otherwise mmseqs is used. Tblastn and mmseqs need to be installed to use the corresponding option, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">tag</font></td><br />
<td>tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = mRNA)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>RNA-seq evidence (data for RNA-seq evidence, range={NO, MAPPED, EXTRACTED}, default = NO)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;MAPPED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.s</font></td><br />
<td>Stranded (Defines whether the reads are stranded. In case of FR_FIRST_STRAND, the first read of a read pair or the only read in case of single-end data is assumed to be located on forward strand of the cDNA, i.e., reverse to the mRNA orientation. If you are using Illumina TruSeq you should use FR_FIRST_STRAND., range={FR_UNSTRANDED, FR_FIRST_STRAND, FR_SECOND_STRAND}, default = FR_UNSTRANDED)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.m</font></td><br />
<td>mapped reads file (BAM/SAM files containing the mapped reads, mime = bam,sam)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.v</font></td><br />
<td>ValidationStringency (Defines how strict to be when reading a SAM or BAM, beyond bare minimum validation., range={STRICT, LENIENT, SILENT}, default = LENIENT)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.u</font></td><br />
<td>use secondary alignments (allows to filter flags in the SAM or BAM, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.c</font></td><br />
<td>coverage (allows to output the coverage, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.mmq</font></td><br />
<td>minimum mapping quality (reads with a mapping quality that is lower than this value will be ignored, valid range = [0, 255], default = 40)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.mc</font></td><br />
<td>minimum context (only introns that have evidence of at least one split read with a minimal M (=(mis)match) stretch in the cigar string larger than or equal to this value will be used, valid range = [1, 1000000], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;EXTRACTED&quot;:</b></td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">introns</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq, mime = gff,gff3)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used zero or multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage</font></td><br />
<td>coverage (experimental coverage (RNA-seq), range={UNSTRANDED, STRANDED}, default = UNSTRANDED)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., mime = bedgraph)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., mime = bedgraph)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., mime = bedgraph)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>denoise (removing questionable introns that have been extracted by ERE, range={DENOISE, RAW}, default = DENOISE)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;DENOISE&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">DenoiseIntrons.m</font></td><br />
<td>maximum intron length (The maximum length of an intron, default = 15000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">DenoiseIntrons.me</font></td><br />
<td>minimum expression (The threshold for removing introns, valid range = [0.0, 1.0], default = 0.01)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">DenoiseIntrons.c</font></td><br />
<td>context (The context upstream a donor and donwstream an acceptor site that is used to determine the expression of the region, valid range = [0, 100], default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;RAW&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.u</font></td><br />
<td>upcase IDs (whether the IDs in the GFF should be upcased, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.r</font></td><br />
<td>repair (if a transcript annotation can not be parsed, the program will try to infer the phase of the CDS parts to repair the annotation, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.a</font></td><br />
<td>Ambiguity (This parameter defines how to deal with ambiguities in the DNA. There are 3 options: EXCEPTION, which will remove the corresponding transcript, AMBIGUOUS, which will use an X for the corresponding amino acid, and RANDOM, which will randomly select an amnio acid from the list of possibilities., range={EXCEPTION, AMBIGUOUS, RANDOM}, default = AMBIGUOUS)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.d</font></td><br />
<td>discard pre-mature stop (if *true* transcripts with pre-mature stop codon are discarded as they often indicate misannotation, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.s</font></td><br />
<td>stop-codon excluded from CDS (A flag that states whether the reference annotation contains the stop codon in the CDS annotation or not, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.f</font></td><br />
<td>full-length (A flag which allows for choosing between only full-length and all (i.e., full-length and partial) transcripts, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.s</font></td><br />
<td>splice (if no intron is given by RNA-seq, compute candidate splice sites or not, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.sm</font></td><br />
<td>substitution matrix (optional user-specified substitution matrix, mime = tabular, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.g</font></td><br />
<td>gap opening (The gap opening cost in the alignment, default = 11)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.ge</font></td><br />
<td>gap extension (The gap extension cost in the alignment, default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.m</font></td><br />
<td>maximum intron length (The maximum length of an intron, default = 15000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.sil</font></td><br />
<td>static intron length (A flag which allows to switch between static intron length, which can be specified by the user and is identical for all genes, and dynamic intron length, which is based on the gene-specific maximum intron length in the reference organism plus the user given maximum intron length, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.i</font></td><br />
<td>intron-loss-gain-penalty (The penalty used for intron loss and gain, default = 25)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.e</font></td><br />
<td>e-value (The e-value for filtering blast results, default = 100.0)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.c</font></td><br />
<td>contig threshold (The threshold for evaluating contigs, valid range = [0.0, 1.0], default = 0.4)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.rt</font></td><br />
<td>region threshold (The threshold for evaluating regions, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.h</font></td><br />
<td>hit threshold (The threshold for adding additional hits, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.p</font></td><br />
<td>predictions (The (maximal) number of predictions per transcript, default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.a</font></td><br />
<td>avoid stop (A flag which allows to avoid (additional) pre-mature stop codons in a transcript, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.approx</font></td><br />
<td>approx (whether an approximation is used to compute the score for intron gain, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.pa</font></td><br />
<td>protein alignment (whether a protein alignment between the prediction and the reference transcript should be computed. If so two additional attributes (iAA, pAA) will be added to predictions in the gff output. These might be used in GAF. However, since some transcripts are very long this can increase the needed runtime and memory (RAM)., default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.prefix</font></td><br />
<td>prefix (A prefix to be used for naming the predictions, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.t</font></td><br />
<td>timeout (The (maximal) number of seconds to be used for the predictions of one transcript, if exceeded GeMoMa does not output a prediction for this transcript., valid range = [0, 604800], default = 3600)</td><br />
<td style="width:100px;">LONG</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.Score</font></td><br />
<td>Score (A flag which allows to do nothing, re-score or re-align the search results, range={Trust, ReScore, ReAlign}, default = ReAlign)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.c</font></td><br />
<td>common border filter (the filter on the common borders of transcripts, the lower the more transcripts will be checked as alternative splice isoforms, valid range = [0.0, 1.0], default = 0.75)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.m</font></td><br />
<td>maximal number of transcripts per gene (the maximal number of allowed transcript predictions per gene, valid range = [1, 2147483647], default = 2147483647)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.d</font></td><br />
<td>default attributes (Comma-separated list of attributes that is set to NaN if they are not given in the annotation file. This allows to use these attributes for sorting or filter criteria. It is especially meaningful if the gene annotation files were received fom different software packages (e.g., GeMoMa, Braker, ...) having different attributes., default = tie,tde,tae,iAA,pAA,score)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.f</font></td><br />
<td>filter (A filter can be applied to transcript predictions to receive only reasonable predictions. The filter is applied to the GFF attributes. The default filter decides based on the completeness of the prediction (start=='M' and stop=='*') and the relative score (score/aa>=0.75) whether a prediction is used or not. In addition, predictions without score (isNaN(score)) will be used as external annotations do not have the attribute score. Different criteria can be combined using 'and' and 'or'. A more sophisticated filter could be applied for instance using the completeness, the relative score, the evidence as well as statistics based on RNA-seq: start=='M' and stop=='*' and score/aa>=0.75 and (evidence>1 or tpc==1.0), default = start=='M' and stop=='*' and (isNaN(score) or score/aa>=0.75), OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.s</font></td><br />
<td>sorting (comma-separated list that defines the way predictions are sorted within a cluster, default = evidence,score)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.a</font></td><br />
<td>alternative transcript filter (If a prediction is suggested as an alternative transcript, this additional filter will be applied. The filter works syntactically like the 'filter' parameter and allows you to keep the number of alternative transcripts small based on meaningful criteria. Reasonable filter could be based on RNA-seq data (tie==1) or on evidence (evidence>1). A more sophisticated filter could be applied combining several criteria: tie==1 or evidence>1, default = tie==1 or evidence>1, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.u</font></td><br />
<td>UTR (allows to predict UTRs using RNA-seq data, range={NO, YES}, default = NO)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;YES&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.r</font></td><br />
<td>rename (allows to generate generic gene and transcripts names (cf. parameter &quot;name attribute&quot;), range={COMPOSED, SIMPLE, NO}, default = COMPOSED)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;COMPOSED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.i</font></td><br />
<td>infix (the infix of the generic name, default = G)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.s</font></td><br />
<td>suffix (the suffix of the generic name, default = 0)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.di</font></td><br />
<td>delete infix (a comma-separated list of infixes that is deleted from the sequence names before building the gene/transcript name, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;SIMPLE&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.n</font></td><br />
<td>name attribute (if true the new name is added as new attribute &quot;Name&quot;, otherwise &quot;Parent&quot; and &quot;ID&quot; values are modified accordingly, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>predicted proteins (If *true*, returns the predicted proteins of the target organism as fastA file, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">pc</font></td><br />
<td>predicted CDSs (If *true*, returns the predicted CDSs of the target organism as fastA file, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">pgr</font></td><br />
<td>predicted genomic regions (If *true*, returns the genomic regions of predicted gene models of the target organism as fastA file, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">o</font></td><br />
<td>output individual predictions (If *true*, returns the predictions for each reference species, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">debug</font></td><br />
<td>debug (If *false* removes all temporary files even if the jobs exits unexpected, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">restart</font></td><br />
<td>restart (can be used to restart the latest GeMoMaPipeline run, which was finished without results, with very similar parameters, e.g., after an exception was thrown (cf. parameter debug), default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>BLAST_PATH (allows to set a path to the blast binaries if not set in the environment, default = , OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>MMSEQS_PATH (allows to set a path to the blast binaries if not set in the environment, default = , OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">threads</font></td><br />
<td>The number of threads used for the tool, defaults to 1</td><br />
<td>INT</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.7.1.jar CLI GeMoMaPipeline t=&lt;target_genome&gt; g=&lt;genome&gt; a=&lt;annotation&gt; AnnotationFinalizer.p=&lt;prefix&gt;<br />
<br />
<br />
= Extract RNA-seq Evidence =<br />
<br />
This tools extracts introns and coverage from mapped RNA-seq reads. Introns might be denoised by the tool '''DenoiseIntrons'''. Introns and coverage results can be used in '''GeMoMa''' to improve the predictions and might help to select better gene models in '''GAF'''. In addition, introns and coverage can be used to predict UTRs by '''AnnotationFinalizer'''.<br />
<br />
''Extract RNA-seq Evidence'' may be called with<br />
<br />
java -jar GeMoMa-1.7.1.jar CLI ERE<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>Stranded (Defines whether the reads are stranded. In case of FR_FIRST_STRAND, the first read of a read pair or the only read in case of single-end data is assumed to be located on forward strand of the cDNA, i.e., reverse to the mRNA orientation. If you are using Illumina TruSeq you should use FR_FIRST_STRAND., range={FR_UNSTRANDED, FR_FIRST_STRAND, FR_SECOND_STRAND}, default = FR_UNSTRANDED)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>mapped reads file (BAM/SAM files containing the mapped reads, mime = bam,sam)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>ValidationStringency (Defines how strict to be when reading a SAM or BAM, beyond bare minimum validation., range={STRICT, LENIENT, SILENT}, default = LENIENT)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">u</font></td><br />
<td>use secondary alignments (allows to filter flags in the SAM or BAM, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage (allows to output the coverage, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">mmq</font></td><br />
<td>minimum mapping quality (reads with a mapping quality that is lower than this value will be ignored, valid range = [0, 255], default = 40)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">mc</font></td><br />
<td>minimum context (only introns that have evidence of at least one split read with a minimal M (=(mis)match) stretch in the cigar string larger than or equal to this value will be used, valid range = [1, 1000000], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.7.1.jar CLI ERE m=&lt;mapped_reads_file&gt;<br />
<br />
<br />
= CheckIntrons =<br />
<br />
The tool checks the distribution of introns on the strands and the dinucleotide distribution at splice sites.<br />
<br />
''CheckIntrons'' may be called with<br />
<br />
java -jar GeMoMa-1.7.1.jar CLI CheckIntrons<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>target genome (The target genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code, mime = fasta)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq, mime = gff, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>verbose (A flag which allows to output a wealth of additional information per transcript, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.7.1.jar CLI CheckIntrons t=&lt;target_genome&gt; i=&lt;introns&gt;<br />
<br />
<br />
= DenoiseIntrons =<br />
<br />
This module allows to analyze introns extracted by '''ERE'''. Introns with a large intron size or a low relative expression are possibly artefacts and will be removed. The result of this module can be used in the module '''GeMoMa''', '''AnnotationEvidence''', and '''AnnotationFinalizer'''.<br />
<br />
''DenoiseIntrons'' may be called with<br />
<br />
java -jar GeMoMa-1.7.1.jar CLI DenoiseIntrons<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq, mime = gff,gff3)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage (experimental coverage (RNA-seq), range={UNSTRANDED, STRANDED}, default = UNSTRANDED)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., mime = bedgraph)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., mime = bedgraph)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., mime = bedgraph)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>maximum intron length (The maximum length of an intron, default = 15000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">me</font></td><br />
<td>minimum expression (The threshold for removing introns, valid range = [0.0, 1.0], default = 0.01)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">context</font></td><br />
<td>context (The context upstream a donor and donwstream an acceptor site that is used to determine the expression of the region, valid range = [0, 100], default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.7.1.jar CLI DenoiseIntrons i=&lt;introns&gt; coverage_unstranded=&lt;coverage_unstranded&gt;<br />
<br />
<br />
= NCBI Reference Retriever =<br />
<br />
This tool can be used to download or update assembly and annotation files of reference organsims from NCBI. This way it allows to easily collect all data necessary to start '''GeMoMaPipeline''' or '''Extractor'''.<br />
<br />
''NCBI Reference Retriever'' may be called with<br />
<br />
java -jar GeMoMa-1.7.1.jar CLI NRR<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reference directory (the directory where the genome and annotation files of the reference organisms should be stored, default = references/)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>number of tries (the number of tries for downloading a reference file, valid range = [1, 100], default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">rl</font></td><br />
<td>reference list (a list of reference organisms, mime = txt)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.7.1.jar CLI NRR rl=&lt;reference_list&gt;<br />
<br />
<br />
= Extractor =<br />
<br />
This tool can be used to create input files for '''GeMoMa''', i.e., it creates at least a fasta file containing the translated parts of the CDS and a tabular file containing the assignment of transcripts to genes and parts of CDS to transcripts. In addition, '''Extractor''' can be used to create several additional files from the final prediction, e.g. proteins, CDSs, ... . Two inputs are mandatory: The genome as fasta or fasta.gz and the corresponding annotation as gff or gff.gz. The gff file should be sorted. If you like to set a user-specific genetic code, please use a tab-delimited file with two columns. The first column contains the amino acid in one letter code, the second a list of tripletts.<br />
<br />
''Extractor'' may be called with<br />
<br />
java -jar GeMoMa-1.7.1.jar CLI Extractor<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (Reference annotation file (GFF or GTF), which contains gene models annotated in the reference genome, mime = gff,gff3,gtf)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (Reference genome file (FASTA), mime = fasta,fa,fas,fna,fasta.gz,fa.gz,fas.gz,fna.gz)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">gc</font></td><br />
<td>genetic code (optional user-specified genetic code, mime = tabular, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>proteins (whether the complete proteins sequences should returned as output, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>cds (whether the complete CDSs should returned as output, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">genomic</font></td><br />
<td>genomic (whether the genomic regions should be returned (upper case = coding, lower case = non coding), default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns (whether introns should be extracted from annotation, that might be used for test cases, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">u</font></td><br />
<td>upcase IDs (whether the IDs in the GFF should be upcased, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>repair (if a transcript annotation can not be parsed, the program will try to infer the phase of the CDS parts to repair the annotation, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>selected (The path to list file, which allows to make only a predictions for the contained transcript ids. The first column should contain transcript IDs as given in the annotation. Remaining columns will be ignored., mime = tabular,txt, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Ambiguity</font></td><br />
<td>Ambiguity (This parameter defines how to deal with ambiguities in the DNA. There are 3 options: EXCEPTION, which will remove the corresponding transcript, AMBIGUOUS, which will use an X for the corresponding amino acid, and RANDOM, which will randomly select an amnio acid from the list of possibilities., range={EXCEPTION, AMBIGUOUS, RANDOM}, default = EXCEPTION)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>discard pre-mature stop (if *true* transcripts with pre-mature stop codon are discarded as they often indicate misannotation, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sefc</font></td><br />
<td>stop-codon excluded from CDS (A flag that states whether the reference annotation contains the stop codon in the CDS annotation or not, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>full-length (A flag which allows for choosing between only full-length and all (i.e., full-length and partial) transcripts, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>verbose (A flag which allows to output a wealth of additional information, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.7.1.jar CLI Extractor a=&lt;annotation&gt; g=&lt;genome&gt;<br />
<br />
<br />
= GeneModelMapper =<br />
<br />
This tool is the main part of, a homology-based gene prediction tool. GeMoMa builds gene models from search results (e.g. tblastn or mmseqs).<br />
<br />
As first step, you should run '''Extractor''' obtaining ''cds parts'' and ''assignment''. Second, you should run a search algorithm, e.g. '''tblastn''' or '''mmseqs''', with ''cds parts'' as query. Finally, these search results are then used in '''GeMoMa'''. Search results should be clustered according to the reference genes. The most easiest way is to sort the search results accoring to the first column. If the search results are not sorted by default (e.g. mmseqs), you should the parameter ''sort''.<br />
If you like to run GeMoMa ignoring intron position conservation, you should blast protein sequences and feed the results in ''query cds parts'' and leave ''assignment'' unselected.<br />
<br />
If you like to run GeMoMa using RNA-seq evidence, you should map your RNA-seq reads to the genome and run '''ERE''' on the mapped reads. For several reasons, spurious introns can be extracted from RNA-seq data. Hence, we recommend to run '''DenoiseIntrons''' to remove such spurious introns. Finally, you can use the obtained ''introns'' (and ''coverage'') in GeMoMa.<br />
<br />
If you like to obtain multiple predictions per gene model of the reference organism, you should set ''predictions'' accordingly. In addition, we suggest to decrease the value of ''contig threshold'' allowing GeMoMa to evaluate more candidate contigs/chromosomes.<br />
<br />
If you change the values of ''contig threshold'', ''region threshold'' and ''hit threshold'', this will influence the predictions as well as the runtime of the algorithm. The lower the values are, the slower the algorithm is.<br />
<br />
You can filter your predictions using '''GAF''', which also allows for combining predictions from different reference organismns.<br />
<br />
Finally, you can predict UTRs and rename predictions using '''AnnotationFinalizer'''.<br />
<br />
If you like to run the complete GeMoMa pipeline and not only specific module, you can run the multi-threaded module '''GeMoMaPipeline'''.<br />
<br />
''GeneModelMapper'' may be called with<br />
<br />
java -jar GeMoMa-1.7.1.jar CLI GeMoMa<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>search results (The search results, e.g., from tblastn or mmseqs, mime = tabular)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>target genome (The target genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code, mime = fasta,fas,fa,fna,fasta.gz,fas.gz,fa.gz,fna.gz)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>cds parts (The query cds parts file (FASTA), i.e., the cds parts that have been blasted, mime = fasta,fa,fas,fna)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>assignment (The assignment file, which combines parts of the CDS to transcripts, mime = tabular, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used zero or multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq, mime = gff,gff3)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">splice</font></td><br />
<td>splice (if no intron is given by RNA-seq, compute candidate splice sites or not, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used zero or multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage</font></td><br />
<td>coverage (experimental coverage (RNA-seq), range={UNSTRANDED, STRANDED}, default = UNSTRANDED)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., mime = bedgraph)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., mime = bedgraph)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., mime = bedgraph)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genetic code (optional user-specified genetic code, mime = tabular, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sm</font></td><br />
<td>substitution matrix (optional user-specified substitution matrix, mime = tabular, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">go</font></td><br />
<td>gap opening (The gap opening cost in the alignment, default = 11)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ge</font></td><br />
<td>gap extension (The gap extension cost in the alignment, default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>maximum intron length (The maximum length of an intron, default = 15000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sil</font></td><br />
<td>static intron length (A flag which allows to switch between static intron length, which can be specified by the user and is identical for all genes, and dynamic intron length, which is based on the gene-specific maximum intron length in the reference organism plus the user given maximum intron length, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">intron-loss-gain-penalty</font></td><br />
<td>intron-loss-gain-penalty (The penalty used for intron loss and gain, default = 25)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">e</font></td><br />
<td>e-value (The e-value for filtering blast results, default = 100.0)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ct</font></td><br />
<td>contig threshold (The threshold for evaluating contigs, valid range = [0.0, 1.0], default = 0.4)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">rt</font></td><br />
<td>region threshold (The threshold for evaluating regions, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">h</font></td><br />
<td>hit threshold (The threshold for adding additional hits, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>predictions (The (maximal) number of predictions per transcript, default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">selected</font></td><br />
<td>selected (The path to list file, which allows to make only a predictions for the contained transcript ids. The first column should contain transcript IDs as given in the annotation. Remaining columns can be used to determine a target region that should be overlapped by the prediction, if columns 2 to 5 contain chromosome, strand, start and end of region, mime = tabular,txt, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">as</font></td><br />
<td>avoid stop (A flag which allows to avoid (additional) pre-mature stop codons in a transcript, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">approx</font></td><br />
<td>approx (whether an approximation is used to compute the score for intron gain, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">pa</font></td><br />
<td>protein alignment (whether a protein alignment between the prediction and the reference transcript should be computed. If so two additional attributes (iAA, pAA) will be added to predictions in the gff output. These might be used in GAF. However, since some transcripts are very long this can increase the needed runtime and memory (RAM)., default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">prefix</font></td><br />
<td>prefix (A prefix to be used for naming the predictions, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">tag</font></td><br />
<td>tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = mRNA)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>verbose (A flag which allows to output a wealth of additional information per transcript, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">timeout</font></td><br />
<td>timeout (The (maximal) number of seconds to be used for the predictions of one transcript, if exceeded GeMoMa does not output a prediction for this transcript., valid range = [0, 604800], default = 3600)</td><br />
<td style="width:100px;">LONG</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sort</font></td><br />
<td>sort (A flag which allows to sort the search results, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Score</font></td><br />
<td>Score (A flag which allows to do nothing, re-score or re-align the search results, range={Trust, ReScore, ReAlign}, default = Trust)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.7.1.jar CLI GeMoMa s=&lt;search_results&gt; t=&lt;target_genome&gt; c=&lt;cds_parts&gt;<br />
<br />
<br />
= GeMoMa Annotation Filter =<br />
<br />
This tool combines and filters gene predictions from different sources yielding a common gene prediction. The tool does not modify the predictions, but filters redundant and low-quality predictions and selects relevant predictions. In addition, it adds attributes to the annotation of transcript predictions.<br />
<br />
The algorithm does the following:<br />
First, redundant predictions are identified (and additional attributes (evidence, sumWeight) are introduced).<br />
Second, the predictions are filtered using the user-specified criterium based on the attributes from the annotation.<br />
Third, clusters of overlapping predictions are determined, the predictions are sorted within the cluster and relevant predictions are extracted.<br />
<br />
Optionally, annotation info can be added for each reference organism enabling a functional prediction for predicted transcripts based on the function of the reference transcript.<br />
Phytozome provides annotation info tables for plants, but annotation info can be used from any source as long as they are tab-delimited files with at least the following columns: transcriptName, GO and .*defline.<br />
<br />
Initially, GAF was build to combine gene predictions obtained from '''GeMoMa'''. It allows to combine the predictions from multiple reference organisms, but works also using only one reference organism. However, GAF also allows to integrate predictions from ab-initio or purely RNA-seq-based gene predictors as well as manually curated annotation. If you like to do so, we recommend to run '''AnnotationEvidence''' for each of these input files to add additional attributes that can be used for sorting and filtering within '''GAF'''. The sort and filter criteria need to be carefully revised in this case. Default values can be set for missing attributes.<br />
<br />
''GeMoMa Annotation Filter'' may be called with<br />
<br />
java -jar GeMoMa-1.7.1.jar CLI GAF<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>tag (the tag used to read the GeMoMa annotations, default = mRNA)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>common border filter (the filter on the common borders of transcripts, the lower the more transcripts will be checked as alternative splice isoforms, valid range = [0.0, 1.0], default = 0.75)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>maximal number of transcripts per gene (the maximal number of allowed transcript predictions per gene, valid range = [1, 2147483647], default = 2147483647)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prefix (the prefix can be used to distinguish predictions from different input files, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>weight (the weight can be used to prioritize predictions from different input files; each prediction will get an additional attribute sumWeight that can be used in the filter, valid range = [0.0, 1000.0], default = 1.0, OPTIONAL)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>gene annotation file (GFF file containing the gene annotations (predicted by GeMoMa), mime = gff,gff3)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation info (annotation information of the reference, tab-delimted file containing at least the columns transcriptName, GO and .*defline, mime = tabular, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>default attributes (Comma-separated list of attributes that is set to NaN if they are not given in the annotation file. This allows to use these attributes for sorting or filter criteria. It is especially meaningful if the gene annotation files were received fom different software packages (e.g., GeMoMa, Braker, ...) having different attributes., default = tie,tde,tae,iAA,pAA,score)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>filter (A filter can be applied to transcript predictions to receive only reasonable predictions. The filter is applied to the GFF attributes. The default filter decides based on the completeness of the prediction (start=='M' and stop=='*') and the relative score (score/aa>=0.75) whether a prediction is used or not. In addition, predictions without score (isNaN(score)) will be used as external annotations do not have the attribute score. Different criteria can be combined using 'and' and 'or'. A more sophisticated filter could be applied for instance using the completeness, the relative score, the evidence as well as statistics based on RNA-seq: start=='M' and stop=='*' and score/aa>=0.75 and (evidence>1 or tpc==1.0), default = start=='M' and stop=='*' and (isNaN(score) or score/aa>=0.75), OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>sorting (comma-separated list that defines the way predictions are sorted within a cluster, default = evidence,score)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">atf</font></td><br />
<td>alternative transcript filter (If a prediction is suggested as an alternative transcript, this additional filter will be applied. The filter works syntactically like the 'filter' parameter and allows you to keep the number of alternative transcripts small based on meaningful criteria. Reasonable filter could be based on RNA-seq data (tie==1) or on evidence (evidence>1). A more sophisticated filter could be applied combining several criteria: tie==1 or evidence>1, default = tie==1 or evidence>1, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.7.1.jar CLI GAF g=&lt;gene_annotation_file&gt;<br />
<br />
<br />
= AnnotationFinalizer =<br />
<br />
This tool finalizes an annotation. It allows to predict for UTRs for annotated coding sequences and to generate generic gene and transcript names. UTR prediction might be negatively influenced (i.e. too long predictions) by genomic contamination of RNA-seq libraries, overlapping genes or genes in close proximity as well as unstranded RNA-seq libraries. Please use '''ERE''' to preprocess the mapped reads.<br />
<br />
''AnnotationFinalizer'' may be called with<br />
<br />
java -jar GeMoMa-1.7.1.jar CLI AnnotationFinalizer<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (The genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code, mime = fasta,fa,fas,fna,fasta.gz,fa.gz,fas.gz,fna.gz)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (The predicted genome annotation file (GFF), mime = gff,gff3)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = mRNA)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">u</font></td><br />
<td>UTR (allows to predict UTRs using RNA-seq data, range={NO, YES}, default = NO)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;YES&quot;:</b></td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns file (Introns (GFF), which might be obtained from RNA-seq, mime = gff,gff3)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage file (experimental coverage (RNA-seq), range={NO, UNSTRANDED, STRANDED}, default = NO)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., mime = bedgraph)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., mime = bedgraph)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., mime = bedgraph)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">rename</font></td><br />
<td>rename (allows to generate generic gene and transcripts names (cf. parameter &quot;name attribute&quot;), range={COMPOSED, SIMPLE, NO}, default = COMPOSED)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;COMPOSED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">infix</font></td><br />
<td>infix (the infix of the generic name, default = G)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>suffix (the suffix of the generic name, default = 0)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">di</font></td><br />
<td>delete infix (a comma-separated list of infixes that is deleted from the sequence names before building the gene/transcript name, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;SIMPLE&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>name attribute (if true the new name is added as new attribute &quot;Name&quot;, otherwise &quot;Parent&quot; and &quot;ID&quot; values are modified accordingly, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.7.1.jar CLI AnnotationFinalizer g=&lt;genome&gt; a=&lt;annotation&gt; p=&lt;prefix&gt;<br />
<br />
<br />
= Annotation evidence =<br />
<br />
This tool adds attributes to the annotation, e.g., tie, tpc, aa, start, stop. These attributes can be used, for instance, if the annotation is used in '''GAF'''. All predictions of the annotation are used. The predictions are not filtered for internal stop codons, missing start or stop codons, frame-shifts, ... . Please use '''ERE''' to preprocess the mapped reads.<br />
<br />
''Annotation evidence'' may be called with<br />
<br />
java -jar GeMoMa-1.7.1.jar CLI AnnotationEvidence<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (The genome annotation file (GFF,GTF), mime = gff,gff3,gtf)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = mRNA)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (The genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code, mime = fasta,fas,fa,fna,fasta.gz,fas.gz,fa.gz,fna.gz)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns file (Introns (GFF), which might be obtained from RNA-seq, mime = gff,gff3, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage file (experimental coverage (RNA-seq), range={NO, UNSTRANDED, STRANDED}, default = NO)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., mime = bedgraph)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., mime = bedgraph)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., mime = bedgraph)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ao</font></td><br />
<td>annotation output (if the annotation should be returned with attributes tie, tpc, and aa, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">gc</font></td><br />
<td>genetic code (optional user-specified genetic code, mime = tabular, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.7.1.jar CLI AnnotationEvidence a=&lt;annotation&gt; g=&lt;genome&gt;<br />
<br />
<br />
= Compare transcripts =<br />
<br />
This tool compares a predicted annotation with a given annotation in terms of F1 measure. If the F1 measure is 1 both annotations are in perfect agreement for this transcript. The smaller the value is the low is the agreement. If it is NA then there is no overlapping annotation.<br />
<br />
''Compare transcripts'' may be called with<br />
<br />
java -jar GeMoMa-1.7.1.jar CLI CompareTranscripts<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prediction (The predicted annotation, mime = gff,gff3)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (The true annotation, mime = gff,gff3)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used zero or multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">prefix</font></td><br />
<td>prefix (the prefix can be used to distinguish predictions from different input files, default = , OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">assignment</font></td><br />
<td>assignment (the transcript info for the reference of the prediction, mime = tabular)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.7.1.jar CLI CompareTranscripts p=&lt;prediction&gt; a=&lt;annotation&gt;<br />
<br />
<br />
= Synteny checker =<br />
<br />
This tool can be used to determine syntenic regions between target organism and reference organism based on similiarity of genes.!The tool returns a table of reference genes per predicted gene. This table can be easily visualized with an R script that is included in the GeMoMa package.<br />
<br />
''Synteny checker'' may be called with<br />
<br />
java -jar GeMoMa-1.7.1.jar CLI SyntenyChecker<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>tag (the tag used to read the GeMoMa annotations, default = mRNA)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prefix (the prefix can be used to distinguish predictions from different input files, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>assignment (The assignment file, which combines parts of the CDS to transcripts, mime = tabular)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>gene annotation file (GFF file containing the gene annotations predicted by GAF, mime = gff,gff3)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.7.1.jar CLI SyntenyChecker a=&lt;assignment&gt; g=&lt;gene_annotation_file&gt;<br />
<br />
<br />
= AddAttribute =<br />
<br />
This tool allows to add an additional attribute to specific features of an annotation.<br />
<br />
Those additional attributes might be used in '''GAF''' for filtering or sorting or might be displayed in genome browsers like IGV or WebApollo. The user can choose binary attributes (true or false) or attributes with values according to given tab-delimited table.<br />
<br />
''AddAttribute'' may be called with<br />
<br />
java -jar GeMoMa-1.7.1.jar CLI AddAttribute<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (annotation file, mime = gff,gff3)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>feature (a feature of the annotation, e.g., gene, transcript or mRNA, default = mRNA)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">attribute</font></td><br />
<td>attribute (the name of the attribute that is added to the annotation)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>table (a tab-delimited file containing IDs and additional attribute, mime = tabular)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>ID column (the ID column in the tab-delimited file, valid range = [0, 2147483647])</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">type</font></td><br />
<td>type (type of addition attribute, range={VALUES, BINARY}, default = VALUES)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;VALUES&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ac</font></td><br />
<td>attribute column (the attribute column in the tab-delimited file, valid range = [0, 2147483647])</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;BINARY&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.7.1.jar CLI AddAttribute a=&lt;annotation&gt; attribute=&lt;attribute&gt; t=&lt;table&gt; i=&lt;ID_column&gt; ac=&lt;attribute_column&gt;</div>Keilwagenhttps://www.jstacs.de/index.php?title=GeMoMa&diff=1110GeMoMa2020-09-09T11:49:20Z<p>Keilwagen: /* Version history */</p>
<hr />
<div>'''Ge'''ne '''Mo'''del '''Ma'''pper (GeMoMa) is a homology-based gene prediction program. GeMoMa uses the annotation of protein-coding genes in a reference genome to infer the annotation of protein-coding genes in a target genome. Thereby, GeMoMa utilizes amino acid sequence and intron position conservation. In addition, GeMoMa allows to incorporate RNA-seq evidence for splice site prediction.<br />
<br />
GeMoMa is available in a public web-server at [http://galaxy.informatik.uni-halle.de galaxy.informatik.uni-halle.de]. The provided web-server only allows a limited number of reference genes and uses a time out of 2 minutes per transcript prediction. For unlimited use, please use the command line program or integrate GeMoMa in your own [https://galaxyproject.org/ Galaxy] instance.<br />
<br />
{|<br />
|__TOC__<br />
|[[File:GeMoMa-schema.png|thumb|right|450px|Schema of GeMoMa algorithm]]<br />
|}<br />
<br />
== Installation ==<br />
<br />
GeMoMa is now available via bioconda. Here is the [https://anaconda.org/bioconda/gemoma direct link to the package]. To install this package with conda run:<br />
conda install -c bioconda gemoma <br />
However, you can also install GeMoMa manually.<br />
<br />
=== Requirements ===<br />
<br />
For running the GeMoMa, you need the following software on your computer<br />
* Java v1.8 or later<br />
* [ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ blast] or [https://github.com/soedinglab/MMseqs2 mmseqs]<br />
<br />
=== Download ===<br />
<br />
GeMoMa is implemented in Java using Jstacs. You can [http://www.jstacs.de/download.php?which=GeMoMa download a zip file] containing a readme, the GeMoMa jar file and some tiny scripts for running GeMoMa. The jar file allows for <br />
<ul><br />
<li>creating the XML file needed for the Galaxy integration</li><br />
<li>running the command line interface (CLI) version.</li><br />
</ul><br />
<br />
== In a nutshell ==<br />
<br />
GeMoMa is a modular, homology-based gene prediction program with huge flexibility. However, we also provide a pipeline allowing to use GeMoMa easily. If you like to start GeMoMa for the first time, we recommend to use the GeMoMaPipeline like this<br />
java -jar GeMoMa-1.7.jar CLI GeMoMaPipeline threads=<threads> outdir=<outdir> GeMoMa.Score=ReAlign AnnotationFinalizer.r=NO o=true t=<target_genome> i=<reference_1_id> a=<reference_1_annotation> g=<reference_1_genome><br />
there are several parameters that need to be set indicated with '''&lt;'''foo'''&gt;'''. You can specify<br />
* the number of threads<br />
* the output directory<br />
* the target genome<br />
* and the reference ID (optional), annotation and genome. If you have several references just repeat <code>s=own</code> and the parameter tags <code>i</code>, <code>a</code>, <code>g</code> with the corresponding values.<br />
In addition, we recommend to set several parameters:<br />
* <code>GeMoMa.Score=ReAlign</code>: states that the score from mmseqs should be recomputed as mmseqs uses an approximation<br />
* <code>AnnotationFinalizer.r=NO</code>: do not rename genes and transcripts<br />
* <code>o=true</code>: output individual predictions for each reference as a separate file allowing to rerun the combination step ('''GAF''') very easily and quickly<br />
If you like to specify the maximum intron length please consider the parameters <code>GeMoMa.m</code> and <code>GeMoMa.sil</code>.<br />
If you have RNA-seq data either from own experiments or publicly available data sets (cf. [https://www.ncbi.nlm.nih.gov/sra NCBI SRA], [https://www.ebi.ac.uk/ena EMBL-EBI ENA]), we recommend to use them. You need to map the data against the target genome with your favorite read mapper. In addition, we recommend to check the parameters of the section '''DenoiseIntrons'''.<br />
<br />
The complete documentation describing all GeMoMa modules and all parameters can be accessed at [[GeMoMa-Docs]].<br />
<br />
You can also [[:File:GeMoMa-manual.pdf|download a small manual for GeMoMa]] which explains the main steps for the analysis.<br />
<br />
== GFF attributes ==<br />
<br />
Using GeMoMa, you'll obtain GFFs containing some special attributes. We briefly explain the most prominent attributes in the following table.<br />
<br />
{| class="wikitable sortable"<br />
|-<br />
!Attribute !!Long name !!Tool !!Necessary parameter !!Feature !!Description<br />
|-<br />
| aa || amino acids || GeMoMa || || mRNA || the number of amino acids in the protein<br />
|-<br />
| score || GeMoMa score || GeMoMa || || mRNA || score computed by GeMoMa using the substitution matrix, gap costs and additional penalties<br />
|-<br />
| nps || number of premature stops || GeMoMa || || mRNA || the number of premature stop codons in the prediction<br />
|-<br />
| ce || coding exons || GeMoMa || assignment || mRNA || the number of coding exons of the prediction<br />
|-<br />
| rce || reference coding exons || GeMoMa || assignment || mRNA || the number of coding exons of the reference transcript<br />
|-<br />
| minCov || minimal coverage || GeMoMa || coverage, ... || mRNA || minimal coverage of any base of the prediction given RNA-seq evidence<br />
|-<br />
| avgCov || average coverage || GeMoMa || coverage, ... || mRNA || average coverage of all bases of the prediction given RNA-seq evidence<br />
|-<br />
| tpc || transcript percentage coverage || GeMoMa || coverage, ... || mRNA || percentage of covered bases per predicted transcript given RNA-seq evidence<br />
|-<br />
| tae || transcript acceptor evidence || GeMoMa || introns || mRNA || percentage of predicted acceptor sites per predicted transcript with RNA-seq evidence<br />
|-<br />
| tde || transcript donor evidence || GeMoMa || introns || mRNA || percentage of predicted donor sites per predicted transcript with RNA-seq evidence<br />
|-<br />
| tie || transcript intron evidence || GeMoMa || introns || mRNA || percentage of predicted introns per predicted transcript with RNA-seq evidence<br />
|-<br />
| minSplitReads || minimal split reads || GeMoMa || introns || mRNA || minimal number of split reads for any of the predicted introns per predicted transcript<br />
|-<br />
| iAA || identical amino acid || GeMoMa || query proteins || mRNA || percentage of identical amino acids between reference transcript and prediction<br />
|-<br />
| pAA || positive amino acid || GeMoMa || query proteins || mRNA || percentage of aligned positions between reference transcript and prediction yielding a positive score in the substitution matrix<br />
|-<br />
| evidence || || GAF || || mRNA || number of reference organisms that have a transcript yielding this prediction<br />
|-<br />
| alternative || || GAF || || mRNA || alternative gene ID(s) leading to the same prediction<br />
|-<br />
| sumWeight || || GAF || || mRNA || the sum of the weights of the references that perfectly support this prediction<br />
|-<br />
| maxTie || maximal tie || GAF || || gene || maximal tie of all transcripts of this gene<br />
|-<br />
| maxEvidence || maximal evidence || GAF || || gene || maximal evidence of all transcripts of this gene<br />
|-<br />
|}<br />
<br />
The name of the feature describing a transcript prediction can be altered using the parameter "tag". Before version 1.7 the default value of tag was "prediction" instead of "mRNA".<br />
<br />
== Frequently asked questions ==<br />
<br />
; Why does the Extractor not return a single CDS-part, protein, ...?<br />
:Please check whether the names of your contigs/chromosomes in your annotation (gff) and genome file (fasta) are identical. The fasta comments should at best only contain the contig/chromosome name. (Since GeMoMa 1.4, comments, which contain the contig/chromosome name and some additional information separated by a space, are also fine.) In addition, check the statistics that are given by the Extractor. It lists how many genes have been read and how many genes have been removed for different reasons. One common problem is that some annotation files do not include the stop codon in the CDS annotation.<br />
<br />
; How can I force GeMoMa to make more predictions?<br />
:There are several parameters affecting the number of predictions. The most prominent are the number of predictions (p) and the contig threshold (ct). For each reference transcript/CDS, GeMoMa initially makes a preliminary prediction and uses this prediction to determine whether a contig is promising and should be used to determine the final predictions. You may decrease ct and increase p to have more contigs in the final prediction. Increasing the number of predictions allows GeMoMa to output more predictions that have been computed. Decreasing the contig threshold allows to increase the number of predictions that are (internally) computed. Increasing p to a very large number without decreasing ct does not help.<br />
<br />
; Running GeMoMa on a single contig of my assembly yield thousands of weird predictions. What went wrong?<br />
:By default, GeMoMa is not build to be run on a single contig. GeMoMa tries to make predictions for all given reference CDS in the given target sequence(s). If the given target sequence is only a fraction of the complete target genome/assembly, GeMoMa will produce weird predictions as it does not filter for the quality of the predictions internally. There are two options to handle this:<br />
:* Use a list of gene models that you expect to be located on this contig (cf. parameter "selected").<br />
:* Filter the predictions using GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>).<br />
<br />
; Is it mandatory to use RNA-seq data?<br />
:No, GeMoMa is able to make predictions with and without RNA-seq evidence.<br />
<br />
; Is it possible to use multiple reference organisms?<br />
:It is possible to use multiple reference organisms for GeMoMa. Just run GeMoMa on each reference organism separately. Finally, you can employ GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>) to combine these annotations.<br />
<br />
; Why do some reference genes not lead to a prediction in the target genome?<br />
:Please first check whether your reference genes have been discarded by the Extractor (cf. assignment file).<br />
:If the genes have been discarded, there are two possibilities:<br />
:* The CDS might be redundant, i.e. the coding exons are identical to those of another transcript. In this case, only one CDS is further evaluated.<br />
:* There might be something wrong with your reference genes, e.g., missing start codon, missing stop codon, premature stop codon, ambiguous nucleotides, ... and you should check the options of Extractor or the annotation.<br />
:If the reference genes passed the Extractor, there are several possible explanations for this behavior. The two most prominent are:<br />
:* GeMoMa stopped the prediction of a reference genes since it does not return a result within the given time (cf. parameter "timeout").<br />
:* GeMoMa simply did not find a prediction matching the remaining quality criteria.<br />
<br />
; What does "partial gene model" mean in the context of GeMoMa?<br />
:We called a gene model partial if it does not contain an initial start codon and a final stop codon. However, this does not mean that the gene model is located at or close to the border of a chromosome or contig.<br />
<br />
; For two different reference transcripts, the predictions of GeMoMa overlap or are identical. What should I do with those?<br />
:GeMoMa makes the predictions for each reference transcript independently. Hence, it can occur that some of predictions of different reference transcripts overlap or are identical especially in gene families. Typically, you might like to filter or rank these predictions. We have implemented GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>) to do this automatically. However, you can also do it by hand using the GFF attributes. Using RNA-seq data in GeMoMa yields additional fields in the annotation that can be used, e.g., average coverage (avgCov).<br />
<br />
; A lot of transcripts have been filtered out by the Extractor. What can I do?<br />
:There are several reasons for removing transcripts by the Extractor. At least in two cases you can try to get more transcripts by setting specific parameter values. First, if the transcript contains ambiguous nucleotides, please test the parameter "Ambiguity". Second, sometimes we received GFFs which contain wrong phases for CDS entries (e.g., 0 for all CDS entries in the phase column of the GFF). Since version 1.3.2, we provide the option "r" which stands for repair. If r=true is chosen, the Extractor tries to infer all phases for transcripts that show an error and would be filtered out.<br />
<br />
; Is GeMoMa able to predict pseudo-genes/ncRNA?<br />
:No, currently not.<br />
<br />
; My RNA-seq data indicates there is an additional intron in a transcipt, but GeMoMa does not predict this. Or vice versa, GeMoMa predicts an intron that is not supported by RNA-seq data. What's the reason?<br />
:GeMoMa is mainly based on the assumptions of amino acid and intron position conservation between reference and target species. Hence, GeMoMa tries to predict a gene model with similar exon-intron-structure in the target species and does not stick too much with RNA-seq data. Although intron position conservation can be observed in most cases, sometimes new introns evolve or others vanish. For this reasons, GeMoMa also allows for the inclusion or exclusion of introns adding some additional costs (cf. GeMoMa parameter intron-loss-gain-penalty). However, the behaviour of GeMoMa depends on the parameters settings (especially intron-loss-gain-penalty, sm (substitution matrix), go (gap opening), ge (gap extension)) and the length of the missed/additional intron. Nevertheless, such cases can only occur if the additional/missed intron has a length that can be divided by 3 preserving the reading frame. Since the available RNA-seq data only reflects a fraction of tissues/environmental conditions/..., missing RNA-seq evidence does not necessarily mean that the predictions is wrong.<br />
<br />
; My RNA-seq data indicates two alternative, highly overlapping introns. Interestingly, GeMoMa does not take the intron that is more abundant. Why?<br />
:GeMoMa reads the introns from the input file using some filter (cf. GeMoMa parameter r (reads)). All introns that pass the filter are used and treated equally. Hence, GeMoMa uses the intron that matches the expectation of intron position and amino acid conservation compared to the reference transcript.<br />
<br />
; Does GeMoMa predict multiple transcripts per gene?<br />
:GeMoMa in principle allows to predict multiple transcripts per gene, if corresponding transcripts are given in the reference species or if multiple reference species are used. <br />
<br />
; GeMoMa failed with java.lang.OutOfMemoryError. What can I do?<br />
:Whenever you see a java.lang.OutOfMemoryError, you should rerun the program with Java virtual machine (VM) options. More specifically you should set: -Xms the initially used RAM, e.g. to 5Gb (–Xms5G), and -Xmx the maximally used RAM, e.g. to 50Gb (-Xmx50G). GeMoMa often needs more memory if you have a large genome and if you’re providing a large coverage file (extracted from RNA-seq data). If you don’t have a compute node with enough memory, you can run GeMoMa without coverage, which will return the same predictions, but does not include all statistics. Another point could be the protein alignment, if you use the optional parameter "query protein" (below version 1.; or "protein alignment" (since version 1.;. Again you can run GeMoMa without protein alignment, which will return the same predictions, but less statistics. <br />
<br />
; I need to specify the genetic code for my organisms. What is the expected format?<br />
:The genetic code is given in a two column tab-delimited table, where the first column is the one letter code of the amino acid and the second column is a comma-separated list of triplets. As we are working on genomic DNA, GeMoMa expects the bases A, C, G, and T, and not U (as expected in mRNA). Here is the link to the default genetic code, which might be used as template:<br />
:https://github.com/Jstacs/Jstacs/blob/master/projects/gemoma/test_data/genetic_code.txt<br />
:Alternative genetic codes are described here using the RNA alphabet:<br />
:https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi<br />
:The genetic code might be specified for a reference organism in the module Extractor or for a target organism in the module GeMoMa.<br />
<br />
; I like to accelerate GeMoMa. What can I do?<br />
:You can use several threads for the computation. If you run the GeMoMaPipeline you just have to select threads=<your_number>.<br />
:In addition, you can change the search algorithm that is used in GeMoMa. Tblastn is used by default as search algorithm in GeMoMa (for historical reasons, until version 1.6.4). However, tblastn can be replaced by mmseqs which is typically much faster. If you run the GeMoMaPipeline you just have to select tblastn=false, which is default since version 1.7. However, changing the search algorithm can also effect the results. We try to minimize these effect using specific parameters for the search algorithms.<br />
:If you modify other parameters, you will probably receive results that differ to a larger extend from those received using the default parameters. <br />
<br />
; Is there a way to use GeMoMa to search a single CDS or protein sequence against a genome and return the predicted gene model (CDS fasta, protein fasta, GFF) similar to exonerate?<br />
:There at least two ways to do this. If you use GeMoMaPipeline you can <br />
: (A) Use the parameter “selected” to select specific gene models (=transcripts/proteins) from the annotation instead of using all or<br />
: (B) Use s=pre-extracted, use a fasta file with the proteins for the parameter cds-parts and leave assignment unset.<br />
:Using one of these options you can either look for a single or few transcripts/proteins either with (A) or without (B) intron-position conservation. In addition, you can use RNA-seq data to improve the predictions, which should be not possible with exonerate.<br />
<br />
; Can I determine synteny based on GeMoMa predictions?<br />
:Yes, since version 1.7 we provide the module SyntenyChecker and a R script that can be used for this purpose. It exploits the fact the the reference gene and the alternative are known. Hence no alignment is need at this point and synteny can be determined quite fast. <br />
<br />
; How, can I add additional attributes to the annotation?<br />
:Additional attribute, e.g., functional annotation from InterProScan, can be added to the structural gene annotation using the module AddAttribute, which has been included since version 1.7. Such additional attributes might be used in GAF for filtering and sorting and can also be displayed in genome browsers like IGV or WebApollo. <br />
<br />
; Can structural gene annotation provided by GeMoMa be submitted to NCBI?<br />
:Yes, NCBI allows to submit structural gene annotation in GFF format (https://www.ncbi.nlm.nih.gov/genbank/genomes_gff/). If you run GeMoMaPipeline or AnnotationFinalizer, the GFF should be valid for conversion. <br />
<br />
; Running GeMoMaPipeline throws an exception. Can I restart GeMoMaPipeline using intermediate results?<br />
:Yes, since version 1.7 we have a new parameter in GeMoMaPipeline called restart. <br />
:If you want to restart the last broken GeMoMaPipeline run, you have to execute GeMoMaPipeline with the same command line as before and add restart=true.<br />
:If necessary, you can also slightly change the other parameters. However, if the parameters differ too much from those used before, GeMoMaPipeline will decide to perform a new independent run.<br />
:A restart of GeMoMaPipeline is particularly useful if the time-consuming search (tblastn or mmseqs) was successful, since this can save runtime.<br />
<br />
If you have any further questions, comments or bugs, please check the [[GeMoMa-Docs]], [https://github.com/Jstacs/Jstacs/labels/GeMoMa our github page] or contact [mailto:jens.keilwagen@julius-kuehn.de Jens Keilwagen].<br />
<br />
== References ==<br />
If you use GeMoMa, please cite<br />
<br />
J. Keilwagen, M. Wenk, J. L. Erickson, M. H. Schattat, J. Grau, and F. Hartung. [https://nar.oxfordjournals.org/content/44/9/e89 Using intron position conservation for homology-based gene prediction]. ''Nucleic Acids Research'', 2016. doi: 10.1093/nar/gkw092<br />
<br />
J. Keilwagen, F. Hartung, M. Paulini, S. O. Twardziok, and J. Grau<br />
[https://rdcu.be/QbKc Combining RNA-seq data and homology-based gene prediction for plants, animals and fungi]. ''BMC Bioinformatics'', 2018. doi: 10.1186/s12859-018-2203-5<br />
<br />
== Version history ==<br />
[http://www.jstacs.de/download.php?which=GeMoMa GeMoMa 1.7.1] (07.09.2020)<br />
*GeMoMa: <br />
**bugfix if assignment == null<br />
**bugfix remove toUpperCase<br />
*GeMoMaPipeline<br />
**Galaxy integration bugfix for hidden parameter restart<br />
**hide BLAST_PATH and MMSEQS_PATH from Galaxy integration<br />
**improved protocol output if threads=1<br />
**add addtional test to GeMoMaPipeline<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.7.zip GeMoMa 1.7] (29.07.2020)<br />
*improved manual including new module and runtime <br />
*check whether input files exist before execution<br />
*partially checking MIME types in CLI before execution<br />
*changed homepage from http to https<br />
*new module AddAttribute: allows to add attributes (like functional annotation from InterProScan) to gene annotation files that might be used in GAF or displayed genome browsers like IGV or WebApollo<br />
*new module SyntenyChecker: creates a table that can be used to create dot plots between the annotation of the target and reference organism<br />
*changed default value of parameter "tag" from "prediction" to "mRNA"<br />
*AnnotationEvidence:<br />
**additional attributes: avgCov, minCov, nps, ce<br />
**changed default value of "annotation output" to true<br />
**bugfix: transcript start and end<br />
*ERE: <br />
**changed default value of coverage to "true" <br />
**new parameter "minimum context": allows to discard introns if all split reads have short aligned contexts<br />
*Extractor: <br />
**bugfix splitAA if coding exon is very short<br />
**improved verbose mode<br />
**new parameter "upcase IDs"<br />
**new parameter "introns" allowing to extract introns from the reference (only for test cases)<br />
**new parameter "discard pre-mature stop" allowing to discard or use transcripts with pre-mature stop<br />
**improved handling of corrupt annotations<br />
*GAF: <br />
**bugfix missing transcripts<br />
**slightly changed the default value of "filter" <br />
*GeMoMa:<br />
**replaced parameter "query proteins" by "protein alignment"<br />
**using splitAA for scoring predictions <br />
**new gff attributes: <br />
*** ce and rce for the feature prediction indicating the number of coding exons for the prediction and the reference, respectively<br />
*** nps for the number of premature stop codons (if avoid stop is false)<br />
**slightly changed the meaning of the parameter "avoid stop"<br />
*GeMoMaPipeline:<br />
**changed the default value of tblastn to false, hence mmseqs is used as search algorithm<br />
**changed the default value of score to ReAlign<br />
**remove "--dont-split-seq-by-len" from mmseqs createdb<br />
**new optional parameter BLAST_PATH<br />
**new optional parameter MMSEQS_PATH<br />
**new option to allow for incorporation of external annotation, e.g., from ab-initio gene prediction<br />
**new parameter restart allowing to restart the latest GeMoMaPipeline run, which was finished without results, with very similar parameters, e.g., after an exception was thrown (cf. parameter debug)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.6.4.zip GeMoMa 1.6.4] (24.04.2020)<br />
* improved help section<br />
* change gff attribute "AA" to "aa"<br />
* GAF:<br />
** bugfix overlapping genes<br />
** accelerated computation<br />
* GeMoMa:<br />
** bugfix: if no assignment file is used and protein ID are prefixes of other protein IDs<br />
** change GFF attribute AA to aa<br />
* AnnotationFinalizer: new parameter "name attribute" allowing to decide whether a name attribute or the Parent and ID attributes should be used for renaming<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.6.3.zip GeMoMa 1.6.3] (05.03.2020)<br />
* Jstacs changes:<br />
** CLI: bugfix ExpandableParameterSet<br />
* python wrapper (for *conda)<br />
* updated tests.sh, run.sh, pipeline.sh<br />
* rename Denoise to DenoiseIntrons<br />
* AnnotationEvidence: write phase (as given) to gff<br />
* GAF: new parameter: default attributes allows to set attributes that are not included in some gene annotation files<br />
* GeMoMa: new parameter: static intron length allowing to use dynamic intron length if set to false<br />
* GeMoMaPipeline: <br />
** bugfix: time-out<br />
** improve output<br />
** separate parameters for maximum intron length (DenoiseIntrons, GeMoMa)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.6.2.zip GeMoMa 1.6.2] (17.12.2019)<br />
* Jstacs changes:<br />
** test methods for modules<br />
** live protocol for Galaxy<br />
* new module Denoise: allowing to clean introns extracted by ERE<br />
* new module NCBIReferenceRetriever: allowing to retrieve data for reference organisms easily from NCBI.<br />
* GAF:<br />
** bugfix for filter using specific attributes if no RNA-seq or query proteins was used<br />
** allow to add annotation info (as for instance provided by Phytozome) based on the reference organisms<br />
* GeMoMa: bugfix for timeout<br />
* GeMoMaPipeline:<br />
** bugfix reporting predicted partial proteins<br />
** improved protocol<br />
** new default value for query proteins (changed from false to true)<br />
** new default value for Ambiguity (changed from EXCEPTION to AMBIGUOUS)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.6.1.zip GeMoMa 1.6.1] (4.06.2019)<br />
* createGalaxyIntegration.sh: bugfix for GeMoMaPipeline<br />
* new module CheckIntrons: allowing to create statistics for introns (extracted by ERE)<br />
* AnnotationFinalizer: bugfix for sequence IDs with large numbers<br />
* CompareTranscripts: <br />
** bugfix for prefix of ref-gene<br />
** allow no transcript info, but making assignment non-optional if a transcript info is set <br />
* GAF: bugfix for Galaxy integration<br />
* GeMoMaPipeline:<br />
** improved output in case of Exceptions<br />
** new parameter "output individual predictions" allows to in- or exclude individual predictions from each reference organism in the final result<br />
** new parameter "weight" allows weights for reference species (cf. GAF)<br />
* ERE: new parameter "minimum mapping quality"<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.6.zip GeMoMa 1.6] (2.04.2019)<br />
* allow to use mmseqs as alternative to tblastn<br />
* AnnotationEvidence:<br />
** allows to add attributes to the input gff: tie, tpc, AA, start, stop<br />
** new parameter for gff output<br />
* AnnotationFinalizer: new tool for predicting UTRs and renaming predictions<br />
* GAF:<br />
** relative score filter and evidence filter are replaced by a flexible filter that allows to filter by relative score, evidence or other GFF attributes as well as combinations thereof<br />
** sorting criteria of the predictions within clusters can now be user-specified<br />
** new attribute for genes: combinedEvidence<br />
** new attribute for predictions: sumWeight<br />
** allows to use gene predictions from all sources, including for instance ab-initio gene predictors, purely RNA-seq based gene prediction and manually curation<br />
** bugfix for predictions from multiple reference organisms<br />
** improved statistic output<br />
* GeMoMa<br />
** renamed the parameter tblastn results to search results<br />
** new parameter for sorting the results of the similarity search (tblastn or mmseqs), if you use mmseqs for the similarity search you have use sort <br />
** new parameter for score of the search results: three options: Trust (as is), ReScore (use aligned sequence, but recompute score), and ReAlign (use detected sequence for realignment and rescore)<br />
** bugfix: threshold for introns from multiple files<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.5.2.zip GeMoMa 1.5.3] (23.07.2018)<br />
* improved parameter description and presentation<br />
* GeMoMaPipeline:<br />
** removed unnecessary parameters<br />
* GeMoMa:<br />
** bugfix: reading coverage file<br />
** removed parameter genomic (cf. Extractor)<br />
** removed protein output (cf. Extractor)<br />
* GAF:<br />
** bugfix: prefix<br />
* Extractor:<br />
** new parameter genomic<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.5.2.zip GeMoMa 1.5.2] (31.5.2018)<br />
* GAF:<br />
** new parameter that allows to restrict the maximal number of transcript predictions per gene<br />
** altered behavior of the evidence filter from percentages to absolute values<br />
** bugfix: nested genes <br />
** checking for duplicates in prediction IDs<br />
* GeMoMa:<br />
** warning if RNA-seq data does not match with target genome<br />
* GeMoMaPipeline: new tool for running the complete GeMoMa pipeline at once allowing multi-threading<br />
* folder for temporary files of GeMoMa<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.5.zip GeMoMa 1.5] (13.02.2018)<br />
* AnnotationEvidence: add chromosome to output<br />
* CompareTranscripts: new parameter that allows to remove prefixes introduces by GAF<br />
* Extractor: new parameter "stop-codon excluded from CDS" that might be used if the annotation does not contain the stop codons<br />
* ExtractRNASeqEvidence: <br />
** print intron length stats<br />
** include program infos in introns.gff3<br />
* GeMoMa:<br />
** new attribute pAA in gff output if query protein is given<br />
** include program infos in predicted_annotation.gff3<br />
** minor bugfix<br />
* GAF:<br />
** new parameter that allows to specify a prefix for each input gff<br />
** collect and print program infos to filtered_prediction.gff3<br />
** improved statistics output<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.4.2.zip GeMoMa 1.4.2] (21.07.2017)<br />
* automatic searching for available updates<br />
* AnnotationEvidence: bugfix (tie computation: Arrays.binarysearch does not find first match)<br />
* Extractor: bugfix (files that are not zipped)<br />
* GeMoMa: bugfix (tie computation: Arrays.binarysearch does not find first match)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.4.1.zip GeMoMa 1.4.1] (30.05.2017)<br />
* CompareTranscripts: bugfix (NullPointerException)<br />
* Extractor: reference genome can be .*fa.gz and .*fasta.gz<br />
* GeMoMa: bugfix (shutdown problem after timeout)<br />
* modified additional scripts and documentation<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.4.zip GeMoMa 1.4] (03.05.2017)<br />
* AnnotationEvidence: new tool computing tie and tpc for given annotation (gff)<br />
* CompareTranscripts: new tool comparing predicted and given annotation (gff)<br />
* Extractor:<br />
** reading CDS with no parent tag (cf. discontinuous feature)<br />
** automatic recognition of GFF or GTF annotation<br />
** Warning if sequences mentioned in the annotation are not included in the reference sequence<br />
* GeMoMa: <br />
** allowing for multiple intron and coverage files (= using different library types at the same time)<br />
** NA instead of "?" for tae, tde, tie, minSplitReads of single coding exon genes<br />
** new default values for the parameters: predictions (10 instead of 1) and contig threshold (0.4 instead of 0.9)<br />
** bugfix (write pc and minCov if possible for last CDS part in predicted annotation)<br />
** bugfix (ref-gene name if no assignment is used)<br />
** bugfix (minSplitReads, minCov, tpc, avgCov if no coverage available)<br />
* GAF:<br />
** nested genes on the same strand<br />
** bugfix (if nothing passes the filter)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.3.2.zip GeMoMa 1.3.2] (18.01.2017)<br />
* Extractor: new parameter repair for broken transcript annotations<br />
* GeMoMa: bugfixes (splice site computation)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa_1.3.1.zip GeMoMa 1.3.1] (09.12.2016)<br />
* GeMoMa bugfix (finding start/stop codon for very small exons)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.3.zip GeMoMa 1.3] (06.12.2016)<br />
* ERE: new tool for extracting RNA-seq evidence<br />
* Extractor: offers options for <br />
** partial gene models<br />
** ambiguities<br />
* GeMoMa:<br />
** RNA-seq<br />
*** defining splice sites<br />
*** additional feature in GFF and output<br />
**** transcript intron evidence (tie)<br />
**** transcript acceptor evidence (tae)<br />
**** transcript donor evidence (tde)<br />
**** transcript percentage coverage (tpc)<br />
**** ...<br />
** improved GFF<br />
** simplified the command line parameters<br />
** IMPORTANT: parameter names changed for some parameters<br />
* GAF: new tool for filtering and combining different predictions (especially of different reference organisms)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.1.3.zip GeMoMa 1.1.3] (06.06.2016)<br />
* minor modifications to the Extractor tool<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.1.2.zip GeMoMa 1.1.2] (05.02.2016)<br />
* GeMoMa bugfix (upstream, downstream sequence for splice site detection)<br />
* Extractor: new parameter s for selecting transcripts<br />
* improved Galaxy integration<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.1.1.zip GeMoMa 1.1.1] (01.02.2016)<br />
* initial release for paper</div>Keilwagenhttps://www.jstacs.de/index.php?title=GeMoMa&diff=1109GeMoMa2020-09-03T05:46:22Z<p>Keilwagen: /* In a nutshell */</p>
<hr />
<div>'''Ge'''ne '''Mo'''del '''Ma'''pper (GeMoMa) is a homology-based gene prediction program. GeMoMa uses the annotation of protein-coding genes in a reference genome to infer the annotation of protein-coding genes in a target genome. Thereby, GeMoMa utilizes amino acid sequence and intron position conservation. In addition, GeMoMa allows to incorporate RNA-seq evidence for splice site prediction.<br />
<br />
GeMoMa is available in a public web-server at [http://galaxy.informatik.uni-halle.de galaxy.informatik.uni-halle.de]. The provided web-server only allows a limited number of reference genes and uses a time out of 2 minutes per transcript prediction. For unlimited use, please use the command line program or integrate GeMoMa in your own [https://galaxyproject.org/ Galaxy] instance.<br />
<br />
{|<br />
|__TOC__<br />
|[[File:GeMoMa-schema.png|thumb|right|450px|Schema of GeMoMa algorithm]]<br />
|}<br />
<br />
== Installation ==<br />
<br />
GeMoMa is now available via bioconda. Here is the [https://anaconda.org/bioconda/gemoma direct link to the package]. To install this package with conda run:<br />
conda install -c bioconda gemoma <br />
However, you can also install GeMoMa manually.<br />
<br />
=== Requirements ===<br />
<br />
For running the GeMoMa, you need the following software on your computer<br />
* Java v1.8 or later<br />
* [ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ blast] or [https://github.com/soedinglab/MMseqs2 mmseqs]<br />
<br />
=== Download ===<br />
<br />
GeMoMa is implemented in Java using Jstacs. You can [http://www.jstacs.de/download.php?which=GeMoMa download a zip file] containing a readme, the GeMoMa jar file and some tiny scripts for running GeMoMa. The jar file allows for <br />
<ul><br />
<li>creating the XML file needed for the Galaxy integration</li><br />
<li>running the command line interface (CLI) version.</li><br />
</ul><br />
<br />
== In a nutshell ==<br />
<br />
GeMoMa is a modular, homology-based gene prediction program with huge flexibility. However, we also provide a pipeline allowing to use GeMoMa easily. If you like to start GeMoMa for the first time, we recommend to use the GeMoMaPipeline like this<br />
java -jar GeMoMa-1.7.jar CLI GeMoMaPipeline threads=<threads> outdir=<outdir> GeMoMa.Score=ReAlign AnnotationFinalizer.r=NO o=true t=<target_genome> i=<reference_1_id> a=<reference_1_annotation> g=<reference_1_genome><br />
there are several parameters that need to be set indicated with '''&lt;'''foo'''&gt;'''. You can specify<br />
* the number of threads<br />
* the output directory<br />
* the target genome<br />
* and the reference ID (optional), annotation and genome. If you have several references just repeat <code>s=own</code> and the parameter tags <code>i</code>, <code>a</code>, <code>g</code> with the corresponding values.<br />
In addition, we recommend to set several parameters:<br />
* <code>GeMoMa.Score=ReAlign</code>: states that the score from mmseqs should be recomputed as mmseqs uses an approximation<br />
* <code>AnnotationFinalizer.r=NO</code>: do not rename genes and transcripts<br />
* <code>o=true</code>: output individual predictions for each reference as a separate file allowing to rerun the combination step ('''GAF''') very easily and quickly<br />
If you like to specify the maximum intron length please consider the parameters <code>GeMoMa.m</code> and <code>GeMoMa.sil</code>.<br />
If you have RNA-seq data either from own experiments or publicly available data sets (cf. [https://www.ncbi.nlm.nih.gov/sra NCBI SRA], [https://www.ebi.ac.uk/ena EMBL-EBI ENA]), we recommend to use them. You need to map the data against the target genome with your favorite read mapper. In addition, we recommend to check the parameters of the section '''DenoiseIntrons'''.<br />
<br />
The complete documentation describing all GeMoMa modules and all parameters can be accessed at [[GeMoMa-Docs]].<br />
<br />
You can also [[:File:GeMoMa-manual.pdf|download a small manual for GeMoMa]] which explains the main steps for the analysis.<br />
<br />
== GFF attributes ==<br />
<br />
Using GeMoMa, you'll obtain GFFs containing some special attributes. We briefly explain the most prominent attributes in the following table.<br />
<br />
{| class="wikitable sortable"<br />
|-<br />
!Attribute !!Long name !!Tool !!Necessary parameter !!Feature !!Description<br />
|-<br />
| aa || amino acids || GeMoMa || || mRNA || the number of amino acids in the protein<br />
|-<br />
| score || GeMoMa score || GeMoMa || || mRNA || score computed by GeMoMa using the substitution matrix, gap costs and additional penalties<br />
|-<br />
| nps || number of premature stops || GeMoMa || || mRNA || the number of premature stop codons in the prediction<br />
|-<br />
| ce || coding exons || GeMoMa || assignment || mRNA || the number of coding exons of the prediction<br />
|-<br />
| rce || reference coding exons || GeMoMa || assignment || mRNA || the number of coding exons of the reference transcript<br />
|-<br />
| minCov || minimal coverage || GeMoMa || coverage, ... || mRNA || minimal coverage of any base of the prediction given RNA-seq evidence<br />
|-<br />
| avgCov || average coverage || GeMoMa || coverage, ... || mRNA || average coverage of all bases of the prediction given RNA-seq evidence<br />
|-<br />
| tpc || transcript percentage coverage || GeMoMa || coverage, ... || mRNA || percentage of covered bases per predicted transcript given RNA-seq evidence<br />
|-<br />
| tae || transcript acceptor evidence || GeMoMa || introns || mRNA || percentage of predicted acceptor sites per predicted transcript with RNA-seq evidence<br />
|-<br />
| tde || transcript donor evidence || GeMoMa || introns || mRNA || percentage of predicted donor sites per predicted transcript with RNA-seq evidence<br />
|-<br />
| tie || transcript intron evidence || GeMoMa || introns || mRNA || percentage of predicted introns per predicted transcript with RNA-seq evidence<br />
|-<br />
| minSplitReads || minimal split reads || GeMoMa || introns || mRNA || minimal number of split reads for any of the predicted introns per predicted transcript<br />
|-<br />
| iAA || identical amino acid || GeMoMa || query proteins || mRNA || percentage of identical amino acids between reference transcript and prediction<br />
|-<br />
| pAA || positive amino acid || GeMoMa || query proteins || mRNA || percentage of aligned positions between reference transcript and prediction yielding a positive score in the substitution matrix<br />
|-<br />
| evidence || || GAF || || mRNA || number of reference organisms that have a transcript yielding this prediction<br />
|-<br />
| alternative || || GAF || || mRNA || alternative gene ID(s) leading to the same prediction<br />
|-<br />
| sumWeight || || GAF || || mRNA || the sum of the weights of the references that perfectly support this prediction<br />
|-<br />
| maxTie || maximal tie || GAF || || gene || maximal tie of all transcripts of this gene<br />
|-<br />
| maxEvidence || maximal evidence || GAF || || gene || maximal evidence of all transcripts of this gene<br />
|-<br />
|}<br />
<br />
The name of the feature describing a transcript prediction can be altered using the parameter "tag". Before version 1.7 the default value of tag was "prediction" instead of "mRNA".<br />
<br />
== Frequently asked questions ==<br />
<br />
; Why does the Extractor not return a single CDS-part, protein, ...?<br />
:Please check whether the names of your contigs/chromosomes in your annotation (gff) and genome file (fasta) are identical. The fasta comments should at best only contain the contig/chromosome name. (Since GeMoMa 1.4, comments, which contain the contig/chromosome name and some additional information separated by a space, are also fine.) In addition, check the statistics that are given by the Extractor. It lists how many genes have been read and how many genes have been removed for different reasons. One common problem is that some annotation files do not include the stop codon in the CDS annotation.<br />
<br />
; How can I force GeMoMa to make more predictions?<br />
:There are several parameters affecting the number of predictions. The most prominent are the number of predictions (p) and the contig threshold (ct). For each reference transcript/CDS, GeMoMa initially makes a preliminary prediction and uses this prediction to determine whether a contig is promising and should be used to determine the final predictions. You may decrease ct and increase p to have more contigs in the final prediction. Increasing the number of predictions allows GeMoMa to output more predictions that have been computed. Decreasing the contig threshold allows to increase the number of predictions that are (internally) computed. Increasing p to a very large number without decreasing ct does not help.<br />
<br />
; Running GeMoMa on a single contig of my assembly yield thousands of weird predictions. What went wrong?<br />
:By default, GeMoMa is not build to be run on a single contig. GeMoMa tries to make predictions for all given reference CDS in the given target sequence(s). If the given target sequence is only a fraction of the complete target genome/assembly, GeMoMa will produce weird predictions as it does not filter for the quality of the predictions internally. There are two options to handle this:<br />
:* Use a list of gene models that you expect to be located on this contig (cf. parameter "selected").<br />
:* Filter the predictions using GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>).<br />
<br />
; Is it mandatory to use RNA-seq data?<br />
:No, GeMoMa is able to make predictions with and without RNA-seq evidence.<br />
<br />
; Is it possible to use multiple reference organisms?<br />
:It is possible to use multiple reference organisms for GeMoMa. Just run GeMoMa on each reference organism separately. Finally, you can employ GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>) to combine these annotations.<br />
<br />
; Why do some reference genes not lead to a prediction in the target genome?<br />
:Please first check whether your reference genes have been discarded by the Extractor (cf. assignment file).<br />
:If the genes have been discarded, there are two possibilities:<br />
:* The CDS might be redundant, i.e. the coding exons are identical to those of another transcript. In this case, only one CDS is further evaluated.<br />
:* There might be something wrong with your reference genes, e.g., missing start codon, missing stop codon, premature stop codon, ambiguous nucleotides, ... and you should check the options of Extractor or the annotation.<br />
:If the reference genes passed the Extractor, there are several possible explanations for this behavior. The two most prominent are:<br />
:* GeMoMa stopped the prediction of a reference genes since it does not return a result within the given time (cf. parameter "timeout").<br />
:* GeMoMa simply did not find a prediction matching the remaining quality criteria.<br />
<br />
; What does "partial gene model" mean in the context of GeMoMa?<br />
:We called a gene model partial if it does not contain an initial start codon and a final stop codon. However, this does not mean that the gene model is located at or close to the border of a chromosome or contig.<br />
<br />
; For two different reference transcripts, the predictions of GeMoMa overlap or are identical. What should I do with those?<br />
:GeMoMa makes the predictions for each reference transcript independently. Hence, it can occur that some of predictions of different reference transcripts overlap or are identical especially in gene families. Typically, you might like to filter or rank these predictions. We have implemented GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>) to do this automatically. However, you can also do it by hand using the GFF attributes. Using RNA-seq data in GeMoMa yields additional fields in the annotation that can be used, e.g., average coverage (avgCov).<br />
<br />
; A lot of transcripts have been filtered out by the Extractor. What can I do?<br />
:There are several reasons for removing transcripts by the Extractor. At least in two cases you can try to get more transcripts by setting specific parameter values. First, if the transcript contains ambiguous nucleotides, please test the parameter "Ambiguity". Second, sometimes we received GFFs which contain wrong phases for CDS entries (e.g., 0 for all CDS entries in the phase column of the GFF). Since version 1.3.2, we provide the option "r" which stands for repair. If r=true is chosen, the Extractor tries to infer all phases for transcripts that show an error and would be filtered out.<br />
<br />
; Is GeMoMa able to predict pseudo-genes/ncRNA?<br />
:No, currently not.<br />
<br />
; My RNA-seq data indicates there is an additional intron in a transcipt, but GeMoMa does not predict this. Or vice versa, GeMoMa predicts an intron that is not supported by RNA-seq data. What's the reason?<br />
:GeMoMa is mainly based on the assumptions of amino acid and intron position conservation between reference and target species. Hence, GeMoMa tries to predict a gene model with similar exon-intron-structure in the target species and does not stick too much with RNA-seq data. Although intron position conservation can be observed in most cases, sometimes new introns evolve or others vanish. For this reasons, GeMoMa also allows for the inclusion or exclusion of introns adding some additional costs (cf. GeMoMa parameter intron-loss-gain-penalty). However, the behaviour of GeMoMa depends on the parameters settings (especially intron-loss-gain-penalty, sm (substitution matrix), go (gap opening), ge (gap extension)) and the length of the missed/additional intron. Nevertheless, such cases can only occur if the additional/missed intron has a length that can be divided by 3 preserving the reading frame. Since the available RNA-seq data only reflects a fraction of tissues/environmental conditions/..., missing RNA-seq evidence does not necessarily mean that the predictions is wrong.<br />
<br />
; My RNA-seq data indicates two alternative, highly overlapping introns. Interestingly, GeMoMa does not take the intron that is more abundant. Why?<br />
:GeMoMa reads the introns from the input file using some filter (cf. GeMoMa parameter r (reads)). All introns that pass the filter are used and treated equally. Hence, GeMoMa uses the intron that matches the expectation of intron position and amino acid conservation compared to the reference transcript.<br />
<br />
; Does GeMoMa predict multiple transcripts per gene?<br />
:GeMoMa in principle allows to predict multiple transcripts per gene, if corresponding transcripts are given in the reference species or if multiple reference species are used. <br />
<br />
; GeMoMa failed with java.lang.OutOfMemoryError. What can I do?<br />
:Whenever you see a java.lang.OutOfMemoryError, you should rerun the program with Java virtual machine (VM) options. More specifically you should set: -Xms the initially used RAM, e.g. to 5Gb (–Xms5G), and -Xmx the maximally used RAM, e.g. to 50Gb (-Xmx50G). GeMoMa often needs more memory if you have a large genome and if you’re providing a large coverage file (extracted from RNA-seq data). If you don’t have a compute node with enough memory, you can run GeMoMa without coverage, which will return the same predictions, but does not include all statistics. Another point could be the protein alignment, if you use the optional parameter "query protein" (below version 1.; or "protein alignment" (since version 1.;. Again you can run GeMoMa without protein alignment, which will return the same predictions, but less statistics. <br />
<br />
; I need to specify the genetic code for my organisms. What is the expected format?<br />
:The genetic code is given in a two column tab-delimited table, where the first column is the one letter code of the amino acid and the second column is a comma-separated list of triplets. As we are working on genomic DNA, GeMoMa expects the bases A, C, G, and T, and not U (as expected in mRNA). Here is the link to the default genetic code, which might be used as template:<br />
:https://github.com/Jstacs/Jstacs/blob/master/projects/gemoma/test_data/genetic_code.txt<br />
:Alternative genetic codes are described here using the RNA alphabet:<br />
:https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi<br />
:The genetic code might be specified for a reference organism in the module Extractor or for a target organism in the module GeMoMa.<br />
<br />
; I like to accelerate GeMoMa. What can I do?<br />
:You can use several threads for the computation. If you run the GeMoMaPipeline you just have to select threads=<your_number>.<br />
:In addition, you can change the search algorithm that is used in GeMoMa. Tblastn is used by default as search algorithm in GeMoMa (for historical reasons, until version 1.6.4). However, tblastn can be replaced by mmseqs which is typically much faster. If you run the GeMoMaPipeline you just have to select tblastn=false, which is default since version 1.7. However, changing the search algorithm can also effect the results. We try to minimize these effect using specific parameters for the search algorithms.<br />
:If you modify other parameters, you will probably receive results that differ to a larger extend from those received using the default parameters. <br />
<br />
; Is there a way to use GeMoMa to search a single CDS or protein sequence against a genome and return the predicted gene model (CDS fasta, protein fasta, GFF) similar to exonerate?<br />
:There at least two ways to do this. If you use GeMoMaPipeline you can <br />
: (A) Use the parameter “selected” to select specific gene models (=transcripts/proteins) from the annotation instead of using all or<br />
: (B) Use s=pre-extracted, use a fasta file with the proteins for the parameter cds-parts and leave assignment unset.<br />
:Using one of these options you can either look for a single or few transcripts/proteins either with (A) or without (B) intron-position conservation. In addition, you can use RNA-seq data to improve the predictions, which should be not possible with exonerate.<br />
<br />
; Can I determine synteny based on GeMoMa predictions?<br />
:Yes, since version 1.7 we provide the module SyntenyChecker and a R script that can be used for this purpose. It exploits the fact the the reference gene and the alternative are known. Hence no alignment is need at this point and synteny can be determined quite fast. <br />
<br />
; How, can I add additional attributes to the annotation?<br />
:Additional attribute, e.g., functional annotation from InterProScan, can be added to the structural gene annotation using the module AddAttribute, which has been included since version 1.7. Such additional attributes might be used in GAF for filtering and sorting and can also be displayed in genome browsers like IGV or WebApollo. <br />
<br />
; Can structural gene annotation provided by GeMoMa be submitted to NCBI?<br />
:Yes, NCBI allows to submit structural gene annotation in GFF format (https://www.ncbi.nlm.nih.gov/genbank/genomes_gff/). If you run GeMoMaPipeline or AnnotationFinalizer, the GFF should be valid for conversion. <br />
<br />
; Running GeMoMaPipeline throws an exception. Can I restart GeMoMaPipeline using intermediate results?<br />
:Yes, since version 1.7 we have a new parameter in GeMoMaPipeline called restart. <br />
:If you want to restart the last broken GeMoMaPipeline run, you have to execute GeMoMaPipeline with the same command line as before and add restart=true.<br />
:If necessary, you can also slightly change the other parameters. However, if the parameters differ too much from those used before, GeMoMaPipeline will decide to perform a new independent run.<br />
:A restart of GeMoMaPipeline is particularly useful if the time-consuming search (tblastn or mmseqs) was successful, since this can save runtime.<br />
<br />
If you have any further questions, comments or bugs, please check the [[GeMoMa-Docs]], [https://github.com/Jstacs/Jstacs/labels/GeMoMa our github page] or contact [mailto:jens.keilwagen@julius-kuehn.de Jens Keilwagen].<br />
<br />
== References ==<br />
If you use GeMoMa, please cite<br />
<br />
J. Keilwagen, M. Wenk, J. L. Erickson, M. H. Schattat, J. Grau, and F. Hartung. [https://nar.oxfordjournals.org/content/44/9/e89 Using intron position conservation for homology-based gene prediction]. ''Nucleic Acids Research'', 2016. doi: 10.1093/nar/gkw092<br />
<br />
J. Keilwagen, F. Hartung, M. Paulini, S. O. Twardziok, and J. Grau<br />
[https://rdcu.be/QbKc Combining RNA-seq data and homology-based gene prediction for plants, animals and fungi]. ''BMC Bioinformatics'', 2018. doi: 10.1186/s12859-018-2203-5<br />
<br />
== Version history ==<br />
[http://www.jstacs.de/download.php?which=GeMoMa GeMoMa 1.7.] (29.07.2020)<br />
GeMoMa 1.7. (29.07.2020)<br />
*improved manual including new module and runtime <br />
*check whether input files exist before execution<br />
*partially checking MIME types in CLI before execution<br />
*changed homepage from http to https<br />
*new module AddAttribute: allows to add attributes (like functional annotation from InterProScan) to gene annotation files that might be used in GAF or displayed genome browsers like IGV or WebApollo<br />
*new module SyntenyChecker: creates a table that can be used to create dot plots between the annotation of the target and reference organism<br />
*changed default value of parameter "tag" from "prediction" to "mRNA"<br />
*AnnotationEvidence:<br />
**additional attributes: avgCov, minCov, nps, ce<br />
**changed default value of "annotation output" to true<br />
**bugfix: transcript start and end<br />
*ERE: <br />
**changed default value of coverage to "true" <br />
**new parameter "minimum context": allows to discard introns if all split reads have short aligned contexts<br />
*Extractor: <br />
**bugfix splitAA if coding exon is very short<br />
**improved verbose mode<br />
**new parameter "upcase IDs"<br />
**new parameter "introns" allowing to extract introns from the reference (only for test cases)<br />
**new parameter "discard pre-mature stop" allowing to discard or use transcripts with pre-mature stop<br />
**improved handling of corrupt annotations<br />
*GAF: <br />
**bugfix missing transcripts<br />
**slightly changed the default value of "filter" <br />
*GeMoMa:<br />
**replaced parameter "query proteins" by "protein alignment"<br />
**using splitAA for scoring predictions <br />
**new gff attributes: <br />
*** ce and rce for the feature prediction indicating the number of coding exons for the prediction and the reference, respectively<br />
*** nps for the number of premature stop codons (if avoid stop is false)<br />
**slightly changed the meaning of the parameter "avoid stop"<br />
*GeMoMaPipeline:<br />
**changed the default value of tblastn to false, hence mmseqs is used as search algorithm<br />
**changed the default value of score to ReAlign<br />
**remove "--dont-split-seq-by-len" from mmseqs createdb<br />
**new optional parameter BLAST_PATH<br />
**new optional parameter MMSEQS_PATH<br />
**new option to allow for incorporation of external annotation, e.g., from ab-initio gene prediction<br />
**new parameter restart allowing to restart the latest GeMoMaPipeline run, which was finished without results, with very similar parameters, e.g., after an exception was thrown (cf. parameter debug)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.6.4.zip GeMoMa 1.6.4] (24.04.2020)<br />
* improved help section<br />
* change gff attribute "AA" to "aa"<br />
* GAF:<br />
** bugfix overlapping genes<br />
** accelerated computation<br />
* GeMoMa:<br />
** bugfix: if no assignment file is used and protein ID are prefixes of other protein IDs<br />
** change GFF attribute AA to aa<br />
* AnnotationFinalizer: new parameter "name attribute" allowing to decide whether a name attribute or the Parent and ID attributes should be used for renaming<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.6.3.zip GeMoMa 1.6.3] (05.03.2020)<br />
* Jstacs changes:<br />
** CLI: bugfix ExpandableParameterSet<br />
* python wrapper (for *conda)<br />
* updated tests.sh, run.sh, pipeline.sh<br />
* rename Denoise to DenoiseIntrons<br />
* AnnotationEvidence: write phase (as given) to gff<br />
* GAF: new parameter: default attributes allows to set attributes that are not included in some gene annotation files<br />
* GeMoMa: new parameter: static intron length allowing to use dynamic intron length if set to false<br />
* GeMoMaPipeline: <br />
** bugfix: time-out<br />
** improve output<br />
** separate parameters for maximum intron length (DenoiseIntrons, GeMoMa)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.6.2.zip GeMoMa 1.6.2] (17.12.2019)<br />
* Jstacs changes:<br />
** test methods for modules<br />
** live protocol for Galaxy<br />
* new module Denoise: allowing to clean introns extracted by ERE<br />
* new module NCBIReferenceRetriever: allowing to retrieve data for reference organisms easily from NCBI.<br />
* GAF:<br />
** bugfix for filter using specific attributes if no RNA-seq or query proteins was used<br />
** allow to add annotation info (as for instance provided by Phytozome) based on the reference organisms<br />
* GeMoMa: bugfix for timeout<br />
* GeMoMaPipeline:<br />
** bugfix reporting predicted partial proteins<br />
** improved protocol<br />
** new default value for query proteins (changed from false to true)<br />
** new default value for Ambiguity (changed from EXCEPTION to AMBIGUOUS)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.6.1.zip GeMoMa 1.6.1] (4.06.2019)<br />
* createGalaxyIntegration.sh: bugfix for GeMoMaPipeline<br />
* new module CheckIntrons: allowing to create statistics for introns (extracted by ERE)<br />
* AnnotationFinalizer: bugfix for sequence IDs with large numbers<br />
* CompareTranscripts: <br />
** bugfix for prefix of ref-gene<br />
** allow no transcript info, but making assignment non-optional if a transcript info is set <br />
* GAF: bugfix for Galaxy integration<br />
* GeMoMaPipeline:<br />
** improved output in case of Exceptions<br />
** new parameter "output individual predictions" allows to in- or exclude individual predictions from each reference organism in the final result<br />
** new parameter "weight" allows weights for reference species (cf. GAF)<br />
* ERE: new parameter "minimum mapping quality"<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.6.zip GeMoMa 1.6] (2.04.2019)<br />
* allow to use mmseqs as alternative to tblastn<br />
* AnnotationEvidence:<br />
** allows to add attributes to the input gff: tie, tpc, AA, start, stop<br />
** new parameter for gff output<br />
* AnnotationFinalizer: new tool for predicting UTRs and renaming predictions<br />
* GAF:<br />
** relative score filter and evidence filter are replaced by a flexible filter that allows to filter by relative score, evidence or other GFF attributes as well as combinations thereof<br />
** sorting criteria of the predictions within clusters can now be user-specified<br />
** new attribute for genes: combinedEvidence<br />
** new attribute for predictions: sumWeight<br />
** allows to use gene predictions from all sources, including for instance ab-initio gene predictors, purely RNA-seq based gene prediction and manually curation<br />
** bugfix for predictions from multiple reference organisms<br />
** improved statistic output<br />
* GeMoMa<br />
** renamed the parameter tblastn results to search results<br />
** new parameter for sorting the results of the similarity search (tblastn or mmseqs), if you use mmseqs for the similarity search you have use sort <br />
** new parameter for score of the search results: three options: Trust (as is), ReScore (use aligned sequence, but recompute score), and ReAlign (use detected sequence for realignment and rescore)<br />
** bugfix: threshold for introns from multiple files<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.5.2.zip GeMoMa 1.5.3] (23.07.2018)<br />
* improved parameter description and presentation<br />
* GeMoMaPipeline:<br />
** removed unnecessary parameters<br />
* GeMoMa:<br />
** bugfix: reading coverage file<br />
** removed parameter genomic (cf. Extractor)<br />
** removed protein output (cf. Extractor)<br />
* GAF:<br />
** bugfix: prefix<br />
* Extractor:<br />
** new parameter genomic<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.5.2.zip GeMoMa 1.5.2] (31.5.2018)<br />
* GAF:<br />
** new parameter that allows to restrict the maximal number of transcript predictions per gene<br />
** altered behavior of the evidence filter from percentages to absolute values<br />
** bugfix: nested genes <br />
** checking for duplicates in prediction IDs<br />
* GeMoMa:<br />
** warning if RNA-seq data does not match with target genome<br />
* GeMoMaPipeline: new tool for running the complete GeMoMa pipeline at once allowing multi-threading<br />
* folder for temporary files of GeMoMa<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.5.zip GeMoMa 1.5] (13.02.2018)<br />
* AnnotationEvidence: add chromosome to output<br />
* CompareTranscripts: new parameter that allows to remove prefixes introduces by GAF<br />
* Extractor: new parameter "stop-codon excluded from CDS" that might be used if the annotation does not contain the stop codons<br />
* ExtractRNASeqEvidence: <br />
** print intron length stats<br />
** include program infos in introns.gff3<br />
* GeMoMa:<br />
** new attribute pAA in gff output if query protein is given<br />
** include program infos in predicted_annotation.gff3<br />
** minor bugfix<br />
* GAF:<br />
** new parameter that allows to specify a prefix for each input gff<br />
** collect and print program infos to filtered_prediction.gff3<br />
** improved statistics output<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.4.2.zip GeMoMa 1.4.2] (21.07.2017)<br />
* automatic searching for available updates<br />
* AnnotationEvidence: bugfix (tie computation: Arrays.binarysearch does not find first match)<br />
* Extractor: bugfix (files that are not zipped)<br />
* GeMoMa: bugfix (tie computation: Arrays.binarysearch does not find first match)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.4.1.zip GeMoMa 1.4.1] (30.05.2017)<br />
* CompareTranscripts: bugfix (NullPointerException)<br />
* Extractor: reference genome can be .*fa.gz and .*fasta.gz<br />
* GeMoMa: bugfix (shutdown problem after timeout)<br />
* modified additional scripts and documentation<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.4.zip GeMoMa 1.4] (03.05.2017)<br />
* AnnotationEvidence: new tool computing tie and tpc for given annotation (gff)<br />
* CompareTranscripts: new tool comparing predicted and given annotation (gff)<br />
* Extractor:<br />
** reading CDS with no parent tag (cf. discontinuous feature)<br />
** automatic recognition of GFF or GTF annotation<br />
** Warning if sequences mentioned in the annotation are not included in the reference sequence<br />
* GeMoMa: <br />
** allowing for multiple intron and coverage files (= using different library types at the same time)<br />
** NA instead of "?" for tae, tde, tie, minSplitReads of single coding exon genes<br />
** new default values for the parameters: predictions (10 instead of 1) and contig threshold (0.4 instead of 0.9)<br />
** bugfix (write pc and minCov if possible for last CDS part in predicted annotation)<br />
** bugfix (ref-gene name if no assignment is used)<br />
** bugfix (minSplitReads, minCov, tpc, avgCov if no coverage available)<br />
* GAF:<br />
** nested genes on the same strand<br />
** bugfix (if nothing passes the filter)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.3.2.zip GeMoMa 1.3.2] (18.01.2017)<br />
* Extractor: new parameter repair for broken transcript annotations<br />
* GeMoMa: bugfixes (splice site computation)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa_1.3.1.zip GeMoMa 1.3.1] (09.12.2016)<br />
* GeMoMa bugfix (finding start/stop codon for very small exons)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.3.zip GeMoMa 1.3] (06.12.2016)<br />
* ERE: new tool for extracting RNA-seq evidence<br />
* Extractor: offers options for <br />
** partial gene models<br />
** ambiguities<br />
* GeMoMa:<br />
** RNA-seq<br />
*** defining splice sites<br />
*** additional feature in GFF and output<br />
**** transcript intron evidence (tie)<br />
**** transcript acceptor evidence (tae)<br />
**** transcript donor evidence (tde)<br />
**** transcript percentage coverage (tpc)<br />
**** ...<br />
** improved GFF<br />
** simplified the command line parameters<br />
** IMPORTANT: parameter names changed for some parameters<br />
* GAF: new tool for filtering and combining different predictions (especially of different reference organisms)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.1.3.zip GeMoMa 1.1.3] (06.06.2016)<br />
* minor modifications to the Extractor tool<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.1.2.zip GeMoMa 1.1.2] (05.02.2016)<br />
* GeMoMa bugfix (upstream, downstream sequence for splice site detection)<br />
* Extractor: new parameter s for selecting transcripts<br />
* improved Galaxy integration<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.1.1.zip GeMoMa 1.1.1] (01.02.2016)<br />
* initial release for paper</div>Keilwagenhttps://www.jstacs.de/index.php?title=GeMoMa-Docs&diff=1108GeMoMa-Docs2020-08-27T06:59:45Z<p>Keilwagen: </p>
<hr />
<div>This page describes the parameters of all [[GeMoMa]] modules.</br><br />
If you have any questions, comments or bugs, please check the [[GeMoMa#Frequently asked questions|FAQs]], [https://github.com/Jstacs/Jstacs/labels/GeMoMa our github page] or contact [mailto:jens.keilwagen@julius-kuehn.de Jens Keilwagen].<br />
<br />
= GeMoMa pipeline =<br />
<br />
This tool can be used to run the complete GeMoMa pipeline. The tool is multi-threaded and can utilize all compute cores on one machine, but not distributed as for instance in a compute cluster. It basically runs: '''Extract RNA-seq evidence (ERE)''', '''DenoiseIntrons''', '''Extractor''', external search (tblastn or mmseqs), '''Gene Model Mapper (GeMoMa)''', '''GeMoMa Annotation Filter (GAF)''', and '''AnnnotationFinalizer'''.<br />
<br />
''GeMoMa pipeline'' may be called with<br />
<br />
java -jar GeMoMa-1.7.jar CLI GeMoMaPipeline<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>target genome (Target genome file (FASTA), mime = fasta,fa,fas,fna,fasta.gz,fa.gz,fas.gz,fna.gz)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used zero or multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>species (data for reference species, range={own, pre-extracted}, default = own)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;own&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>ID (ID to distinguish the different reference species, default = , OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (Reference annotation file (GFF or GTF), which contains gene models annotated in the reference genome, mime = gff,gff3,gtf)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (Reference genome file (FASTA), mime = fasta,fa,fas,fna,fasta.gz,fa.gz,fas.gz,fna.gz)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>weight (the weight can be used to prioritize predictions from different input files; each prediction will get an additional attribute sumWeight that can be used in the filter, valid range = [0.0, 1000.0], default = 1.0, OPTIONAL)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ai</font></td><br />
<td>annotation info (annotation information of the reference, tab-delimted file containing at least the columns transcriptName, GO and .*defline, mime = tabular, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;pre-extracted&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>ID (ID to distinguish the different reference species, default = , OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>cds parts (The query cds parts file (FASTA), i.e., the cds parts that have been blasted, mime = fasta,fas,fa,fna)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>assignment (The assignment file, which combines parts of the CDS to transcripts, mime = tabular, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>weight (the weight can be used to prioritize predictions from different input files; each prediction will get an additional attribute sumWeight that can be used in the filter, valid range = [0.0, 1000.0], default = 1.0, OPTIONAL)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ai</font></td><br />
<td>annotation info (annotation information of the reference, tab-delimted file containing at least the columns transcriptName, GO and .*defline, mime = tabular, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used zero or multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ID</font></td><br />
<td>ID (ID to distinguish the different external annotations of the target organism, default = , OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">e</font></td><br />
<td>external annotation (External annotation file (GFF,GTF) of the target organism, which contains gene models from an external source (e.g., ab initio gene prediction) and will be integrated in the module GAF, mime = gff,gff3,gtf)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">weight</font></td><br />
<td>weight (the weight can be used to prioritize predictions from different input files in the module GAF; each prediction will get an additional attribute sumWeight that can be used in the filter, valid range = [0.0, 1000.0], default = 1.0, OPTIONAL)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ae</font></td><br />
<td>annotation evidence (run AnnotationEvidence on this external annotation, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">selected</font></td><br />
<td>selected (The path to list file, which allows to make only a predictions for the contained transcript ids. The first column should contain transcript IDs as given in the annotation. Remaining columns can be used to determine a target region that should be overlapped by the prediction, if columns 2 to 5 contain chromosome, strand, start and end of region, mime = tabular,txt, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">gc</font></td><br />
<td>genetic code (optional user-specified genetic code, mime = tabular, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">tblastn</font></td><br />
<td>tblastn (if *true* tblastn is used as search algorithm, otherwise mmseqs is used. Tblastn and mmseqs need to be installed to use the corresponding option, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">tag</font></td><br />
<td>tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = mRNA)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>RNA-seq evidence (data for RNA-seq evidence, range={NO, MAPPED, EXTRACTED}, default = NO)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;MAPPED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.s</font></td><br />
<td>Stranded (Defines whether the reads are stranded. In case of FR_FIRST_STRAND, the first read of a read pair or the only read in case of single-end data is assumed to be located on forward strand of the cDNA, i.e., reverse to the mRNA orientation. If you are using Illumina TruSeq you should use FR_FIRST_STRAND., range={FR_UNSTRANDED, FR_FIRST_STRAND, FR_SECOND_STRAND}, default = FR_UNSTRANDED)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.m</font></td><br />
<td>mapped reads file (BAM/SAM files containing the mapped reads, mime = bam,sam)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.v</font></td><br />
<td>ValidationStringency (Defines how strict to be when reading a SAM or BAM, beyond bare minimum validation., range={STRICT, LENIENT, SILENT}, default = LENIENT)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.u</font></td><br />
<td>use secondary alignments (allows to filter flags in the SAM or BAM, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.c</font></td><br />
<td>coverage (allows to output the coverage, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.mmq</font></td><br />
<td>minimum mapping quality (reads with a mapping quality that is lower than this value will be ignored, valid range = [0, 255], default = 40)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.mc</font></td><br />
<td>minimum context (only introns that have evidence of at least one split read with a minimal M (=(mis)match) stretch in the cigar string larger than or equal to this value will be used, valid range = [1, 1000000], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;EXTRACTED&quot;:</b></td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">introns</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq, mime = gff,gff3)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used zero or multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage</font></td><br />
<td>coverage (experimental coverage (RNA-seq), range={UNSTRANDED, STRANDED}, default = UNSTRANDED)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., mime = bedgraph)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., mime = bedgraph)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., mime = bedgraph)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>denoise (removing questionable introns that have been extracted by ERE, range={DENOISE, RAW}, default = DENOISE)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;DENOISE&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">DenoiseIntrons.m</font></td><br />
<td>maximum intron length (The maximum length of an intron, default = 15000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">DenoiseIntrons.me</font></td><br />
<td>minimum expression (The threshold for removing introns, valid range = [0.0, 1.0], default = 0.01)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">DenoiseIntrons.c</font></td><br />
<td>context (The context upstream a donor and donwstream an acceptor site that is used to determine the expression of the region, valid range = [0, 100], default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;RAW&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.u</font></td><br />
<td>upcase IDs (whether the IDs in the GFF should be upcased, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.r</font></td><br />
<td>repair (if a transcript annotation can not be parsed, the program will try to infer the phase of the CDS parts to repair the annotation, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.a</font></td><br />
<td>Ambiguity (This parameter defines how to deal with ambiguities in the DNA. There are 3 options: EXCEPTION, which will remove the corresponding transcript, AMBIGUOUS, which will use an X for the corresponding amino acid, and RANDOM, which will randomly select an amnio acid from the list of possibilities., range={EXCEPTION, AMBIGUOUS, RANDOM}, default = AMBIGUOUS)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.d</font></td><br />
<td>discard pre-mature stop (if *true* transcripts with pre-mature stop codon are discarded as they often indicate misannotation, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.s</font></td><br />
<td>stop-codon excluded from CDS (A flag that states whether the reference annotation contains the stop codon in the CDS annotation or not, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.f</font></td><br />
<td>full-length (A flag which allows for choosing between only full-length and all (i.e., full-length and partial) transcripts, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.s</font></td><br />
<td>splice (if no intron is given by RNA-seq, compute candidate splice sites or not, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.sm</font></td><br />
<td>substitution matrix (optional user-specified substitution matrix, mime = tabular, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.g</font></td><br />
<td>gap opening (The gap opening cost in the alignment, default = 11)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.ge</font></td><br />
<td>gap extension (The gap extension cost in the alignment, default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.m</font></td><br />
<td>maximum intron length (The maximum length of an intron, default = 15000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.sil</font></td><br />
<td>static intron length (A flag which allows to switch between static intron length, which can be specified by the user and is identical for all genes, and dynamic intron length, which is based on the gene-specific maximum intron length in the reference organism plus the user given maximum intron length, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.i</font></td><br />
<td>intron-loss-gain-penalty (The penalty used for intron loss and gain, default = 25)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.e</font></td><br />
<td>e-value (The e-value for filtering blast results, default = 100.0)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.c</font></td><br />
<td>contig threshold (The threshold for evaluating contigs, valid range = [0.0, 1.0], default = 0.4)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.rt</font></td><br />
<td>region threshold (The threshold for evaluating regions, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.h</font></td><br />
<td>hit threshold (The threshold for adding additional hits, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.p</font></td><br />
<td>predictions (The (maximal) number of predictions per transcript, default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.a</font></td><br />
<td>avoid stop (A flag which allows to avoid (additional) pre-mature stop codons in a transcript, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.approx</font></td><br />
<td>approx (whether an approximation is used to compute the score for intron gain, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.pa</font></td><br />
<td>protein alignment (whether a protein alignment between the prediction and the reference transcript should be computed. If so two additional attributes (iAA, pAA) will be added to predictions in the gff output. These might be used in GAF. However, since some transcripts are very long this can increase the needed runtime and memory (RAM)., default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.prefix</font></td><br />
<td>prefix (A prefix to be used for naming the predictions, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.t</font></td><br />
<td>timeout (The (maximal) number of seconds to be used for the predictions of one transcript, if exceeded GeMoMa does not output a prediction for this transcript., valid range = [0, 604800], default = 3600)</td><br />
<td style="width:100px;">LONG</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.Score</font></td><br />
<td>Score (A flag which allows to do nothing, re-score or re-align the search results, range={Trust, ReScore, ReAlign}, default = ReAlign)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.c</font></td><br />
<td>common border filter (the filter on the common borders of transcripts, the lower the more transcripts will be checked as alternative splice isoforms, valid range = [0.0, 1.0], default = 0.75)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.m</font></td><br />
<td>maximal number of transcripts per gene (the maximal number of allowed transcript predictions per gene, valid range = [1, 2147483647], default = 2147483647)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.d</font></td><br />
<td>default attributes (Comma-separated list of attributes that is set to NaN if they are not given in the annotation file. This allows to use these attributes for sorting or filter criteria. It is especially meaningful if the gene annotation files were received fom different software packages (e.g., GeMoMa, Braker, ...) having different attributes., default = tie,tde,tae,iAA,pAA,score)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.f</font></td><br />
<td>filter (A filter can be applied to transcript predictions to receive only reasonable predictions. The filter is applied to the GFF attributes. The default filter decides based on the completeness of the prediction (start=='M' and stop=='*') and the relative score (score/aa>=0.75) whether a prediction is used or not. In addition, predictions without score (isNaN(score)) will be used as external annotations do not have the attribute score. Different criteria can be combined using 'and' and 'or'. A more sophisticated filter could be applied for instance using the completeness, the relative score, the evidence as well as statistics based on RNA-seq: start=='M' and stop=='*' and score/aa>=0.75 and (evidence>1 or tpc==1.0), default = start=='M' and stop=='*' and (isNaN(score) or score/aa>=0.75), OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.s</font></td><br />
<td>sorting (comma-separated list that defines the way predictions are sorted within a cluster, default = evidence,score)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.a</font></td><br />
<td>alternative transcript filter (If a prediction is suggested as an alternative transcript, this additional filter will be applied. The filter works syntactically like the 'filter' parameter and allows you to keep the number of alternative transcripts small based on meaningful criteria. Reasonable filter could be based on RNA-seq data (tie==1) or on evidence (evidence>1). A more sophisticated filter could be applied combining several criteria: tie==1 or evidence>1, default = tie==1 or evidence>1, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.u</font></td><br />
<td>UTR (allows to predict UTRs using RNA-seq data, range={NO, YES}, default = NO)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;YES&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.r</font></td><br />
<td>rename (allows to generate generic gene and transcripts names (cf. parameter &quot;name attribute&quot;), range={COMPOSED, SIMPLE, NO}, default = COMPOSED)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;COMPOSED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.i</font></td><br />
<td>infix (the infix of the generic name, default = G)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.s</font></td><br />
<td>suffix (the suffix of the generic name, default = 0)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.di</font></td><br />
<td>delete infix (a comma-separated list of infixes that is deleted from the sequence names before building the gene/transcript name, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;SIMPLE&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.n</font></td><br />
<td>name attribute (if true the new name is added as new attribute &quot;Name&quot;, otherwise &quot;Parent&quot; and &quot;ID&quot; values are modified accordingly, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>predicted proteins (If *true*, returns the predicted proteins of the target organism as fastA file, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">pc</font></td><br />
<td>predicted CDSs (If *true*, returns the predicted CDSs of the target organism as fastA file, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">pgr</font></td><br />
<td>predicted genomic regions (If *true*, returns the genomic regions of predicted gene models of the target organism as fastA file, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">o</font></td><br />
<td>output individual predictions (If *true*, returns the predictions for each reference species, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">debug</font></td><br />
<td>debug (If *false* removes all temporary files even if the jobs exits unexpected, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">restart</font></td><br />
<td>restart (can be used to restart the latest GeMoMaPipeline run, which was finished without results, with very similar parameters, e.g., after an exception was thrown (cf. parameter debug), default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>BLAST_PATH (allows to set a path to the blast binaries if not set in the environment, default = , OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>MMSEQS_PATH (allows to set a path to the blast binaries if not set in the environment, default = , OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">threads</font></td><br />
<td>The number of threads used for the tool, defaults to 1</td><br />
<td>INT</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.7.jar CLI GeMoMaPipeline t=<target_genome> AnnotationFinalizer.p=<prefix><br />
<br />
<br />
= Extract RNA-seq Evidence =<br />
<br />
This tools extracts introns and coverage from mapped RNA-seq reads. Introns might be denoised by the tool '''DenoiseIntrons'''. Introns and coverage results can be used in '''GeMoMa''' to improve the predictions and might help to select better gene models in '''GAF'''. In addition, introns and coverage can be used to predict UTRs by '''AnnotationFinalizer'''.<br />
<br />
''Extract RNA-seq Evidence'' may be called with<br />
<br />
java -jar GeMoMa-1.7.jar CLI ERE<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>Stranded (Defines whether the reads are stranded. In case of FR_FIRST_STRAND, the first read of a read pair or the only read in case of single-end data is assumed to be located on forward strand of the cDNA, i.e., reverse to the mRNA orientation. If you are using Illumina TruSeq you should use FR_FIRST_STRAND., range={FR_UNSTRANDED, FR_FIRST_STRAND, FR_SECOND_STRAND}, default = FR_UNSTRANDED)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>mapped reads file (BAM/SAM files containing the mapped reads, mime = bam,sam)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>ValidationStringency (Defines how strict to be when reading a SAM or BAM, beyond bare minimum validation., range={STRICT, LENIENT, SILENT}, default = LENIENT)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">u</font></td><br />
<td>use secondary alignments (allows to filter flags in the SAM or BAM, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage (allows to output the coverage, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">mmq</font></td><br />
<td>minimum mapping quality (reads with a mapping quality that is lower than this value will be ignored, valid range = [0, 255], default = 40)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">mc</font></td><br />
<td>minimum context (only introns that have evidence of at least one split read with a minimal M (=(mis)match) stretch in the cigar string larger than or equal to this value will be used, valid range = [1, 1000000], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.7.jar CLI ERE m=<mapped_reads_file><br />
<br />
<br />
= CheckIntrons =<br />
<br />
The tool checks the distribution of introns on the strands and the dinucleotide distribution at splice sites.<br />
<br />
''CheckIntrons'' may be called with<br />
<br />
java -jar GeMoMa-1.7.jar CLI CheckIntrons<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>target genome (The target genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code, mime = fasta)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq, mime = gff, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>verbose (A flag which allows to output a wealth of additional information per transcript, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.7.jar CLI CheckIntrons t=<target_genome><br />
<br />
<br />
= DenoiseIntrons =<br />
<br />
This module allows to analyze introns extracted by '''ERE'''. Introns with a large intron size or a low relative expression are possibly artefacts and will be removed. The result of this module can be used in the module '''GeMoMa''', '''AnnotationEvidence''', and '''AnnotationFinalizer'''.<br />
<br />
''DenoiseIntrons'' may be called with<br />
<br />
java -jar GeMoMa-1.7.jar CLI DenoiseIntrons<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq, mime = gff,gff3)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage (experimental coverage (RNA-seq), range={UNSTRANDED, STRANDED}, default = UNSTRANDED)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., mime = bedgraph)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., mime = bedgraph)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., mime = bedgraph)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>maximum intron length (The maximum length of an intron, default = 15000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">me</font></td><br />
<td>minimum expression (The threshold for removing introns, valid range = [0.0, 1.0], default = 0.01)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">context</font></td><br />
<td>context (The context upstream a donor and donwstream an acceptor site that is used to determine the expression of the region, valid range = [0, 100], default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.7.jar CLI DenoiseIntrons i=<introns> coverage_unstranded=<coverage_unstranded><br />
<br />
<br />
= NCBI Reference Retriever =<br />
<br />
This tool can be used to download or update assembly and annotation files of reference organsims from NCBI. This way it allows to easily collect all data necessary to start '''GeMoMaPipeline''' or '''Extractor'''.<br />
<br />
''NCBI Reference Retriever'' may be called with<br />
<br />
java -jar GeMoMa-1.7.jar CLI NRR<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reference directory (the directory where the genome and annotation files of the reference organisms should be stored, default = references/)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>number of tries (the number of tries for downloading a reference file, valid range = [1, 100], default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">rl</font></td><br />
<td>reference list (a list of reference organisms, mime = txt)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.7.jar CLI NRR rl=<reference_list><br />
<br />
<br />
= Extractor =<br />
<br />
This tool can be used to create input files for '''GeMoMa''', i.e., it creates at least a fasta file containing the translated parts of the CDS and a tabular file containing the assignment of transcripts to genes and parts of CDS to transcripts. In addition, '''Extractor''' can be used to create several additional files from the final prediction, e.g. proteins, CDSs, ... . Two inputs are mandatory: The genome as fasta or fasta.gz and the corresponding annotation as gff or gff.gz. The gff file should be sorted. If you like to set a user-specific genetic code, please use a tab-delimited file with two columns. The first column contains the amino acid in one letter code, the second a list of tripletts.<br />
<br />
''Extractor'' may be called with<br />
<br />
java -jar GeMoMa-1.7.jar CLI Extractor<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (Reference annotation file (GFF or GTF), which contains gene models annotated in the reference genome, mime = gff,gff3,gtf)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (Reference genome file (FASTA), mime = fasta,fa,fas,fna,fasta.gz,fa.gz,fas.gz,fna.gz)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">gc</font></td><br />
<td>genetic code (optional user-specified genetic code, mime = tabular, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>proteins (whether the complete proteins sequences should returned as output, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>cds (whether the complete CDSs should returned as output, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">genomic</font></td><br />
<td>genomic (whether the genomic regions should be returned (upper case = coding, lower case = non coding), default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns (whether introns should be extracted from annotation, that might be used for test cases, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">u</font></td><br />
<td>upcase IDs (whether the IDs in the GFF should be upcased, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>repair (if a transcript annotation can not be parsed, the program will try to infer the phase of the CDS parts to repair the annotation, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>selected (The path to list file, which allows to make only a predictions for the contained transcript ids. The first column should contain transcript IDs as given in the annotation. Remaining columns will be ignored., mime = tabular,txt, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Ambiguity</font></td><br />
<td>Ambiguity (This parameter defines how to deal with ambiguities in the DNA. There are 3 options: EXCEPTION, which will remove the corresponding transcript, AMBIGUOUS, which will use an X for the corresponding amino acid, and RANDOM, which will randomly select an amnio acid from the list of possibilities., range={EXCEPTION, AMBIGUOUS, RANDOM}, default = EXCEPTION)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>discard pre-mature stop (if *true* transcripts with pre-mature stop codon are discarded as they often indicate misannotation, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sefc</font></td><br />
<td>stop-codon excluded from CDS (A flag that states whether the reference annotation contains the stop codon in the CDS annotation or not, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>full-length (A flag which allows for choosing between only full-length and all (i.e., full-length and partial) transcripts, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>verbose (A flag which allows to output a wealth of additional information, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.7.jar CLI Extractor a=<annotation> g=<genome><br />
<br />
<br />
= GeneModelMapper =<br />
<br />
This tool is the main part of, a homology-based gene prediction tool. GeMoMa builds gene models from search results (e.g. tblastn or mmseqs).<br />
<br />
As first step, you should run '''Extractor''' obtaining ''cds parts'' and ''assignment''. Second, you should run a search algorithm, e.g. '''tblastn''' or '''mmseqs''', with ''cds parts'' as query. Finally, these search results are then used in '''GeMoMa'''. Search results should be clustered according to the reference genes. The most easiest way is to sort the search results accoring to the first column. If the search results are not sorted by default (e.g. mmseqs), you should the parameter ''sort''.<br />
If you like to run GeMoMa ignoring intron position conservation, you should blast protein sequences and feed the results in ''query cds parts'' and leave ''assignment'' unselected.<br />
<br />
If you like to run GeMoMa using RNA-seq evidence, you should map your RNA-seq reads to the genome and run '''ERE''' on the mapped reads. For several reasons, spurious introns can be extracted from RNA-seq data. Hence, we recommend to run '''DenoiseIntrons''' to remove such spurious introns. Finally, you can use the obtained ''introns'' (and ''coverage'') in GeMoMa.<br />
<br />
If you like to obtain multiple predictions per gene model of the reference organism, you should set ''predictions'' accordingly. In addition, we suggest to decrease the value of ''contig threshold'' allowing GeMoMa to evaluate more candidate contigs/chromosomes.<br />
<br />
If you change the values of ''contig threshold'', ''region threshold'' and ''hit threshold'', this will influence the predictions as well as the runtime of the algorithm. The lower the values are, the slower the algorithm is.<br />
<br />
You can filter your predictions using '''GAF''', which also allows for combining predictions from different reference organismns.<br />
<br />
Finally, you can predict UTRs and rename predictions using '''AnnotationFinalizer'''.<br />
<br />
If you like to run the complete GeMoMa pipeline and not only specific module, you can run the multi-threaded module '''GeMoMaPipeline'''.<br />
<br />
''GeneModelMapper'' may be called with<br />
<br />
java -jar GeMoMa-1.7.jar CLI GeMoMa<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>search results (The search results, e.g., from tblastn or mmseqs, mime = tabular)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>target genome (The target genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code, mime = fasta,fas,fa,fna,fasta.gz,fas.gz,fa.gz,fna.gz)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>cds parts (The query cds parts file (FASTA), i.e., the cds parts that have been blasted, mime = fasta,fa,fas,fna)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>assignment (The assignment file, which combines parts of the CDS to transcripts, mime = tabular, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used zero or multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq, mime = gff,gff3)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">splice</font></td><br />
<td>splice (if no intron is given by RNA-seq, compute candidate splice sites or not, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used zero or multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage</font></td><br />
<td>coverage (experimental coverage (RNA-seq), range={UNSTRANDED, STRANDED}, default = UNSTRANDED)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., mime = bedgraph)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., mime = bedgraph)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., mime = bedgraph)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genetic code (optional user-specified genetic code, mime = tabular, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sm</font></td><br />
<td>substitution matrix (optional user-specified substitution matrix, mime = tabular, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">go</font></td><br />
<td>gap opening (The gap opening cost in the alignment, default = 11)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ge</font></td><br />
<td>gap extension (The gap extension cost in the alignment, default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>maximum intron length (The maximum length of an intron, default = 15000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sil</font></td><br />
<td>static intron length (A flag which allows to switch between static intron length, which can be specified by the user and is identical for all genes, and dynamic intron length, which is based on the gene-specific maximum intron length in the reference organism plus the user given maximum intron length, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">intron-loss-gain-penalty</font></td><br />
<td>intron-loss-gain-penalty (The penalty used for intron loss and gain, default = 25)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">e</font></td><br />
<td>e-value (The e-value for filtering blast results, default = 100.0)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ct</font></td><br />
<td>contig threshold (The threshold for evaluating contigs, valid range = [0.0, 1.0], default = 0.4)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">rt</font></td><br />
<td>region threshold (The threshold for evaluating regions, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">h</font></td><br />
<td>hit threshold (The threshold for adding additional hits, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>predictions (The (maximal) number of predictions per transcript, default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">selected</font></td><br />
<td>selected (The path to list file, which allows to make only a predictions for the contained transcript ids. The first column should contain transcript IDs as given in the annotation. Remaining columns can be used to determine a target region that should be overlapped by the prediction, if columns 2 to 5 contain chromosome, strand, start and end of region, mime = tabular,txt, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">as</font></td><br />
<td>avoid stop (A flag which allows to avoid (additional) pre-mature stop codons in a transcript, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">approx</font></td><br />
<td>approx (whether an approximation is used to compute the score for intron gain, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">pa</font></td><br />
<td>protein alignment (whether a protein alignment between the prediction and the reference transcript should be computed. If so two additional attributes (iAA, pAA) will be added to predictions in the gff output. These might be used in GAF. However, since some transcripts are very long this can increase the needed runtime and memory (RAM)., default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">prefix</font></td><br />
<td>prefix (A prefix to be used for naming the predictions, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">tag</font></td><br />
<td>tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = mRNA)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>verbose (A flag which allows to output a wealth of additional information per transcript, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">timeout</font></td><br />
<td>timeout (The (maximal) number of seconds to be used for the predictions of one transcript, if exceeded GeMoMa does not output a prediction for this transcript., valid range = [0, 604800], default = 3600)</td><br />
<td style="width:100px;">LONG</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sort</font></td><br />
<td>sort (A flag which allows to sort the search results, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Score</font></td><br />
<td>Score (A flag which allows to do nothing, re-score or re-align the search results, range={Trust, ReScore, ReAlign}, default = Trust)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.7.jar CLI GeMoMa s=<search_results> t=<target_genome> c=<cds_parts> a=<assignment><br />
<br />
<br />
= GeMoMa Annotation Filter =<br />
<br />
This tool combines and filters gene predictions from different sources yielding a common gene prediction. The tool does not modify the predictions, but filters redundant and low-quality predictions and selects relevant predictions. In addition, it adds attributes to the annotation of transcript predictions.<br />
<br />
The algorithm does the following:<br />
First, redundant predictions are identified (and additional attributes (evidence, sumWeight) are introduced).<br />
Second, the predictions are filtered using the user-specified criterium based on the attributes from the annotation.<br />
Third, clusters of overlapping predictions are determined, the predictions are sorted within the cluster and relevant predictions are extracted.<br />
<br />
Optionally, annotation info can be added for each reference organism enabling a functional prediction for predicted transcripts based on the function of the reference transcript.<br />
Phytozome provides annotation info tables for plants, but annotation info can be used from any source as long as they are tab-delimited files with at least the following columns: transcriptName, GO and .*defline.<br />
<br />
Initially, GAF was build to combine gene predictions obtained from '''GeMoMa'''. It allows to combine the predictions from multiple reference organisms, but works also using only one reference organism. However, GAF also allows to integrate predictions from ab-initio or purely RNA-seq-based gene predictors as well as manually curated annotation. If you like to do so, we recommend to run '''AnnotationEvidence''' for each of these input files to add additional attributes that can be used for sorting and filtering within '''GAF'''. The sort and filter criteria need to be carefully revised in this case. Default values can be set for missing attributes.<br />
<br />
''GeMoMa Annotation Filter'' may be called with<br />
<br />
java -jar GeMoMa-1.7.jar CLI GAF<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>tag (the tag used to read the GeMoMa annotations, default = mRNA)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>common border filter (the filter on the common borders of transcripts, the lower the more transcripts will be checked as alternative splice isoforms, valid range = [0.0, 1.0], default = 0.75)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>maximal number of transcripts per gene (the maximal number of allowed transcript predictions per gene, valid range = [1, 2147483647], default = 2147483647)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prefix (the prefix can be used to distinguish predictions from different input files, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>weight (the weight can be used to prioritize predictions from different input files; each prediction will get an additional attribute sumWeight that can be used in the filter, valid range = [0.0, 1000.0], default = 1.0, OPTIONAL)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>gene annotation file (GFF file containing the gene annotations (predicted by GeMoMa), mime = gff,gff3)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation info (annotation information of the reference, tab-delimted file containing at least the columns transcriptName, GO and .*defline, mime = tabular, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>default attributes (Comma-separated list of attributes that is set to NaN if they are not given in the annotation file. This allows to use these attributes for sorting or filter criteria. It is especially meaningful if the gene annotation files were received fom different software packages (e.g., GeMoMa, Braker, ...) having different attributes., default = tie,tde,tae,iAA,pAA,score)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>filter (A filter can be applied to transcript predictions to receive only reasonable predictions. The filter is applied to the GFF attributes. The default filter decides based on the completeness of the prediction (start=='M' and stop=='*') and the relative score (score/aa>=0.75) whether a prediction is used or not. In addition, predictions without score (isNaN(score)) will be used as external annotations do not have the attribute score. Different criteria can be combined using 'and' and 'or'. A more sophisticated filter could be applied for instance using the completeness, the relative score, the evidence as well as statistics based on RNA-seq: start=='M' and stop=='*' and score/aa>=0.75 and (evidence>1 or tpc==1.0), default = start=='M' and stop=='*' and (isNaN(score) or score/aa>=0.75), OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>sorting (comma-separated list that defines the way predictions are sorted within a cluster, default = evidence,score)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">atf</font></td><br />
<td>alternative transcript filter (If a prediction is suggested as an alternative transcript, this additional filter will be applied. The filter works syntactically like the 'filter' parameter and allows you to keep the number of alternative transcripts small based on meaningful criteria. Reasonable filter could be based on RNA-seq data (tie==1) or on evidence (evidence>1). A more sophisticated filter could be applied combining several criteria: tie==1 or evidence>1, default = tie==1 or evidence>1, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.7.jar CLI GAF g=<gene_annotation_file><br />
<br />
<br />
= AnnotationFinalizer =<br />
<br />
This tool finalizes an annotation. It allows to predict for UTRs for annotated coding sequences and to generate generic gene and transcript names. UTR prediction might be negatively influenced (i.e. too long predictions) by genomic contamination of RNA-seq libraries, overlapping genes or genes in close proximity as well as unstranded RNA-seq libraries. Please use '''ERE''' to preprocess the mapped reads.<br />
<br />
''AnnotationFinalizer'' may be called with<br />
<br />
java -jar GeMoMa-1.7.jar CLI AnnotationFinalizer<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (The genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code, mime = fasta,fa,fas,fna,fasta.gz,fa.gz,fas.gz,fna.gz)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (The predicted genome annotation file (GFF), mime = gff,gff3)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = mRNA)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">u</font></td><br />
<td>UTR (allows to predict UTRs using RNA-seq data, range={NO, YES}, default = NO)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;YES&quot;:</b></td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns file (Introns (GFF), which might be obtained from RNA-seq, mime = gff,gff3)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage file (experimental coverage (RNA-seq), range={NO, UNSTRANDED, STRANDED}, default = NO)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., mime = bedgraph)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., mime = bedgraph)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., mime = bedgraph)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">rename</font></td><br />
<td>rename (allows to generate generic gene and transcripts names (cf. parameter &quot;name attribute&quot;), range={COMPOSED, SIMPLE, NO}, default = COMPOSED)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;COMPOSED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">infix</font></td><br />
<td>infix (the infix of the generic name, default = G)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>suffix (the suffix of the generic name, default = 0)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">di</font></td><br />
<td>delete infix (a comma-separated list of infixes that is deleted from the sequence names before building the gene/transcript name, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;SIMPLE&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>name attribute (if true the new name is added as new attribute &quot;Name&quot;, otherwise &quot;Parent&quot; and &quot;ID&quot; values are modified accordingly, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.7.jar CLI AnnotationFinalizer g=<genome> a=<annotation> p=<prefix><br />
<br />
<br />
= Annotation evidence =<br />
<br />
This tool adds attributes to the annotation, e.g., tie, tpc, aa, start, stop. These attributes can be used, for instance, if the annotation is used in '''GAF'''. All predictions of the annotation are used. The predictions are not filtered for internal stop codons, missing start or stop codons, frame-shifts, ... . Please use '''ERE''' to preprocess the mapped reads.<br />
<br />
''Annotation evidence'' may be called with<br />
<br />
java -jar GeMoMa-1.7.jar CLI AnnotationEvidence<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (The genome annotation file (GFF,GTF), mime = gff,gff3,gtf)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = mRNA)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (The genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code, mime = fasta,fas,fa,fna,fasta.gz,fas.gz,fa.gz,fna.gz)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns file (Introns (GFF), which might be obtained from RNA-seq, mime = gff,gff3, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage file (experimental coverage (RNA-seq), range={NO, UNSTRANDED, STRANDED}, default = NO)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., mime = bedgraph)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., mime = bedgraph)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., mime = bedgraph)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ao</font></td><br />
<td>annotation output (if the annotation should be returned with attributes tie, tpc, and aa, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">gc</font></td><br />
<td>genetic code (optional user-specified genetic code, mime = tabular, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.7.jar CLI AnnotationEvidence a=<annotation> g=<genome><br />
<br />
<br />
= Compare transcripts =<br />
<br />
This tool compares a predicted annotation with a given annotation in terms of F1 measure. If the F1 measure is 1 both annotations are in perfect agreement for this transcript. The smaller the value is the low is the agreement. If it is NA then there is no overlapping annotation.<br />
<br />
''Compare transcripts'' may be called with<br />
<br />
java -jar GeMoMa-1.7.jar CLI CompareTranscripts<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prediction (The predicted annotation, mime = gff,gff3)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (The true annotation, mime = gff,gff3)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used zero or multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">prefix</font></td><br />
<td>prefix (the prefix can be used to distinguish predictions from different input files, default = , OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">assignment</font></td><br />
<td>assignment (the transcript info for the reference of the prediction, mime = tabular)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.7.jar CLI CompareTranscripts p=<prediction> a=<annotation><br />
<br />
<br />
= Synteny checker =<br />
<br />
This tool can be used to determine syntenic regions between target organism and reference organism based on similiarity of genes.!The tool returns a table of reference genes per predicted gene. This table can be easily visualized with an R script that is included in the GeMoMa package.<br />
<br />
''Synteny checker'' may be called with<br />
<br />
java -jar GeMoMa-1.7.jar CLI SyntenyChecker<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>tag (the tag used to read the GeMoMa annotations, default = mRNA)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prefix (the prefix can be used to distinguish predictions from different input files, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>assignment (The assignment file, which combines parts of the CDS to transcripts, mime = tabular)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>gene annotation file (GFF file containing the gene annotations predicted by GAF, mime = gff,gff3)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.7.jar CLI SyntenyChecker a=<assignment> g=<gene_annotation_file><br />
<br />
<br />
= AddAttribute =<br />
<br />
This tool allows to add an additional attribute to specific features of an annotation.<br />
<br />
Those additional attributes might be used in '''GAF''' for filtering or sorting or might be displayed in genome browsers like IGV or WebApollo. The user can choose binary attributes (true or false) or attributes with values according to given tab-delimited table.<br />
<br />
''AddAttribute'' may be called with<br />
<br />
java -jar GeMoMa-1.7.jar CLI AddAttribute<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (annotation file, mime = gff,gff3)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>feature (a feature of the annotation, e.g., gene, transcript or mRNA, default = mRNA)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">attribute</font></td><br />
<td>attribute (the name of the attribute that is added to the annotation)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>table (a tab-delimited file containing IDs and additional attribute, mime = tabular)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>ID column (the ID column in the tab-delimited file, valid range = [0, 2147483647])</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">type</font></td><br />
<td>type (type of addition attribute, range={VALUES, BINARY}, default = VALUES)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;VALUES&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ac</font></td><br />
<td>attribute column (the attribute column in the tab-delimited file, valid range = [0, 2147483647])</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;BINARY&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.7.jar CLI AddAttribute a=<annotation> attribute=<attribute> t=&lt;table&gt; i=<ID_column> ac=<attribute_column></div>Keilwagenhttps://www.jstacs.de/index.php?title=GeMoMa&diff=1107GeMoMa2020-08-27T06:47:27Z<p>Keilwagen: /* Frequently asked questions */</p>
<hr />
<div>'''Ge'''ne '''Mo'''del '''Ma'''pper (GeMoMa) is a homology-based gene prediction program. GeMoMa uses the annotation of protein-coding genes in a reference genome to infer the annotation of protein-coding genes in a target genome. Thereby, GeMoMa utilizes amino acid sequence and intron position conservation. In addition, GeMoMa allows to incorporate RNA-seq evidence for splice site prediction.<br />
<br />
GeMoMa is available in a public web-server at [http://galaxy.informatik.uni-halle.de galaxy.informatik.uni-halle.de]. The provided web-server only allows a limited number of reference genes and uses a time out of 2 minutes per transcript prediction. For unlimited use, please use the command line program or integrate GeMoMa in your own [https://galaxyproject.org/ Galaxy] instance.<br />
<br />
{|<br />
|__TOC__<br />
|[[File:GeMoMa-schema.png|thumb|right|450px|Schema of GeMoMa algorithm]]<br />
|}<br />
<br />
== Installation ==<br />
<br />
GeMoMa is now available via bioconda. Here is the [https://anaconda.org/bioconda/gemoma direct link to the package]. To install this package with conda run:<br />
conda install -c bioconda gemoma <br />
However, you can also install GeMoMa manually.<br />
<br />
=== Requirements ===<br />
<br />
For running the GeMoMa, you need the following software on your computer<br />
* Java v1.8 or later<br />
* [ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ blast] or [https://github.com/soedinglab/MMseqs2 mmseqs]<br />
<br />
=== Download ===<br />
<br />
GeMoMa is implemented in Java using Jstacs. You can [http://www.jstacs.de/download.php?which=GeMoMa download a zip file] containing a readme, the GeMoMa jar file and some tiny scripts for running GeMoMa. The jar file allows for <br />
<ul><br />
<li>creating the XML file needed for the Galaxy integration</li><br />
<li>running the command line interface (CLI) version.</li><br />
</ul><br />
<br />
== In a nutshell ==<br />
<br />
GeMoMa is a modular, homology-based gene prediction program with huge flexibility. However, we also provide a pipeline allowing to use GeMoMa easily. If you like to start GeMoMa for the first time, we recommend to use the GeMoMaPipeline like this<br />
java -jar GeMoMa-1.7.jar CLI GeMoMaPipeline threads=<threads> outdir=<outdir> GeMoMa.Score=ReAlign AnnotationFinalizer.r=NO o=true t=<target_genome> i=<reference_1_id> a=<reference_1_annotation> g=<reference_1_genome><br />
there are several parameters that need to be set indicated with '''&lt;'''foo'''&gt;'''. You can specify<br />
* the number of threads<br />
* the output directory<br />
* the target genome<br />
* and the reference ID (optional), annotation and genome. If you have several references just repeat <code>s=own</code> and the parameter tags <code>i</code>, <code>a</code>, <code>g</code> with the corresponding values.<br />
In addition, we recommend to set several parameters:<br />
* <code>GeMoMa.Score=ReAlign</code>: states that the score from mmseqs should be recomputed as mmseqs uses an approximation<br />
* <code>AnnotationFinalizer.r=NO</code>: do not rename genes and transcripts<br />
* <code>o=true</code>: output individual predictions for each reference as a separate file allowing to rerun the combination step ('''GAF''') very easily and quickly<br />
If you like to specify the maximum intron length please consider the parameters <code>GeMoMa.m</code> and <code>GeMoMa.sil</code>.<br />
If you have RNA-seq data either from own experiments or publicly available data sets (cf. [https://www.ncbi.nlm.nih.gov/sra NCBI SRA], [https://www.ebi.ac.uk/ena EMBL-EBI ENA]), we recommend to use them. You need to map the data against the target genome with your favorite read mapper. In addition, we recommend to check the parameters of the section '''DenoiseIntrons'''.<br />
<br />
The complete documentation describing all GeMoMa modules and all paaremeters can be accessed at [[GeMoMa-Docs]].<br />
<br />
You can also [[:File:GeMoMa-manual.pdf|download a small manual for GeMoMa]] which explains the main steps for the analysis.<br />
<br />
== GFF attributes ==<br />
<br />
Using GeMoMa, you'll obtain GFFs containing some special attributes. We briefly explain the most prominent attributes in the following table.<br />
<br />
{| class="wikitable sortable"<br />
|-<br />
!Attribute !!Long name !!Tool !!Necessary parameter !!Feature !!Description<br />
|-<br />
| aa || amino acids || GeMoMa || || mRNA || the number of amino acids in the protein<br />
|-<br />
| score || GeMoMa score || GeMoMa || || mRNA || score computed by GeMoMa using the substitution matrix, gap costs and additional penalties<br />
|-<br />
| nps || number of premature stops || GeMoMa || || mRNA || the number of premature stop codons in the prediction<br />
|-<br />
| ce || coding exons || GeMoMa || assignment || mRNA || the number of coding exons of the prediction<br />
|-<br />
| rce || reference coding exons || GeMoMa || assignment || mRNA || the number of coding exons of the reference transcript<br />
|-<br />
| minCov || minimal coverage || GeMoMa || coverage, ... || mRNA || minimal coverage of any base of the prediction given RNA-seq evidence<br />
|-<br />
| avgCov || average coverage || GeMoMa || coverage, ... || mRNA || average coverage of all bases of the prediction given RNA-seq evidence<br />
|-<br />
| tpc || transcript percentage coverage || GeMoMa || coverage, ... || mRNA || percentage of covered bases per predicted transcript given RNA-seq evidence<br />
|-<br />
| tae || transcript acceptor evidence || GeMoMa || introns || mRNA || percentage of predicted acceptor sites per predicted transcript with RNA-seq evidence<br />
|-<br />
| tde || transcript donor evidence || GeMoMa || introns || mRNA || percentage of predicted donor sites per predicted transcript with RNA-seq evidence<br />
|-<br />
| tie || transcript intron evidence || GeMoMa || introns || mRNA || percentage of predicted introns per predicted transcript with RNA-seq evidence<br />
|-<br />
| minSplitReads || minimal split reads || GeMoMa || introns || mRNA || minimal number of split reads for any of the predicted introns per predicted transcript<br />
|-<br />
| iAA || identical amino acid || GeMoMa || query proteins || mRNA || percentage of identical amino acids between reference transcript and prediction<br />
|-<br />
| pAA || positive amino acid || GeMoMa || query proteins || mRNA || percentage of aligned positions between reference transcript and prediction yielding a positive score in the substitution matrix<br />
|-<br />
| evidence || || GAF || || mRNA || number of reference organisms that have a transcript yielding this prediction<br />
|-<br />
| alternative || || GAF || || mRNA || alternative gene ID(s) leading to the same prediction<br />
|-<br />
| sumWeight || || GAF || || mRNA || the sum of the weights of the references that perfectly support this prediction<br />
|-<br />
| maxTie || maximal tie || GAF || || gene || maximal tie of all transcripts of this gene<br />
|-<br />
| maxEvidence || maximal evidence || GAF || || gene || maximal evidence of all transcripts of this gene<br />
|-<br />
|}<br />
<br />
The name of the feature describing a transcript prediction can be altered using the parameter "tag". Before version 1.7 the default value of tag was "prediction" instead of "mRNA".<br />
<br />
== Frequently asked questions ==<br />
<br />
; Why does the Extractor not return a single CDS-part, protein, ...?<br />
:Please check whether the names of your contigs/chromosomes in your annotation (gff) and genome file (fasta) are identical. The fasta comments should at best only contain the contig/chromosome name. (Since GeMoMa 1.4, comments, which contain the contig/chromosome name and some additional information separated by a space, are also fine.) In addition, check the statistics that are given by the Extractor. It lists how many genes have been read and how many genes have been removed for different reasons. One common problem is that some annotation files do not include the stop codon in the CDS annotation.<br />
<br />
; How can I force GeMoMa to make more predictions?<br />
:There are several parameters affecting the number of predictions. The most prominent are the number of predictions (p) and the contig threshold (ct). For each reference transcript/CDS, GeMoMa initially makes a preliminary prediction and uses this prediction to determine whether a contig is promising and should be used to determine the final predictions. You may decrease ct and increase p to have more contigs in the final prediction. Increasing the number of predictions allows GeMoMa to output more predictions that have been computed. Decreasing the contig threshold allows to increase the number of predictions that are (internally) computed. Increasing p to a very large number without decreasing ct does not help.<br />
<br />
; Running GeMoMa on a single contig of my assembly yield thousands of weird predictions. What went wrong?<br />
:By default, GeMoMa is not build to be run on a single contig. GeMoMa tries to make predictions for all given reference CDS in the given target sequence(s). If the given target sequence is only a fraction of the complete target genome/assembly, GeMoMa will produce weird predictions as it does not filter for the quality of the predictions internally. There are two options to handle this:<br />
:* Use a list of gene models that you expect to be located on this contig (cf. parameter "selected").<br />
:* Filter the predictions using GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>).<br />
<br />
; Is it mandatory to use RNA-seq data?<br />
:No, GeMoMa is able to make predictions with and without RNA-seq evidence.<br />
<br />
; Is it possible to use multiple reference organisms?<br />
:It is possible to use multiple reference organisms for GeMoMa. Just run GeMoMa on each reference organism separately. Finally, you can employ GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>) to combine these annotations.<br />
<br />
; Why do some reference genes not lead to a prediction in the target genome?<br />
:Please first check whether your reference genes have been discarded by the Extractor (cf. assignment file).<br />
:If the genes have been discarded, there are two possibilities:<br />
:* The CDS might be redundant, i.e. the coding exons are identical to those of another transcript. In this case, only one CDS is further evaluated.<br />
:* There might be something wrong with your reference genes, e.g., missing start codon, missing stop codon, premature stop codon, ambiguous nucleotides, ... and you should check the options of Extractor or the annotation.<br />
:If the reference genes passed the Extractor, there are several possible explanations for this behavior. The two most prominent are:<br />
:* GeMoMa stopped the prediction of a reference genes since it does not return a result within the given time (cf. parameter "timeout").<br />
:* GeMoMa simply did not find a prediction matching the remaining quality criteria.<br />
<br />
; What does "partial gene model" mean in the context of GeMoMa?<br />
:We called a gene model partial if it does not contain an initial start codon and a final stop codon. However, this does not mean that the gene model is located at or close to the border of a chromosome or contig.<br />
<br />
; For two different reference transcripts, the predictions of GeMoMa overlap or are identical. What should I do with those?<br />
:GeMoMa makes the predictions for each reference transcript independently. Hence, it can occur that some of predictions of different reference transcripts overlap or are identical especially in gene families. Typically, you might like to filter or rank these predictions. We have implemented GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>) to do this automatically. However, you can also do it by hand using the GFF attributes. Using RNA-seq data in GeMoMa yields additional fields in the annotation that can be used, e.g., average coverage (avgCov).<br />
<br />
; A lot of transcripts have been filtered out by the Extractor. What can I do?<br />
:There are several reasons for removing transcripts by the Extractor. At least in two cases you can try to get more transcripts by setting specific parameter values. First, if the transcript contains ambiguous nucleotides, please test the parameter "Ambiguity". Second, sometimes we received GFFs which contain wrong phases for CDS entries (e.g., 0 for all CDS entries in the phase column of the GFF). Since version 1.3.2, we provide the option "r" which stands for repair. If r=true is chosen, the Extractor tries to infer all phases for transcripts that show an error and would be filtered out.<br />
<br />
; Is GeMoMa able to predict pseudo-genes/ncRNA?<br />
:No, currently not.<br />
<br />
; My RNA-seq data indicates there is an additional intron in a transcipt, but GeMoMa does not predict this. Or vice versa, GeMoMa predicts an intron that is not supported by RNA-seq data. What's the reason?<br />
:GeMoMa is mainly based on the assumptions of amino acid and intron position conservation between reference and target species. Hence, GeMoMa tries to predict a gene model with similar exon-intron-structure in the target species and does not stick too much with RNA-seq data. Although intron position conservation can be observed in most cases, sometimes new introns evolve or others vanish. For this reasons, GeMoMa also allows for the inclusion or exclusion of introns adding some additional costs (cf. GeMoMa parameter intron-loss-gain-penalty). However, the behaviour of GeMoMa depends on the parameters settings (especially intron-loss-gain-penalty, sm (substitution matrix), go (gap opening), ge (gap extension)) and the length of the missed/additional intron. Nevertheless, such cases can only occur if the additional/missed intron has a length that can be divided by 3 preserving the reading frame. Since the available RNA-seq data only reflects a fraction of tissues/environmental conditions/..., missing RNA-seq evidence does not necessarily mean that the predictions is wrong.<br />
<br />
; My RNA-seq data indicates two alternative, highly overlapping introns. Interestingly, GeMoMa does not take the intron that is more abundant. Why?<br />
:GeMoMa reads the introns from the input file using some filter (cf. GeMoMa parameter r (reads)). All introns that pass the filter are used and treated equally. Hence, GeMoMa uses the intron that matches the expectation of intron position and amino acid conservation compared to the reference transcript.<br />
<br />
; Does GeMoMa predict multiple transcripts per gene?<br />
:GeMoMa in principle allows to predict multiple transcripts per gene, if corresponding transcripts are given in the reference species or if multiple reference species are used. <br />
<br />
; GeMoMa failed with java.lang.OutOfMemoryError. What can I do?<br />
:Whenever you see a java.lang.OutOfMemoryError, you should rerun the program with Java virtual machine (VM) options. More specifically you should set: -Xms the initially used RAM, e.g. to 5Gb (–Xms5G), and -Xmx the maximally used RAM, e.g. to 50Gb (-Xmx50G). GeMoMa often needs more memory if you have a large genome and if you’re providing a large coverage file (extracted from RNA-seq data). If you don’t have a compute node with enough memory, you can run GeMoMa without coverage, which will return the same predictions, but does not include all statistics. Another point could be the protein alignment, if you use the optional parameter "query protein" (below version 1.; or "protein alignment" (since version 1.;. Again you can run GeMoMa without protein alignment, which will return the same predictions, but less statistics. <br />
<br />
; I need to specify the genetic code for my organisms. What is the expected format?<br />
:The genetic code is given in a two column tab-delimited table, where the first column is the one letter code of the amino acid and the second column is a comma-separated list of triplets. As we are working on genomic DNA, GeMoMa expects the bases A, C, G, and T, and not U (as expected in mRNA). Here is the link to the default genetic code, which might be used as template:<br />
:https://github.com/Jstacs/Jstacs/blob/master/projects/gemoma/test_data/genetic_code.txt<br />
:Alternative genetic codes are described here using the RNA alphabet:<br />
:https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi<br />
:The genetic code might be specified for a reference organism in the module Extractor or for a target organism in the module GeMoMa.<br />
<br />
; I like to accelerate GeMoMa. What can I do?<br />
:You can use several threads for the computation. If you run the GeMoMaPipeline you just have to select threads=<your_number>.<br />
:In addition, you can change the search algorithm that is used in GeMoMa. Tblastn is used by default as search algorithm in GeMoMa (for historical reasons, until version 1.6.4). However, tblastn can be replaced by mmseqs which is typically much faster. If you run the GeMoMaPipeline you just have to select tblastn=false, which is default since version 1.7. However, changing the search algorithm can also effect the results. We try to minimize these effect using specific parameters for the search algorithms.<br />
:If you modify other parameters, you will probably receive results that differ to a larger extend from those received using the default parameters. <br />
<br />
; Is there a way to use GeMoMa to search a single CDS or protein sequence against a genome and return the predicted gene model (CDS fasta, protein fasta, GFF) similar to exonerate?<br />
:There at least two ways to do this. If you use GeMoMaPipeline you can <br />
: (A) Use the parameter “selected” to select specific gene models (=transcripts/proteins) from the annotation instead of using all or<br />
: (B) Use s=pre-extracted, use a fasta file with the proteins for the parameter cds-parts and leave assignment unset.<br />
:Using one of these options you can either look for a single or few transcripts/proteins either with (A) or without (B) intron-position conservation. In addition, you can use RNA-seq data to improve the predictions, which should be not possible with exonerate.<br />
<br />
; Can I determine synteny based on GeMoMa predictions?<br />
:Yes, since version 1.7 we provide the module SyntenyChecker and a R script that can be used for this purpose. It exploits the fact the the reference gene and the alternative are known. Hence no alignment is need at this point and synteny can be determined quite fast. <br />
<br />
; How, can I add additional attributes to the annotation?<br />
:Additional attribute, e.g., functional annotation from InterProScan, can be added to the structural gene annotation using the module AddAttribute, which has been included since version 1.7. Such additional attributes might be used in GAF for filtering and sorting and can also be displayed in genome browsers like IGV or WebApollo. <br />
<br />
; Can structural gene annotation provided by GeMoMa be submitted to NCBI?<br />
:Yes, NCBI allows to submit structural gene annotation in GFF format (https://www.ncbi.nlm.nih.gov/genbank/genomes_gff/). If you run GeMoMaPipeline or AnnotationFinalizer, the GFF should be valid for conversion. <br />
<br />
; Running GeMoMaPipeline throws an exception. Can I restart GeMoMaPipeline using intermediate results?<br />
:Yes, since version 1.7 we have a new parameter in GeMoMaPipeline called restart. <br />
:If you want to restart the last broken GeMoMaPipeline run, you have to execute GeMoMaPipeline with the same command line as before and add restart=true.<br />
:If necessary, you can also slightly change the other parameters. However, if the parameters differ too much from those used before, GeMoMaPipeline will decide to perform a new independent run.<br />
:A restart of GeMoMaPipeline is particularly useful if the time-consuming search (tblastn or mmseqs) was successful, since this can save runtime.<br />
<br />
If you have any further questions, comments or bugs, please check the [[GeMoMa-Docs]], [https://github.com/Jstacs/Jstacs/labels/GeMoMa our github page] or contact [mailto:jens.keilwagen@julius-kuehn.de Jens Keilwagen].<br />
<br />
== References ==<br />
If you use GeMoMa, please cite<br />
<br />
J. Keilwagen, M. Wenk, J. L. Erickson, M. H. Schattat, J. Grau, and F. Hartung. [https://nar.oxfordjournals.org/content/44/9/e89 Using intron position conservation for homology-based gene prediction]. ''Nucleic Acids Research'', 2016. doi: 10.1093/nar/gkw092<br />
<br />
J. Keilwagen, F. Hartung, M. Paulini, S. O. Twardziok, and J. Grau<br />
[https://rdcu.be/QbKc Combining RNA-seq data and homology-based gene prediction for plants, animals and fungi]. ''BMC Bioinformatics'', 2018. doi: 10.1186/s12859-018-2203-5<br />
<br />
== Version history ==<br />
[http://www.jstacs.de/download.php?which=GeMoMa GeMoMa 1.7.] (29.07.2020)<br />
GeMoMa 1.7. (29.07.2020)<br />
*improved manual including new module and runtime <br />
*check whether input files exist before execution<br />
*partially checking MIME types in CLI before execution<br />
*changed homepage from http to https<br />
*new module AddAttribute: allows to add attributes (like functional annotation from InterProScan) to gene annotation files that might be used in GAF or displayed genome browsers like IGV or WebApollo<br />
*new module SyntenyChecker: creates a table that can be used to create dot plots between the annotation of the target and reference organism<br />
*changed default value of parameter "tag" from "prediction" to "mRNA"<br />
*AnnotationEvidence:<br />
**additional attributes: avgCov, minCov, nps, ce<br />
**changed default value of "annotation output" to true<br />
**bugfix: transcript start and end<br />
*ERE: <br />
**changed default value of coverage to "true" <br />
**new parameter "minimum context": allows to discard introns if all split reads have short aligned contexts<br />
*Extractor: <br />
**bugfix splitAA if coding exon is very short<br />
**improved verbose mode<br />
**new parameter "upcase IDs"<br />
**new parameter "introns" allowing to extract introns from the reference (only for test cases)<br />
**new parameter "discard pre-mature stop" allowing to discard or use transcripts with pre-mature stop<br />
**improved handling of corrupt annotations<br />
*GAF: <br />
**bugfix missing transcripts<br />
**slightly changed the default value of "filter" <br />
*GeMoMa:<br />
**replaced parameter "query proteins" by "protein alignment"<br />
**using splitAA for scoring predictions <br />
**new gff attributes: <br />
*** ce and rce for the feature prediction indicating the number of coding exons for the prediction and the reference, respectively<br />
*** nps for the number of premature stop codons (if avoid stop is false)<br />
**slightly changed the meaning of the parameter "avoid stop"<br />
*GeMoMaPipeline:<br />
**changed the default value of tblastn to false, hence mmseqs is used as search algorithm<br />
**changed the default value of score to ReAlign<br />
**remove "--dont-split-seq-by-len" from mmseqs createdb<br />
**new optional parameter BLAST_PATH<br />
**new optional parameter MMSEQS_PATH<br />
**new option to allow for incorporation of external annotation, e.g., from ab-initio gene prediction<br />
**new parameter restart allowing to restart the latest GeMoMaPipeline run, which was finished without results, with very similar parameters, e.g., after an exception was thrown (cf. parameter debug)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.6.4.zip GeMoMa 1.6.4] (24.04.2020)<br />
* improved help section<br />
* change gff attribute "AA" to "aa"<br />
* GAF:<br />
** bugfix overlapping genes<br />
** accelerated computation<br />
* GeMoMa:<br />
** bugfix: if no assignment file is used and protein ID are prefixes of other protein IDs<br />
** change GFF attribute AA to aa<br />
* AnnotationFinalizer: new parameter "name attribute" allowing to decide whether a name attribute or the Parent and ID attributes should be used for renaming<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.6.3.zip GeMoMa 1.6.3] (05.03.2020)<br />
* Jstacs changes:<br />
** CLI: bugfix ExpandableParameterSet<br />
* python wrapper (for *conda)<br />
* updated tests.sh, run.sh, pipeline.sh<br />
* rename Denoise to DenoiseIntrons<br />
* AnnotationEvidence: write phase (as given) to gff<br />
* GAF: new parameter: default attributes allows to set attributes that are not included in some gene annotation files<br />
* GeMoMa: new parameter: static intron length allowing to use dynamic intron length if set to false<br />
* GeMoMaPipeline: <br />
** bugfix: time-out<br />
** improve output<br />
** separate parameters for maximum intron length (DenoiseIntrons, GeMoMa)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.6.2.zip GeMoMa 1.6.2] (17.12.2019)<br />
* Jstacs changes:<br />
** test methods for modules<br />
** live protocol for Galaxy<br />
* new module Denoise: allowing to clean introns extracted by ERE<br />
* new module NCBIReferenceRetriever: allowing to retrieve data for reference organisms easily from NCBI.<br />
* GAF:<br />
** bugfix for filter using specific attributes if no RNA-seq or query proteins was used<br />
** allow to add annotation info (as for instance provided by Phytozome) based on the reference organisms<br />
* GeMoMa: bugfix for timeout<br />
* GeMoMaPipeline:<br />
** bugfix reporting predicted partial proteins<br />
** improved protocol<br />
** new default value for query proteins (changed from false to true)<br />
** new default value for Ambiguity (changed from EXCEPTION to AMBIGUOUS)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.6.1.zip GeMoMa 1.6.1] (4.06.2019)<br />
* createGalaxyIntegration.sh: bugfix for GeMoMaPipeline<br />
* new module CheckIntrons: allowing to create statistics for introns (extracted by ERE)<br />
* AnnotationFinalizer: bugfix for sequence IDs with large numbers<br />
* CompareTranscripts: <br />
** bugfix for prefix of ref-gene<br />
** allow no transcript info, but making assignment non-optional if a transcript info is set <br />
* GAF: bugfix for Galaxy integration<br />
* GeMoMaPipeline:<br />
** improved output in case of Exceptions<br />
** new parameter "output individual predictions" allows to in- or exclude individual predictions from each reference organism in the final result<br />
** new parameter "weight" allows weights for reference species (cf. GAF)<br />
* ERE: new parameter "minimum mapping quality"<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.6.zip GeMoMa 1.6] (2.04.2019)<br />
* allow to use mmseqs as alternative to tblastn<br />
* AnnotationEvidence:<br />
** allows to add attributes to the input gff: tie, tpc, AA, start, stop<br />
** new parameter for gff output<br />
* AnnotationFinalizer: new tool for predicting UTRs and renaming predictions<br />
* GAF:<br />
** relative score filter and evidence filter are replaced by a flexible filter that allows to filter by relative score, evidence or other GFF attributes as well as combinations thereof<br />
** sorting criteria of the predictions within clusters can now be user-specified<br />
** new attribute for genes: combinedEvidence<br />
** new attribute for predictions: sumWeight<br />
** allows to use gene predictions from all sources, including for instance ab-initio gene predictors, purely RNA-seq based gene prediction and manually curation<br />
** bugfix for predictions from multiple reference organisms<br />
** improved statistic output<br />
* GeMoMa<br />
** renamed the parameter tblastn results to search results<br />
** new parameter for sorting the results of the similarity search (tblastn or mmseqs), if you use mmseqs for the similarity search you have use sort <br />
** new parameter for score of the search results: three options: Trust (as is), ReScore (use aligned sequence, but recompute score), and ReAlign (use detected sequence for realignment and rescore)<br />
** bugfix: threshold for introns from multiple files<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.5.2.zip GeMoMa 1.5.3] (23.07.2018)<br />
* improved parameter description and presentation<br />
* GeMoMaPipeline:<br />
** removed unnecessary parameters<br />
* GeMoMa:<br />
** bugfix: reading coverage file<br />
** removed parameter genomic (cf. Extractor)<br />
** removed protein output (cf. Extractor)<br />
* GAF:<br />
** bugfix: prefix<br />
* Extractor:<br />
** new parameter genomic<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.5.2.zip GeMoMa 1.5.2] (31.5.2018)<br />
* GAF:<br />
** new parameter that allows to restrict the maximal number of transcript predictions per gene<br />
** altered behavior of the evidence filter from percentages to absolute values<br />
** bugfix: nested genes <br />
** checking for duplicates in prediction IDs<br />
* GeMoMa:<br />
** warning if RNA-seq data does not match with target genome<br />
* GeMoMaPipeline: new tool for running the complete GeMoMa pipeline at once allowing multi-threading<br />
* folder for temporary files of GeMoMa<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.5.zip GeMoMa 1.5] (13.02.2018)<br />
* AnnotationEvidence: add chromosome to output<br />
* CompareTranscripts: new parameter that allows to remove prefixes introduces by GAF<br />
* Extractor: new parameter "stop-codon excluded from CDS" that might be used if the annotation does not contain the stop codons<br />
* ExtractRNASeqEvidence: <br />
** print intron length stats<br />
** include program infos in introns.gff3<br />
* GeMoMa:<br />
** new attribute pAA in gff output if query protein is given<br />
** include program infos in predicted_annotation.gff3<br />
** minor bugfix<br />
* GAF:<br />
** new parameter that allows to specify a prefix for each input gff<br />
** collect and print program infos to filtered_prediction.gff3<br />
** improved statistics output<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.4.2.zip GeMoMa 1.4.2] (21.07.2017)<br />
* automatic searching for available updates<br />
* AnnotationEvidence: bugfix (tie computation: Arrays.binarysearch does not find first match)<br />
* Extractor: bugfix (files that are not zipped)<br />
* GeMoMa: bugfix (tie computation: Arrays.binarysearch does not find first match)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.4.1.zip GeMoMa 1.4.1] (30.05.2017)<br />
* CompareTranscripts: bugfix (NullPointerException)<br />
* Extractor: reference genome can be .*fa.gz and .*fasta.gz<br />
* GeMoMa: bugfix (shutdown problem after timeout)<br />
* modified additional scripts and documentation<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.4.zip GeMoMa 1.4] (03.05.2017)<br />
* AnnotationEvidence: new tool computing tie and tpc for given annotation (gff)<br />
* CompareTranscripts: new tool comparing predicted and given annotation (gff)<br />
* Extractor:<br />
** reading CDS with no parent tag (cf. discontinuous feature)<br />
** automatic recognition of GFF or GTF annotation<br />
** Warning if sequences mentioned in the annotation are not included in the reference sequence<br />
* GeMoMa: <br />
** allowing for multiple intron and coverage files (= using different library types at the same time)<br />
** NA instead of "?" for tae, tde, tie, minSplitReads of single coding exon genes<br />
** new default values for the parameters: predictions (10 instead of 1) and contig threshold (0.4 instead of 0.9)<br />
** bugfix (write pc and minCov if possible for last CDS part in predicted annotation)<br />
** bugfix (ref-gene name if no assignment is used)<br />
** bugfix (minSplitReads, minCov, tpc, avgCov if no coverage available)<br />
* GAF:<br />
** nested genes on the same strand<br />
** bugfix (if nothing passes the filter)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.3.2.zip GeMoMa 1.3.2] (18.01.2017)<br />
* Extractor: new parameter repair for broken transcript annotations<br />
* GeMoMa: bugfixes (splice site computation)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa_1.3.1.zip GeMoMa 1.3.1] (09.12.2016)<br />
* GeMoMa bugfix (finding start/stop codon for very small exons)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.3.zip GeMoMa 1.3] (06.12.2016)<br />
* ERE: new tool for extracting RNA-seq evidence<br />
* Extractor: offers options for <br />
** partial gene models<br />
** ambiguities<br />
* GeMoMa:<br />
** RNA-seq<br />
*** defining splice sites<br />
*** additional feature in GFF and output<br />
**** transcript intron evidence (tie)<br />
**** transcript acceptor evidence (tae)<br />
**** transcript donor evidence (tde)<br />
**** transcript percentage coverage (tpc)<br />
**** ...<br />
** improved GFF<br />
** simplified the command line parameters<br />
** IMPORTANT: parameter names changed for some parameters<br />
* GAF: new tool for filtering and combining different predictions (especially of different reference organisms)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.1.3.zip GeMoMa 1.1.3] (06.06.2016)<br />
* minor modifications to the Extractor tool<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.1.2.zip GeMoMa 1.1.2] (05.02.2016)<br />
* GeMoMa bugfix (upstream, downstream sequence for splice site detection)<br />
* Extractor: new parameter s for selecting transcripts<br />
* improved Galaxy integration<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.1.1.zip GeMoMa 1.1.1] (01.02.2016)<br />
* initial release for paper</div>Keilwagenhttps://www.jstacs.de/index.php?title=GeMoMa-Docs&diff=1106GeMoMa-Docs2020-08-27T06:34:58Z<p>Keilwagen: /* AddAttribute */</p>
<hr />
<div>This page describes the parameters of all [[GeMoMa]] modules.<br />
<br />
= GeMoMa pipeline =<br />
<br />
This tool can be used to run the complete GeMoMa pipeline. The tool is multi-threaded and can utilize all compute cores on one machine, but not distributed as for instance in a compute cluster. It basically runs: '''Extract RNA-seq evidence (ERE)''', '''DenoiseIntrons''', '''Extractor''', external search (tblastn or mmseqs), '''Gene Model Mapper (GeMoMa)''', '''GeMoMa Annotation Filter (GAF)''', and '''AnnnotationFinalizer'''.<br />
<br />
''GeMoMa pipeline'' may be called with<br />
<br />
java -jar GeMoMa-1.7.jar CLI GeMoMaPipeline<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>target genome (Target genome file (FASTA), mime = fasta,fa,fas,fna,fasta.gz,fa.gz,fas.gz,fna.gz)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used zero or multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>species (data for reference species, range={own, pre-extracted}, default = own)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;own&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>ID (ID to distinguish the different reference species, default = , OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (Reference annotation file (GFF or GTF), which contains gene models annotated in the reference genome, mime = gff,gff3,gtf)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (Reference genome file (FASTA), mime = fasta,fa,fas,fna,fasta.gz,fa.gz,fas.gz,fna.gz)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>weight (the weight can be used to prioritize predictions from different input files; each prediction will get an additional attribute sumWeight that can be used in the filter, valid range = [0.0, 1000.0], default = 1.0, OPTIONAL)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ai</font></td><br />
<td>annotation info (annotation information of the reference, tab-delimted file containing at least the columns transcriptName, GO and .*defline, mime = tabular, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;pre-extracted&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>ID (ID to distinguish the different reference species, default = , OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>cds parts (The query cds parts file (FASTA), i.e., the cds parts that have been blasted, mime = fasta,fas,fa,fna)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>assignment (The assignment file, which combines parts of the CDS to transcripts, mime = tabular, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>weight (the weight can be used to prioritize predictions from different input files; each prediction will get an additional attribute sumWeight that can be used in the filter, valid range = [0.0, 1000.0], default = 1.0, OPTIONAL)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ai</font></td><br />
<td>annotation info (annotation information of the reference, tab-delimted file containing at least the columns transcriptName, GO and .*defline, mime = tabular, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used zero or multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ID</font></td><br />
<td>ID (ID to distinguish the different external annotations of the target organism, default = , OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">e</font></td><br />
<td>external annotation (External annotation file (GFF,GTF) of the target organism, which contains gene models from an external source (e.g., ab initio gene prediction) and will be integrated in the module GAF, mime = gff,gff3,gtf)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">weight</font></td><br />
<td>weight (the weight can be used to prioritize predictions from different input files in the module GAF; each prediction will get an additional attribute sumWeight that can be used in the filter, valid range = [0.0, 1000.0], default = 1.0, OPTIONAL)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ae</font></td><br />
<td>annotation evidence (run AnnotationEvidence on this external annotation, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">selected</font></td><br />
<td>selected (The path to list file, which allows to make only a predictions for the contained transcript ids. The first column should contain transcript IDs as given in the annotation. Remaining columns can be used to determine a target region that should be overlapped by the prediction, if columns 2 to 5 contain chromosome, strand, start and end of region, mime = tabular,txt, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">gc</font></td><br />
<td>genetic code (optional user-specified genetic code, mime = tabular, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">tblastn</font></td><br />
<td>tblastn (if *true* tblastn is used as search algorithm, otherwise mmseqs is used. Tblastn and mmseqs need to be installed to use the corresponding option, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">tag</font></td><br />
<td>tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = mRNA)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>RNA-seq evidence (data for RNA-seq evidence, range={NO, MAPPED, EXTRACTED}, default = NO)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;MAPPED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.s</font></td><br />
<td>Stranded (Defines whether the reads are stranded. In case of FR_FIRST_STRAND, the first read of a read pair or the only read in case of single-end data is assumed to be located on forward strand of the cDNA, i.e., reverse to the mRNA orientation. If you are using Illumina TruSeq you should use FR_FIRST_STRAND., range={FR_UNSTRANDED, FR_FIRST_STRAND, FR_SECOND_STRAND}, default = FR_UNSTRANDED)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.m</font></td><br />
<td>mapped reads file (BAM/SAM files containing the mapped reads, mime = bam,sam)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.v</font></td><br />
<td>ValidationStringency (Defines how strict to be when reading a SAM or BAM, beyond bare minimum validation., range={STRICT, LENIENT, SILENT}, default = LENIENT)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.u</font></td><br />
<td>use secondary alignments (allows to filter flags in the SAM or BAM, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.c</font></td><br />
<td>coverage (allows to output the coverage, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.mmq</font></td><br />
<td>minimum mapping quality (reads with a mapping quality that is lower than this value will be ignored, valid range = [0, 255], default = 40)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.mc</font></td><br />
<td>minimum context (only introns that have evidence of at least one split read with a minimal M (=(mis)match) stretch in the cigar string larger than or equal to this value will be used, valid range = [1, 1000000], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;EXTRACTED&quot;:</b></td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">introns</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq, mime = gff,gff3)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used zero or multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage</font></td><br />
<td>coverage (experimental coverage (RNA-seq), range={UNSTRANDED, STRANDED}, default = UNSTRANDED)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., mime = bedgraph)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., mime = bedgraph)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., mime = bedgraph)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>denoise (removing questionable introns that have been extracted by ERE, range={DENOISE, RAW}, default = DENOISE)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;DENOISE&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">DenoiseIntrons.m</font></td><br />
<td>maximum intron length (The maximum length of an intron, default = 15000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">DenoiseIntrons.me</font></td><br />
<td>minimum expression (The threshold for removing introns, valid range = [0.0, 1.0], default = 0.01)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">DenoiseIntrons.c</font></td><br />
<td>context (The context upstream a donor and donwstream an acceptor site that is used to determine the expression of the region, valid range = [0, 100], default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;RAW&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.u</font></td><br />
<td>upcase IDs (whether the IDs in the GFF should be upcased, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.r</font></td><br />
<td>repair (if a transcript annotation can not be parsed, the program will try to infer the phase of the CDS parts to repair the annotation, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.a</font></td><br />
<td>Ambiguity (This parameter defines how to deal with ambiguities in the DNA. There are 3 options: EXCEPTION, which will remove the corresponding transcript, AMBIGUOUS, which will use an X for the corresponding amino acid, and RANDOM, which will randomly select an amnio acid from the list of possibilities., range={EXCEPTION, AMBIGUOUS, RANDOM}, default = AMBIGUOUS)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.d</font></td><br />
<td>discard pre-mature stop (if *true* transcripts with pre-mature stop codon are discarded as they often indicate misannotation, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.s</font></td><br />
<td>stop-codon excluded from CDS (A flag that states whether the reference annotation contains the stop codon in the CDS annotation or not, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.f</font></td><br />
<td>full-length (A flag which allows for choosing between only full-length and all (i.e., full-length and partial) transcripts, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.s</font></td><br />
<td>splice (if no intron is given by RNA-seq, compute candidate splice sites or not, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.sm</font></td><br />
<td>substitution matrix (optional user-specified substitution matrix, mime = tabular, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.g</font></td><br />
<td>gap opening (The gap opening cost in the alignment, default = 11)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.ge</font></td><br />
<td>gap extension (The gap extension cost in the alignment, default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.m</font></td><br />
<td>maximum intron length (The maximum length of an intron, default = 15000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.sil</font></td><br />
<td>static intron length (A flag which allows to switch between static intron length, which can be specified by the user and is identical for all genes, and dynamic intron length, which is based on the gene-specific maximum intron length in the reference organism plus the user given maximum intron length, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.i</font></td><br />
<td>intron-loss-gain-penalty (The penalty used for intron loss and gain, default = 25)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.e</font></td><br />
<td>e-value (The e-value for filtering blast results, default = 100.0)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.c</font></td><br />
<td>contig threshold (The threshold for evaluating contigs, valid range = [0.0, 1.0], default = 0.4)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.rt</font></td><br />
<td>region threshold (The threshold for evaluating regions, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.h</font></td><br />
<td>hit threshold (The threshold for adding additional hits, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.p</font></td><br />
<td>predictions (The (maximal) number of predictions per transcript, default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.a</font></td><br />
<td>avoid stop (A flag which allows to avoid (additional) pre-mature stop codons in a transcript, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.approx</font></td><br />
<td>approx (whether an approximation is used to compute the score for intron gain, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.pa</font></td><br />
<td>protein alignment (whether a protein alignment between the prediction and the reference transcript should be computed. If so two additional attributes (iAA, pAA) will be added to predictions in the gff output. These might be used in GAF. However, since some transcripts are very long this can increase the needed runtime and memory (RAM)., default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.prefix</font></td><br />
<td>prefix (A prefix to be used for naming the predictions, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.t</font></td><br />
<td>timeout (The (maximal) number of seconds to be used for the predictions of one transcript, if exceeded GeMoMa does not output a prediction for this transcript., valid range = [0, 604800], default = 3600)</td><br />
<td style="width:100px;">LONG</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.Score</font></td><br />
<td>Score (A flag which allows to do nothing, re-score or re-align the search results, range={Trust, ReScore, ReAlign}, default = ReAlign)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.c</font></td><br />
<td>common border filter (the filter on the common borders of transcripts, the lower the more transcripts will be checked as alternative splice isoforms, valid range = [0.0, 1.0], default = 0.75)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.m</font></td><br />
<td>maximal number of transcripts per gene (the maximal number of allowed transcript predictions per gene, valid range = [1, 2147483647], default = 2147483647)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.d</font></td><br />
<td>default attributes (Comma-separated list of attributes that is set to NaN if they are not given in the annotation file. This allows to use these attributes for sorting or filter criteria. It is especially meaningful if the gene annotation files were received fom different software packages (e.g., GeMoMa, Braker, ...) having different attributes., default = tie,tde,tae,iAA,pAA,score)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.f</font></td><br />
<td>filter (A filter can be applied to transcript predictions to receive only reasonable predictions. The filter is applied to the GFF attributes. The default filter decides based on the completeness of the prediction (start=='M' and stop=='*') and the relative score (score/aa>=0.75) whether a prediction is used or not. In addition, predictions without score (isNaN(score)) will be used as external annotations do not have the attribute score. Different criteria can be combined using 'and' and 'or'. A more sophisticated filter could be applied for instance using the completeness, the relative score, the evidence as well as statistics based on RNA-seq: start=='M' and stop=='*' and score/aa>=0.75 and (evidence>1 or tpc==1.0), default = start=='M' and stop=='*' and (isNaN(score) or score/aa>=0.75), OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.s</font></td><br />
<td>sorting (comma-separated list that defines the way predictions are sorted within a cluster, default = evidence,score)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.a</font></td><br />
<td>alternative transcript filter (If a prediction is suggested as an alternative transcript, this additional filter will be applied. The filter works syntactically like the 'filter' parameter and allows you to keep the number of alternative transcripts small based on meaningful criteria. Reasonable filter could be based on RNA-seq data (tie==1) or on evidence (evidence>1). A more sophisticated filter could be applied combining several criteria: tie==1 or evidence>1, default = tie==1 or evidence>1, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.u</font></td><br />
<td>UTR (allows to predict UTRs using RNA-seq data, range={NO, YES}, default = NO)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;YES&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.r</font></td><br />
<td>rename (allows to generate generic gene and transcripts names (cf. parameter &quot;name attribute&quot;), range={COMPOSED, SIMPLE, NO}, default = COMPOSED)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;COMPOSED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.i</font></td><br />
<td>infix (the infix of the generic name, default = G)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.s</font></td><br />
<td>suffix (the suffix of the generic name, default = 0)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.di</font></td><br />
<td>delete infix (a comma-separated list of infixes that is deleted from the sequence names before building the gene/transcript name, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;SIMPLE&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.n</font></td><br />
<td>name attribute (if true the new name is added as new attribute &quot;Name&quot;, otherwise &quot;Parent&quot; and &quot;ID&quot; values are modified accordingly, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>predicted proteins (If *true*, returns the predicted proteins of the target organism as fastA file, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">pc</font></td><br />
<td>predicted CDSs (If *true*, returns the predicted CDSs of the target organism as fastA file, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">pgr</font></td><br />
<td>predicted genomic regions (If *true*, returns the genomic regions of predicted gene models of the target organism as fastA file, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">o</font></td><br />
<td>output individual predictions (If *true*, returns the predictions for each reference species, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">debug</font></td><br />
<td>debug (If *false* removes all temporary files even if the jobs exits unexpected, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">restart</font></td><br />
<td>restart (can be used to restart the latest GeMoMaPipeline run, which was finished without results, with very similar parameters, e.g., after an exception was thrown (cf. parameter debug), default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>BLAST_PATH (allows to set a path to the blast binaries if not set in the environment, default = , OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>MMSEQS_PATH (allows to set a path to the blast binaries if not set in the environment, default = , OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">threads</font></td><br />
<td>The number of threads used for the tool, defaults to 1</td><br />
<td>INT</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.7.jar CLI GeMoMaPipeline t=<target_genome> AnnotationFinalizer.p=<prefix><br />
<br />
<br />
= Extract RNA-seq Evidence =<br />
<br />
This tools extracts introns and coverage from mapped RNA-seq reads. Introns might be denoised by the tool '''DenoiseIntrons'''. Introns and coverage results can be used in '''GeMoMa''' to improve the predictions and might help to select better gene models in '''GAF'''. In addition, introns and coverage can be used to predict UTRs by '''AnnotationFinalizer'''.<br />
<br />
''Extract RNA-seq Evidence'' may be called with<br />
<br />
java -jar GeMoMa-1.7.jar CLI ERE<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>Stranded (Defines whether the reads are stranded. In case of FR_FIRST_STRAND, the first read of a read pair or the only read in case of single-end data is assumed to be located on forward strand of the cDNA, i.e., reverse to the mRNA orientation. If you are using Illumina TruSeq you should use FR_FIRST_STRAND., range={FR_UNSTRANDED, FR_FIRST_STRAND, FR_SECOND_STRAND}, default = FR_UNSTRANDED)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>mapped reads file (BAM/SAM files containing the mapped reads, mime = bam,sam)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>ValidationStringency (Defines how strict to be when reading a SAM or BAM, beyond bare minimum validation., range={STRICT, LENIENT, SILENT}, default = LENIENT)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">u</font></td><br />
<td>use secondary alignments (allows to filter flags in the SAM or BAM, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage (allows to output the coverage, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">mmq</font></td><br />
<td>minimum mapping quality (reads with a mapping quality that is lower than this value will be ignored, valid range = [0, 255], default = 40)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">mc</font></td><br />
<td>minimum context (only introns that have evidence of at least one split read with a minimal M (=(mis)match) stretch in the cigar string larger than or equal to this value will be used, valid range = [1, 1000000], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.7.jar CLI ERE m=<mapped_reads_file><br />
<br />
<br />
= CheckIntrons =<br />
<br />
The tool checks the distribution of introns on the strands and the dinucleotide distribution at splice sites.<br />
<br />
''CheckIntrons'' may be called with<br />
<br />
java -jar GeMoMa-1.7.jar CLI CheckIntrons<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>target genome (The target genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code, mime = fasta)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq, mime = gff, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>verbose (A flag which allows to output a wealth of additional information per transcript, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.7.jar CLI CheckIntrons t=<target_genome><br />
<br />
<br />
= DenoiseIntrons =<br />
<br />
This module allows to analyze introns extracted by '''ERE'''. Introns with a large intron size or a low relative expression are possibly artefacts and will be removed. The result of this module can be used in the module '''GeMoMa''', '''AnnotationEvidence''', and '''AnnotationFinalizer'''.<br />
<br />
''DenoiseIntrons'' may be called with<br />
<br />
java -jar GeMoMa-1.7.jar CLI DenoiseIntrons<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq, mime = gff,gff3)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage (experimental coverage (RNA-seq), range={UNSTRANDED, STRANDED}, default = UNSTRANDED)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., mime = bedgraph)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., mime = bedgraph)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., mime = bedgraph)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>maximum intron length (The maximum length of an intron, default = 15000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">me</font></td><br />
<td>minimum expression (The threshold for removing introns, valid range = [0.0, 1.0], default = 0.01)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">context</font></td><br />
<td>context (The context upstream a donor and donwstream an acceptor site that is used to determine the expression of the region, valid range = [0, 100], default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.7.jar CLI DenoiseIntrons i=<introns> coverage_unstranded=<coverage_unstranded><br />
<br />
<br />
= NCBI Reference Retriever =<br />
<br />
This tool can be used to download or update assembly and annotation files of reference organsims from NCBI. This way it allows to easily collect all data necessary to start '''GeMoMaPipeline''' or '''Extractor'''.<br />
<br />
''NCBI Reference Retriever'' may be called with<br />
<br />
java -jar GeMoMa-1.7.jar CLI NRR<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reference directory (the directory where the genome and annotation files of the reference organisms should be stored, default = references/)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>number of tries (the number of tries for downloading a reference file, valid range = [1, 100], default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">rl</font></td><br />
<td>reference list (a list of reference organisms, mime = txt)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.7.jar CLI NRR rl=<reference_list><br />
<br />
<br />
= Extractor =<br />
<br />
This tool can be used to create input files for '''GeMoMa''', i.e., it creates at least a fasta file containing the translated parts of the CDS and a tabular file containing the assignment of transcripts to genes and parts of CDS to transcripts. In addition, '''Extractor''' can be used to create several additional files from the final prediction, e.g. proteins, CDSs, ... . Two inputs are mandatory: The genome as fasta or fasta.gz and the corresponding annotation as gff or gff.gz. The gff file should be sorted. If you like to set a user-specific genetic code, please use a tab-delimited file with two columns. The first column contains the amino acid in one letter code, the second a list of tripletts.<br />
<br />
''Extractor'' may be called with<br />
<br />
java -jar GeMoMa-1.7.jar CLI Extractor<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (Reference annotation file (GFF or GTF), which contains gene models annotated in the reference genome, mime = gff,gff3,gtf)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (Reference genome file (FASTA), mime = fasta,fa,fas,fna,fasta.gz,fa.gz,fas.gz,fna.gz)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">gc</font></td><br />
<td>genetic code (optional user-specified genetic code, mime = tabular, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>proteins (whether the complete proteins sequences should returned as output, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>cds (whether the complete CDSs should returned as output, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">genomic</font></td><br />
<td>genomic (whether the genomic regions should be returned (upper case = coding, lower case = non coding), default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns (whether introns should be extracted from annotation, that might be used for test cases, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">u</font></td><br />
<td>upcase IDs (whether the IDs in the GFF should be upcased, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>repair (if a transcript annotation can not be parsed, the program will try to infer the phase of the CDS parts to repair the annotation, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>selected (The path to list file, which allows to make only a predictions for the contained transcript ids. The first column should contain transcript IDs as given in the annotation. Remaining columns will be ignored., mime = tabular,txt, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Ambiguity</font></td><br />
<td>Ambiguity (This parameter defines how to deal with ambiguities in the DNA. There are 3 options: EXCEPTION, which will remove the corresponding transcript, AMBIGUOUS, which will use an X for the corresponding amino acid, and RANDOM, which will randomly select an amnio acid from the list of possibilities., range={EXCEPTION, AMBIGUOUS, RANDOM}, default = EXCEPTION)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>discard pre-mature stop (if *true* transcripts with pre-mature stop codon are discarded as they often indicate misannotation, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sefc</font></td><br />
<td>stop-codon excluded from CDS (A flag that states whether the reference annotation contains the stop codon in the CDS annotation or not, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>full-length (A flag which allows for choosing between only full-length and all (i.e., full-length and partial) transcripts, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>verbose (A flag which allows to output a wealth of additional information, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.7.jar CLI Extractor a=<annotation> g=<genome><br />
<br />
<br />
= GeneModelMapper =<br />
<br />
This tool is the main part of, a homology-based gene prediction tool. GeMoMa builds gene models from search results (e.g. tblastn or mmseqs).<br />
<br />
As first step, you should run '''Extractor''' obtaining ''cds parts'' and ''assignment''. Second, you should run a search algorithm, e.g. '''tblastn''' or '''mmseqs''', with ''cds parts'' as query. Finally, these search results are then used in '''GeMoMa'''. Search results should be clustered according to the reference genes. The most easiest way is to sort the search results accoring to the first column. If the search results are not sorted by default (e.g. mmseqs), you should the parameter ''sort''.<br />
If you like to run GeMoMa ignoring intron position conservation, you should blast protein sequences and feed the results in ''query cds parts'' and leave ''assignment'' unselected.<br />
<br />
If you like to run GeMoMa using RNA-seq evidence, you should map your RNA-seq reads to the genome and run '''ERE''' on the mapped reads. For several reasons, spurious introns can be extracted from RNA-seq data. Hence, we recommend to run '''DenoiseIntrons''' to remove such spurious introns. Finally, you can use the obtained ''introns'' (and ''coverage'') in GeMoMa.<br />
<br />
If you like to obtain multiple predictions per gene model of the reference organism, you should set ''predictions'' accordingly. In addition, we suggest to decrease the value of ''contig threshold'' allowing GeMoMa to evaluate more candidate contigs/chromosomes.<br />
<br />
If you change the values of ''contig threshold'', ''region threshold'' and ''hit threshold'', this will influence the predictions as well as the runtime of the algorithm. The lower the values are, the slower the algorithm is.<br />
<br />
You can filter your predictions using '''GAF''', which also allows for combining predictions from different reference organismns.<br />
<br />
Finally, you can predict UTRs and rename predictions using '''AnnotationFinalizer'''.<br />
<br />
If you like to run the complete GeMoMa pipeline and not only specific module, you can run the multi-threaded module '''GeMoMaPipeline'''.<br />
<br />
''GeneModelMapper'' may be called with<br />
<br />
java -jar GeMoMa-1.7.jar CLI GeMoMa<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>search results (The search results, e.g., from tblastn or mmseqs, mime = tabular)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>target genome (The target genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code, mime = fasta,fas,fa,fna,fasta.gz,fas.gz,fa.gz,fna.gz)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>cds parts (The query cds parts file (FASTA), i.e., the cds parts that have been blasted, mime = fasta,fa,fas,fna)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>assignment (The assignment file, which combines parts of the CDS to transcripts, mime = tabular, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used zero or multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq, mime = gff,gff3)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">splice</font></td><br />
<td>splice (if no intron is given by RNA-seq, compute candidate splice sites or not, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used zero or multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage</font></td><br />
<td>coverage (experimental coverage (RNA-seq), range={UNSTRANDED, STRANDED}, default = UNSTRANDED)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., mime = bedgraph)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., mime = bedgraph)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., mime = bedgraph)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genetic code (optional user-specified genetic code, mime = tabular, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sm</font></td><br />
<td>substitution matrix (optional user-specified substitution matrix, mime = tabular, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">go</font></td><br />
<td>gap opening (The gap opening cost in the alignment, default = 11)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ge</font></td><br />
<td>gap extension (The gap extension cost in the alignment, default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>maximum intron length (The maximum length of an intron, default = 15000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sil</font></td><br />
<td>static intron length (A flag which allows to switch between static intron length, which can be specified by the user and is identical for all genes, and dynamic intron length, which is based on the gene-specific maximum intron length in the reference organism plus the user given maximum intron length, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">intron-loss-gain-penalty</font></td><br />
<td>intron-loss-gain-penalty (The penalty used for intron loss and gain, default = 25)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">e</font></td><br />
<td>e-value (The e-value for filtering blast results, default = 100.0)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ct</font></td><br />
<td>contig threshold (The threshold for evaluating contigs, valid range = [0.0, 1.0], default = 0.4)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">rt</font></td><br />
<td>region threshold (The threshold for evaluating regions, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">h</font></td><br />
<td>hit threshold (The threshold for adding additional hits, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>predictions (The (maximal) number of predictions per transcript, default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">selected</font></td><br />
<td>selected (The path to list file, which allows to make only a predictions for the contained transcript ids. The first column should contain transcript IDs as given in the annotation. Remaining columns can be used to determine a target region that should be overlapped by the prediction, if columns 2 to 5 contain chromosome, strand, start and end of region, mime = tabular,txt, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">as</font></td><br />
<td>avoid stop (A flag which allows to avoid (additional) pre-mature stop codons in a transcript, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">approx</font></td><br />
<td>approx (whether an approximation is used to compute the score for intron gain, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">pa</font></td><br />
<td>protein alignment (whether a protein alignment between the prediction and the reference transcript should be computed. If so two additional attributes (iAA, pAA) will be added to predictions in the gff output. These might be used in GAF. However, since some transcripts are very long this can increase the needed runtime and memory (RAM)., default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">prefix</font></td><br />
<td>prefix (A prefix to be used for naming the predictions, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">tag</font></td><br />
<td>tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = mRNA)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>verbose (A flag which allows to output a wealth of additional information per transcript, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">timeout</font></td><br />
<td>timeout (The (maximal) number of seconds to be used for the predictions of one transcript, if exceeded GeMoMa does not output a prediction for this transcript., valid range = [0, 604800], default = 3600)</td><br />
<td style="width:100px;">LONG</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sort</font></td><br />
<td>sort (A flag which allows to sort the search results, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Score</font></td><br />
<td>Score (A flag which allows to do nothing, re-score or re-align the search results, range={Trust, ReScore, ReAlign}, default = Trust)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.7.jar CLI GeMoMa s=<search_results> t=<target_genome> c=<cds_parts> a=<assignment><br />
<br />
<br />
= GeMoMa Annotation Filter =<br />
<br />
This tool combines and filters gene predictions from different sources yielding a common gene prediction. The tool does not modify the predictions, but filters redundant and low-quality predictions and selects relevant predictions. In addition, it adds attributes to the annotation of transcript predictions.<br />
<br />
The algorithm does the following:<br />
First, redundant predictions are identified (and additional attributes (evidence, sumWeight) are introduced).<br />
Second, the predictions are filtered using the user-specified criterium based on the attributes from the annotation.<br />
Third, clusters of overlapping predictions are determined, the predictions are sorted within the cluster and relevant predictions are extracted.<br />
<br />
Optionally, annotation info can be added for each reference organism enabling a functional prediction for predicted transcripts based on the function of the reference transcript.<br />
Phytozome provides annotation info tables for plants, but annotation info can be used from any source as long as they are tab-delimited files with at least the following columns: transcriptName, GO and .*defline.<br />
<br />
Initially, GAF was build to combine gene predictions obtained from '''GeMoMa'''. It allows to combine the predictions from multiple reference organisms, but works also using only one reference organism. However, GAF also allows to integrate predictions from ab-initio or purely RNA-seq-based gene predictors as well as manually curated annotation. If you like to do so, we recommend to run '''AnnotationEvidence''' for each of these input files to add additional attributes that can be used for sorting and filtering within '''GAF'''. The sort and filter criteria need to be carefully revised in this case. Default values can be set for missing attributes.<br />
<br />
''GeMoMa Annotation Filter'' may be called with<br />
<br />
java -jar GeMoMa-1.7.jar CLI GAF<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>tag (the tag used to read the GeMoMa annotations, default = mRNA)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>common border filter (the filter on the common borders of transcripts, the lower the more transcripts will be checked as alternative splice isoforms, valid range = [0.0, 1.0], default = 0.75)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>maximal number of transcripts per gene (the maximal number of allowed transcript predictions per gene, valid range = [1, 2147483647], default = 2147483647)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prefix (the prefix can be used to distinguish predictions from different input files, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>weight (the weight can be used to prioritize predictions from different input files; each prediction will get an additional attribute sumWeight that can be used in the filter, valid range = [0.0, 1000.0], default = 1.0, OPTIONAL)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>gene annotation file (GFF file containing the gene annotations (predicted by GeMoMa), mime = gff,gff3)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation info (annotation information of the reference, tab-delimted file containing at least the columns transcriptName, GO and .*defline, mime = tabular, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>default attributes (Comma-separated list of attributes that is set to NaN if they are not given in the annotation file. This allows to use these attributes for sorting or filter criteria. It is especially meaningful if the gene annotation files were received fom different software packages (e.g., GeMoMa, Braker, ...) having different attributes., default = tie,tde,tae,iAA,pAA,score)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>filter (A filter can be applied to transcript predictions to receive only reasonable predictions. The filter is applied to the GFF attributes. The default filter decides based on the completeness of the prediction (start=='M' and stop=='*') and the relative score (score/aa>=0.75) whether a prediction is used or not. In addition, predictions without score (isNaN(score)) will be used as external annotations do not have the attribute score. Different criteria can be combined using 'and' and 'or'. A more sophisticated filter could be applied for instance using the completeness, the relative score, the evidence as well as statistics based on RNA-seq: start=='M' and stop=='*' and score/aa>=0.75 and (evidence>1 or tpc==1.0), default = start=='M' and stop=='*' and (isNaN(score) or score/aa>=0.75), OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>sorting (comma-separated list that defines the way predictions are sorted within a cluster, default = evidence,score)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">atf</font></td><br />
<td>alternative transcript filter (If a prediction is suggested as an alternative transcript, this additional filter will be applied. The filter works syntactically like the 'filter' parameter and allows you to keep the number of alternative transcripts small based on meaningful criteria. Reasonable filter could be based on RNA-seq data (tie==1) or on evidence (evidence>1). A more sophisticated filter could be applied combining several criteria: tie==1 or evidence>1, default = tie==1 or evidence>1, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.7.jar CLI GAF g=<gene_annotation_file><br />
<br />
<br />
= AnnotationFinalizer =<br />
<br />
This tool finalizes an annotation. It allows to predict for UTRs for annotated coding sequences and to generate generic gene and transcript names. UTR prediction might be negatively influenced (i.e. too long predictions) by genomic contamination of RNA-seq libraries, overlapping genes or genes in close proximity as well as unstranded RNA-seq libraries. Please use '''ERE''' to preprocess the mapped reads.<br />
<br />
''AnnotationFinalizer'' may be called with<br />
<br />
java -jar GeMoMa-1.7.jar CLI AnnotationFinalizer<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (The genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code, mime = fasta,fa,fas,fna,fasta.gz,fa.gz,fas.gz,fna.gz)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (The predicted genome annotation file (GFF), mime = gff,gff3)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = mRNA)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">u</font></td><br />
<td>UTR (allows to predict UTRs using RNA-seq data, range={NO, YES}, default = NO)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;YES&quot;:</b></td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns file (Introns (GFF), which might be obtained from RNA-seq, mime = gff,gff3)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage file (experimental coverage (RNA-seq), range={NO, UNSTRANDED, STRANDED}, default = NO)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., mime = bedgraph)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., mime = bedgraph)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., mime = bedgraph)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">rename</font></td><br />
<td>rename (allows to generate generic gene and transcripts names (cf. parameter &quot;name attribute&quot;), range={COMPOSED, SIMPLE, NO}, default = COMPOSED)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;COMPOSED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">infix</font></td><br />
<td>infix (the infix of the generic name, default = G)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>suffix (the suffix of the generic name, default = 0)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">di</font></td><br />
<td>delete infix (a comma-separated list of infixes that is deleted from the sequence names before building the gene/transcript name, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;SIMPLE&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>name attribute (if true the new name is added as new attribute &quot;Name&quot;, otherwise &quot;Parent&quot; and &quot;ID&quot; values are modified accordingly, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.7.jar CLI AnnotationFinalizer g=<genome> a=<annotation> p=<prefix><br />
<br />
<br />
= Annotation evidence =<br />
<br />
This tool adds attributes to the annotation, e.g., tie, tpc, aa, start, stop. These attributes can be used, for instance, if the annotation is used in '''GAF'''. All predictions of the annotation are used. The predictions are not filtered for internal stop codons, missing start or stop codons, frame-shifts, ... . Please use '''ERE''' to preprocess the mapped reads.<br />
<br />
''Annotation evidence'' may be called with<br />
<br />
java -jar GeMoMa-1.7.jar CLI AnnotationEvidence<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (The genome annotation file (GFF,GTF), mime = gff,gff3,gtf)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = mRNA)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (The genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code, mime = fasta,fas,fa,fna,fasta.gz,fas.gz,fa.gz,fna.gz)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns file (Introns (GFF), which might be obtained from RNA-seq, mime = gff,gff3, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage file (experimental coverage (RNA-seq), range={NO, UNSTRANDED, STRANDED}, default = NO)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., mime = bedgraph)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., mime = bedgraph)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., mime = bedgraph)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ao</font></td><br />
<td>annotation output (if the annotation should be returned with attributes tie, tpc, and aa, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">gc</font></td><br />
<td>genetic code (optional user-specified genetic code, mime = tabular, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.7.jar CLI AnnotationEvidence a=<annotation> g=<genome><br />
<br />
<br />
= Compare transcripts =<br />
<br />
This tool compares a predicted annotation with a given annotation in terms of F1 measure. If the F1 measure is 1 both annotations are in perfect agreement for this transcript. The smaller the value is the low is the agreement. If it is NA then there is no overlapping annotation.<br />
<br />
''Compare transcripts'' may be called with<br />
<br />
java -jar GeMoMa-1.7.jar CLI CompareTranscripts<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prediction (The predicted annotation, mime = gff,gff3)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (The true annotation, mime = gff,gff3)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used zero or multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">prefix</font></td><br />
<td>prefix (the prefix can be used to distinguish predictions from different input files, default = , OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">assignment</font></td><br />
<td>assignment (the transcript info for the reference of the prediction, mime = tabular)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.7.jar CLI CompareTranscripts p=<prediction> a=<annotation><br />
<br />
<br />
= Synteny checker =<br />
<br />
This tool can be used to determine syntenic regions between target organism and reference organism based on similiarity of genes.!The tool returns a table of reference genes per predicted gene. This table can be easily visualized with an R script that is included in the GeMoMa package.<br />
<br />
''Synteny checker'' may be called with<br />
<br />
java -jar GeMoMa-1.7.jar CLI SyntenyChecker<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>tag (the tag used to read the GeMoMa annotations, default = mRNA)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prefix (the prefix can be used to distinguish predictions from different input files, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>assignment (The assignment file, which combines parts of the CDS to transcripts, mime = tabular)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>gene annotation file (GFF file containing the gene annotations predicted by GAF, mime = gff,gff3)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.7.jar CLI SyntenyChecker a=<assignment> g=<gene_annotation_file><br />
<br />
<br />
= AddAttribute =<br />
<br />
This tool allows to add an additional attribute to specific features of an annotation.<br />
<br />
Those additional attributes might be used in '''GAF''' for filtering or sorting or might be displayed in genome browsers like IGV or WebApollo. The user can choose binary attributes (true or false) or attributes with values according to given tab-delimited table.<br />
<br />
''AddAttribute'' may be called with<br />
<br />
java -jar GeMoMa-1.7.jar CLI AddAttribute<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (annotation file, mime = gff,gff3)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>feature (a feature of the annotation, e.g., gene, transcript or mRNA, default = mRNA)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">attribute</font></td><br />
<td>attribute (the name of the attribute that is added to the annotation)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>table (a tab-delimited file containing IDs and additional attribute, mime = tabular)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>ID column (the ID column in the tab-delimited file, valid range = [0, 2147483647])</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">type</font></td><br />
<td>type (type of addition attribute, range={VALUES, BINARY}, default = VALUES)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;VALUES&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ac</font></td><br />
<td>attribute column (the attribute column in the tab-delimited file, valid range = [0, 2147483647])</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;BINARY&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.7.jar CLI AddAttribute a=<annotation> attribute=<attribute> t=&lt;table&gt; i=<ID_column> ac=<attribute_column></div>Keilwagenhttps://www.jstacs.de/index.php?title=GeMoMa&diff=1105GeMoMa2020-08-24T13:49:07Z<p>Keilwagen: /* Frequently asked questions */</p>
<hr />
<div>'''Ge'''ne '''Mo'''del '''Ma'''pper (GeMoMa) is a homology-based gene prediction program. GeMoMa uses the annotation of protein-coding genes in a reference genome to infer the annotation of protein-coding genes in a target genome. Thereby, GeMoMa utilizes amino acid sequence and intron position conservation. In addition, GeMoMa allows to incorporate RNA-seq evidence for splice site prediction.<br />
<br />
GeMoMa is available in a public web-server at [http://galaxy.informatik.uni-halle.de galaxy.informatik.uni-halle.de]. The provided web-server only allows a limited number of reference genes and uses a time out of 2 minutes per transcript prediction. For unlimited use, please use the command line program or integrate GeMoMa in your own [https://galaxyproject.org/ Galaxy] instance.<br />
<br />
{|<br />
|__TOC__<br />
|[[File:GeMoMa-schema.png|thumb|right|450px|Schema of GeMoMa algorithm]]<br />
|}<br />
<br />
== Installation ==<br />
<br />
GeMoMa is now available via bioconda. Here is the [https://anaconda.org/bioconda/gemoma direct link to the package]. To install this package with conda run:<br />
conda install -c bioconda gemoma <br />
However, you can also install GeMoMa manually.<br />
<br />
=== Requirements ===<br />
<br />
For running the GeMoMa, you need the following software on your computer<br />
* Java v1.8 or later<br />
* [ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ blast] or [https://github.com/soedinglab/MMseqs2 mmseqs]<br />
<br />
=== Download ===<br />
<br />
GeMoMa is implemented in Java using Jstacs. You can [http://www.jstacs.de/download.php?which=GeMoMa download a zip file] containing a readme, the GeMoMa jar file and some tiny scripts for running GeMoMa. The jar file allows for <br />
<ul><br />
<li>creating the XML file needed for the Galaxy integration</li><br />
<li>running the command line interface (CLI) version.</li><br />
</ul><br />
<br />
== In a nutshell ==<br />
<br />
GeMoMa is a modular, homology-based gene prediction program with huge flexibility. However, we also provide a pipeline allowing to use GeMoMa easily. If you like to start GeMoMa for the first time, we recommend to use the GeMoMaPipeline like this<br />
java -jar GeMoMa-1.7.jar CLI GeMoMaPipeline threads=<threads> outdir=<outdir> GeMoMa.Score=ReAlign AnnotationFinalizer.r=NO o=true t=<target_genome> i=<reference_1_id> a=<reference_1_annotation> g=<reference_1_genome><br />
there are several parameters that need to be set indicated with '''&lt;'''foo'''&gt;'''. You can specify<br />
* the number of threads<br />
* the output directory<br />
* the target genome<br />
* and the reference ID (optional), annotation and genome. If you have several references just repeat <code>s=own</code> and the parameter tags <code>i</code>, <code>a</code>, <code>g</code> with the corresponding values.<br />
In addition, we recommend to set several parameters:<br />
* <code>GeMoMa.Score=ReAlign</code>: states that the score from mmseqs should be recomputed as mmseqs uses an approximation<br />
* <code>AnnotationFinalizer.r=NO</code>: do not rename genes and transcripts<br />
* <code>o=true</code>: output individual predictions for each reference as a separate file allowing to rerun the combination step ('''GAF''') very easily and quickly<br />
If you like to specify the maximum intron length please consider the parameters <code>GeMoMa.m</code> and <code>GeMoMa.sil</code>.<br />
If you have RNA-seq data either from own experiments or publicly available data sets (cf. [https://www.ncbi.nlm.nih.gov/sra NCBI SRA], [https://www.ebi.ac.uk/ena EMBL-EBI ENA]), we recommend to use them. You need to map the data against the target genome with your favorite read mapper. In addition, we recommend to check the parameters of the section '''DenoiseIntrons'''.<br />
<br />
The complete documentation describing all GeMoMa modules and all paaremeters can be accessed at [[GeMoMa-Docs]].<br />
<br />
You can also [[:File:GeMoMa-manual.pdf|download a small manual for GeMoMa]] which explains the main steps for the analysis.<br />
<br />
== GFF attributes ==<br />
<br />
Using GeMoMa, you'll obtain GFFs containing some special attributes. We briefly explain the most prominent attributes in the following table.<br />
<br />
{| class="wikitable sortable"<br />
|-<br />
!Attribute !!Long name !!Tool !!Necessary parameter !!Feature !!Description<br />
|-<br />
| aa || amino acids || GeMoMa || || mRNA || the number of amino acids in the protein<br />
|-<br />
| score || GeMoMa score || GeMoMa || || mRNA || score computed by GeMoMa using the substitution matrix, gap costs and additional penalties<br />
|-<br />
| nps || number of premature stops || GeMoMa || || mRNA || the number of premature stop codons in the prediction<br />
|-<br />
| ce || coding exons || GeMoMa || assignment || mRNA || the number of coding exons of the prediction<br />
|-<br />
| rce || reference coding exons || GeMoMa || assignment || mRNA || the number of coding exons of the reference transcript<br />
|-<br />
| minCov || minimal coverage || GeMoMa || coverage, ... || mRNA || minimal coverage of any base of the prediction given RNA-seq evidence<br />
|-<br />
| avgCov || average coverage || GeMoMa || coverage, ... || mRNA || average coverage of all bases of the prediction given RNA-seq evidence<br />
|-<br />
| tpc || transcript percentage coverage || GeMoMa || coverage, ... || mRNA || percentage of covered bases per predicted transcript given RNA-seq evidence<br />
|-<br />
| tae || transcript acceptor evidence || GeMoMa || introns || mRNA || percentage of predicted acceptor sites per predicted transcript with RNA-seq evidence<br />
|-<br />
| tde || transcript donor evidence || GeMoMa || introns || mRNA || percentage of predicted donor sites per predicted transcript with RNA-seq evidence<br />
|-<br />
| tie || transcript intron evidence || GeMoMa || introns || mRNA || percentage of predicted introns per predicted transcript with RNA-seq evidence<br />
|-<br />
| minSplitReads || minimal split reads || GeMoMa || introns || mRNA || minimal number of split reads for any of the predicted introns per predicted transcript<br />
|-<br />
| iAA || identical amino acid || GeMoMa || query proteins || mRNA || percentage of identical amino acids between reference transcript and prediction<br />
|-<br />
| pAA || positive amino acid || GeMoMa || query proteins || mRNA || percentage of aligned positions between reference transcript and prediction yielding a positive score in the substitution matrix<br />
|-<br />
| evidence || || GAF || || mRNA || number of reference organisms that have a transcript yielding this prediction<br />
|-<br />
| alternative || || GAF || || mRNA || alternative gene ID(s) leading to the same prediction<br />
|-<br />
| sumWeight || || GAF || || mRNA || the sum of the weights of the references that perfectly support this prediction<br />
|-<br />
| maxTie || maximal tie || GAF || || gene || maximal tie of all transcripts of this gene<br />
|-<br />
| maxEvidence || maximal evidence || GAF || || gene || maximal evidence of all transcripts of this gene<br />
|-<br />
|}<br />
<br />
The name of the feature describing a transcript prediction can be altered using the parameter "tag". Before version 1.7 the default value of tag was "prediction" instead of "mRNA".<br />
<br />
== Frequently asked questions ==<br />
<br />
; Why does the Extractor not return a single CDS-part, protein, ...?<br />
:Please check whether the names of your contigs/chromosomes in your annotation (gff) and genome file (fasta) are identical. The fasta comments should at best only contain the contig/chromosome name. (Since GeMoMa 1.4, comments, which contain the contig/chromosome name and some additional information separated by a space, are also fine.) In addition, check the statistics that are given by the Extractor. It lists how many genes have been read and how many genes have been removed for different reasons. One common problem is that some annotation files do not include the stop codon in the CDS annotation.<br />
<br />
; How can I force GeMoMa to make more predictions?<br />
:There are several parameters affecting the number of predictions. The most prominent are the number of predictions (p) and the contig threshold (ct). For each reference transcript/CDS, GeMoMa initially makes a preliminary prediction and uses this prediction to determine whether a contig is promising and should be used to determine the final predictions. You may decrease ct and increase p to have more contigs in the final prediction. Increasing the number of predictions allows GeMoMa to output more predictions that have been computed. Decreasing the contig threshold allows to increase the number of predictions that are (internally) computed. Increasing p to a very large number without decreasing ct does not help.<br />
<br />
; Running GeMoMa on a single contig of my assembly yield thousands of weird predictions. What went wrong?<br />
:By default, GeMoMa is not build to be run on a single contig. GeMoMa tries to make predictions for all given reference CDS in the given target sequence(s). If the given target sequence is only a fraction of the complete target genome/assembly, GeMoMa will produce weird predictions as it does not filter for the quality of the predictions internally. There are two options to handle this:<br />
:* Use a list of gene models that you expect to be located on this contig (cf. parameter "selected").<br />
:* Filter the predictions using GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>).<br />
<br />
; Is it mandatory to use RNA-seq data?<br />
:No, GeMoMa is able to make predictions with and without RNA-seq evidence.<br />
<br />
; Is it possible to use multiple reference organisms?<br />
:It is possible to use multiple reference organisms for GeMoMa. Just run GeMoMa on each reference organism separately. Finally, you can employ GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>) to combine these annotations.<br />
<br />
; Why do some reference genes not lead to a prediction in the target genome?<br />
:Please first check whether your reference genes have been discarded by the Extractor (cf. assignment file).<br />
:If the genes have been discarded, there are two possibilities:<br />
:* The CDS might be redundant, i.e. the coding exons are identical to those of another transcript. In this case, only one CDS is further evaluated.<br />
:* There might be something wrong with your reference genes, e.g., missing start codon, missing stop codon, premature stop codon, ambiguous nucleotides, ... and you should check the options of Extractor or the annotation.<br />
:If the reference genes passed the Extractor, there are several possible explanations for this behavior. The two most prominent are:<br />
:* GeMoMa stopped the prediction of a reference genes since it does not return a result within the given time (cf. parameter "timeout").<br />
:* GeMoMa simply did not find a prediction matching the remaining quality criteria.<br />
<br />
; What does "partial gene model" mean in the context of GeMoMa?<br />
:We called a gene model partial if it does not contain an initial start codon and a final stop codon. However, this does not mean that the gene model is located at or close to the border of a chromosome or contig.<br />
<br />
; For two different reference transcripts, the predictions of GeMoMa overlap or are identical. What should I do with those?<br />
:GeMoMa makes the predictions for each reference transcript independently. Hence, it can occur that some of predictions of different reference transcripts overlap or are identical especially in gene families. Typically, you might like to filter or rank these predictions. We have implemented GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>) to do this automatically. However, you can also do it by hand using the GFF attributes. Using RNA-seq data in GeMoMa yields additional fields in the annotation that can be used, e.g., average coverage (avgCov).<br />
<br />
; A lot of transcripts have been filtered out by the Extractor. What can I do?<br />
:There are several reasons for removing transcripts by the Extractor. At least in two cases you can try to get more transcripts by setting specific parameter values. First, if the transcript contains ambiguous nucleotides, please test the parameter "Ambiguity". Second, sometimes we received GFFs which contain wrong phases for CDS entries (e.g., 0 for all CDS entries in the phase column of the GFF). Since version 1.3.2, we provide the option "r" which stands for repair. If r=true is chosen, the Extractor tries to infer all phases for transcripts that show an error and would be filtered out.<br />
<br />
; Is GeMoMa able to predict pseudo-genes/ncRNA?<br />
:No, currently not.<br />
<br />
; My RNA-seq data indicates there is an additional intron in a transcipt, but GeMoMa does not predict this. Or vice versa, GeMoMa predicts an intron that is not supported by RNA-seq data. What's the reason?<br />
:GeMoMa is mainly based on the assumptions of amino acid and intron position conservation between reference and target species. Hence, GeMoMa tries to predict a gene model with similar exon-intron-structure in the target species and does not stick too much with RNA-seq data. Although intron position conservation can be observed in most cases, sometimes new introns evolve or others vanish. For this reasons, GeMoMa also allows for the inclusion or exclusion of introns adding some additional costs (cf. GeMoMa parameter intron-loss-gain-penalty). However, the behaviour of GeMoMa depends on the parameters settings (especially intron-loss-gain-penalty, sm (substitution matrix), go (gap opening), ge (gap extension)) and the length of the missed/additional intron. Nevertheless, such cases can only occur if the additional/missed intron has a length that can be divided by 3 preserving the reading frame. Since the available RNA-seq data only reflects a fraction of tissues/environmental conditions/..., missing RNA-seq evidence does not necessarily mean that the predictions is wrong.<br />
<br />
; My RNA-seq data indicates two alternative, highly overlapping introns. Interestingly, GeMoMa does not take the intron that is more abundant. Why?<br />
:GeMoMa reads the introns from the input file using some filter (cf. GeMoMa parameter r (reads)). All introns that pass the filter are used and treated equally. Hence, GeMoMa uses the intron that matches the expectation of intron position and amino acid conservation compared to the reference transcript.<br />
<br />
; Does GeMoMa predict multiple transcripts per gene?<br />
:GeMoMa in principle allows to predict multiple transcripts per gene, if corresponding transcripts are given in the reference species or if multiple reference species are used. <br />
<br />
; GeMoMa failed with java.lang.OutOfMemoryError. What can I do?<br />
:Whenever you see a java.lang.OutOfMemoryError, you should rerun the program with Java virtual machine (VM) options. More specifically you should set: -Xms the initially used RAM, e.g. to 5Gb (–Xms5G), and -Xmx the maximally used RAM, e.g. to 50Gb (-Xmx50G). GeMoMa often needs more memory if you have a large genome and if you’re providing a large coverage file (extracted from RNA-seq data). If you don’t have a compute node with enough memory, you can run GeMoMa without coverage, which will return the same predictions, but does not include all statistics. Another point could be the protein alignment, if you use the optional parameter "query protein" (below version 1.; or "protein alignment" (since version 1.;. Again you can run GeMoMa without protein alignment, which will return the same predictions, but less statistics. <br />
<br />
; I need to specify the genetic code for my organisms. What is the expected format?<br />
:The genetic code is given in a two column tab-delimited table, where the first column is the one letter code of the amino acid and the second column is a comma-separated list of triplets. As we are working on genomic DNA, GeMoMa expects the bases A, C, G, and T, and not U (as expected in mRNA). Here is the link to the default genetic code, which might be used as template:<br />
:https://github.com/Jstacs/Jstacs/blob/master/projects/gemoma/test_data/genetic_code.txt<br />
:Alternative genetic codes are described here using the RNA alphabet:<br />
:https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi<br />
:The genetic code might be specified for a reference organism in the module Extractor or for a target organism in the module GeMoMa.<br />
<br />
; I like to accelerate GeMoMa. What can I do?<br />
:You can use several threads for the computation. If you run the GeMoMaPipeline you just have to select threads=<your_number>.<br />
:In addition, you can change the search algorithm that is used in GeMoMa. Tblastn is used by default as search algorithm in GeMoMa (for historical reasons, until version 1.6.4). However, tblastn can be replaced by mmseqs which is typically much faster. If you run the GeMoMaPipeline you just have to select tblastn=false, which is default since version 1.7. However, changing the search algorithm can also effect the results. We try to minimize these effect using specific parameters for the search algorithms.<br />
:If you modify other parameters, you will probably receive results that differ to a larger extend from those received using the default parameters. <br />
<br />
; Is there a way to use GeMoMa to search a single CDS or protein sequence against a genome and return the predicted gene model (CDS fasta, protein fasta, GFF) similar to exonerate?<br />
:There at least two ways to do this. If you use GeMoMaPipeline you can <br />
: (A) Use the parameter “selected” to select specific gene models (=transcripts/proteins) from the annotation instead of using all or<br />
: (B) Use s=pre-extracted, use a fasta file with the proteins for the parameter cds-parts and leave assignment unset.<br />
:Using one of these options you can either look for a single or few transcripts/proteins either with (A) or without (B) intron-position conservation. In addition, you can use RNA-seq data to improve the predictions, which should be not possible with exonerate.<br />
<br />
; Can I determine synteny based on GeMoMa predictions?<br />
:Yes, since version 1.7 we provide the module SyntenyChecker and a R script that can be used for this purpose. It exploits the fact the the reference gene and the alternative are known. Hence no alignment is need at this point and synteny can be determined quite fast. <br />
<br />
; How, can I add additional attributes to the annotation?<br />
:Additional attribute, e.g., functional annotation from InterProScan, can be added to the structural gene annotation using the module AddAttribute, which has been included since version 1.7. Such additional attributes might be used in GAF for filtering and sorting and can also be displayed in genome browsers like IGV or WebApollo. <br />
<br />
; Can structural gene annotation provided by GeMoMa be submitted to NCBI?<br />
:Yes, NCBI allows to submit structural gene annotation in GFF format (https://www.ncbi.nlm.nih.gov/genbank/genomes_gff/). If you run GeMoMaPipeline or AnnotationFinalizer, the GFF should be valid for conversion. <br />
<br />
; Running GeMoMaPipeline throws an exception. Can I restart GeMoMaPipeline using intermediate results?<br />
:Yes, since version 1.7 we have a new parameter in GeMoMaPipeline called restart. <br />
:If you want to restart the last broken GeMoMaPipeline run, you have to execute GeMoMaPipeline with the same command line as before and add restart=true.<br />
:If necessary, you can also slightly change the other parameters. However, if the parameters differ too much from those used before, GeMoMaPipeline will decide to perform a new independent run.<br />
:A restart of GeMoMaPipeline is particularly useful if the time-consuming search (tblastn or mmseqs) was successful, since this can save runtime.<br />
<br />
For any further questions or comments please check [https://github.com/Jstacs/Jstacs/labels/GeMoMa our github page].<br />
<br />
== References ==<br />
If you use GeMoMa, please cite<br />
<br />
J. Keilwagen, M. Wenk, J. L. Erickson, M. H. Schattat, J. Grau, and F. Hartung. [https://nar.oxfordjournals.org/content/44/9/e89 Using intron position conservation for homology-based gene prediction]. ''Nucleic Acids Research'', 2016. doi: 10.1093/nar/gkw092<br />
<br />
J. Keilwagen, F. Hartung, M. Paulini, S. O. Twardziok, and J. Grau<br />
[https://rdcu.be/QbKc Combining RNA-seq data and homology-based gene prediction for plants, animals and fungi]. ''BMC Bioinformatics'', 2018. doi: 10.1186/s12859-018-2203-5<br />
<br />
== Version history ==<br />
[http://www.jstacs.de/download.php?which=GeMoMa GeMoMa 1.7.] (29.07.2020)<br />
GeMoMa 1.7. (29.07.2020)<br />
*improved manual including new module and runtime <br />
*check whether input files exist before execution<br />
*partially checking MIME types in CLI before execution<br />
*changed homepage from http to https<br />
*new module AddAttribute: allows to add attributes (like functional annotation from InterProScan) to gene annotation files that might be used in GAF or displayed genome browsers like IGV or WebApollo<br />
*new module SyntenyChecker: creates a table that can be used to create dot plots between the annotation of the target and reference organism<br />
*changed default value of parameter "tag" from "prediction" to "mRNA"<br />
*AnnotationEvidence:<br />
**additional attributes: avgCov, minCov, nps, ce<br />
**changed default value of "annotation output" to true<br />
**bugfix: transcript start and end<br />
*ERE: <br />
**changed default value of coverage to "true" <br />
**new parameter "minimum context": allows to discard introns if all split reads have short aligned contexts<br />
*Extractor: <br />
**bugfix splitAA if coding exon is very short<br />
**improved verbose mode<br />
**new parameter "upcase IDs"<br />
**new parameter "introns" allowing to extract introns from the reference (only for test cases)<br />
**new parameter "discard pre-mature stop" allowing to discard or use transcripts with pre-mature stop<br />
**improved handling of corrupt annotations<br />
*GAF: <br />
**bugfix missing transcripts<br />
**slightly changed the default value of "filter" <br />
*GeMoMa:<br />
**replaced parameter "query proteins" by "protein alignment"<br />
**using splitAA for scoring predictions <br />
**new gff attributes: <br />
*** ce and rce for the feature prediction indicating the number of coding exons for the prediction and the reference, respectively<br />
*** nps for the number of premature stop codons (if avoid stop is false)<br />
**slightly changed the meaning of the parameter "avoid stop"<br />
*GeMoMaPipeline:<br />
**changed the default value of tblastn to false, hence mmseqs is used as search algorithm<br />
**changed the default value of score to ReAlign<br />
**remove "--dont-split-seq-by-len" from mmseqs createdb<br />
**new optional parameter BLAST_PATH<br />
**new optional parameter MMSEQS_PATH<br />
**new option to allow for incorporation of external annotation, e.g., from ab-initio gene prediction<br />
**new parameter restart allowing to restart the latest GeMoMaPipeline run, which was finished without results, with very similar parameters, e.g., after an exception was thrown (cf. parameter debug)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.6.4.zip GeMoMa 1.6.4] (24.04.2020)<br />
* improved help section<br />
* change gff attribute "AA" to "aa"<br />
* GAF:<br />
** bugfix overlapping genes<br />
** accelerated computation<br />
* GeMoMa:<br />
** bugfix: if no assignment file is used and protein ID are prefixes of other protein IDs<br />
** change GFF attribute AA to aa<br />
* AnnotationFinalizer: new parameter "name attribute" allowing to decide whether a name attribute or the Parent and ID attributes should be used for renaming<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.6.3.zip GeMoMa 1.6.3] (05.03.2020)<br />
* Jstacs changes:<br />
** CLI: bugfix ExpandableParameterSet<br />
* python wrapper (for *conda)<br />
* updated tests.sh, run.sh, pipeline.sh<br />
* rename Denoise to DenoiseIntrons<br />
* AnnotationEvidence: write phase (as given) to gff<br />
* GAF: new parameter: default attributes allows to set attributes that are not included in some gene annotation files<br />
* GeMoMa: new parameter: static intron length allowing to use dynamic intron length if set to false<br />
* GeMoMaPipeline: <br />
** bugfix: time-out<br />
** improve output<br />
** separate parameters for maximum intron length (DenoiseIntrons, GeMoMa)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.6.2.zip GeMoMa 1.6.2] (17.12.2019)<br />
* Jstacs changes:<br />
** test methods for modules<br />
** live protocol for Galaxy<br />
* new module Denoise: allowing to clean introns extracted by ERE<br />
* new module NCBIReferenceRetriever: allowing to retrieve data for reference organisms easily from NCBI.<br />
* GAF:<br />
** bugfix for filter using specific attributes if no RNA-seq or query proteins was used<br />
** allow to add annotation info (as for instance provided by Phytozome) based on the reference organisms<br />
* GeMoMa: bugfix for timeout<br />
* GeMoMaPipeline:<br />
** bugfix reporting predicted partial proteins<br />
** improved protocol<br />
** new default value for query proteins (changed from false to true)<br />
** new default value for Ambiguity (changed from EXCEPTION to AMBIGUOUS)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.6.1.zip GeMoMa 1.6.1] (4.06.2019)<br />
* createGalaxyIntegration.sh: bugfix for GeMoMaPipeline<br />
* new module CheckIntrons: allowing to create statistics for introns (extracted by ERE)<br />
* AnnotationFinalizer: bugfix for sequence IDs with large numbers<br />
* CompareTranscripts: <br />
** bugfix for prefix of ref-gene<br />
** allow no transcript info, but making assignment non-optional if a transcript info is set <br />
* GAF: bugfix for Galaxy integration<br />
* GeMoMaPipeline:<br />
** improved output in case of Exceptions<br />
** new parameter "output individual predictions" allows to in- or exclude individual predictions from each reference organism in the final result<br />
** new parameter "weight" allows weights for reference species (cf. GAF)<br />
* ERE: new parameter "minimum mapping quality"<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.6.zip GeMoMa 1.6] (2.04.2019)<br />
* allow to use mmseqs as alternative to tblastn<br />
* AnnotationEvidence:<br />
** allows to add attributes to the input gff: tie, tpc, AA, start, stop<br />
** new parameter for gff output<br />
* AnnotationFinalizer: new tool for predicting UTRs and renaming predictions<br />
* GAF:<br />
** relative score filter and evidence filter are replaced by a flexible filter that allows to filter by relative score, evidence or other GFF attributes as well as combinations thereof<br />
** sorting criteria of the predictions within clusters can now be user-specified<br />
** new attribute for genes: combinedEvidence<br />
** new attribute for predictions: sumWeight<br />
** allows to use gene predictions from all sources, including for instance ab-initio gene predictors, purely RNA-seq based gene prediction and manually curation<br />
** bugfix for predictions from multiple reference organisms<br />
** improved statistic output<br />
* GeMoMa<br />
** renamed the parameter tblastn results to search results<br />
** new parameter for sorting the results of the similarity search (tblastn or mmseqs), if you use mmseqs for the similarity search you have use sort <br />
** new parameter for score of the search results: three options: Trust (as is), ReScore (use aligned sequence, but recompute score), and ReAlign (use detected sequence for realignment and rescore)<br />
** bugfix: threshold for introns from multiple files<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.5.2.zip GeMoMa 1.5.3] (23.07.2018)<br />
* improved parameter description and presentation<br />
* GeMoMaPipeline:<br />
** removed unnecessary parameters<br />
* GeMoMa:<br />
** bugfix: reading coverage file<br />
** removed parameter genomic (cf. Extractor)<br />
** removed protein output (cf. Extractor)<br />
* GAF:<br />
** bugfix: prefix<br />
* Extractor:<br />
** new parameter genomic<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.5.2.zip GeMoMa 1.5.2] (31.5.2018)<br />
* GAF:<br />
** new parameter that allows to restrict the maximal number of transcript predictions per gene<br />
** altered behavior of the evidence filter from percentages to absolute values<br />
** bugfix: nested genes <br />
** checking for duplicates in prediction IDs<br />
* GeMoMa:<br />
** warning if RNA-seq data does not match with target genome<br />
* GeMoMaPipeline: new tool for running the complete GeMoMa pipeline at once allowing multi-threading<br />
* folder for temporary files of GeMoMa<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.5.zip GeMoMa 1.5] (13.02.2018)<br />
* AnnotationEvidence: add chromosome to output<br />
* CompareTranscripts: new parameter that allows to remove prefixes introduces by GAF<br />
* Extractor: new parameter "stop-codon excluded from CDS" that might be used if the annotation does not contain the stop codons<br />
* ExtractRNASeqEvidence: <br />
** print intron length stats<br />
** include program infos in introns.gff3<br />
* GeMoMa:<br />
** new attribute pAA in gff output if query protein is given<br />
** include program infos in predicted_annotation.gff3<br />
** minor bugfix<br />
* GAF:<br />
** new parameter that allows to specify a prefix for each input gff<br />
** collect and print program infos to filtered_prediction.gff3<br />
** improved statistics output<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.4.2.zip GeMoMa 1.4.2] (21.07.2017)<br />
* automatic searching for available updates<br />
* AnnotationEvidence: bugfix (tie computation: Arrays.binarysearch does not find first match)<br />
* Extractor: bugfix (files that are not zipped)<br />
* GeMoMa: bugfix (tie computation: Arrays.binarysearch does not find first match)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.4.1.zip GeMoMa 1.4.1] (30.05.2017)<br />
* CompareTranscripts: bugfix (NullPointerException)<br />
* Extractor: reference genome can be .*fa.gz and .*fasta.gz<br />
* GeMoMa: bugfix (shutdown problem after timeout)<br />
* modified additional scripts and documentation<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.4.zip GeMoMa 1.4] (03.05.2017)<br />
* AnnotationEvidence: new tool computing tie and tpc for given annotation (gff)<br />
* CompareTranscripts: new tool comparing predicted and given annotation (gff)<br />
* Extractor:<br />
** reading CDS with no parent tag (cf. discontinuous feature)<br />
** automatic recognition of GFF or GTF annotation<br />
** Warning if sequences mentioned in the annotation are not included in the reference sequence<br />
* GeMoMa: <br />
** allowing for multiple intron and coverage files (= using different library types at the same time)<br />
** NA instead of "?" for tae, tde, tie, minSplitReads of single coding exon genes<br />
** new default values for the parameters: predictions (10 instead of 1) and contig threshold (0.4 instead of 0.9)<br />
** bugfix (write pc and minCov if possible for last CDS part in predicted annotation)<br />
** bugfix (ref-gene name if no assignment is used)<br />
** bugfix (minSplitReads, minCov, tpc, avgCov if no coverage available)<br />
* GAF:<br />
** nested genes on the same strand<br />
** bugfix (if nothing passes the filter)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.3.2.zip GeMoMa 1.3.2] (18.01.2017)<br />
* Extractor: new parameter repair for broken transcript annotations<br />
* GeMoMa: bugfixes (splice site computation)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa_1.3.1.zip GeMoMa 1.3.1] (09.12.2016)<br />
* GeMoMa bugfix (finding start/stop codon for very small exons)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.3.zip GeMoMa 1.3] (06.12.2016)<br />
* ERE: new tool for extracting RNA-seq evidence<br />
* Extractor: offers options for <br />
** partial gene models<br />
** ambiguities<br />
* GeMoMa:<br />
** RNA-seq<br />
*** defining splice sites<br />
*** additional feature in GFF and output<br />
**** transcript intron evidence (tie)<br />
**** transcript acceptor evidence (tae)<br />
**** transcript donor evidence (tde)<br />
**** transcript percentage coverage (tpc)<br />
**** ...<br />
** improved GFF<br />
** simplified the command line parameters<br />
** IMPORTANT: parameter names changed for some parameters<br />
* GAF: new tool for filtering and combining different predictions (especially of different reference organisms)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.1.3.zip GeMoMa 1.1.3] (06.06.2016)<br />
* minor modifications to the Extractor tool<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.1.2.zip GeMoMa 1.1.2] (05.02.2016)<br />
* GeMoMa bugfix (upstream, downstream sequence for splice site detection)<br />
* Extractor: new parameter s for selecting transcripts<br />
* improved Galaxy integration<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.1.1.zip GeMoMa 1.1.1] (01.02.2016)<br />
* initial release for paper</div>Keilwagenhttps://www.jstacs.de/index.php?title=GeMoMa&diff=1104GeMoMa2020-07-30T12:34:24Z<p>Keilwagen: /* GFF attributes */</p>
<hr />
<div>'''Ge'''ne '''Mo'''del '''Ma'''pper (GeMoMa) is a homology-based gene prediction program. GeMoMa uses the annotation of protein-coding genes in a reference genome to infer the annotation of protein-coding genes in a target genome. Thereby, GeMoMa utilizes amino acid sequence and intron position conservation. In addition, GeMoMa allows to incorporate RNA-seq evidence for splice site prediction.<br />
<br />
GeMoMa is available in a public web-server at [http://galaxy.informatik.uni-halle.de galaxy.informatik.uni-halle.de]. The provided web-server only allows a limited number of reference genes and uses a time out of 2 minutes per transcript prediction. For unlimited use, please use the command line program or integrate GeMoMa in your own [https://galaxyproject.org/ Galaxy] instance.<br />
<br />
{|<br />
|__TOC__<br />
|[[File:GeMoMa-schema.png|thumb|right|450px|Schema of GeMoMa algorithm]]<br />
|}<br />
<br />
== Installation ==<br />
<br />
GeMoMa is now available via bioconda. Here is the [https://anaconda.org/bioconda/gemoma direct link to the package]. To install this package with conda run:<br />
conda install -c bioconda gemoma <br />
However, you can also install GeMoMa manually.<br />
<br />
=== Requirements ===<br />
<br />
For running the GeMoMa, you need the following software on your computer<br />
* Java v1.8 or later<br />
* [ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ blast] or [https://github.com/soedinglab/MMseqs2 mmseqs]<br />
<br />
=== Download ===<br />
<br />
GeMoMa is implemented in Java using Jstacs. You can [http://www.jstacs.de/download.php?which=GeMoMa download a zip file] containing a readme, the GeMoMa jar file and some tiny scripts for running GeMoMa. The jar file allows for <br />
<ul><br />
<li>creating the XML file needed for the Galaxy integration</li><br />
<li>running the command line interface (CLI) version.</li><br />
</ul><br />
<br />
== In a nutshell ==<br />
<br />
GeMoMa is a modular, homology-based gene prediction program with huge flexibility. However, we also provide a pipeline allowing to use GeMoMa easily. If you like to start GeMoMa for the first time, we recommend to use the GeMoMaPipeline like this<br />
java -jar GeMoMa-1.7.jar CLI GeMoMaPipeline threads=<threads> outdir=<outdir> GeMoMa.Score=ReAlign AnnotationFinalizer.r=NO o=true t=<target_genome> i=<reference_1_id> a=<reference_1_annotation> g=<reference_1_genome><br />
there are several parameters that need to be set indicated with '''&lt;'''foo'''&gt;'''. You can specify<br />
* the number of threads<br />
* the output directory<br />
* the target genome<br />
* and the reference ID (optional), annotation and genome. If you have several references just repeat <code>s=own</code> and the parameter tags <code>i</code>, <code>a</code>, <code>g</code> with the corresponding values.<br />
In addition, we recommend to set several parameters:<br />
* <code>GeMoMa.Score=ReAlign</code>: states that the score from mmseqs should be recomputed as mmseqs uses an approximation<br />
* <code>AnnotationFinalizer.r=NO</code>: do not rename genes and transcripts<br />
* <code>o=true</code>: output individual predictions for each reference as a separate file allowing to rerun the combination step ('''GAF''') very easily and quickly<br />
If you like to specify the maximum intron length please consider the parameters <code>GeMoMa.m</code> and <code>GeMoMa.sil</code>.<br />
If you have RNA-seq data either from own experiments or publicly available data sets (cf. [https://www.ncbi.nlm.nih.gov/sra NCBI SRA], [https://www.ebi.ac.uk/ena EMBL-EBI ENA]), we recommend to use them. You need to map the data against the target genome with your favorite read mapper. In addition, we recommend to check the parameters of the section '''DenoiseIntrons'''.<br />
<br />
The complete documentation describing all GeMoMa modules and all paaremeters can be accessed at [[GeMoMa-Docs]].<br />
<br />
You can also [[:File:GeMoMa-manual.pdf|download a small manual for GeMoMa]] which explains the main steps for the analysis.<br />
<br />
== GFF attributes ==<br />
<br />
Using GeMoMa, you'll obtain GFFs containing some special attributes. We briefly explain the most prominent attributes in the following table.<br />
<br />
{| class="wikitable sortable"<br />
|-<br />
!Attribute !!Long name !!Tool !!Necessary parameter !!Feature !!Description<br />
|-<br />
| aa || amino acids || GeMoMa || || mRNA || the number of amino acids in the protein<br />
|-<br />
| score || GeMoMa score || GeMoMa || || mRNA || score computed by GeMoMa using the substitution matrix, gap costs and additional penalties<br />
|-<br />
| nps || number of premature stops || GeMoMa || || mRNA || the number of premature stop codons in the prediction<br />
|-<br />
| ce || coding exons || GeMoMa || assignment || mRNA || the number of coding exons of the prediction<br />
|-<br />
| rce || reference coding exons || GeMoMa || assignment || mRNA || the number of coding exons of the reference transcript<br />
|-<br />
| minCov || minimal coverage || GeMoMa || coverage, ... || mRNA || minimal coverage of any base of the prediction given RNA-seq evidence<br />
|-<br />
| avgCov || average coverage || GeMoMa || coverage, ... || mRNA || average coverage of all bases of the prediction given RNA-seq evidence<br />
|-<br />
| tpc || transcript percentage coverage || GeMoMa || coverage, ... || mRNA || percentage of covered bases per predicted transcript given RNA-seq evidence<br />
|-<br />
| tae || transcript acceptor evidence || GeMoMa || introns || mRNA || percentage of predicted acceptor sites per predicted transcript with RNA-seq evidence<br />
|-<br />
| tde || transcript donor evidence || GeMoMa || introns || mRNA || percentage of predicted donor sites per predicted transcript with RNA-seq evidence<br />
|-<br />
| tie || transcript intron evidence || GeMoMa || introns || mRNA || percentage of predicted introns per predicted transcript with RNA-seq evidence<br />
|-<br />
| minSplitReads || minimal split reads || GeMoMa || introns || mRNA || minimal number of split reads for any of the predicted introns per predicted transcript<br />
|-<br />
| iAA || identical amino acid || GeMoMa || query proteins || mRNA || percentage of identical amino acids between reference transcript and prediction<br />
|-<br />
| pAA || positive amino acid || GeMoMa || query proteins || mRNA || percentage of aligned positions between reference transcript and prediction yielding a positive score in the substitution matrix<br />
|-<br />
| evidence || || GAF || || mRNA || number of reference organisms that have a transcript yielding this prediction<br />
|-<br />
| alternative || || GAF || || mRNA || alternative gene ID(s) leading to the same prediction<br />
|-<br />
| sumWeight || || GAF || || mRNA || the sum of the weights of the references that perfectly support this prediction<br />
|-<br />
| maxTie || maximal tie || GAF || || gene || maximal tie of all transcripts of this gene<br />
|-<br />
| maxEvidence || maximal evidence || GAF || || gene || maximal evidence of all transcripts of this gene<br />
|-<br />
|}<br />
<br />
The name of the feature describing a transcript prediction can be altered using the parameter "tag". Before version 1.7 the default value of tag was "prediction" instead of "mRNA".<br />
<br />
== Frequently asked questions ==<br />
<br />
; Why does the Extractor not return a single CDS-part, protein, ...?<br />
:Please check whether the names of your contigs/chromosomes in your annotation (gff) and genome file (fasta) are identical. The fasta comments should at best only contain the contig/chromosome name. (Since GeMoMa 1.4, comments, which contain the contig/chromosome name and some additional information separated by a space, are also fine.) In addition, check the statistics that are given by the Extractor. It lists how many genes have been read and how many genes have been removed for different reasons. One common problem is that some annotation files do not include the stop codon in the CDS annotation.<br />
<br />
; How can I force GeMoMa to make more predictions?<br />
:There are several parameters affecting the number of predictions. The most prominent are the number of predictions (p) and the contig threshold (ct). For each reference transcript/CDS, GeMoMa initially makes a preliminary prediction and uses this prediction to determine whether a contig is promising and should be used to determine the final predictions. You may decrease ct and increase p to have more contigs in the final prediction. Increasing the number of predictions allows GeMoMa to output more predictions that have been computed. Decreasing the contig threshold allows to increase the number of predictions that are (internally) computed. Increasing p to a very large number without decreasing ct does not help.<br />
<br />
; Running GeMoMa on a single contig of my assembly yield thousands of weird predictions. What went wrong?<br />
:By default, GeMoMa is not build to be run on a single contig. GeMoMa tries to make predictions for all given reference CDS in the given target sequence(s). If the given target sequence is only a fraction of the complete target genome/assembly, GeMoMa will produce weird predictions as it does not filter for the quality of the predictions internally. There are two options to handle this:<br />
:* Use a list of gene models that you expect to be located on this contig (cf. parameter "selected").<br />
:* Filter the predictions using GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>).<br />
<br />
; Is it mandatory to use RNA-seq data?<br />
:No, GeMoMa is able to make predictions with and without RNA-seq evidence.<br />
<br />
; Is it possible to use multiple reference organisms?<br />
:It is possible to use multiple reference organisms for GeMoMa. Just run GeMoMa on each reference organism separately. Finally, you can employ GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>) to combine these annotations.<br />
<br />
; Why do some reference genes not lead to a prediction in the target genome?<br />
:Please first check whether your reference genes have been discarded by the Extractor (cf. assignment file).<br />
:If the genes have been discarded, there are two possibilities:<br />
:* The CDS might be redundant, i.e. the coding exons are identical to those of another transcript. In this case, only one CDS is further evaluated.<br />
:* There might be something wrong with your reference genes, e.g., missing start codon, missing stop codon, premature stop codon, ambiguous nucleotides, ... and you should check the options of Extractor or the annotation.<br />
:If the reference genes passed the Extractor, there are several possible explanations for this behavior. The two most prominent are:<br />
:* GeMoMa stopped the prediction of a reference genes since it does not return a result within the given time (cf. parameter "timeout").<br />
:* GeMoMa simply did not find a prediction matching the remaining quality criteria.<br />
<br />
; What does "partial gene model" mean in the context of GeMoMa?<br />
:We called a gene model partial if it does not contain an initial start codon and a final stop codon. However, this does not mean that the gene model is located at or close to the border of a chromosome or contig.<br />
<br />
; For two different reference transcripts, the predictions of GeMoMa overlap or are identical. What should I do with those?<br />
:GeMoMa makes the predictions for each reference transcript independently. Hence, it can occur that some of predictions of different reference transcripts overlap or are identical especially in gene families. Typically, you might like to filter or rank these predictions. We have implemented GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>) to do this automatically. However, you can also do it by hand using the GFF attributes. Using RNA-seq data in GeMoMa yields additional fields in the annotation that can be used, e.g., average coverage (avgCov).<br />
<br />
; A lot of transcripts have been filtered out by the Extractor. What can I do?<br />
:There are several reasons for removing transcripts by the Extractor. At least in two cases you can try to get more transcripts by setting specific parameter values. First, if the transcript contains ambiguous nucleotides, please test the parameter "Ambiguity". Second, sometimes we received GFFs which contain wrong phases for CDS entries (e.g., 0 for all CDS entries in the phase column of the GFF). Since version 1.3.2, we provide the option "r" which stands for repair. If r=true is chosen, the Extractor tries to infer all phases for transcripts that show an error and would be filtered out.<br />
<br />
; Is GeMoMa able to predict pseudo-genes/ncRNA?<br />
:No, currently not.<br />
<br />
; My RNA-seq data indicates there is an additional intron in a transcipt, but GeMoMa does not predict this. Or vice versa, GeMoMa predicts an intron that is not supported by RNA-seq data. What's the reason?<br />
:GeMoMa is mainly based on the assumptions of amino acid and intron position conservation between reference and target species. Hence, GeMoMa tries to predict a gene model with similar exon-intron-structure in the target species and does not stick too much with RNA-seq data. Although intron position conservation can be observed in most cases, sometimes new introns evolve or others vanish. For this reasons, GeMoMa also allows for the inclusion or exclusion of introns adding some additional costs (cf. GeMoMa parameter intron-loss-gain-penalty). However, the behaviour of GeMoMa depends on the parameters settings (especially intron-loss-gain-penalty, sm (substitution matrix), go (gap opening), ge (gap extension)) and the length of the missed/additional intron. Nevertheless, such cases can only occur if the additional/missed intron has a length that can be divided by 3 preserving the reading frame. Since the available RNA-seq data only reflects a fraction of tissues/environmental conditions/..., missing RNA-seq evidence does not necessarily mean that the predictions is wrong.<br />
<br />
; My RNA-seq data indicates two alternative, highly overlapping introns. Interestingly, GeMoMa does not take the intron that is more abundant. Why?<br />
:GeMoMa reads the introns from the input file using some filter (cf. GeMoMa parameter r (reads)). All introns that pass the filter are used and treated equally. Hence, GeMoMa uses the intron that matches the expectation of intron position and amino acid conservation compared to the reference transcript.<br />
<br />
; Does GeMoMa predict multiple transcripts per gene?<br />
:GeMoMa in principle allows to predict multiple transcripts per gene, if corresponding transcripts are given in the reference species or if multiple reference species are used. <br />
<br />
; GeMoMa failed with java.lang.OutOfMemoryError. What can I do?<br />
:Whenever you see a java.lang.OutOfMemoryError, you should rerun the program with Java virtual machine (VM) options. More specifically you should set: -Xms the initially used RAM, e.g. to 5Gb (–Xms5G), and -Xmx the maximally used RAM, e.g. to 50Gb (-Xmx50G). GeMoMa often needs more memory if you have a large genome and if you’re providing a large coverage file (extracted from RNA-seq data). If you don’t have a compute node with enough memory, you can run GeMoMa without coverage, which will return the same predictions, but does not include all statistics. Another point could be the protein alignment, if you use the optional parameter "query protein" (below version 1.; or "protein alignment" (since version 1.;. Again you can run GeMoMa without protein alignment, which will return the same predictions, but less statistics. <br />
<br />
; I need to specify the genetic code for my organisms. What is the expected format?<br />
:The genetic code is given in a two column tab-delimited table, where the first column is the one letter code of the amino acid and the second column is a comma-separated list of triplets. As we are working on genomic DNA, GeMoMa expects the bases A, C, G, and T, and not U (as expected in mRNA). Here is the link to the default genetic code, which might be used as template:<br />
:https://github.com/Jstacs/Jstacs/blob/master/projects/gemoma/test_data/genetic_code.txt<br />
:Alternative genetic codes are described here using the RNA alphabet:<br />
:https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi<br />
:The genetic code might be specified for a reference organism in the module Extractor or for a target organism in the module GeMoMa.<br />
<br />
; I like to accelerate GeMoMa. What can I do?<br />
:You can use several threads for the computation. If you run the GeMoMaPipeline you just have to select threads=<your_number>.<br />
:In addition, you can change the search algorithm that is used in GeMoMa. Tblastn is used by default as search algorithm in GeMoMa (for historical reasons, until version 1.6.4). However, tblastn can be replaced by mmseqs which is typically much faster. If you run the GeMoMaPipeline you just have to select tblastn=false, which is default since version 1.7. However, changing the search algorithm can also effect the results. We try to minimize these effect using specific parameters for the search algorithms.<br />
:If you modify other parameters, you will probably receive results that differ to a larger extend from those received using the default parameters. <br />
<br />
; Is there a way to use GeMoMa to search a single CDS or protein sequence against a genome and return the predicted gene model (CDS fasta, protein fasta, GFF) similar to exonerate?<br />
:There at least two ways to do this. If you use GeMoMaPipeline you can <br />
: (A) Use the parameter “selected” to select specific gene models (=transcripts/proteins) from the annotation instead of using all or<br />
: (B) Use s=pre-extracted, use a fasta file with the proteins for the parameter cds-parts and leave assignment unset.<br />
:Using one of these options you can either look for a single or few transcripts/proteins either with (A) or without (B) intron-position conservation. In addition, you can use RNA-seq data to improve the predictions, which should be not possible with exonerate.<br />
<br />
; Can I determine synteny based on GeMoMa predictions?<br />
:Yes, since version 1.7 we provide the module SyntenyChecker and a R script that can be used for this purpose. It exploits the fact the the reference gene and the alternative are known. Hence no alignment is need at this point and synteny can be determined quite fast. <br />
<br />
; How, can I add additional attributes to the annotation?<br />
:Additional attribute, e.g., functional annotation from InterProScan, can be added to the structural gene annotation using the module AddAttribute, which has been included since version 1.7. Such additional attributes might be used in GAF for filtering and sorting and can also be displayed in genome browsers like IGV or WebApollo. <br />
<br />
; Can structural gene annotation provided by GeMoMa be submitted to NCBI?<br />
:Yes, NCBI allows to submit structural gene annotation in GFF format (https://www.ncbi.nlm.nih.gov/genbank/genomes_gff/). If you run GeMoMaPipeline or AnnotationFinalizer, the GFF should be valid for conversion. <br />
<br />
; Running GeMoMaPipeline throws an exception. Can I restart GeMoMaPipeline using intermediate results?<br />
:Yes, since version 1.7 we have a new parameter in GeMoMaPipeline called restart. <br />
:If you want to restart the last broken GeMoMaPipeline run, you have to execute GeMoMaPipeline with the same command line as before and add restart=true.<br />
:If necessary, you can also slightly change the other parameters. However, if the parameters differ too much from those used before, GeMoMaPipeline will decide to perform a new independent run.<br />
:A restart of GeMoMaPipeline is particularly useful if the time-consuming search (tblastn or mmseqs) was successful, since this can save runtime.<br />
<br />
For any further questions or comments please contact jens.keilwagen@julius-kuehn.de <br />
<br />
== References ==<br />
If you use GeMoMa, please cite<br />
<br />
J. Keilwagen, M. Wenk, J. L. Erickson, M. H. Schattat, J. Grau, and F. Hartung. [https://nar.oxfordjournals.org/content/44/9/e89 Using intron position conservation for homology-based gene prediction]. ''Nucleic Acids Research'', 2016. doi: 10.1093/nar/gkw092<br />
<br />
J. Keilwagen, F. Hartung, M. Paulini, S. O. Twardziok, and J. Grau<br />
[https://rdcu.be/QbKc Combining RNA-seq data and homology-based gene prediction for plants, animals and fungi]. ''BMC Bioinformatics'', 2018. doi: 10.1186/s12859-018-2203-5<br />
<br />
== Version history ==<br />
[http://www.jstacs.de/download.php?which=GeMoMa GeMoMa 1.7.] (29.07.2020)<br />
GeMoMa 1.7. (29.07.2020)<br />
*improved manual including new module and runtime <br />
*check whether input files exist before execution<br />
*partially checking MIME types in CLI before execution<br />
*changed homepage from http to https<br />
*new module AddAttribute: allows to add attributes (like functional annotation from InterProScan) to gene annotation files that might be used in GAF or displayed genome browsers like IGV or WebApollo<br />
*new module SyntenyChecker: creates a table that can be used to create dot plots between the annotation of the target and reference organism<br />
*changed default value of parameter "tag" from "prediction" to "mRNA"<br />
*AnnotationEvidence:<br />
**additional attributes: avgCov, minCov, nps, ce<br />
**changed default value of "annotation output" to true<br />
**bugfix: transcript start and end<br />
*ERE: <br />
**changed default value of coverage to "true" <br />
**new parameter "minimum context": allows to discard introns if all split reads have short aligned contexts<br />
*Extractor: <br />
**bugfix splitAA if coding exon is very short<br />
**improved verbose mode<br />
**new parameter "upcase IDs"<br />
**new parameter "introns" allowing to extract introns from the reference (only for test cases)<br />
**new parameter "discard pre-mature stop" allowing to discard or use transcripts with pre-mature stop<br />
**improved handling of corrupt annotations<br />
*GAF: <br />
**bugfix missing transcripts<br />
**slightly changed the default value of "filter" <br />
*GeMoMa:<br />
**replaced parameter "query proteins" by "protein alignment"<br />
**using splitAA for scoring predictions <br />
**new gff attributes: <br />
*** ce and rce for the feature prediction indicating the number of coding exons for the prediction and the reference, respectively<br />
*** nps for the number of premature stop codons (if avoid stop is false)<br />
**slightly changed the meaning of the parameter "avoid stop"<br />
*GeMoMaPipeline:<br />
**changed the default value of tblastn to false, hence mmseqs is used as search algorithm<br />
**changed the default value of score to ReAlign<br />
**remove "--dont-split-seq-by-len" from mmseqs createdb<br />
**new optional parameter BLAST_PATH<br />
**new optional parameter MMSEQS_PATH<br />
**new option to allow for incorporation of external annotation, e.g., from ab-initio gene prediction<br />
**new parameter restart allowing to restart the latest GeMoMaPipeline run, which was finished without results, with very similar parameters, e.g., after an exception was thrown (cf. parameter debug)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.6.4.zip GeMoMa 1.6.4] (24.04.2020)<br />
* improved help section<br />
* change gff attribute "AA" to "aa"<br />
* GAF:<br />
** bugfix overlapping genes<br />
** accelerated computation<br />
* GeMoMa:<br />
** bugfix: if no assignment file is used and protein ID are prefixes of other protein IDs<br />
** change GFF attribute AA to aa<br />
* AnnotationFinalizer: new parameter "name attribute" allowing to decide whether a name attribute or the Parent and ID attributes should be used for renaming<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.6.3.zip GeMoMa 1.6.3] (05.03.2020)<br />
* Jstacs changes:<br />
** CLI: bugfix ExpandableParameterSet<br />
* python wrapper (for *conda)<br />
* updated tests.sh, run.sh, pipeline.sh<br />
* rename Denoise to DenoiseIntrons<br />
* AnnotationEvidence: write phase (as given) to gff<br />
* GAF: new parameter: default attributes allows to set attributes that are not included in some gene annotation files<br />
* GeMoMa: new parameter: static intron length allowing to use dynamic intron length if set to false<br />
* GeMoMaPipeline: <br />
** bugfix: time-out<br />
** improve output<br />
** separate parameters for maximum intron length (DenoiseIntrons, GeMoMa)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.6.2.zip GeMoMa 1.6.2] (17.12.2019)<br />
* Jstacs changes:<br />
** test methods for modules<br />
** live protocol for Galaxy<br />
* new module Denoise: allowing to clean introns extracted by ERE<br />
* new module NCBIReferenceRetriever: allowing to retrieve data for reference organisms easily from NCBI.<br />
* GAF:<br />
** bugfix for filter using specific attributes if no RNA-seq or query proteins was used<br />
** allow to add annotation info (as for instance provided by Phytozome) based on the reference organisms<br />
* GeMoMa: bugfix for timeout<br />
* GeMoMaPipeline:<br />
** bugfix reporting predicted partial proteins<br />
** improved protocol<br />
** new default value for query proteins (changed from false to true)<br />
** new default value for Ambiguity (changed from EXCEPTION to AMBIGUOUS)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.6.1.zip GeMoMa 1.6.1] (4.06.2019)<br />
* createGalaxyIntegration.sh: bugfix for GeMoMaPipeline<br />
* new module CheckIntrons: allowing to create statistics for introns (extracted by ERE)<br />
* AnnotationFinalizer: bugfix for sequence IDs with large numbers<br />
* CompareTranscripts: <br />
** bugfix for prefix of ref-gene<br />
** allow no transcript info, but making assignment non-optional if a transcript info is set <br />
* GAF: bugfix for Galaxy integration<br />
* GeMoMaPipeline:<br />
** improved output in case of Exceptions<br />
** new parameter "output individual predictions" allows to in- or exclude individual predictions from each reference organism in the final result<br />
** new parameter "weight" allows weights for reference species (cf. GAF)<br />
* ERE: new parameter "minimum mapping quality"<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.6.zip GeMoMa 1.6] (2.04.2019)<br />
* allow to use mmseqs as alternative to tblastn<br />
* AnnotationEvidence:<br />
** allows to add attributes to the input gff: tie, tpc, AA, start, stop<br />
** new parameter for gff output<br />
* AnnotationFinalizer: new tool for predicting UTRs and renaming predictions<br />
* GAF:<br />
** relative score filter and evidence filter are replaced by a flexible filter that allows to filter by relative score, evidence or other GFF attributes as well as combinations thereof<br />
** sorting criteria of the predictions within clusters can now be user-specified<br />
** new attribute for genes: combinedEvidence<br />
** new attribute for predictions: sumWeight<br />
** allows to use gene predictions from all sources, including for instance ab-initio gene predictors, purely RNA-seq based gene prediction and manually curation<br />
** bugfix for predictions from multiple reference organisms<br />
** improved statistic output<br />
* GeMoMa<br />
** renamed the parameter tblastn results to search results<br />
** new parameter for sorting the results of the similarity search (tblastn or mmseqs), if you use mmseqs for the similarity search you have use sort <br />
** new parameter for score of the search results: three options: Trust (as is), ReScore (use aligned sequence, but recompute score), and ReAlign (use detected sequence for realignment and rescore)<br />
** bugfix: threshold for introns from multiple files<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.5.2.zip GeMoMa 1.5.3] (23.07.2018)<br />
* improved parameter description and presentation<br />
* GeMoMaPipeline:<br />
** removed unnecessary parameters<br />
* GeMoMa:<br />
** bugfix: reading coverage file<br />
** removed parameter genomic (cf. Extractor)<br />
** removed protein output (cf. Extractor)<br />
* GAF:<br />
** bugfix: prefix<br />
* Extractor:<br />
** new parameter genomic<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.5.2.zip GeMoMa 1.5.2] (31.5.2018)<br />
* GAF:<br />
** new parameter that allows to restrict the maximal number of transcript predictions per gene<br />
** altered behavior of the evidence filter from percentages to absolute values<br />
** bugfix: nested genes <br />
** checking for duplicates in prediction IDs<br />
* GeMoMa:<br />
** warning if RNA-seq data does not match with target genome<br />
* GeMoMaPipeline: new tool for running the complete GeMoMa pipeline at once allowing multi-threading<br />
* folder for temporary files of GeMoMa<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.5.zip GeMoMa 1.5] (13.02.2018)<br />
* AnnotationEvidence: add chromosome to output<br />
* CompareTranscripts: new parameter that allows to remove prefixes introduces by GAF<br />
* Extractor: new parameter "stop-codon excluded from CDS" that might be used if the annotation does not contain the stop codons<br />
* ExtractRNASeqEvidence: <br />
** print intron length stats<br />
** include program infos in introns.gff3<br />
* GeMoMa:<br />
** new attribute pAA in gff output if query protein is given<br />
** include program infos in predicted_annotation.gff3<br />
** minor bugfix<br />
* GAF:<br />
** new parameter that allows to specify a prefix for each input gff<br />
** collect and print program infos to filtered_prediction.gff3<br />
** improved statistics output<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.4.2.zip GeMoMa 1.4.2] (21.07.2017)<br />
* automatic searching for available updates<br />
* AnnotationEvidence: bugfix (tie computation: Arrays.binarysearch does not find first match)<br />
* Extractor: bugfix (files that are not zipped)<br />
* GeMoMa: bugfix (tie computation: Arrays.binarysearch does not find first match)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.4.1.zip GeMoMa 1.4.1] (30.05.2017)<br />
* CompareTranscripts: bugfix (NullPointerException)<br />
* Extractor: reference genome can be .*fa.gz and .*fasta.gz<br />
* GeMoMa: bugfix (shutdown problem after timeout)<br />
* modified additional scripts and documentation<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.4.zip GeMoMa 1.4] (03.05.2017)<br />
* AnnotationEvidence: new tool computing tie and tpc for given annotation (gff)<br />
* CompareTranscripts: new tool comparing predicted and given annotation (gff)<br />
* Extractor:<br />
** reading CDS with no parent tag (cf. discontinuous feature)<br />
** automatic recognition of GFF or GTF annotation<br />
** Warning if sequences mentioned in the annotation are not included in the reference sequence<br />
* GeMoMa: <br />
** allowing for multiple intron and coverage files (= using different library types at the same time)<br />
** NA instead of "?" for tae, tde, tie, minSplitReads of single coding exon genes<br />
** new default values for the parameters: predictions (10 instead of 1) and contig threshold (0.4 instead of 0.9)<br />
** bugfix (write pc and minCov if possible for last CDS part in predicted annotation)<br />
** bugfix (ref-gene name if no assignment is used)<br />
** bugfix (minSplitReads, minCov, tpc, avgCov if no coverage available)<br />
* GAF:<br />
** nested genes on the same strand<br />
** bugfix (if nothing passes the filter)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.3.2.zip GeMoMa 1.3.2] (18.01.2017)<br />
* Extractor: new parameter repair for broken transcript annotations<br />
* GeMoMa: bugfixes (splice site computation)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa_1.3.1.zip GeMoMa 1.3.1] (09.12.2016)<br />
* GeMoMa bugfix (finding start/stop codon for very small exons)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.3.zip GeMoMa 1.3] (06.12.2016)<br />
* ERE: new tool for extracting RNA-seq evidence<br />
* Extractor: offers options for <br />
** partial gene models<br />
** ambiguities<br />
* GeMoMa:<br />
** RNA-seq<br />
*** defining splice sites<br />
*** additional feature in GFF and output<br />
**** transcript intron evidence (tie)<br />
**** transcript acceptor evidence (tae)<br />
**** transcript donor evidence (tde)<br />
**** transcript percentage coverage (tpc)<br />
**** ...<br />
** improved GFF<br />
** simplified the command line parameters<br />
** IMPORTANT: parameter names changed for some parameters<br />
* GAF: new tool for filtering and combining different predictions (especially of different reference organisms)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.1.3.zip GeMoMa 1.1.3] (06.06.2016)<br />
* minor modifications to the Extractor tool<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.1.2.zip GeMoMa 1.1.2] (05.02.2016)<br />
* GeMoMa bugfix (upstream, downstream sequence for splice site detection)<br />
* Extractor: new parameter s for selecting transcripts<br />
* improved Galaxy integration<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.1.1.zip GeMoMa 1.1.1] (01.02.2016)<br />
* initial release for paper</div>Keilwagenhttps://www.jstacs.de/index.php?title=File:GeMoMa-manual.pdf&diff=1103File:GeMoMa-manual.pdf2020-07-30T04:42:18Z<p>Keilwagen: Keilwagen uploaded a new version of File:GeMoMa-manual.pdf</p>
<hr />
<div></div>Keilwagenhttps://www.jstacs.de/index.php?title=GeMoMa&diff=1102GeMoMa2020-07-30T04:33:15Z<p>Keilwagen: release 1.7</p>
<hr />
<div>'''Ge'''ne '''Mo'''del '''Ma'''pper (GeMoMa) is a homology-based gene prediction program. GeMoMa uses the annotation of protein-coding genes in a reference genome to infer the annotation of protein-coding genes in a target genome. Thereby, GeMoMa utilizes amino acid sequence and intron position conservation. In addition, GeMoMa allows to incorporate RNA-seq evidence for splice site prediction.<br />
<br />
GeMoMa is available in a public web-server at [http://galaxy.informatik.uni-halle.de galaxy.informatik.uni-halle.de]. The provided web-server only allows a limited number of reference genes and uses a time out of 2 minutes per transcript prediction. For unlimited use, please use the command line program or integrate GeMoMa in your own [https://galaxyproject.org/ Galaxy] instance.<br />
<br />
{|<br />
|__TOC__<br />
|[[File:GeMoMa-schema.png|thumb|right|450px|Schema of GeMoMa algorithm]]<br />
|}<br />
<br />
== Installation ==<br />
<br />
GeMoMa is now available via bioconda. Here is the [https://anaconda.org/bioconda/gemoma direct link to the package]. To install this package with conda run:<br />
conda install -c bioconda gemoma <br />
However, you can also install GeMoMa manually.<br />
<br />
=== Requirements ===<br />
<br />
For running the GeMoMa, you need the following software on your computer<br />
* Java v1.8 or later<br />
* [ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ blast] or [https://github.com/soedinglab/MMseqs2 mmseqs]<br />
<br />
=== Download ===<br />
<br />
GeMoMa is implemented in Java using Jstacs. You can [http://www.jstacs.de/download.php?which=GeMoMa download a zip file] containing a readme, the GeMoMa jar file and some tiny scripts for running GeMoMa. The jar file allows for <br />
<ul><br />
<li>creating the XML file needed for the Galaxy integration</li><br />
<li>running the command line interface (CLI) version.</li><br />
</ul><br />
<br />
== In a nutshell ==<br />
<br />
GeMoMa is a modular, homology-based gene prediction program with huge flexibility. However, we also provide a pipeline allowing to use GeMoMa easily. If you like to start GeMoMa for the first time, we recommend to use the GeMoMaPipeline like this<br />
java -jar GeMoMa-1.7.jar CLI GeMoMaPipeline threads=<threads> outdir=<outdir> GeMoMa.Score=ReAlign AnnotationFinalizer.r=NO o=true t=<target_genome> i=<reference_1_id> a=<reference_1_annotation> g=<reference_1_genome><br />
there are several parameters that need to be set indicated with '''&lt;'''foo'''&gt;'''. You can specify<br />
* the number of threads<br />
* the output directory<br />
* the target genome<br />
* and the reference ID (optional), annotation and genome. If you have several references just repeat <code>s=own</code> and the parameter tags <code>i</code>, <code>a</code>, <code>g</code> with the corresponding values.<br />
In addition, we recommend to set several parameters:<br />
* <code>GeMoMa.Score=ReAlign</code>: states that the score from mmseqs should be recomputed as mmseqs uses an approximation<br />
* <code>AnnotationFinalizer.r=NO</code>: do not rename genes and transcripts<br />
* <code>o=true</code>: output individual predictions for each reference as a separate file allowing to rerun the combination step ('''GAF''') very easily and quickly<br />
If you like to specify the maximum intron length please consider the parameters <code>GeMoMa.m</code> and <code>GeMoMa.sil</code>.<br />
If you have RNA-seq data either from own experiments or publicly available data sets (cf. [https://www.ncbi.nlm.nih.gov/sra NCBI SRA], [https://www.ebi.ac.uk/ena EMBL-EBI ENA]), we recommend to use them. You need to map the data against the target genome with your favorite read mapper. In addition, we recommend to check the parameters of the section '''DenoiseIntrons'''.<br />
<br />
The complete documentation describing all GeMoMa modules and all paaremeters can be accessed at [[GeMoMa-Docs]].<br />
<br />
You can also [[:File:GeMoMa-manual.pdf|download a small manual for GeMoMa]] which explains the main steps for the analysis.<br />
<br />
== GFF attributes ==<br />
<br />
Using GeMoMa and GAF, you'll obtain GFFs containing some special attributes. We briefly explain the most prominent attributes in the following table.<br />
<br />
{| class="wikitable sortable"<br />
|-<br />
!Attribute !!Long name !!Tool !!Necessary parameter !!Feature !!Description<br />
|-<br />
| aa || amino acids || GeMoMa || || prediction || the number of amino acids in the protein<br />
|-<br />
| score || GeMoMa score || GeMoMa || || prediction || score computed by GeMoMa using the substitution matrix, gap costs and additional penalties<br />
|-<br />
| nps || number of premature stops || GeMoMa || || prediction || the number of premature stop codons in the prediction<br />
|-<br />
| ce || coding exons || GeMoMa || assignment || prediction || the number of coding exons of the prediction<br />
|-<br />
| rce || reference coding exons || GeMoMa || assignment || prediction || the number of coding exons of the reference transcript<br />
|-<br />
| minCov || minimal coverage || GeMoMa || coverage, ... || prediction || minimal coverage of any base of the prediction given RNA-seq evidence<br />
|-<br />
| avgCov || average coverage || GeMoMa || coverage, ... || prediction || average coverage of all bases of the prediction given RNA-seq evidence<br />
|-<br />
| tpc || transcript percentage coverage || GeMoMa || coverage, ... || prediction || percentage of covered bases per predicted transcript given RNA-seq evidence<br />
|-<br />
| tae || transcript acceptor evidence || GeMoMa || introns || prediction || percentage of predicted acceptor sites per predicted transcript with RNA-seq evidence<br />
|-<br />
| tde || transcript donor evidence || GeMoMa || introns || prediction || percentage of predicted donor sites per predicted transcript with RNA-seq evidence<br />
|-<br />
| tie || transcript intron evidence || GeMoMa || introns || prediction || percentage of predicted introns per predicted transcript with RNA-seq evidence<br />
|-<br />
| minSplitReads || minimal split reads || GeMoMa || introns || prediction || minimal number of split reads for any of the predicted introns per predicted transcript<br />
|-<br />
| iAA || identical amino acid || GeMoMa || query proteins || prediction || percentage of identical amino acids between reference transcript and prediction<br />
|-<br />
| pAA || positive amino acid || GeMoMa || query proteins || prediction || percentage of aligned positions between reference transcript and prediction yielding a positive score in the substitution matrix<br />
|-<br />
| evidence || || GAF || || prediction || number of reference organisms that have a transcript yielding this prediction<br />
|-<br />
| alternative || || GAF || || prediction || alternative gene ID(s) leading to the same prediction<br />
|-<br />
| sumWeight || || GAF || || prediction || the sum of the weights of the references that perfectly support this prediction<br />
|-<br />
| maxTie || maximal tie || GAF || || gene || maximal tie of all transcripts of this gene<br />
|-<br />
| maxEvidence || maximal evidence || GAF || || gene || maximal evidence of all transcripts of this gene<br />
|-<br />
|}<br />
<br />
== Frequently asked questions ==<br />
<br />
; Why does the Extractor not return a single CDS-part, protein, ...?<br />
:Please check whether the names of your contigs/chromosomes in your annotation (gff) and genome file (fasta) are identical. The fasta comments should at best only contain the contig/chromosome name. (Since GeMoMa 1.4, comments, which contain the contig/chromosome name and some additional information separated by a space, are also fine.) In addition, check the statistics that are given by the Extractor. It lists how many genes have been read and how many genes have been removed for different reasons. One common problem is that some annotation files do not include the stop codon in the CDS annotation.<br />
<br />
; How can I force GeMoMa to make more predictions?<br />
:There are several parameters affecting the number of predictions. The most prominent are the number of predictions (p) and the contig threshold (ct). For each reference transcript/CDS, GeMoMa initially makes a preliminary prediction and uses this prediction to determine whether a contig is promising and should be used to determine the final predictions. You may decrease ct and increase p to have more contigs in the final prediction. Increasing the number of predictions allows GeMoMa to output more predictions that have been computed. Decreasing the contig threshold allows to increase the number of predictions that are (internally) computed. Increasing p to a very large number without decreasing ct does not help.<br />
<br />
; Running GeMoMa on a single contig of my assembly yield thousands of weird predictions. What went wrong?<br />
:By default, GeMoMa is not build to be run on a single contig. GeMoMa tries to make predictions for all given reference CDS in the given target sequence(s). If the given target sequence is only a fraction of the complete target genome/assembly, GeMoMa will produce weird predictions as it does not filter for the quality of the predictions internally. There are two options to handle this:<br />
:* Use a list of gene models that you expect to be located on this contig (cf. parameter "selected").<br />
:* Filter the predictions using GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>).<br />
<br />
; Is it mandatory to use RNA-seq data?<br />
:No, GeMoMa is able to make predictions with and without RNA-seq evidence.<br />
<br />
; Is it possible to use multiple reference organisms?<br />
:It is possible to use multiple reference organisms for GeMoMa. Just run GeMoMa on each reference organism separately. Finally, you can employ GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>) to combine these annotations.<br />
<br />
; Why do some reference genes not lead to a prediction in the target genome?<br />
:Please first check whether your reference genes have been discarded by the Extractor (cf. assignment file).<br />
:If the genes have been discarded, there are two possibilities:<br />
:* The CDS might be redundant, i.e. the coding exons are identical to those of another transcript. In this case, only one CDS is further evaluated.<br />
:* There might be something wrong with your reference genes, e.g., missing start codon, missing stop codon, premature stop codon, ambiguous nucleotides, ... and you should check the options of Extractor or the annotation.<br />
:If the reference genes passed the Extractor, there are several possible explanations for this behavior. The two most prominent are:<br />
:* GeMoMa stopped the prediction of a reference genes since it does not return a result within the given time (cf. parameter "timeout").<br />
:* GeMoMa simply did not find a prediction matching the remaining quality criteria.<br />
<br />
; What does "partial gene model" mean in the context of GeMoMa?<br />
:We called a gene model partial if it does not contain an initial start codon and a final stop codon. However, this does not mean that the gene model is located at or close to the border of a chromosome or contig.<br />
<br />
; For two different reference transcripts, the predictions of GeMoMa overlap or are identical. What should I do with those?<br />
:GeMoMa makes the predictions for each reference transcript independently. Hence, it can occur that some of predictions of different reference transcripts overlap or are identical especially in gene families. Typically, you might like to filter or rank these predictions. We have implemented GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>) to do this automatically. However, you can also do it by hand using the GFF attributes. Using RNA-seq data in GeMoMa yields additional fields in the annotation that can be used, e.g., average coverage (avgCov).<br />
<br />
; A lot of transcripts have been filtered out by the Extractor. What can I do?<br />
:There are several reasons for removing transcripts by the Extractor. At least in two cases you can try to get more transcripts by setting specific parameter values. First, if the transcript contains ambiguous nucleotides, please test the parameter "Ambiguity". Second, sometimes we received GFFs which contain wrong phases for CDS entries (e.g., 0 for all CDS entries in the phase column of the GFF). Since version 1.3.2, we provide the option "r" which stands for repair. If r=true is chosen, the Extractor tries to infer all phases for transcripts that show an error and would be filtered out.<br />
<br />
; Is GeMoMa able to predict pseudo-genes/ncRNA?<br />
:No, currently not.<br />
<br />
; My RNA-seq data indicates there is an additional intron in a transcipt, but GeMoMa does not predict this. Or vice versa, GeMoMa predicts an intron that is not supported by RNA-seq data. What's the reason?<br />
:GeMoMa is mainly based on the assumptions of amino acid and intron position conservation between reference and target species. Hence, GeMoMa tries to predict a gene model with similar exon-intron-structure in the target species and does not stick too much with RNA-seq data. Although intron position conservation can be observed in most cases, sometimes new introns evolve or others vanish. For this reasons, GeMoMa also allows for the inclusion or exclusion of introns adding some additional costs (cf. GeMoMa parameter intron-loss-gain-penalty). However, the behaviour of GeMoMa depends on the parameters settings (especially intron-loss-gain-penalty, sm (substitution matrix), go (gap opening), ge (gap extension)) and the length of the missed/additional intron. Nevertheless, such cases can only occur if the additional/missed intron has a length that can be divided by 3 preserving the reading frame. Since the available RNA-seq data only reflects a fraction of tissues/environmental conditions/..., missing RNA-seq evidence does not necessarily mean that the predictions is wrong.<br />
<br />
; My RNA-seq data indicates two alternative, highly overlapping introns. Interestingly, GeMoMa does not take the intron that is more abundant. Why?<br />
:GeMoMa reads the introns from the input file using some filter (cf. GeMoMa parameter r (reads)). All introns that pass the filter are used and treated equally. Hence, GeMoMa uses the intron that matches the expectation of intron position and amino acid conservation compared to the reference transcript.<br />
<br />
; Does GeMoMa predict multiple transcripts per gene?<br />
:GeMoMa in principle allows to predict multiple transcripts per gene, if corresponding transcripts are given in the reference species or if multiple reference species are used. <br />
<br />
; GeMoMa failed with java.lang.OutOfMemoryError. What can I do?<br />
:Whenever you see a java.lang.OutOfMemoryError, you should rerun the program with Java virtual machine (VM) options. More specifically you should set: -Xms the initially used RAM, e.g. to 5Gb (–Xms5G), and -Xmx the maximally used RAM, e.g. to 50Gb (-Xmx50G). GeMoMa often needs more memory if you have a large genome and if you’re providing a large coverage file (extracted from RNA-seq data). If you don’t have a compute node with enough memory, you can run GeMoMa without coverage, which will return the same predictions, but does not include all statistics. Another point could be the protein alignment, if you use the optional parameter "query protein" (below version 1.; or "protein alignment" (since version 1.;. Again you can run GeMoMa without protein alignment, which will return the same predictions, but less statistics. <br />
<br />
; I need to specify the genetic code for my organisms. What is the expected format?<br />
:The genetic code is given in a two column tab-delimited table, where the first column is the one letter code of the amino acid and the second column is a comma-separated list of triplets. As we are working on genomic DNA, GeMoMa expects the bases A, C, G, and T, and not U (as expected in mRNA). Here is the link to the default genetic code, which might be used as template:<br />
:https://github.com/Jstacs/Jstacs/blob/master/projects/gemoma/test_data/genetic_code.txt<br />
:Alternative genetic codes are described here using the RNA alphabet:<br />
:https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi<br />
:The genetic code might be specified for a reference organism in the module Extractor or for a target organism in the module GeMoMa.<br />
<br />
; I like to accelerate GeMoMa. What can I do?<br />
:You can use several threads for the computation. If you run the GeMoMaPipeline you just have to select threads=<your_number>.<br />
:In addition, you can change the search algorithm that is used in GeMoMa. Tblastn is used by default as search algorithm in GeMoMa (for historical reasons, until version 1.6.4). However, tblastn can be replaced by mmseqs which is typically much faster. If you run the GeMoMaPipeline you just have to select tblastn=false, which is default since version 1.7. However, changing the search algorithm can also effect the results. We try to minimize these effect using specific parameters for the search algorithms.<br />
:If you modify other parameters, you will probably receive results that differ to a larger extend from those received using the default parameters. <br />
<br />
; Is there a way to use GeMoMa to search a single CDS or protein sequence against a genome and return the predicted gene model (CDS fasta, protein fasta, GFF) similar to exonerate?<br />
:There at least two ways to do this. If you use GeMoMaPipeline you can <br />
: (A) Use the parameter “selected” to select specific gene models (=transcripts/proteins) from the annotation instead of using all or<br />
: (B) Use s=pre-extracted, use a fasta file with the proteins for the parameter cds-parts and leave assignment unset.<br />
:Using one of these options you can either look for a single or few transcripts/proteins either with (A) or without (B) intron-position conservation. In addition, you can use RNA-seq data to improve the predictions, which should be not possible with exonerate.<br />
<br />
; Can I determine synteny based on GeMoMa predictions?<br />
:Yes, since version 1.7 we provide the module SyntenyChecker and a R script that can be used for this purpose. It exploits the fact the the reference gene and the alternative are known. Hence no alignment is need at this point and synteny can be determined quite fast. <br />
<br />
; How, can I add additional attributes to the annotation?<br />
:Additional attribute, e.g., functional annotation from InterProScan, can be added to the structural gene annotation using the module AddAttribute, which has been included since version 1.7. Such additional attributes might be used in GAF for filtering and sorting and can also be displayed in genome browsers like IGV or WebApollo. <br />
<br />
; Can structural gene annotation provided by GeMoMa be submitted to NCBI?<br />
:Yes, NCBI allows to submit structural gene annotation in GFF format (https://www.ncbi.nlm.nih.gov/genbank/genomes_gff/). If you run GeMoMaPipeline or AnnotationFinalizer, the GFF should be valid for conversion. <br />
<br />
; Running GeMoMaPipeline throws an exception. Can I restart GeMoMaPipeline using intermediate results?<br />
:Yes, since version 1.7 we have a new parameter in GeMoMaPipeline called restart. <br />
:If you want to restart the last broken GeMoMaPipeline run, you have to execute GeMoMaPipeline with the same command line as before and add restart=true.<br />
:If necessary, you can also slightly change the other parameters. However, if the parameters differ too much from those used before, GeMoMaPipeline will decide to perform a new independent run.<br />
:A restart of GeMoMaPipeline is particularly useful if the time-consuming search (tblastn or mmseqs) was successful, since this can save runtime.<br />
<br />
For any further questions or comments please contact jens.keilwagen@julius-kuehn.de <br />
<br />
== References ==<br />
If you use GeMoMa, please cite<br />
<br />
J. Keilwagen, M. Wenk, J. L. Erickson, M. H. Schattat, J. Grau, and F. Hartung. [https://nar.oxfordjournals.org/content/44/9/e89 Using intron position conservation for homology-based gene prediction]. ''Nucleic Acids Research'', 2016. doi: 10.1093/nar/gkw092<br />
<br />
J. Keilwagen, F. Hartung, M. Paulini, S. O. Twardziok, and J. Grau<br />
[https://rdcu.be/QbKc Combining RNA-seq data and homology-based gene prediction for plants, animals and fungi]. ''BMC Bioinformatics'', 2018. doi: 10.1186/s12859-018-2203-5<br />
<br />
== Version history ==<br />
[http://www.jstacs.de/download.php?which=GeMoMa GeMoMa 1.7.] (29.07.2020)<br />
GeMoMa 1.7. (29.07.2020)<br />
*improved manual including new module and runtime <br />
*check whether input files exist before execution<br />
*partially checking MIME types in CLI before execution<br />
*changed homepage from http to https<br />
*new module AddAttribute: allows to add attributes (like functional annotation from InterProScan) to gene annotation files that might be used in GAF or displayed genome browsers like IGV or WebApollo<br />
*new module SyntenyChecker: creates a table that can be used to create dot plots between the annotation of the target and reference organism<br />
*changed default value of parameter "tag" from "prediction" to "mRNA"<br />
*AnnotationEvidence:<br />
**additional attributes: avgCov, minCov, nps, ce<br />
**changed default value of "annotation output" to true<br />
**bugfix: transcript start and end<br />
*ERE: <br />
**changed default value of coverage to "true" <br />
**new parameter "minimum context": allows to discard introns if all split reads have short aligned contexts<br />
*Extractor: <br />
**bugfix splitAA if coding exon is very short<br />
**improved verbose mode<br />
**new parameter "upcase IDs"<br />
**new parameter "introns" allowing to extract introns from the reference (only for test cases)<br />
**new parameter "discard pre-mature stop" allowing to discard or use transcripts with pre-mature stop<br />
**improved handling of corrupt annotations<br />
*GAF: <br />
**bugfix missing transcripts<br />
**slightly changed the default value of "filter" <br />
*GeMoMa:<br />
**replaced parameter "query proteins" by "protein alignment"<br />
**using splitAA for scoring predictions <br />
**new gff attributes: <br />
*** ce and rce for the feature prediction indicating the number of coding exons for the prediction and the reference, respectively<br />
*** nps for the number of premature stop codons (if avoid stop is false)<br />
**slightly changed the meaning of the parameter "avoid stop"<br />
*GeMoMaPipeline:<br />
**changed the default value of tblastn to false, hence mmseqs is used as search algorithm<br />
**changed the default value of score to ReAlign<br />
**remove "--dont-split-seq-by-len" from mmseqs createdb<br />
**new optional parameter BLAST_PATH<br />
**new optional parameter MMSEQS_PATH<br />
**new option to allow for incorporation of external annotation, e.g., from ab-initio gene prediction<br />
**new parameter restart allowing to restart the latest GeMoMaPipeline run, which was finished without results, with very similar parameters, e.g., after an exception was thrown (cf. parameter debug)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.6.4.zip GeMoMa 1.6.4] (24.04.2020)<br />
* improved help section<br />
* change gff attribute "AA" to "aa"<br />
* GAF:<br />
** bugfix overlapping genes<br />
** accelerated computation<br />
* GeMoMa:<br />
** bugfix: if no assignment file is used and protein ID are prefixes of other protein IDs<br />
** change GFF attribute AA to aa<br />
* AnnotationFinalizer: new parameter "name attribute" allowing to decide whether a name attribute or the Parent and ID attributes should be used for renaming<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.6.3.zip GeMoMa 1.6.3] (05.03.2020)<br />
* Jstacs changes:<br />
** CLI: bugfix ExpandableParameterSet<br />
* python wrapper (for *conda)<br />
* updated tests.sh, run.sh, pipeline.sh<br />
* rename Denoise to DenoiseIntrons<br />
* AnnotationEvidence: write phase (as given) to gff<br />
* GAF: new parameter: default attributes allows to set attributes that are not included in some gene annotation files<br />
* GeMoMa: new parameter: static intron length allowing to use dynamic intron length if set to false<br />
* GeMoMaPipeline: <br />
** bugfix: time-out<br />
** improve output<br />
** separate parameters for maximum intron length (DenoiseIntrons, GeMoMa)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.6.2.zip GeMoMa 1.6.2] (17.12.2019)<br />
* Jstacs changes:<br />
** test methods for modules<br />
** live protocol for Galaxy<br />
* new module Denoise: allowing to clean introns extracted by ERE<br />
* new module NCBIReferenceRetriever: allowing to retrieve data for reference organisms easily from NCBI.<br />
* GAF:<br />
** bugfix for filter using specific attributes if no RNA-seq or query proteins was used<br />
** allow to add annotation info (as for instance provided by Phytozome) based on the reference organisms<br />
* GeMoMa: bugfix for timeout<br />
* GeMoMaPipeline:<br />
** bugfix reporting predicted partial proteins<br />
** improved protocol<br />
** new default value for query proteins (changed from false to true)<br />
** new default value for Ambiguity (changed from EXCEPTION to AMBIGUOUS)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.6.1.zip GeMoMa 1.6.1] (4.06.2019)<br />
* createGalaxyIntegration.sh: bugfix for GeMoMaPipeline<br />
* new module CheckIntrons: allowing to create statistics for introns (extracted by ERE)<br />
* AnnotationFinalizer: bugfix for sequence IDs with large numbers<br />
* CompareTranscripts: <br />
** bugfix for prefix of ref-gene<br />
** allow no transcript info, but making assignment non-optional if a transcript info is set <br />
* GAF: bugfix for Galaxy integration<br />
* GeMoMaPipeline:<br />
** improved output in case of Exceptions<br />
** new parameter "output individual predictions" allows to in- or exclude individual predictions from each reference organism in the final result<br />
** new parameter "weight" allows weights for reference species (cf. GAF)<br />
* ERE: new parameter "minimum mapping quality"<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.6.zip GeMoMa 1.6] (2.04.2019)<br />
* allow to use mmseqs as alternative to tblastn<br />
* AnnotationEvidence:<br />
** allows to add attributes to the input gff: tie, tpc, AA, start, stop<br />
** new parameter for gff output<br />
* AnnotationFinalizer: new tool for predicting UTRs and renaming predictions<br />
* GAF:<br />
** relative score filter and evidence filter are replaced by a flexible filter that allows to filter by relative score, evidence or other GFF attributes as well as combinations thereof<br />
** sorting criteria of the predictions within clusters can now be user-specified<br />
** new attribute for genes: combinedEvidence<br />
** new attribute for predictions: sumWeight<br />
** allows to use gene predictions from all sources, including for instance ab-initio gene predictors, purely RNA-seq based gene prediction and manually curation<br />
** bugfix for predictions from multiple reference organisms<br />
** improved statistic output<br />
* GeMoMa<br />
** renamed the parameter tblastn results to search results<br />
** new parameter for sorting the results of the similarity search (tblastn or mmseqs), if you use mmseqs for the similarity search you have use sort <br />
** new parameter for score of the search results: three options: Trust (as is), ReScore (use aligned sequence, but recompute score), and ReAlign (use detected sequence for realignment and rescore)<br />
** bugfix: threshold for introns from multiple files<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.5.2.zip GeMoMa 1.5.3] (23.07.2018)<br />
* improved parameter description and presentation<br />
* GeMoMaPipeline:<br />
** removed unnecessary parameters<br />
* GeMoMa:<br />
** bugfix: reading coverage file<br />
** removed parameter genomic (cf. Extractor)<br />
** removed protein output (cf. Extractor)<br />
* GAF:<br />
** bugfix: prefix<br />
* Extractor:<br />
** new parameter genomic<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.5.2.zip GeMoMa 1.5.2] (31.5.2018)<br />
* GAF:<br />
** new parameter that allows to restrict the maximal number of transcript predictions per gene<br />
** altered behavior of the evidence filter from percentages to absolute values<br />
** bugfix: nested genes <br />
** checking for duplicates in prediction IDs<br />
* GeMoMa:<br />
** warning if RNA-seq data does not match with target genome<br />
* GeMoMaPipeline: new tool for running the complete GeMoMa pipeline at once allowing multi-threading<br />
* folder for temporary files of GeMoMa<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.5.zip GeMoMa 1.5] (13.02.2018)<br />
* AnnotationEvidence: add chromosome to output<br />
* CompareTranscripts: new parameter that allows to remove prefixes introduces by GAF<br />
* Extractor: new parameter "stop-codon excluded from CDS" that might be used if the annotation does not contain the stop codons<br />
* ExtractRNASeqEvidence: <br />
** print intron length stats<br />
** include program infos in introns.gff3<br />
* GeMoMa:<br />
** new attribute pAA in gff output if query protein is given<br />
** include program infos in predicted_annotation.gff3<br />
** minor bugfix<br />
* GAF:<br />
** new parameter that allows to specify a prefix for each input gff<br />
** collect and print program infos to filtered_prediction.gff3<br />
** improved statistics output<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.4.2.zip GeMoMa 1.4.2] (21.07.2017)<br />
* automatic searching for available updates<br />
* AnnotationEvidence: bugfix (tie computation: Arrays.binarysearch does not find first match)<br />
* Extractor: bugfix (files that are not zipped)<br />
* GeMoMa: bugfix (tie computation: Arrays.binarysearch does not find first match)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.4.1.zip GeMoMa 1.4.1] (30.05.2017)<br />
* CompareTranscripts: bugfix (NullPointerException)<br />
* Extractor: reference genome can be .*fa.gz and .*fasta.gz<br />
* GeMoMa: bugfix (shutdown problem after timeout)<br />
* modified additional scripts and documentation<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.4.zip GeMoMa 1.4] (03.05.2017)<br />
* AnnotationEvidence: new tool computing tie and tpc for given annotation (gff)<br />
* CompareTranscripts: new tool comparing predicted and given annotation (gff)<br />
* Extractor:<br />
** reading CDS with no parent tag (cf. discontinuous feature)<br />
** automatic recognition of GFF or GTF annotation<br />
** Warning if sequences mentioned in the annotation are not included in the reference sequence<br />
* GeMoMa: <br />
** allowing for multiple intron and coverage files (= using different library types at the same time)<br />
** NA instead of "?" for tae, tde, tie, minSplitReads of single coding exon genes<br />
** new default values for the parameters: predictions (10 instead of 1) and contig threshold (0.4 instead of 0.9)<br />
** bugfix (write pc and minCov if possible for last CDS part in predicted annotation)<br />
** bugfix (ref-gene name if no assignment is used)<br />
** bugfix (minSplitReads, minCov, tpc, avgCov if no coverage available)<br />
* GAF:<br />
** nested genes on the same strand<br />
** bugfix (if nothing passes the filter)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.3.2.zip GeMoMa 1.3.2] (18.01.2017)<br />
* Extractor: new parameter repair for broken transcript annotations<br />
* GeMoMa: bugfixes (splice site computation)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa_1.3.1.zip GeMoMa 1.3.1] (09.12.2016)<br />
* GeMoMa bugfix (finding start/stop codon for very small exons)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.3.zip GeMoMa 1.3] (06.12.2016)<br />
* ERE: new tool for extracting RNA-seq evidence<br />
* Extractor: offers options for <br />
** partial gene models<br />
** ambiguities<br />
* GeMoMa:<br />
** RNA-seq<br />
*** defining splice sites<br />
*** additional feature in GFF and output<br />
**** transcript intron evidence (tie)<br />
**** transcript acceptor evidence (tae)<br />
**** transcript donor evidence (tde)<br />
**** transcript percentage coverage (tpc)<br />
**** ...<br />
** improved GFF<br />
** simplified the command line parameters<br />
** IMPORTANT: parameter names changed for some parameters<br />
* GAF: new tool for filtering and combining different predictions (especially of different reference organisms)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.1.3.zip GeMoMa 1.1.3] (06.06.2016)<br />
* minor modifications to the Extractor tool<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.1.2.zip GeMoMa 1.1.2] (05.02.2016)<br />
* GeMoMa bugfix (upstream, downstream sequence for splice site detection)<br />
* Extractor: new parameter s for selecting transcripts<br />
* improved Galaxy integration<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.1.1.zip GeMoMa 1.1.1] (01.02.2016)<br />
* initial release for paper</div>Keilwagenhttps://www.jstacs.de/index.php?title=GeMoMa-Docs&diff=1101GeMoMa-Docs2020-07-30T04:13:23Z<p>Keilwagen: initial commit: new documentation since version 1.7</p>
<hr />
<div>This page describes the parameters of all [[GeMoMa]] modules.<br />
<br />
= GeMoMa pipeline =<br />
<br />
This tool can be used to run the complete GeMoMa pipeline. The tool is multi-threaded and can utilize all compute cores on one machine, but not distributed as for instance in a compute cluster. It basically runs: '''Extract RNA-seq evidence (ERE)''', '''DenoiseIntrons''', '''Extractor''', external search (tblastn or mmseqs), '''Gene Model Mapper (GeMoMa)''', '''GeMoMa Annotation Filter (GAF)''', and '''AnnnotationFinalizer'''.<br />
<br />
''GeMoMa pipeline'' may be called with<br />
<br />
java -jar GeMoMa-1.7.jar CLI GeMoMaPipeline<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>target genome (Target genome file (FASTA), mime = fasta,fa,fas,fna,fasta.gz,fa.gz,fas.gz,fna.gz)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used zero or multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>species (data for reference species, range={own, pre-extracted}, default = own)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;own&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>ID (ID to distinguish the different reference species, default = , OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (Reference annotation file (GFF or GTF), which contains gene models annotated in the reference genome, mime = gff,gff3,gtf)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (Reference genome file (FASTA), mime = fasta,fa,fas,fna,fasta.gz,fa.gz,fas.gz,fna.gz)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>weight (the weight can be used to prioritize predictions from different input files; each prediction will get an additional attribute sumWeight that can be used in the filter, valid range = [0.0, 1000.0], default = 1.0, OPTIONAL)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ai</font></td><br />
<td>annotation info (annotation information of the reference, tab-delimted file containing at least the columns transcriptName, GO and .*defline, mime = tabular, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;pre-extracted&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>ID (ID to distinguish the different reference species, default = , OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>cds parts (The query cds parts file (FASTA), i.e., the cds parts that have been blasted, mime = fasta,fas,fa,fna)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>assignment (The assignment file, which combines parts of the CDS to transcripts, mime = tabular, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>weight (the weight can be used to prioritize predictions from different input files; each prediction will get an additional attribute sumWeight that can be used in the filter, valid range = [0.0, 1000.0], default = 1.0, OPTIONAL)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ai</font></td><br />
<td>annotation info (annotation information of the reference, tab-delimted file containing at least the columns transcriptName, GO and .*defline, mime = tabular, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used zero or multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ID</font></td><br />
<td>ID (ID to distinguish the different external annotations of the target organism, default = , OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">e</font></td><br />
<td>external annotation (External annotation file (GFF,GTF) of the target organism, which contains gene models from an external source (e.g., ab initio gene prediction) and will be integrated in the module GAF, mime = gff,gff3,gtf)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">weight</font></td><br />
<td>weight (the weight can be used to prioritize predictions from different input files in the module GAF; each prediction will get an additional attribute sumWeight that can be used in the filter, valid range = [0.0, 1000.0], default = 1.0, OPTIONAL)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ae</font></td><br />
<td>annotation evidence (run AnnotationEvidence on this external annotation, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">selected</font></td><br />
<td>selected (The path to list file, which allows to make only a predictions for the contained transcript ids. The first column should contain transcript IDs as given in the annotation. Remaining columns can be used to determine a target region that should be overlapped by the prediction, if columns 2 to 5 contain chromosome, strand, start and end of region, mime = tabular,txt, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">gc</font></td><br />
<td>genetic code (optional user-specified genetic code, mime = tabular, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">tblastn</font></td><br />
<td>tblastn (if *true* tblastn is used as search algorithm, otherwise mmseqs is used. Tblastn and mmseqs need to be installed to use the corresponding option, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">tag</font></td><br />
<td>tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = mRNA)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>RNA-seq evidence (data for RNA-seq evidence, range={NO, MAPPED, EXTRACTED}, default = NO)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;MAPPED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.s</font></td><br />
<td>Stranded (Defines whether the reads are stranded. In case of FR_FIRST_STRAND, the first read of a read pair or the only read in case of single-end data is assumed to be located on forward strand of the cDNA, i.e., reverse to the mRNA orientation. If you are using Illumina TruSeq you should use FR_FIRST_STRAND., range={FR_UNSTRANDED, FR_FIRST_STRAND, FR_SECOND_STRAND}, default = FR_UNSTRANDED)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.m</font></td><br />
<td>mapped reads file (BAM/SAM files containing the mapped reads, mime = bam,sam)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.v</font></td><br />
<td>ValidationStringency (Defines how strict to be when reading a SAM or BAM, beyond bare minimum validation., range={STRICT, LENIENT, SILENT}, default = LENIENT)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.u</font></td><br />
<td>use secondary alignments (allows to filter flags in the SAM or BAM, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.c</font></td><br />
<td>coverage (allows to output the coverage, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.mmq</font></td><br />
<td>minimum mapping quality (reads with a mapping quality that is lower than this value will be ignored, valid range = [0, 255], default = 40)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.mc</font></td><br />
<td>minimum context (only introns that have evidence of at least one split read with a minimal M (=(mis)match) stretch in the cigar string larger than or equal to this value will be used, valid range = [1, 1000000], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;EXTRACTED&quot;:</b></td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">introns</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq, mime = gff,gff3)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used zero or multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage</font></td><br />
<td>coverage (experimental coverage (RNA-seq), range={UNSTRANDED, STRANDED}, default = UNSTRANDED)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., mime = bedgraph)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., mime = bedgraph)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., mime = bedgraph)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>denoise (removing questionable introns that have been extracted by ERE, range={DENOISE, RAW}, default = DENOISE)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;DENOISE&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">DenoiseIntrons.m</font></td><br />
<td>maximum intron length (The maximum length of an intron, default = 15000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">DenoiseIntrons.me</font></td><br />
<td>minimum expression (The threshold for removing introns, valid range = [0.0, 1.0], default = 0.01)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">DenoiseIntrons.c</font></td><br />
<td>context (The context upstream a donor and donwstream an acceptor site that is used to determine the expression of the region, valid range = [0, 100], default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;RAW&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.u</font></td><br />
<td>upcase IDs (whether the IDs in the GFF should be upcased, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.r</font></td><br />
<td>repair (if a transcript annotation can not be parsed, the program will try to infer the phase of the CDS parts to repair the annotation, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.a</font></td><br />
<td>Ambiguity (This parameter defines how to deal with ambiguities in the DNA. There are 3 options: EXCEPTION, which will remove the corresponding transcript, AMBIGUOUS, which will use an X for the corresponding amino acid, and RANDOM, which will randomly select an amnio acid from the list of possibilities., range={EXCEPTION, AMBIGUOUS, RANDOM}, default = AMBIGUOUS)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.d</font></td><br />
<td>discard pre-mature stop (if *true* transcripts with pre-mature stop codon are discarded as they often indicate misannotation, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.s</font></td><br />
<td>stop-codon excluded from CDS (A flag that states whether the reference annotation contains the stop codon in the CDS annotation or not, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.f</font></td><br />
<td>full-length (A flag which allows for choosing between only full-length and all (i.e., full-length and partial) transcripts, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.s</font></td><br />
<td>splice (if no intron is given by RNA-seq, compute candidate splice sites or not, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.sm</font></td><br />
<td>substitution matrix (optional user-specified substitution matrix, mime = tabular, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.g</font></td><br />
<td>gap opening (The gap opening cost in the alignment, default = 11)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.ge</font></td><br />
<td>gap extension (The gap extension cost in the alignment, default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.m</font></td><br />
<td>maximum intron length (The maximum length of an intron, default = 15000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.sil</font></td><br />
<td>static intron length (A flag which allows to switch between static intron length, which can be specified by the user and is identical for all genes, and dynamic intron length, which is based on the gene-specific maximum intron length in the reference organism plus the user given maximum intron length, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.i</font></td><br />
<td>intron-loss-gain-penalty (The penalty used for intron loss and gain, default = 25)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.e</font></td><br />
<td>e-value (The e-value for filtering blast results, default = 100.0)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.c</font></td><br />
<td>contig threshold (The threshold for evaluating contigs, valid range = [0.0, 1.0], default = 0.4)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.rt</font></td><br />
<td>region threshold (The threshold for evaluating regions, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.h</font></td><br />
<td>hit threshold (The threshold for adding additional hits, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.p</font></td><br />
<td>predictions (The (maximal) number of predictions per transcript, default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.a</font></td><br />
<td>avoid stop (A flag which allows to avoid (additional) pre-mature stop codons in a transcript, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.approx</font></td><br />
<td>approx (whether an approximation is used to compute the score for intron gain, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.pa</font></td><br />
<td>protein alignment (whether a protein alignment between the prediction and the reference transcript should be computed. If so two additional attributes (iAA, pAA) will be added to predictions in the gff output. These might be used in GAF. However, since some transcripts are very long this can increase the needed runtime and memory (RAM)., default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.prefix</font></td><br />
<td>prefix (A prefix to be used for naming the predictions, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.t</font></td><br />
<td>timeout (The (maximal) number of seconds to be used for the predictions of one transcript, if exceeded GeMoMa does not output a prediction for this transcript., valid range = [0, 604800], default = 3600)</td><br />
<td style="width:100px;">LONG</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.Score</font></td><br />
<td>Score (A flag which allows to do nothing, re-score or re-align the search results, range={Trust, ReScore, ReAlign}, default = ReAlign)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.c</font></td><br />
<td>common border filter (the filter on the common borders of transcripts, the lower the more transcripts will be checked as alternative splice isoforms, valid range = [0.0, 1.0], default = 0.75)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.m</font></td><br />
<td>maximal number of transcripts per gene (the maximal number of allowed transcript predictions per gene, valid range = [1, 2147483647], default = 2147483647)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.d</font></td><br />
<td>default attributes (Comma-separated list of attributes that is set to NaN if they are not given in the annotation file. This allows to use these attributes for sorting or filter criteria. It is especially meaningful if the gene annotation files were received fom different software packages (e.g., GeMoMa, Braker, ...) having different attributes., default = tie,tde,tae,iAA,pAA,score)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.f</font></td><br />
<td>filter (A filter can be applied to transcript predictions to receive only reasonable predictions. The filter is applied to the GFF attributes. The default filter decides based on the completeness of the prediction (start=='M' and stop=='*') and the relative score (score/aa>=0.75) whether a prediction is used or not. In addition, predictions without score (isNaN(score)) will be used as external annotations do not have the attribute score. Different criteria can be combined using 'and' and 'or'. A more sophisticated filter could be applied for instance using the completeness, the relative score, the evidence as well as statistics based on RNA-seq: start=='M' and stop=='*' and score/aa>=0.75 and (evidence>1 or tpc==1.0), default = start=='M' and stop=='*' and (isNaN(score) or score/aa>=0.75), OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.s</font></td><br />
<td>sorting (comma-separated list that defines the way predictions are sorted within a cluster, default = evidence,score)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.a</font></td><br />
<td>alternative transcript filter (If a prediction is suggested as an alternative transcript, this additional filter will be applied. The filter works syntactically like the 'filter' parameter and allows you to keep the number of alternative transcripts small based on meaningful criteria. Reasonable filter could be based on RNA-seq data (tie==1) or on evidence (evidence>1). A more sophisticated filter could be applied combining several criteria: tie==1 or evidence>1, default = tie==1 or evidence>1, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.u</font></td><br />
<td>UTR (allows to predict UTRs using RNA-seq data, range={NO, YES}, default = NO)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;YES&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.r</font></td><br />
<td>rename (allows to generate generic gene and transcripts names (cf. parameter &quot;name attribute&quot;), range={COMPOSED, SIMPLE, NO}, default = COMPOSED)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;COMPOSED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.i</font></td><br />
<td>infix (the infix of the generic name, default = G)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.s</font></td><br />
<td>suffix (the suffix of the generic name, default = 0)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.di</font></td><br />
<td>delete infix (a comma-separated list of infixes that is deleted from the sequence names before building the gene/transcript name, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;SIMPLE&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.n</font></td><br />
<td>name attribute (if true the new name is added as new attribute &quot;Name&quot;, otherwise &quot;Parent&quot; and &quot;ID&quot; values are modified accordingly, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>predicted proteins (If *true*, returns the predicted proteins of the target organism as fastA file, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">pc</font></td><br />
<td>predicted CDSs (If *true*, returns the predicted CDSs of the target organism as fastA file, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">pgr</font></td><br />
<td>predicted genomic regions (If *true*, returns the genomic regions of predicted gene models of the target organism as fastA file, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">o</font></td><br />
<td>output individual predictions (If *true*, returns the predictions for each reference species, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">debug</font></td><br />
<td>debug (If *false* removes all temporary files even if the jobs exits unexpected, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">restart</font></td><br />
<td>restart (can be used to restart the latest GeMoMaPipeline run, which was finished without results, with very similar parameters, e.g., after an exception was thrown (cf. parameter debug), default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">b</font></td><br />
<td>BLAST_PATH (allows to set a path to the blast binaries if not set in the environment, default = , OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>MMSEQS_PATH (allows to set a path to the blast binaries if not set in the environment, default = , OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">threads</font></td><br />
<td>The number of threads used for the tool, defaults to 1</td><br />
<td>INT</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.7.jar CLI GeMoMaPipeline t=<target_genome> AnnotationFinalizer.p=<prefix><br />
<br />
<br />
= Extract RNA-seq Evidence =<br />
<br />
This tools extracts introns and coverage from mapped RNA-seq reads. Introns might be denoised by the tool '''DenoiseIntrons'''. Introns and coverage results can be used in '''GeMoMa''' to improve the predictions and might help to select better gene models in '''GAF'''. In addition, introns and coverage can be used to predict UTRs by '''AnnotationFinalizer'''.<br />
<br />
''Extract RNA-seq Evidence'' may be called with<br />
<br />
java -jar GeMoMa-1.7.jar CLI ERE<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>Stranded (Defines whether the reads are stranded. In case of FR_FIRST_STRAND, the first read of a read pair or the only read in case of single-end data is assumed to be located on forward strand of the cDNA, i.e., reverse to the mRNA orientation. If you are using Illumina TruSeq you should use FR_FIRST_STRAND., range={FR_UNSTRANDED, FR_FIRST_STRAND, FR_SECOND_STRAND}, default = FR_UNSTRANDED)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>mapped reads file (BAM/SAM files containing the mapped reads, mime = bam,sam)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>ValidationStringency (Defines how strict to be when reading a SAM or BAM, beyond bare minimum validation., range={STRICT, LENIENT, SILENT}, default = LENIENT)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">u</font></td><br />
<td>use secondary alignments (allows to filter flags in the SAM or BAM, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage (allows to output the coverage, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">mmq</font></td><br />
<td>minimum mapping quality (reads with a mapping quality that is lower than this value will be ignored, valid range = [0, 255], default = 40)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">mc</font></td><br />
<td>minimum context (only introns that have evidence of at least one split read with a minimal M (=(mis)match) stretch in the cigar string larger than or equal to this value will be used, valid range = [1, 1000000], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.7.jar CLI ERE m=<mapped_reads_file><br />
<br />
<br />
= CheckIntrons =<br />
<br />
The tool checks the distribution of introns on the strands and the dinucleotide distribution at splice sites.<br />
<br />
''CheckIntrons'' may be called with<br />
<br />
java -jar GeMoMa-1.7.jar CLI CheckIntrons<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>target genome (The target genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code, mime = fasta)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq, mime = gff, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>verbose (A flag which allows to output a wealth of additional information per transcript, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.7.jar CLI CheckIntrons t=<target_genome><br />
<br />
<br />
= DenoiseIntrons =<br />
<br />
This module allows to analyze introns extracted by '''ERE'''. Introns with a large intron size or a low relative expression are possibly artefacts and will be removed. The result of this module can be used in the module '''GeMoMa''', '''AnnotationEvidence''', and '''AnnotationFinalizer'''.<br />
<br />
''DenoiseIntrons'' may be called with<br />
<br />
java -jar GeMoMa-1.7.jar CLI DenoiseIntrons<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq, mime = gff,gff3)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage (experimental coverage (RNA-seq), range={UNSTRANDED, STRANDED}, default = UNSTRANDED)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., mime = bedgraph)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., mime = bedgraph)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., mime = bedgraph)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>maximum intron length (The maximum length of an intron, default = 15000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">me</font></td><br />
<td>minimum expression (The threshold for removing introns, valid range = [0.0, 1.0], default = 0.01)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">context</font></td><br />
<td>context (The context upstream a donor and donwstream an acceptor site that is used to determine the expression of the region, valid range = [0, 100], default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.7.jar CLI DenoiseIntrons i=<introns> coverage_unstranded=<coverage_unstranded><br />
<br />
<br />
= NCBI Reference Retriever =<br />
<br />
This tool can be used to download or update assembly and annotation files of reference organsims from NCBI. This way it allows to easily collect all data necessary to start '''GeMoMaPipeline''' or '''Extractor'''.<br />
<br />
''NCBI Reference Retriever'' may be called with<br />
<br />
java -jar GeMoMa-1.7.jar CLI NRR<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reference directory (the directory where the genome and annotation files of the reference organisms should be stored, default = references/)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>number of tries (the number of tries for downloading a reference file, valid range = [1, 100], default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">rl</font></td><br />
<td>reference list (a list of reference organisms, mime = txt)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.7.jar CLI NRR rl=<reference_list><br />
<br />
<br />
= Extractor =<br />
<br />
This tool can be used to create input files for '''GeMoMa''', i.e., it creates at least a fasta file containing the translated parts of the CDS and a tabular file containing the assignment of transcripts to genes and parts of CDS to transcripts. In addition, '''Extractor''' can be used to create several additional files from the final prediction, e.g. proteins, CDSs, ... . Two inputs are mandatory: The genome as fasta or fasta.gz and the corresponding annotation as gff or gff.gz. The gff file should be sorted. If you like to set a user-specific genetic code, please use a tab-delimited file with two columns. The first column contains the amino acid in one letter code, the second a list of tripletts.<br />
<br />
''Extractor'' may be called with<br />
<br />
java -jar GeMoMa-1.7.jar CLI Extractor<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (Reference annotation file (GFF or GTF), which contains gene models annotated in the reference genome, mime = gff,gff3,gtf)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (Reference genome file (FASTA), mime = fasta,fa,fas,fna,fasta.gz,fa.gz,fas.gz,fna.gz)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">gc</font></td><br />
<td>genetic code (optional user-specified genetic code, mime = tabular, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>proteins (whether the complete proteins sequences should returned as output, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>cds (whether the complete CDSs should returned as output, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">genomic</font></td><br />
<td>genomic (whether the genomic regions should be returned (upper case = coding, lower case = non coding), default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns (whether introns should be extracted from annotation, that might be used for test cases, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">u</font></td><br />
<td>upcase IDs (whether the IDs in the GFF should be upcased, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>repair (if a transcript annotation can not be parsed, the program will try to infer the phase of the CDS parts to repair the annotation, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>selected (The path to list file, which allows to make only a predictions for the contained transcript ids. The first column should contain transcript IDs as given in the annotation. Remaining columns will be ignored., mime = tabular,txt, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Ambiguity</font></td><br />
<td>Ambiguity (This parameter defines how to deal with ambiguities in the DNA. There are 3 options: EXCEPTION, which will remove the corresponding transcript, AMBIGUOUS, which will use an X for the corresponding amino acid, and RANDOM, which will randomly select an amnio acid from the list of possibilities., range={EXCEPTION, AMBIGUOUS, RANDOM}, default = EXCEPTION)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>discard pre-mature stop (if *true* transcripts with pre-mature stop codon are discarded as they often indicate misannotation, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sefc</font></td><br />
<td>stop-codon excluded from CDS (A flag that states whether the reference annotation contains the stop codon in the CDS annotation or not, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>full-length (A flag which allows for choosing between only full-length and all (i.e., full-length and partial) transcripts, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>verbose (A flag which allows to output a wealth of additional information, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.7.jar CLI Extractor a=<annotation> g=<genome><br />
<br />
<br />
= GeneModelMapper =<br />
<br />
This tool is the main part of, a homology-based gene prediction tool. GeMoMa builds gene models from search results (e.g. tblastn or mmseqs).<br />
<br />
As first step, you should run '''Extractor''' obtaining ''cds parts'' and ''assignment''. Second, you should run a search algorithm, e.g. '''tblastn''' or '''mmseqs''', with ''cds parts'' as query. Finally, these search results are then used in '''GeMoMa'''. Search results should be clustered according to the reference genes. The most easiest way is to sort the search results accoring to the first column. If the search results are not sorted by default (e.g. mmseqs), you should the parameter ''sort''.<br />
If you like to run GeMoMa ignoring intron position conservation, you should blast protein sequences and feed the results in ''query cds parts'' and leave ''assignment'' unselected.<br />
<br />
If you like to run GeMoMa using RNA-seq evidence, you should map your RNA-seq reads to the genome and run '''ERE''' on the mapped reads. For several reasons, spurious introns can be extracted from RNA-seq data. Hence, we recommend to run '''DenoiseIntrons''' to remove such spurious introns. Finally, you can use the obtained ''introns'' (and ''coverage'') in GeMoMa.<br />
<br />
If you like to obtain multiple predictions per gene model of the reference organism, you should set ''predictions'' accordingly. In addition, we suggest to decrease the value of ''contig threshold'' allowing GeMoMa to evaluate more candidate contigs/chromosomes.<br />
<br />
If you change the values of ''contig threshold'', ''region threshold'' and ''hit threshold'', this will influence the predictions as well as the runtime of the algorithm. The lower the values are, the slower the algorithm is.<br />
<br />
You can filter your predictions using '''GAF''', which also allows for combining predictions from different reference organismns.<br />
<br />
Finally, you can predict UTRs and rename predictions using '''AnnotationFinalizer'''.<br />
<br />
If you like to run the complete GeMoMa pipeline and not only specific module, you can run the multi-threaded module '''GeMoMaPipeline'''.<br />
<br />
''GeneModelMapper'' may be called with<br />
<br />
java -jar GeMoMa-1.7.jar CLI GeMoMa<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>search results (The search results, e.g., from tblastn or mmseqs, mime = tabular)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>target genome (The target genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code, mime = fasta,fas,fa,fna,fasta.gz,fas.gz,fa.gz,fna.gz)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>cds parts (The query cds parts file (FASTA), i.e., the cds parts that have been blasted, mime = fasta,fa,fas,fna)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>assignment (The assignment file, which combines parts of the CDS to transcripts, mime = tabular, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used zero or multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq, mime = gff,gff3)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">splice</font></td><br />
<td>splice (if no intron is given by RNA-seq, compute candidate splice sites or not, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used zero or multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage</font></td><br />
<td>coverage (experimental coverage (RNA-seq), range={UNSTRANDED, STRANDED}, default = UNSTRANDED)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., mime = bedgraph)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., mime = bedgraph)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., mime = bedgraph)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genetic code (optional user-specified genetic code, mime = tabular, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sm</font></td><br />
<td>substitution matrix (optional user-specified substitution matrix, mime = tabular, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">go</font></td><br />
<td>gap opening (The gap opening cost in the alignment, default = 11)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ge</font></td><br />
<td>gap extension (The gap extension cost in the alignment, default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>maximum intron length (The maximum length of an intron, default = 15000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sil</font></td><br />
<td>static intron length (A flag which allows to switch between static intron length, which can be specified by the user and is identical for all genes, and dynamic intron length, which is based on the gene-specific maximum intron length in the reference organism plus the user given maximum intron length, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">intron-loss-gain-penalty</font></td><br />
<td>intron-loss-gain-penalty (The penalty used for intron loss and gain, default = 25)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">e</font></td><br />
<td>e-value (The e-value for filtering blast results, default = 100.0)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ct</font></td><br />
<td>contig threshold (The threshold for evaluating contigs, valid range = [0.0, 1.0], default = 0.4)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">rt</font></td><br />
<td>region threshold (The threshold for evaluating regions, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">h</font></td><br />
<td>hit threshold (The threshold for adding additional hits, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>predictions (The (maximal) number of predictions per transcript, default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">selected</font></td><br />
<td>selected (The path to list file, which allows to make only a predictions for the contained transcript ids. The first column should contain transcript IDs as given in the annotation. Remaining columns can be used to determine a target region that should be overlapped by the prediction, if columns 2 to 5 contain chromosome, strand, start and end of region, mime = tabular,txt, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">as</font></td><br />
<td>avoid stop (A flag which allows to avoid (additional) pre-mature stop codons in a transcript, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">approx</font></td><br />
<td>approx (whether an approximation is used to compute the score for intron gain, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">pa</font></td><br />
<td>protein alignment (whether a protein alignment between the prediction and the reference transcript should be computed. If so two additional attributes (iAA, pAA) will be added to predictions in the gff output. These might be used in GAF. However, since some transcripts are very long this can increase the needed runtime and memory (RAM)., default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">prefix</font></td><br />
<td>prefix (A prefix to be used for naming the predictions, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">tag</font></td><br />
<td>tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = mRNA)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>verbose (A flag which allows to output a wealth of additional information per transcript, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">timeout</font></td><br />
<td>timeout (The (maximal) number of seconds to be used for the predictions of one transcript, if exceeded GeMoMa does not output a prediction for this transcript., valid range = [0, 604800], default = 3600)</td><br />
<td style="width:100px;">LONG</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sort</font></td><br />
<td>sort (A flag which allows to sort the search results, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Score</font></td><br />
<td>Score (A flag which allows to do nothing, re-score or re-align the search results, range={Trust, ReScore, ReAlign}, default = Trust)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.7.jar CLI GeMoMa s=<search_results> t=<target_genome> c=<cds_parts> a=<assignment><br />
<br />
<br />
= GeMoMa Annotation Filter =<br />
<br />
This tool combines and filters gene predictions from different sources yielding a common gene prediction. The tool does not modify the predictions, but filters redundant and low-quality predictions and selects relevant predictions. In addition, it adds attributes to the annotation of transcript predictions.<br />
<br />
The algorithm does the following:<br />
First, redundant predictions are identified (and additional attributes (evidence, sumWeight) are introduced).<br />
Second, the predictions are filtered using the user-specified criterium based on the attributes from the annotation.<br />
Third, clusters of overlapping predictions are determined, the predictions are sorted within the cluster and relevant predictions are extracted.<br />
<br />
Optionally, annotation info can be added for each reference organism enabling a functional prediction for predicted transcripts based on the function of the reference transcript.<br />
Phytozome provides annotation info tables for plants, but annotation info can be used from any source as long as they are tab-delimited files with at least the following columns: transcriptName, GO and .*defline.<br />
<br />
Initially, GAF was build to combine gene predictions obtained from '''GeMoMa'''. It allows to combine the predictions from multiple reference organisms, but works also using only one reference organism. However, GAF also allows to integrate predictions from ab-initio or purely RNA-seq-based gene predictors as well as manually curated annotation. If you like to do so, we recommend to run '''AnnotationEvidence''' for each of these input files to add additional attributes that can be used for sorting and filtering within '''GAF'''. The sort and filter criteria need to be carefully revised in this case. Default values can be set for missing attributes.<br />
<br />
''GeMoMa Annotation Filter'' may be called with<br />
<br />
java -jar GeMoMa-1.7.jar CLI GAF<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>tag (the tag used to read the GeMoMa annotations, default = mRNA)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>common border filter (the filter on the common borders of transcripts, the lower the more transcripts will be checked as alternative splice isoforms, valid range = [0.0, 1.0], default = 0.75)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>maximal number of transcripts per gene (the maximal number of allowed transcript predictions per gene, valid range = [1, 2147483647], default = 2147483647)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prefix (the prefix can be used to distinguish predictions from different input files, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>weight (the weight can be used to prioritize predictions from different input files; each prediction will get an additional attribute sumWeight that can be used in the filter, valid range = [0.0, 1000.0], default = 1.0, OPTIONAL)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>gene annotation file (GFF file containing the gene annotations (predicted by GeMoMa), mime = gff,gff3)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation info (annotation information of the reference, tab-delimted file containing at least the columns transcriptName, GO and .*defline, mime = tabular, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>default attributes (Comma-separated list of attributes that is set to NaN if they are not given in the annotation file. This allows to use these attributes for sorting or filter criteria. It is especially meaningful if the gene annotation files were received fom different software packages (e.g., GeMoMa, Braker, ...) having different attributes., default = tie,tde,tae,iAA,pAA,score)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>filter (A filter can be applied to transcript predictions to receive only reasonable predictions. The filter is applied to the GFF attributes. The default filter decides based on the completeness of the prediction (start=='M' and stop=='*') and the relative score (score/aa>=0.75) whether a prediction is used or not. In addition, predictions without score (isNaN(score)) will be used as external annotations do not have the attribute score. Different criteria can be combined using 'and' and 'or'. A more sophisticated filter could be applied for instance using the completeness, the relative score, the evidence as well as statistics based on RNA-seq: start=='M' and stop=='*' and score/aa>=0.75 and (evidence>1 or tpc==1.0), default = start=='M' and stop=='*' and (isNaN(score) or score/aa>=0.75), OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>sorting (comma-separated list that defines the way predictions are sorted within a cluster, default = evidence,score)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">atf</font></td><br />
<td>alternative transcript filter (If a prediction is suggested as an alternative transcript, this additional filter will be applied. The filter works syntactically like the 'filter' parameter and allows you to keep the number of alternative transcripts small based on meaningful criteria. Reasonable filter could be based on RNA-seq data (tie==1) or on evidence (evidence>1). A more sophisticated filter could be applied combining several criteria: tie==1 or evidence>1, default = tie==1 or evidence>1, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.7.jar CLI GAF g=<gene_annotation_file><br />
<br />
<br />
= AnnotationFinalizer =<br />
<br />
This tool finalizes an annotation. It allows to predict for UTRs for annotated coding sequences and to generate generic gene and transcript names. UTR prediction might be negatively influenced (i.e. too long predictions) by genomic contamination of RNA-seq libraries, overlapping genes or genes in close proximity as well as unstranded RNA-seq libraries. Please use '''ERE''' to preprocess the mapped reads.<br />
<br />
''AnnotationFinalizer'' may be called with<br />
<br />
java -jar GeMoMa-1.7.jar CLI AnnotationFinalizer<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (The genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code, mime = fasta,fa,fas,fna,fasta.gz,fa.gz,fas.gz,fna.gz)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (The predicted genome annotation file (GFF), mime = gff,gff3)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = mRNA)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">u</font></td><br />
<td>UTR (allows to predict UTRs using RNA-seq data, range={NO, YES}, default = NO)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;YES&quot;:</b></td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns file (Introns (GFF), which might be obtained from RNA-seq, mime = gff,gff3)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage file (experimental coverage (RNA-seq), range={NO, UNSTRANDED, STRANDED}, default = NO)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., mime = bedgraph)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., mime = bedgraph)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., mime = bedgraph)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">rename</font></td><br />
<td>rename (allows to generate generic gene and transcripts names (cf. parameter &quot;name attribute&quot;), range={COMPOSED, SIMPLE, NO}, default = COMPOSED)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;COMPOSED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">infix</font></td><br />
<td>infix (the infix of the generic name, default = G)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>suffix (the suffix of the generic name, default = 0)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">di</font></td><br />
<td>delete infix (a comma-separated list of infixes that is deleted from the sequence names before building the gene/transcript name, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;SIMPLE&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>name attribute (if true the new name is added as new attribute &quot;Name&quot;, otherwise &quot;Parent&quot; and &quot;ID&quot; values are modified accordingly, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.7.jar CLI AnnotationFinalizer g=<genome> a=<annotation> p=<prefix><br />
<br />
<br />
= Annotation evidence =<br />
<br />
This tool adds attributes to the annotation, e.g., tie, tpc, aa, start, stop. These attributes can be used, for instance, if the annotation is used in '''GAF'''. All predictions of the annotation are used. The predictions are not filtered for internal stop codons, missing start or stop codons, frame-shifts, ... . Please use '''ERE''' to preprocess the mapped reads.<br />
<br />
''Annotation evidence'' may be called with<br />
<br />
java -jar GeMoMa-1.7.jar CLI AnnotationEvidence<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (The genome annotation file (GFF,GTF), mime = gff,gff3,gtf)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = mRNA)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (The genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code, mime = fasta,fas,fa,fna,fasta.gz,fas.gz,fa.gz,fna.gz)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns file (Introns (GFF), which might be obtained from RNA-seq, mime = gff,gff3, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage file (experimental coverage (RNA-seq), range={NO, UNSTRANDED, STRANDED}, default = NO)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., mime = bedgraph)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., mime = bedgraph)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out., mime = bedgraph)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ao</font></td><br />
<td>annotation output (if the annotation should be returned with attributes tie, tpc, and aa, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">gc</font></td><br />
<td>genetic code (optional user-specified genetic code, mime = tabular, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.7.jar CLI AnnotationEvidence a=<annotation> g=<genome><br />
<br />
<br />
= Compare transcripts =<br />
<br />
This tool compares a predicted annotation with a given annotation in terms of F1 measure. If the F1 measure is 1 both annotations are in perfect agreement for this transcript. The smaller the value is the low is the agreement. If it is NA then there is no overlapping annotation.<br />
<br />
''Compare transcripts'' may be called with<br />
<br />
java -jar GeMoMa-1.7.jar CLI CompareTranscripts<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prediction (The predicted annotation, mime = gff,gff3)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (The true annotation, mime = gff,gff3)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used zero or multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">prefix</font></td><br />
<td>prefix (the prefix can be used to distinguish predictions from different input files, default = , OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">assignment</font></td><br />
<td>assignment (the transcript info for the reference of the prediction, mime = tabular)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.7.jar CLI CompareTranscripts p=<prediction> a=<annotation><br />
<br />
<br />
= Synteny checker =<br />
<br />
This tool can be used to determine syntenic regions between target organism and reference organism based on similiarity of genes.!The tool returns a table of reference genes per predicted gene. This table can be easily visualized with an R script that is included in the GeMoMa package.<br />
<br />
''Synteny checker'' may be called with<br />
<br />
java -jar GeMoMa-1.7.jar CLI SyntenyChecker<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>tag (the tag used to read the GeMoMa annotations, default = mRNA)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prefix (the prefix can be used to distinguish predictions from different input files, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>assignment (The assignment file, which combines parts of the CDS to transcripts, mime = tabular)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>gene annotation file (GFF file containing the gene annotations predicted by GAF, mime = gff,gff3)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.7.jar CLI SyntenyChecker a=<assignment> g=<gene_annotation_file><br />
<br />
<br />
= AddAttribute =<br />
<br />
This tool allows to add an additional attribute to specific features of an annotation.<br />
<br />
Those additional attributes might be used in '''GAF''' for filtering or sorting or might be displayed in genome browsers like IGV or WebApollo. The user can choose binary attributes (true or false) or attributes with values according to given tab-delimited table.<br />
<br />
''AddAttribute'' may be called with<br />
<br />
java -jar GeMoMa-1.7.jar CLI AddAttribute<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (annotation file, mime = gff,gff3)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>feature (a feature of the annotation, e.g., gene, transcript or mRNA, default = mRNA)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">attribute</font></td><br />
<td>attribute (the name of the attribute that is added to the annotation)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>table (a tab-delimited file containing IDs and additional attribute, mime = tabular)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>ID column (the ID column in the tab-delimited file, valid range = [0, 2147483647])</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">type</font></td><br />
<td>type (type of addition attribute, range={VALUES, BINARY}, default = VALUES)</td><br />
<td style="width:100px;">STRING</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;VALUES&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ac</font></td><br />
<td>attribute column (the attribute column in the tab-delimited file, valid range = [0, 2147483647])</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;BINARY&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.7.jar CLI AddAttribute a=<annotation> attribute=<attribute> t=<table> i=<ID_column> ac=<attribute_column></div>Keilwagenhttps://www.jstacs.de/index.php?title=GeMoMa&diff=1100GeMoMa2020-05-20T05:24:32Z<p>Keilwagen: /* FAQs */</p>
<hr />
<div>'''Ge'''ne '''Mo'''del '''Ma'''pper (GeMoMa) is a homology-based gene prediction program. GeMoMa uses the annotation of protein-coding genes in a reference genome to infer the annotation of protein-coding genes in a target genome. Thereby, GeMoMa utilizes amino acid sequence and intron position conservation. In addition, GeMoMa allows to incorporate RNA-seq evidence for splice site prediction.<br />
<br />
GeMoMa is available in a public web-server at [http://galaxy.informatik.uni-halle.de galaxy.informatik.uni-halle.de]. The provided web-server only allows a limited number of reference genes and uses a time out of 2 minutes per transcript prediction. For unlimited use, please use the command line program or integrate GeMoMa in your own [https://galaxyproject.org/ Galaxy] instance.<br />
<br />
{|<br />
|__TOC__<br />
|[[File:GeMoMa-schema.png|thumb|right|450px|Schema of GeMoMa algorithm]]<br />
|}<br />
<br />
== Installation ==<br />
<br />
GeMoMa is now available via bioconda. Here is the [https://anaconda.org/bioconda/gemoma direct link to the package]. To install this package with conda run:<br />
conda install -c bioconda gemoma <br />
However, you can also install GeMoMa manually.<br />
<br />
=== Requirements ===<br />
<br />
For running the GeMoMa, you need the following software on your computer<br />
* Java v1.8 or later<br />
* [ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ blast] or [https://github.com/soedinglab/MMseqs2 mmseqs]<br />
<br />
=== Download ===<br />
<br />
GeMoMa is implemented in Java using Jstacs. You can [http://www.jstacs.de/download.php?which=GeMoMa download a zip file] containing a readme, the GeMoMa jar file and some tiny scripts for running GeMoMa. The jar file allows for <br />
<ul><br />
<li>creating the XML file needed for the Galaxy integration</li><br />
<li>running the command line interface (CLI) version.</li><br />
</ul><br />
<br />
You can also [[:File:GeMoMa-manual.pdf|download a small manual for GeMoMa]] which explains the main steps for the analysis.<br />
<br />
== In a nutshell ==<br />
<br />
GeMoMa is a modular, homology-based gene prediction program with huge flexibility. However, we also provide a pipeline allowing to use GeMoMa easily. If you like to start GeMoMa for the first time, we recommend to use the GeMoMaPipeline like this<br />
java -jar GeMoMa-1.6.4.jar CLI GeMoMaPipeline threads=<threads> outdir=<outdir> tblastn=false GeMoMa.Score=ReAlign AnnotationFinalizer.r=NO o=true t=<target_genome> i=<reference_1_id> a=<reference_1_annotation> g=<reference_1_genome><br />
there are several parameters that need to be set indicated with '''&lt;'''foo'''&gt;'''. You can specify<br />
* the number of threads<br />
* the output directory<br />
* the target genome<br />
* and the reference ID (optional), annotation and genome. If you have several references just repeat the parameter tags <code>i</code>, <code>a</code>, <code>g</code> with the corresponding values.<br />
In addition, we recommend to set several parameters:<br />
* <code>tblastn=false</code>: use mmseqs instead of tblastn, since mmseqs is faster<br />
* <code>GeMoMa.Score=ReAlign</code>: states that the score from mmseqs should be recomputed as mmseqs uses an approximation<br />
* <code>AnnotationFinalizer.r=NO</code>: do not rename genes and transcripts<br />
* <code>o=true</code>: output individual predictions for each reference as a separate file allowing to rerun the combination step ('''GAF''') very easily and quickly<br />
If you like to specify the maximum intron length please consider the parameters <code>GeMoMa.m</code> and <code>GeMoMa.sil</code>.<br />
If you have RNA-seq data either from own experiments or publicly available data sets (cf. [https://www.ncbi.nlm.nih.gov/sra NCBI SRA], [https://www.ebi.ac.uk/ena EMBL-EBI ENA]), we recommend to use them. You need to map the data against the target genome with your favorite read mapper. In addition, we recommend to check the parameters of the section '''DenoiseIntrons'''.<br />
<br />
== Tools ==<br />
<br />
=== GeMoMa pipeline ===<br />
<br />
This tool can be used to run the complete GeMoMa pipeline. The tool is multi-threaded and can utilize all compute cores on one machine, but not distributed as for instance in a compute cluster. It basically runs: '''Extract RNA-seq evidence (ERE)''', '''DenoiseIntrons''', '''Extractor''', external search (tblastn or mmseqs), '''Gene Model Mapper (GeMoMa)''', '''GeMoMa Annotation Filter (GAF)''', and '''AnnnotationFinalizer'''.<br />
<br />
''GeMoMa pipeline'' may be called with<br />
<br />
java -jar GeMoMa-1.6.4.jar CLI GeMoMaPipeline<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>target genome (Target genome file (FASTA))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>species (data for reference species, range={own, pre-extracted}, default = own)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;own&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>ID (ID to distinguish the different reference species, default = , OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (Reference annotation file (GFF or GTF), which contains gene models annotated in the reference genome)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (Reference genome file (FASTA))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>weight (the weight can be used to prioritize predictions from different input files; each prediction will get an additional attribute sumWeight that can be used in the filter, valid range = [0.0, 1000.0], default = 1.0, OPTIONAL)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ai</font></td><br />
<td>annotation info (annotation information of the reference, tab-delimted file containing at least the columns transcriptName, GO and .*defline, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;pre-extracted&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>ID (ID to distinguish the different reference species, default = , OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>cds parts (The query cds parts file (FASTA), i.e., the cds parts that have been blasted)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>assignment (The assignment file, which combines parts of the CDS to transcripts, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">q</font></td><br />
<td>query proteins (optional query protein file (FASTA) for computing the optimal alignment score against complete protein prediction, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>weight (the weight can be used to prioritize predictions from different input files; each prediction will get an additional attribute sumWeight that can be used in the filter, valid range = [0.0, 1000.0], default = 1.0, OPTIONAL)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ai</font></td><br />
<td>annotation info (annotation information of the reference, tab-delimted file containing at least the columns transcriptName, GO and .*defline, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">selected</font></td><br />
<td>selected (The path to list file, which allows to make only a predictions for the contained transcript ids. The first column should contain transcript IDs as given in the annotation. Remaining columns can be used to determine a target region that should be overlapped by the prediction, if columns 2 to 5 contain chromosome, strand, start and end of region, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">gc</font></td><br />
<td>genetic code (optional user-specified genetic code, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">tblastn</font></td><br />
<td>tblastn (if *true* tblastn is used as search algorithm, otherwise mmseqs is used. Tblastn and mmseqs need to be installed to use the corresponding option, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">tag</font></td><br />
<td>tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = prediction)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>RNA-seq evidence (data for RNA-seq evidence, range={NO, MAPPED, EXTRACTED}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;MAPPED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.s</font></td><br />
<td>Stranded (Defines whether the reads are stranded. In case of FR_FIRST_STRAND, the first read of a read pair or the only read in case of single-end data is assumed to be located on forward strand of the cDNA, i.e., reverse to the mRNA orientation. If you are using Illumina TruSeq you should use FR_FIRST_STRAND., range={FR_UNSTRANDED, FR_FIRST_STRAND, FR_SECOND_STRAND}, default = FR_UNSTRANDED)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.m</font></td><br />
<td>mapped reads file (BAM/SAM files containing the mapped reads)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.v</font></td><br />
<td>ValidationStringency (Defines how strict to be when reading a SAM or BAM, beyond bare minimum validation., range={STRICT, LENIENT, SILENT}, default = LENIENT)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.u</font></td><br />
<td>use secondary alignments (allows to filter flags in the SAM or BAM, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.c</font></td><br />
<td>coverage (allows to output the coverage, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.mmq</font></td><br />
<td>minimum mapping quality (reads with a mapping quality that is lower than this value will be ignored, valid range = [0, 255], default = 40)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;EXTRACTED&quot;:</b></td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">introns</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used zero or multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage</font></td><br />
<td>coverage (experimental coverage (RNA-seq), range={UNSTRANDED, STRANDED}, default = UNSTRANDED)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>denoise (removing questionable introns that have been extracted by ERE, range={DENOISE, RAW}, default = DENOISE)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;DENOISE&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">DenoiseIntrons.m</font></td><br />
<td>maximum intron length (The maximum length of an intron, default = 15000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">DenoiseIntrons.me</font></td><br />
<td>minimum expression (The threshold for removing introns, valid range = [0.0, 1.0], default = 0.01)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">DenoiseIntrons.c</font></td><br />
<td>context (The context upstream a donor and donwstream an acceptor site that is used to determine the expression of the region, valid range = [0, 100], default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;RAW&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.p</font></td><br />
<td>proteins (whether the complete proteins sequences should returned as output, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.r</font></td><br />
<td>repair (if a transcript annotation can not be parsed, the program will try to infer the phase of the CDS parts to repair the annotation, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.a</font></td><br />
<td>Ambiguity (This parameter defines how to deal with ambiguities in the DNA. There are 3 options: EXCEPTION, which will remove the corresponding transcript, AMBIGUOUS, which will use an X for the corresponding amino acid, and RANDOM, which will randomly select an amnio acid from the list of possibilities., range={EXCEPTION, AMBIGUOUS, RANDOM}, default = AMBIGUOUS)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.s</font></td><br />
<td>stop-codon excluded from CDS (A flag that states whether the reference annotation contains the stop codon in the CDS annotation or not, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.f</font></td><br />
<td>full-length (A flag which allows for choosing between only full-length and all (i.e., full-length and partial) transcripts, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.s</font></td><br />
<td>splice (if no intron is given by RNA-seq, compute candidate splice sites or not, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.sm</font></td><br />
<td>substitution matrix (optional user-specified substitution matrix, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.g</font></td><br />
<td>gap opening (The gap opening cost in the alignment, default = 11)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.ge</font></td><br />
<td>gap extension (The gap extension cost in the alignment, default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.m</font></td><br />
<td>maximum intron length (The maximum length of an intron, default = 15000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.sil</font></td><br />
<td>static intron length (A flag which allows to switch between static intron length, which can be specified by the user and is identical for all genes, and dynamic intron length, which is based on the gene-specific maximum intron length in the reference organism plus the user given maximum intron length, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.i</font></td><br />
<td>intron-loss-gain-penalty (The penalty used for intron loss and gain, default = 25)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.e</font></td><br />
<td>e-value (The e-value for filtering blast results, default = 100.0)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.c</font></td><br />
<td>contig threshold (The threshold for evaluating contigs, valid range = [0.0, 1.0], default = 0.4)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.rt</font></td><br />
<td>region threshold (The threshold for evaluating regions, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.h</font></td><br />
<td>hit threshold (The threshold for adding additional hits, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.p</font></td><br />
<td>predictions (The (maximal) number of predictions per transcript, default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.a</font></td><br />
<td>avoid stop (A flag which allows to avoid stop codons in a transcript (except the last AS), default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.approx</font></td><br />
<td>approx (whether an approximation is used to compute the score for intron gain, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.prefix</font></td><br />
<td>prefix (A prefix to be used for naming the predictions, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.t</font></td><br />
<td>timeout (The (maximal) number of seconds to be used for the predictions of one transcript, if exceeded GeMoMa does not output a prediction for this transcript., valid range = [0, 604800], default = 3600)</td><br />
<td style="width:100px;">LONG</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.Score</font></td><br />
<td>Score (A flag which allows to do nothing, re-score or re-align the search results, range={Trust, ReScore, ReAlign}, default = Trust)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.c</font></td><br />
<td>common border filter (the filter on the common borders of transcripts, the lower the more transcripts will be checked as alternative splice isoforms, valid range = [0.0, 1.0], default = 0.75)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.m</font></td><br />
<td>maximal number of transcripts per gene (the maximal number of allowed transcript predictions per gene, valid range = [1, 2147483647], default = 2147483647)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.d</font></td><br />
<td>default attributes (Comma-separated list of attributes that is set to NaN if they are not given in the annotation file. This allows to use these attributes for sorting or filter criteria. It is especially meaningful if the gene annotation files were received fom different software packages (e.g., GeMoMa, Braker, ...) having different attributes., default = tie,tde,tae,iAA,pAA,score)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.f</font></td><br />
<td>filter (A filter can be applied to transcript predictions to receive only reasonable predictions. The filter is applied to the GFF attributes. The deault filter decides based on the completeness of the prediction (start=='M' and stop=='*') and the relative score (score/aa>=0.75) whether a prediction is used or not. Different criteria can be combined using 'and' and 'or'. In addition, you can check for NaN, e.g., 'isNaN(score)'. A more sophisticated filter could be applied for instance using the completeness, the relative score, the evidence as well as statistics based on RNA-seq: start=='M' and stop =='*' and score/aa>=0.75 and (evidence>1 or tpc==1.0), default = start=='M' and stop =='*' and score/aa>=0.75, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.s</font></td><br />
<td>sorting (comma-separated list that defines the way predictions are sorted within a cluster, default = evidence,score)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.a</font></td><br />
<td>alternative transcript filter (If a prediction is suggested as an alternative transcript, this additional filter will be applied. The filter works syntactically like the 'filter' parameter and allows you to keep the number of alternative transcripts small based on meaningful criteria. Reasonable filter could be based on RNA-seq data (tie==1) or on evidence (evidence>1). A more sophisticated filter could be applied combining several criteria: tie==1 or evidence>1, default = tie==1 or evidence>1, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.u</font></td><br />
<td>UTR (allows to predict UTRs using RNA-seq data, range={NO, YES}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;YES&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.r</font></td><br />
<td>rename (allows to generate generic gene and transcripts names (cf. parameter &quot;name attribute&quot;), range={COMPOSED, SIMPLE, NO}, default = COMPOSED)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;COMPOSED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.i</font></td><br />
<td>infix (the infix of the generic name, default = G)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.s</font></td><br />
<td>suffix (the suffix of the generic name, default = 0)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.di</font></td><br />
<td>delete infix (a comma-separated list of infixes that is deleted from the sequence names before building the gene/transcript name, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;SIMPLE&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.n</font></td><br />
<td>name attribute (if true the new name is added as new attribute &quot;Name&quot;, otherwise &quot;Parent&quot; and &quot;ID&quot; values are modified accordingly, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>predicted proteins (If *true*, returns the predicted proteins of the target organism as fastA file, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">pc</font></td><br />
<td>predicted CDSs (If *true*, returns the predicted CDSs of the target organism as fastA file, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">pgr</font></td><br />
<td>predicted genomic regions (If *true*, returns the genomic regions of predicted gene models of the target organism as fastA file, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">o</font></td><br />
<td>output individual predictions (If *true*, returns the predictions for each reference species, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">debug</font></td><br />
<td>debug (If *false* removes all temporary files even if the jobs exits unexpected, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">threads</font></td><br />
<td>The number of threads used for the tool, defaults to 1</td><br />
<td>INT</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.6.4.jar CLI GeMoMaPipeline t=<target_genome> a=<annotation> g=<genome> AnnotationFinalizer.p=<prefix><br />
<br />
<br />
=== Extract RNA-seq Evidence ===<br />
<br />
This tools extracts introns and coverage from mapped RNA-seq reads. Introns might be denoised by the tool '''DenoiseIntrons'''. Introns and coverage results can be used in '''GeMoMa''' to improve the predictions and might help to select better gene models in '''GAF'''. In addition, introns and coverage can be used to predict UTRs by '''AnnotationFinalizer'''.<br />
<br />
''Extract RNA-seq Evidence'' may be called with<br />
<br />
java -jar GeMoMa-1.6.4.jar CLI ERE<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>Stranded (Defines whether the reads are stranded. In case of FR_FIRST_STRAND, the first read of a read pair or the only read in case of single-end data is assumed to be located on forward strand of the cDNA, i.e., reverse to the mRNA orientation. If you are using Illumina TruSeq you should use FR_FIRST_STRAND., range={FR_UNSTRANDED, FR_FIRST_STRAND, FR_SECOND_STRAND}, default = FR_UNSTRANDED)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>mapped reads file (BAM/SAM files containing the mapped reads)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>ValidationStringency (Defines how strict to be when reading a SAM or BAM, beyond bare minimum validation., range={STRICT, LENIENT, SILENT}, default = LENIENT)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">u</font></td><br />
<td>use secondary alignments (allows to filter flags in the SAM or BAM, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage (allows to output the coverage, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">mmq</font></td><br />
<td>minimum mapping quality (reads with a mapping quality that is lower than this value will be ignored, valid range = [0, 255], default = 40)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.6.4.jar CLI ERE m=<mapped_reads_file><br />
<br />
<br />
=== CheckIntrons ===<br />
<br />
The tool checks the distribution of introns on the strands and the dinucleotide distribution at splice sites.<br />
<br />
''CheckIntrons'' may be called with<br />
<br />
java -jar GeMoMa-1.6.4.jar CLI CheckIntrons<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>target genome (The target genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>verbose (A flag which allows to output a wealth of additional information per transcript, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.6.4.jar CLI CheckIntrons t=<target_genome><br />
<br />
<br />
=== DenoiseIntrons ===<br />
<br />
This module allows to analyze introns extracted by '''ERE'''. Introns with a large intron size or a low relative expression are possibly artefacts and will be removed. The result of this module can be used in the module '''GeMoMa''', '''AnnotationEvidence''', and '''AnnotationFinalizer'''.<br />
<br />
''DenoiseIntrons'' may be called with<br />
<br />
java -jar GeMoMa-1.6.4.jar CLI DenoiseIntrons<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage (experimental coverage (RNA-seq), range={UNSTRANDED, STRANDED}, default = UNSTRANDED)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>maximum intron length (The maximum length of an intron, default = 15000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">me</font></td><br />
<td>minimum expression (The threshold for removing introns, valid range = [0.0, 1.0], default = 0.01)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">context</font></td><br />
<td>context (The context upstream a donor and donwstream an acceptor site that is used to determine the expression of the region, valid range = [0, 100], default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.6.4.jar CLI DenoiseIntrons i=<introns> coverage_unstranded=<coverage_unstranded><br />
<br />
<br />
=== NCBI Reference Retriever ===<br />
<br />
This tool can be used to download or update assembly and annotation files of reference organsims from NCBI. This way it allows to easily collect all data necessary to start '''GeMoMaPipeline''' or '''Extractor'''.<br />
<br />
''NCBI Reference Retriever'' may be called with<br />
<br />
java -jar GeMoMa-1.6.4.jar CLI NRR<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reference directory (the directory where the genome and annotation files of the reference organisms should be stored, default = references/)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>number of tries (the number of tries for downloading a reference file, valid range = [1, 100], default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">rl</font></td><br />
<td>reference list (a list of reference organisms)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.6.4.jar CLI NRR rl=<reference_list><br />
<br />
<br />
=== Extractor ===<br />
<br />
This tool can be used to create input files for '''GeMoMa''', i.e., it creates at least a fasta file containing the translated parts of the CDS and a tabular file containing the assignment of transcripts to genes and parts of CDS to transcripts. In addition, '''Extractor''' can be used to create several additional files from the final prediction, e.g. proteins, CDSs, ... . Two inputs are mandatory: The genome as fasta or fasta.gz and the corresponding annotation as gff or gff.gz. The gff file should be sorted. If you like to set a user-specific genetic code, please use a tab-delimited file with two columns. The first column contains the amino acid in one letter code, the second a list of tripletts.<br />
<br />
''Extractor'' may be called with<br />
<br />
java -jar GeMoMa-1.6.4.jar CLI Extractor<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (Reference annotation file (GFF or GTF), which contains gene models annotated in the reference genome)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (Reference genome file (FASTA))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">gc</font></td><br />
<td>genetic code (optional user-specified genetic code, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>proteins (whether the complete proteins sequences should returned as output, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>cds (whether the complete CDSs should returned as output, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">genomic</font></td><br />
<td>genomic (whether the genomic regions should be returned (upper case = coding, lower case = non coding), default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>repair (if a transcript annotation can not be parsed, the program will try to infer the phase of the CDS parts to repair the annotation, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>selected (The path to list file, which allows to make only a predictions for the contained transcript ids. The first column should contain transcript IDs as given in the annotation. Remaining columns will be ignored., OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Ambiguity</font></td><br />
<td>Ambiguity (This parameter defines how to deal with ambiguities in the DNA. There are 3 options: EXCEPTION, which will remove the corresponding transcript, AMBIGUOUS, which will use an X for the corresponding amino acid, and RANDOM, which will randomly select an amnio acid from the list of possibilities., range={EXCEPTION, AMBIGUOUS, RANDOM}, default = EXCEPTION)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sefc</font></td><br />
<td>stop-codon excluded from CDS (A flag that states whether the reference annotation contains the stop codon in the CDS annotation or not, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>full-length (A flag which allows for choosing between only full-length and all (i.e., full-length and partial) transcripts, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>verbose (A flag which allows to output a wealth of additional information, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.6.4.jar CLI Extractor a=<annotation> g=<genome><br />
<br />
<br />
=== GeneModelMapper ===<br />
<br />
This tool is the main part of, a homology-based gene prediction tool. GeMoMa builds gene models from search results (e.g. tblastn or mmseqs).<br />
<br />
As first step, you should run '''Extractor''' obtaining ''cds parts'' and ''assignment''. Second, you should run a search algorithm, e.g. '''tblastn''' or '''mmseqs''', with ''cds parts'' as query. Finally, these search results are then used in '''GeMoMa'''. Search results should be clustered according to the reference genes. The most easiest way is to sort the search results accoring to the first column. If the search results are not sorted by default (e.g. mmseqs), you should the parameter ''sort''.<br />
If you like to run GeMoMa ignoring intron position conservation, you should blast protein sequences and feed the results in ''query cds parts'' and leave ''assignment'' unselected.<br />
<br />
If you like to run GeMoMa using RNA-seq evidence, you should map your RNA-seq reads to the genome and run '''ERE''' on the mapped reads. For several reasons, spurious introns can be extracted from RNA-seq data. Hence, we recommend to run '''DenoiseIntrons''' to remove such spurious introns. Finally, you can use the obtained ''introns'' (and ''coverage'') in GeMoMa.<br />
<br />
If you like to obtain multiple predictions per gene model of the reference organism, you should set ''predictions'' accordingly. In addition, we suggest to decrease the value of ''contig threshold'' allowing GeMoMa to evaluate more candidate contigs/chromosomes.<br />
<br />
If you change the values of ''contig threshold'', ''region threshold'' and ''hit threshold'', this will influence the predictions as well as the runtime of the algorithm. The lower the values are, the slower the algorithm is.<br />
<br />
You can filter your predictions using '''GAF''', which also allows for combining predictions from different reference organismns.<br />
<br />
Finally, you can predict UTRs and rename predictions using '''AnnotationFinalizer'''.<br />
<br />
If you like to run the complete GeMoMa pipeline and not only specific module, you can run the multi-threaded module '''GeMoMaPipeline'''.<br />
<br />
''GeneModelMapper'' may be called with<br />
<br />
java -jar GeMoMa-1.6.4.jar CLI GeMoMa<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>search results (The search results, e.g., from tblastn or mmseqs)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>target genome (The target genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>cds parts (The query cds parts file (FASTA), i.e., the cds parts that have been blasted)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>assignment (The assignment file, which combines parts of the CDS to transcripts, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">q</font></td><br />
<td>query proteins (optional query protein file (FASTA) for computing the optimal alignment score against complete protein prediction, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used zero or multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">splice</font></td><br />
<td>splice (if no intron is given by RNA-seq, compute candidate splice sites or not, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used zero or multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage</font></td><br />
<td>coverage (experimental coverage (RNA-seq), range={UNSTRANDED, STRANDED}, default = UNSTRANDED)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genetic code (optional user-specified genetic code, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sm</font></td><br />
<td>substitution matrix (optional user-specified substitution matrix, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">go</font></td><br />
<td>gap opening (The gap opening cost in the alignment, default = 11)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ge</font></td><br />
<td>gap extension (The gap extension cost in the alignment, default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>maximum intron length (The maximum length of an intron, default = 15000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sil</font></td><br />
<td>static intron length (A flag which allows to switch between static intron length, which can be specified by the user and is identical for all genes, and dynamic intron length, which is based on the gene-specific maximum intron length in the reference organism plus the user given maximum intron length, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">intron-loss-gain-penalty</font></td><br />
<td>intron-loss-gain-penalty (The penalty used for intron loss and gain, default = 25)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">e</font></td><br />
<td>e-value (The e-value for filtering blast results, default = 100.0)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ct</font></td><br />
<td>contig threshold (The threshold for evaluating contigs, valid range = [0.0, 1.0], default = 0.4)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">rt</font></td><br />
<td>region threshold (The threshold for evaluating regions, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">h</font></td><br />
<td>hit threshold (The threshold for adding additional hits, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>predictions (The (maximal) number of predictions per transcript, default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">selected</font></td><br />
<td>selected (The path to list file, which allows to make only a predictions for the contained transcript ids. The first column should contain transcript IDs as given in the annotation. Remaining columns can be used to determine a target region that should be overlapped by the prediction, if columns 2 to 5 contain chromosome, strand, start and end of region, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">as</font></td><br />
<td>avoid stop (A flag which allows to avoid stop codons in a transcript (except the last AS), default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">approx</font></td><br />
<td>approx (whether an approximation is used to compute the score for intron gain, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">prefix</font></td><br />
<td>prefix (A prefix to be used for naming the predictions, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">tag</font></td><br />
<td>tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = prediction)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>verbose (A flag which allows to output a wealth of additional information per transcript, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">timeout</font></td><br />
<td>timeout (The (maximal) number of seconds to be used for the predictions of one transcript, if exceeded GeMoMa does not output a prediction for this transcript., valid range = [0, 604800], default = 3600)</td><br />
<td style="width:100px;">LONG</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sort</font></td><br />
<td>sort (A flag which allows to sort the search results, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Score</font></td><br />
<td>Score (A flag which allows to do nothing, re-score or re-align the search results, range={Trust, ReScore, ReAlign}, default = Trust)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.6.4.jar CLI GeMoMa s=<search_results> t=<target_genome> c=<cds_parts> a=<assignment><br />
<br />
<br />
=== GeMoMa Annotation Filter ===<br />
<br />
This tool combines and filters gene predictions from different sources yielding a common gene prediction. The tool does not modify the predictions, but filters redundant and low-quality predictions and selects relevant predictions. In addition, it adds attributes to the annotation of transcript predictions.<br />
<br />
The algorithm does the following:<br />
First, redundant predictions are identified (and additional attributes (evidence, sumWeight) are introduced).<br />
Second, the predictions are filtered using the user-specified criterium based on the attributes from the annotation.<br />
Third, clusters of overlapping predictions are determined, the predictions are sorted within the cluster and relevant predictions are extracted.<br />
<br />
Optionally, annotation info can be added for each reference organism enabling a functional prediction for predicted transcripts based on the function of the reference transcript.<br />
Phytozome provides annotation info tables for plants, but annotation info can be used from any source as long as they are tab-delimited files with at least the following columns: transcriptName, GO and .*defline.<br />
<br />
Initially, GAF was build to combine gene predictions obtained from '''GeMoMa'''. It allows to combine the predictions from multiple reference organisms, but works also using only one reference organism. However, GAF also allows to integrate predictions from ab-initio or purely RNA-seq-based gene predictors as well as manually curated annotation. If you like to do so, we recommend to run '''AnnotationEvidence''' for each of these input files to add additional attributes that can be used for sorting and filtering within '''GAF'''. The sort and filter criteria need to be carefully revised in this case. Default values can be set for missing attributes.<br />
<br />
''GeMoMa Annotation Filter'' may be called with<br />
<br />
java -jar GeMoMa-1.6.4.jar CLI GAF<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>tag (the tag used to read the GeMoMa annotations, default = prediction)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>common border filter (the filter on the common borders of transcripts, the lower the more transcripts will be checked as alternative splice isoforms, valid range = [0.0, 1.0], default = 0.75)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>maximal number of transcripts per gene (the maximal number of allowed transcript predictions per gene, valid range = [1, 2147483647], default = 2147483647)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prefix (the prefix can be used to distinguish predictions from different input files, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>weight (the weight can be used to prioritize predictions from different input files; each prediction will get an additional attribute sumWeight that can be used in the filter, valid range = [0.0, 1000.0], default = 1.0, OPTIONAL)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>gene annotation file (GFF files containing the gene annotations (predicted by GeMoMa))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation info (annotation information of the reference, tab-delimted file containing at least the columns transcriptName, GO and .*defline, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>default attributes (Comma-separated list of attributes that is set to NaN if they are not given in the annotation file. This allows to use these attributes for sorting or filter criteria. It is especially meaningful if the gene annotation files were received fom different software packages (e.g., GeMoMa, Braker, ...) having different attributes., default = tie,tde,tae,iAA,pAA,score)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>filter (A filter can be applied to transcript predictions to receive only reasonable predictions. The filter is applied to the GFF attributes. The deault filter decides based on the completeness of the prediction (start=='M' and stop=='*') and the relative score (score/aa>=0.75) whether a prediction is used or not. Different criteria can be combined using 'and' and 'or'. In addition, you can check for NaN, e.g., 'isNaN(score)'. A more sophisticated filter could be applied for instance using the completeness, the relative score, the evidence as well as statistics based on RNA-seq: start=='M' and stop =='*' and score/aa>=0.75 and (evidence>1 or tpc==1.0), default = start=='M' and stop =='*' and score/aa>=0.75, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>sorting (comma-separated list that defines the way predictions are sorted within a cluster, default = evidence,score)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">atf</font></td><br />
<td>alternative transcript filter (If a prediction is suggested as an alternative transcript, this additional filter will be applied. The filter works syntactically like the 'filter' parameter and allows you to keep the number of alternative transcripts small based on meaningful criteria. Reasonable filter could be based on RNA-seq data (tie==1) or on evidence (evidence>1). A more sophisticated filter could be applied combining several criteria: tie==1 or evidence>1, default = tie==1 or evidence>1, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.6.4.jar CLI GAF g=<gene_annotation_file><br />
<br />
<br />
=== AnnotationFinalizer ===<br />
<br />
This tool finalizes an annotation. It allows to predict for UTRs for annotated coding sequences and to generate generic gene and transcript names. UTR prediction might be negatively influenced (i.e. too long predictions) by genomic contamination of RNA-seq libraries, overlapping genes or genes in close proximity as well as unstranded RNA-seq libraries. Please use '''ERE''' to preprocess the mapped reads.<br />
<br />
''AnnotationFinalizer'' may be called with<br />
<br />
java -jar GeMoMa-1.6.4.jar CLI AnnotationFinalizer<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (The genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (The predicted genome annotation file (GFF))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = prediction)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">u</font></td><br />
<td>UTR (allows to predict UTRs using RNA-seq data, range={NO, YES}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;YES&quot;:</b></td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns file (Introns (GFF), which might be obtained from RNA-seq)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage file (experimental coverage (RNA-seq), range={NO, UNSTRANDED, STRANDED}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">rename</font></td><br />
<td>rename (allows to generate generic gene and transcripts names (cf. parameter &quot;name attribute&quot;), range={COMPOSED, SIMPLE, NO}, default = COMPOSED)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;COMPOSED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">infix</font></td><br />
<td>infix (the infix of the generic name, default = G)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>suffix (the suffix of the generic name, default = 0)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">di</font></td><br />
<td>delete infix (a comma-separated list of infixes that is deleted from the sequence names before building the gene/transcript name, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;SIMPLE&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>name attribute (if true the new name is added as new attribute &quot;Name&quot;, otherwise &quot;Parent&quot; and &quot;ID&quot; values are modified accordingly, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.6.4.jar CLI AnnotationFinalizer g=<genome> a=<annotation> p=<prefix><br />
<br />
<br />
=== Annotation evidence ===<br />
<br />
This tool adds attributes to the annotation, e.g., tie, tpc, aa, start, stop. These attributes can be used, for instance, if the annotation is used in '''GAF'''. All predictions of the annotation are used. The predictions are not filtered for internal stop codons, missing start or stop codons, frame-shifts, ... . Please use '''ERE''' to preprocess the mapped reads.<br />
<br />
''Annotation evidence'' may be called with<br />
<br />
java -jar GeMoMa-1.6.4.jar CLI AnnotationEvidence<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (The genome annotation file (GFF))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (The genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns file (Introns (GFF), which might be obtained from RNA-seq, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage file (experimental coverage (RNA-seq), range={NO, UNSTRANDED, STRANDED}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ao</font></td><br />
<td>annotation output (if the annotation should be returned with attributes tie, tpc, and aa, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">gc</font></td><br />
<td>genetic code (optional user-specified genetic code, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.6.4.jar CLI AnnotationEvidence a=<annotation> g=<genome><br />
<br />
<br />
=== Compare transcripts ===<br />
<br />
This tool compares a predicted annotation with a given annotation in terms of F1 measure. If the F1 measure is 1 both annotations are in perfect agreement for this transcript. The smaller the value is the low is the agreement. If it is NA then there is no overlapping annotation.<br />
<br />
''Compare transcripts'' may be called with<br />
<br />
java -jar GeMoMa-1.6.4.jar CLI CompareTranscripts<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prediction (The predicted annotation)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (The true annotation)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used zero or multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">prefix</font></td><br />
<td>prefix (the prefix can be used to distinguish predictions from different input files, default = , OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">assignment</font></td><br />
<td>assignment (the transcript info for the reference of the prediction)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.6.4.jar CLI CompareTranscripts p=<prediction> a=<annotation><br />
<br />
== GFF attributes ==<br />
<br />
Using GeMoMa and GAF, you'll obtain GFFs containing some special attributes. We briefly explain the most prominent attributes in the following table.<br />
<br />
{| class="wikitable sortable"<br />
|-<br />
!Attribute !!Long name !!Tool !!Necessary parameter !!Feature !!Description<br />
|-<br />
| aa || amino acids || GeMoMa || || prediction || the number of amino acids in the protein<br />
|-<br />
| score || GeMoMa score || GeMoMa || || prediction || score computed by GeMoMa using the substitution matrix, gap costs and additional penalties<br />
|-<br />
| minCov || minimal coverage || GeMoMa || coverage, ... || prediction || minimal coverage of any base of the prediction given RNA-seq evidence<br />
|-<br />
| avgCov || average coverage || GeMoMa || coverage, ... || prediction || average coverage of all bases of the prediction given RNA-seq evidence<br />
|-<br />
| tpc || transcript percentage coverage || GeMoMa || coverage, ... || prediction || percentage of covered bases per predicted transcript given RNA-seq evidence<br />
|-<br />
| tae || transcript acceptor evidence || GeMoMa || introns || prediction || percentage of predicted acceptor sites per predicted transcript with RNA-seq evidence<br />
|-<br />
| tde || transcript donor evidence || GeMoMa || introns || prediction || percentage of predicted donor sites per predicted transcript with RNA-seq evidence<br />
|-<br />
| tie || transcript intron evidence || GeMoMa || introns || prediction || percentage of predicted introns per predicted transcript with RNA-seq evidence<br />
|-<br />
| minSplitReads || minimal split reads || GeMoMa || introns || prediction || minimal number of split reads for any of the predicted introns per predicted transcript<br />
|-<br />
| iAA || identical amino acid || GeMoMa || query proteins || prediction || percentage of identical amino acids between reference transcript and prediction<br />
|-<br />
| pAA || positive amino acid || GeMoMa || query proteins || prediction || percentage of aligned positions between reference transcript and prediction yielding a positive score in the substitution matrix<br />
|-<br />
| evidence || || GAF || || prediction || number of reference organisms that have a transcript yielding this prediction<br />
|-<br />
| alternative || || GAF || || prediction || alternative gene ID(s) leading to the same prediction<br />
|-<br />
| sumWeight || || GAF || || prediction || the sum of the weights of the references that perfectly support this prediction<br />
|-<br />
| maxTie || maximal tie || GAF || || gene || maximal tie of all transcripts of this gene<br />
|-<br />
| maxEvidence || maximal evidence || GAF || || gene || maximal evidence of all transcripts of this gene<br />
|-<br />
|}<br />
<br />
== FAQs ==<br />
<br />
; Why does the Extractor not return a single CDS-part, protein, ...?<br />
: First, please check whether the names of your contigs/chromosomes in your annotation (gff) and genome file (fasta) are identical. The fasta comments should at best only contain the contig/chromosome name. (Since GeMoMa 1.4, comments, which contain the contig/chromosome name and some additional information separated by a space, are also fine.) Second, please check whether you have a valid GFF/GTF file. Valid GFF files should have a valid "ID" or "Parent" entry in the attributes column. Valid GTF files should have a valid "gene_id" and "transcript_id" entry. Finally, please check the statistics that are given by the Extractor. It lists how many genes have been read and how many genes have been removed for different reasons. One common problem is that some annotation files do not include the stop codon in the CDS annotation.<br />
; How can I force GeMoMa to make more predictions?<br />
: There are several parameters affecting the number of predictions. The most prominent are the number of predictions (p) and the contig threshold (ct). For each reference transcript/CDS, GeMoMa initially makes a preliminary prediction and uses this prediction to determine whether a contig is promising and should be used to determine the final predictions. You may decrease ct and increase p to have more contigs in the final prediction. Increasing the number of predictions allows GeMoMa to output more predictions that have been computed. Decreasing the contig threshold allows to increase the number of predictions that are (internally) computed. Increasing p to a very large number without decreasing ct does not help.<br />
; Running GeMoMa on a single contig of my assembly yield thousands of weird predictions. What went wrong?<br />
: By default, GeMoMa is not build to be run on a single contig. GeMoMa tries to make predictions for all given reference CDS in the given target sequence(s). If the given target sequence is only a fraction of the complete target genome/assembly, GeMoMa will produce weird predictions as it does not filter for the quality of the predictions internally. There are two options to handle this:<br />
:* Use a list of gene models that you expect to be located on this contig (cf. parameter "selected").<br />
:* Filter the predictions using GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>).<br />
; Is it mandatory to use RNA-seq data?<br />
: No, GeMoMa is able to make predictions with and without RNA-seq evidence.<br />
; Is it possible to use multiple reference organisms?<br />
: It is possible to use multiple reference organisms for GeMoMa. Just run GeMoMa on each reference organism separately. Finally, you can employ GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>) to combine these annotations.<br />
; Why do some reference genes not lead to a prediction in the target genome?<br />
: Please first check whether your reference genes have been discarded by the Extractor (cf. assignment file).<br />
: If the genes have been discarded, there are two possibilities:<br />
:* The CDS might be redundant, i.e. the coding exons are identical to those of another transcript. In this case, only one CDS is further evaluated.<br />
:* There might be something wrong with your reference genes, e.g., missing start codon, missing stop codon, premature stop codon, ambiguous nucleotides, ... and you should check the options of Extractor or the annotation.<br />
: If the reference genes passed the Extractor, there are several possible explanations for this behavior. The two most prominent are:<br />
:* GeMoMa stopped the prediction of a reference genes since it does not return a result within the given time (cf. parameter "timeout").<br />
:* GeMoMa simply did not find a prediction matching the remaining quality criteria<br />
:* GeMoMa did find a prediction, but it was filtered out by GAF, e.g. to low relative score, missing start or stop codon (cf. GAF parameters).<br />
; What does "partial gene model" mean in the context of GeMoMa?<br />
: We called a gene model partial if it does not contain an initial start codon and a final stop codon. However, this does not mean that the gene model is located at or close to the border of a chromosome or contig.<br />
; For two different reference transcripts, the predictions of GeMoMa overlap or are identical. What should I do with those?<br />
: GeMoMa makes the predictions for each reference transcript independently. Hence, it can occur that some of predictions of different reference transcripts overlap or are identical especially in gene families. Typically, you might like to filter or rank these predictions. We have implemented GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>) to do this automatically. However, you can also do it by hand using the GFF attributes. Using RNA-seq data in GeMoMa yields additional fields in the annotation that can be used, e.g., average coverage (avgCov).<br />
; A lot of transcripts have been filtered out by the Extractor. What can I do?<br />
: There are several reasons for removing transcripts by the Extractor. At least in two cases you can try to get more transcripts by setting specific parameter values. First, if the transcript contains ambiguous nucleotides, please test the parameter "Ambiguity". Second, sometimes we received GFFs which contain wrong phases for CDS entries (e.g., 0 for all CDS entries in the phase column of the GFF). Since version 1.3.2, we provide the option "r" which stands for repair. If r=true is chosen, the Extractor tries to infer all phases for transcripts that show an error and would be filtered out.<br />
; Is GeMoMa able to predict pseudo-genes/ncRNA?<br />
: No, currently not.<br />
; My RNA-seq data indicates there is an additional intron in a transcipt, but GeMoMa does not predict this. Or vice versa, GeMoMa predicts an intron that is not supported by RNA-seq data. What's the reason?<br />
: GeMoMa is mainly based on the assumptions of amino acid and intron position conservation between reference and target species. Hence, GeMoMa tries to predict a gene model with similar exon-intron structure in the target species and does not stick too much to RNA-seq data. Although intron position conservation can be observed in most cases, sometimes new introns evolve or others vanish. For this reasons, GeMoMa also allows for the inclusion or exclusion of introns adding some additional costs (cf. GeMoMa parameter intron-loss-gain-penalty). However, the behaviour of GeMoMa depends on the parameters settings (especially intron-loss-gain-penalty, sm (substitution matrix), go (gap opening), ge (gap extension)) and the length of the missed/additional intron. Nevertheless, such cases can only occur if the additional/missed intron has a length that can be divided by 3 preserving the reading frame.<br />
:Since the available RNA-seq data only reflects a fraction of tissues/environmental conditions/..., missing RNA-seq evidence does not necessarily mean that the predictions is wrong.<br />
; My RNA-seq data indicates two alternative, highly overlapping introns. Interestingly, GeMoMa does not take the intron that is more abundant. Why?<br />
: GeMoMa reads the introns from the input file using some filter (cf. GeMoMa parameter r (reads)). All introns that pass the filter are used and treated equally. Hence, GeMoMa uses the intron that matches the expectation of intron position and amino acid conservation compared to the reference transcript.<br />
; Does GeMoMa predict multiple transcripts per gene?<br />
: GeMoMa in principle allows to predict multiple transcripts per gene, if corresponding transcripts are given in the reference species or if multiple reference species are used.<br />
; GeMoMa failed with java.lang.OutOfMemoryError. What can I do?<br />
: Whenever you see a java.lang.OutOfMemoryError, you should rerun the program with Java virtual machine (VM) options. More specifically you should set: '''-Xms''' the initally used RAM, e.g. to 5Gb (–Xms5G), and '''-Xmx''' the maximally used RAM, e.g. to 50Gb (-Xmx50G). GeMoMa often needs more memory if you have a large genome ''and'' if you’re providing a large coverage file (extracted from RNA-seq data). If you don’t have a compute node with enough memory, you can run GeMoMa without coverage, which will return the same predictions, but does not include all statistics. Another point could be the protein alignment, if you use the optional parameter query protein. Again you can run GeMoMa without this parameter, which will return the same predictions, but less statistics.<br />
; I need to specify the genetic code for my organisms. What is the expected format?<br />
: The genetic code is given in a two column tab-delimited table, where the first column is the one letter code of the amino acid and the second column is a comma-separated list of triplets. As we are working on genomic DNA, GeMoMa expects the bases A, C, G, and T, and not U (as expected in mRNA). Here is the link to the default genetic code, which might be used as template:<br />
: https://github.com/Jstacs/Jstacs/blob/master/projects/gemoma/test_data/genetic_code.txt<br />
: Alternative genetic codes are described here using the RNA alphabet:<br />
: https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi<br />
: The genetic code might be specified for a reference organism in the module Extractor or for a target organism in the module GeMoMa.<br />
; I like to accelerate GeMoMa. What can I do?<br />
:If you like to improve the runtime, you can use several threads for the computation. If you run the GeMoMaPipeline you just have to select <code>threads=<your_number></code>.<br />
:In addition, you can change the search algorithm that is used in GeMoMa. Tblastn is used by default as search algorithm in GeMoMa (for historical reasons). However, tblastn can be replaced by mmseqs which is typically much faster. If you run the GeMoMaPipeline you just have to select <code>tblastn=false</code>. However, changing the search algorithm can also effect the results. We try to minimize these effect using specific parameters for the search algorithms.<br />
:If you modify other parameters, you will probably receive results that differ to a larger extend from those received using the default parameters.<br />
; Is there a way to use GeMoMa to search a single CDS or protein sequence against a genome and return the predicted gene model (CDS fasta, protein fasta, GFF) similar to exonerate?<br />
: There at least two ways to do this. If you use GeMoMaPipeline you can <br />
: (A) Use the parameter “selected” to select specific gene models (=transcripts/proteins) from the annotation instead of using all or<br />
: (B) Use <code>s=pre-extracted</code>, use a fasta file with the proteins for the parameter cds-parts and leave assignment unset.<br />
: Using one of these options you can either look for a single or few transcripts/proteins either with (A) or without (B) intron-position conservation. In addition, you can use RNA-seq data to improve the predictions, which should be not possible with exonerate.<br />
<br />
== References ==<br />
If you use GeMoMa, please cite<br />
<br />
J. Keilwagen, M. Wenk, J. L. Erickson, M. H. Schattat, J. Grau, and F. Hartung. [https://nar.oxfordjournals.org/content/44/9/e89 Using intron position conservation for homology-based gene prediction]. ''Nucleic Acids Research'', 2016. doi: 10.1093/nar/gkw092<br />
<br />
J. Keilwagen, F. Hartung, M. Paulini, S. O. Twardziok, and J. Grau<br />
[https://rdcu.be/QbKc Combining RNA-seq data and homology-based gene prediction for plants, animals and fungi]. ''BMC Bioinformatics'', 2018. doi: 10.1186/s12859-018-2203-5<br />
<br />
== Version history ==<br />
[http://www.jstacs.de/download.php?which=GeMoMa GeMoMa 1.6.4] (24.04.2020)<br />
* improved help section<br />
* change gff attribute "AA" to "aa"<br />
* GAF:<br />
** bugfix overlapping genes<br />
** accelerated computation<br />
* GeMoMa:<br />
** bugfix: if no assignment file is used and protein ID are prefixes of other protein IDs<br />
** change GFF attribute AA to aa<br />
* AnnotationFinalizer: new parameter "name attribute" allowing to decide whether a name attribute or the Parent and ID attributes should be used for renaming<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.6.3.zip GeMoMa 1.6.3] (05.03.2020)<br />
* Jstacs changes:<br />
** CLI: bugfix ExpandableParameterSet<br />
* python wrapper (for *conda)<br />
* updated tests.sh, run.sh, pipeline.sh<br />
* rename Denoise to DenoiseIntrons<br />
* AnnotationEvidence: write phase (as given) to gff<br />
* GAF: new parameter: default attributes allows to set attributes that are not included in some gene annotation files<br />
* GeMoMa: new parameter: static intron length allowing to use dynamic intron length if set to false<br />
* GeMoMaPipeline: <br />
** bugfix: time-out<br />
** improve output<br />
** separate parameters for maximum intron length (DenoiseIntrons, GeMoMa)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.6.2.zip GeMoMa 1.6.2] (17.12.2019)<br />
* Jstacs changes:<br />
** test methods for modules<br />
** live protocol for Galaxy<br />
* new module Denoise: allowing to clean introns extracted by ERE<br />
* new module NCBIReferenceRetriever: allowing to retrieve data for reference organisms easily from NCBI.<br />
* GAF:<br />
** bugfix for filter using specific attributes if no RNA-seq or query proteins was used<br />
** allow to add annotation info (as for instance provided by Phytozome) based on the reference organisms<br />
* GeMoMa: bugfix for timeout<br />
* GeMoMaPipeline:<br />
** bugfix reporting predicted partial proteins<br />
** improved protocol<br />
** new default value for query proteins (changed from false to true)<br />
** new default value for Ambiguity (changed from EXCEPTION to AMBIGUOUS)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.6.1.zip GeMoMa 1.6.1] (4.06.2019)<br />
* createGalaxyIntegration.sh: bugfix for GeMoMaPipeline<br />
* new module CheckIntrons: allowing to create statistics for introns (extracted by ERE)<br />
* AnnotationFinalizer: bugfix for sequence IDs with large numbers<br />
* CompareTranscripts: <br />
** bugfix for prefix of ref-gene<br />
** allow no transcript info, but making assignment non-optional if a transcript info is set <br />
* GAF: bugfix for Galaxy integration<br />
* GeMoMaPipeline:<br />
** improved output in case of Exceptions<br />
** new parameter "output individual predictions" allows to in- or exclude individual predictions from each reference organism in the final result<br />
** new parameter "weight" allows weights for reference species (cf. GAF)<br />
* ERE: new parameter "minimum mapping quality"<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.6.zip GeMoMa 1.6] (2.04.2019)<br />
* allow to use mmseqs as alternative to tblastn<br />
* AnnotationEvidence:<br />
** allows to add attributes to the input gff: tie, tpc, AA, start, stop<br />
** new parameter for gff output<br />
* AnnotationFinalizer: new tool for predicting UTRs and renaming predictions<br />
* GAF:<br />
** relative score filter and evidence filter are replaced by a flexible filter that allows to filter by relative score, evidence or other GFF attributes as well as combinations thereof<br />
** sorting criteria of the predictions within clusters can now be user-specified<br />
** new attribute for genes: combinedEvidence<br />
** new attribute for predictions: sumWeight<br />
** allows to use gene predictions from all sources, including for instance ab-initio gene predictors, purely RNA-seq based gene prediction and manually curation<br />
** bugfix for predictions from multiple reference organisms<br />
** improved statistic output<br />
* GeMoMa<br />
** renamed the parameter tblastn results to search results<br />
** new parameter for sorting the results of the similarity search (tblastn or mmseqs), if you use mmseqs for the similarity search you have use sort <br />
** new parameter for score of the search results: three options: Trust (as is), ReScore (use aligned sequence, but recompute score), and ReAlign (use detected sequence for realignment and rescore)<br />
** bugfix: threshold for introns from multiple files<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.5.2.zip GeMoMa 1.5.3] (23.07.2018)<br />
* improved parameter description and presentation<br />
* GeMoMaPipeline:<br />
** removed unnecessary parameters<br />
* GeMoMa:<br />
** bugfix: reading coverage file<br />
** removed parameter genomic (cf. Extractor)<br />
** removed protein output (cf. Extractor)<br />
* GAF:<br />
** bugfix: prefix<br />
* Extractor:<br />
** new parameter genomic<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.5.2.zip GeMoMa 1.5.2] (31.5.2018)<br />
* GAF:<br />
** new parameter that allows to restrict the maximal number of transcript predictions per gene<br />
** altered behavior of the evidence filter from percentages to absolute values<br />
** bugfix: nested genes <br />
** checking for duplicates in prediction IDs<br />
* GeMoMa:<br />
** warning if RNA-seq data does not match with target genome<br />
* GeMoMaPipeline: new tool for running the complete GeMoMa pipeline at once allowing multi-threading<br />
* folder for temporary files of GeMoMa<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.5.zip GeMoMa 1.5] (13.02.2018)<br />
* AnnotationEvidence: add chromosome to output<br />
* CompareTranscripts: new parameter that allows to remove prefixes introduces by GAF<br />
* Extractor: new parameter "stop-codon excluded from CDS" that might be used if the annotation does not contain the stop codons<br />
* ExtractRNASeqEvidence: <br />
** print intron length stats<br />
** include program infos in introns.gff3<br />
* GeMoMa:<br />
** new attribute pAA in gff output if query protein is given<br />
** include program infos in predicted_annotation.gff3<br />
** minor bugfix<br />
* GAF:<br />
** new parameter that allows to specify a prefix for each input gff<br />
** collect and print program infos to filtered_prediction.gff3<br />
** improved statistics output<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.4.2.zip GeMoMa 1.4.2] (21.07.2017)<br />
* automatic searching for available updates<br />
* AnnotationEvidence: bugfix (tie computation: Arrays.binarysearch does not find first match)<br />
* Extractor: bugfix (files that are not zipped)<br />
* GeMoMa: bugfix (tie computation: Arrays.binarysearch does not find first match)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.4.1.zip GeMoMa 1.4.1] (30.05.2017)<br />
* CompareTranscripts: bugfix (NullPointerException)<br />
* Extractor: reference genome can be .*fa.gz and .*fasta.gz<br />
* GeMoMa: bugfix (shutdown problem after timeout)<br />
* modified additional scripts and documentation<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.4.zip GeMoMa 1.4] (03.05.2017)<br />
* AnnotationEvidence: new tool computing tie and tpc for given annotation (gff)<br />
* CompareTranscripts: new tool comparing predicted and given annotation (gff)<br />
* Extractor:<br />
** reading CDS with no parent tag (cf. discontinuous feature)<br />
** automatic recognition of GFF or GTF annotation<br />
** Warning if sequences mentioned in the annotation are not included in the reference sequence<br />
* GeMoMa: <br />
** allowing for multiple intron and coverage files (= using different library types at the same time)<br />
** NA instead of "?" for tae, tde, tie, minSplitReads of single coding exon genes<br />
** new default values for the parameters: predictions (10 instead of 1) and contig threshold (0.4 instead of 0.9)<br />
** bugfix (write pc and minCov if possible for last CDS part in predicted annotation)<br />
** bugfix (ref-gene name if no assignment is used)<br />
** bugfix (minSplitReads, minCov, tpc, avgCov if no coverage available)<br />
* GAF:<br />
** nested genes on the same strand<br />
** bugfix (if nothing passes the filter)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.3.2.zip GeMoMa 1.3.2] (18.01.2017)<br />
* Extractor: new parameter repair for broken transcript annotations<br />
* GeMoMa: bugfixes (splice site computation)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa_1.3.1.zip GeMoMa 1.3.1] (09.12.2016)<br />
* GeMoMa bugfix (finding start/stop codon for very small exons)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.3.zip GeMoMa 1.3] (06.12.2016)<br />
* ERE: new tool for extracting RNA-seq evidence<br />
* Extractor: offers options for <br />
** partial gene models<br />
** ambiguities<br />
* GeMoMa:<br />
** RNA-seq<br />
*** defining splice sites<br />
*** additional feature in GFF and output<br />
**** transcript intron evidence (tie)<br />
**** transcript acceptor evidence (tae)<br />
**** transcript donor evidence (tde)<br />
**** transcript percentage coverage (tpc)<br />
**** ...<br />
** improved GFF<br />
** simplified the command line parameters<br />
** IMPORTANT: parameter names changed for some parameters<br />
* GAF: new tool for filtering and combining different predictions (especially of different reference organisms)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.1.3.zip GeMoMa 1.1.3] (06.06.2016)<br />
* minor modifications to the Extractor tool<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.1.2.zip GeMoMa 1.1.2] (05.02.2016)<br />
* GeMoMa bugfix (upstream, downstream sequence for splice site detection)<br />
* Extractor: new parameter s for selecting transcripts<br />
* improved Galaxy integration<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.1.1.zip GeMoMa 1.1.1] (01.02.2016)<br />
* initial release for paper</div>Keilwagenhttps://www.jstacs.de/index.php?title=GeMoMa&diff=1099GeMoMa2020-05-20T05:23:42Z<p>Keilwagen: /* FAQs */</p>
<hr />
<div>'''Ge'''ne '''Mo'''del '''Ma'''pper (GeMoMa) is a homology-based gene prediction program. GeMoMa uses the annotation of protein-coding genes in a reference genome to infer the annotation of protein-coding genes in a target genome. Thereby, GeMoMa utilizes amino acid sequence and intron position conservation. In addition, GeMoMa allows to incorporate RNA-seq evidence for splice site prediction.<br />
<br />
GeMoMa is available in a public web-server at [http://galaxy.informatik.uni-halle.de galaxy.informatik.uni-halle.de]. The provided web-server only allows a limited number of reference genes and uses a time out of 2 minutes per transcript prediction. For unlimited use, please use the command line program or integrate GeMoMa in your own [https://galaxyproject.org/ Galaxy] instance.<br />
<br />
{|<br />
|__TOC__<br />
|[[File:GeMoMa-schema.png|thumb|right|450px|Schema of GeMoMa algorithm]]<br />
|}<br />
<br />
== Installation ==<br />
<br />
GeMoMa is now available via bioconda. Here is the [https://anaconda.org/bioconda/gemoma direct link to the package]. To install this package with conda run:<br />
conda install -c bioconda gemoma <br />
However, you can also install GeMoMa manually.<br />
<br />
=== Requirements ===<br />
<br />
For running the GeMoMa, you need the following software on your computer<br />
* Java v1.8 or later<br />
* [ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ blast] or [https://github.com/soedinglab/MMseqs2 mmseqs]<br />
<br />
=== Download ===<br />
<br />
GeMoMa is implemented in Java using Jstacs. You can [http://www.jstacs.de/download.php?which=GeMoMa download a zip file] containing a readme, the GeMoMa jar file and some tiny scripts for running GeMoMa. The jar file allows for <br />
<ul><br />
<li>creating the XML file needed for the Galaxy integration</li><br />
<li>running the command line interface (CLI) version.</li><br />
</ul><br />
<br />
You can also [[:File:GeMoMa-manual.pdf|download a small manual for GeMoMa]] which explains the main steps for the analysis.<br />
<br />
== In a nutshell ==<br />
<br />
GeMoMa is a modular, homology-based gene prediction program with huge flexibility. However, we also provide a pipeline allowing to use GeMoMa easily. If you like to start GeMoMa for the first time, we recommend to use the GeMoMaPipeline like this<br />
java -jar GeMoMa-1.6.4.jar CLI GeMoMaPipeline threads=<threads> outdir=<outdir> tblastn=false GeMoMa.Score=ReAlign AnnotationFinalizer.r=NO o=true t=<target_genome> i=<reference_1_id> a=<reference_1_annotation> g=<reference_1_genome><br />
there are several parameters that need to be set indicated with '''&lt;'''foo'''&gt;'''. You can specify<br />
* the number of threads<br />
* the output directory<br />
* the target genome<br />
* and the reference ID (optional), annotation and genome. If you have several references just repeat the parameter tags <code>i</code>, <code>a</code>, <code>g</code> with the corresponding values.<br />
In addition, we recommend to set several parameters:<br />
* <code>tblastn=false</code>: use mmseqs instead of tblastn, since mmseqs is faster<br />
* <code>GeMoMa.Score=ReAlign</code>: states that the score from mmseqs should be recomputed as mmseqs uses an approximation<br />
* <code>AnnotationFinalizer.r=NO</code>: do not rename genes and transcripts<br />
* <code>o=true</code>: output individual predictions for each reference as a separate file allowing to rerun the combination step ('''GAF''') very easily and quickly<br />
If you like to specify the maximum intron length please consider the parameters <code>GeMoMa.m</code> and <code>GeMoMa.sil</code>.<br />
If you have RNA-seq data either from own experiments or publicly available data sets (cf. [https://www.ncbi.nlm.nih.gov/sra NCBI SRA], [https://www.ebi.ac.uk/ena EMBL-EBI ENA]), we recommend to use them. You need to map the data against the target genome with your favorite read mapper. In addition, we recommend to check the parameters of the section '''DenoiseIntrons'''.<br />
<br />
== Tools ==<br />
<br />
=== GeMoMa pipeline ===<br />
<br />
This tool can be used to run the complete GeMoMa pipeline. The tool is multi-threaded and can utilize all compute cores on one machine, but not distributed as for instance in a compute cluster. It basically runs: '''Extract RNA-seq evidence (ERE)''', '''DenoiseIntrons''', '''Extractor''', external search (tblastn or mmseqs), '''Gene Model Mapper (GeMoMa)''', '''GeMoMa Annotation Filter (GAF)''', and '''AnnnotationFinalizer'''.<br />
<br />
''GeMoMa pipeline'' may be called with<br />
<br />
java -jar GeMoMa-1.6.4.jar CLI GeMoMaPipeline<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>target genome (Target genome file (FASTA))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>species (data for reference species, range={own, pre-extracted}, default = own)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;own&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>ID (ID to distinguish the different reference species, default = , OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (Reference annotation file (GFF or GTF), which contains gene models annotated in the reference genome)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (Reference genome file (FASTA))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>weight (the weight can be used to prioritize predictions from different input files; each prediction will get an additional attribute sumWeight that can be used in the filter, valid range = [0.0, 1000.0], default = 1.0, OPTIONAL)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ai</font></td><br />
<td>annotation info (annotation information of the reference, tab-delimted file containing at least the columns transcriptName, GO and .*defline, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;pre-extracted&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>ID (ID to distinguish the different reference species, default = , OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>cds parts (The query cds parts file (FASTA), i.e., the cds parts that have been blasted)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>assignment (The assignment file, which combines parts of the CDS to transcripts, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">q</font></td><br />
<td>query proteins (optional query protein file (FASTA) for computing the optimal alignment score against complete protein prediction, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>weight (the weight can be used to prioritize predictions from different input files; each prediction will get an additional attribute sumWeight that can be used in the filter, valid range = [0.0, 1000.0], default = 1.0, OPTIONAL)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ai</font></td><br />
<td>annotation info (annotation information of the reference, tab-delimted file containing at least the columns transcriptName, GO and .*defline, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">selected</font></td><br />
<td>selected (The path to list file, which allows to make only a predictions for the contained transcript ids. The first column should contain transcript IDs as given in the annotation. Remaining columns can be used to determine a target region that should be overlapped by the prediction, if columns 2 to 5 contain chromosome, strand, start and end of region, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">gc</font></td><br />
<td>genetic code (optional user-specified genetic code, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">tblastn</font></td><br />
<td>tblastn (if *true* tblastn is used as search algorithm, otherwise mmseqs is used. Tblastn and mmseqs need to be installed to use the corresponding option, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">tag</font></td><br />
<td>tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = prediction)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>RNA-seq evidence (data for RNA-seq evidence, range={NO, MAPPED, EXTRACTED}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;MAPPED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.s</font></td><br />
<td>Stranded (Defines whether the reads are stranded. In case of FR_FIRST_STRAND, the first read of a read pair or the only read in case of single-end data is assumed to be located on forward strand of the cDNA, i.e., reverse to the mRNA orientation. If you are using Illumina TruSeq you should use FR_FIRST_STRAND., range={FR_UNSTRANDED, FR_FIRST_STRAND, FR_SECOND_STRAND}, default = FR_UNSTRANDED)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.m</font></td><br />
<td>mapped reads file (BAM/SAM files containing the mapped reads)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.v</font></td><br />
<td>ValidationStringency (Defines how strict to be when reading a SAM or BAM, beyond bare minimum validation., range={STRICT, LENIENT, SILENT}, default = LENIENT)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.u</font></td><br />
<td>use secondary alignments (allows to filter flags in the SAM or BAM, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.c</font></td><br />
<td>coverage (allows to output the coverage, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.mmq</font></td><br />
<td>minimum mapping quality (reads with a mapping quality that is lower than this value will be ignored, valid range = [0, 255], default = 40)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;EXTRACTED&quot;:</b></td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">introns</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used zero or multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage</font></td><br />
<td>coverage (experimental coverage (RNA-seq), range={UNSTRANDED, STRANDED}, default = UNSTRANDED)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>denoise (removing questionable introns that have been extracted by ERE, range={DENOISE, RAW}, default = DENOISE)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;DENOISE&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">DenoiseIntrons.m</font></td><br />
<td>maximum intron length (The maximum length of an intron, default = 15000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">DenoiseIntrons.me</font></td><br />
<td>minimum expression (The threshold for removing introns, valid range = [0.0, 1.0], default = 0.01)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">DenoiseIntrons.c</font></td><br />
<td>context (The context upstream a donor and donwstream an acceptor site that is used to determine the expression of the region, valid range = [0, 100], default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;RAW&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.p</font></td><br />
<td>proteins (whether the complete proteins sequences should returned as output, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.r</font></td><br />
<td>repair (if a transcript annotation can not be parsed, the program will try to infer the phase of the CDS parts to repair the annotation, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.a</font></td><br />
<td>Ambiguity (This parameter defines how to deal with ambiguities in the DNA. There are 3 options: EXCEPTION, which will remove the corresponding transcript, AMBIGUOUS, which will use an X for the corresponding amino acid, and RANDOM, which will randomly select an amnio acid from the list of possibilities., range={EXCEPTION, AMBIGUOUS, RANDOM}, default = AMBIGUOUS)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.s</font></td><br />
<td>stop-codon excluded from CDS (A flag that states whether the reference annotation contains the stop codon in the CDS annotation or not, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.f</font></td><br />
<td>full-length (A flag which allows for choosing between only full-length and all (i.e., full-length and partial) transcripts, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.s</font></td><br />
<td>splice (if no intron is given by RNA-seq, compute candidate splice sites or not, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.sm</font></td><br />
<td>substitution matrix (optional user-specified substitution matrix, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.g</font></td><br />
<td>gap opening (The gap opening cost in the alignment, default = 11)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.ge</font></td><br />
<td>gap extension (The gap extension cost in the alignment, default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.m</font></td><br />
<td>maximum intron length (The maximum length of an intron, default = 15000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.sil</font></td><br />
<td>static intron length (A flag which allows to switch between static intron length, which can be specified by the user and is identical for all genes, and dynamic intron length, which is based on the gene-specific maximum intron length in the reference organism plus the user given maximum intron length, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.i</font></td><br />
<td>intron-loss-gain-penalty (The penalty used for intron loss and gain, default = 25)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.e</font></td><br />
<td>e-value (The e-value for filtering blast results, default = 100.0)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.c</font></td><br />
<td>contig threshold (The threshold for evaluating contigs, valid range = [0.0, 1.0], default = 0.4)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.rt</font></td><br />
<td>region threshold (The threshold for evaluating regions, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.h</font></td><br />
<td>hit threshold (The threshold for adding additional hits, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.p</font></td><br />
<td>predictions (The (maximal) number of predictions per transcript, default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.a</font></td><br />
<td>avoid stop (A flag which allows to avoid stop codons in a transcript (except the last AS), default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.approx</font></td><br />
<td>approx (whether an approximation is used to compute the score for intron gain, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.prefix</font></td><br />
<td>prefix (A prefix to be used for naming the predictions, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.t</font></td><br />
<td>timeout (The (maximal) number of seconds to be used for the predictions of one transcript, if exceeded GeMoMa does not output a prediction for this transcript., valid range = [0, 604800], default = 3600)</td><br />
<td style="width:100px;">LONG</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.Score</font></td><br />
<td>Score (A flag which allows to do nothing, re-score or re-align the search results, range={Trust, ReScore, ReAlign}, default = Trust)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.c</font></td><br />
<td>common border filter (the filter on the common borders of transcripts, the lower the more transcripts will be checked as alternative splice isoforms, valid range = [0.0, 1.0], default = 0.75)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.m</font></td><br />
<td>maximal number of transcripts per gene (the maximal number of allowed transcript predictions per gene, valid range = [1, 2147483647], default = 2147483647)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.d</font></td><br />
<td>default attributes (Comma-separated list of attributes that is set to NaN if they are not given in the annotation file. This allows to use these attributes for sorting or filter criteria. It is especially meaningful if the gene annotation files were received fom different software packages (e.g., GeMoMa, Braker, ...) having different attributes., default = tie,tde,tae,iAA,pAA,score)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.f</font></td><br />
<td>filter (A filter can be applied to transcript predictions to receive only reasonable predictions. The filter is applied to the GFF attributes. The deault filter decides based on the completeness of the prediction (start=='M' and stop=='*') and the relative score (score/aa>=0.75) whether a prediction is used or not. Different criteria can be combined using 'and' and 'or'. In addition, you can check for NaN, e.g., 'isNaN(score)'. A more sophisticated filter could be applied for instance using the completeness, the relative score, the evidence as well as statistics based on RNA-seq: start=='M' and stop =='*' and score/aa>=0.75 and (evidence>1 or tpc==1.0), default = start=='M' and stop =='*' and score/aa>=0.75, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.s</font></td><br />
<td>sorting (comma-separated list that defines the way predictions are sorted within a cluster, default = evidence,score)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.a</font></td><br />
<td>alternative transcript filter (If a prediction is suggested as an alternative transcript, this additional filter will be applied. The filter works syntactically like the 'filter' parameter and allows you to keep the number of alternative transcripts small based on meaningful criteria. Reasonable filter could be based on RNA-seq data (tie==1) or on evidence (evidence>1). A more sophisticated filter could be applied combining several criteria: tie==1 or evidence>1, default = tie==1 or evidence>1, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.u</font></td><br />
<td>UTR (allows to predict UTRs using RNA-seq data, range={NO, YES}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;YES&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.r</font></td><br />
<td>rename (allows to generate generic gene and transcripts names (cf. parameter &quot;name attribute&quot;), range={COMPOSED, SIMPLE, NO}, default = COMPOSED)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;COMPOSED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.i</font></td><br />
<td>infix (the infix of the generic name, default = G)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.s</font></td><br />
<td>suffix (the suffix of the generic name, default = 0)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.di</font></td><br />
<td>delete infix (a comma-separated list of infixes that is deleted from the sequence names before building the gene/transcript name, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;SIMPLE&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.n</font></td><br />
<td>name attribute (if true the new name is added as new attribute &quot;Name&quot;, otherwise &quot;Parent&quot; and &quot;ID&quot; values are modified accordingly, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>predicted proteins (If *true*, returns the predicted proteins of the target organism as fastA file, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">pc</font></td><br />
<td>predicted CDSs (If *true*, returns the predicted CDSs of the target organism as fastA file, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">pgr</font></td><br />
<td>predicted genomic regions (If *true*, returns the genomic regions of predicted gene models of the target organism as fastA file, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">o</font></td><br />
<td>output individual predictions (If *true*, returns the predictions for each reference species, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">debug</font></td><br />
<td>debug (If *false* removes all temporary files even if the jobs exits unexpected, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">threads</font></td><br />
<td>The number of threads used for the tool, defaults to 1</td><br />
<td>INT</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.6.4.jar CLI GeMoMaPipeline t=<target_genome> a=<annotation> g=<genome> AnnotationFinalizer.p=<prefix><br />
<br />
<br />
=== Extract RNA-seq Evidence ===<br />
<br />
This tools extracts introns and coverage from mapped RNA-seq reads. Introns might be denoised by the tool '''DenoiseIntrons'''. Introns and coverage results can be used in '''GeMoMa''' to improve the predictions and might help to select better gene models in '''GAF'''. In addition, introns and coverage can be used to predict UTRs by '''AnnotationFinalizer'''.<br />
<br />
''Extract RNA-seq Evidence'' may be called with<br />
<br />
java -jar GeMoMa-1.6.4.jar CLI ERE<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>Stranded (Defines whether the reads are stranded. In case of FR_FIRST_STRAND, the first read of a read pair or the only read in case of single-end data is assumed to be located on forward strand of the cDNA, i.e., reverse to the mRNA orientation. If you are using Illumina TruSeq you should use FR_FIRST_STRAND., range={FR_UNSTRANDED, FR_FIRST_STRAND, FR_SECOND_STRAND}, default = FR_UNSTRANDED)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>mapped reads file (BAM/SAM files containing the mapped reads)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>ValidationStringency (Defines how strict to be when reading a SAM or BAM, beyond bare minimum validation., range={STRICT, LENIENT, SILENT}, default = LENIENT)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">u</font></td><br />
<td>use secondary alignments (allows to filter flags in the SAM or BAM, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage (allows to output the coverage, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">mmq</font></td><br />
<td>minimum mapping quality (reads with a mapping quality that is lower than this value will be ignored, valid range = [0, 255], default = 40)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.6.4.jar CLI ERE m=<mapped_reads_file><br />
<br />
<br />
=== CheckIntrons ===<br />
<br />
The tool checks the distribution of introns on the strands and the dinucleotide distribution at splice sites.<br />
<br />
''CheckIntrons'' may be called with<br />
<br />
java -jar GeMoMa-1.6.4.jar CLI CheckIntrons<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>target genome (The target genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>verbose (A flag which allows to output a wealth of additional information per transcript, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.6.4.jar CLI CheckIntrons t=<target_genome><br />
<br />
<br />
=== DenoiseIntrons ===<br />
<br />
This module allows to analyze introns extracted by '''ERE'''. Introns with a large intron size or a low relative expression are possibly artefacts and will be removed. The result of this module can be used in the module '''GeMoMa''', '''AnnotationEvidence''', and '''AnnotationFinalizer'''.<br />
<br />
''DenoiseIntrons'' may be called with<br />
<br />
java -jar GeMoMa-1.6.4.jar CLI DenoiseIntrons<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage (experimental coverage (RNA-seq), range={UNSTRANDED, STRANDED}, default = UNSTRANDED)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>maximum intron length (The maximum length of an intron, default = 15000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">me</font></td><br />
<td>minimum expression (The threshold for removing introns, valid range = [0.0, 1.0], default = 0.01)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">context</font></td><br />
<td>context (The context upstream a donor and donwstream an acceptor site that is used to determine the expression of the region, valid range = [0, 100], default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.6.4.jar CLI DenoiseIntrons i=<introns> coverage_unstranded=<coverage_unstranded><br />
<br />
<br />
=== NCBI Reference Retriever ===<br />
<br />
This tool can be used to download or update assembly and annotation files of reference organsims from NCBI. This way it allows to easily collect all data necessary to start '''GeMoMaPipeline''' or '''Extractor'''.<br />
<br />
''NCBI Reference Retriever'' may be called with<br />
<br />
java -jar GeMoMa-1.6.4.jar CLI NRR<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reference directory (the directory where the genome and annotation files of the reference organisms should be stored, default = references/)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>number of tries (the number of tries for downloading a reference file, valid range = [1, 100], default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">rl</font></td><br />
<td>reference list (a list of reference organisms)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.6.4.jar CLI NRR rl=<reference_list><br />
<br />
<br />
=== Extractor ===<br />
<br />
This tool can be used to create input files for '''GeMoMa''', i.e., it creates at least a fasta file containing the translated parts of the CDS and a tabular file containing the assignment of transcripts to genes and parts of CDS to transcripts. In addition, '''Extractor''' can be used to create several additional files from the final prediction, e.g. proteins, CDSs, ... . Two inputs are mandatory: The genome as fasta or fasta.gz and the corresponding annotation as gff or gff.gz. The gff file should be sorted. If you like to set a user-specific genetic code, please use a tab-delimited file with two columns. The first column contains the amino acid in one letter code, the second a list of tripletts.<br />
<br />
''Extractor'' may be called with<br />
<br />
java -jar GeMoMa-1.6.4.jar CLI Extractor<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (Reference annotation file (GFF or GTF), which contains gene models annotated in the reference genome)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (Reference genome file (FASTA))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">gc</font></td><br />
<td>genetic code (optional user-specified genetic code, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>proteins (whether the complete proteins sequences should returned as output, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>cds (whether the complete CDSs should returned as output, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">genomic</font></td><br />
<td>genomic (whether the genomic regions should be returned (upper case = coding, lower case = non coding), default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>repair (if a transcript annotation can not be parsed, the program will try to infer the phase of the CDS parts to repair the annotation, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>selected (The path to list file, which allows to make only a predictions for the contained transcript ids. The first column should contain transcript IDs as given in the annotation. Remaining columns will be ignored., OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Ambiguity</font></td><br />
<td>Ambiguity (This parameter defines how to deal with ambiguities in the DNA. There are 3 options: EXCEPTION, which will remove the corresponding transcript, AMBIGUOUS, which will use an X for the corresponding amino acid, and RANDOM, which will randomly select an amnio acid from the list of possibilities., range={EXCEPTION, AMBIGUOUS, RANDOM}, default = EXCEPTION)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sefc</font></td><br />
<td>stop-codon excluded from CDS (A flag that states whether the reference annotation contains the stop codon in the CDS annotation or not, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>full-length (A flag which allows for choosing between only full-length and all (i.e., full-length and partial) transcripts, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>verbose (A flag which allows to output a wealth of additional information, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.6.4.jar CLI Extractor a=<annotation> g=<genome><br />
<br />
<br />
=== GeneModelMapper ===<br />
<br />
This tool is the main part of, a homology-based gene prediction tool. GeMoMa builds gene models from search results (e.g. tblastn or mmseqs).<br />
<br />
As first step, you should run '''Extractor''' obtaining ''cds parts'' and ''assignment''. Second, you should run a search algorithm, e.g. '''tblastn''' or '''mmseqs''', with ''cds parts'' as query. Finally, these search results are then used in '''GeMoMa'''. Search results should be clustered according to the reference genes. The most easiest way is to sort the search results accoring to the first column. If the search results are not sorted by default (e.g. mmseqs), you should the parameter ''sort''.<br />
If you like to run GeMoMa ignoring intron position conservation, you should blast protein sequences and feed the results in ''query cds parts'' and leave ''assignment'' unselected.<br />
<br />
If you like to run GeMoMa using RNA-seq evidence, you should map your RNA-seq reads to the genome and run '''ERE''' on the mapped reads. For several reasons, spurious introns can be extracted from RNA-seq data. Hence, we recommend to run '''DenoiseIntrons''' to remove such spurious introns. Finally, you can use the obtained ''introns'' (and ''coverage'') in GeMoMa.<br />
<br />
If you like to obtain multiple predictions per gene model of the reference organism, you should set ''predictions'' accordingly. In addition, we suggest to decrease the value of ''contig threshold'' allowing GeMoMa to evaluate more candidate contigs/chromosomes.<br />
<br />
If you change the values of ''contig threshold'', ''region threshold'' and ''hit threshold'', this will influence the predictions as well as the runtime of the algorithm. The lower the values are, the slower the algorithm is.<br />
<br />
You can filter your predictions using '''GAF''', which also allows for combining predictions from different reference organismns.<br />
<br />
Finally, you can predict UTRs and rename predictions using '''AnnotationFinalizer'''.<br />
<br />
If you like to run the complete GeMoMa pipeline and not only specific module, you can run the multi-threaded module '''GeMoMaPipeline'''.<br />
<br />
''GeneModelMapper'' may be called with<br />
<br />
java -jar GeMoMa-1.6.4.jar CLI GeMoMa<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>search results (The search results, e.g., from tblastn or mmseqs)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>target genome (The target genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>cds parts (The query cds parts file (FASTA), i.e., the cds parts that have been blasted)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>assignment (The assignment file, which combines parts of the CDS to transcripts, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">q</font></td><br />
<td>query proteins (optional query protein file (FASTA) for computing the optimal alignment score against complete protein prediction, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used zero or multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">splice</font></td><br />
<td>splice (if no intron is given by RNA-seq, compute candidate splice sites or not, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used zero or multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage</font></td><br />
<td>coverage (experimental coverage (RNA-seq), range={UNSTRANDED, STRANDED}, default = UNSTRANDED)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genetic code (optional user-specified genetic code, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sm</font></td><br />
<td>substitution matrix (optional user-specified substitution matrix, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">go</font></td><br />
<td>gap opening (The gap opening cost in the alignment, default = 11)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ge</font></td><br />
<td>gap extension (The gap extension cost in the alignment, default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>maximum intron length (The maximum length of an intron, default = 15000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sil</font></td><br />
<td>static intron length (A flag which allows to switch between static intron length, which can be specified by the user and is identical for all genes, and dynamic intron length, which is based on the gene-specific maximum intron length in the reference organism plus the user given maximum intron length, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">intron-loss-gain-penalty</font></td><br />
<td>intron-loss-gain-penalty (The penalty used for intron loss and gain, default = 25)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">e</font></td><br />
<td>e-value (The e-value for filtering blast results, default = 100.0)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ct</font></td><br />
<td>contig threshold (The threshold for evaluating contigs, valid range = [0.0, 1.0], default = 0.4)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">rt</font></td><br />
<td>region threshold (The threshold for evaluating regions, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">h</font></td><br />
<td>hit threshold (The threshold for adding additional hits, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>predictions (The (maximal) number of predictions per transcript, default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">selected</font></td><br />
<td>selected (The path to list file, which allows to make only a predictions for the contained transcript ids. The first column should contain transcript IDs as given in the annotation. Remaining columns can be used to determine a target region that should be overlapped by the prediction, if columns 2 to 5 contain chromosome, strand, start and end of region, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">as</font></td><br />
<td>avoid stop (A flag which allows to avoid stop codons in a transcript (except the last AS), default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">approx</font></td><br />
<td>approx (whether an approximation is used to compute the score for intron gain, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">prefix</font></td><br />
<td>prefix (A prefix to be used for naming the predictions, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">tag</font></td><br />
<td>tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = prediction)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>verbose (A flag which allows to output a wealth of additional information per transcript, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">timeout</font></td><br />
<td>timeout (The (maximal) number of seconds to be used for the predictions of one transcript, if exceeded GeMoMa does not output a prediction for this transcript., valid range = [0, 604800], default = 3600)</td><br />
<td style="width:100px;">LONG</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sort</font></td><br />
<td>sort (A flag which allows to sort the search results, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Score</font></td><br />
<td>Score (A flag which allows to do nothing, re-score or re-align the search results, range={Trust, ReScore, ReAlign}, default = Trust)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.6.4.jar CLI GeMoMa s=<search_results> t=<target_genome> c=<cds_parts> a=<assignment><br />
<br />
<br />
=== GeMoMa Annotation Filter ===<br />
<br />
This tool combines and filters gene predictions from different sources yielding a common gene prediction. The tool does not modify the predictions, but filters redundant and low-quality predictions and selects relevant predictions. In addition, it adds attributes to the annotation of transcript predictions.<br />
<br />
The algorithm does the following:<br />
First, redundant predictions are identified (and additional attributes (evidence, sumWeight) are introduced).<br />
Second, the predictions are filtered using the user-specified criterium based on the attributes from the annotation.<br />
Third, clusters of overlapping predictions are determined, the predictions are sorted within the cluster and relevant predictions are extracted.<br />
<br />
Optionally, annotation info can be added for each reference organism enabling a functional prediction for predicted transcripts based on the function of the reference transcript.<br />
Phytozome provides annotation info tables for plants, but annotation info can be used from any source as long as they are tab-delimited files with at least the following columns: transcriptName, GO and .*defline.<br />
<br />
Initially, GAF was build to combine gene predictions obtained from '''GeMoMa'''. It allows to combine the predictions from multiple reference organisms, but works also using only one reference organism. However, GAF also allows to integrate predictions from ab-initio or purely RNA-seq-based gene predictors as well as manually curated annotation. If you like to do so, we recommend to run '''AnnotationEvidence''' for each of these input files to add additional attributes that can be used for sorting and filtering within '''GAF'''. The sort and filter criteria need to be carefully revised in this case. Default values can be set for missing attributes.<br />
<br />
''GeMoMa Annotation Filter'' may be called with<br />
<br />
java -jar GeMoMa-1.6.4.jar CLI GAF<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>tag (the tag used to read the GeMoMa annotations, default = prediction)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>common border filter (the filter on the common borders of transcripts, the lower the more transcripts will be checked as alternative splice isoforms, valid range = [0.0, 1.0], default = 0.75)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>maximal number of transcripts per gene (the maximal number of allowed transcript predictions per gene, valid range = [1, 2147483647], default = 2147483647)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prefix (the prefix can be used to distinguish predictions from different input files, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>weight (the weight can be used to prioritize predictions from different input files; each prediction will get an additional attribute sumWeight that can be used in the filter, valid range = [0.0, 1000.0], default = 1.0, OPTIONAL)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>gene annotation file (GFF files containing the gene annotations (predicted by GeMoMa))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation info (annotation information of the reference, tab-delimted file containing at least the columns transcriptName, GO and .*defline, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>default attributes (Comma-separated list of attributes that is set to NaN if they are not given in the annotation file. This allows to use these attributes for sorting or filter criteria. It is especially meaningful if the gene annotation files were received fom different software packages (e.g., GeMoMa, Braker, ...) having different attributes., default = tie,tde,tae,iAA,pAA,score)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>filter (A filter can be applied to transcript predictions to receive only reasonable predictions. The filter is applied to the GFF attributes. The deault filter decides based on the completeness of the prediction (start=='M' and stop=='*') and the relative score (score/aa>=0.75) whether a prediction is used or not. Different criteria can be combined using 'and' and 'or'. In addition, you can check for NaN, e.g., 'isNaN(score)'. A more sophisticated filter could be applied for instance using the completeness, the relative score, the evidence as well as statistics based on RNA-seq: start=='M' and stop =='*' and score/aa>=0.75 and (evidence>1 or tpc==1.0), default = start=='M' and stop =='*' and score/aa>=0.75, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>sorting (comma-separated list that defines the way predictions are sorted within a cluster, default = evidence,score)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">atf</font></td><br />
<td>alternative transcript filter (If a prediction is suggested as an alternative transcript, this additional filter will be applied. The filter works syntactically like the 'filter' parameter and allows you to keep the number of alternative transcripts small based on meaningful criteria. Reasonable filter could be based on RNA-seq data (tie==1) or on evidence (evidence>1). A more sophisticated filter could be applied combining several criteria: tie==1 or evidence>1, default = tie==1 or evidence>1, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.6.4.jar CLI GAF g=<gene_annotation_file><br />
<br />
<br />
=== AnnotationFinalizer ===<br />
<br />
This tool finalizes an annotation. It allows to predict for UTRs for annotated coding sequences and to generate generic gene and transcript names. UTR prediction might be negatively influenced (i.e. too long predictions) by genomic contamination of RNA-seq libraries, overlapping genes or genes in close proximity as well as unstranded RNA-seq libraries. Please use '''ERE''' to preprocess the mapped reads.<br />
<br />
''AnnotationFinalizer'' may be called with<br />
<br />
java -jar GeMoMa-1.6.4.jar CLI AnnotationFinalizer<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (The genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (The predicted genome annotation file (GFF))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = prediction)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">u</font></td><br />
<td>UTR (allows to predict UTRs using RNA-seq data, range={NO, YES}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;YES&quot;:</b></td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns file (Introns (GFF), which might be obtained from RNA-seq)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage file (experimental coverage (RNA-seq), range={NO, UNSTRANDED, STRANDED}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">rename</font></td><br />
<td>rename (allows to generate generic gene and transcripts names (cf. parameter &quot;name attribute&quot;), range={COMPOSED, SIMPLE, NO}, default = COMPOSED)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;COMPOSED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">infix</font></td><br />
<td>infix (the infix of the generic name, default = G)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>suffix (the suffix of the generic name, default = 0)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">di</font></td><br />
<td>delete infix (a comma-separated list of infixes that is deleted from the sequence names before building the gene/transcript name, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;SIMPLE&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>name attribute (if true the new name is added as new attribute &quot;Name&quot;, otherwise &quot;Parent&quot; and &quot;ID&quot; values are modified accordingly, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.6.4.jar CLI AnnotationFinalizer g=<genome> a=<annotation> p=<prefix><br />
<br />
<br />
=== Annotation evidence ===<br />
<br />
This tool adds attributes to the annotation, e.g., tie, tpc, aa, start, stop. These attributes can be used, for instance, if the annotation is used in '''GAF'''. All predictions of the annotation are used. The predictions are not filtered for internal stop codons, missing start or stop codons, frame-shifts, ... . Please use '''ERE''' to preprocess the mapped reads.<br />
<br />
''Annotation evidence'' may be called with<br />
<br />
java -jar GeMoMa-1.6.4.jar CLI AnnotationEvidence<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (The genome annotation file (GFF))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (The genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns file (Introns (GFF), which might be obtained from RNA-seq, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage file (experimental coverage (RNA-seq), range={NO, UNSTRANDED, STRANDED}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ao</font></td><br />
<td>annotation output (if the annotation should be returned with attributes tie, tpc, and aa, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">gc</font></td><br />
<td>genetic code (optional user-specified genetic code, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.6.4.jar CLI AnnotationEvidence a=<annotation> g=<genome><br />
<br />
<br />
=== Compare transcripts ===<br />
<br />
This tool compares a predicted annotation with a given annotation in terms of F1 measure. If the F1 measure is 1 both annotations are in perfect agreement for this transcript. The smaller the value is the low is the agreement. If it is NA then there is no overlapping annotation.<br />
<br />
''Compare transcripts'' may be called with<br />
<br />
java -jar GeMoMa-1.6.4.jar CLI CompareTranscripts<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prediction (The predicted annotation)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (The true annotation)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used zero or multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">prefix</font></td><br />
<td>prefix (the prefix can be used to distinguish predictions from different input files, default = , OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">assignment</font></td><br />
<td>assignment (the transcript info for the reference of the prediction)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.6.4.jar CLI CompareTranscripts p=<prediction> a=<annotation><br />
<br />
== GFF attributes ==<br />
<br />
Using GeMoMa and GAF, you'll obtain GFFs containing some special attributes. We briefly explain the most prominent attributes in the following table.<br />
<br />
{| class="wikitable sortable"<br />
|-<br />
!Attribute !!Long name !!Tool !!Necessary parameter !!Feature !!Description<br />
|-<br />
| aa || amino acids || GeMoMa || || prediction || the number of amino acids in the protein<br />
|-<br />
| score || GeMoMa score || GeMoMa || || prediction || score computed by GeMoMa using the substitution matrix, gap costs and additional penalties<br />
|-<br />
| minCov || minimal coverage || GeMoMa || coverage, ... || prediction || minimal coverage of any base of the prediction given RNA-seq evidence<br />
|-<br />
| avgCov || average coverage || GeMoMa || coverage, ... || prediction || average coverage of all bases of the prediction given RNA-seq evidence<br />
|-<br />
| tpc || transcript percentage coverage || GeMoMa || coverage, ... || prediction || percentage of covered bases per predicted transcript given RNA-seq evidence<br />
|-<br />
| tae || transcript acceptor evidence || GeMoMa || introns || prediction || percentage of predicted acceptor sites per predicted transcript with RNA-seq evidence<br />
|-<br />
| tde || transcript donor evidence || GeMoMa || introns || prediction || percentage of predicted donor sites per predicted transcript with RNA-seq evidence<br />
|-<br />
| tie || transcript intron evidence || GeMoMa || introns || prediction || percentage of predicted introns per predicted transcript with RNA-seq evidence<br />
|-<br />
| minSplitReads || minimal split reads || GeMoMa || introns || prediction || minimal number of split reads for any of the predicted introns per predicted transcript<br />
|-<br />
| iAA || identical amino acid || GeMoMa || query proteins || prediction || percentage of identical amino acids between reference transcript and prediction<br />
|-<br />
| pAA || positive amino acid || GeMoMa || query proteins || prediction || percentage of aligned positions between reference transcript and prediction yielding a positive score in the substitution matrix<br />
|-<br />
| evidence || || GAF || || prediction || number of reference organisms that have a transcript yielding this prediction<br />
|-<br />
| alternative || || GAF || || prediction || alternative gene ID(s) leading to the same prediction<br />
|-<br />
| sumWeight || || GAF || || prediction || the sum of the weights of the references that perfectly support this prediction<br />
|-<br />
| maxTie || maximal tie || GAF || || gene || maximal tie of all transcripts of this gene<br />
|-<br />
| maxEvidence || maximal evidence || GAF || || gene || maximal evidence of all transcripts of this gene<br />
|-<br />
|}<br />
<br />
== FAQs ==<br />
<br />
; Why does the Extractor not return a single CDS-part, protein, ...?<br />
: First, please check whether the names of your contigs/chromosomes in your annotation (gff) and genome file (fasta) are identical. The fasta comments should at best only contain the contig/chromosome name. (Since GeMoMa 1.4, comments, which contain the contig/chromosome name and some additional information separated by a space, are also fine.) Second, please check whether you have a valid GFF/GTF file. Valid GFF files should have a valid "ID" or "Parent" entry in the attributes column. Valid GTF files should have a valid "gene_id" and "transcript_id" entry. Finally, please check the statistics that are given by the Extractor. It lists how many genes have been read and how many genes have been removed for different reasons. One common problem is that some annotation files do not include the stop codon in the CDS annotation.<br />
; How can I force GeMoMa to make more predictions?<br />
: There are several parameters affecting the number of predictions. The most prominent are the number of predictions (p) and the contig threshold (ct). For each reference transcript/CDS, GeMoMa initially makes a preliminary prediction and uses this prediction to determine whether a contig is promising and should be used to determine the final predictions. You may decrease ct and increase p to have more contigs in the final prediction. Increasing the number of predictions allows GeMoMa to output more predictions that have been computed. Decreasing the contig threshold allows to increase the number of predictions that are (internally) computed. Increasing p to a very large number without decreasing ct does not help.<br />
; Running GeMoMa on a single contig of my assembly yield thousands of weird predictions. What went wrong?<br />
: By default, GeMoMa is not build to be run on a single contig. GeMoMa tries to make predictions for all given reference CDS in the given target sequence(s). If the given target sequence is only a fraction of the complete target genome/assembly, GeMoMa will produce weird predictions as it does not filter for the quality of the predictions internally. There are two options to handle this:<br />
:* Use a list of gene models that you expect to be located on this contig (cf. parameter "selected").<br />
:* Filter the predictions using GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>).<br />
; Is it mandatory to use RNA-seq data?<br />
: No, GeMoMa is able to make predictions with and without RNA-seq evidence.<br />
; Is it possible to use multiple reference organisms?<br />
: It is possible to use multiple reference organisms for GeMoMa. Just run GeMoMa on each reference organism separately. Finally, you can employ GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>) to combine these annotations.<br />
; Why do some reference genes not lead to a prediction in the target genome?<br />
: Please first check whether your reference genes have been discarded by the Extractor (cf. assignment file).<br />
: If the genes have been discarded, there are two possibilities:<br />
:* The CDS might be redundant, i.e. the coding exons are identical to those of another transcript. In this case, only one CDS is further evaluated.<br />
:* There might be something wrong with your reference genes, e.g., missing start codon, missing stop codon, premature stop codon, ambiguous nucleotides, ... and you should check the options of Extractor or the annotation.<br />
: If the reference genes passed the Extractor, there are several possible explanations for this behavior. The two most prominent are:<br />
:* GeMoMa stopped the prediction of a reference genes since it does not return a result within the given time (cf. parameter "timeout").<br />
:* GeMoMa simply did not find a prediction matching the remaining quality criteria<br />
:* GeMoMa did find a prediction, but it was filtered out by GAF, e.g. to low relative score, missing start or stop codon (cf. GAF parameters).<br />
; What does "partial gene model" mean in the context of GeMoMa?<br />
: We called a gene model partial if it does not contain an initial start codon and a final stop codon. However, this does not mean that the gene model is located at or close to the border of a chromosome or contig.<br />
; For two different reference transcripts, the predictions of GeMoMa overlap or are identical. What should I do with those?<br />
: GeMoMa makes the predictions for each reference transcript independently. Hence, it can occur that some of predictions of different reference transcripts overlap or are identical especially in gene families. Typically, you might like to filter or rank these predictions. We have implemented GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>) to do this automatically. However, you can also do it by hand using the GFF attributes. Using RNA-seq data in GeMoMa yields additional fields in the annotation that can be used, e.g., average coverage (avgCov).<br />
; A lot of transcripts have been filtered out by the Extractor. What can I do?<br />
: There are several reasons for removing transcripts by the Extractor. At least in two cases you can try to get more transcripts by setting specific parameter values. First, if the transcript contains ambiguous nucleotides, please test the parameter "Ambiguity". Second, sometimes we received GFFs which contain wrong phases for CDS entries (e.g., 0 for all CDS entries in the phase column of the GFF). Since version 1.3.2, we provide the option "r" which stands for repair. If r=true is chosen, the Extractor tries to infer all phases for transcripts that show an error and would be filtered out.<br />
; Is GeMoMa able to predict pseudo-genes/ncRNA?<br />
: No, currently not.<br />
; My RNA-seq data indicates there is an additional intron in a transcipt, but GeMoMa does not predict this. Or vice versa, GeMoMa predicts an intron that is not supported by RNA-seq data. What's the reason?<br />
: GeMoMa is mainly based on the assumptions of amino acid and intron position conservation between reference and target species. Hence, GeMoMa tries to predict a gene model with similar exon-intron structure in the target species and does not stick too much to RNA-seq data. Although intron position conservation can be observed in most cases, sometimes new introns evolve or others vanish. For this reasons, GeMoMa also allows for the inclusion or exclusion of introns adding some additional costs (cf. GeMoMa parameter intron-loss-gain-penalty). However, the behaviour of GeMoMa depends on the parameters settings (especially intron-loss-gain-penalty, sm (substitution matrix), go (gap opening), ge (gap extension)) and the length of the missed/additional intron. Nevertheless, such cases can only occur if the additional/missed intron has a length that can be divided by 3 preserving the reading frame.<br />
:Since the available RNA-seq data only reflects a fraction of tissues/environmental conditions/..., missing RNA-seq evidence does not necessarily mean that the predictions is wrong.<br />
; My RNA-seq data indicates two alternative, highly overlapping introns. Interestingly, GeMoMa does not take the intron that is more abundant. Why?<br />
: GeMoMa reads the introns from the input file using some filter (cf. GeMoMa parameter r (reads)). All introns that pass the filter are used and treated equally. Hence, GeMoMa uses the intron that matches the expectation of intron position and amino acid conservation compared to the reference transcript.<br />
; Does GeMoMa predict multiple transcripts per gene?<br />
: GeMoMa in principle allows to predict multiple transcripts per gene, if corresponding transcripts are given in the reference species or if multiple reference species are used.<br />
; GeMoMa failed with java.lang.OutOfMemoryError. What can I do?<br />
: Whenever you see a java.lang.OutOfMemoryError, you should rerun the program with Java virtual machine (VM) options. More specifically you should set: '''-Xms''' the initally used RAM, e.g. to 5Gb (–Xms5G), and '''-Xmx''' the maximally used RAM, e.g. to 50Gb (-Xmx50G). GeMoMa often needs more memory if you have a large genome ''and'' if you’re providing a large coverage file (extracted from RNA-seq data). If you don’t have a compute node with enough memory, you can run GeMoMa without coverage, which will return the same predictions, but does not include all statistics. Another point could be the protein alignment, if you use the optional parameter query protein. Again you can run GeMoMa without this parameter, which will return the same predictions, but less statistics.<br />
; I need to specify the genetic code for my organisms. What is the expected format?<br />
: The genetic code is given in a two column tab-delimited table, where the first column is the one letter code of the amino acid and the second column is a comma-separated list of triplets. As we are working on genomic DNA, GeMoMa expects the bases A, C, G, and T, and not U (as expected in mRNA). Here is the link to the default genetic code, which might be used as template:<br />
: https://github.com/Jstacs/Jstacs/blob/master/projects/gemoma/test_data/genetic_code.txt<br />
: Alternative genetic codes are described here using the RNA alphabet:<br />
: https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi<br />
: The genetic code might be specified for a reference organism in the module Extractor or for a target organism in the module GeMoMa.<br />
; I like to accelerate GeMoMa. What can I do?<br />
:If you like to improve the runtime, you can use several threads for the computation. If you run the GeMoMaPipeline you just have to select <code>threads=<your_number></code>.<br />
:In addition, you can change the search algorithm that is used in GeMoMa. Tblastn is used by default as search algorithm in GeMoMa (for historical reasons). However, tblastn can be replaced by mmseqs which is typically much faster. If you run the GeMoMaPipeline you just have to select <code>tblastn=false</code>. However, changing the search algorithm can also effect the results. We try to minimize these effect using specific parameters for the search algorithms.<br />
:If you modify other parameters, you will probably receive results that differ to a larger extend from those received using the default parameters.<br />
; Is there a way to use the GeMoMa code to search a single CDS or protein sequence against a genome and return the predicted gene model (CDS fasta, protein fasta, GFF) similar to exonerate?<br />
: There at least two ways to do this. If you use GeMoMaPipeline you can <br />
: (A) Use the parameter “selected” to select specific gene models (=transcripts/proteins) from the annotation instead of using all or<br />
: (B) Use <code>s=pre-extracted</code>, use a fasta file with the proteins for the parameter cds-parts and leave assignment unset.<br />
: Using one of these options you can either look for a single or few transcripts/proteins either with (A) or without (B) intron-position conservation. In addition, you can use RNA-seq data to improve the predictions, which should be not possible with exonerate.<br />
<br />
== References ==<br />
If you use GeMoMa, please cite<br />
<br />
J. Keilwagen, M. Wenk, J. L. Erickson, M. H. Schattat, J. Grau, and F. Hartung. [https://nar.oxfordjournals.org/content/44/9/e89 Using intron position conservation for homology-based gene prediction]. ''Nucleic Acids Research'', 2016. doi: 10.1093/nar/gkw092<br />
<br />
J. Keilwagen, F. Hartung, M. Paulini, S. O. Twardziok, and J. Grau<br />
[https://rdcu.be/QbKc Combining RNA-seq data and homology-based gene prediction for plants, animals and fungi]. ''BMC Bioinformatics'', 2018. doi: 10.1186/s12859-018-2203-5<br />
<br />
== Version history ==<br />
[http://www.jstacs.de/download.php?which=GeMoMa GeMoMa 1.6.4] (24.04.2020)<br />
* improved help section<br />
* change gff attribute "AA" to "aa"<br />
* GAF:<br />
** bugfix overlapping genes<br />
** accelerated computation<br />
* GeMoMa:<br />
** bugfix: if no assignment file is used and protein ID are prefixes of other protein IDs<br />
** change GFF attribute AA to aa<br />
* AnnotationFinalizer: new parameter "name attribute" allowing to decide whether a name attribute or the Parent and ID attributes should be used for renaming<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.6.3.zip GeMoMa 1.6.3] (05.03.2020)<br />
* Jstacs changes:<br />
** CLI: bugfix ExpandableParameterSet<br />
* python wrapper (for *conda)<br />
* updated tests.sh, run.sh, pipeline.sh<br />
* rename Denoise to DenoiseIntrons<br />
* AnnotationEvidence: write phase (as given) to gff<br />
* GAF: new parameter: default attributes allows to set attributes that are not included in some gene annotation files<br />
* GeMoMa: new parameter: static intron length allowing to use dynamic intron length if set to false<br />
* GeMoMaPipeline: <br />
** bugfix: time-out<br />
** improve output<br />
** separate parameters for maximum intron length (DenoiseIntrons, GeMoMa)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.6.2.zip GeMoMa 1.6.2] (17.12.2019)<br />
* Jstacs changes:<br />
** test methods for modules<br />
** live protocol for Galaxy<br />
* new module Denoise: allowing to clean introns extracted by ERE<br />
* new module NCBIReferenceRetriever: allowing to retrieve data for reference organisms easily from NCBI.<br />
* GAF:<br />
** bugfix for filter using specific attributes if no RNA-seq or query proteins was used<br />
** allow to add annotation info (as for instance provided by Phytozome) based on the reference organisms<br />
* GeMoMa: bugfix for timeout<br />
* GeMoMaPipeline:<br />
** bugfix reporting predicted partial proteins<br />
** improved protocol<br />
** new default value for query proteins (changed from false to true)<br />
** new default value for Ambiguity (changed from EXCEPTION to AMBIGUOUS)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.6.1.zip GeMoMa 1.6.1] (4.06.2019)<br />
* createGalaxyIntegration.sh: bugfix for GeMoMaPipeline<br />
* new module CheckIntrons: allowing to create statistics for introns (extracted by ERE)<br />
* AnnotationFinalizer: bugfix for sequence IDs with large numbers<br />
* CompareTranscripts: <br />
** bugfix for prefix of ref-gene<br />
** allow no transcript info, but making assignment non-optional if a transcript info is set <br />
* GAF: bugfix for Galaxy integration<br />
* GeMoMaPipeline:<br />
** improved output in case of Exceptions<br />
** new parameter "output individual predictions" allows to in- or exclude individual predictions from each reference organism in the final result<br />
** new parameter "weight" allows weights for reference species (cf. GAF)<br />
* ERE: new parameter "minimum mapping quality"<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.6.zip GeMoMa 1.6] (2.04.2019)<br />
* allow to use mmseqs as alternative to tblastn<br />
* AnnotationEvidence:<br />
** allows to add attributes to the input gff: tie, tpc, AA, start, stop<br />
** new parameter for gff output<br />
* AnnotationFinalizer: new tool for predicting UTRs and renaming predictions<br />
* GAF:<br />
** relative score filter and evidence filter are replaced by a flexible filter that allows to filter by relative score, evidence or other GFF attributes as well as combinations thereof<br />
** sorting criteria of the predictions within clusters can now be user-specified<br />
** new attribute for genes: combinedEvidence<br />
** new attribute for predictions: sumWeight<br />
** allows to use gene predictions from all sources, including for instance ab-initio gene predictors, purely RNA-seq based gene prediction and manually curation<br />
** bugfix for predictions from multiple reference organisms<br />
** improved statistic output<br />
* GeMoMa<br />
** renamed the parameter tblastn results to search results<br />
** new parameter for sorting the results of the similarity search (tblastn or mmseqs), if you use mmseqs for the similarity search you have use sort <br />
** new parameter for score of the search results: three options: Trust (as is), ReScore (use aligned sequence, but recompute score), and ReAlign (use detected sequence for realignment and rescore)<br />
** bugfix: threshold for introns from multiple files<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.5.2.zip GeMoMa 1.5.3] (23.07.2018)<br />
* improved parameter description and presentation<br />
* GeMoMaPipeline:<br />
** removed unnecessary parameters<br />
* GeMoMa:<br />
** bugfix: reading coverage file<br />
** removed parameter genomic (cf. Extractor)<br />
** removed protein output (cf. Extractor)<br />
* GAF:<br />
** bugfix: prefix<br />
* Extractor:<br />
** new parameter genomic<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.5.2.zip GeMoMa 1.5.2] (31.5.2018)<br />
* GAF:<br />
** new parameter that allows to restrict the maximal number of transcript predictions per gene<br />
** altered behavior of the evidence filter from percentages to absolute values<br />
** bugfix: nested genes <br />
** checking for duplicates in prediction IDs<br />
* GeMoMa:<br />
** warning if RNA-seq data does not match with target genome<br />
* GeMoMaPipeline: new tool for running the complete GeMoMa pipeline at once allowing multi-threading<br />
* folder for temporary files of GeMoMa<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.5.zip GeMoMa 1.5] (13.02.2018)<br />
* AnnotationEvidence: add chromosome to output<br />
* CompareTranscripts: new parameter that allows to remove prefixes introduces by GAF<br />
* Extractor: new parameter "stop-codon excluded from CDS" that might be used if the annotation does not contain the stop codons<br />
* ExtractRNASeqEvidence: <br />
** print intron length stats<br />
** include program infos in introns.gff3<br />
* GeMoMa:<br />
** new attribute pAA in gff output if query protein is given<br />
** include program infos in predicted_annotation.gff3<br />
** minor bugfix<br />
* GAF:<br />
** new parameter that allows to specify a prefix for each input gff<br />
** collect and print program infos to filtered_prediction.gff3<br />
** improved statistics output<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.4.2.zip GeMoMa 1.4.2] (21.07.2017)<br />
* automatic searching for available updates<br />
* AnnotationEvidence: bugfix (tie computation: Arrays.binarysearch does not find first match)<br />
* Extractor: bugfix (files that are not zipped)<br />
* GeMoMa: bugfix (tie computation: Arrays.binarysearch does not find first match)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.4.1.zip GeMoMa 1.4.1] (30.05.2017)<br />
* CompareTranscripts: bugfix (NullPointerException)<br />
* Extractor: reference genome can be .*fa.gz and .*fasta.gz<br />
* GeMoMa: bugfix (shutdown problem after timeout)<br />
* modified additional scripts and documentation<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.4.zip GeMoMa 1.4] (03.05.2017)<br />
* AnnotationEvidence: new tool computing tie and tpc for given annotation (gff)<br />
* CompareTranscripts: new tool comparing predicted and given annotation (gff)<br />
* Extractor:<br />
** reading CDS with no parent tag (cf. discontinuous feature)<br />
** automatic recognition of GFF or GTF annotation<br />
** Warning if sequences mentioned in the annotation are not included in the reference sequence<br />
* GeMoMa: <br />
** allowing for multiple intron and coverage files (= using different library types at the same time)<br />
** NA instead of "?" for tae, tde, tie, minSplitReads of single coding exon genes<br />
** new default values for the parameters: predictions (10 instead of 1) and contig threshold (0.4 instead of 0.9)<br />
** bugfix (write pc and minCov if possible for last CDS part in predicted annotation)<br />
** bugfix (ref-gene name if no assignment is used)<br />
** bugfix (minSplitReads, minCov, tpc, avgCov if no coverage available)<br />
* GAF:<br />
** nested genes on the same strand<br />
** bugfix (if nothing passes the filter)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.3.2.zip GeMoMa 1.3.2] (18.01.2017)<br />
* Extractor: new parameter repair for broken transcript annotations<br />
* GeMoMa: bugfixes (splice site computation)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa_1.3.1.zip GeMoMa 1.3.1] (09.12.2016)<br />
* GeMoMa bugfix (finding start/stop codon for very small exons)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.3.zip GeMoMa 1.3] (06.12.2016)<br />
* ERE: new tool for extracting RNA-seq evidence<br />
* Extractor: offers options for <br />
** partial gene models<br />
** ambiguities<br />
* GeMoMa:<br />
** RNA-seq<br />
*** defining splice sites<br />
*** additional feature in GFF and output<br />
**** transcript intron evidence (tie)<br />
**** transcript acceptor evidence (tae)<br />
**** transcript donor evidence (tde)<br />
**** transcript percentage coverage (tpc)<br />
**** ...<br />
** improved GFF<br />
** simplified the command line parameters<br />
** IMPORTANT: parameter names changed for some parameters<br />
* GAF: new tool for filtering and combining different predictions (especially of different reference organisms)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.1.3.zip GeMoMa 1.1.3] (06.06.2016)<br />
* minor modifications to the Extractor tool<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.1.2.zip GeMoMa 1.1.2] (05.02.2016)<br />
* GeMoMa bugfix (upstream, downstream sequence for splice site detection)<br />
* Extractor: new parameter s for selecting transcripts<br />
* improved Galaxy integration<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.1.1.zip GeMoMa 1.1.1] (01.02.2016)<br />
* initial release for paper</div>Keilwagenhttps://www.jstacs.de/index.php?title=GeMoMa&diff=1098GeMoMa2020-04-29T10:38:32Z<p>Keilwagen: bioconda</p>
<hr />
<div>'''Ge'''ne '''Mo'''del '''Ma'''pper (GeMoMa) is a homology-based gene prediction program. GeMoMa uses the annotation of protein-coding genes in a reference genome to infer the annotation of protein-coding genes in a target genome. Thereby, GeMoMa utilizes amino acid sequence and intron position conservation. In addition, GeMoMa allows to incorporate RNA-seq evidence for splice site prediction.<br />
<br />
GeMoMa is available in a public web-server at [http://galaxy.informatik.uni-halle.de galaxy.informatik.uni-halle.de]. The provided web-server only allows a limited number of reference genes and uses a time out of 2 minutes per transcript prediction. For unlimited use, please use the command line program or integrate GeMoMa in your own [https://galaxyproject.org/ Galaxy] instance.<br />
<br />
{|<br />
|__TOC__<br />
|[[File:GeMoMa-schema.png|thumb|right|450px|Schema of GeMoMa algorithm]]<br />
|}<br />
<br />
== Installation ==<br />
<br />
GeMoMa is now available via bioconda. Here is the [https://anaconda.org/bioconda/gemoma direct link to the package]. To install this package with conda run:<br />
conda install -c bioconda gemoma <br />
However, you can also install GeMoMa manually.<br />
<br />
=== Requirements ===<br />
<br />
For running the GeMoMa, you need the following software on your computer<br />
* Java v1.8 or later<br />
* [ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ blast] or [https://github.com/soedinglab/MMseqs2 mmseqs]<br />
<br />
=== Download ===<br />
<br />
GeMoMa is implemented in Java using Jstacs. You can [http://www.jstacs.de/download.php?which=GeMoMa download a zip file] containing a readme, the GeMoMa jar file and some tiny scripts for running GeMoMa. The jar file allows for <br />
<ul><br />
<li>creating the XML file needed for the Galaxy integration</li><br />
<li>running the command line interface (CLI) version.</li><br />
</ul><br />
<br />
You can also [[:File:GeMoMa-manual.pdf|download a small manual for GeMoMa]] which explains the main steps for the analysis.<br />
<br />
== In a nutshell ==<br />
<br />
GeMoMa is a modular, homology-based gene prediction program with huge flexibility. However, we also provide a pipeline allowing to use GeMoMa easily. If you like to start GeMoMa for the first time, we recommend to use the GeMoMaPipeline like this<br />
java -jar GeMoMa-1.6.4.jar CLI GeMoMaPipeline threads=<threads> outdir=<outdir> tblastn=false GeMoMa.Score=ReAlign AnnotationFinalizer.r=NO o=true t=<target_genome> i=<reference_1_id> a=<reference_1_annotation> g=<reference_1_genome><br />
there are several parameters that need to be set indicated with '''&lt;'''foo'''&gt;'''. You can specify<br />
* the number of threads<br />
* the output directory<br />
* the target genome<br />
* and the reference ID (optional), annotation and genome. If you have several references just repeat the parameter tags <code>i</code>, <code>a</code>, <code>g</code> with the corresponding values.<br />
In addition, we recommend to set several parameters:<br />
* <code>tblastn=false</code>: use mmseqs instead of tblastn, since mmseqs is faster<br />
* <code>GeMoMa.Score=ReAlign</code>: states that the score from mmseqs should be recomputed as mmseqs uses an approximation<br />
* <code>AnnotationFinalizer.r=NO</code>: do not rename genes and transcripts<br />
* <code>o=true</code>: output individual predictions for each reference as a separate file allowing to rerun the combination step ('''GAF''') very easily and quickly<br />
If you like to specify the maximum intron length please consider the parameters <code>GeMoMa.m</code> and <code>GeMoMa.sil</code>.<br />
If you have RNA-seq data either from own experiments or publicly available data sets (cf. [https://www.ncbi.nlm.nih.gov/sra NCBI SRA], [https://www.ebi.ac.uk/ena EMBL-EBI ENA]), we recommend to use them. You need to map the data against the target genome with your favorite read mapper. In addition, we recommend to check the parameters of the section '''DenoiseIntrons'''.<br />
<br />
== Tools ==<br />
<br />
=== GeMoMa pipeline ===<br />
<br />
This tool can be used to run the complete GeMoMa pipeline. The tool is multi-threaded and can utilize all compute cores on one machine, but not distributed as for instance in a compute cluster. It basically runs: '''Extract RNA-seq evidence (ERE)''', '''DenoiseIntrons''', '''Extractor''', external search (tblastn or mmseqs), '''Gene Model Mapper (GeMoMa)''', '''GeMoMa Annotation Filter (GAF)''', and '''AnnnotationFinalizer'''.<br />
<br />
''GeMoMa pipeline'' may be called with<br />
<br />
java -jar GeMoMa-1.6.4.jar CLI GeMoMaPipeline<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>target genome (Target genome file (FASTA))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>species (data for reference species, range={own, pre-extracted}, default = own)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;own&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>ID (ID to distinguish the different reference species, default = , OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (Reference annotation file (GFF or GTF), which contains gene models annotated in the reference genome)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (Reference genome file (FASTA))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>weight (the weight can be used to prioritize predictions from different input files; each prediction will get an additional attribute sumWeight that can be used in the filter, valid range = [0.0, 1000.0], default = 1.0, OPTIONAL)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ai</font></td><br />
<td>annotation info (annotation information of the reference, tab-delimted file containing at least the columns transcriptName, GO and .*defline, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;pre-extracted&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>ID (ID to distinguish the different reference species, default = , OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>cds parts (The query cds parts file (FASTA), i.e., the cds parts that have been blasted)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>assignment (The assignment file, which combines parts of the CDS to transcripts, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">q</font></td><br />
<td>query proteins (optional query protein file (FASTA) for computing the optimal alignment score against complete protein prediction, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>weight (the weight can be used to prioritize predictions from different input files; each prediction will get an additional attribute sumWeight that can be used in the filter, valid range = [0.0, 1000.0], default = 1.0, OPTIONAL)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ai</font></td><br />
<td>annotation info (annotation information of the reference, tab-delimted file containing at least the columns transcriptName, GO and .*defline, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">selected</font></td><br />
<td>selected (The path to list file, which allows to make only a predictions for the contained transcript ids. The first column should contain transcript IDs as given in the annotation. Remaining columns can be used to determine a target region that should be overlapped by the prediction, if columns 2 to 5 contain chromosome, strand, start and end of region, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">gc</font></td><br />
<td>genetic code (optional user-specified genetic code, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">tblastn</font></td><br />
<td>tblastn (if *true* tblastn is used as search algorithm, otherwise mmseqs is used. Tblastn and mmseqs need to be installed to use the corresponding option, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">tag</font></td><br />
<td>tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = prediction)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>RNA-seq evidence (data for RNA-seq evidence, range={NO, MAPPED, EXTRACTED}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;MAPPED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.s</font></td><br />
<td>Stranded (Defines whether the reads are stranded. In case of FR_FIRST_STRAND, the first read of a read pair or the only read in case of single-end data is assumed to be located on forward strand of the cDNA, i.e., reverse to the mRNA orientation. If you are using Illumina TruSeq you should use FR_FIRST_STRAND., range={FR_UNSTRANDED, FR_FIRST_STRAND, FR_SECOND_STRAND}, default = FR_UNSTRANDED)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.m</font></td><br />
<td>mapped reads file (BAM/SAM files containing the mapped reads)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.v</font></td><br />
<td>ValidationStringency (Defines how strict to be when reading a SAM or BAM, beyond bare minimum validation., range={STRICT, LENIENT, SILENT}, default = LENIENT)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.u</font></td><br />
<td>use secondary alignments (allows to filter flags in the SAM or BAM, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.c</font></td><br />
<td>coverage (allows to output the coverage, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.mmq</font></td><br />
<td>minimum mapping quality (reads with a mapping quality that is lower than this value will be ignored, valid range = [0, 255], default = 40)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;EXTRACTED&quot;:</b></td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">introns</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used zero or multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage</font></td><br />
<td>coverage (experimental coverage (RNA-seq), range={UNSTRANDED, STRANDED}, default = UNSTRANDED)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>denoise (removing questionable introns that have been extracted by ERE, range={DENOISE, RAW}, default = DENOISE)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;DENOISE&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">DenoiseIntrons.m</font></td><br />
<td>maximum intron length (The maximum length of an intron, default = 15000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">DenoiseIntrons.me</font></td><br />
<td>minimum expression (The threshold for removing introns, valid range = [0.0, 1.0], default = 0.01)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">DenoiseIntrons.c</font></td><br />
<td>context (The context upstream a donor and donwstream an acceptor site that is used to determine the expression of the region, valid range = [0, 100], default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;RAW&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.p</font></td><br />
<td>proteins (whether the complete proteins sequences should returned as output, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.r</font></td><br />
<td>repair (if a transcript annotation can not be parsed, the program will try to infer the phase of the CDS parts to repair the annotation, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.a</font></td><br />
<td>Ambiguity (This parameter defines how to deal with ambiguities in the DNA. There are 3 options: EXCEPTION, which will remove the corresponding transcript, AMBIGUOUS, which will use an X for the corresponding amino acid, and RANDOM, which will randomly select an amnio acid from the list of possibilities., range={EXCEPTION, AMBIGUOUS, RANDOM}, default = AMBIGUOUS)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.s</font></td><br />
<td>stop-codon excluded from CDS (A flag that states whether the reference annotation contains the stop codon in the CDS annotation or not, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.f</font></td><br />
<td>full-length (A flag which allows for choosing between only full-length and all (i.e., full-length and partial) transcripts, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.s</font></td><br />
<td>splice (if no intron is given by RNA-seq, compute candidate splice sites or not, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.sm</font></td><br />
<td>substitution matrix (optional user-specified substitution matrix, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.g</font></td><br />
<td>gap opening (The gap opening cost in the alignment, default = 11)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.ge</font></td><br />
<td>gap extension (The gap extension cost in the alignment, default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.m</font></td><br />
<td>maximum intron length (The maximum length of an intron, default = 15000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.sil</font></td><br />
<td>static intron length (A flag which allows to switch between static intron length, which can be specified by the user and is identical for all genes, and dynamic intron length, which is based on the gene-specific maximum intron length in the reference organism plus the user given maximum intron length, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.i</font></td><br />
<td>intron-loss-gain-penalty (The penalty used for intron loss and gain, default = 25)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.e</font></td><br />
<td>e-value (The e-value for filtering blast results, default = 100.0)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.c</font></td><br />
<td>contig threshold (The threshold for evaluating contigs, valid range = [0.0, 1.0], default = 0.4)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.rt</font></td><br />
<td>region threshold (The threshold for evaluating regions, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.h</font></td><br />
<td>hit threshold (The threshold for adding additional hits, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.p</font></td><br />
<td>predictions (The (maximal) number of predictions per transcript, default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.a</font></td><br />
<td>avoid stop (A flag which allows to avoid stop codons in a transcript (except the last AS), default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.approx</font></td><br />
<td>approx (whether an approximation is used to compute the score for intron gain, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.prefix</font></td><br />
<td>prefix (A prefix to be used for naming the predictions, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.t</font></td><br />
<td>timeout (The (maximal) number of seconds to be used for the predictions of one transcript, if exceeded GeMoMa does not output a prediction for this transcript., valid range = [0, 604800], default = 3600)</td><br />
<td style="width:100px;">LONG</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.Score</font></td><br />
<td>Score (A flag which allows to do nothing, re-score or re-align the search results, range={Trust, ReScore, ReAlign}, default = Trust)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.c</font></td><br />
<td>common border filter (the filter on the common borders of transcripts, the lower the more transcripts will be checked as alternative splice isoforms, valid range = [0.0, 1.0], default = 0.75)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.m</font></td><br />
<td>maximal number of transcripts per gene (the maximal number of allowed transcript predictions per gene, valid range = [1, 2147483647], default = 2147483647)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.d</font></td><br />
<td>default attributes (Comma-separated list of attributes that is set to NaN if they are not given in the annotation file. This allows to use these attributes for sorting or filter criteria. It is especially meaningful if the gene annotation files were received fom different software packages (e.g., GeMoMa, Braker, ...) having different attributes., default = tie,tde,tae,iAA,pAA,score)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.f</font></td><br />
<td>filter (A filter can be applied to transcript predictions to receive only reasonable predictions. The filter is applied to the GFF attributes. The deault filter decides based on the completeness of the prediction (start=='M' and stop=='*') and the relative score (score/aa>=0.75) whether a prediction is used or not. Different criteria can be combined using 'and' and 'or'. In addition, you can check for NaN, e.g., 'isNaN(score)'. A more sophisticated filter could be applied for instance using the completeness, the relative score, the evidence as well as statistics based on RNA-seq: start=='M' and stop =='*' and score/aa>=0.75 and (evidence>1 or tpc==1.0), default = start=='M' and stop =='*' and score/aa>=0.75, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.s</font></td><br />
<td>sorting (comma-separated list that defines the way predictions are sorted within a cluster, default = evidence,score)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.a</font></td><br />
<td>alternative transcript filter (If a prediction is suggested as an alternative transcript, this additional filter will be applied. The filter works syntactically like the 'filter' parameter and allows you to keep the number of alternative transcripts small based on meaningful criteria. Reasonable filter could be based on RNA-seq data (tie==1) or on evidence (evidence>1). A more sophisticated filter could be applied combining several criteria: tie==1 or evidence>1, default = tie==1 or evidence>1, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.u</font></td><br />
<td>UTR (allows to predict UTRs using RNA-seq data, range={NO, YES}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;YES&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.r</font></td><br />
<td>rename (allows to generate generic gene and transcripts names (cf. parameter &quot;name attribute&quot;), range={COMPOSED, SIMPLE, NO}, default = COMPOSED)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;COMPOSED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.i</font></td><br />
<td>infix (the infix of the generic name, default = G)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.s</font></td><br />
<td>suffix (the suffix of the generic name, default = 0)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.di</font></td><br />
<td>delete infix (a comma-separated list of infixes that is deleted from the sequence names before building the gene/transcript name, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;SIMPLE&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.n</font></td><br />
<td>name attribute (if true the new name is added as new attribute &quot;Name&quot;, otherwise &quot;Parent&quot; and &quot;ID&quot; values are modified accordingly, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>predicted proteins (If *true*, returns the predicted proteins of the target organism as fastA file, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">pc</font></td><br />
<td>predicted CDSs (If *true*, returns the predicted CDSs of the target organism as fastA file, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">pgr</font></td><br />
<td>predicted genomic regions (If *true*, returns the genomic regions of predicted gene models of the target organism as fastA file, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">o</font></td><br />
<td>output individual predictions (If *true*, returns the predictions for each reference species, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">debug</font></td><br />
<td>debug (If *false* removes all temporary files even if the jobs exits unexpected, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">threads</font></td><br />
<td>The number of threads used for the tool, defaults to 1</td><br />
<td>INT</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.6.4.jar CLI GeMoMaPipeline t=<target_genome> a=<annotation> g=<genome> AnnotationFinalizer.p=<prefix><br />
<br />
<br />
=== Extract RNA-seq Evidence ===<br />
<br />
This tools extracts introns and coverage from mapped RNA-seq reads. Introns might be denoised by the tool '''DenoiseIntrons'''. Introns and coverage results can be used in '''GeMoMa''' to improve the predictions and might help to select better gene models in '''GAF'''. In addition, introns and coverage can be used to predict UTRs by '''AnnotationFinalizer'''.<br />
<br />
''Extract RNA-seq Evidence'' may be called with<br />
<br />
java -jar GeMoMa-1.6.4.jar CLI ERE<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>Stranded (Defines whether the reads are stranded. In case of FR_FIRST_STRAND, the first read of a read pair or the only read in case of single-end data is assumed to be located on forward strand of the cDNA, i.e., reverse to the mRNA orientation. If you are using Illumina TruSeq you should use FR_FIRST_STRAND., range={FR_UNSTRANDED, FR_FIRST_STRAND, FR_SECOND_STRAND}, default = FR_UNSTRANDED)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>mapped reads file (BAM/SAM files containing the mapped reads)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>ValidationStringency (Defines how strict to be when reading a SAM or BAM, beyond bare minimum validation., range={STRICT, LENIENT, SILENT}, default = LENIENT)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">u</font></td><br />
<td>use secondary alignments (allows to filter flags in the SAM or BAM, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage (allows to output the coverage, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">mmq</font></td><br />
<td>minimum mapping quality (reads with a mapping quality that is lower than this value will be ignored, valid range = [0, 255], default = 40)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.6.4.jar CLI ERE m=<mapped_reads_file><br />
<br />
<br />
=== CheckIntrons ===<br />
<br />
The tool checks the distribution of introns on the strands and the dinucleotide distribution at splice sites.<br />
<br />
''CheckIntrons'' may be called with<br />
<br />
java -jar GeMoMa-1.6.4.jar CLI CheckIntrons<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>target genome (The target genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>verbose (A flag which allows to output a wealth of additional information per transcript, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.6.4.jar CLI CheckIntrons t=<target_genome><br />
<br />
<br />
=== DenoiseIntrons ===<br />
<br />
This module allows to analyze introns extracted by '''ERE'''. Introns with a large intron size or a low relative expression are possibly artefacts and will be removed. The result of this module can be used in the module '''GeMoMa''', '''AnnotationEvidence''', and '''AnnotationFinalizer'''.<br />
<br />
''DenoiseIntrons'' may be called with<br />
<br />
java -jar GeMoMa-1.6.4.jar CLI DenoiseIntrons<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage (experimental coverage (RNA-seq), range={UNSTRANDED, STRANDED}, default = UNSTRANDED)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>maximum intron length (The maximum length of an intron, default = 15000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">me</font></td><br />
<td>minimum expression (The threshold for removing introns, valid range = [0.0, 1.0], default = 0.01)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">context</font></td><br />
<td>context (The context upstream a donor and donwstream an acceptor site that is used to determine the expression of the region, valid range = [0, 100], default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.6.4.jar CLI DenoiseIntrons i=<introns> coverage_unstranded=<coverage_unstranded><br />
<br />
<br />
=== NCBI Reference Retriever ===<br />
<br />
This tool can be used to download or update assembly and annotation files of reference organsims from NCBI. This way it allows to easily collect all data necessary to start '''GeMoMaPipeline''' or '''Extractor'''.<br />
<br />
''NCBI Reference Retriever'' may be called with<br />
<br />
java -jar GeMoMa-1.6.4.jar CLI NRR<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reference directory (the directory where the genome and annotation files of the reference organisms should be stored, default = references/)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>number of tries (the number of tries for downloading a reference file, valid range = [1, 100], default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">rl</font></td><br />
<td>reference list (a list of reference organisms)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.6.4.jar CLI NRR rl=<reference_list><br />
<br />
<br />
=== Extractor ===<br />
<br />
This tool can be used to create input files for '''GeMoMa''', i.e., it creates at least a fasta file containing the translated parts of the CDS and a tabular file containing the assignment of transcripts to genes and parts of CDS to transcripts. In addition, '''Extractor''' can be used to create several additional files from the final prediction, e.g. proteins, CDSs, ... . Two inputs are mandatory: The genome as fasta or fasta.gz and the corresponding annotation as gff or gff.gz. The gff file should be sorted. If you like to set a user-specific genetic code, please use a tab-delimited file with two columns. The first column contains the amino acid in one letter code, the second a list of tripletts.<br />
<br />
''Extractor'' may be called with<br />
<br />
java -jar GeMoMa-1.6.4.jar CLI Extractor<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (Reference annotation file (GFF or GTF), which contains gene models annotated in the reference genome)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (Reference genome file (FASTA))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">gc</font></td><br />
<td>genetic code (optional user-specified genetic code, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>proteins (whether the complete proteins sequences should returned as output, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>cds (whether the complete CDSs should returned as output, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">genomic</font></td><br />
<td>genomic (whether the genomic regions should be returned (upper case = coding, lower case = non coding), default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>repair (if a transcript annotation can not be parsed, the program will try to infer the phase of the CDS parts to repair the annotation, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>selected (The path to list file, which allows to make only a predictions for the contained transcript ids. The first column should contain transcript IDs as given in the annotation. Remaining columns will be ignored., OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Ambiguity</font></td><br />
<td>Ambiguity (This parameter defines how to deal with ambiguities in the DNA. There are 3 options: EXCEPTION, which will remove the corresponding transcript, AMBIGUOUS, which will use an X for the corresponding amino acid, and RANDOM, which will randomly select an amnio acid from the list of possibilities., range={EXCEPTION, AMBIGUOUS, RANDOM}, default = EXCEPTION)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sefc</font></td><br />
<td>stop-codon excluded from CDS (A flag that states whether the reference annotation contains the stop codon in the CDS annotation or not, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>full-length (A flag which allows for choosing between only full-length and all (i.e., full-length and partial) transcripts, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>verbose (A flag which allows to output a wealth of additional information, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.6.4.jar CLI Extractor a=<annotation> g=<genome><br />
<br />
<br />
=== GeneModelMapper ===<br />
<br />
This tool is the main part of, a homology-based gene prediction tool. GeMoMa builds gene models from search results (e.g. tblastn or mmseqs).<br />
<br />
As first step, you should run '''Extractor''' obtaining ''cds parts'' and ''assignment''. Second, you should run a search algorithm, e.g. '''tblastn''' or '''mmseqs''', with ''cds parts'' as query. Finally, these search results are then used in '''GeMoMa'''. Search results should be clustered according to the reference genes. The most easiest way is to sort the search results accoring to the first column. If the search results are not sorted by default (e.g. mmseqs), you should the parameter ''sort''.<br />
If you like to run GeMoMa ignoring intron position conservation, you should blast protein sequences and feed the results in ''query cds parts'' and leave ''assignment'' unselected.<br />
<br />
If you like to run GeMoMa using RNA-seq evidence, you should map your RNA-seq reads to the genome and run '''ERE''' on the mapped reads. For several reasons, spurious introns can be extracted from RNA-seq data. Hence, we recommend to run '''DenoiseIntrons''' to remove such spurious introns. Finally, you can use the obtained ''introns'' (and ''coverage'') in GeMoMa.<br />
<br />
If you like to obtain multiple predictions per gene model of the reference organism, you should set ''predictions'' accordingly. In addition, we suggest to decrease the value of ''contig threshold'' allowing GeMoMa to evaluate more candidate contigs/chromosomes.<br />
<br />
If you change the values of ''contig threshold'', ''region threshold'' and ''hit threshold'', this will influence the predictions as well as the runtime of the algorithm. The lower the values are, the slower the algorithm is.<br />
<br />
You can filter your predictions using '''GAF''', which also allows for combining predictions from different reference organismns.<br />
<br />
Finally, you can predict UTRs and rename predictions using '''AnnotationFinalizer'''.<br />
<br />
If you like to run the complete GeMoMa pipeline and not only specific module, you can run the multi-threaded module '''GeMoMaPipeline'''.<br />
<br />
''GeneModelMapper'' may be called with<br />
<br />
java -jar GeMoMa-1.6.4.jar CLI GeMoMa<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>search results (The search results, e.g., from tblastn or mmseqs)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>target genome (The target genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>cds parts (The query cds parts file (FASTA), i.e., the cds parts that have been blasted)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>assignment (The assignment file, which combines parts of the CDS to transcripts, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">q</font></td><br />
<td>query proteins (optional query protein file (FASTA) for computing the optimal alignment score against complete protein prediction, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used zero or multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">splice</font></td><br />
<td>splice (if no intron is given by RNA-seq, compute candidate splice sites or not, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used zero or multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage</font></td><br />
<td>coverage (experimental coverage (RNA-seq), range={UNSTRANDED, STRANDED}, default = UNSTRANDED)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genetic code (optional user-specified genetic code, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sm</font></td><br />
<td>substitution matrix (optional user-specified substitution matrix, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">go</font></td><br />
<td>gap opening (The gap opening cost in the alignment, default = 11)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ge</font></td><br />
<td>gap extension (The gap extension cost in the alignment, default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>maximum intron length (The maximum length of an intron, default = 15000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sil</font></td><br />
<td>static intron length (A flag which allows to switch between static intron length, which can be specified by the user and is identical for all genes, and dynamic intron length, which is based on the gene-specific maximum intron length in the reference organism plus the user given maximum intron length, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">intron-loss-gain-penalty</font></td><br />
<td>intron-loss-gain-penalty (The penalty used for intron loss and gain, default = 25)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">e</font></td><br />
<td>e-value (The e-value for filtering blast results, default = 100.0)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ct</font></td><br />
<td>contig threshold (The threshold for evaluating contigs, valid range = [0.0, 1.0], default = 0.4)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">rt</font></td><br />
<td>region threshold (The threshold for evaluating regions, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">h</font></td><br />
<td>hit threshold (The threshold for adding additional hits, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>predictions (The (maximal) number of predictions per transcript, default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">selected</font></td><br />
<td>selected (The path to list file, which allows to make only a predictions for the contained transcript ids. The first column should contain transcript IDs as given in the annotation. Remaining columns can be used to determine a target region that should be overlapped by the prediction, if columns 2 to 5 contain chromosome, strand, start and end of region, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">as</font></td><br />
<td>avoid stop (A flag which allows to avoid stop codons in a transcript (except the last AS), default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">approx</font></td><br />
<td>approx (whether an approximation is used to compute the score for intron gain, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">prefix</font></td><br />
<td>prefix (A prefix to be used for naming the predictions, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">tag</font></td><br />
<td>tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = prediction)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>verbose (A flag which allows to output a wealth of additional information per transcript, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">timeout</font></td><br />
<td>timeout (The (maximal) number of seconds to be used for the predictions of one transcript, if exceeded GeMoMa does not output a prediction for this transcript., valid range = [0, 604800], default = 3600)</td><br />
<td style="width:100px;">LONG</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sort</font></td><br />
<td>sort (A flag which allows to sort the search results, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Score</font></td><br />
<td>Score (A flag which allows to do nothing, re-score or re-align the search results, range={Trust, ReScore, ReAlign}, default = Trust)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.6.4.jar CLI GeMoMa s=<search_results> t=<target_genome> c=<cds_parts> a=<assignment><br />
<br />
<br />
=== GeMoMa Annotation Filter ===<br />
<br />
This tool combines and filters gene predictions from different sources yielding a common gene prediction. The tool does not modify the predictions, but filters redundant and low-quality predictions and selects relevant predictions. In addition, it adds attributes to the annotation of transcript predictions.<br />
<br />
The algorithm does the following:<br />
First, redundant predictions are identified (and additional attributes (evidence, sumWeight) are introduced).<br />
Second, the predictions are filtered using the user-specified criterium based on the attributes from the annotation.<br />
Third, clusters of overlapping predictions are determined, the predictions are sorted within the cluster and relevant predictions are extracted.<br />
<br />
Optionally, annotation info can be added for each reference organism enabling a functional prediction for predicted transcripts based on the function of the reference transcript.<br />
Phytozome provides annotation info tables for plants, but annotation info can be used from any source as long as they are tab-delimited files with at least the following columns: transcriptName, GO and .*defline.<br />
<br />
Initially, GAF was build to combine gene predictions obtained from '''GeMoMa'''. It allows to combine the predictions from multiple reference organisms, but works also using only one reference organism. However, GAF also allows to integrate predictions from ab-initio or purely RNA-seq-based gene predictors as well as manually curated annotation. If you like to do so, we recommend to run '''AnnotationEvidence''' for each of these input files to add additional attributes that can be used for sorting and filtering within '''GAF'''. The sort and filter criteria need to be carefully revised in this case. Default values can be set for missing attributes.<br />
<br />
''GeMoMa Annotation Filter'' may be called with<br />
<br />
java -jar GeMoMa-1.6.4.jar CLI GAF<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>tag (the tag used to read the GeMoMa annotations, default = prediction)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>common border filter (the filter on the common borders of transcripts, the lower the more transcripts will be checked as alternative splice isoforms, valid range = [0.0, 1.0], default = 0.75)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>maximal number of transcripts per gene (the maximal number of allowed transcript predictions per gene, valid range = [1, 2147483647], default = 2147483647)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prefix (the prefix can be used to distinguish predictions from different input files, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>weight (the weight can be used to prioritize predictions from different input files; each prediction will get an additional attribute sumWeight that can be used in the filter, valid range = [0.0, 1000.0], default = 1.0, OPTIONAL)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>gene annotation file (GFF files containing the gene annotations (predicted by GeMoMa))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation info (annotation information of the reference, tab-delimted file containing at least the columns transcriptName, GO and .*defline, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>default attributes (Comma-separated list of attributes that is set to NaN if they are not given in the annotation file. This allows to use these attributes for sorting or filter criteria. It is especially meaningful if the gene annotation files were received fom different software packages (e.g., GeMoMa, Braker, ...) having different attributes., default = tie,tde,tae,iAA,pAA,score)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>filter (A filter can be applied to transcript predictions to receive only reasonable predictions. The filter is applied to the GFF attributes. The deault filter decides based on the completeness of the prediction (start=='M' and stop=='*') and the relative score (score/aa>=0.75) whether a prediction is used or not. Different criteria can be combined using 'and' and 'or'. In addition, you can check for NaN, e.g., 'isNaN(score)'. A more sophisticated filter could be applied for instance using the completeness, the relative score, the evidence as well as statistics based on RNA-seq: start=='M' and stop =='*' and score/aa>=0.75 and (evidence>1 or tpc==1.0), default = start=='M' and stop =='*' and score/aa>=0.75, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>sorting (comma-separated list that defines the way predictions are sorted within a cluster, default = evidence,score)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">atf</font></td><br />
<td>alternative transcript filter (If a prediction is suggested as an alternative transcript, this additional filter will be applied. The filter works syntactically like the 'filter' parameter and allows you to keep the number of alternative transcripts small based on meaningful criteria. Reasonable filter could be based on RNA-seq data (tie==1) or on evidence (evidence>1). A more sophisticated filter could be applied combining several criteria: tie==1 or evidence>1, default = tie==1 or evidence>1, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.6.4.jar CLI GAF g=<gene_annotation_file><br />
<br />
<br />
=== AnnotationFinalizer ===<br />
<br />
This tool finalizes an annotation. It allows to predict for UTRs for annotated coding sequences and to generate generic gene and transcript names. UTR prediction might be negatively influenced (i.e. too long predictions) by genomic contamination of RNA-seq libraries, overlapping genes or genes in close proximity as well as unstranded RNA-seq libraries. Please use '''ERE''' to preprocess the mapped reads.<br />
<br />
''AnnotationFinalizer'' may be called with<br />
<br />
java -jar GeMoMa-1.6.4.jar CLI AnnotationFinalizer<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (The genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (The predicted genome annotation file (GFF))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = prediction)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">u</font></td><br />
<td>UTR (allows to predict UTRs using RNA-seq data, range={NO, YES}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;YES&quot;:</b></td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns file (Introns (GFF), which might be obtained from RNA-seq)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage file (experimental coverage (RNA-seq), range={NO, UNSTRANDED, STRANDED}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">rename</font></td><br />
<td>rename (allows to generate generic gene and transcripts names (cf. parameter &quot;name attribute&quot;), range={COMPOSED, SIMPLE, NO}, default = COMPOSED)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;COMPOSED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">infix</font></td><br />
<td>infix (the infix of the generic name, default = G)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>suffix (the suffix of the generic name, default = 0)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">di</font></td><br />
<td>delete infix (a comma-separated list of infixes that is deleted from the sequence names before building the gene/transcript name, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;SIMPLE&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>name attribute (if true the new name is added as new attribute &quot;Name&quot;, otherwise &quot;Parent&quot; and &quot;ID&quot; values are modified accordingly, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.6.4.jar CLI AnnotationFinalizer g=<genome> a=<annotation> p=<prefix><br />
<br />
<br />
=== Annotation evidence ===<br />
<br />
This tool adds attributes to the annotation, e.g., tie, tpc, aa, start, stop. These attributes can be used, for instance, if the annotation is used in '''GAF'''. All predictions of the annotation are used. The predictions are not filtered for internal stop codons, missing start or stop codons, frame-shifts, ... . Please use '''ERE''' to preprocess the mapped reads.<br />
<br />
''Annotation evidence'' may be called with<br />
<br />
java -jar GeMoMa-1.6.4.jar CLI AnnotationEvidence<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (The genome annotation file (GFF))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (The genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns file (Introns (GFF), which might be obtained from RNA-seq, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage file (experimental coverage (RNA-seq), range={NO, UNSTRANDED, STRANDED}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ao</font></td><br />
<td>annotation output (if the annotation should be returned with attributes tie, tpc, and aa, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">gc</font></td><br />
<td>genetic code (optional user-specified genetic code, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.6.4.jar CLI AnnotationEvidence a=<annotation> g=<genome><br />
<br />
<br />
=== Compare transcripts ===<br />
<br />
This tool compares a predicted annotation with a given annotation in terms of F1 measure. If the F1 measure is 1 both annotations are in perfect agreement for this transcript. The smaller the value is the low is the agreement. If it is NA then there is no overlapping annotation.<br />
<br />
''Compare transcripts'' may be called with<br />
<br />
java -jar GeMoMa-1.6.4.jar CLI CompareTranscripts<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prediction (The predicted annotation)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (The true annotation)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used zero or multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">prefix</font></td><br />
<td>prefix (the prefix can be used to distinguish predictions from different input files, default = , OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">assignment</font></td><br />
<td>assignment (the transcript info for the reference of the prediction)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.6.4.jar CLI CompareTranscripts p=<prediction> a=<annotation><br />
<br />
== GFF attributes ==<br />
<br />
Using GeMoMa and GAF, you'll obtain GFFs containing some special attributes. We briefly explain the most prominent attributes in the following table.<br />
<br />
{| class="wikitable sortable"<br />
|-<br />
!Attribute !!Long name !!Tool !!Necessary parameter !!Feature !!Description<br />
|-<br />
| aa || amino acids || GeMoMa || || prediction || the number of amino acids in the protein<br />
|-<br />
| score || GeMoMa score || GeMoMa || || prediction || score computed by GeMoMa using the substitution matrix, gap costs and additional penalties<br />
|-<br />
| minCov || minimal coverage || GeMoMa || coverage, ... || prediction || minimal coverage of any base of the prediction given RNA-seq evidence<br />
|-<br />
| avgCov || average coverage || GeMoMa || coverage, ... || prediction || average coverage of all bases of the prediction given RNA-seq evidence<br />
|-<br />
| tpc || transcript percentage coverage || GeMoMa || coverage, ... || prediction || percentage of covered bases per predicted transcript given RNA-seq evidence<br />
|-<br />
| tae || transcript acceptor evidence || GeMoMa || introns || prediction || percentage of predicted acceptor sites per predicted transcript with RNA-seq evidence<br />
|-<br />
| tde || transcript donor evidence || GeMoMa || introns || prediction || percentage of predicted donor sites per predicted transcript with RNA-seq evidence<br />
|-<br />
| tie || transcript intron evidence || GeMoMa || introns || prediction || percentage of predicted introns per predicted transcript with RNA-seq evidence<br />
|-<br />
| minSplitReads || minimal split reads || GeMoMa || introns || prediction || minimal number of split reads for any of the predicted introns per predicted transcript<br />
|-<br />
| iAA || identical amino acid || GeMoMa || query proteins || prediction || percentage of identical amino acids between reference transcript and prediction<br />
|-<br />
| pAA || positive amino acid || GeMoMa || query proteins || prediction || percentage of aligned positions between reference transcript and prediction yielding a positive score in the substitution matrix<br />
|-<br />
| evidence || || GAF || || prediction || number of reference organisms that have a transcript yielding this prediction<br />
|-<br />
| alternative || || GAF || || prediction || alternative gene ID(s) leading to the same prediction<br />
|-<br />
| sumWeight || || GAF || || prediction || the sum of the weights of the references that perfectly support this prediction<br />
|-<br />
| maxTie || maximal tie || GAF || || gene || maximal tie of all transcripts of this gene<br />
|-<br />
| maxEvidence || maximal evidence || GAF || || gene || maximal evidence of all transcripts of this gene<br />
|-<br />
|}<br />
<br />
== FAQs ==<br />
<br />
; Why does the Extractor not return a single CDS-part, protein, ...?<br />
: First, please check whether the names of your contigs/chromosomes in your annotation (gff) and genome file (fasta) are identical. The fasta comments should at best only contain the contig/chromosome name. (Since GeMoMa 1.4, comments, which contain the contig/chromosome name and some additional information separated by a space, are also fine.) Second, please check whether you have a valid GFF/GTF file. Valid GFF files should have a valid "ID" or "Parent" entry in the attributes column. Valid GTF files should have a valid "gene_id" and "transcript_id" entry. Finally, please check the statistics that are given by the Extractor. It lists how many genes have been read and how many genes have been removed for different reasons. One common problem is that some annotation files do not include the stop codon in the CDS annotation.<br />
; How can I force GeMoMa to make more predictions?<br />
: There are several parameters affecting the number of predictions. The most prominent are the number of predictions (p) and the contig threshold (ct). For each reference transcript/CDS, GeMoMa initially makes a preliminary prediction and uses this prediction to determine whether a contig is promising and should be used to determine the final predictions. You may decrease ct and increase p to have more contigs in the final prediction. Increasing the number of predictions allows GeMoMa to output more predictions that have been computed. Decreasing the contig threshold allows to increase the number of predictions that are (internally) computed. Increasing p to a very large number without decreasing ct does not help.<br />
; Running GeMoMa on a single contig of my assembly yield thousands of weird predictions. What went wrong?<br />
: By default, GeMoMa is not build to be run on a single contig. GeMoMa tries to make predictions for all given reference CDS in the given target sequence(s). If the given target sequence is only a fraction of the complete target genome/assembly, GeMoMa will produce weird predictions as it does not filter for the quality of the predictions internally. There are two options to handle this:<br />
:* Use a list of gene models that you expect to be located on this contig (cf. parameter "selected").<br />
:* Filter the predictions using GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>).<br />
; Is it mandatory to use RNA-seq data?<br />
: No, GeMoMa is able to make predictions with and without RNA-seq evidence.<br />
; Is it possible to use multiple reference organisms?<br />
: It is possible to use multiple reference organisms for GeMoMa. Just run GeMoMa on each reference organism separately. Finally, you can employ GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>) to combine these annotations.<br />
; Why do some reference genes not lead to a prediction in the target genome?<br />
: Please first check whether your reference genes have been discarded by the Extractor (cf. assignment file).<br />
: If the genes have been discarded, there are two possibilities:<br />
:* The CDS might be redundant, i.e. the coding exons are identical to those of another transcript. In this case, only one CDS is further evaluated.<br />
:* There might be something wrong with your reference genes, e.g., missing start codon, missing stop codon, premature stop codon, ambiguous nucleotides, ... and you should check the options of Extractor or the annotation.<br />
: If the reference genes passed the Extractor, there are several possible explanations for this behavior. The two most prominent are:<br />
:* GeMoMa stopped the prediction of a reference genes since it does not return a result within the given time (cf. parameter "timeout").<br />
:* GeMoMa simply did not find a prediction matching the remaining quality criteria<br />
:* GeMoMa did find a prediction, but it was filtered out by GAF, e.g. to low relative score, missing start or stop codon (cf. GAF parameters).<br />
; What does "partial gene model" mean in the context of GeMoMa?<br />
: We called a gene model partial if it does not contain an initial start codon and a final stop codon. However, this does not mean that the gene model is located at or close to the border of a chromosome or contig.<br />
; For two different reference transcripts, the predictions of GeMoMa overlap or are identical. What should I do with those?<br />
: GeMoMa makes the predictions for each reference transcript independently. Hence, it can occur that some of predictions of different reference transcripts overlap or are identical especially in gene families. Typically, you might like to filter or rank these predictions. We have implemented GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>) to do this automatically. However, you can also do it by hand using the GFF attributes. Using RNA-seq data in GeMoMa yields additional fields in the annotation that can be used, e.g., average coverage (avgCov).<br />
; A lot of transcripts have been filtered out by the Extractor. What can I do?<br />
: There are several reasons for removing transcripts by the Extractor. At least in two cases you can try to get more transcripts by setting specific parameter values. First, if the transcript contains ambiguous nucleotides, please test the parameter "Ambiguity". Second, sometimes we received GFFs which contain wrong phases for CDS entries (e.g., 0 for all CDS entries in the phase column of the GFF). Since version 1.3.2, we provide the option "r" which stands for repair. If r=true is chosen, the Extractor tries to infer all phases for transcripts that show an error and would be filtered out.<br />
; Is GeMoMa able to predict pseudo-genes/ncRNA?<br />
: No, currently not.<br />
; My RNA-seq data indicates there is an additional intron in a transcipt, but GeMoMa does not predict this. Or vice versa, GeMoMa predicts an intron that is not supported by RNA-seq data. What's the reason?<br />
: GeMoMa is mainly based on the assumptions of amino acid and intron position conservation between reference and target species. Hence, GeMoMa tries to predict a gene model with similar exon-intron structure in the target species and does not stick too much to RNA-seq data. Although intron position conservation can be observed in most cases, sometimes new introns evolve or others vanish. For this reasons, GeMoMa also allows for the inclusion or exclusion of introns adding some additional costs (cf. GeMoMa parameter intron-loss-gain-penalty). However, the behaviour of GeMoMa depends on the parameters settings (especially intron-loss-gain-penalty, sm (substitution matrix), go (gap opening), ge (gap extension)) and the length of the missed/additional intron. Nevertheless, such cases can only occur if the additional/missed intron has a length that can be divided by 3 preserving the reading frame.<br />
:Since the available RNA-seq data only reflects a fraction of tissues/environmental conditions/..., missing RNA-seq evidence does not necessarily mean that the predictions is wrong.<br />
; My RNA-seq data indicates two alternative, highly overlapping introns. Interestingly, GeMoMa does not take the intron that is more abundant. Why?<br />
: GeMoMa reads the introns from the input file using some filter (cf. GeMoMa parameter r (reads)). All introns that pass the filter are used and treated equally. Hence, GeMoMa uses the intron that matches the expectation of intron position and amino acid conservation compared to the reference transcript.<br />
; Does GeMoMa predict multiple transcripts per gene?<br />
: GeMoMa in principle allows to predict multiple transcripts per gene, if corresponding transcripts are given in the reference species or if multiple reference species are used.<br />
; GeMoMa failed with java.lang.OutOfMemoryError. What can I do?<br />
: Whenever you see a java.lang.OutOfMemoryError, you should rerun the program with Java virtual machine (VM) options. More specifically you should set: '''-Xms''' the initally used RAM, e.g. to 5Gb (–Xms5G), and '''-Xmx''' the maximally used RAM, e.g. to 50Gb (-Xmx50G). GeMoMa often needs more memory if you have a large genome ''and'' if you’re providing a large coverage file (extracted from RNA-seq data). If you don’t have a compute node with enough memory, you can run GeMoMa without coverage, which will return the same predictions, but does not include all statistics. Another point could be the protein alignment, if you use the optional parameter query protein. Again you can run GeMoMa without this parameter, which will return the same predictions, but less statistics.<br />
; I need to specify the genetic code for my organisms. What is the expected format?<br />
: The genetic code is given in a two column tab-delimited table, where the first column is the one letter code of the amino acid and the second column is a comma-separated list of triplets. As we are working on genomic DNA, GeMoMa expects the bases A, C, G, and T, and not U (as expected in mRNA). Here is the link to the default genetic code, which might be used as template:<br />
: https://github.com/Jstacs/Jstacs/blob/master/projects/gemoma/test_data/genetic_code.txt<br />
: Alternative genetic codes are described here using the RNA alphabet:<br />
: https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi<br />
: The genetic code might be specified for a reference organism in the module Extractor or for a target organism in the module GeMoMa.<br />
; I like to accelerate GeMoMa. What can I do?<br />
:If you like to improve the runtime, you can use several threads for the computation. If you run the GeMoMaPipeline you just have to select <code>threads=<your_number></code>.<br />
:In addition, you can change the search algorithm that is used in GeMoMa. Tblastn is used by default as search algorithm in GeMoMa (for historical reasons). However, tblastn can be replaced by mmseqs which is typically much faster. If you run the GeMoMaPipeline you just have to select <code>tblastn=false</code>. However, changing the search algorithm can also effect the results. We try to minimize these effect using specific parameters for the search algorithms.<br />
:If you modify other parameters, you will probably receive results that differ to a larger extend from those received using the default parameters.<br />
<br />
== References ==<br />
If you use GeMoMa, please cite<br />
<br />
J. Keilwagen, M. Wenk, J. L. Erickson, M. H. Schattat, J. Grau, and F. Hartung. [https://nar.oxfordjournals.org/content/44/9/e89 Using intron position conservation for homology-based gene prediction]. ''Nucleic Acids Research'', 2016. doi: 10.1093/nar/gkw092<br />
<br />
J. Keilwagen, F. Hartung, M. Paulini, S. O. Twardziok, and J. Grau<br />
[https://rdcu.be/QbKc Combining RNA-seq data and homology-based gene prediction for plants, animals and fungi]. ''BMC Bioinformatics'', 2018. doi: 10.1186/s12859-018-2203-5<br />
<br />
== Version history ==<br />
[http://www.jstacs.de/download.php?which=GeMoMa GeMoMa 1.6.4] (24.04.2020)<br />
* improved help section<br />
* change gff attribute "AA" to "aa"<br />
* GAF:<br />
** bugfix overlapping genes<br />
** accelerated computation<br />
* GeMoMa:<br />
** bugfix: if no assignment file is used and protein ID are prefixes of other protein IDs<br />
** change GFF attribute AA to aa<br />
* AnnotationFinalizer: new parameter "name attribute" allowing to decide whether a name attribute or the Parent and ID attributes should be used for renaming<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.6.3.zip GeMoMa 1.6.3] (05.03.2020)<br />
* Jstacs changes:<br />
** CLI: bugfix ExpandableParameterSet<br />
* python wrapper (for *conda)<br />
* updated tests.sh, run.sh, pipeline.sh<br />
* rename Denoise to DenoiseIntrons<br />
* AnnotationEvidence: write phase (as given) to gff<br />
* GAF: new parameter: default attributes allows to set attributes that are not included in some gene annotation files<br />
* GeMoMa: new parameter: static intron length allowing to use dynamic intron length if set to false<br />
* GeMoMaPipeline: <br />
** bugfix: time-out<br />
** improve output<br />
** separate parameters for maximum intron length (DenoiseIntrons, GeMoMa)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.6.2.zip GeMoMa 1.6.2] (17.12.2019)<br />
* Jstacs changes:<br />
** test methods for modules<br />
** live protocol for Galaxy<br />
* new module Denoise: allowing to clean introns extracted by ERE<br />
* new module NCBIReferenceRetriever: allowing to retrieve data for reference organisms easily from NCBI.<br />
* GAF:<br />
** bugfix for filter using specific attributes if no RNA-seq or query proteins was used<br />
** allow to add annotation info (as for instance provided by Phytozome) based on the reference organisms<br />
* GeMoMa: bugfix for timeout<br />
* GeMoMaPipeline:<br />
** bugfix reporting predicted partial proteins<br />
** improved protocol<br />
** new default value for query proteins (changed from false to true)<br />
** new default value for Ambiguity (changed from EXCEPTION to AMBIGUOUS)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.6.1.zip GeMoMa 1.6.1] (4.06.2019)<br />
* createGalaxyIntegration.sh: bugfix for GeMoMaPipeline<br />
* new module CheckIntrons: allowing to create statistics for introns (extracted by ERE)<br />
* AnnotationFinalizer: bugfix for sequence IDs with large numbers<br />
* CompareTranscripts: <br />
** bugfix for prefix of ref-gene<br />
** allow no transcript info, but making assignment non-optional if a transcript info is set <br />
* GAF: bugfix for Galaxy integration<br />
* GeMoMaPipeline:<br />
** improved output in case of Exceptions<br />
** new parameter "output individual predictions" allows to in- or exclude individual predictions from each reference organism in the final result<br />
** new parameter "weight" allows weights for reference species (cf. GAF)<br />
* ERE: new parameter "minimum mapping quality"<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.6.zip GeMoMa 1.6] (2.04.2019)<br />
* allow to use mmseqs as alternative to tblastn<br />
* AnnotationEvidence:<br />
** allows to add attributes to the input gff: tie, tpc, AA, start, stop<br />
** new parameter for gff output<br />
* AnnotationFinalizer: new tool for predicting UTRs and renaming predictions<br />
* GAF:<br />
** relative score filter and evidence filter are replaced by a flexible filter that allows to filter by relative score, evidence or other GFF attributes as well as combinations thereof<br />
** sorting criteria of the predictions within clusters can now be user-specified<br />
** new attribute for genes: combinedEvidence<br />
** new attribute for predictions: sumWeight<br />
** allows to use gene predictions from all sources, including for instance ab-initio gene predictors, purely RNA-seq based gene prediction and manually curation<br />
** bugfix for predictions from multiple reference organisms<br />
** improved statistic output<br />
* GeMoMa<br />
** renamed the parameter tblastn results to search results<br />
** new parameter for sorting the results of the similarity search (tblastn or mmseqs), if you use mmseqs for the similarity search you have use sort <br />
** new parameter for score of the search results: three options: Trust (as is), ReScore (use aligned sequence, but recompute score), and ReAlign (use detected sequence for realignment and rescore)<br />
** bugfix: threshold for introns from multiple files<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.5.2.zip GeMoMa 1.5.3] (23.07.2018)<br />
* improved parameter description and presentation<br />
* GeMoMaPipeline:<br />
** removed unnecessary parameters<br />
* GeMoMa:<br />
** bugfix: reading coverage file<br />
** removed parameter genomic (cf. Extractor)<br />
** removed protein output (cf. Extractor)<br />
* GAF:<br />
** bugfix: prefix<br />
* Extractor:<br />
** new parameter genomic<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.5.2.zip GeMoMa 1.5.2] (31.5.2018)<br />
* GAF:<br />
** new parameter that allows to restrict the maximal number of transcript predictions per gene<br />
** altered behavior of the evidence filter from percentages to absolute values<br />
** bugfix: nested genes <br />
** checking for duplicates in prediction IDs<br />
* GeMoMa:<br />
** warning if RNA-seq data does not match with target genome<br />
* GeMoMaPipeline: new tool for running the complete GeMoMa pipeline at once allowing multi-threading<br />
* folder for temporary files of GeMoMa<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.5.zip GeMoMa 1.5] (13.02.2018)<br />
* AnnotationEvidence: add chromosome to output<br />
* CompareTranscripts: new parameter that allows to remove prefixes introduces by GAF<br />
* Extractor: new parameter "stop-codon excluded from CDS" that might be used if the annotation does not contain the stop codons<br />
* ExtractRNASeqEvidence: <br />
** print intron length stats<br />
** include program infos in introns.gff3<br />
* GeMoMa:<br />
** new attribute pAA in gff output if query protein is given<br />
** include program infos in predicted_annotation.gff3<br />
** minor bugfix<br />
* GAF:<br />
** new parameter that allows to specify a prefix for each input gff<br />
** collect and print program infos to filtered_prediction.gff3<br />
** improved statistics output<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.4.2.zip GeMoMa 1.4.2] (21.07.2017)<br />
* automatic searching for available updates<br />
* AnnotationEvidence: bugfix (tie computation: Arrays.binarysearch does not find first match)<br />
* Extractor: bugfix (files that are not zipped)<br />
* GeMoMa: bugfix (tie computation: Arrays.binarysearch does not find first match)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.4.1.zip GeMoMa 1.4.1] (30.05.2017)<br />
* CompareTranscripts: bugfix (NullPointerException)<br />
* Extractor: reference genome can be .*fa.gz and .*fasta.gz<br />
* GeMoMa: bugfix (shutdown problem after timeout)<br />
* modified additional scripts and documentation<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.4.zip GeMoMa 1.4] (03.05.2017)<br />
* AnnotationEvidence: new tool computing tie and tpc for given annotation (gff)<br />
* CompareTranscripts: new tool comparing predicted and given annotation (gff)<br />
* Extractor:<br />
** reading CDS with no parent tag (cf. discontinuous feature)<br />
** automatic recognition of GFF or GTF annotation<br />
** Warning if sequences mentioned in the annotation are not included in the reference sequence<br />
* GeMoMa: <br />
** allowing for multiple intron and coverage files (= using different library types at the same time)<br />
** NA instead of "?" for tae, tde, tie, minSplitReads of single coding exon genes<br />
** new default values for the parameters: predictions (10 instead of 1) and contig threshold (0.4 instead of 0.9)<br />
** bugfix (write pc and minCov if possible for last CDS part in predicted annotation)<br />
** bugfix (ref-gene name if no assignment is used)<br />
** bugfix (minSplitReads, minCov, tpc, avgCov if no coverage available)<br />
* GAF:<br />
** nested genes on the same strand<br />
** bugfix (if nothing passes the filter)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.3.2.zip GeMoMa 1.3.2] (18.01.2017)<br />
* Extractor: new parameter repair for broken transcript annotations<br />
* GeMoMa: bugfixes (splice site computation)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa_1.3.1.zip GeMoMa 1.3.1] (09.12.2016)<br />
* GeMoMa bugfix (finding start/stop codon for very small exons)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.3.zip GeMoMa 1.3] (06.12.2016)<br />
* ERE: new tool for extracting RNA-seq evidence<br />
* Extractor: offers options for <br />
** partial gene models<br />
** ambiguities<br />
* GeMoMa:<br />
** RNA-seq<br />
*** defining splice sites<br />
*** additional feature in GFF and output<br />
**** transcript intron evidence (tie)<br />
**** transcript acceptor evidence (tae)<br />
**** transcript donor evidence (tde)<br />
**** transcript percentage coverage (tpc)<br />
**** ...<br />
** improved GFF<br />
** simplified the command line parameters<br />
** IMPORTANT: parameter names changed for some parameters<br />
* GAF: new tool for filtering and combining different predictions (especially of different reference organisms)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.1.3.zip GeMoMa 1.1.3] (06.06.2016)<br />
* minor modifications to the Extractor tool<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.1.2.zip GeMoMa 1.1.2] (05.02.2016)<br />
* GeMoMa bugfix (upstream, downstream sequence for splice site detection)<br />
* Extractor: new parameter s for selecting transcripts<br />
* improved Galaxy integration<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.1.1.zip GeMoMa 1.1.1] (01.02.2016)<br />
* initial release for paper</div>Keilwagenhttps://www.jstacs.de/index.php?title=GeMoMa&diff=1096GeMoMa2020-04-25T20:59:14Z<p>Keilwagen: conda</p>
<hr />
<div>'''Ge'''ne '''Mo'''del '''Ma'''pper (GeMoMa) is a homology-based gene prediction program. GeMoMa uses the annotation of protein-coding genes in a reference genome to infer the annotation of protein-coding genes in a target genome. Thereby, GeMoMa utilizes amino acid sequence and intron position conservation. In addition, GeMoMa allows to incorporate RNA-seq evidence for splice site prediction.<br />
<br />
GeMoMa is available in a public web-server at [http://galaxy.informatik.uni-halle.de galaxy.informatik.uni-halle.de]. The provided web-server only allows a limited number of reference genes and uses a time out of 2 minutes per transcript prediction. For unlimited use, please use the command line program or integrate GeMoMa in your own [https://galaxyproject.org/ Galaxy] instance.<br />
<br />
{|<br />
|__TOC__<br />
|[[File:GeMoMa-schema.png|thumb|right|450px|Schema of GeMoMa algorithm]]<br />
|}<br />
<br />
== Installation ==<br />
<br />
GeMoMa is now available at [https://anaconda.org Anaconda]. Here is the [https://anaconda.org/keili/gemoma direct link to the package]. <br />
However, you can also install GeMoMa manually.<br />
<br />
=== Requirements ===<br />
<br />
For running the GeMoMa, you need the following software on you computer<br />
* Java v1.8 or later<br />
* [ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ blast] or [https://github.com/soedinglab/MMseqs2 mmseqs]<br />
<br />
=== Download ===<br />
<br />
GeMoMa is implemented in Java using Jstacs. You can [http://www.jstacs.de/download.php?which=GeMoMa download a zip file] containing a readme, the GeMoMa jar file and some tiny scripts for running GeMoMa. The jar file allows for <br />
<ul><br />
<li>creating the XML file needed for the Galaxy integration</li><br />
<li>running the command line interface (CLI) version.</li><br />
</ul><br />
<br />
You can also [[:File:GeMoMa-manual.pdf|download a small manual for GeMoMa]] which explains the main steps for the analysis.<br />
<br />
== In a nutshell ==<br />
<br />
GeMoMa is a modular, homology-based gene prediction program with huge flexibility. However, we also provide a pipeline allowing to use GeMoMa easily. If you like to start GeMoMa for the first time, we recommend to use the GeMoMaPipeline like this<br />
java -jar GeMoMa-1.6.4.jar CLI GeMoMaPipeline threads=<threads> outdir=<outdir> tblastn=false GeMoMa.Score=ReAlign AnnotationFinalizer.r=NO o=true t=<target_genome> i=<reference_1_id> a=<reference_1_annotation> g=<reference_1_genome><br />
there are several parameters that need to be set indicated with '''&lt;'''foo'''&gt;'''. You can specify<br />
* the number of threads<br />
* the output directory<br />
* the target genome<br />
* and the reference ID (optional), annotation and genome. If you have several references just repeat the parameter tags <code>i</code>, <code>a</code>, <code>g</code> with the corresponding values.<br />
In addition, we recommend to set several parameters:<br />
* <code>tblastn=false</code>: use mmseqs instead of tblastn, since mmseqs is faster<br />
* <code>GeMoMa.Score=ReAlign</code>: states that the score from mmseqs should be recomputed as mmseqs uses an approximation<br />
* <code>AnnotationFinalizer.r=NO</code>: do not rename genes and transcripts<br />
* <code>o=true</code>: output individual predictions for each reference as a separate file allowing to rerun the combination step ('''GAF''') very easily and quickly<br />
If you like to specify the maximum intron length please consider the parameters <code>GeMoMa.m</code> and <code>GeMoMa.sil</code>.<br />
If you have RNA-seq data either from own experiments or publicly available data sets (cf. [https://www.ncbi.nlm.nih.gov/sra NCBI SRA], [https://www.ebi.ac.uk/ena EMBL-EBI ENA]), we recommend to use them. You need to map the data against the target genome with your favorite read mapper. In addition, we recommend to check the parameters of the section '''DenoiseIntrons'''.<br />
<br />
== Tools ==<br />
<br />
=== GeMoMa pipeline ===<br />
<br />
This tool can be used to run the complete GeMoMa pipeline. The tool is multi-threaded and can utilize all compute cores on one machine, but not distributed as for instance in a compute cluster. It basically runs: '''Extract RNA-seq evidence (ERE)''', '''DenoiseIntrons''', '''Extractor''', external search (tblastn or mmseqs), '''Gene Model Mapper (GeMoMa)''', '''GeMoMa Annotation Filter (GAF)''', and '''AnnnotationFinalizer'''.<br />
<br />
''GeMoMa pipeline'' may be called with<br />
<br />
java -jar GeMoMa-1.6.4.jar CLI GeMoMaPipeline<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>target genome (Target genome file (FASTA))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>species (data for reference species, range={own, pre-extracted}, default = own)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;own&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>ID (ID to distinguish the different reference species, default = , OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (Reference annotation file (GFF or GTF), which contains gene models annotated in the reference genome)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (Reference genome file (FASTA))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>weight (the weight can be used to prioritize predictions from different input files; each prediction will get an additional attribute sumWeight that can be used in the filter, valid range = [0.0, 1000.0], default = 1.0, OPTIONAL)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ai</font></td><br />
<td>annotation info (annotation information of the reference, tab-delimted file containing at least the columns transcriptName, GO and .*defline, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;pre-extracted&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>ID (ID to distinguish the different reference species, default = , OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>cds parts (The query cds parts file (FASTA), i.e., the cds parts that have been blasted)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>assignment (The assignment file, which combines parts of the CDS to transcripts, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">q</font></td><br />
<td>query proteins (optional query protein file (FASTA) for computing the optimal alignment score against complete protein prediction, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>weight (the weight can be used to prioritize predictions from different input files; each prediction will get an additional attribute sumWeight that can be used in the filter, valid range = [0.0, 1000.0], default = 1.0, OPTIONAL)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ai</font></td><br />
<td>annotation info (annotation information of the reference, tab-delimted file containing at least the columns transcriptName, GO and .*defline, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">selected</font></td><br />
<td>selected (The path to list file, which allows to make only a predictions for the contained transcript ids. The first column should contain transcript IDs as given in the annotation. Remaining columns can be used to determine a target region that should be overlapped by the prediction, if columns 2 to 5 contain chromosome, strand, start and end of region, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">gc</font></td><br />
<td>genetic code (optional user-specified genetic code, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">tblastn</font></td><br />
<td>tblastn (if *true* tblastn is used as search algorithm, otherwise mmseqs is used. Tblastn and mmseqs need to be installed to use the corresponding option, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">tag</font></td><br />
<td>tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = prediction)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>RNA-seq evidence (data for RNA-seq evidence, range={NO, MAPPED, EXTRACTED}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;MAPPED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.s</font></td><br />
<td>Stranded (Defines whether the reads are stranded. In case of FR_FIRST_STRAND, the first read of a read pair or the only read in case of single-end data is assumed to be located on forward strand of the cDNA, i.e., reverse to the mRNA orientation. If you are using Illumina TruSeq you should use FR_FIRST_STRAND., range={FR_UNSTRANDED, FR_FIRST_STRAND, FR_SECOND_STRAND}, default = FR_UNSTRANDED)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.m</font></td><br />
<td>mapped reads file (BAM/SAM files containing the mapped reads)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.v</font></td><br />
<td>ValidationStringency (Defines how strict to be when reading a SAM or BAM, beyond bare minimum validation., range={STRICT, LENIENT, SILENT}, default = LENIENT)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.u</font></td><br />
<td>use secondary alignments (allows to filter flags in the SAM or BAM, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.c</font></td><br />
<td>coverage (allows to output the coverage, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.mmq</font></td><br />
<td>minimum mapping quality (reads with a mapping quality that is lower than this value will be ignored, valid range = [0, 255], default = 40)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;EXTRACTED&quot;:</b></td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">introns</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used zero or multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage</font></td><br />
<td>coverage (experimental coverage (RNA-seq), range={UNSTRANDED, STRANDED}, default = UNSTRANDED)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>denoise (removing questionable introns that have been extracted by ERE, range={DENOISE, RAW}, default = DENOISE)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;DENOISE&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">DenoiseIntrons.m</font></td><br />
<td>maximum intron length (The maximum length of an intron, default = 15000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">DenoiseIntrons.me</font></td><br />
<td>minimum expression (The threshold for removing introns, valid range = [0.0, 1.0], default = 0.01)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">DenoiseIntrons.c</font></td><br />
<td>context (The context upstream a donor and donwstream an acceptor site that is used to determine the expression of the region, valid range = [0, 100], default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;RAW&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.p</font></td><br />
<td>proteins (whether the complete proteins sequences should returned as output, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.r</font></td><br />
<td>repair (if a transcript annotation can not be parsed, the program will try to infer the phase of the CDS parts to repair the annotation, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.a</font></td><br />
<td>Ambiguity (This parameter defines how to deal with ambiguities in the DNA. There are 3 options: EXCEPTION, which will remove the corresponding transcript, AMBIGUOUS, which will use an X for the corresponding amino acid, and RANDOM, which will randomly select an amnio acid from the list of possibilities., range={EXCEPTION, AMBIGUOUS, RANDOM}, default = AMBIGUOUS)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.s</font></td><br />
<td>stop-codon excluded from CDS (A flag that states whether the reference annotation contains the stop codon in the CDS annotation or not, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.f</font></td><br />
<td>full-length (A flag which allows for choosing between only full-length and all (i.e., full-length and partial) transcripts, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.s</font></td><br />
<td>splice (if no intron is given by RNA-seq, compute candidate splice sites or not, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.sm</font></td><br />
<td>substitution matrix (optional user-specified substitution matrix, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.g</font></td><br />
<td>gap opening (The gap opening cost in the alignment, default = 11)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.ge</font></td><br />
<td>gap extension (The gap extension cost in the alignment, default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.m</font></td><br />
<td>maximum intron length (The maximum length of an intron, default = 15000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.sil</font></td><br />
<td>static intron length (A flag which allows to switch between static intron length, which can be specified by the user and is identical for all genes, and dynamic intron length, which is based on the gene-specific maximum intron length in the reference organism plus the user given maximum intron length, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.i</font></td><br />
<td>intron-loss-gain-penalty (The penalty used for intron loss and gain, default = 25)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.e</font></td><br />
<td>e-value (The e-value for filtering blast results, default = 100.0)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.c</font></td><br />
<td>contig threshold (The threshold for evaluating contigs, valid range = [0.0, 1.0], default = 0.4)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.rt</font></td><br />
<td>region threshold (The threshold for evaluating regions, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.h</font></td><br />
<td>hit threshold (The threshold for adding additional hits, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.p</font></td><br />
<td>predictions (The (maximal) number of predictions per transcript, default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.a</font></td><br />
<td>avoid stop (A flag which allows to avoid stop codons in a transcript (except the last AS), default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.approx</font></td><br />
<td>approx (whether an approximation is used to compute the score for intron gain, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.prefix</font></td><br />
<td>prefix (A prefix to be used for naming the predictions, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.t</font></td><br />
<td>timeout (The (maximal) number of seconds to be used for the predictions of one transcript, if exceeded GeMoMa does not output a prediction for this transcript., valid range = [0, 604800], default = 3600)</td><br />
<td style="width:100px;">LONG</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.Score</font></td><br />
<td>Score (A flag which allows to do nothing, re-score or re-align the search results, range={Trust, ReScore, ReAlign}, default = Trust)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.c</font></td><br />
<td>common border filter (the filter on the common borders of transcripts, the lower the more transcripts will be checked as alternative splice isoforms, valid range = [0.0, 1.0], default = 0.75)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.m</font></td><br />
<td>maximal number of transcripts per gene (the maximal number of allowed transcript predictions per gene, valid range = [1, 2147483647], default = 2147483647)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.d</font></td><br />
<td>default attributes (Comma-separated list of attributes that is set to NaN if they are not given in the annotation file. This allows to use these attributes for sorting or filter criteria. It is especially meaningful if the gene annotation files were received fom different software packages (e.g., GeMoMa, Braker, ...) having different attributes., default = tie,tde,tae,iAA,pAA,score)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.f</font></td><br />
<td>filter (A filter can be applied to transcript predictions to receive only reasonable predictions. The filter is applied to the GFF attributes. The deault filter decides based on the completeness of the prediction (start=='M' and stop=='*') and the relative score (score/aa>=0.75) whether a prediction is used or not. Different criteria can be combined using 'and' and 'or'. In addition, you can check for NaN, e.g., 'isNaN(score)'. A more sophisticated filter could be applied for instance using the completeness, the relative score, the evidence as well as statistics based on RNA-seq: start=='M' and stop =='*' and score/aa>=0.75 and (evidence>1 or tpc==1.0), default = start=='M' and stop =='*' and score/aa>=0.75, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.s</font></td><br />
<td>sorting (comma-separated list that defines the way predictions are sorted within a cluster, default = evidence,score)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.a</font></td><br />
<td>alternative transcript filter (If a prediction is suggested as an alternative transcript, this additional filter will be applied. The filter works syntactically like the 'filter' parameter and allows you to keep the number of alternative transcripts small based on meaningful criteria. Reasonable filter could be based on RNA-seq data (tie==1) or on evidence (evidence>1). A more sophisticated filter could be applied combining several criteria: tie==1 or evidence>1, default = tie==1 or evidence>1, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.u</font></td><br />
<td>UTR (allows to predict UTRs using RNA-seq data, range={NO, YES}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;YES&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.r</font></td><br />
<td>rename (allows to generate generic gene and transcripts names (cf. parameter &quot;name attribute&quot;), range={COMPOSED, SIMPLE, NO}, default = COMPOSED)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;COMPOSED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.i</font></td><br />
<td>infix (the infix of the generic name, default = G)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.s</font></td><br />
<td>suffix (the suffix of the generic name, default = 0)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.di</font></td><br />
<td>delete infix (a comma-separated list of infixes that is deleted from the sequence names before building the gene/transcript name, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;SIMPLE&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.n</font></td><br />
<td>name attribute (if true the new name is added as new attribute &quot;Name&quot;, otherwise &quot;Parent&quot; and &quot;ID&quot; values are modified accordingly, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>predicted proteins (If *true*, returns the predicted proteins of the target organism as fastA file, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">pc</font></td><br />
<td>predicted CDSs (If *true*, returns the predicted CDSs of the target organism as fastA file, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">pgr</font></td><br />
<td>predicted genomic regions (If *true*, returns the genomic regions of predicted gene models of the target organism as fastA file, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">o</font></td><br />
<td>output individual predictions (If *true*, returns the predictions for each reference species, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">debug</font></td><br />
<td>debug (If *false* removes all temporary files even if the jobs exits unexpected, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">threads</font></td><br />
<td>The number of threads used for the tool, defaults to 1</td><br />
<td>INT</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.6.4.jar CLI GeMoMaPipeline t=<target_genome> a=<annotation> g=<genome> AnnotationFinalizer.p=<prefix><br />
<br />
<br />
=== Extract RNA-seq Evidence ===<br />
<br />
This tools extracts introns and coverage from mapped RNA-seq reads. Introns might be denoised by the tool '''DenoiseIntrons'''. Introns and coverage results can be used in '''GeMoMa''' to improve the predictions and might help to select better gene models in '''GAF'''. In addition, introns and coverage can be used to predict UTRs by '''AnnotationFinalizer'''.<br />
<br />
''Extract RNA-seq Evidence'' may be called with<br />
<br />
java -jar GeMoMa-1.6.4.jar CLI ERE<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>Stranded (Defines whether the reads are stranded. In case of FR_FIRST_STRAND, the first read of a read pair or the only read in case of single-end data is assumed to be located on forward strand of the cDNA, i.e., reverse to the mRNA orientation. If you are using Illumina TruSeq you should use FR_FIRST_STRAND., range={FR_UNSTRANDED, FR_FIRST_STRAND, FR_SECOND_STRAND}, default = FR_UNSTRANDED)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>mapped reads file (BAM/SAM files containing the mapped reads)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>ValidationStringency (Defines how strict to be when reading a SAM or BAM, beyond bare minimum validation., range={STRICT, LENIENT, SILENT}, default = LENIENT)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">u</font></td><br />
<td>use secondary alignments (allows to filter flags in the SAM or BAM, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage (allows to output the coverage, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">mmq</font></td><br />
<td>minimum mapping quality (reads with a mapping quality that is lower than this value will be ignored, valid range = [0, 255], default = 40)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.6.4.jar CLI ERE m=<mapped_reads_file><br />
<br />
<br />
=== CheckIntrons ===<br />
<br />
The tool checks the distribution of introns on the strands and the dinucleotide distribution at splice sites.<br />
<br />
''CheckIntrons'' may be called with<br />
<br />
java -jar GeMoMa-1.6.4.jar CLI CheckIntrons<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>target genome (The target genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>verbose (A flag which allows to output a wealth of additional information per transcript, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.6.4.jar CLI CheckIntrons t=<target_genome><br />
<br />
<br />
=== DenoiseIntrons ===<br />
<br />
This module allows to analyze introns extracted by '''ERE'''. Introns with a large intron size or a low relative expression are possibly artefacts and will be removed. The result of this module can be used in the module '''GeMoMa''', '''AnnotationEvidence''', and '''AnnotationFinalizer'''.<br />
<br />
''DenoiseIntrons'' may be called with<br />
<br />
java -jar GeMoMa-1.6.4.jar CLI DenoiseIntrons<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage (experimental coverage (RNA-seq), range={UNSTRANDED, STRANDED}, default = UNSTRANDED)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>maximum intron length (The maximum length of an intron, default = 15000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">me</font></td><br />
<td>minimum expression (The threshold for removing introns, valid range = [0.0, 1.0], default = 0.01)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">context</font></td><br />
<td>context (The context upstream a donor and donwstream an acceptor site that is used to determine the expression of the region, valid range = [0, 100], default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.6.4.jar CLI DenoiseIntrons i=<introns> coverage_unstranded=<coverage_unstranded><br />
<br />
<br />
=== NCBI Reference Retriever ===<br />
<br />
This tool can be used to download or update assembly and annotation files of reference organsims from NCBI. This way it allows to easily collect all data necessary to start '''GeMoMaPipeline''' or '''Extractor'''.<br />
<br />
''NCBI Reference Retriever'' may be called with<br />
<br />
java -jar GeMoMa-1.6.4.jar CLI NRR<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reference directory (the directory where the genome and annotation files of the reference organisms should be stored, default = references/)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>number of tries (the number of tries for downloading a reference file, valid range = [1, 100], default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">rl</font></td><br />
<td>reference list (a list of reference organisms)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.6.4.jar CLI NRR rl=<reference_list><br />
<br />
<br />
=== Extractor ===<br />
<br />
This tool can be used to create input files for '''GeMoMa''', i.e., it creates at least a fasta file containing the translated parts of the CDS and a tabular file containing the assignment of transcripts to genes and parts of CDS to transcripts. In addition, '''Extractor''' can be used to create several additional files from the final prediction, e.g. proteins, CDSs, ... . Two inputs are mandatory: The genome as fasta or fasta.gz and the corresponding annotation as gff or gff.gz. The gff file should be sorted. If you like to set a user-specific genetic code, please use a tab-delimited file with two columns. The first column contains the amino acid in one letter code, the second a list of tripletts.<br />
<br />
''Extractor'' may be called with<br />
<br />
java -jar GeMoMa-1.6.4.jar CLI Extractor<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (Reference annotation file (GFF or GTF), which contains gene models annotated in the reference genome)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (Reference genome file (FASTA))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">gc</font></td><br />
<td>genetic code (optional user-specified genetic code, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>proteins (whether the complete proteins sequences should returned as output, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>cds (whether the complete CDSs should returned as output, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">genomic</font></td><br />
<td>genomic (whether the genomic regions should be returned (upper case = coding, lower case = non coding), default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>repair (if a transcript annotation can not be parsed, the program will try to infer the phase of the CDS parts to repair the annotation, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>selected (The path to list file, which allows to make only a predictions for the contained transcript ids. The first column should contain transcript IDs as given in the annotation. Remaining columns will be ignored., OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Ambiguity</font></td><br />
<td>Ambiguity (This parameter defines how to deal with ambiguities in the DNA. There are 3 options: EXCEPTION, which will remove the corresponding transcript, AMBIGUOUS, which will use an X for the corresponding amino acid, and RANDOM, which will randomly select an amnio acid from the list of possibilities., range={EXCEPTION, AMBIGUOUS, RANDOM}, default = EXCEPTION)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sefc</font></td><br />
<td>stop-codon excluded from CDS (A flag that states whether the reference annotation contains the stop codon in the CDS annotation or not, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>full-length (A flag which allows for choosing between only full-length and all (i.e., full-length and partial) transcripts, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>verbose (A flag which allows to output a wealth of additional information, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.6.4.jar CLI Extractor a=<annotation> g=<genome><br />
<br />
<br />
=== GeneModelMapper ===<br />
<br />
This tool is the main part of, a homology-based gene prediction tool. GeMoMa builds gene models from search results (e.g. tblastn or mmseqs).<br />
<br />
As first step, you should run '''Extractor''' obtaining ''cds parts'' and ''assignment''. Second, you should run a search algorithm, e.g. '''tblastn''' or '''mmseqs''', with ''cds parts'' as query. Finally, these search results are then used in '''GeMoMa'''. Search results should be clustered according to the reference genes. The most easiest way is to sort the search results accoring to the first column. If the search results are not sorted by default (e.g. mmseqs), you should the parameter ''sort''.<br />
If you like to run GeMoMa ignoring intron position conservation, you should blast protein sequences and feed the results in ''query cds parts'' and leave ''assignment'' unselected.<br />
<br />
If you like to run GeMoMa using RNA-seq evidence, you should map your RNA-seq reads to the genome and run '''ERE''' on the mapped reads. For several reasons, spurious introns can be extracted from RNA-seq data. Hence, we recommend to run '''DenoiseIntrons''' to remove such spurious introns. Finally, you can use the obtained ''introns'' (and ''coverage'') in GeMoMa.<br />
<br />
If you like to obtain multiple predictions per gene model of the reference organism, you should set ''predictions'' accordingly. In addition, we suggest to decrease the value of ''contig threshold'' allowing GeMoMa to evaluate more candidate contigs/chromosomes.<br />
<br />
If you change the values of ''contig threshold'', ''region threshold'' and ''hit threshold'', this will influence the predictions as well as the runtime of the algorithm. The lower the values are, the slower the algorithm is.<br />
<br />
You can filter your predictions using '''GAF''', which also allows for combining predictions from different reference organismns.<br />
<br />
Finally, you can predict UTRs and rename predictions using '''AnnotationFinalizer'''.<br />
<br />
If you like to run the complete GeMoMa pipeline and not only specific module, you can run the multi-threaded module '''GeMoMaPipeline'''.<br />
<br />
''GeneModelMapper'' may be called with<br />
<br />
java -jar GeMoMa-1.6.4.jar CLI GeMoMa<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>search results (The search results, e.g., from tblastn or mmseqs)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>target genome (The target genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>cds parts (The query cds parts file (FASTA), i.e., the cds parts that have been blasted)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>assignment (The assignment file, which combines parts of the CDS to transcripts, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">q</font></td><br />
<td>query proteins (optional query protein file (FASTA) for computing the optimal alignment score against complete protein prediction, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used zero or multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">splice</font></td><br />
<td>splice (if no intron is given by RNA-seq, compute candidate splice sites or not, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used zero or multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage</font></td><br />
<td>coverage (experimental coverage (RNA-seq), range={UNSTRANDED, STRANDED}, default = UNSTRANDED)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genetic code (optional user-specified genetic code, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sm</font></td><br />
<td>substitution matrix (optional user-specified substitution matrix, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">go</font></td><br />
<td>gap opening (The gap opening cost in the alignment, default = 11)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ge</font></td><br />
<td>gap extension (The gap extension cost in the alignment, default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>maximum intron length (The maximum length of an intron, default = 15000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sil</font></td><br />
<td>static intron length (A flag which allows to switch between static intron length, which can be specified by the user and is identical for all genes, and dynamic intron length, which is based on the gene-specific maximum intron length in the reference organism plus the user given maximum intron length, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">intron-loss-gain-penalty</font></td><br />
<td>intron-loss-gain-penalty (The penalty used for intron loss and gain, default = 25)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">e</font></td><br />
<td>e-value (The e-value for filtering blast results, default = 100.0)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ct</font></td><br />
<td>contig threshold (The threshold for evaluating contigs, valid range = [0.0, 1.0], default = 0.4)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">rt</font></td><br />
<td>region threshold (The threshold for evaluating regions, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">h</font></td><br />
<td>hit threshold (The threshold for adding additional hits, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>predictions (The (maximal) number of predictions per transcript, default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">selected</font></td><br />
<td>selected (The path to list file, which allows to make only a predictions for the contained transcript ids. The first column should contain transcript IDs as given in the annotation. Remaining columns can be used to determine a target region that should be overlapped by the prediction, if columns 2 to 5 contain chromosome, strand, start and end of region, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">as</font></td><br />
<td>avoid stop (A flag which allows to avoid stop codons in a transcript (except the last AS), default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">approx</font></td><br />
<td>approx (whether an approximation is used to compute the score for intron gain, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">prefix</font></td><br />
<td>prefix (A prefix to be used for naming the predictions, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">tag</font></td><br />
<td>tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = prediction)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>verbose (A flag which allows to output a wealth of additional information per transcript, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">timeout</font></td><br />
<td>timeout (The (maximal) number of seconds to be used for the predictions of one transcript, if exceeded GeMoMa does not output a prediction for this transcript., valid range = [0, 604800], default = 3600)</td><br />
<td style="width:100px;">LONG</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sort</font></td><br />
<td>sort (A flag which allows to sort the search results, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Score</font></td><br />
<td>Score (A flag which allows to do nothing, re-score or re-align the search results, range={Trust, ReScore, ReAlign}, default = Trust)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.6.4.jar CLI GeMoMa s=<search_results> t=<target_genome> c=<cds_parts> a=<assignment><br />
<br />
<br />
=== GeMoMa Annotation Filter ===<br />
<br />
This tool combines and filters gene predictions from different sources yielding a common gene prediction. The tool does not modify the predictions, but filters redundant and low-quality predictions and selects relevant predictions. In addition, it adds attributes to the annotation of transcript predictions.<br />
<br />
The algorithm does the following:<br />
First, redundant predictions are identified (and additional attributes (evidence, sumWeight) are introduced).<br />
Second, the predictions are filtered using the user-specified criterium based on the attributes from the annotation.<br />
Third, clusters of overlapping predictions are determined, the predictions are sorted within the cluster and relevant predictions are extracted.<br />
<br />
Optionally, annotation info can be added for each reference organism enabling a functional prediction for predicted transcripts based on the function of the reference transcript.<br />
Phytozome provides annotation info tables for plants, but annotation info can be used from any source as long as they are tab-delimited files with at least the following columns: transcriptName, GO and .*defline.<br />
<br />
Initially, GAF was build to combine gene predictions obtained from '''GeMoMa'''. It allows to combine the predictions from multiple reference organisms, but works also using only one reference organism. However, GAF also allows to integrate predictions from ab-initio or purely RNA-seq-based gene predictors as well as manually curated annotation. If you like to do so, we recommend to run '''AnnotationEvidence''' for each of these input files to add additional attributes that can be used for sorting and filtering within '''GAF'''. The sort and filter criteria need to be carefully revised in this case. Default values can be set for missing attributes.<br />
<br />
''GeMoMa Annotation Filter'' may be called with<br />
<br />
java -jar GeMoMa-1.6.4.jar CLI GAF<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>tag (the tag used to read the GeMoMa annotations, default = prediction)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>common border filter (the filter on the common borders of transcripts, the lower the more transcripts will be checked as alternative splice isoforms, valid range = [0.0, 1.0], default = 0.75)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>maximal number of transcripts per gene (the maximal number of allowed transcript predictions per gene, valid range = [1, 2147483647], default = 2147483647)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prefix (the prefix can be used to distinguish predictions from different input files, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>weight (the weight can be used to prioritize predictions from different input files; each prediction will get an additional attribute sumWeight that can be used in the filter, valid range = [0.0, 1000.0], default = 1.0, OPTIONAL)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>gene annotation file (GFF files containing the gene annotations (predicted by GeMoMa))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation info (annotation information of the reference, tab-delimted file containing at least the columns transcriptName, GO and .*defline, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>default attributes (Comma-separated list of attributes that is set to NaN if they are not given in the annotation file. This allows to use these attributes for sorting or filter criteria. It is especially meaningful if the gene annotation files were received fom different software packages (e.g., GeMoMa, Braker, ...) having different attributes., default = tie,tde,tae,iAA,pAA,score)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>filter (A filter can be applied to transcript predictions to receive only reasonable predictions. The filter is applied to the GFF attributes. The deault filter decides based on the completeness of the prediction (start=='M' and stop=='*') and the relative score (score/aa>=0.75) whether a prediction is used or not. Different criteria can be combined using 'and' and 'or'. In addition, you can check for NaN, e.g., 'isNaN(score)'. A more sophisticated filter could be applied for instance using the completeness, the relative score, the evidence as well as statistics based on RNA-seq: start=='M' and stop =='*' and score/aa>=0.75 and (evidence>1 or tpc==1.0), default = start=='M' and stop =='*' and score/aa>=0.75, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>sorting (comma-separated list that defines the way predictions are sorted within a cluster, default = evidence,score)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">atf</font></td><br />
<td>alternative transcript filter (If a prediction is suggested as an alternative transcript, this additional filter will be applied. The filter works syntactically like the 'filter' parameter and allows you to keep the number of alternative transcripts small based on meaningful criteria. Reasonable filter could be based on RNA-seq data (tie==1) or on evidence (evidence>1). A more sophisticated filter could be applied combining several criteria: tie==1 or evidence>1, default = tie==1 or evidence>1, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.6.4.jar CLI GAF g=<gene_annotation_file><br />
<br />
<br />
=== AnnotationFinalizer ===<br />
<br />
This tool finalizes an annotation. It allows to predict for UTRs for annotated coding sequences and to generate generic gene and transcript names. UTR prediction might be negatively influenced (i.e. too long predictions) by genomic contamination of RNA-seq libraries, overlapping genes or genes in close proximity as well as unstranded RNA-seq libraries. Please use '''ERE''' to preprocess the mapped reads.<br />
<br />
''AnnotationFinalizer'' may be called with<br />
<br />
java -jar GeMoMa-1.6.4.jar CLI AnnotationFinalizer<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (The genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (The predicted genome annotation file (GFF))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = prediction)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">u</font></td><br />
<td>UTR (allows to predict UTRs using RNA-seq data, range={NO, YES}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;YES&quot;:</b></td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns file (Introns (GFF), which might be obtained from RNA-seq)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage file (experimental coverage (RNA-seq), range={NO, UNSTRANDED, STRANDED}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">rename</font></td><br />
<td>rename (allows to generate generic gene and transcripts names (cf. parameter &quot;name attribute&quot;), range={COMPOSED, SIMPLE, NO}, default = COMPOSED)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;COMPOSED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">infix</font></td><br />
<td>infix (the infix of the generic name, default = G)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>suffix (the suffix of the generic name, default = 0)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">di</font></td><br />
<td>delete infix (a comma-separated list of infixes that is deleted from the sequence names before building the gene/transcript name, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;SIMPLE&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>name attribute (if true the new name is added as new attribute &quot;Name&quot;, otherwise &quot;Parent&quot; and &quot;ID&quot; values are modified accordingly, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.6.4.jar CLI AnnotationFinalizer g=<genome> a=<annotation> p=<prefix><br />
<br />
<br />
=== Annotation evidence ===<br />
<br />
This tool adds attributes to the annotation, e.g., tie, tpc, aa, start, stop. These attributes can be used, for instance, if the annotation is used in '''GAF'''. All predictions of the annotation are used. The predictions are not filtered for internal stop codons, missing start or stop codons, frame-shifts, ... . Please use '''ERE''' to preprocess the mapped reads.<br />
<br />
''Annotation evidence'' may be called with<br />
<br />
java -jar GeMoMa-1.6.4.jar CLI AnnotationEvidence<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (The genome annotation file (GFF))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (The genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns file (Introns (GFF), which might be obtained from RNA-seq, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage file (experimental coverage (RNA-seq), range={NO, UNSTRANDED, STRANDED}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ao</font></td><br />
<td>annotation output (if the annotation should be returned with attributes tie, tpc, and aa, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">gc</font></td><br />
<td>genetic code (optional user-specified genetic code, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.6.4.jar CLI AnnotationEvidence a=<annotation> g=<genome><br />
<br />
<br />
=== Compare transcripts ===<br />
<br />
This tool compares a predicted annotation with a given annotation in terms of F1 measure. If the F1 measure is 1 both annotations are in perfect agreement for this transcript. The smaller the value is the low is the agreement. If it is NA then there is no overlapping annotation.<br />
<br />
''Compare transcripts'' may be called with<br />
<br />
java -jar GeMoMa-1.6.4.jar CLI CompareTranscripts<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prediction (The predicted annotation)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (The true annotation)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used zero or multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">prefix</font></td><br />
<td>prefix (the prefix can be used to distinguish predictions from different input files, default = , OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">assignment</font></td><br />
<td>assignment (the transcript info for the reference of the prediction)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.6.4.jar CLI CompareTranscripts p=<prediction> a=<annotation><br />
<br />
== GFF attributes ==<br />
<br />
Using GeMoMa and GAF, you'll obtain GFFs containing some special attributes. We briefly explain the most prominent attributes in the following table.<br />
<br />
{| class="wikitable sortable"<br />
|-<br />
!Attribute !!Long name !!Tool !!Necessary parameter !!Feature !!Description<br />
|-<br />
| aa || amino acids || GeMoMa || || prediction || the number of amino acids in the protein<br />
|-<br />
| score || GeMoMa score || GeMoMa || || prediction || score computed by GeMoMa using the substitution matrix, gap costs and additional penalties<br />
|-<br />
| minCov || minimal coverage || GeMoMa || coverage, ... || prediction || minimal coverage of any base of the prediction given RNA-seq evidence<br />
|-<br />
| avgCov || average coverage || GeMoMa || coverage, ... || prediction || average coverage of all bases of the prediction given RNA-seq evidence<br />
|-<br />
| tpc || transcript percentage coverage || GeMoMa || coverage, ... || prediction || percentage of covered bases per predicted transcript given RNA-seq evidence<br />
|-<br />
| tae || transcript acceptor evidence || GeMoMa || introns || prediction || percentage of predicted acceptor sites per predicted transcript with RNA-seq evidence<br />
|-<br />
| tde || transcript donor evidence || GeMoMa || introns || prediction || percentage of predicted donor sites per predicted transcript with RNA-seq evidence<br />
|-<br />
| tie || transcript intron evidence || GeMoMa || introns || prediction || percentage of predicted introns per predicted transcript with RNA-seq evidence<br />
|-<br />
| minSplitReads || minimal split reads || GeMoMa || introns || prediction || minimal number of split reads for any of the predicted introns per predicted transcript<br />
|-<br />
| iAA || identical amino acid || GeMoMa || query proteins || prediction || percentage of identical amino acids between reference transcript and prediction<br />
|-<br />
| pAA || positive amino acid || GeMoMa || query proteins || prediction || percentage of aligned positions between reference transcript and prediction yielding a positive score in the substitution matrix<br />
|-<br />
| evidence || || GAF || || prediction || number of reference organisms that have a transcript yielding this prediction<br />
|-<br />
| alternative || || GAF || || prediction || alternative gene ID(s) leading to the same prediction<br />
|-<br />
| sumWeight || || GAF || || prediction || the sum of the weights of the references that perfectly support this prediction<br />
|-<br />
| maxTie || maximal tie || GAF || || gene || maximal tie of all transcripts of this gene<br />
|-<br />
| maxEvidence || maximal evidence || GAF || || gene || maximal evidence of all transcripts of this gene<br />
|-<br />
|}<br />
<br />
== FAQs ==<br />
<br />
; Why does the Extractor not return a single CDS-part, protein, ...?<br />
: First, please check whether the names of your contigs/chromosomes in your annotation (gff) and genome file (fasta) are identical. The fasta comments should at best only contain the contig/chromosome name. (Since GeMoMa 1.4, comments, which contain the contig/chromosome name and some additional information separated by a space, are also fine.) Second, please check whether you have a valid GFF/GTF file. Valid GFF files should have a valid "ID" or "Parent" entry in the attributes column. Valid GTF files should have a valid "gene_id" and "transcript_id" entry. Finally, please check the statistics that are given by the Extractor. It lists how many genes have been read and how many genes have been removed for different reasons. One common problem is that some annotation files do not include the stop codon in the CDS annotation.<br />
; How can I force GeMoMa to make more predictions?<br />
: There are several parameters affecting the number of predictions. The most prominent are the number of predictions (p) and the contig threshold (ct). For each reference transcript/CDS, GeMoMa initially makes a preliminary prediction and uses this prediction to determine whether a contig is promising and should be used to determine the final predictions. You may decrease ct and increase p to have more contigs in the final prediction. Increasing the number of predictions allows GeMoMa to output more predictions that have been computed. Decreasing the contig threshold allows to increase the number of predictions that are (internally) computed. Increasing p to a very large number without decreasing ct does not help.<br />
; Running GeMoMa on a single contig of my assembly yield thousands of weird predictions. What went wrong?<br />
: By default, GeMoMa is not build to be run on a single contig. GeMoMa tries to make predictions for all given reference CDS in the given target sequence(s). If the given target sequence is only a fraction of the complete target genome/assembly, GeMoMa will produce weird predictions as it does not filter for the quality of the predictions internally. There are two options to handle this:<br />
:* Use a list of gene models that you expect to be located on this contig (cf. parameter "selected").<br />
:* Filter the predictions using GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>).<br />
; Is it mandatory to use RNA-seq data?<br />
: No, GeMoMa is able to make predictions with and without RNA-seq evidence.<br />
; Is it possible to use multiple reference organisms?<br />
: It is possible to use multiple reference organisms for GeMoMa. Just run GeMoMa on each reference organism separately. Finally, you can employ GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>) to combine these annotations.<br />
; Why do some reference genes not lead to a prediction in the target genome?<br />
: Please first check whether your reference genes have been discarded by the Extractor (cf. assignment file).<br />
: If the genes have been discarded, there are two possibilities:<br />
:* The CDS might be redundant, i.e. the coding exons are identical to those of another transcript. In this case, only one CDS is further evaluated.<br />
:* There might be something wrong with your reference genes, e.g., missing start codon, missing stop codon, premature stop codon, ambiguous nucleotides, ... and you should check the options of Extractor or the annotation.<br />
: If the reference genes passed the Extractor, there are several possible explanations for this behavior. The two most prominent are:<br />
:* GeMoMa stopped the prediction of a reference genes since it does not return a result within the given time (cf. parameter "timeout").<br />
:* GeMoMa simply did not find a prediction matching the remaining quality criteria<br />
:* GeMoMa did find a prediction, but it was filtered out by GAF, e.g. to low relative score, missing start or stop codon (cf. GAF parameters).<br />
; What does "partial gene model" mean in the context of GeMoMa?<br />
: We called a gene model partial if it does not contain an initial start codon and a final stop codon. However, this does not mean that the gene model is located at or close to the border of a chromosome or contig.<br />
; For two different reference transcripts, the predictions of GeMoMa overlap or are identical. What should I do with those?<br />
: GeMoMa makes the predictions for each reference transcript independently. Hence, it can occur that some of predictions of different reference transcripts overlap or are identical especially in gene families. Typically, you might like to filter or rank these predictions. We have implemented GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>) to do this automatically. However, you can also do it by hand using the GFF attributes. Using RNA-seq data in GeMoMa yields additional fields in the annotation that can be used, e.g., average coverage (avgCov).<br />
; A lot of transcripts have been filtered out by the Extractor. What can I do?<br />
: There are several reasons for removing transcripts by the Extractor. At least in two cases you can try to get more transcripts by setting specific parameter values. First, if the transcript contains ambiguous nucleotides, please test the parameter "Ambiguity". Second, sometimes we received GFFs which contain wrong phases for CDS entries (e.g., 0 for all CDS entries in the phase column of the GFF). Since version 1.3.2, we provide the option "r" which stands for repair. If r=true is chosen, the Extractor tries to infer all phases for transcripts that show an error and would be filtered out.<br />
; Is GeMoMa able to predict pseudo-genes/ncRNA?<br />
: No, currently not.<br />
; My RNA-seq data indicates there is an additional intron in a transcipt, but GeMoMa does not predict this. Or vice versa, GeMoMa predicts an intron that is not supported by RNA-seq data. What's the reason?<br />
: GeMoMa is mainly based on the assumptions of amino acid and intron position conservation between reference and target species. Hence, GeMoMa tries to predict a gene model with similar exon-intron structure in the target species and does not stick too much to RNA-seq data. Although intron position conservation can be observed in most cases, sometimes new introns evolve or others vanish. For this reasons, GeMoMa also allows for the inclusion or exclusion of introns adding some additional costs (cf. GeMoMa parameter intron-loss-gain-penalty). However, the behaviour of GeMoMa depends on the parameters settings (especially intron-loss-gain-penalty, sm (substitution matrix), go (gap opening), ge (gap extension)) and the length of the missed/additional intron. Nevertheless, such cases can only occur if the additional/missed intron has a length that can be divided by 3 preserving the reading frame.<br />
:Since the available RNA-seq data only reflects a fraction of tissues/environmental conditions/..., missing RNA-seq evidence does not necessarily mean that the predictions is wrong.<br />
; My RNA-seq data indicates two alternative, highly overlapping introns. Interestingly, GeMoMa does not take the intron that is more abundant. Why?<br />
: GeMoMa reads the introns from the input file using some filter (cf. GeMoMa parameter r (reads)). All introns that pass the filter are used and treated equally. Hence, GeMoMa uses the intron that matches the expectation of intron position and amino acid conservation compared to the reference transcript.<br />
; Does GeMoMa predict multiple transcripts per gene?<br />
: GeMoMa in principle allows to predict multiple transcripts per gene, if corresponding transcripts are given in the reference species or if multiple reference species are used.<br />
; GeMoMa failed with java.lang.OutOfMemoryError. What can I do?<br />
: Whenever you see a java.lang.OutOfMemoryError, you should rerun the program with Java virtual machine (VM) options. More specifically you should set: '''-Xms''' the initally used RAM, e.g. to 5Gb (–Xms5G), and '''-Xmx''' the maximally used RAM, e.g. to 50Gb (-Xmx50G). GeMoMa often needs more memory if you have a large genome ''and'' if you’re providing a large coverage file (extracted from RNA-seq data). If you don’t have a compute node with enough memory, you can run GeMoMa without coverage, which will return the same predictions, but does not include all statistics. Another point could be the protein alignment, if you use the optional parameter query protein. Again you can run GeMoMa without this parameter, which will return the same predictions, but less statistics.<br />
; I need to specify the genetic code for my organisms. What is the expected format?<br />
: The genetic code is given in a two column tab-delimited table, where the first column is the one letter code of the amino acid and the second column is a comma-separated list of triplets. As we are working on genomic DNA, GeMoMa expects the bases A, C, G, and T, and not U (as expected in mRNA). Here is the link to the default genetic code, which might be used as template:<br />
: https://github.com/Jstacs/Jstacs/blob/master/projects/gemoma/test_data/genetic_code.txt<br />
: Alternative genetic codes are described here using the RNA alphabet:<br />
: https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi<br />
: The genetic code might be specified for a reference organism in the module Extractor or for a target organism in the module GeMoMa.<br />
; I like to accelerate GeMoMa. What can I do?<br />
:If you like to improve the runtime, you can use several threads for the computation. If you run the GeMoMaPipeline you just have to select <code>threads=<your_number></code>.<br />
:In addition, you can change the search algorithm that is used in GeMoMa. Tblastn is used by default as search algorithm in GeMoMa (for historical reasons). However, tblastn can be replaced by mmseqs which is typically much faster. If you run the GeMoMaPipeline you just have to select <code>tblastn=false</code>. However, changing the search algorithm can also effect the results. We try to minimize these effect using specific parameters for the search algorithms.<br />
:If you modify other parameters, you will probably receive results that differ to a larger extend from those received using the default parameters.<br />
<br />
== References ==<br />
If you use GeMoMa, please cite<br />
<br />
J. Keilwagen, M. Wenk, J. L. Erickson, M. H. Schattat, J. Grau, and F. Hartung. [https://nar.oxfordjournals.org/content/44/9/e89 Using intron position conservation for homology-based gene prediction]. ''Nucleic Acids Research'', 2016. doi: 10.1093/nar/gkw092<br />
<br />
J. Keilwagen, F. Hartung, M. Paulini, S. O. Twardziok, and J. Grau<br />
[https://rdcu.be/QbKc Combining RNA-seq data and homology-based gene prediction for plants, animals and fungi]. ''BMC Bioinformatics'', 2018. doi: 10.1186/s12859-018-2203-5<br />
<br />
== Version history ==<br />
[http://www.jstacs.de/download.php?which=GeMoMa GeMoMa 1.6.4] (24.04.2020)<br />
* improved help section<br />
* change gff attribute "AA" to "aa"<br />
* GAF:<br />
** bugfix overlapping genes<br />
** accelerated computation<br />
* GeMoMa:<br />
** bugfix: if no assignment file is used and protein ID are prefixes of other protein IDs<br />
** change GFF attribute AA to aa<br />
* AnnotationFinalizer: new parameter "name attribute" allowing to decide whether a name attribute or the Parent and ID attributes should be used for renaming<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.6.3.zip GeMoMa 1.6.3] (05.03.2020)<br />
* Jstacs changes:<br />
** CLI: bugfix ExpandableParameterSet<br />
* python wrapper (for *conda)<br />
* updated tests.sh, run.sh, pipeline.sh<br />
* rename Denoise to DenoiseIntrons<br />
* AnnotationEvidence: write phase (as given) to gff<br />
* GAF: new parameter: default attributes allows to set attributes that are not included in some gene annotation files<br />
* GeMoMa: new parameter: static intron length allowing to use dynamic intron length if set to false<br />
* GeMoMaPipeline: <br />
** bugfix: time-out<br />
** improve output<br />
** separate parameters for maximum intron length (DenoiseIntrons, GeMoMa)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.6.2.zip GeMoMa 1.6.2] (17.12.2019)<br />
* Jstacs changes:<br />
** test methods for modules<br />
** live protocol for Galaxy<br />
* new module Denoise: allowing to clean introns extracted by ERE<br />
* new module NCBIReferenceRetriever: allowing to retrieve data for reference organisms easily from NCBI.<br />
* GAF:<br />
** bugfix for filter using specific attributes if no RNA-seq or query proteins was used<br />
** allow to add annotation info (as for instance provided by Phytozome) based on the reference organisms<br />
* GeMoMa: bugfix for timeout<br />
* GeMoMaPipeline:<br />
** bugfix reporting predicted partial proteins<br />
** improved protocol<br />
** new default value for query proteins (changed from false to true)<br />
** new default value for Ambiguity (changed from EXCEPTION to AMBIGUOUS)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.6.1.zip GeMoMa 1.6.1] (4.06.2019)<br />
* createGalaxyIntegration.sh: bugfix for GeMoMaPipeline<br />
* new module CheckIntrons: allowing to create statistics for introns (extracted by ERE)<br />
* AnnotationFinalizer: bugfix for sequence IDs with large numbers<br />
* CompareTranscripts: <br />
** bugfix for prefix of ref-gene<br />
** allow no transcript info, but making assignment non-optional if a transcript info is set <br />
* GAF: bugfix for Galaxy integration<br />
* GeMoMaPipeline:<br />
** improved output in case of Exceptions<br />
** new parameter "output individual predictions" allows to in- or exclude individual predictions from each reference organism in the final result<br />
** new parameter "weight" allows weights for reference species (cf. GAF)<br />
* ERE: new parameter "minimum mapping quality"<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.6.zip GeMoMa 1.6] (2.04.2019)<br />
* allow to use mmseqs as alternative to tblastn<br />
* AnnotationEvidence:<br />
** allows to add attributes to the input gff: tie, tpc, AA, start, stop<br />
** new parameter for gff output<br />
* AnnotationFinalizer: new tool for predicting UTRs and renaming predictions<br />
* GAF:<br />
** relative score filter and evidence filter are replaced by a flexible filter that allows to filter by relative score, evidence or other GFF attributes as well as combinations thereof<br />
** sorting criteria of the predictions within clusters can now be user-specified<br />
** new attribute for genes: combinedEvidence<br />
** new attribute for predictions: sumWeight<br />
** allows to use gene predictions from all sources, including for instance ab-initio gene predictors, purely RNA-seq based gene prediction and manually curation<br />
** bugfix for predictions from multiple reference organisms<br />
** improved statistic output<br />
* GeMoMa<br />
** renamed the parameter tblastn results to search results<br />
** new parameter for sorting the results of the similarity search (tblastn or mmseqs), if you use mmseqs for the similarity search you have use sort <br />
** new parameter for score of the search results: three options: Trust (as is), ReScore (use aligned sequence, but recompute score), and ReAlign (use detected sequence for realignment and rescore)<br />
** bugfix: threshold for introns from multiple files<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.5.2.zip GeMoMa 1.5.3] (23.07.2018)<br />
* improved parameter description and presentation<br />
* GeMoMaPipeline:<br />
** removed unnecessary parameters<br />
* GeMoMa:<br />
** bugfix: reading coverage file<br />
** removed parameter genomic (cf. Extractor)<br />
** removed protein output (cf. Extractor)<br />
* GAF:<br />
** bugfix: prefix<br />
* Extractor:<br />
** new parameter genomic<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.5.2.zip GeMoMa 1.5.2] (31.5.2018)<br />
* GAF:<br />
** new parameter that allows to restrict the maximal number of transcript predictions per gene<br />
** altered behavior of the evidence filter from percentages to absolute values<br />
** bugfix: nested genes <br />
** checking for duplicates in prediction IDs<br />
* GeMoMa:<br />
** warning if RNA-seq data does not match with target genome<br />
* GeMoMaPipeline: new tool for running the complete GeMoMa pipeline at once allowing multi-threading<br />
* folder for temporary files of GeMoMa<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.5.zip GeMoMa 1.5] (13.02.2018)<br />
* AnnotationEvidence: add chromosome to output<br />
* CompareTranscripts: new parameter that allows to remove prefixes introduces by GAF<br />
* Extractor: new parameter "stop-codon excluded from CDS" that might be used if the annotation does not contain the stop codons<br />
* ExtractRNASeqEvidence: <br />
** print intron length stats<br />
** include program infos in introns.gff3<br />
* GeMoMa:<br />
** new attribute pAA in gff output if query protein is given<br />
** include program infos in predicted_annotation.gff3<br />
** minor bugfix<br />
* GAF:<br />
** new parameter that allows to specify a prefix for each input gff<br />
** collect and print program infos to filtered_prediction.gff3<br />
** improved statistics output<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.4.2.zip GeMoMa 1.4.2] (21.07.2017)<br />
* automatic searching for available updates<br />
* AnnotationEvidence: bugfix (tie computation: Arrays.binarysearch does not find first match)<br />
* Extractor: bugfix (files that are not zipped)<br />
* GeMoMa: bugfix (tie computation: Arrays.binarysearch does not find first match)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.4.1.zip GeMoMa 1.4.1] (30.05.2017)<br />
* CompareTranscripts: bugfix (NullPointerException)<br />
* Extractor: reference genome can be .*fa.gz and .*fasta.gz<br />
* GeMoMa: bugfix (shutdown problem after timeout)<br />
* modified additional scripts and documentation<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.4.zip GeMoMa 1.4] (03.05.2017)<br />
* AnnotationEvidence: new tool computing tie and tpc for given annotation (gff)<br />
* CompareTranscripts: new tool comparing predicted and given annotation (gff)<br />
* Extractor:<br />
** reading CDS with no parent tag (cf. discontinuous feature)<br />
** automatic recognition of GFF or GTF annotation<br />
** Warning if sequences mentioned in the annotation are not included in the reference sequence<br />
* GeMoMa: <br />
** allowing for multiple intron and coverage files (= using different library types at the same time)<br />
** NA instead of "?" for tae, tde, tie, minSplitReads of single coding exon genes<br />
** new default values for the parameters: predictions (10 instead of 1) and contig threshold (0.4 instead of 0.9)<br />
** bugfix (write pc and minCov if possible for last CDS part in predicted annotation)<br />
** bugfix (ref-gene name if no assignment is used)<br />
** bugfix (minSplitReads, minCov, tpc, avgCov if no coverage available)<br />
* GAF:<br />
** nested genes on the same strand<br />
** bugfix (if nothing passes the filter)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.3.2.zip GeMoMa 1.3.2] (18.01.2017)<br />
* Extractor: new parameter repair for broken transcript annotations<br />
* GeMoMa: bugfixes (splice site computation)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa_1.3.1.zip GeMoMa 1.3.1] (09.12.2016)<br />
* GeMoMa bugfix (finding start/stop codon for very small exons)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.3.zip GeMoMa 1.3] (06.12.2016)<br />
* ERE: new tool for extracting RNA-seq evidence<br />
* Extractor: offers options for <br />
** partial gene models<br />
** ambiguities<br />
* GeMoMa:<br />
** RNA-seq<br />
*** defining splice sites<br />
*** additional feature in GFF and output<br />
**** transcript intron evidence (tie)<br />
**** transcript acceptor evidence (tae)<br />
**** transcript donor evidence (tde)<br />
**** transcript percentage coverage (tpc)<br />
**** ...<br />
** improved GFF<br />
** simplified the command line parameters<br />
** IMPORTANT: parameter names changed for some parameters<br />
* GAF: new tool for filtering and combining different predictions (especially of different reference organisms)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.1.3.zip GeMoMa 1.1.3] (06.06.2016)<br />
* minor modifications to the Extractor tool<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.1.2.zip GeMoMa 1.1.2] (05.02.2016)<br />
* GeMoMa bugfix (upstream, downstream sequence for splice site detection)<br />
* Extractor: new parameter s for selecting transcripts<br />
* improved Galaxy integration<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.1.1.zip GeMoMa 1.1.1] (01.02.2016)<br />
* initial release for paper</div>Keilwagenhttps://www.jstacs.de/index.php?title=GeMoMa&diff=1095GeMoMa2020-04-25T20:37:05Z<p>Keilwagen: </p>
<hr />
<div>'''Ge'''ne '''Mo'''del '''Ma'''pper (GeMoMa) is a homology-based gene prediction program. GeMoMa uses the annotation of protein-coding genes in a reference genome to infer the annotation of protein-coding genes in a target genome. Thereby, GeMoMa utilizes amino acid sequence and intron position conservation. In addition, GeMoMa allows to incorporate RNA-seq evidence for splice site prediction.<br />
<br />
GeMoMa is available in a public web-server at [http://galaxy.informatik.uni-halle.de galaxy.informatik.uni-halle.de]. The provided web-server only allows a limited number of reference genes and uses a time out of 2 minutes per transcript prediction. For unlimited use, please use the command line program or integrate GeMoMa in your own [https://galaxyproject.org/ Galaxy] instance.<br />
<br />
{|<br />
|__TOC__<br />
|[[File:GeMoMa-schema.png|thumb|right|450px|Schema of GeMoMa algorithm]]<br />
|}<br />
<br />
== Installation ==<br />
<br />
=== Requirements ===<br />
<br />
For running the GeMoMa, you need the following software on you computer<br />
* Java v1.8 or later<br />
* [ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ blast] or [https://github.com/soedinglab/MMseqs2 mmseqs]<br />
<br />
=== Download ===<br />
<br />
GeMoMa is implemented in Java using Jstacs. You can [http://www.jstacs.de/download.php?which=GeMoMa download a zip file] containing a readme, the GeMoMa jar file and some tiny scripts for running GeMoMa. The jar file allows for <br />
<ul><br />
<li>creating the XML file needed for the Galaxy integration</li><br />
<li>running the command line interface (CLI) version.</li><br />
</ul><br />
<br />
You can also [[:File:GeMoMa-manual.pdf|download a small manual for GeMoMa]] which explains the main steps for the analysis.<br />
<br />
== In a nutshell ==<br />
<br />
GeMoMa is a modular, homology-based gene prediction program with huge flexibility. However, we also provide a pipeline allowing to use GeMoMa easily. If you like to start GeMoMa for the first time, we recommend to use the GeMoMaPipeline like this<br />
java -jar GeMoMa-1.6.4.jar CLI GeMoMaPipeline threads=<threads> outdir=<outdir> tblastn=false GeMoMa.Score=ReAlign AnnotationFinalizer.r=NO o=true t=<target_genome> i=<reference_1_id> a=<reference_1_annotation> g=<reference_1_genome><br />
there are several parameters that need to be set indicated with '''&lt;'''foo'''&gt;'''. You can specify<br />
* the number of threads<br />
* the output directory<br />
* the target genome<br />
* and the reference ID (optional), annotation and genome. If you have several references just repeat the parameter tags <code>i</code>, <code>a</code>, <code>g</code> with the corresponding values.<br />
In addition, we recommend to set several parameters:<br />
* <code>tblastn=false</code>: use mmseqs instead of tblastn, since mmseqs is faster<br />
* <code>GeMoMa.Score=ReAlign</code>: states that the score from mmseqs should be recomputed as mmseqs uses an approximation<br />
* <code>AnnotationFinalizer.r=NO</code>: do not rename genes and transcripts<br />
* <code>o=true</code>: output individual predictions for each reference as a separate file allowing to rerun the combination step ('''GAF''') very easily and quickly<br />
If you like to specify the maximum intron length please consider the parameters <code>GeMoMa.m</code> and <code>GeMoMa.sil</code>.<br />
If you have RNA-seq data either from own experiments or publicly available data sets (cf. [https://www.ncbi.nlm.nih.gov/sra NCBI SRA], [https://www.ebi.ac.uk/ena EMBL-EBI ENA]), we recommend to use them. You need to map the data against the target genome with your favorite read mapper. In addition, we recommend to check the parameters of the section '''DenoiseIntrons'''.<br />
<br />
== Tools ==<br />
<br />
=== GeMoMa pipeline ===<br />
<br />
This tool can be used to run the complete GeMoMa pipeline. The tool is multi-threaded and can utilize all compute cores on one machine, but not distributed as for instance in a compute cluster. It basically runs: '''Extract RNA-seq evidence (ERE)''', '''DenoiseIntrons''', '''Extractor''', external search (tblastn or mmseqs), '''Gene Model Mapper (GeMoMa)''', '''GeMoMa Annotation Filter (GAF)''', and '''AnnnotationFinalizer'''.<br />
<br />
''GeMoMa pipeline'' may be called with<br />
<br />
java -jar GeMoMa-1.6.4.jar CLI GeMoMaPipeline<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>target genome (Target genome file (FASTA))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>species (data for reference species, range={own, pre-extracted}, default = own)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;own&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>ID (ID to distinguish the different reference species, default = , OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (Reference annotation file (GFF or GTF), which contains gene models annotated in the reference genome)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (Reference genome file (FASTA))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>weight (the weight can be used to prioritize predictions from different input files; each prediction will get an additional attribute sumWeight that can be used in the filter, valid range = [0.0, 1000.0], default = 1.0, OPTIONAL)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ai</font></td><br />
<td>annotation info (annotation information of the reference, tab-delimted file containing at least the columns transcriptName, GO and .*defline, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;pre-extracted&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>ID (ID to distinguish the different reference species, default = , OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>cds parts (The query cds parts file (FASTA), i.e., the cds parts that have been blasted)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>assignment (The assignment file, which combines parts of the CDS to transcripts, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">q</font></td><br />
<td>query proteins (optional query protein file (FASTA) for computing the optimal alignment score against complete protein prediction, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>weight (the weight can be used to prioritize predictions from different input files; each prediction will get an additional attribute sumWeight that can be used in the filter, valid range = [0.0, 1000.0], default = 1.0, OPTIONAL)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ai</font></td><br />
<td>annotation info (annotation information of the reference, tab-delimted file containing at least the columns transcriptName, GO and .*defline, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">selected</font></td><br />
<td>selected (The path to list file, which allows to make only a predictions for the contained transcript ids. The first column should contain transcript IDs as given in the annotation. Remaining columns can be used to determine a target region that should be overlapped by the prediction, if columns 2 to 5 contain chromosome, strand, start and end of region, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">gc</font></td><br />
<td>genetic code (optional user-specified genetic code, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">tblastn</font></td><br />
<td>tblastn (if *true* tblastn is used as search algorithm, otherwise mmseqs is used. Tblastn and mmseqs need to be installed to use the corresponding option, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">tag</font></td><br />
<td>tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = prediction)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>RNA-seq evidence (data for RNA-seq evidence, range={NO, MAPPED, EXTRACTED}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;MAPPED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.s</font></td><br />
<td>Stranded (Defines whether the reads are stranded. In case of FR_FIRST_STRAND, the first read of a read pair or the only read in case of single-end data is assumed to be located on forward strand of the cDNA, i.e., reverse to the mRNA orientation. If you are using Illumina TruSeq you should use FR_FIRST_STRAND., range={FR_UNSTRANDED, FR_FIRST_STRAND, FR_SECOND_STRAND}, default = FR_UNSTRANDED)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.m</font></td><br />
<td>mapped reads file (BAM/SAM files containing the mapped reads)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.v</font></td><br />
<td>ValidationStringency (Defines how strict to be when reading a SAM or BAM, beyond bare minimum validation., range={STRICT, LENIENT, SILENT}, default = LENIENT)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.u</font></td><br />
<td>use secondary alignments (allows to filter flags in the SAM or BAM, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.c</font></td><br />
<td>coverage (allows to output the coverage, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.mmq</font></td><br />
<td>minimum mapping quality (reads with a mapping quality that is lower than this value will be ignored, valid range = [0, 255], default = 40)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;EXTRACTED&quot;:</b></td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">introns</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used zero or multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage</font></td><br />
<td>coverage (experimental coverage (RNA-seq), range={UNSTRANDED, STRANDED}, default = UNSTRANDED)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>denoise (removing questionable introns that have been extracted by ERE, range={DENOISE, RAW}, default = DENOISE)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;DENOISE&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">DenoiseIntrons.m</font></td><br />
<td>maximum intron length (The maximum length of an intron, default = 15000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">DenoiseIntrons.me</font></td><br />
<td>minimum expression (The threshold for removing introns, valid range = [0.0, 1.0], default = 0.01)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">DenoiseIntrons.c</font></td><br />
<td>context (The context upstream a donor and donwstream an acceptor site that is used to determine the expression of the region, valid range = [0, 100], default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;RAW&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.p</font></td><br />
<td>proteins (whether the complete proteins sequences should returned as output, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.r</font></td><br />
<td>repair (if a transcript annotation can not be parsed, the program will try to infer the phase of the CDS parts to repair the annotation, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.a</font></td><br />
<td>Ambiguity (This parameter defines how to deal with ambiguities in the DNA. There are 3 options: EXCEPTION, which will remove the corresponding transcript, AMBIGUOUS, which will use an X for the corresponding amino acid, and RANDOM, which will randomly select an amnio acid from the list of possibilities., range={EXCEPTION, AMBIGUOUS, RANDOM}, default = AMBIGUOUS)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.s</font></td><br />
<td>stop-codon excluded from CDS (A flag that states whether the reference annotation contains the stop codon in the CDS annotation or not, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.f</font></td><br />
<td>full-length (A flag which allows for choosing between only full-length and all (i.e., full-length and partial) transcripts, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.s</font></td><br />
<td>splice (if no intron is given by RNA-seq, compute candidate splice sites or not, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.sm</font></td><br />
<td>substitution matrix (optional user-specified substitution matrix, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.g</font></td><br />
<td>gap opening (The gap opening cost in the alignment, default = 11)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.ge</font></td><br />
<td>gap extension (The gap extension cost in the alignment, default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.m</font></td><br />
<td>maximum intron length (The maximum length of an intron, default = 15000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.sil</font></td><br />
<td>static intron length (A flag which allows to switch between static intron length, which can be specified by the user and is identical for all genes, and dynamic intron length, which is based on the gene-specific maximum intron length in the reference organism plus the user given maximum intron length, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.i</font></td><br />
<td>intron-loss-gain-penalty (The penalty used for intron loss and gain, default = 25)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.e</font></td><br />
<td>e-value (The e-value for filtering blast results, default = 100.0)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.c</font></td><br />
<td>contig threshold (The threshold for evaluating contigs, valid range = [0.0, 1.0], default = 0.4)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.rt</font></td><br />
<td>region threshold (The threshold for evaluating regions, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.h</font></td><br />
<td>hit threshold (The threshold for adding additional hits, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.p</font></td><br />
<td>predictions (The (maximal) number of predictions per transcript, default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.a</font></td><br />
<td>avoid stop (A flag which allows to avoid stop codons in a transcript (except the last AS), default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.approx</font></td><br />
<td>approx (whether an approximation is used to compute the score for intron gain, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.prefix</font></td><br />
<td>prefix (A prefix to be used for naming the predictions, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.t</font></td><br />
<td>timeout (The (maximal) number of seconds to be used for the predictions of one transcript, if exceeded GeMoMa does not output a prediction for this transcript., valid range = [0, 604800], default = 3600)</td><br />
<td style="width:100px;">LONG</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.Score</font></td><br />
<td>Score (A flag which allows to do nothing, re-score or re-align the search results, range={Trust, ReScore, ReAlign}, default = Trust)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.c</font></td><br />
<td>common border filter (the filter on the common borders of transcripts, the lower the more transcripts will be checked as alternative splice isoforms, valid range = [0.0, 1.0], default = 0.75)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.m</font></td><br />
<td>maximal number of transcripts per gene (the maximal number of allowed transcript predictions per gene, valid range = [1, 2147483647], default = 2147483647)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.d</font></td><br />
<td>default attributes (Comma-separated list of attributes that is set to NaN if they are not given in the annotation file. This allows to use these attributes for sorting or filter criteria. It is especially meaningful if the gene annotation files were received fom different software packages (e.g., GeMoMa, Braker, ...) having different attributes., default = tie,tde,tae,iAA,pAA,score)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.f</font></td><br />
<td>filter (A filter can be applied to transcript predictions to receive only reasonable predictions. The filter is applied to the GFF attributes. The deault filter decides based on the completeness of the prediction (start=='M' and stop=='*') and the relative score (score/aa>=0.75) whether a prediction is used or not. Different criteria can be combined using 'and' and 'or'. In addition, you can check for NaN, e.g., 'isNaN(score)'. A more sophisticated filter could be applied for instance using the completeness, the relative score, the evidence as well as statistics based on RNA-seq: start=='M' and stop =='*' and score/aa>=0.75 and (evidence>1 or tpc==1.0), default = start=='M' and stop =='*' and score/aa>=0.75, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.s</font></td><br />
<td>sorting (comma-separated list that defines the way predictions are sorted within a cluster, default = evidence,score)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.a</font></td><br />
<td>alternative transcript filter (If a prediction is suggested as an alternative transcript, this additional filter will be applied. The filter works syntactically like the 'filter' parameter and allows you to keep the number of alternative transcripts small based on meaningful criteria. Reasonable filter could be based on RNA-seq data (tie==1) or on evidence (evidence>1). A more sophisticated filter could be applied combining several criteria: tie==1 or evidence>1, default = tie==1 or evidence>1, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.u</font></td><br />
<td>UTR (allows to predict UTRs using RNA-seq data, range={NO, YES}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;YES&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.r</font></td><br />
<td>rename (allows to generate generic gene and transcripts names (cf. parameter &quot;name attribute&quot;), range={COMPOSED, SIMPLE, NO}, default = COMPOSED)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;COMPOSED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.i</font></td><br />
<td>infix (the infix of the generic name, default = G)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.s</font></td><br />
<td>suffix (the suffix of the generic name, default = 0)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.di</font></td><br />
<td>delete infix (a comma-separated list of infixes that is deleted from the sequence names before building the gene/transcript name, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;SIMPLE&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.n</font></td><br />
<td>name attribute (if true the new name is added as new attribute &quot;Name&quot;, otherwise &quot;Parent&quot; and &quot;ID&quot; values are modified accordingly, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>predicted proteins (If *true*, returns the predicted proteins of the target organism as fastA file, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">pc</font></td><br />
<td>predicted CDSs (If *true*, returns the predicted CDSs of the target organism as fastA file, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">pgr</font></td><br />
<td>predicted genomic regions (If *true*, returns the genomic regions of predicted gene models of the target organism as fastA file, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">o</font></td><br />
<td>output individual predictions (If *true*, returns the predictions for each reference species, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">debug</font></td><br />
<td>debug (If *false* removes all temporary files even if the jobs exits unexpected, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">threads</font></td><br />
<td>The number of threads used for the tool, defaults to 1</td><br />
<td>INT</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.6.4.jar CLI GeMoMaPipeline t=<target_genome> a=<annotation> g=<genome> AnnotationFinalizer.p=<prefix><br />
<br />
<br />
=== Extract RNA-seq Evidence ===<br />
<br />
This tools extracts introns and coverage from mapped RNA-seq reads. Introns might be denoised by the tool '''DenoiseIntrons'''. Introns and coverage results can be used in '''GeMoMa''' to improve the predictions and might help to select better gene models in '''GAF'''. In addition, introns and coverage can be used to predict UTRs by '''AnnotationFinalizer'''.<br />
<br />
''Extract RNA-seq Evidence'' may be called with<br />
<br />
java -jar GeMoMa-1.6.4.jar CLI ERE<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>Stranded (Defines whether the reads are stranded. In case of FR_FIRST_STRAND, the first read of a read pair or the only read in case of single-end data is assumed to be located on forward strand of the cDNA, i.e., reverse to the mRNA orientation. If you are using Illumina TruSeq you should use FR_FIRST_STRAND., range={FR_UNSTRANDED, FR_FIRST_STRAND, FR_SECOND_STRAND}, default = FR_UNSTRANDED)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>mapped reads file (BAM/SAM files containing the mapped reads)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>ValidationStringency (Defines how strict to be when reading a SAM or BAM, beyond bare minimum validation., range={STRICT, LENIENT, SILENT}, default = LENIENT)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">u</font></td><br />
<td>use secondary alignments (allows to filter flags in the SAM or BAM, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage (allows to output the coverage, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">mmq</font></td><br />
<td>minimum mapping quality (reads with a mapping quality that is lower than this value will be ignored, valid range = [0, 255], default = 40)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.6.4.jar CLI ERE m=<mapped_reads_file><br />
<br />
<br />
=== CheckIntrons ===<br />
<br />
The tool checks the distribution of introns on the strands and the dinucleotide distribution at splice sites.<br />
<br />
''CheckIntrons'' may be called with<br />
<br />
java -jar GeMoMa-1.6.4.jar CLI CheckIntrons<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>target genome (The target genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>verbose (A flag which allows to output a wealth of additional information per transcript, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.6.4.jar CLI CheckIntrons t=<target_genome><br />
<br />
<br />
=== DenoiseIntrons ===<br />
<br />
This module allows to analyze introns extracted by '''ERE'''. Introns with a large intron size or a low relative expression are possibly artefacts and will be removed. The result of this module can be used in the module '''GeMoMa''', '''AnnotationEvidence''', and '''AnnotationFinalizer'''.<br />
<br />
''DenoiseIntrons'' may be called with<br />
<br />
java -jar GeMoMa-1.6.4.jar CLI DenoiseIntrons<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage (experimental coverage (RNA-seq), range={UNSTRANDED, STRANDED}, default = UNSTRANDED)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>maximum intron length (The maximum length of an intron, default = 15000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">me</font></td><br />
<td>minimum expression (The threshold for removing introns, valid range = [0.0, 1.0], default = 0.01)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">context</font></td><br />
<td>context (The context upstream a donor and donwstream an acceptor site that is used to determine the expression of the region, valid range = [0, 100], default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.6.4.jar CLI DenoiseIntrons i=<introns> coverage_unstranded=<coverage_unstranded><br />
<br />
<br />
=== NCBI Reference Retriever ===<br />
<br />
This tool can be used to download or update assembly and annotation files of reference organsims from NCBI. This way it allows to easily collect all data necessary to start '''GeMoMaPipeline''' or '''Extractor'''.<br />
<br />
''NCBI Reference Retriever'' may be called with<br />
<br />
java -jar GeMoMa-1.6.4.jar CLI NRR<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reference directory (the directory where the genome and annotation files of the reference organisms should be stored, default = references/)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>number of tries (the number of tries for downloading a reference file, valid range = [1, 100], default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">rl</font></td><br />
<td>reference list (a list of reference organisms)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.6.4.jar CLI NRR rl=<reference_list><br />
<br />
<br />
=== Extractor ===<br />
<br />
This tool can be used to create input files for '''GeMoMa''', i.e., it creates at least a fasta file containing the translated parts of the CDS and a tabular file containing the assignment of transcripts to genes and parts of CDS to transcripts. In addition, '''Extractor''' can be used to create several additional files from the final prediction, e.g. proteins, CDSs, ... . Two inputs are mandatory: The genome as fasta or fasta.gz and the corresponding annotation as gff or gff.gz. The gff file should be sorted. If you like to set a user-specific genetic code, please use a tab-delimited file with two columns. The first column contains the amino acid in one letter code, the second a list of tripletts.<br />
<br />
''Extractor'' may be called with<br />
<br />
java -jar GeMoMa-1.6.4.jar CLI Extractor<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (Reference annotation file (GFF or GTF), which contains gene models annotated in the reference genome)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (Reference genome file (FASTA))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">gc</font></td><br />
<td>genetic code (optional user-specified genetic code, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>proteins (whether the complete proteins sequences should returned as output, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>cds (whether the complete CDSs should returned as output, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">genomic</font></td><br />
<td>genomic (whether the genomic regions should be returned (upper case = coding, lower case = non coding), default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>repair (if a transcript annotation can not be parsed, the program will try to infer the phase of the CDS parts to repair the annotation, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>selected (The path to list file, which allows to make only a predictions for the contained transcript ids. The first column should contain transcript IDs as given in the annotation. Remaining columns will be ignored., OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Ambiguity</font></td><br />
<td>Ambiguity (This parameter defines how to deal with ambiguities in the DNA. There are 3 options: EXCEPTION, which will remove the corresponding transcript, AMBIGUOUS, which will use an X for the corresponding amino acid, and RANDOM, which will randomly select an amnio acid from the list of possibilities., range={EXCEPTION, AMBIGUOUS, RANDOM}, default = EXCEPTION)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sefc</font></td><br />
<td>stop-codon excluded from CDS (A flag that states whether the reference annotation contains the stop codon in the CDS annotation or not, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>full-length (A flag which allows for choosing between only full-length and all (i.e., full-length and partial) transcripts, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>verbose (A flag which allows to output a wealth of additional information, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.6.4.jar CLI Extractor a=<annotation> g=<genome><br />
<br />
<br />
=== GeneModelMapper ===<br />
<br />
This tool is the main part of, a homology-based gene prediction tool. GeMoMa builds gene models from search results (e.g. tblastn or mmseqs).<br />
<br />
As first step, you should run '''Extractor''' obtaining ''cds parts'' and ''assignment''. Second, you should run a search algorithm, e.g. '''tblastn''' or '''mmseqs''', with ''cds parts'' as query. Finally, these search results are then used in '''GeMoMa'''. Search results should be clustered according to the reference genes. The most easiest way is to sort the search results accoring to the first column. If the search results are not sorted by default (e.g. mmseqs), you should the parameter ''sort''.<br />
If you like to run GeMoMa ignoring intron position conservation, you should blast protein sequences and feed the results in ''query cds parts'' and leave ''assignment'' unselected.<br />
<br />
If you like to run GeMoMa using RNA-seq evidence, you should map your RNA-seq reads to the genome and run '''ERE''' on the mapped reads. For several reasons, spurious introns can be extracted from RNA-seq data. Hence, we recommend to run '''DenoiseIntrons''' to remove such spurious introns. Finally, you can use the obtained ''introns'' (and ''coverage'') in GeMoMa.<br />
<br />
If you like to obtain multiple predictions per gene model of the reference organism, you should set ''predictions'' accordingly. In addition, we suggest to decrease the value of ''contig threshold'' allowing GeMoMa to evaluate more candidate contigs/chromosomes.<br />
<br />
If you change the values of ''contig threshold'', ''region threshold'' and ''hit threshold'', this will influence the predictions as well as the runtime of the algorithm. The lower the values are, the slower the algorithm is.<br />
<br />
You can filter your predictions using '''GAF''', which also allows for combining predictions from different reference organismns.<br />
<br />
Finally, you can predict UTRs and rename predictions using '''AnnotationFinalizer'''.<br />
<br />
If you like to run the complete GeMoMa pipeline and not only specific module, you can run the multi-threaded module '''GeMoMaPipeline'''.<br />
<br />
''GeneModelMapper'' may be called with<br />
<br />
java -jar GeMoMa-1.6.4.jar CLI GeMoMa<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>search results (The search results, e.g., from tblastn or mmseqs)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>target genome (The target genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>cds parts (The query cds parts file (FASTA), i.e., the cds parts that have been blasted)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>assignment (The assignment file, which combines parts of the CDS to transcripts, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">q</font></td><br />
<td>query proteins (optional query protein file (FASTA) for computing the optimal alignment score against complete protein prediction, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used zero or multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">splice</font></td><br />
<td>splice (if no intron is given by RNA-seq, compute candidate splice sites or not, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used zero or multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage</font></td><br />
<td>coverage (experimental coverage (RNA-seq), range={UNSTRANDED, STRANDED}, default = UNSTRANDED)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genetic code (optional user-specified genetic code, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sm</font></td><br />
<td>substitution matrix (optional user-specified substitution matrix, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">go</font></td><br />
<td>gap opening (The gap opening cost in the alignment, default = 11)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ge</font></td><br />
<td>gap extension (The gap extension cost in the alignment, default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>maximum intron length (The maximum length of an intron, default = 15000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sil</font></td><br />
<td>static intron length (A flag which allows to switch between static intron length, which can be specified by the user and is identical for all genes, and dynamic intron length, which is based on the gene-specific maximum intron length in the reference organism plus the user given maximum intron length, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">intron-loss-gain-penalty</font></td><br />
<td>intron-loss-gain-penalty (The penalty used for intron loss and gain, default = 25)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">e</font></td><br />
<td>e-value (The e-value for filtering blast results, default = 100.0)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ct</font></td><br />
<td>contig threshold (The threshold for evaluating contigs, valid range = [0.0, 1.0], default = 0.4)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">rt</font></td><br />
<td>region threshold (The threshold for evaluating regions, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">h</font></td><br />
<td>hit threshold (The threshold for adding additional hits, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>predictions (The (maximal) number of predictions per transcript, default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">selected</font></td><br />
<td>selected (The path to list file, which allows to make only a predictions for the contained transcript ids. The first column should contain transcript IDs as given in the annotation. Remaining columns can be used to determine a target region that should be overlapped by the prediction, if columns 2 to 5 contain chromosome, strand, start and end of region, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">as</font></td><br />
<td>avoid stop (A flag which allows to avoid stop codons in a transcript (except the last AS), default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">approx</font></td><br />
<td>approx (whether an approximation is used to compute the score for intron gain, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">prefix</font></td><br />
<td>prefix (A prefix to be used for naming the predictions, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">tag</font></td><br />
<td>tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = prediction)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>verbose (A flag which allows to output a wealth of additional information per transcript, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">timeout</font></td><br />
<td>timeout (The (maximal) number of seconds to be used for the predictions of one transcript, if exceeded GeMoMa does not output a prediction for this transcript., valid range = [0, 604800], default = 3600)</td><br />
<td style="width:100px;">LONG</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sort</font></td><br />
<td>sort (A flag which allows to sort the search results, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Score</font></td><br />
<td>Score (A flag which allows to do nothing, re-score or re-align the search results, range={Trust, ReScore, ReAlign}, default = Trust)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.6.4.jar CLI GeMoMa s=<search_results> t=<target_genome> c=<cds_parts> a=<assignment><br />
<br />
<br />
=== GeMoMa Annotation Filter ===<br />
<br />
This tool combines and filters gene predictions from different sources yielding a common gene prediction. The tool does not modify the predictions, but filters redundant and low-quality predictions and selects relevant predictions. In addition, it adds attributes to the annotation of transcript predictions.<br />
<br />
The algorithm does the following:<br />
First, redundant predictions are identified (and additional attributes (evidence, sumWeight) are introduced).<br />
Second, the predictions are filtered using the user-specified criterium based on the attributes from the annotation.<br />
Third, clusters of overlapping predictions are determined, the predictions are sorted within the cluster and relevant predictions are extracted.<br />
<br />
Optionally, annotation info can be added for each reference organism enabling a functional prediction for predicted transcripts based on the function of the reference transcript.<br />
Phytozome provides annotation info tables for plants, but annotation info can be used from any source as long as they are tab-delimited files with at least the following columns: transcriptName, GO and .*defline.<br />
<br />
Initially, GAF was build to combine gene predictions obtained from '''GeMoMa'''. It allows to combine the predictions from multiple reference organisms, but works also using only one reference organism. However, GAF also allows to integrate predictions from ab-initio or purely RNA-seq-based gene predictors as well as manually curated annotation. If you like to do so, we recommend to run '''AnnotationEvidence''' for each of these input files to add additional attributes that can be used for sorting and filtering within '''GAF'''. The sort and filter criteria need to be carefully revised in this case. Default values can be set for missing attributes.<br />
<br />
''GeMoMa Annotation Filter'' may be called with<br />
<br />
java -jar GeMoMa-1.6.4.jar CLI GAF<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>tag (the tag used to read the GeMoMa annotations, default = prediction)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>common border filter (the filter on the common borders of transcripts, the lower the more transcripts will be checked as alternative splice isoforms, valid range = [0.0, 1.0], default = 0.75)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>maximal number of transcripts per gene (the maximal number of allowed transcript predictions per gene, valid range = [1, 2147483647], default = 2147483647)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prefix (the prefix can be used to distinguish predictions from different input files, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>weight (the weight can be used to prioritize predictions from different input files; each prediction will get an additional attribute sumWeight that can be used in the filter, valid range = [0.0, 1000.0], default = 1.0, OPTIONAL)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>gene annotation file (GFF files containing the gene annotations (predicted by GeMoMa))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation info (annotation information of the reference, tab-delimted file containing at least the columns transcriptName, GO and .*defline, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>default attributes (Comma-separated list of attributes that is set to NaN if they are not given in the annotation file. This allows to use these attributes for sorting or filter criteria. It is especially meaningful if the gene annotation files were received fom different software packages (e.g., GeMoMa, Braker, ...) having different attributes., default = tie,tde,tae,iAA,pAA,score)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>filter (A filter can be applied to transcript predictions to receive only reasonable predictions. The filter is applied to the GFF attributes. The deault filter decides based on the completeness of the prediction (start=='M' and stop=='*') and the relative score (score/aa>=0.75) whether a prediction is used or not. Different criteria can be combined using 'and' and 'or'. In addition, you can check for NaN, e.g., 'isNaN(score)'. A more sophisticated filter could be applied for instance using the completeness, the relative score, the evidence as well as statistics based on RNA-seq: start=='M' and stop =='*' and score/aa>=0.75 and (evidence>1 or tpc==1.0), default = start=='M' and stop =='*' and score/aa>=0.75, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>sorting (comma-separated list that defines the way predictions are sorted within a cluster, default = evidence,score)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">atf</font></td><br />
<td>alternative transcript filter (If a prediction is suggested as an alternative transcript, this additional filter will be applied. The filter works syntactically like the 'filter' parameter and allows you to keep the number of alternative transcripts small based on meaningful criteria. Reasonable filter could be based on RNA-seq data (tie==1) or on evidence (evidence>1). A more sophisticated filter could be applied combining several criteria: tie==1 or evidence>1, default = tie==1 or evidence>1, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.6.4.jar CLI GAF g=<gene_annotation_file><br />
<br />
<br />
=== AnnotationFinalizer ===<br />
<br />
This tool finalizes an annotation. It allows to predict for UTRs for annotated coding sequences and to generate generic gene and transcript names. UTR prediction might be negatively influenced (i.e. too long predictions) by genomic contamination of RNA-seq libraries, overlapping genes or genes in close proximity as well as unstranded RNA-seq libraries. Please use '''ERE''' to preprocess the mapped reads.<br />
<br />
''AnnotationFinalizer'' may be called with<br />
<br />
java -jar GeMoMa-1.6.4.jar CLI AnnotationFinalizer<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (The genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (The predicted genome annotation file (GFF))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = prediction)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">u</font></td><br />
<td>UTR (allows to predict UTRs using RNA-seq data, range={NO, YES}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;YES&quot;:</b></td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns file (Introns (GFF), which might be obtained from RNA-seq)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage file (experimental coverage (RNA-seq), range={NO, UNSTRANDED, STRANDED}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">rename</font></td><br />
<td>rename (allows to generate generic gene and transcripts names (cf. parameter &quot;name attribute&quot;), range={COMPOSED, SIMPLE, NO}, default = COMPOSED)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;COMPOSED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">infix</font></td><br />
<td>infix (the infix of the generic name, default = G)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>suffix (the suffix of the generic name, default = 0)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">di</font></td><br />
<td>delete infix (a comma-separated list of infixes that is deleted from the sequence names before building the gene/transcript name, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;SIMPLE&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>name attribute (if true the new name is added as new attribute &quot;Name&quot;, otherwise &quot;Parent&quot; and &quot;ID&quot; values are modified accordingly, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.6.4.jar CLI AnnotationFinalizer g=<genome> a=<annotation> p=<prefix><br />
<br />
<br />
=== Annotation evidence ===<br />
<br />
This tool adds attributes to the annotation, e.g., tie, tpc, aa, start, stop. These attributes can be used, for instance, if the annotation is used in '''GAF'''. All predictions of the annotation are used. The predictions are not filtered for internal stop codons, missing start or stop codons, frame-shifts, ... . Please use '''ERE''' to preprocess the mapped reads.<br />
<br />
''Annotation evidence'' may be called with<br />
<br />
java -jar GeMoMa-1.6.4.jar CLI AnnotationEvidence<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (The genome annotation file (GFF))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (The genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns file (Introns (GFF), which might be obtained from RNA-seq, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage file (experimental coverage (RNA-seq), range={NO, UNSTRANDED, STRANDED}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ao</font></td><br />
<td>annotation output (if the annotation should be returned with attributes tie, tpc, and aa, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">gc</font></td><br />
<td>genetic code (optional user-specified genetic code, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.6.4.jar CLI AnnotationEvidence a=<annotation> g=<genome><br />
<br />
<br />
=== Compare transcripts ===<br />
<br />
This tool compares a predicted annotation with a given annotation in terms of F1 measure. If the F1 measure is 1 both annotations are in perfect agreement for this transcript. The smaller the value is the low is the agreement. If it is NA then there is no overlapping annotation.<br />
<br />
''Compare transcripts'' may be called with<br />
<br />
java -jar GeMoMa-1.6.4.jar CLI CompareTranscripts<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prediction (The predicted annotation)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (The true annotation)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used zero or multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">prefix</font></td><br />
<td>prefix (the prefix can be used to distinguish predictions from different input files, default = , OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">assignment</font></td><br />
<td>assignment (the transcript info for the reference of the prediction)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.6.4.jar CLI CompareTranscripts p=<prediction> a=<annotation><br />
<br />
== GFF attributes ==<br />
<br />
Using GeMoMa and GAF, you'll obtain GFFs containing some special attributes. We briefly explain the most prominent attributes in the following table.<br />
<br />
{| class="wikitable sortable"<br />
|-<br />
!Attribute !!Long name !!Tool !!Necessary parameter !!Feature !!Description<br />
|-<br />
| aa || amino acids || GeMoMa || || prediction || the number of amino acids in the protein<br />
|-<br />
| score || GeMoMa score || GeMoMa || || prediction || score computed by GeMoMa using the substitution matrix, gap costs and additional penalties<br />
|-<br />
| minCov || minimal coverage || GeMoMa || coverage, ... || prediction || minimal coverage of any base of the prediction given RNA-seq evidence<br />
|-<br />
| avgCov || average coverage || GeMoMa || coverage, ... || prediction || average coverage of all bases of the prediction given RNA-seq evidence<br />
|-<br />
| tpc || transcript percentage coverage || GeMoMa || coverage, ... || prediction || percentage of covered bases per predicted transcript given RNA-seq evidence<br />
|-<br />
| tae || transcript acceptor evidence || GeMoMa || introns || prediction || percentage of predicted acceptor sites per predicted transcript with RNA-seq evidence<br />
|-<br />
| tde || transcript donor evidence || GeMoMa || introns || prediction || percentage of predicted donor sites per predicted transcript with RNA-seq evidence<br />
|-<br />
| tie || transcript intron evidence || GeMoMa || introns || prediction || percentage of predicted introns per predicted transcript with RNA-seq evidence<br />
|-<br />
| minSplitReads || minimal split reads || GeMoMa || introns || prediction || minimal number of split reads for any of the predicted introns per predicted transcript<br />
|-<br />
| iAA || identical amino acid || GeMoMa || query proteins || prediction || percentage of identical amino acids between reference transcript and prediction<br />
|-<br />
| pAA || positive amino acid || GeMoMa || query proteins || prediction || percentage of aligned positions between reference transcript and prediction yielding a positive score in the substitution matrix<br />
|-<br />
| evidence || || GAF || || prediction || number of reference organisms that have a transcript yielding this prediction<br />
|-<br />
| alternative || || GAF || || prediction || alternative gene ID(s) leading to the same prediction<br />
|-<br />
| sumWeight || || GAF || || prediction || the sum of the weights of the references that perfectly support this prediction<br />
|-<br />
| maxTie || maximal tie || GAF || || gene || maximal tie of all transcripts of this gene<br />
|-<br />
| maxEvidence || maximal evidence || GAF || || gene || maximal evidence of all transcripts of this gene<br />
|-<br />
|}<br />
<br />
== FAQs ==<br />
<br />
; Why does the Extractor not return a single CDS-part, protein, ...?<br />
: First, please check whether the names of your contigs/chromosomes in your annotation (gff) and genome file (fasta) are identical. The fasta comments should at best only contain the contig/chromosome name. (Since GeMoMa 1.4, comments, which contain the contig/chromosome name and some additional information separated by a space, are also fine.) Second, please check whether you have a valid GFF/GTF file. Valid GFF files should have a valid "ID" or "Parent" entry in the attributes column. Valid GTF files should have a valid "gene_id" and "transcript_id" entry. Finally, please check the statistics that are given by the Extractor. It lists how many genes have been read and how many genes have been removed for different reasons. One common problem is that some annotation files do not include the stop codon in the CDS annotation.<br />
; How can I force GeMoMa to make more predictions?<br />
: There are several parameters affecting the number of predictions. The most prominent are the number of predictions (p) and the contig threshold (ct). For each reference transcript/CDS, GeMoMa initially makes a preliminary prediction and uses this prediction to determine whether a contig is promising and should be used to determine the final predictions. You may decrease ct and increase p to have more contigs in the final prediction. Increasing the number of predictions allows GeMoMa to output more predictions that have been computed. Decreasing the contig threshold allows to increase the number of predictions that are (internally) computed. Increasing p to a very large number without decreasing ct does not help.<br />
; Running GeMoMa on a single contig of my assembly yield thousands of weird predictions. What went wrong?<br />
: By default, GeMoMa is not build to be run on a single contig. GeMoMa tries to make predictions for all given reference CDS in the given target sequence(s). If the given target sequence is only a fraction of the complete target genome/assembly, GeMoMa will produce weird predictions as it does not filter for the quality of the predictions internally. There are two options to handle this:<br />
:* Use a list of gene models that you expect to be located on this contig (cf. parameter "selected").<br />
:* Filter the predictions using GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>).<br />
; Is it mandatory to use RNA-seq data?<br />
: No, GeMoMa is able to make predictions with and without RNA-seq evidence.<br />
; Is it possible to use multiple reference organisms?<br />
: It is possible to use multiple reference organisms for GeMoMa. Just run GeMoMa on each reference organism separately. Finally, you can employ GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>) to combine these annotations.<br />
; Why do some reference genes not lead to a prediction in the target genome?<br />
: Please first check whether your reference genes have been discarded by the Extractor (cf. assignment file).<br />
: If the genes have been discarded, there are two possibilities:<br />
:* The CDS might be redundant, i.e. the coding exons are identical to those of another transcript. In this case, only one CDS is further evaluated.<br />
:* There might be something wrong with your reference genes, e.g., missing start codon, missing stop codon, premature stop codon, ambiguous nucleotides, ... and you should check the options of Extractor or the annotation.<br />
: If the reference genes passed the Extractor, there are several possible explanations for this behavior. The two most prominent are:<br />
:* GeMoMa stopped the prediction of a reference genes since it does not return a result within the given time (cf. parameter "timeout").<br />
:* GeMoMa simply did not find a prediction matching the remaining quality criteria<br />
:* GeMoMa did find a prediction, but it was filtered out by GAF, e.g. to low relative score, missing start or stop codon (cf. GAF parameters).<br />
; What does "partial gene model" mean in the context of GeMoMa?<br />
: We called a gene model partial if it does not contain an initial start codon and a final stop codon. However, this does not mean that the gene model is located at or close to the border of a chromosome or contig.<br />
; For two different reference transcripts, the predictions of GeMoMa overlap or are identical. What should I do with those?<br />
: GeMoMa makes the predictions for each reference transcript independently. Hence, it can occur that some of predictions of different reference transcripts overlap or are identical especially in gene families. Typically, you might like to filter or rank these predictions. We have implemented GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>) to do this automatically. However, you can also do it by hand using the GFF attributes. Using RNA-seq data in GeMoMa yields additional fields in the annotation that can be used, e.g., average coverage (avgCov).<br />
; A lot of transcripts have been filtered out by the Extractor. What can I do?<br />
: There are several reasons for removing transcripts by the Extractor. At least in two cases you can try to get more transcripts by setting specific parameter values. First, if the transcript contains ambiguous nucleotides, please test the parameter "Ambiguity". Second, sometimes we received GFFs which contain wrong phases for CDS entries (e.g., 0 for all CDS entries in the phase column of the GFF). Since version 1.3.2, we provide the option "r" which stands for repair. If r=true is chosen, the Extractor tries to infer all phases for transcripts that show an error and would be filtered out.<br />
; Is GeMoMa able to predict pseudo-genes/ncRNA?<br />
: No, currently not.<br />
; My RNA-seq data indicates there is an additional intron in a transcipt, but GeMoMa does not predict this. Or vice versa, GeMoMa predicts an intron that is not supported by RNA-seq data. What's the reason?<br />
: GeMoMa is mainly based on the assumptions of amino acid and intron position conservation between reference and target species. Hence, GeMoMa tries to predict a gene model with similar exon-intron structure in the target species and does not stick too much to RNA-seq data. Although intron position conservation can be observed in most cases, sometimes new introns evolve or others vanish. For this reasons, GeMoMa also allows for the inclusion or exclusion of introns adding some additional costs (cf. GeMoMa parameter intron-loss-gain-penalty). However, the behaviour of GeMoMa depends on the parameters settings (especially intron-loss-gain-penalty, sm (substitution matrix), go (gap opening), ge (gap extension)) and the length of the missed/additional intron. Nevertheless, such cases can only occur if the additional/missed intron has a length that can be divided by 3 preserving the reading frame.<br />
:Since the available RNA-seq data only reflects a fraction of tissues/environmental conditions/..., missing RNA-seq evidence does not necessarily mean that the predictions is wrong.<br />
; My RNA-seq data indicates two alternative, highly overlapping introns. Interestingly, GeMoMa does not take the intron that is more abundant. Why?<br />
: GeMoMa reads the introns from the input file using some filter (cf. GeMoMa parameter r (reads)). All introns that pass the filter are used and treated equally. Hence, GeMoMa uses the intron that matches the expectation of intron position and amino acid conservation compared to the reference transcript.<br />
; Does GeMoMa predict multiple transcripts per gene?<br />
: GeMoMa in principle allows to predict multiple transcripts per gene, if corresponding transcripts are given in the reference species or if multiple reference species are used.<br />
; GeMoMa failed with java.lang.OutOfMemoryError. What can I do?<br />
: Whenever you see a java.lang.OutOfMemoryError, you should rerun the program with Java virtual machine (VM) options. More specifically you should set: '''-Xms''' the initally used RAM, e.g. to 5Gb (–Xms5G), and '''-Xmx''' the maximally used RAM, e.g. to 50Gb (-Xmx50G). GeMoMa often needs more memory if you have a large genome ''and'' if you’re providing a large coverage file (extracted from RNA-seq data). If you don’t have a compute node with enough memory, you can run GeMoMa without coverage, which will return the same predictions, but does not include all statistics. Another point could be the protein alignment, if you use the optional parameter query protein. Again you can run GeMoMa without this parameter, which will return the same predictions, but less statistics.<br />
; I need to specify the genetic code for my organisms. What is the expected format?<br />
: The genetic code is given in a two column tab-delimited table, where the first column is the one letter code of the amino acid and the second column is a comma-separated list of triplets. As we are working on genomic DNA, GeMoMa expects the bases A, C, G, and T, and not U (as expected in mRNA). Here is the link to the default genetic code, which might be used as template:<br />
: https://github.com/Jstacs/Jstacs/blob/master/projects/gemoma/test_data/genetic_code.txt<br />
: Alternative genetic codes are described here using the RNA alphabet:<br />
: https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi<br />
: The genetic code might be specified for a reference organism in the module Extractor or for a target organism in the module GeMoMa.<br />
; I like to accelerate GeMoMa. What can I do?<br />
:If you like to improve the runtime, you can use several threads for the computation. If you run the GeMoMaPipeline you just have to select <code>threads=<your_number></code>.<br />
:In addition, you can change the search algorithm that is used in GeMoMa. Tblastn is used by default as search algorithm in GeMoMa (for historical reasons). However, tblastn can be replaced by mmseqs which is typically much faster. If you run the GeMoMaPipeline you just have to select <code>tblastn=false</code>. However, changing the search algorithm can also effect the results. We try to minimize these effect using specific parameters for the search algorithms.<br />
:If you modify other parameters, you will probably receive results that differ to a larger extend from those received using the default parameters.<br />
<br />
== References ==<br />
If you use GeMoMa, please cite<br />
<br />
J. Keilwagen, M. Wenk, J. L. Erickson, M. H. Schattat, J. Grau, and F. Hartung. [https://nar.oxfordjournals.org/content/44/9/e89 Using intron position conservation for homology-based gene prediction]. ''Nucleic Acids Research'', 2016. doi: 10.1093/nar/gkw092<br />
<br />
J. Keilwagen, F. Hartung, M. Paulini, S. O. Twardziok, and J. Grau<br />
[https://rdcu.be/QbKc Combining RNA-seq data and homology-based gene prediction for plants, animals and fungi]. ''BMC Bioinformatics'', 2018. doi: 10.1186/s12859-018-2203-5<br />
<br />
== Version history ==<br />
[http://www.jstacs.de/download.php?which=GeMoMa GeMoMa 1.6.4] (24.04.2020)<br />
* improved help section<br />
* change gff attribute "AA" to "aa"<br />
* GAF:<br />
** bugfix overlapping genes<br />
** accelerated computation<br />
* GeMoMa:<br />
** bugfix: if no assignment file is used and protein ID are prefixes of other protein IDs<br />
** change GFF attribute AA to aa<br />
* AnnotationFinalizer: new parameter "name attribute" allowing to decide whether a name attribute or the Parent and ID attributes should be used for renaming<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.6.3.zip GeMoMa 1.6.3] (05.03.2020)<br />
* Jstacs changes:<br />
** CLI: bugfix ExpandableParameterSet<br />
* python wrapper (for *conda)<br />
* updated tests.sh, run.sh, pipeline.sh<br />
* rename Denoise to DenoiseIntrons<br />
* AnnotationEvidence: write phase (as given) to gff<br />
* GAF: new parameter: default attributes allows to set attributes that are not included in some gene annotation files<br />
* GeMoMa: new parameter: static intron length allowing to use dynamic intron length if set to false<br />
* GeMoMaPipeline: <br />
** bugfix: time-out<br />
** improve output<br />
** separate parameters for maximum intron length (DenoiseIntrons, GeMoMa)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.6.2.zip GeMoMa 1.6.2] (17.12.2019)<br />
* Jstacs changes:<br />
** test methods for modules<br />
** live protocol for Galaxy<br />
* new module Denoise: allowing to clean introns extracted by ERE<br />
* new module NCBIReferenceRetriever: allowing to retrieve data for reference organisms easily from NCBI.<br />
* GAF:<br />
** bugfix for filter using specific attributes if no RNA-seq or query proteins was used<br />
** allow to add annotation info (as for instance provided by Phytozome) based on the reference organisms<br />
* GeMoMa: bugfix for timeout<br />
* GeMoMaPipeline:<br />
** bugfix reporting predicted partial proteins<br />
** improved protocol<br />
** new default value for query proteins (changed from false to true)<br />
** new default value for Ambiguity (changed from EXCEPTION to AMBIGUOUS)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.6.1.zip GeMoMa 1.6.1] (4.06.2019)<br />
* createGalaxyIntegration.sh: bugfix for GeMoMaPipeline<br />
* new module CheckIntrons: allowing to create statistics for introns (extracted by ERE)<br />
* AnnotationFinalizer: bugfix for sequence IDs with large numbers<br />
* CompareTranscripts: <br />
** bugfix for prefix of ref-gene<br />
** allow no transcript info, but making assignment non-optional if a transcript info is set <br />
* GAF: bugfix for Galaxy integration<br />
* GeMoMaPipeline:<br />
** improved output in case of Exceptions<br />
** new parameter "output individual predictions" allows to in- or exclude individual predictions from each reference organism in the final result<br />
** new parameter "weight" allows weights for reference species (cf. GAF)<br />
* ERE: new parameter "minimum mapping quality"<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.6.zip GeMoMa 1.6] (2.04.2019)<br />
* allow to use mmseqs as alternative to tblastn<br />
* AnnotationEvidence:<br />
** allows to add attributes to the input gff: tie, tpc, AA, start, stop<br />
** new parameter for gff output<br />
* AnnotationFinalizer: new tool for predicting UTRs and renaming predictions<br />
* GAF:<br />
** relative score filter and evidence filter are replaced by a flexible filter that allows to filter by relative score, evidence or other GFF attributes as well as combinations thereof<br />
** sorting criteria of the predictions within clusters can now be user-specified<br />
** new attribute for genes: combinedEvidence<br />
** new attribute for predictions: sumWeight<br />
** allows to use gene predictions from all sources, including for instance ab-initio gene predictors, purely RNA-seq based gene prediction and manually curation<br />
** bugfix for predictions from multiple reference organisms<br />
** improved statistic output<br />
* GeMoMa<br />
** renamed the parameter tblastn results to search results<br />
** new parameter for sorting the results of the similarity search (tblastn or mmseqs), if you use mmseqs for the similarity search you have use sort <br />
** new parameter for score of the search results: three options: Trust (as is), ReScore (use aligned sequence, but recompute score), and ReAlign (use detected sequence for realignment and rescore)<br />
** bugfix: threshold for introns from multiple files<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.5.2.zip GeMoMa 1.5.3] (23.07.2018)<br />
* improved parameter description and presentation<br />
* GeMoMaPipeline:<br />
** removed unnecessary parameters<br />
* GeMoMa:<br />
** bugfix: reading coverage file<br />
** removed parameter genomic (cf. Extractor)<br />
** removed protein output (cf. Extractor)<br />
* GAF:<br />
** bugfix: prefix<br />
* Extractor:<br />
** new parameter genomic<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.5.2.zip GeMoMa 1.5.2] (31.5.2018)<br />
* GAF:<br />
** new parameter that allows to restrict the maximal number of transcript predictions per gene<br />
** altered behavior of the evidence filter from percentages to absolute values<br />
** bugfix: nested genes <br />
** checking for duplicates in prediction IDs<br />
* GeMoMa:<br />
** warning if RNA-seq data does not match with target genome<br />
* GeMoMaPipeline: new tool for running the complete GeMoMa pipeline at once allowing multi-threading<br />
* folder for temporary files of GeMoMa<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.5.zip GeMoMa 1.5] (13.02.2018)<br />
* AnnotationEvidence: add chromosome to output<br />
* CompareTranscripts: new parameter that allows to remove prefixes introduces by GAF<br />
* Extractor: new parameter "stop-codon excluded from CDS" that might be used if the annotation does not contain the stop codons<br />
* ExtractRNASeqEvidence: <br />
** print intron length stats<br />
** include program infos in introns.gff3<br />
* GeMoMa:<br />
** new attribute pAA in gff output if query protein is given<br />
** include program infos in predicted_annotation.gff3<br />
** minor bugfix<br />
* GAF:<br />
** new parameter that allows to specify a prefix for each input gff<br />
** collect and print program infos to filtered_prediction.gff3<br />
** improved statistics output<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.4.2.zip GeMoMa 1.4.2] (21.07.2017)<br />
* automatic searching for available updates<br />
* AnnotationEvidence: bugfix (tie computation: Arrays.binarysearch does not find first match)<br />
* Extractor: bugfix (files that are not zipped)<br />
* GeMoMa: bugfix (tie computation: Arrays.binarysearch does not find first match)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.4.1.zip GeMoMa 1.4.1] (30.05.2017)<br />
* CompareTranscripts: bugfix (NullPointerException)<br />
* Extractor: reference genome can be .*fa.gz and .*fasta.gz<br />
* GeMoMa: bugfix (shutdown problem after timeout)<br />
* modified additional scripts and documentation<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.4.zip GeMoMa 1.4] (03.05.2017)<br />
* AnnotationEvidence: new tool computing tie and tpc for given annotation (gff)<br />
* CompareTranscripts: new tool comparing predicted and given annotation (gff)<br />
* Extractor:<br />
** reading CDS with no parent tag (cf. discontinuous feature)<br />
** automatic recognition of GFF or GTF annotation<br />
** Warning if sequences mentioned in the annotation are not included in the reference sequence<br />
* GeMoMa: <br />
** allowing for multiple intron and coverage files (= using different library types at the same time)<br />
** NA instead of "?" for tae, tde, tie, minSplitReads of single coding exon genes<br />
** new default values for the parameters: predictions (10 instead of 1) and contig threshold (0.4 instead of 0.9)<br />
** bugfix (write pc and minCov if possible for last CDS part in predicted annotation)<br />
** bugfix (ref-gene name if no assignment is used)<br />
** bugfix (minSplitReads, minCov, tpc, avgCov if no coverage available)<br />
* GAF:<br />
** nested genes on the same strand<br />
** bugfix (if nothing passes the filter)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.3.2.zip GeMoMa 1.3.2] (18.01.2017)<br />
* Extractor: new parameter repair for broken transcript annotations<br />
* GeMoMa: bugfixes (splice site computation)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa_1.3.1.zip GeMoMa 1.3.1] (09.12.2016)<br />
* GeMoMa bugfix (finding start/stop codon for very small exons)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.3.zip GeMoMa 1.3] (06.12.2016)<br />
* ERE: new tool for extracting RNA-seq evidence<br />
* Extractor: offers options for <br />
** partial gene models<br />
** ambiguities<br />
* GeMoMa:<br />
** RNA-seq<br />
*** defining splice sites<br />
*** additional feature in GFF and output<br />
**** transcript intron evidence (tie)<br />
**** transcript acceptor evidence (tae)<br />
**** transcript donor evidence (tde)<br />
**** transcript percentage coverage (tpc)<br />
**** ...<br />
** improved GFF<br />
** simplified the command line parameters<br />
** IMPORTANT: parameter names changed for some parameters<br />
* GAF: new tool for filtering and combining different predictions (especially of different reference organisms)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.1.3.zip GeMoMa 1.1.3] (06.06.2016)<br />
* minor modifications to the Extractor tool<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.1.2.zip GeMoMa 1.1.2] (05.02.2016)<br />
* GeMoMa bugfix (upstream, downstream sequence for splice site detection)<br />
* Extractor: new parameter s for selecting transcripts<br />
* improved Galaxy integration<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.1.1.zip GeMoMa 1.1.1] (01.02.2016)<br />
* initial release for paper</div>Keilwagenhttps://www.jstacs.de/index.php?title=GeMoMa&diff=1094GeMoMa2020-04-24T11:56:47Z<p>Keilwagen: 1.6.4</p>
<hr />
<div>'''Ge'''ne '''Mo'''del '''Ma'''pper (GeMoMa) is a homology-based gene prediction program. GeMoMa uses the annotation of protein-coding genes in a reference genome to infer the annotation of protein-coding genes in a target genome. Thereby, GeMoMa utilizes amino acid sequence and intron position conservation. In addition, GeMoMa allows to incorporate RNA-seq evidence for splice site prediction.<br />
<br />
GeMoMa is available in a public web-server at [http://galaxy.informatik.uni-halle.de galaxy.informatik.uni-halle.de]. The provided web-server only allows a limited number of reference genes and uses a time out of 2 minutes per transcript prediction. For unlimited use, please use the command line program or integrate GeMoMa in your own [https://galaxyproject.org/ Galaxy] instance.<br />
<br />
{|<br />
|__TOC__<br />
|[[File:GeMoMa-schema.png|thumb|right|450px|Schema of GeMoMa algorithm]]<br />
|}<br />
<br />
== Installation ==<br />
<br />
=== Requirements ===<br />
<br />
For running the GeMoMa, you need the following software on you computer<br />
* Java v1.8 or later<br />
* [ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ blast] or [https://github.com/soedinglab/MMseqs2 mmseqs]<br />
<br />
=== Download ===<br />
<br />
GeMoMa is implemented in Java using Jstacs. You can [http://www.jstacs.de/download.php?which=GeMoMa download a zip file] containing a readme, the GeMoMa jar file and some tiny scripts for running GeMoMa. The jar file allows for <br />
<ul><br />
<li>creating the XML file needed for the Galaxy integration</li><br />
<li>running the command line interface (CLI) version.</li><br />
</ul><br />
<br />
You can also [[:File:GeMoMa-manual.pdf|download a small manual for GeMoMa]] which explains the main steps for the analysis.<br />
<br />
== In a nutshell ==<br />
<br />
GeMoMa is a modular, homology-based gene prediction program with huge flexibility. However, we also provide a pipeline allowing to use GeMoMa easily. If you like to start GeMoMa for the first time, we recommend to use the GeMoMaPipeline like this<br />
java -jar GeMoMa-1.6.4beta.jar CLI GeMoMaPipeline threads=<threads> outdir=<outdir> tblastn=false GeMoMa.Score=ReAlign AnnotationFinalizer.r=NO o=true t=<target_genome> i=<reference_1_id> a=<reference_1_annotation> g=<reference_1_genome><br />
there are several parameters that need to be set indicated with '''&lt;'''foo'''&gt;'''. You can specify<br />
* the number of threads<br />
* the output directory<br />
* the target genome<br />
* and the reference ID (optional), annotation and genome. If you have several references just repeat the parameter tags <code>i</code>, <code>a</code>, <code>g</code> with the corresponding values.<br />
In addition, we recommend to set several parameters:<br />
* <code>tblastn=false</code>: use mmseqs instead of tblastn, since mmseqs is faster<br />
* <code>GeMoMa.Score=ReAlign</code>: states that the score from mmseqs should be recomputed as mmseqs uses an approximation<br />
* <code>AnnotationFinalizer.r=NO</code>: do not rename genes and transcripts<br />
* <code>o=true</code>: output individual predictions for each reference as a separate file allowing to rerun the combination step ('''GAF''') very easily and quickly<br />
If you like to specify the maximum intron length please consider the parameters <code>GeMoMa.m</code> and <code>GeMoMa.sil</code>.<br />
If you have RNA-seq data either from own experiments or publicly available data sets (cf. [https://www.ncbi.nlm.nih.gov/sra NCBI SRA], [https://www.ebi.ac.uk/ena EMBL-EBI ENA]), we recommend to use them. You need to map the data against the target genome with your favorite read mapper. In addition, we recommend to check the parameters of the section '''DenoiseIntrons'''.<br />
<br />
== Tools ==<br />
<br />
=== GeMoMa pipeline ===<br />
<br />
This tool can be used to run the complete GeMoMa pipeline. The tool is multi-threaded and can utilize all compute cores on one machine, but not distributed as for instance in a compute cluster. It basically runs: '''Extract RNA-seq evidence (ERE)''', '''DenoiseIntrons''', '''Extractor''', external search (tblastn or mmseqs), '''Gene Model Mapper (GeMoMa)''', '''GeMoMa Annotation Filter (GAF)''', and '''AnnnotationFinalizer'''.<br />
<br />
''GeMoMa pipeline'' may be called with<br />
<br />
java -jar GeMoMa-1.6.4beta.jar CLI GeMoMaPipeline<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>target genome (Target genome file (FASTA))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>species (data for reference species, range={own, pre-extracted}, default = own)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;own&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>ID (ID to distinguish the different reference species, default = , OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (Reference annotation file (GFF or GTF), which contains gene models annotated in the reference genome)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (Reference genome file (FASTA))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>weight (the weight can be used to prioritize predictions from different input files; each prediction will get an additional attribute sumWeight that can be used in the filter, valid range = [0.0, 1000.0], default = 1.0, OPTIONAL)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ai</font></td><br />
<td>annotation info (annotation information of the reference, tab-delimted file containing at least the columns transcriptName, GO and .*defline, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;pre-extracted&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>ID (ID to distinguish the different reference species, default = , OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>cds parts (The query cds parts file (FASTA), i.e., the cds parts that have been blasted)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>assignment (The assignment file, which combines parts of the CDS to transcripts, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">q</font></td><br />
<td>query proteins (optional query protein file (FASTA) for computing the optimal alignment score against complete protein prediction, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>weight (the weight can be used to prioritize predictions from different input files; each prediction will get an additional attribute sumWeight that can be used in the filter, valid range = [0.0, 1000.0], default = 1.0, OPTIONAL)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ai</font></td><br />
<td>annotation info (annotation information of the reference, tab-delimted file containing at least the columns transcriptName, GO and .*defline, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">selected</font></td><br />
<td>selected (The path to list file, which allows to make only a predictions for the contained transcript ids. The first column should contain transcript IDs as given in the annotation. Remaining columns can be used to determine a target region that should be overlapped by the prediction, if columns 2 to 5 contain chromosome, strand, start and end of region, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">gc</font></td><br />
<td>genetic code (optional user-specified genetic code, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">tblastn</font></td><br />
<td>tblastn (if *true* tblastn is used as search algorithm, otherwise mmseqs is used. Tblastn and mmseqs need to be installed to use the corresponding option, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">tag</font></td><br />
<td>tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = prediction)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>RNA-seq evidence (data for RNA-seq evidence, range={NO, MAPPED, EXTRACTED}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;MAPPED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.s</font></td><br />
<td>Stranded (Defines whether the reads are stranded. In case of FR_FIRST_STRAND, the first read of a read pair or the only read in case of single-end data is assumed to be located on forward strand of the cDNA, i.e., reverse to the mRNA orientation. If you are using Illumina TruSeq you should use FR_FIRST_STRAND., range={FR_UNSTRANDED, FR_FIRST_STRAND, FR_SECOND_STRAND}, default = FR_UNSTRANDED)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.m</font></td><br />
<td>mapped reads file (BAM/SAM files containing the mapped reads)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.v</font></td><br />
<td>ValidationStringency (Defines how strict to be when reading a SAM or BAM, beyond bare minimum validation., range={STRICT, LENIENT, SILENT}, default = LENIENT)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.u</font></td><br />
<td>use secondary alignments (allows to filter flags in the SAM or BAM, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.c</font></td><br />
<td>coverage (allows to output the coverage, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.mmq</font></td><br />
<td>minimum mapping quality (reads with a mapping quality that is lower than this value will be ignored, valid range = [0, 255], default = 40)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;EXTRACTED&quot;:</b></td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">introns</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used zero or multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage</font></td><br />
<td>coverage (experimental coverage (RNA-seq), range={UNSTRANDED, STRANDED}, default = UNSTRANDED)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>denoise (removing questionable introns that have been extracted by ERE, range={DENOISE, RAW}, default = DENOISE)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;DENOISE&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">DenoiseIntrons.m</font></td><br />
<td>maximum intron length (The maximum length of an intron, default = 15000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">DenoiseIntrons.me</font></td><br />
<td>minimum expression (The threshold for removing introns, valid range = [0.0, 1.0], default = 0.01)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">DenoiseIntrons.c</font></td><br />
<td>context (The context upstream a donor and donwstream an acceptor site that is used to determine the expression of the region, valid range = [0, 100], default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;RAW&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.p</font></td><br />
<td>proteins (whether the complete proteins sequences should returned as output, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.r</font></td><br />
<td>repair (if a transcript annotation can not be parsed, the program will try to infer the phase of the CDS parts to repair the annotation, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.a</font></td><br />
<td>Ambiguity (This parameter defines how to deal with ambiguities in the DNA. There are 3 options: EXCEPTION, which will remove the corresponding transcript, AMBIGUOUS, which will use an X for the corresponding amino acid, and RANDOM, which will randomly select an amnio acid from the list of possibilities., range={EXCEPTION, AMBIGUOUS, RANDOM}, default = AMBIGUOUS)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.s</font></td><br />
<td>stop-codon excluded from CDS (A flag that states whether the reference annotation contains the stop codon in the CDS annotation or not, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.f</font></td><br />
<td>full-length (A flag which allows for choosing between only full-length and all (i.e., full-length and partial) transcripts, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.s</font></td><br />
<td>splice (if no intron is given by RNA-seq, compute candidate splice sites or not, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.sm</font></td><br />
<td>substitution matrix (optional user-specified substitution matrix, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.g</font></td><br />
<td>gap opening (The gap opening cost in the alignment, default = 11)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.ge</font></td><br />
<td>gap extension (The gap extension cost in the alignment, default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.m</font></td><br />
<td>maximum intron length (The maximum length of an intron, default = 15000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.sil</font></td><br />
<td>static intron length (A flag which allows to switch between static intron length, which can be specified by the user and is identical for all genes, and dynamic intron length, which is based on the gene-specific maximum intron length in the reference organism plus the user given maximum intron length, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.i</font></td><br />
<td>intron-loss-gain-penalty (The penalty used for intron loss and gain, default = 25)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.e</font></td><br />
<td>e-value (The e-value for filtering blast results, default = 100.0)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.c</font></td><br />
<td>contig threshold (The threshold for evaluating contigs, valid range = [0.0, 1.0], default = 0.4)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.rt</font></td><br />
<td>region threshold (The threshold for evaluating regions, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.h</font></td><br />
<td>hit threshold (The threshold for adding additional hits, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.p</font></td><br />
<td>predictions (The (maximal) number of predictions per transcript, default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.a</font></td><br />
<td>avoid stop (A flag which allows to avoid stop codons in a transcript (except the last AS), default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.approx</font></td><br />
<td>approx (whether an approximation is used to compute the score for intron gain, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.prefix</font></td><br />
<td>prefix (A prefix to be used for naming the predictions, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.t</font></td><br />
<td>timeout (The (maximal) number of seconds to be used for the predictions of one transcript, if exceeded GeMoMa does not output a prediction for this transcript., valid range = [0, 604800], default = 3600)</td><br />
<td style="width:100px;">LONG</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.Score</font></td><br />
<td>Score (A flag which allows to do nothing, re-score or re-align the search results, range={Trust, ReScore, ReAlign}, default = Trust)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.c</font></td><br />
<td>common border filter (the filter on the common borders of transcripts, the lower the more transcripts will be checked as alternative splice isoforms, valid range = [0.0, 1.0], default = 0.75)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.m</font></td><br />
<td>maximal number of transcripts per gene (the maximal number of allowed transcript predictions per gene, valid range = [1, 2147483647], default = 2147483647)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.d</font></td><br />
<td>default attributes (Comma-separated list of attributes that is set to NaN if they are not given in the annotation file. This allows to use these attributes for sorting or filter criteria. It is especially meaningful if the gene annotation files were received fom different software packages (e.g., GeMoMa, Braker, ...) having different attributes., default = tie,tde,tae,iAA,pAA,score)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.f</font></td><br />
<td>filter (A filter can be applied to transcript predictions to receive only reasonable predictions. The filter is applied to the GFF attributes. The deault filter decides based on the completeness of the prediction (start=='M' and stop=='*') and the relative score (score/aa>=0.75) whether a prediction is used or not. Different criteria can be combined using 'and' and 'or'. In addition, you can check for NaN, e.g., 'isNaN(score)'. A more sophisticated filter could be applied for instance using the completeness, the relative score, the evidence as well as statistics based on RNA-seq: start=='M' and stop =='*' and score/aa>=0.75 and (evidence>1 or tpc==1.0), default = start=='M' and stop =='*' and score/aa>=0.75, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.s</font></td><br />
<td>sorting (comma-separated list that defines the way predictions are sorted within a cluster, default = evidence,score)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.a</font></td><br />
<td>alternative transcript filter (If a prediction is suggested as an alternative transcript, this additional filter will be applied. The filter works syntactically like the 'filter' parameter and allows you to keep the number of alternative transcripts small based on meaningful criteria. Reasonable filter could be based on RNA-seq data (tie==1) or on evidence (evidence>1). A more sophisticated filter could be applied combining several criteria: tie==1 or evidence>1, default = tie==1 or evidence>1, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.u</font></td><br />
<td>UTR (allows to predict UTRs using RNA-seq data, range={NO, YES}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;YES&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.r</font></td><br />
<td>rename (allows to generate generic gene and transcripts names (cf. parameter &quot;name attribute&quot;), range={COMPOSED, SIMPLE, NO}, default = COMPOSED)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;COMPOSED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.i</font></td><br />
<td>infix (the infix of the generic name, default = G)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.s</font></td><br />
<td>suffix (the suffix of the generic name, default = 0)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.di</font></td><br />
<td>delete infix (a comma-separated list of infixes that is deleted from the sequence names before building the gene/transcript name, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;SIMPLE&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.n</font></td><br />
<td>name attribute (if true the new name is added as new attribute &quot;Name&quot;, otherwise &quot;Parent&quot; and &quot;ID&quot; values are modified accordingly, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>predicted proteins (If *true*, returns the predicted proteins of the target organism as fastA file, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">pc</font></td><br />
<td>predicted CDSs (If *true*, returns the predicted CDSs of the target organism as fastA file, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">pgr</font></td><br />
<td>predicted genomic regions (If *true*, returns the genomic regions of predicted gene models of the target organism as fastA file, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">o</font></td><br />
<td>output individual predictions (If *true*, returns the predictions for each reference species, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">debug</font></td><br />
<td>debug (If *false* removes all temporary files even if the jobs exits unexpected, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">threads</font></td><br />
<td>The number of threads used for the tool, defaults to 1</td><br />
<td>INT</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.6.4beta.jar CLI GeMoMaPipeline t=<target_genome> a=<annotation> g=<genome> AnnotationFinalizer.p=<prefix><br />
<br />
<br />
=== Extract RNA-seq Evidence ===<br />
<br />
This tools extracts introns and coverage from mapped RNA-seq reads. Introns might be denoised by the tool '''DenoiseIntrons'''. Introns and coverage results can be used in '''GeMoMa''' to improve the predictions and might help to select better gene models in '''GAF'''. In addition, introns and coverage can be used to predict UTRs by '''AnnotationFinalizer'''.<br />
<br />
''Extract RNA-seq Evidence'' may be called with<br />
<br />
java -jar GeMoMa-1.6.4beta.jar CLI ERE<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>Stranded (Defines whether the reads are stranded. In case of FR_FIRST_STRAND, the first read of a read pair or the only read in case of single-end data is assumed to be located on forward strand of the cDNA, i.e., reverse to the mRNA orientation. If you are using Illumina TruSeq you should use FR_FIRST_STRAND., range={FR_UNSTRANDED, FR_FIRST_STRAND, FR_SECOND_STRAND}, default = FR_UNSTRANDED)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>mapped reads file (BAM/SAM files containing the mapped reads)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>ValidationStringency (Defines how strict to be when reading a SAM or BAM, beyond bare minimum validation., range={STRICT, LENIENT, SILENT}, default = LENIENT)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">u</font></td><br />
<td>use secondary alignments (allows to filter flags in the SAM or BAM, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage (allows to output the coverage, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">mmq</font></td><br />
<td>minimum mapping quality (reads with a mapping quality that is lower than this value will be ignored, valid range = [0, 255], default = 40)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.6.4beta.jar CLI ERE m=<mapped_reads_file><br />
<br />
<br />
=== CheckIntrons ===<br />
<br />
The tool checks the distribution of introns on the strands and the dinucleotide distribution at splice sites.<br />
<br />
''CheckIntrons'' may be called with<br />
<br />
java -jar GeMoMa-1.6.4beta.jar CLI CheckIntrons<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>target genome (The target genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>verbose (A flag which allows to output a wealth of additional information per transcript, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.6.4beta.jar CLI CheckIntrons t=<target_genome><br />
<br />
<br />
=== DenoiseIntrons ===<br />
<br />
This module allows to analyze introns extracted by '''ERE'''. Introns with a large intron size or a low relative expression are possibly artefacts and will be removed. The result of this module can be used in the module '''GeMoMa''', '''AnnotationEvidence''', and '''AnnotationFinalizer'''.<br />
<br />
''DenoiseIntrons'' may be called with<br />
<br />
java -jar GeMoMa-1.6.4beta.jar CLI DenoiseIntrons<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage (experimental coverage (RNA-seq), range={UNSTRANDED, STRANDED}, default = UNSTRANDED)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>maximum intron length (The maximum length of an intron, default = 15000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">me</font></td><br />
<td>minimum expression (The threshold for removing introns, valid range = [0.0, 1.0], default = 0.01)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">context</font></td><br />
<td>context (The context upstream a donor and donwstream an acceptor site that is used to determine the expression of the region, valid range = [0, 100], default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.6.4beta.jar CLI DenoiseIntrons i=<introns> coverage_unstranded=<coverage_unstranded><br />
<br />
<br />
=== NCBI Reference Retriever ===<br />
<br />
This tool can be used to download or update assembly and annotation files of reference organsims from NCBI. This way it allows to easily collect all data necessary to start '''GeMoMaPipeline''' or '''Extractor'''.<br />
<br />
''NCBI Reference Retriever'' may be called with<br />
<br />
java -jar GeMoMa-1.6.4beta.jar CLI NRR<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reference directory (the directory where the genome and annotation files of the reference organisms should be stored, default = references/)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>number of tries (the number of tries for downloading a reference file, valid range = [1, 100], default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">rl</font></td><br />
<td>reference list (a list of reference organisms)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.6.4beta.jar CLI NRR rl=<reference_list><br />
<br />
<br />
=== Extractor ===<br />
<br />
This tool can be used to create input files for '''GeMoMa''', i.e., it creates at least a fasta file containing the translated parts of the CDS and a tabular file containing the assignment of transcripts to genes and parts of CDS to transcripts. In addition, '''Extractor''' can be used to create several additional files from the final prediction, e.g. proteins, CDSs, ... . Two inputs are mandatory: The genome as fasta or fasta.gz and the corresponding annotation as gff or gff.gz. The gff file should be sorted. If you like to set a user-specific genetic code, please use a tab-delimited file with two columns. The first column contains the amino acid in one letter code, the second a list of tripletts.<br />
<br />
''Extractor'' may be called with<br />
<br />
java -jar GeMoMa-1.6.4beta.jar CLI Extractor<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (Reference annotation file (GFF or GTF), which contains gene models annotated in the reference genome)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (Reference genome file (FASTA))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">gc</font></td><br />
<td>genetic code (optional user-specified genetic code, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>proteins (whether the complete proteins sequences should returned as output, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>cds (whether the complete CDSs should returned as output, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">genomic</font></td><br />
<td>genomic (whether the genomic regions should be returned (upper case = coding, lower case = non coding), default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>repair (if a transcript annotation can not be parsed, the program will try to infer the phase of the CDS parts to repair the annotation, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>selected (The path to list file, which allows to make only a predictions for the contained transcript ids. The first column should contain transcript IDs as given in the annotation. Remaining columns will be ignored., OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Ambiguity</font></td><br />
<td>Ambiguity (This parameter defines how to deal with ambiguities in the DNA. There are 3 options: EXCEPTION, which will remove the corresponding transcript, AMBIGUOUS, which will use an X for the corresponding amino acid, and RANDOM, which will randomly select an amnio acid from the list of possibilities., range={EXCEPTION, AMBIGUOUS, RANDOM}, default = EXCEPTION)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sefc</font></td><br />
<td>stop-codon excluded from CDS (A flag that states whether the reference annotation contains the stop codon in the CDS annotation or not, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>full-length (A flag which allows for choosing between only full-length and all (i.e., full-length and partial) transcripts, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>verbose (A flag which allows to output a wealth of additional information, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.6.4beta.jar CLI Extractor a=<annotation> g=<genome><br />
<br />
<br />
=== GeneModelMapper ===<br />
<br />
This tool is the main part of, a homology-based gene prediction tool. GeMoMa builds gene models from search results (e.g. tblastn or mmseqs).<br />
<br />
As first step, you should run '''Extractor''' obtaining ''cds parts'' and ''assignment''. Second, you should run a search algorithm, e.g. '''tblastn''' or '''mmseqs''', with ''cds parts'' as query. Finally, these search results are then used in '''GeMoMa'''. Search results should be clustered according to the reference genes. The most easiest way is to sort the search results accoring to the first column. If the search results are not sorted by default (e.g. mmseqs), you should the parameter ''sort''.<br />
If you like to run GeMoMa ignoring intron position conservation, you should blast protein sequences and feed the results in ''query cds parts'' and leave ''assignment'' unselected.<br />
<br />
If you like to run GeMoMa using RNA-seq evidence, you should map your RNA-seq reads to the genome and run '''ERE''' on the mapped reads. For several reasons, spurious introns can be extracted from RNA-seq data. Hence, we recommend to run '''DenoiseIntrons''' to remove such spurious introns. Finally, you can use the obtained ''introns'' (and ''coverage'') in GeMoMa.<br />
<br />
If you like to obtain multiple predictions per gene model of the reference organism, you should set ''predictions'' accordingly. In addition, we suggest to decrease the value of ''contig threshold'' allowing GeMoMa to evaluate more candidate contigs/chromosomes.<br />
<br />
If you change the values of ''contig threshold'', ''region threshold'' and ''hit threshold'', this will influence the predictions as well as the runtime of the algorithm. The lower the values are, the slower the algorithm is.<br />
<br />
You can filter your predictions using '''GAF''', which also allows for combining predictions from different reference organismns.<br />
<br />
Finally, you can predict UTRs and rename predictions using '''AnnotationFinalizer'''.<br />
<br />
If you like to run the complete GeMoMa pipeline and not only specific module, you can run the multi-threaded module '''GeMoMaPipeline'''.<br />
<br />
''GeneModelMapper'' may be called with<br />
<br />
java -jar GeMoMa-1.6.4beta.jar CLI GeMoMa<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>search results (The search results, e.g., from tblastn or mmseqs)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>target genome (The target genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>cds parts (The query cds parts file (FASTA), i.e., the cds parts that have been blasted)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>assignment (The assignment file, which combines parts of the CDS to transcripts, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">q</font></td><br />
<td>query proteins (optional query protein file (FASTA) for computing the optimal alignment score against complete protein prediction, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used zero or multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">splice</font></td><br />
<td>splice (if no intron is given by RNA-seq, compute candidate splice sites or not, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used zero or multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage</font></td><br />
<td>coverage (experimental coverage (RNA-seq), range={UNSTRANDED, STRANDED}, default = UNSTRANDED)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genetic code (optional user-specified genetic code, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sm</font></td><br />
<td>substitution matrix (optional user-specified substitution matrix, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">go</font></td><br />
<td>gap opening (The gap opening cost in the alignment, default = 11)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ge</font></td><br />
<td>gap extension (The gap extension cost in the alignment, default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>maximum intron length (The maximum length of an intron, default = 15000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sil</font></td><br />
<td>static intron length (A flag which allows to switch between static intron length, which can be specified by the user and is identical for all genes, and dynamic intron length, which is based on the gene-specific maximum intron length in the reference organism plus the user given maximum intron length, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">intron-loss-gain-penalty</font></td><br />
<td>intron-loss-gain-penalty (The penalty used for intron loss and gain, default = 25)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">e</font></td><br />
<td>e-value (The e-value for filtering blast results, default = 100.0)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ct</font></td><br />
<td>contig threshold (The threshold for evaluating contigs, valid range = [0.0, 1.0], default = 0.4)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">rt</font></td><br />
<td>region threshold (The threshold for evaluating regions, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">h</font></td><br />
<td>hit threshold (The threshold for adding additional hits, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>predictions (The (maximal) number of predictions per transcript, default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">selected</font></td><br />
<td>selected (The path to list file, which allows to make only a predictions for the contained transcript ids. The first column should contain transcript IDs as given in the annotation. Remaining columns can be used to determine a target region that should be overlapped by the prediction, if columns 2 to 5 contain chromosome, strand, start and end of region, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">as</font></td><br />
<td>avoid stop (A flag which allows to avoid stop codons in a transcript (except the last AS), default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">approx</font></td><br />
<td>approx (whether an approximation is used to compute the score for intron gain, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">prefix</font></td><br />
<td>prefix (A prefix to be used for naming the predictions, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">tag</font></td><br />
<td>tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = prediction)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>verbose (A flag which allows to output a wealth of additional information per transcript, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">timeout</font></td><br />
<td>timeout (The (maximal) number of seconds to be used for the predictions of one transcript, if exceeded GeMoMa does not output a prediction for this transcript., valid range = [0, 604800], default = 3600)</td><br />
<td style="width:100px;">LONG</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sort</font></td><br />
<td>sort (A flag which allows to sort the search results, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Score</font></td><br />
<td>Score (A flag which allows to do nothing, re-score or re-align the search results, range={Trust, ReScore, ReAlign}, default = Trust)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.6.4beta.jar CLI GeMoMa s=<search_results> t=<target_genome> c=<cds_parts> a=<assignment><br />
<br />
<br />
=== GeMoMa Annotation Filter ===<br />
<br />
This tool combines and filters gene predictions from different sources yielding a common gene prediction. The tool does not modify the predictions, but filters redundant and low-quality predictions and selects relevant predictions. In addition, it adds attributes to the annotation of transcript predictions.<br />
<br />
The algorithm does the following:<br />
First, redundant predictions are identified (and additional attributes (evidence, sumWeight) are introduced).<br />
Second, the predictions are filtered using the user-specified criterium based on the attributes from the annotation.<br />
Third, clusters of overlapping predictions are determined, the predictions are sorted within the cluster and relevant predictions are extracted.<br />
<br />
Optionally, annotation info can be added for each reference organism enabling a functional prediction for predicted transcripts based on the function of the reference transcript.<br />
Phytozome provides annotation info tables for plants, but annotation info can be used from any source as long as they are tab-delimited files with at least the following columns: transcriptName, GO and .*defline.<br />
<br />
Initially, GAF was build to combine gene predictions obtained from '''GeMoMa'''. It allows to combine the predictions from multiple reference organisms, but works also using only one reference organism. However, GAF also allows to integrate predictions from ab-initio or purely RNA-seq-based gene predictors as well as manually curated annotation. If you like to do so, we recommend to run '''AnnotationEvidence''' for each of these input files to add additional attributes that can be used for sorting and filtering within '''GAF'''. The sort and filter criteria need to be carefully revised in this case. Default values can be set for missing attributes.<br />
<br />
''GeMoMa Annotation Filter'' may be called with<br />
<br />
java -jar GeMoMa-1.6.4beta.jar CLI GAF<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>tag (the tag used to read the GeMoMa annotations, default = prediction)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>common border filter (the filter on the common borders of transcripts, the lower the more transcripts will be checked as alternative splice isoforms, valid range = [0.0, 1.0], default = 0.75)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>maximal number of transcripts per gene (the maximal number of allowed transcript predictions per gene, valid range = [1, 2147483647], default = 2147483647)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prefix (the prefix can be used to distinguish predictions from different input files, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>weight (the weight can be used to prioritize predictions from different input files; each prediction will get an additional attribute sumWeight that can be used in the filter, valid range = [0.0, 1000.0], default = 1.0, OPTIONAL)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>gene annotation file (GFF files containing the gene annotations (predicted by GeMoMa))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation info (annotation information of the reference, tab-delimted file containing at least the columns transcriptName, GO and .*defline, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>default attributes (Comma-separated list of attributes that is set to NaN if they are not given in the annotation file. This allows to use these attributes for sorting or filter criteria. It is especially meaningful if the gene annotation files were received fom different software packages (e.g., GeMoMa, Braker, ...) having different attributes., default = tie,tde,tae,iAA,pAA,score)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>filter (A filter can be applied to transcript predictions to receive only reasonable predictions. The filter is applied to the GFF attributes. The deault filter decides based on the completeness of the prediction (start=='M' and stop=='*') and the relative score (score/aa>=0.75) whether a prediction is used or not. Different criteria can be combined using 'and' and 'or'. In addition, you can check for NaN, e.g., 'isNaN(score)'. A more sophisticated filter could be applied for instance using the completeness, the relative score, the evidence as well as statistics based on RNA-seq: start=='M' and stop =='*' and score/aa>=0.75 and (evidence>1 or tpc==1.0), default = start=='M' and stop =='*' and score/aa>=0.75, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>sorting (comma-separated list that defines the way predictions are sorted within a cluster, default = evidence,score)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">atf</font></td><br />
<td>alternative transcript filter (If a prediction is suggested as an alternative transcript, this additional filter will be applied. The filter works syntactically like the 'filter' parameter and allows you to keep the number of alternative transcripts small based on meaningful criteria. Reasonable filter could be based on RNA-seq data (tie==1) or on evidence (evidence>1). A more sophisticated filter could be applied combining several criteria: tie==1 or evidence>1, default = tie==1 or evidence>1, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.6.4beta.jar CLI GAF g=<gene_annotation_file><br />
<br />
<br />
=== AnnotationFinalizer ===<br />
<br />
This tool finalizes an annotation. It allows to predict for UTRs for annotated coding sequences and to generate generic gene and transcript names. UTR prediction might be negatively influenced (i.e. too long predictions) by genomic contamination of RNA-seq libraries, overlapping genes or genes in close proximity as well as unstranded RNA-seq libraries. Please use '''ERE''' to preprocess the mapped reads.<br />
<br />
''AnnotationFinalizer'' may be called with<br />
<br />
java -jar GeMoMa-1.6.4beta.jar CLI AnnotationFinalizer<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (The genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (The predicted genome annotation file (GFF))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = prediction)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">u</font></td><br />
<td>UTR (allows to predict UTRs using RNA-seq data, range={NO, YES}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;YES&quot;:</b></td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns file (Introns (GFF), which might be obtained from RNA-seq)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage file (experimental coverage (RNA-seq), range={NO, UNSTRANDED, STRANDED}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">rename</font></td><br />
<td>rename (allows to generate generic gene and transcripts names (cf. parameter &quot;name attribute&quot;), range={COMPOSED, SIMPLE, NO}, default = COMPOSED)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;COMPOSED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">infix</font></td><br />
<td>infix (the infix of the generic name, default = G)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>suffix (the suffix of the generic name, default = 0)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">di</font></td><br />
<td>delete infix (a comma-separated list of infixes that is deleted from the sequence names before building the gene/transcript name, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;SIMPLE&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>name attribute (if true the new name is added as new attribute &quot;Name&quot;, otherwise &quot;Parent&quot; and &quot;ID&quot; values are modified accordingly, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.6.4beta.jar CLI AnnotationFinalizer g=<genome> a=<annotation> p=<prefix><br />
<br />
<br />
=== Annotation evidence ===<br />
<br />
This tool adds attributes to the annotation, e.g., tie, tpc, aa, start, stop. These attributes can be used, for instance, if the annotation is used in '''GAF'''. All predictions of the annotation are used. The predictions are not filtered for internal stop codons, missing start or stop codons, frame-shifts, ... . Please use '''ERE''' to preprocess the mapped reads.<br />
<br />
''Annotation evidence'' may be called with<br />
<br />
java -jar GeMoMa-1.6.4beta.jar CLI AnnotationEvidence<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (The genome annotation file (GFF))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (The genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns file (Introns (GFF), which might be obtained from RNA-seq, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage file (experimental coverage (RNA-seq), range={NO, UNSTRANDED, STRANDED}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ao</font></td><br />
<td>annotation output (if the annotation should be returned with attributes tie, tpc, and aa, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">gc</font></td><br />
<td>genetic code (optional user-specified genetic code, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.6.4beta.jar CLI AnnotationEvidence a=<annotation> g=<genome><br />
<br />
<br />
=== Compare transcripts ===<br />
<br />
This tool compares a predicted annotation with a given annotation in terms of F1 measure. If the F1 measure is 1 both annotations are in perfect agreement for this transcript. The smaller the value is the low is the agreement. If it is NA then there is no overlapping annotation.<br />
<br />
''Compare transcripts'' may be called with<br />
<br />
java -jar GeMoMa-1.6.4beta.jar CLI CompareTranscripts<br />
<br />
and has the following parameters<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prediction (The predicted annotation)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (The true annotation)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used zero or multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">prefix</font></td><br />
<td>prefix (the prefix can be used to distinguish predictions from different input files, default = , OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">assignment</font></td><br />
<td>assignment (the transcript info for the reference of the prediction)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
'''Example:'''<br />
<br />
java -jar GeMoMa-1.6.4beta.jar CLI CompareTranscripts p=<prediction> a=<annotation><br />
<br />
== GFF attributes ==<br />
<br />
Using GeMoMa and GAF, you'll obtain GFFs containing some special attributes. We briefly explain the most prominent attributes in the following table.<br />
<br />
{| class="wikitable sortable"<br />
|-<br />
!Attribute !!Long name !!Tool !!Necessary parameter !!Feature !!Description<br />
|-<br />
| aa || amino acids || GeMoMa || || prediction || the number of amino acids in the protein<br />
|-<br />
| score || GeMoMa score || GeMoMa || || prediction || score computed by GeMoMa using the substitution matrix, gap costs and additional penalties<br />
|-<br />
| minCov || minimal coverage || GeMoMa || coverage, ... || prediction || minimal coverage of any base of the prediction given RNA-seq evidence<br />
|-<br />
| avgCov || average coverage || GeMoMa || coverage, ... || prediction || average coverage of all bases of the prediction given RNA-seq evidence<br />
|-<br />
| tpc || transcript percentage coverage || GeMoMa || coverage, ... || prediction || percentage of covered bases per predicted transcript given RNA-seq evidence<br />
|-<br />
| tae || transcript acceptor evidence || GeMoMa || introns || prediction || percentage of predicted acceptor sites per predicted transcript with RNA-seq evidence<br />
|-<br />
| tde || transcript donor evidence || GeMoMa || introns || prediction || percentage of predicted donor sites per predicted transcript with RNA-seq evidence<br />
|-<br />
| tie || transcript intron evidence || GeMoMa || introns || prediction || percentage of predicted introns per predicted transcript with RNA-seq evidence<br />
|-<br />
| minSplitReads || minimal split reads || GeMoMa || introns || prediction || minimal number of split reads for any of the predicted introns per predicted transcript<br />
|-<br />
| iAA || identical amino acid || GeMoMa || query proteins || prediction || percentage of identical amino acids between reference transcript and prediction<br />
|-<br />
| pAA || positive amino acid || GeMoMa || query proteins || prediction || percentage of aligned positions between reference transcript and prediction yielding a positive score in the substitution matrix<br />
|-<br />
| evidence || || GAF || || prediction || number of reference organisms that have a transcript yielding this prediction<br />
|-<br />
| alternative || || GAF || || prediction || alternative gene ID(s) leading to the same prediction<br />
|-<br />
| sumWeight || || GAF || || prediction || the sum of the weights of the references that perfectly support this prediction<br />
|-<br />
| maxTie || maximal tie || GAF || || gene || maximal tie of all transcripts of this gene<br />
|-<br />
| maxEvidence || maximal evidence || GAF || || gene || maximal evidence of all transcripts of this gene<br />
|-<br />
|}<br />
<br />
== FAQs ==<br />
<br />
; Why does the Extractor not return a single CDS-part, protein, ...?<br />
: First, please check whether the names of your contigs/chromosomes in your annotation (gff) and genome file (fasta) are identical. The fasta comments should at best only contain the contig/chromosome name. (Since GeMoMa 1.4, comments, which contain the contig/chromosome name and some additional information separated by a space, are also fine.) Second, please check whether you have a valid GFF/GTF file. Valid GFF files should have a valid "ID" or "Parent" entry in the attributes column. Valid GTF files should have a valid "gene_id" and "transcript_id" entry. Finally, please check the statistics that are given by the Extractor. It lists how many genes have been read and how many genes have been removed for different reasons. One common problem is that some annotation files do not include the stop codon in the CDS annotation.<br />
; How can I force GeMoMa to make more predictions?<br />
: There are several parameters affecting the number of predictions. The most prominent are the number of predictions (p) and the contig threshold (ct). For each reference transcript/CDS, GeMoMa initially makes a preliminary prediction and uses this prediction to determine whether a contig is promising and should be used to determine the final predictions. You may decrease ct and increase p to have more contigs in the final prediction. Increasing the number of predictions allows GeMoMa to output more predictions that have been computed. Decreasing the contig threshold allows to increase the number of predictions that are (internally) computed. Increasing p to a very large number without decreasing ct does not help.<br />
; Running GeMoMa on a single contig of my assembly yield thousands of weird predictions. What went wrong?<br />
: By default, GeMoMa is not build to be run on a single contig. GeMoMa tries to make predictions for all given reference CDS in the given target sequence(s). If the given target sequence is only a fraction of the complete target genome/assembly, GeMoMa will produce weird predictions as it does not filter for the quality of the predictions internally. There are two options to handle this:<br />
:* Use a list of gene models that you expect to be located on this contig (cf. parameter "selected").<br />
:* Filter the predictions using GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>).<br />
; Is it mandatory to use RNA-seq data?<br />
: No, GeMoMa is able to make predictions with and without RNA-seq evidence.<br />
; Is it possible to use multiple reference organisms?<br />
: It is possible to use multiple reference organisms for GeMoMa. Just run GeMoMa on each reference organism separately. Finally, you can employ GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>) to combine these annotations.<br />
; Why do some reference genes not lead to a prediction in the target genome?<br />
: Please first check whether your reference genes have been discarded by the Extractor (cf. assignment file).<br />
: If the genes have been discarded, there are two possibilities:<br />
:* The CDS might be redundant, i.e. the coding exons are identical to those of another transcript. In this case, only one CDS is further evaluated.<br />
:* There might be something wrong with your reference genes, e.g., missing start codon, missing stop codon, premature stop codon, ambiguous nucleotides, ... and you should check the options of Extractor or the annotation.<br />
: If the reference genes passed the Extractor, there are several possible explanations for this behavior. The two most prominent are:<br />
:* GeMoMa stopped the prediction of a reference genes since it does not return a result within the given time (cf. parameter "timeout").<br />
:* GeMoMa simply did not find a prediction matching the remaining quality criteria<br />
:* GeMoMa did find a prediction, but it was filtered out by GAF, e.g. to low relative score, missing start or stop codon (cf. GAF parameters).<br />
; What does "partial gene model" mean in the context of GeMoMa?<br />
: We called a gene model partial if it does not contain an initial start codon and a final stop codon. However, this does not mean that the gene model is located at or close to the border of a chromosome or contig.<br />
; For two different reference transcripts, the predictions of GeMoMa overlap or are identical. What should I do with those?<br />
: GeMoMa makes the predictions for each reference transcript independently. Hence, it can occur that some of predictions of different reference transcripts overlap or are identical especially in gene families. Typically, you might like to filter or rank these predictions. We have implemented GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>) to do this automatically. However, you can also do it by hand using the GFF attributes. Using RNA-seq data in GeMoMa yields additional fields in the annotation that can be used, e.g., average coverage (avgCov).<br />
; A lot of transcripts have been filtered out by the Extractor. What can I do?<br />
: There are several reasons for removing transcripts by the Extractor. At least in two cases you can try to get more transcripts by setting specific parameter values. First, if the transcript contains ambiguous nucleotides, please test the parameter "Ambiguity". Second, sometimes we received GFFs which contain wrong phases for CDS entries (e.g., 0 for all CDS entries in the phase column of the GFF). Since version 1.3.2, we provide the option "r" which stands for repair. If r=true is chosen, the Extractor tries to infer all phases for transcripts that show an error and would be filtered out.<br />
; Is GeMoMa able to predict pseudo-genes/ncRNA?<br />
: No, currently not.<br />
; My RNA-seq data indicates there is an additional intron in a transcipt, but GeMoMa does not predict this. Or vice versa, GeMoMa predicts an intron that is not supported by RNA-seq data. What's the reason?<br />
: GeMoMa is mainly based on the assumptions of amino acid and intron position conservation between reference and target species. Hence, GeMoMa tries to predict a gene model with similar exon-intron structure in the target species and does not stick too much to RNA-seq data. Although intron position conservation can be observed in most cases, sometimes new introns evolve or others vanish. For this reasons, GeMoMa also allows for the inclusion or exclusion of introns adding some additional costs (cf. GeMoMa parameter intron-loss-gain-penalty). However, the behaviour of GeMoMa depends on the parameters settings (especially intron-loss-gain-penalty, sm (substitution matrix), go (gap opening), ge (gap extension)) and the length of the missed/additional intron. Nevertheless, such cases can only occur if the additional/missed intron has a length that can be divided by 3 preserving the reading frame.<br />
:Since the available RNA-seq data only reflects a fraction of tissues/environmental conditions/..., missing RNA-seq evidence does not necessarily mean that the predictions is wrong.<br />
; My RNA-seq data indicates two alternative, highly overlapping introns. Interestingly, GeMoMa does not take the intron that is more abundant. Why?<br />
: GeMoMa reads the introns from the input file using some filter (cf. GeMoMa parameter r (reads)). All introns that pass the filter are used and treated equally. Hence, GeMoMa uses the intron that matches the expectation of intron position and amino acid conservation compared to the reference transcript.<br />
; Does GeMoMa predict multiple transcripts per gene?<br />
: GeMoMa in principle allows to predict multiple transcripts per gene, if corresponding transcripts are given in the reference species or if multiple reference species are used.<br />
; GeMoMa failed with java.lang.OutOfMemoryError. What can I do?<br />
: Whenever you see a java.lang.OutOfMemoryError, you should rerun the program with Java virtual machine (VM) options. More specifically you should set: '''-Xms''' the initally used RAM, e.g. to 5Gb (–Xms5G), and '''-Xmx''' the maximally used RAM, e.g. to 50Gb (-Xmx50G). GeMoMa often needs more memory if you have a large genome ''and'' if you’re providing a large coverage file (extracted from RNA-seq data). If you don’t have a compute node with enough memory, you can run GeMoMa without coverage, which will return the same predictions, but does not include all statistics. Another point could be the protein alignment, if you use the optional parameter query protein. Again you can run GeMoMa without this parameter, which will return the same predictions, but less statistics.<br />
; I need to specify the genetic code for my organisms. What is the expected format?<br />
: The genetic code is given in a two column tab-delimited table, where the first column is the one letter code of the amino acid and the second column is a comma-separated list of triplets. As we are working on genomic DNA, GeMoMa expects the bases A, C, G, and T, and not U (as expected in mRNA). Here is the link to the default genetic code, which might be used as template:<br />
: https://github.com/Jstacs/Jstacs/blob/master/projects/gemoma/test_data/genetic_code.txt<br />
: Alternative genetic codes are described here using the RNA alphabet:<br />
: https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi<br />
: The genetic code might be specified for a reference organism in the module Extractor or for a target organism in the module GeMoMa.<br />
; I like to accelerate GeMoMa. What can I do?<br />
:If you like to improve the runtime, you can use several threads for the computation. If you run the GeMoMaPipeline you just have to select <code>threads=<your_number></code>.<br />
:In addition, you can change the search algorithm that is used in GeMoMa. Tblastn is used by default as search algorithm in GeMoMa (for historical reasons). However, tblastn can be replaced by mmseqs which is typically much faster. If you run the GeMoMaPipeline you just have to select <code>tblastn=false</code>. However, changing the search algorithm can also effect the results. We try to minimize these effect using specific parameters for the search algorithms.<br />
:If you modify other parameters, you will probably receive results that differ to a larger extend from those received using the default parameters.<br />
<br />
== References ==<br />
If you use GeMoMa, please cite<br />
<br />
J. Keilwagen, M. Wenk, J. L. Erickson, M. H. Schattat, J. Grau, and F. Hartung. [https://nar.oxfordjournals.org/content/44/9/e89 Using intron position conservation for homology-based gene prediction]. ''Nucleic Acids Research'', 2016. doi: 10.1093/nar/gkw092<br />
<br />
J. Keilwagen, F. Hartung, M. Paulini, S. O. Twardziok, and J. Grau<br />
[https://rdcu.be/QbKc Combining RNA-seq data and homology-based gene prediction for plants, animals and fungi]. ''BMC Bioinformatics'', 2018. doi: 10.1186/s12859-018-2203-5<br />
<br />
== Version history ==<br />
[http://www.jstacs.de/download.php?which=GeMoMa GeMoMa 1.6.4] (24.04.2020)<br />
* improved help section<br />
* change gff attribute "AA" to "aa"<br />
* GAF:<br />
** bugfix overlapping genes<br />
** accelerated computation<br />
* GeMoMa:<br />
** bugfix: if no assignment file is used and protein ID are prefixes of other protein IDs<br />
** change GFF attribute AA to aa<br />
* AnnotationFinalizer: new parameter "name attribute" allowing to decide whether a name attribute or the Parent and ID attributes should be used for renaming<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.6.3.zip GeMoMa 1.6.3] (05.03.2020)<br />
* Jstacs changes:<br />
** CLI: bugfix ExpandableParameterSet<br />
* python wrapper (for *conda)<br />
* updated tests.sh, run.sh, pipeline.sh<br />
* rename Denoise to DenoiseIntrons<br />
* AnnotationEvidence: write phase (as given) to gff<br />
* GAF: new parameter: default attributes allows to set attributes that are not included in some gene annotation files<br />
* GeMoMa: new parameter: static intron length allowing to use dynamic intron length if set to false<br />
* GeMoMaPipeline: <br />
** bugfix: time-out<br />
** improve output<br />
** separate parameters for maximum intron length (DenoiseIntrons, GeMoMa)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.6.2.zip GeMoMa 1.6.2] (17.12.2019)<br />
* Jstacs changes:<br />
** test methods for modules<br />
** live protocol for Galaxy<br />
* new module Denoise: allowing to clean introns extracted by ERE<br />
* new module NCBIReferenceRetriever: allowing to retrieve data for reference organisms easily from NCBI.<br />
* GAF:<br />
** bugfix for filter using specific attributes if no RNA-seq or query proteins was used<br />
** allow to add annotation info (as for instance provided by Phytozome) based on the reference organisms<br />
* GeMoMa: bugfix for timeout<br />
* GeMoMaPipeline:<br />
** bugfix reporting predicted partial proteins<br />
** improved protocol<br />
** new default value for query proteins (changed from false to true)<br />
** new default value for Ambiguity (changed from EXCEPTION to AMBIGUOUS)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.6.1.zip GeMoMa 1.6.1] (4.06.2019)<br />
* createGalaxyIntegration.sh: bugfix for GeMoMaPipeline<br />
* new module CheckIntrons: allowing to create statistics for introns (extracted by ERE)<br />
* AnnotationFinalizer: bugfix for sequence IDs with large numbers<br />
* CompareTranscripts: <br />
** bugfix for prefix of ref-gene<br />
** allow no transcript info, but making assignment non-optional if a transcript info is set <br />
* GAF: bugfix for Galaxy integration<br />
* GeMoMaPipeline:<br />
** improved output in case of Exceptions<br />
** new parameter "output individual predictions" allows to in- or exclude individual predictions from each reference organism in the final result<br />
** new parameter "weight" allows weights for reference species (cf. GAF)<br />
* ERE: new parameter "minimum mapping quality"<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.6.zip GeMoMa 1.6] (2.04.2019)<br />
* allow to use mmseqs as alternative to tblastn<br />
* AnnotationEvidence:<br />
** allows to add attributes to the input gff: tie, tpc, AA, start, stop<br />
** new parameter for gff output<br />
* AnnotationFinalizer: new tool for predicting UTRs and renaming predictions<br />
* GAF:<br />
** relative score filter and evidence filter are replaced by a flexible filter that allows to filter by relative score, evidence or other GFF attributes as well as combinations thereof<br />
** sorting criteria of the predictions within clusters can now be user-specified<br />
** new attribute for genes: combinedEvidence<br />
** new attribute for predictions: sumWeight<br />
** allows to use gene predictions from all sources, including for instance ab-initio gene predictors, purely RNA-seq based gene prediction and manually curation<br />
** bugfix for predictions from multiple reference organisms<br />
** improved statistic output<br />
* GeMoMa<br />
** renamed the parameter tblastn results to search results<br />
** new parameter for sorting the results of the similarity search (tblastn or mmseqs), if you use mmseqs for the similarity search you have use sort <br />
** new parameter for score of the search results: three options: Trust (as is), ReScore (use aligned sequence, but recompute score), and ReAlign (use detected sequence for realignment and rescore)<br />
** bugfix: threshold for introns from multiple files<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.5.2.zip GeMoMa 1.5.3] (23.07.2018)<br />
* improved parameter description and presentation<br />
* GeMoMaPipeline:<br />
** removed unnecessary parameters<br />
* GeMoMa:<br />
** bugfix: reading coverage file<br />
** removed parameter genomic (cf. Extractor)<br />
** removed protein output (cf. Extractor)<br />
* GAF:<br />
** bugfix: prefix<br />
* Extractor:<br />
** new parameter genomic<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.5.2.zip GeMoMa 1.5.2] (31.5.2018)<br />
* GAF:<br />
** new parameter that allows to restrict the maximal number of transcript predictions per gene<br />
** altered behavior of the evidence filter from percentages to absolute values<br />
** bugfix: nested genes <br />
** checking for duplicates in prediction IDs<br />
* GeMoMa:<br />
** warning if RNA-seq data does not match with target genome<br />
* GeMoMaPipeline: new tool for running the complete GeMoMa pipeline at once allowing multi-threading<br />
* folder for temporary files of GeMoMa<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.5.zip GeMoMa 1.5] (13.02.2018)<br />
* AnnotationEvidence: add chromosome to output<br />
* CompareTranscripts: new parameter that allows to remove prefixes introduces by GAF<br />
* Extractor: new parameter "stop-codon excluded from CDS" that might be used if the annotation does not contain the stop codons<br />
* ExtractRNASeqEvidence: <br />
** print intron length stats<br />
** include program infos in introns.gff3<br />
* GeMoMa:<br />
** new attribute pAA in gff output if query protein is given<br />
** include program infos in predicted_annotation.gff3<br />
** minor bugfix<br />
* GAF:<br />
** new parameter that allows to specify a prefix for each input gff<br />
** collect and print program infos to filtered_prediction.gff3<br />
** improved statistics output<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.4.2.zip GeMoMa 1.4.2] (21.07.2017)<br />
* automatic searching for available updates<br />
* AnnotationEvidence: bugfix (tie computation: Arrays.binarysearch does not find first match)<br />
* Extractor: bugfix (files that are not zipped)<br />
* GeMoMa: bugfix (tie computation: Arrays.binarysearch does not find first match)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.4.1.zip GeMoMa 1.4.1] (30.05.2017)<br />
* CompareTranscripts: bugfix (NullPointerException)<br />
* Extractor: reference genome can be .*fa.gz and .*fasta.gz<br />
* GeMoMa: bugfix (shutdown problem after timeout)<br />
* modified additional scripts and documentation<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.4.zip GeMoMa 1.4] (03.05.2017)<br />
* AnnotationEvidence: new tool computing tie and tpc for given annotation (gff)<br />
* CompareTranscripts: new tool comparing predicted and given annotation (gff)<br />
* Extractor:<br />
** reading CDS with no parent tag (cf. discontinuous feature)<br />
** automatic recognition of GFF or GTF annotation<br />
** Warning if sequences mentioned in the annotation are not included in the reference sequence<br />
* GeMoMa: <br />
** allowing for multiple intron and coverage files (= using different library types at the same time)<br />
** NA instead of "?" for tae, tde, tie, minSplitReads of single coding exon genes<br />
** new default values for the parameters: predictions (10 instead of 1) and contig threshold (0.4 instead of 0.9)<br />
** bugfix (write pc and minCov if possible for last CDS part in predicted annotation)<br />
** bugfix (ref-gene name if no assignment is used)<br />
** bugfix (minSplitReads, minCov, tpc, avgCov if no coverage available)<br />
* GAF:<br />
** nested genes on the same strand<br />
** bugfix (if nothing passes the filter)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.3.2.zip GeMoMa 1.3.2] (18.01.2017)<br />
* Extractor: new parameter repair for broken transcript annotations<br />
* GeMoMa: bugfixes (splice site computation)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa_1.3.1.zip GeMoMa 1.3.1] (09.12.2016)<br />
* GeMoMa bugfix (finding start/stop codon for very small exons)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.3.zip GeMoMa 1.3] (06.12.2016)<br />
* ERE: new tool for extracting RNA-seq evidence<br />
* Extractor: offers options for <br />
** partial gene models<br />
** ambiguities<br />
* GeMoMa:<br />
** RNA-seq<br />
*** defining splice sites<br />
*** additional feature in GFF and output<br />
**** transcript intron evidence (tie)<br />
**** transcript acceptor evidence (tae)<br />
**** transcript donor evidence (tde)<br />
**** transcript percentage coverage (tpc)<br />
**** ...<br />
** improved GFF<br />
** simplified the command line parameters<br />
** IMPORTANT: parameter names changed for some parameters<br />
* GAF: new tool for filtering and combining different predictions (especially of different reference organisms)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.1.3.zip GeMoMa 1.1.3] (06.06.2016)<br />
* minor modifications to the Extractor tool<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.1.2.zip GeMoMa 1.1.2] (05.02.2016)<br />
* GeMoMa bugfix (upstream, downstream sequence for splice site detection)<br />
* Extractor: new parameter s for selecting transcripts<br />
* improved Galaxy integration<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.1.1.zip GeMoMa 1.1.1] (01.02.2016)<br />
* initial release for paper</div>Keilwagenhttps://www.jstacs.de/index.php?title=GeMoMa&diff=1093GeMoMa2020-04-21T09:48:15Z<p>Keilwagen: </p>
<hr />
<div>'''Ge'''ne '''Mo'''del '''Ma'''pper (GeMoMa) is a homology-based gene prediction program. GeMoMa uses the annotation of protein-coding genes in a reference genome to infer the annotation of protein-coding genes in a target genome. Thereby, GeMoMa utilizes amino acid sequence and intron position conservation. In addition, GeMoMa allows to incorporate RNA-seq evidence for splice site prediction.<br />
<br />
[[File:GeMoMa-schema.png|thumb|right|350px|Schema of GeMoMa algorithm]]<br />
<br />
== Installation ==<br />
<br />
=== Requirements ===<br />
<br />
For running the GeMoMa, you need the following software on you computer<br />
* Java v1.8 or later<br />
* [ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ blast] or [https://github.com/soedinglab/MMseqs2 mmseqs]<br />
<br />
=== Download ===<br />
<br />
GeMoMa is implemented in Java using Jstacs. You can [http://www.jstacs.de/download.php?which=GeMoMa download a zip file] containing a readme, the GeMoMa jar file and some tiny scripts for running GeMoMa. The jar file allows for <br />
<ul><br />
<li>creating the XML file needed for the Galaxy integration</li><br />
<li>running the command line interface (CLI) version.</li><br />
</ul><br />
<br />
You can also [[:File:GeMoMa-manual.pdf|download a small manual for GeMoMa]] which explains the main steps for the analysis.<br />
<br />
== Tools ==<br />
<br />
=== NCBIReferenceRetriever ===<br />
<br />
We provide the module NCBIReferenceRetriever allowing to retrieve data for reference organisms easily from NCBI. You can run NCBIReferenceRetriever from the command line with<br/><br />
<code>java -jar GeMoMa-<version>.jar CLI NRR [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reference directory (the directory where the genome and annotation files of the reference organisms should be stored, default = references/)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>number of tries (the number of tries for downloading a reference file, valid range = [1, 100], default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">rl</font></td><br />
<td>reference list (a list of reference organisms)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== GeMoMaPipeline ===<br />
<br />
If you like to run the GeMoMaPipeline on a server as a single job, you can use the module GeMoMaPipeline which allows to exploit the full compute power of the computer server via multi-threading. However, GeMoMaPipeline does not distribute task on a compute cluster.<br />
You can run GeMoMaPipeline from the command line with<br/><br />
<code>java -jar GeMoMa-<version>.jar CLI GeMoMaPipeline [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>target genome (Target genome file (FASTA))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>species (data for reference species, range={own, pre-extracted}, default = own)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;own&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>ID (ID to distinguish the different reference species, default = , OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (Reference annotation file (GFF or GTF), which contains gene models annotated in the reference genome)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (Reference genome file (FASTA))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>weight (the weight can be used to prioritize predictions from different input files; each prediction will get an additional attribute sumWeight that can be used in the filter, valid range = [0.0, 1000.0], default = 1.0, OPTIONAL)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ai</font></td><br />
<td>annotation info (annotation information of the reference, tab-delimted file containing at least the columns transcriptName, GO and .*defline, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;pre-extracted&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>ID (ID to distinguish the different reference species, default = , OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>cds parts (The query cds parts file (FASTA), i.e., the cds parts that have been blasted)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>assignment (The assignment file, which combines parts of the CDS to transcripts, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">q</font></td><br />
<td>query proteins (optional query protein file (FASTA) for computing the optimal alignment score against complete protein prediction, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>weight (the weight can be used to prioritize predictions from different input files; each prediction will get an additional attribute sumWeight that can be used in the filter, valid range = [0.0, 1000.0], default = 1.0, OPTIONAL)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ai</font></td><br />
<td>annotation info (annotation information of the reference, tab-delimted file containing at least the columns transcriptName, GO and .*defline, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">selected</font></td><br />
<td>selected (The path to list file, which allows to make only a predictions for the contained transcript ids. The first column should contain transcript IDs as given in the annotation. Remaining columns can be used to determine a target region that should be overlapped by the prediction, if columns 2 to 5 contain chromosome, strand, start and end of region, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">gc</font></td><br />
<td>genetic code (optional user-specified genetic code, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">tblastn</font></td><br />
<td>tblastn (if *true* tblastn is used as search algorithm, otherwise mmseqs is used. Tblastn and mmseqs need to be installed to use the corresponding option, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">tag</font></td><br />
<td>tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = prediction)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>RNA-seq evidence (data for RNA-seq evidence, range={NO, MAPPED, EXTRACTED}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;MAPPED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.s</font></td><br />
<td>Stranded (Defines whether the reads are stranded. In case of FR_FIRST_STRAND, the first read of a read pair or the only read in case of single-end data is assumed to be located on forward strand of the cDNA, i.e., reverse to the mRNA orientation. If you are using Illumina TruSeq you should use FR_FIRST_STRAND., range={FR_UNSTRANDED, FR_FIRST_STRAND, FR_SECOND_STRAND}, default = FR_UNSTRANDED)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.m</font></td><br />
<td>mapped reads file (BAM/SAM files containing the mapped reads)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.v</font></td><br />
<td>ValidationStringency (Defines how strict to be when reading a SAM or BAM, beyond bare minimum validation., range={STRICT, LENIENT, SILENT}, default = LENIENT)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.u</font></td><br />
<td>use secondary alignments (allows to filter flags in the SAM or BAM, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.c</font></td><br />
<td>coverage (allows to output the coverage, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.mmq</font></td><br />
<td>minimum mapping quality (reads with a mapping quality that is lower than this value will be ignored, valid range = [0, 255], default = 40)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;EXTRACTED&quot;:</b></td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">introns</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used zero or multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage</font></td><br />
<td>coverage (experimental coverage (RNA-seq), range={UNSTRANDED, STRANDED}, default = UNSTRANDED)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>denoise (removing questionable introns that have been extracted by ERE, range={DENOISE, RAW}, default = DENOISE)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;DENOISE&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">DenoiseIntrons.m</font></td><br />
<td>maximum intron length (The maximum length of an intron, default = 15000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">DenoiseIntrons.me</font></td><br />
<td>minimum expression (The threshold for removing introns, valid range = [0.0, 1.0], default = 0.01)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">DenoiseIntrons.c</font></td><br />
<td>context (The context upstream a donor and donwstream an acceptor site that is used to determine the expression of the region, valid range = [0, 100], default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;RAW&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.p</font></td><br />
<td>proteins (whether the complete proteins sequences should returned as output, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.r</font></td><br />
<td>repair (if a transcript annotation can not be parsed, the program will try to infer the phase of the CDS parts to repair the annotation, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.a</font></td><br />
<td>Ambiguity (This parameter defines how to deal with ambiguities in the DNA. There are 3 options: EXCEPTION, which will remove the corresponding transcript, AMBIGUOUS, which will use an X for the corresponding amino acid, and RANDOM, which will randomly select an amnio acid from the list of possibilities., range={EXCEPTION, AMBIGUOUS, RANDOM}, default = AMBIGUOUS)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.s</font></td><br />
<td>stop-codon excluded from CDS (A flag that states whether the reference annotation contains the stop codon in the CDS annotation or not, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.f</font></td><br />
<td>full-length (A flag which allows for choosing between only full-length and all (i.e., full-length and partial) transcripts, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.s</font></td><br />
<td>splice (if no intron is given by RNA-seq, compute candidate splice sites or not, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.sm</font></td><br />
<td>substitution matrix (optional user-specified substitution matrix, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.g</font></td><br />
<td>gap opening (The gap opening cost in the alignment, default = 11)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.ge</font></td><br />
<td>gap extension (The gap extension cost in the alignment, default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.m</font></td><br />
<td>maximum intron length (The maximum length of an intron, default = 15000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.sil</font></td><br />
<td>static intron length (A flag which allows to switch between static intron length, which can be specified by the user and is identical for all genes, and dynamic intron length, which is based on the gene-specific maximum intron length in the reference organism plus the user given maximum intron length, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.i</font></td><br />
<td>intron-loss-gain-penalty (The penalty used for intron loss and gain, default = 25)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.e</font></td><br />
<td>e-value (The e-value for filtering blast results, default = 100.0)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.c</font></td><br />
<td>contig threshold (The threshold for evaluating contigs, valid range = [0.0, 1.0], default = 0.4)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.rt</font></td><br />
<td>region threshold (The threshold for evaluating regions, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.h</font></td><br />
<td>hit threshold (The threshold for adding additional hits, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.p</font></td><br />
<td>predictions (The (maximal) number of predictions per transcript, default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.a</font></td><br />
<td>avoid stop (A flag which allows to avoid stop codons in a transcript (except the last AS), default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.approx</font></td><br />
<td>approx (whether an approximation is used to compute the score for intron gain, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.prefix</font></td><br />
<td>prefix (A prefix to be used for naming the predictions, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.t</font></td><br />
<td>timeout (The (maximal) number of seconds to be used for the predictions of one transcript, if exceeded GeMoMa does not output a prediction for this transcript., valid range = [0, 604800], default = 3600)</td><br />
<td style="width:100px;">LONG</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.Score</font></td><br />
<td>Score (A flag which allows to do nothing, re-score or re-align the search results, range={Trust, ReScore, ReAlign}, default = Trust)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.c</font></td><br />
<td>common border filter (the filter on the common borders of transcripts, the lower the more transcripts will be checked as alternative splice isoforms, valid range = [0.0, 1.0], default = 0.75)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.m</font></td><br />
<td>maximal number of transcripts per gene (the maximal number of allowed transcript predictions per gene, valid range = [1, 2147483647], default = 2147483647)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.d</font></td><br />
<td>default attributes (Comma-separated list of attributes that is set to NaN if they are not given in the annotation file. This allows to use these attributes for sorting or filter criteria. It is especially meaningful if the gene annotation files were received fom different software packages (e.g., GeMoMa, Braker, ...) having different attributes., default = tie,tde,tae,iAA,pAA,score)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.f</font></td><br />
<td>filter (A filter can be applied to transcript predictions to receive only reasonable predictions. The filter is applied to the GFF attributes. The deault filter decides based on the completeness of the prediction (start=='M' and stop=='*') and the relative score (score/AA>=0.75) whether a prediction is used or not. Different criteria can be combined using 'and' and 'or'. In addition, you can check for NaN, e.g., 'isNaN(score)'. A more sophisticated filter could be applied for instance using the completeness, the relative score, the evidence as well as statistics based on RNA-seq: start=='M' and stop =='*' and score/AA>=0.75 and (evidence>1 or tpc==1.0), default = start=='M' and stop =='*' and score/AA>=0.75, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.s</font></td><br />
<td>sorting (comma-separated list that defines the way predictions are sorted within a cluster, default = evidence,score)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.a</font></td><br />
<td>alternative transcript filter (If a prediction is suggested as an alternative transcript, this additional filter will be applied. The filter works syntactically like the 'filter' parameter and allows you to keep the number of alternative transcripts small based on meaningful criteria. Reasonable filter could be based on RNA-seq data (tie==1) or on evidence (evidence>1). A more sophisticated filter could be applied combining several criteria: tie==1 or evidence>1, default = tie==1 or evidence>1, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.u</font></td><br />
<td>UTR (allows to predict UTRs using RNA-seq data, range={NO, YES}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;YES&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.r</font></td><br />
<td>rename (allows to generate generic gene and transcripts names (cf. attribute &quot;Name&quot;), range={COMPOSED, SIMPLE, NO}, default = COMPOSED)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;COMPOSED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.i</font></td><br />
<td>infix (the infix of the generic name, default = G)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.s</font></td><br />
<td>suffix (the suffix of the generic name, default = 0)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.di</font></td><br />
<td>delete infix (a comma-separated list of infixes that is deleted from the sequence names before building the gene/transcript name, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;SIMPLE&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>predicted proteins (If *true*, returns the predicted proteins of the target organism as fastA file, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">pc</font></td><br />
<td>predicted CDSs (If *true*, returns the predicted CDSs of the target organism as fastA file, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">pgr</font></td><br />
<td>predicted genomic regions (If *true*, returns the genomic regions of predicted gene models of the target organism as fastA file, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">o</font></td><br />
<td>output individual predictions (If *true*, returns the predictions for each reference species, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">debug</font></td><br />
<td>debug (If *false* removes all temporary files even if the jobs exits unexpected, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">threads</font></td><br />
<td>The number of threads used for the tool, defaults to 1</td><br />
<td>INT</td><br />
</tr><br />
</table><br />
<br />
=== Extract RNA-seq Evidence (ERE) ===<br />
<br />
For post-processing the mapped RNA-seq data, we provide the tool ExtractRNAseqEvidence (ERE). You can run Extractor from the command line with<br/><br />
<code>java -jar GeMoMa-<version>.jar CLI ERE [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>Stranded (Defines whether the reads are stranded. In case of FR_FIRST_STRAND, the first read of a read pair or the only read in case of single-end data is assumed to be located on forward strand of the cDNA, i.e., reverse to the mRNA orientation. If you are using Illumina TruSeq you should use FR_FIRST_STRAND., range={FR_UNSTRANDED, FR_FIRST_STRAND, FR_SECOND_STRAND}, default = FR_UNSTRANDED)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>mapped reads file (BAM/SAM files containing the mapped reads)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>ValidationStringency (Defines how strict to be when reading a SAM or BAM, beyond bare minimum validation., range={STRICT, LENIENT, SILENT}, default = LENIENT)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">u</font></td><br />
<td>use secondary alignments (allows to filter flags in the SAM or BAM, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage (allows to output the coverage, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">mmq</font></td><br />
<td>minimum mapping quality (reads with a mapping quality that is lower than this value will be ignored, valid range = [0, 254], default = 40)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== CheckIntrons===<br />
<br />
This tool allows to check whether the extracted introns show the expected patterns of di-nucleotides at the splice sites. You can run CheckIntrons from the command line with<br/><br />
<code>java -jar GeMoMa-<version>.jar CLI CheckIntrons [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>target genome (The target genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>verbose (A flag which allows to output a wealth of additional information per transcript, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== DenoiseIntrons ===<br />
This tool allows to remove potentially incorrectly extracted introns. You can run Denoise from the command line with<br/><br />
<code>java -jar GeMoMa-<version>.jar CLI DenoiseIntrons [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage (experimental coverage (RNA-seq), range={UNSTRANDED, STRANDED}, default = UNSTRANDED)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>maximum intron length (The maximum length of an intron, default = 15000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">me</font></td><br />
<td>minimum expression (The threshold for removing introns, valid range = [0.0, 1.0], default = 0.01)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">context</font></td><br />
<td>context (The context upstream a donor and donwstream an acceptor site that is used to determine the expression of the region, valid range = [0, 100], default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== Extractor ===<br />
<br />
For preparing the data, we provide the tool Extractor. You can run Extractor from the command line with<br/><br />
<code>java -jar GeMoMa-<version>.jar CLI Extractor [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (Reference annotation file (GFF or GTF), which contains gene models annotated in the reference genome)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (Reference genome file (FASTA))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">gc</font></td><br />
<td>genetic code (optional user-specified genetic code, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>proteins (whether the complete proteins sequences should returned as output, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>cds (whether the complete CDSs should returned as output, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">genomic</font></td><br />
<td>genomic (whether the genomic regions should be returned (upper case = coding, lower case = non coding), default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>repair (if a transcript annotation can not be parsed, the program will try to infer the phase of the CDS parts to repair the annotation, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>selected (The path to list file, which allows to make only a predictions for the contained transcript ids. The first column should contain transcript IDs as given in the annotation. Remaining columns will be ignored., OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Ambiguity</font></td><br />
<td>Ambiguity (This parameter defines how to deal with ambiguities in the DNA. There are 3 options: EXCEPTION, which will remove the corresponding transcript, AMBIGUOUS, which will use an X for the corresponding amino acid, and RANDOM, which will randomly select an amnio acid from the list of possibilities., range={EXCEPTION, AMBIGUOUS, RANDOM}, default = EXCEPTION)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sefc</font></td><br />
<td>stop-codon excluded from CDS (A flag that states whether the reference annotation contains the stop codon in the CDS annotation or not, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>full-length (A flag which allows for choosing between only full-length and all (i.e., full-length and partial) transcripts, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>verbose (A flag which allows to output a wealth of additional information, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== Gene Model Mapper (GeMoMa) ===<br />
For predicting gene models, we provide the tool GeMoMa. You can run GeMoMa from the command line with<br/><br />
<code>java -jar GeMoMa-<version>.jar CLI GeMoMa [<parameter>=<value> ...]</code><br/><br />
The parameters comprise: <br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>search results (The search results, e.g., from tblastn or mmseqs)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>target genome (The target genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>cds parts (The query cds parts file (FASTA), i.e., the cds parts that have been blasted)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>assignment (The assignment file, which combines parts of the CDS to transcripts, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">q</font></td><br />
<td>query proteins (optional query protein file (FASTA) for computing the optimal alignment score against complete protein prediction, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used zero or multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">splice</font></td><br />
<td>splice (if no intron is given by RNA-seq, compute candidate splice sites or not, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used zero or multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage</font></td><br />
<td>coverage (experimental coverage (RNA-seq), range={UNSTRANDED, STRANDED}, default = UNSTRANDED)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genetic code (optional user-specified genetic code, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sm</font></td><br />
<td>substitution matrix (optional user-specified substitution matrix, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">go</font></td><br />
<td>gap opening (The gap opening cost in the alignment, default = 11)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ge</font></td><br />
<td>gap extension (The gap extension cost in the alignment, default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>maximum intron length (The maximum length of an intron, default = 15000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sil</font></td><br />
<td>static intron length (A flag which allows to switch between static intron length, which can be specified by the user and is identical for all genes, and dynamic intron length, which is based on the gene-specific maximum intron length in the reference organism plus the user given maximum intron length, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">intron-loss-gain-penalty</font></td><br />
<td>intron-loss-gain-penalty (The penalty used for intron loss and gain, default = 25)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">e</font></td><br />
<td>e-value (The e-value for filtering blast results, default = 100.0)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ct</font></td><br />
<td>contig threshold (The threshold for evaluating contigs, valid range = [0.0, 1.0], default = 0.4)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">rt</font></td><br />
<td>region threshold (The threshold for evaluating regions, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">h</font></td><br />
<td>hit threshold (The threshold for adding additional hits, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>predictions (The (maximal) number of predictions per transcript, default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">selected</font></td><br />
<td>selected (The path to list file, which allows to make only a predictions for the contained transcript ids. The first column should contain transcript IDs as given in the annotation. Remaining columns can be used to determine a target region that should be overlapped by the prediction, if columns 2 to 5 contain chromosome, strand, start and end of region, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">as</font></td><br />
<td>avoid stop (A flag which allows to avoid stop codons in a transcript (except the last AS), default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">approx</font></td><br />
<td>approx (whether an approximation is used to compute the score for intron gain, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">prefix</font></td><br />
<td>prefix (A prefix to be used for naming the predictions, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">tag</font></td><br />
<td>tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = prediction)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>verbose (A flag which allows to output a wealth of additional information per transcript, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">timeout</font></td><br />
<td>timeout (The (maximal) number of seconds to be used for the predictions of one transcript, if exceeded GeMoMa does not output a prediction for this transcript., valid range = [0, 604800], default = 3600)</td><br />
<td style="width:100px;">LONG</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sort</font></td><br />
<td>sort (A flag which allows to sort the search results, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Score</font></td><br />
<td>Score (A flag which allows to do nothing, re-score or re-align the search results, range={Trust, ReScore, ReAlign}, default = Trust)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
GeMoMa returns the predicted annotation as gff file.<br />
<br />
=== GeMoMa Annotation Filter (GAF) ===<br />
<br />
The GeMoMa Annotation Filter allows to combine and reduce predictions from GeMoMa into a single final prediction. It is able to handle predictions from different reference species. It also handles overlapping or identical predictions.<br />
You can run GeMoMa from the command line with<br/><br />
<code>java -jar GeMoMa-<version>.jar CLI GAF[&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>tag (the tag used to read the GeMoMa annotations, default = prediction)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>common border filter (the filter on the common borders of transcripts, the lower the more transcripts will be checked as alternative splice isoforms, valid range = [0.0, 1.0], default = 0.75)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>maximal number of transcripts per gene (the maximal number of allowed transcript predictions per gene, valid range = [1, 2147483647], default = 2147483647)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prefix (the prefix can be used to distinguish predictions from different input files, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>weight (the weight can be used to prioritize predictions from different input files; each prediction will get an additional attribute sumWeight that can be used in the filter, valid range = [0.0, 1000.0], default = 1.0, OPTIONAL)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>gene annotation file (GFF files containing the gene annotations (predicted by GeMoMa))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation info (annotation information of the reference, tab-delimted file containing at least the columns transcriptName, GO and .*defline, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>default attributes (Comma-separated list of attributes that is set to NaN if they are not given in the annotation file. This allows to use these attributes for sorting or filter criteria. It is especially meaningful if the gene annotation files were received fom different software packages (e.g., GeMoMa, Braker, ...) having different attributes., default = tie,tde,tae,iAA,pAA,score)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>filter (A filter can be applied to transcript predictions to receive only reasonable predictions. The filter is applied to the GFF attributes. The deault filter decides based on the completeness of the prediction (start=='M' and stop=='*') and the relative score (score/AA>=0.75) whether a prediction is used or not. Different criteria can be combined using 'and' and 'or'. In addition, you can check for NaN, e.g., 'isNaN(score)'. A more sophisticated filter could be applied for instance using the completeness, the relative score, the evidence as well as statistics based on RNA-seq: start=='M' and stop =='*' and score/AA>=0.75 and (evidence>1 or tpc==1.0), default = start=='M' and stop =='*' and score/AA>=0.75, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>sorting (comma-separated list that defines the way predictions are sorted within a cluster, default = evidence,score)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">atf</font></td><br />
<td>alternative transcript filter (If a prediction is suggested as an alternative transcript, this additional filter will be applied. The filter works syntactically like the 'filter' parameter and allows you to keep the number of alternative transcripts small based on meaningful criteria. Reasonable filter could be based on RNA-seq data (tie==1) or on evidence (evidence>1). A more sophisticated filter could be applied combining several criteria: tie==1 or evidence>1, default = tie==1 or evidence>1, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== CompareTranscripts ===<br />
<br />
For comparing gene models from GeMoMa predictions with existing annotation, we provide the tool CompareTranscripts. You can run CompareTranscripts from the command line with<br/><br />
<code>java -jar GeMoMa-<version>.jar CLI CompareTranscripts [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prediction (The predicted annotation)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (The true annotation)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used zero or multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">prefix</font></td><br />
<td>prefix (the prefix can be used to distinguish predictions from different input files, default = , OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">assignment</font></td><br />
<td>assignment (the transcript info for the reference of the prediction)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== AnnotationEvidence ===<br />
<br />
For providing RNA-seq evidence (e.g. tie) for existing annotation, we provide the tool AnnotationEvidence. You can run AnnotationEvidence from the command line with<br/><br />
<code>java -jar GeMoMa-<version>.jar CLI AnnotationEvidence [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (The genome annotation file (GFF))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (The genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns file (Introns (GFF), which might be obtained from RNA-seq, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage file (experimental coverage (RNA-seq), range={NO, UNSTRANDED, STRANDED}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ao</font></td><br />
<td>annotation output (if the annotation should be returned with attributes tie, tpc, and AA, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">gc</font></td><br />
<td>genetic code (optional user-specified genetic code, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== AnnotationFinializer ===<br />
<br />
This tool allows to predict UTR and to rename predictions. You can run AnnotationEvidence from the command line with<br/><br />
<code>java -jar GeMoMa-<version>.jar CLI AnnotationFinalizer [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (The predicted genome annotation file (GFF))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = prediction)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">u</font></td><br />
<td>UTR (allows to predict UTRs using RNA-seq data, range={NO, YES}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;YES&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (The genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns file (Introns (GFF), which might be obtained from RNA-seq)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage file (experimental coverage (RNA-seq), range={NO, UNSTRANDED, STRANDED}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">rename</font></td><br />
<td>rename (allows to generate generic gene and transcripts names (cf. attribute &quot;Name&quot;), range={COMPOSED, SIMPLE, NO}, default = COMPOSED)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;COMPOSED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">infix</font></td><br />
<td>infix (the infix of the generic name, default = G)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>suffix (the suffix of the generic name, default = 0)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">di</font></td><br />
<td>delete infix (a comma-separated list of infixes that is deleted from the sequence names before building the gene/transcript name, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;SIMPLE&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
== Galaxy ==<br />
GeMoMa is available in a public web-server at [http://galaxy.informatik.uni-halle.de galaxy.informatik.uni-halle.de]. The provided web-server only allows a limited number of reference genes and uses a time out of 2 minutes per transcript prediction. For unlimited use, please use the command line program or integrate GeMoMa in your only Galaxy instance.<br />
<br />
[[File:GeMoMa-Workflow.png|thumb|center|700px|GeMoMa workflow adapted from Galaxy]]<br />
<br />
== GFF attributes ==<br />
<br />
Using GeMoMa and GAF, you'll obtain GFFs containing some special attributes. We briefly explain the most prominent attributes in the following table.<br />
<br />
{| class="wikitable sortable"<br />
|-<br />
!Attribute !!Long name !!Tool !!Necessary parameter !!Feature !!Description<br />
|-<br />
| score || GeMoMa score || GeMoMa || || prediction || score computed by GeMoMa using the substitution matrix, gap costs and additional penalties<br />
|-<br />
| minCov || minimal coverage || GeMoMa || coverage, ... || prediction || minimal coverage of any base of the prediction given RNA-seq evidence<br />
|-<br />
| avgCov || average coverage || GeMoMa || coverage, ... || prediction || average coverage of all bases of the prediction given RNA-seq evidence<br />
|-<br />
| tpc || transcript percentage coverage || GeMoMa || coverage, ... || prediction || percentage of covered bases per predicted transcript given RNA-seq evidence<br />
|-<br />
| tae || transcript acceptor evidence || GeMoMa || introns || prediction || percentage of predicted acceptor sites per predicted transcript with RNA-seq evidence<br />
|-<br />
| tde || transcript donor evidence || GeMoMa || introns || prediction || percentage of predicted donor sites per predicted transcript with RNA-seq evidence<br />
|-<br />
| tie || transcript intron evidence || GeMoMa || introns || prediction || percentage of predicted introns per predicted transcript with RNA-seq evidence<br />
|-<br />
| minSplitReads || minimal split reads || GeMoMa || introns || prediction || minimal number of split reads for any of the predicted introns per predicted transcript<br />
|-<br />
| iAA || identical amino acid || GeMoMa || query proteins || prediction || percentage of identical amino acids between reference transcript and prediction<br />
|-<br />
| pAA || positive amino acid || GeMoMa || query proteins || prediction || percentage of aligned positions between reference transcript and prediction yielding a positive score in the substitution matrix<br />
|-<br />
| evidence || || GAF || || prediction || number of reference organisms that have a transcript yielding this prediction<br />
|-<br />
| alternative || || GAF || || prediction || alternative gene ID(s) leading to the same prediction<br />
|-<br />
| sumWeight || || GAF || || prediction || the sum of the weights of the references that perfectly support this prediction<br />
|-<br />
| maxTie || maximal tie || GAF || || gene || maximal tie of all transcripts of this gene<br />
|-<br />
| maxEvidence || maximal evidence || GAF || || gene || maximal evidence of all transcripts of this gene<br />
|-<br />
|}<br />
<br />
== FAQs ==<br />
<br />
; Why does the Extractor not return a single CDS-part, protein, ...?<br />
: First, please check whether the names of your contigs/chromosomes in your annotation (gff) and genome file (fasta) are identical. The fasta comments should at best only contain the contig/chromosome name. (Since GeMoMa 1.4, comments, which contain the contig/chromosome name and some additional information separated by a space, are also fine.) Second, please check whether you have a valid GFF/GTF file. Valid GFF files should have a valid "ID" or "Parent" entry in the attributes column. Valid GTF files should have a valid "gene_id" and "transcript_id" entry. Finally, please check the statistics that are given by the Extractor. It lists how many genes have been read and how many genes have been removed for different reasons. One common problem is that some annotation files do not include the stop codon in the CDS annotation.<br />
; How can I force GeMoMa to make more predictions?<br />
: There are several parameters affecting the number of predictions. The most prominent are the number of predictions (p) and the contig threshold (ct). For each reference transcript/CDS, GeMoMa initially makes a preliminary prediction and uses this prediction to determine whether a contig is promising and should be used to determine the final predictions. You may decrease ct and increase p to have more contigs in the final prediction. Increasing the number of predictions allows GeMoMa to output more predictions that have been computed. Decreasing the contig threshold allows to increase the number of predictions that are (internally) computed. Increasing p to a very large number without decreasing ct does not help.<br />
; Running GeMoMa on a single contig of my assembly yield thousands of weird predictions. What went wrong?<br />
: By default, GeMoMa is not build to be run on a single contig. GeMoMa tries to make predictions for all given reference CDS in the given target sequence(s). If the given target sequence is only a fraction of the complete target genome/assembly, GeMoMa will produce weird predictions as it does not filter for the quality of the predictions internally. There are two options to handle this:<br />
:* Use a list of gene models that you expect to be located on this contig (cf. parameter "selected").<br />
:* Filter the predictions using GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>).<br />
; Is it mandatory to use RNA-seq data?<br />
: No, GeMoMa is able to make predictions with and without RNA-seq evidence.<br />
; Is it possible to use multiple reference organisms?<br />
: It is possible to use multiple reference organisms for GeMoMa. Just run GeMoMa on each reference organism separately. Finally, you can employ GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>) to combine these annotations.<br />
; Why do some reference genes not lead to a prediction in the target genome?<br />
: Please first check whether your reference genes have been discarded by the Extractor (cf. assignment file).<br />
: If the genes have been discarded, there are two possibilities:<br />
:* The CDS might be redundant, i.e. the coding exons are identical to those of another transcript. In this case, only one CDS is further evaluated.<br />
:* There might be something wrong with your reference genes, e.g., missing start codon, missing stop codon, premature stop codon, ambiguous nucleotides, ... and you should check the options of Extractor or the annotation.<br />
: If the reference genes passed the Extractor, there are several possible explanations for this behavior. The two most prominent are:<br />
:* GeMoMa stopped the prediction of a reference genes since it does not return a result within the given time (cf. parameter "timeout").<br />
:* GeMoMa simply did not find a prediction matching the remaining quality criteria<br />
:* GeMoMa did find a prediction, but it was filtered out by GAF, e.g. to low relative score, missing start or stop codon (cf. GAF parameters).<br />
; What does "partial gene model" mean in the context of GeMoMa?<br />
: We called a gene model partial if it does not contain an initial start codon and a final stop codon. However, this does not mean that the gene model is located at or close to the border of a chromosome or contig.<br />
; For two different reference transcripts, the predictions of GeMoMa overlap or are identical. What should I do with those?<br />
: GeMoMa makes the predictions for each reference transcript independently. Hence, it can occur that some of predictions of different reference transcripts overlap or are identical especially in gene families. Typically, you might like to filter or rank these predictions. We have implemented GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>) to do this automatically. However, you can also do it by hand using the GFF attributes. Using RNA-seq data in GeMoMa yields additional fields in the annotation that can be used, e.g., average coverage (avgCov).<br />
; A lot of transcripts have been filtered out by the Extractor. What can I do?<br />
: There are several reasons for removing transcripts by the Extractor. At least in two cases you can try to get more transcripts by setting specific parameter values. First, if the transcript contains ambiguous nucleotides, please test the parameter "Ambiguity". Second, sometimes we received GFFs which contain wrong phases for CDS entries (e.g., 0 for all CDS entries in the phase column of the GFF). Since version 1.3.2, we provide the option "r" which stands for repair. If r=true is chosen, the Extractor tries to infer all phases for transcripts that show an error and would be filtered out.<br />
; Is GeMoMa able to predict pseudo-genes/ncRNA?<br />
: No, currently not.<br />
; My RNA-seq data indicates there is an additional intron in a transcipt, but GeMoMa does not predict this. Or vice versa, GeMoMa predicts an intron that is not supported by RNA-seq data. What's the reason?<br />
: GeMoMa is mainly based on the assumptions of amino acid and intron position conservation between reference and target species. Hence, GeMoMa tries to predict a gene model with similar exon-intron structure in the target species and does not stick too much to RNA-seq data. Although intron position conservation can be observed in most cases, sometimes new introns evolve or others vanish. For this reasons, GeMoMa also allows for the inclusion or exclusion of introns adding some additional costs (cf. GeMoMa parameter intron-loss-gain-penalty). However, the behaviour of GeMoMa depends on the parameters settings (especially intron-loss-gain-penalty, sm (substitution matrix), go (gap opening), ge (gap extension)) and the length of the missed/additional intron. Nevertheless, such cases can only occur if the additional/missed intron has a length that can be divided by 3 preserving the reading frame.<br />
:Since the available RNA-seq data only reflects a fraction of tissues/environmental conditions/..., missing RNA-seq evidence does not necessarily mean that the predictions is wrong.<br />
; My RNA-seq data indicates two alternative, highly overlapping introns. Interestingly, GeMoMa does not take the intron that is more abundant. Why?<br />
: GeMoMa reads the introns from the input file using some filter (cf. GeMoMa parameter r (reads)). All introns that pass the filter are used and treated equally. Hence, GeMoMa uses the intron that matches the expectation of intron position and amino acid conservation compared to the reference transcript.<br />
; Does GeMoMa predict multiple transcripts per gene?<br />
: GeMoMa in principle allows to predict multiple transcripts per gene, if corresponding transcripts are given in the reference species or if multiple reference species are used.<br />
; GeMoMa failed with java.lang.OutOfMemoryError. What can I do?<br />
: Whenever you see a java.lang.OutOfMemoryError, you should rerun the program with Java virtual machine (VM) options. More specifically you should set: '''-Xms''' the initally used RAM, e.g. to 5Gb (–Xms5G), and '''-Xmx''' the maximally used RAM, e.g. to 50Gb (-Xmx50G). GeMoMa often needs more memory if you have a large genome ''and'' if you’re providing a large coverage file (extracted from RNA-seq data). If you don’t have a compute node with enough memory, you can run GeMoMa without coverage, which will return the same predictions, but does not include all statistics. Another point could be the protein alignment, if you use the optional parameter query protein. Again you can run GeMoMa without this parameter, which will return the same predictions, but less statistics.<br />
; I need to specify the genetic code for my organisms. What is the expected format?<br />
: The genetic code is given in a two column tab-delimited table, where the first column is the one letter code of the amino acid and the second column is a comma-separated list of triplets. As we are working on genomic DNA, GeMoMa expects the bases A, C, G, and T, and not U (as expected in mRNA). Here is the link to the default genetic code, which might be used as template:<br />
: https://github.com/Jstacs/Jstacs/blob/master/projects/gemoma/test_data/genetic_code.txt<br />
: Alternative genetic codes are described here using the RNA alphabet:<br />
: https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi<br />
: The genetic code might be specified for a reference organism in the module Extractor or for a target organism in the module GeMoMa.<br />
; I like to accelerate GeMoMa. What can I do?<br />
:If you like to improve the runtime, you can use several threads for the computation. If you run the GeMoMaPipeline you just have to select <code>threads=<your_number></code>.<br />
:In addition, you can change the search algorithm that is used in GeMoMa. Tblastn is used by default as search algorithm in GeMoMa (for historical reasons). However, tblastn can be replaced by mmseqs which is typically much faster. If you run the GeMoMaPipeline you just have to select <code>tblastn=false</code>. However, changing the search algorithm can also effect the results. We try to minimize these effect using specific parameters for the search algorithms.<br />
:If you modify other parameters, you will probably receive results that differ to a larger extend from those received using the default parameters.<br />
<br />
== References ==<br />
If you use GeMoMa, please cite<br />
<br />
J. Keilwagen, M. Wenk, J. L. Erickson, M. H. Schattat, J. Grau, and F. Hartung. [https://nar.oxfordjournals.org/content/44/9/e89 Using intron position conservation for homology-based gene prediction]. ''Nucleic Acids Research'', 2016. doi: 10.1093/nar/gkw092<br />
<br />
J. Keilwagen, F. Hartung, M. Paulini, S. O. Twardziok, and J. Grau<br />
[https://rdcu.be/QbKc Combining RNA-seq data and homology-based gene prediction for plants, animals and fungi]. ''BMC Bioinformatics'', 2018. doi: 10.1186/s12859-018-2203-5<br />
<br />
== Version history ==<br />
[http://www.jstacs.de/download.php?which=GeMoMa GeMoMa 1.6.3] (05.03.2020)<br />
* Jstacs changes:<br />
** CLI: bugfix ExpandableParameterSet<br />
* python wrapper (for *conda)<br />
* updated tests.sh, run.sh, pipeline.sh<br />
* rename Denoise to DenoiseIntrons<br />
* AnnotationEvidence: write phase (as given) to gff<br />
* GAF: new parameter: default attributes allows to set attributes that are not included in some gene annotation files<br />
* GeMoMa: new parameter: static intron length allowing to use dynamic intron length if set to false<br />
* GeMoMaPipeline: <br />
** bugfix: time-out<br />
** improve output<br />
** separate parameters for maximum intron length (DenoiseIntrons, GeMoMa)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.6.2.zip GeMoMa 1.6.2] (17.12.2019)<br />
* Jstacs changes:<br />
** test methods for modules<br />
** live protocol for Galaxy<br />
* new module Denoise: allowing to clean introns extracted by ERE<br />
* new module NCBIReferenceRetriever: allowing to retrieve data for reference organisms easily from NCBI.<br />
* GAF:<br />
** bugfix for filter using specific attributes if no RNA-seq or query proteins was used<br />
** allow to add annotation info (as for instance provided by Phytozome) based on the reference organisms<br />
* GeMoMa: bugfix for timeout<br />
* GeMoMaPipeline:<br />
** bugfix reporting predicted partial proteins<br />
** improved protocol<br />
** new default value for query proteins (changed from false to true)<br />
** new default value for Ambiguity (changed from EXCEPTION to AMBIGUOUS)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.6.1.zip GeMoMa 1.6.1] (4.06.2019)<br />
* createGalaxyIntegration.sh: bugfix for GeMoMaPipeline<br />
* new module CheckIntrons: allowing to create statistics for introns (extracted by ERE)<br />
* AnnotationFinalizer: bugfix for sequence IDs with large numbers<br />
* CompareTranscripts: <br />
** bugfix for prefix of ref-gene<br />
** allow no transcript info, but making assignment non-optional if a transcript info is set <br />
* GAF: bugfix for Galaxy integration<br />
* GeMoMaPipeline:<br />
** improved output in case of Exceptions<br />
** new parameter "output individual predictions" allows to in- or exclude individual predictions from each reference organism in the final result<br />
** new parameter "weight" allows weights for reference species (cf. GAF)<br />
* ERE: new parameter "minimum mapping quality"<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.6.zip GeMoMa 1.6] (2.04.2019)<br />
* allow to use mmseqs as alternative to tblastn<br />
* AnnotationEvidence:<br />
** allows to add attributes to the input gff: tie, tpc, AA, start, stop<br />
** new parameter for gff output<br />
* AnnotationFinalizer: new tool for predicting UTRs and renaming predictions<br />
* GAF:<br />
** relative score filter and evidence filter are replaced by a flexible filter that allows to filter by relative score, evidence or other GFF attributes as well as combinations thereof<br />
** sorting criteria of the predictions within clusters can now be user-specified<br />
** new attribute for genes: combinedEvidence<br />
** new attribute for predictions: sumWeight<br />
** allows to use gene predictions from all sources, including for instance ab-initio gene predictors, purely RNA-seq based gene prediction and manually curation<br />
** bugfix for predictions from multiple reference organisms<br />
** improved statistic output<br />
* GeMoMa<br />
** renamed the parameter tblastn results to search results<br />
** new parameter for sorting the results of the similarity search (tblastn or mmseqs), if you use mmseqs for the similarity search you have use sort <br />
** new parameter for score of the search results: three options: Trust (as is), ReScore (use aligned sequence, but recompute score), and ReAlign (use detected sequence for realignment and rescore)<br />
** bugfix: threshold for introns from multiple files<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.5.2.zip GeMoMa 1.5.3] (23.07.2018)<br />
* improved parameter description and presentation<br />
* GeMoMaPipeline:<br />
** removed unnecessary parameters<br />
* GeMoMa:<br />
** bugfix: reading coverage file<br />
** removed parameter genomic (cf. Extractor)<br />
** removed protein output (cf. Extractor)<br />
* GAF:<br />
** bugfix: prefix<br />
* Extractor:<br />
** new parameter genomic<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.5.2.zip GeMoMa 1.5.2] (31.5.2018)<br />
* GAF:<br />
** new parameter that allows to restrict the maximal number of transcript predictions per gene<br />
** altered behavior of the evidence filter from percentages to absolute values<br />
** bugfix: nested genes <br />
** checking for duplicates in prediction IDs<br />
* GeMoMa:<br />
** warning if RNA-seq data does not match with target genome<br />
* GeMoMaPipeline: new tool for running the complete GeMoMa pipeline at once allowing multi-threading<br />
* folder for temporary files of GeMoMa<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.5.zip GeMoMa 1.5] (13.02.2018)<br />
* AnnotationEvidence: add chromosome to output<br />
* CompareTranscripts: new parameter that allows to remove prefixes introduces by GAF<br />
* Extractor: new parameter "stop-codon excluded from CDS" that might be used if the annotation does not contain the stop codons<br />
* ExtractRNASeqEvidence: <br />
** print intron length stats<br />
** include program infos in introns.gff3<br />
* GeMoMa:<br />
** new attribute pAA in gff output if query protein is given<br />
** include program infos in predicted_annotation.gff3<br />
** minor bugfix<br />
* GAF:<br />
** new parameter that allows to specify a prefix for each input gff<br />
** collect and print program infos to filtered_prediction.gff3<br />
** improved statistics output<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.4.2.zip GeMoMa 1.4.2] (21.07.2017)<br />
* automatic searching for available updates<br />
* AnnotationEvidence: bugfix (tie computation: Arrays.binarysearch does not find first match)<br />
* Extractor: bugfix (files that are not zipped)<br />
* GeMoMa: bugfix (tie computation: Arrays.binarysearch does not find first match)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.4.1.zip GeMoMa 1.4.1] (30.05.2017)<br />
* CompareTranscripts: bugfix (NullPointerException)<br />
* Extractor: reference genome can be .*fa.gz and .*fasta.gz<br />
* GeMoMa: bugfix (shutdown problem after timeout)<br />
* modified additional scripts and documentation<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.4.zip GeMoMa 1.4] (03.05.2017)<br />
* AnnotationEvidence: new tool computing tie and tpc for given annotation (gff)<br />
* CompareTranscripts: new tool comparing predicted and given annotation (gff)<br />
* Extractor:<br />
** reading CDS with no parent tag (cf. discontinuous feature)<br />
** automatic recognition of GFF or GTF annotation<br />
** Warning if sequences mentioned in the annotation are not included in the reference sequence<br />
* GeMoMa: <br />
** allowing for multiple intron and coverage files (= using different library types at the same time)<br />
** NA instead of "?" for tae, tde, tie, minSplitReads of single coding exon genes<br />
** new default values for the parameters: predictions (10 instead of 1) and contig threshold (0.4 instead of 0.9)<br />
** bugfix (write pc and minCov if possible for last CDS part in predicted annotation)<br />
** bugfix (ref-gene name if no assignment is used)<br />
** bugfix (minSplitReads, minCov, tpc, avgCov if no coverage available)<br />
* GAF:<br />
** nested genes on the same strand<br />
** bugfix (if nothing passes the filter)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.3.2.zip GeMoMa 1.3.2] (18.01.2017)<br />
* Extractor: new parameter repair for broken transcript annotations<br />
* GeMoMa: bugfixes (splice site computation)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa_1.3.1.zip GeMoMa 1.3.1] (09.12.2016)<br />
* GeMoMa bugfix (finding start/stop codon for very small exons)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.3.zip GeMoMa 1.3] (06.12.2016)<br />
* ERE: new tool for extracting RNA-seq evidence<br />
* Extractor: offers options for <br />
** partial gene models<br />
** ambiguities<br />
* GeMoMa:<br />
** RNA-seq<br />
*** defining splice sites<br />
*** additional feature in GFF and output<br />
**** transcript intron evidence (tie)<br />
**** transcript acceptor evidence (tae)<br />
**** transcript donor evidence (tde)<br />
**** transcript percentage coverage (tpc)<br />
**** ...<br />
** improved GFF<br />
** simplified the command line parameters<br />
** IMPORTANT: parameter names changed for some parameters<br />
* GAF: new tool for filtering and combining different predictions (especially of different reference organisms)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.1.3.zip GeMoMa 1.1.3] (06.06.2016)<br />
* minor modifications to the Extractor tool<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.1.2.zip GeMoMa 1.1.2] (05.02.2016)<br />
* GeMoMa bugfix (upstream, downstream sequence for splice site detection)<br />
* Extractor: new parameter s for selecting transcripts<br />
* improved Galaxy integration<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.1.1.zip GeMoMa 1.1.1] (01.02.2016)<br />
* initial release for paper</div>Keilwagenhttps://www.jstacs.de/index.php?title=GeMoMa&diff=1092GeMoMa2020-04-21T05:30:02Z<p>Keilwagen: new structure</p>
<hr />
<div>'''Ge'''ne '''Mo'''del '''Ma'''pper (GeMoMa) is a homology-based gene prediction program. GeMoMa uses the annotation of protein-coding genes in a reference genome to infer the annotation of protein-coding genes in a target genome. Thereby, GeMoMa utilizes amino acid sequence and intron position conservation. In addition, GeMoMa allows to incorporate RNA-seq evidence for splice site prediction.<br />
<br />
[[File:GeMoMa-schema.png|thumb|right|350px|Schema of GeMoMa algorithm]]<br />
<br />
== References ==<br />
If you use GeMoMa, please cite<br />
<br />
J. Keilwagen, M. Wenk, J. L. Erickson, M. H. Schattat, J. Grau, and F. Hartung. [https://nar.oxfordjournals.org/content/44/9/e89 Using intron position conservation for homology-based gene prediction]. ''Nucleic Acids Research'', 2016. doi: 10.1093/nar/gkw092<br />
<br />
J. Keilwagen, F. Hartung, M. Paulini, S. O. Twardziok, and J. Grau<br />
[https://rdcu.be/QbKc Combining RNA-seq data and homology-based gene prediction for plants, animals and fungi]. ''BMC Bioinformatics'', 2018. doi: 10.1186/s12859-018-2203-5<br />
<br />
== Installation ==<br />
<br />
=== Requirements ===<br />
<br />
For running the GeMoMa, you need the following software on you computer<br />
* Java v1.8 or later<br />
* [ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ blast] or [https://github.com/soedinglab/MMseqs2 mmseqs]<br />
<br />
=== Download ===<br />
<br />
GeMoMa is implemented in Java using Jstacs. You can [http://www.jstacs.de/download.php?which=GeMoMa download a zip file] containing a readme, the GeMoMa jar file and some tiny scripts for running GeMoMa. The jar file allows for <br />
<ul><br />
<li>creating the XML file needed for the Galaxy integration</li><br />
<li>running the command line interface (CLI) version.</li><br />
</ul><br />
<br />
You can also [[:File:GeMoMa-manual.pdf|download a small manual for GeMoMa]] which explains the main steps for the analysis.<br />
<br />
== Tools ==<br />
<br />
=== NCBIReferenceRetriever ===<br />
<br />
We provide the module NCBIReferenceRetriever allowing to retrieve data for reference organisms easily from NCBI. You can run NCBIReferenceRetriever from the command line with<br/><br />
<code>java -jar GeMoMa-<version>.jar CLI NRR [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reference directory (the directory where the genome and annotation files of the reference organisms should be stored, default = references/)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>number of tries (the number of tries for downloading a reference file, valid range = [1, 100], default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">rl</font></td><br />
<td>reference list (a list of reference organisms)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== GeMoMaPipeline ===<br />
<br />
If you like to run the GeMoMaPipeline on a server as a single job, you can use the module GeMoMaPipeline which allows to exploit the full compute power of the computer server via multi-threading. However, GeMoMaPipeline does not distribute task on a compute cluster.<br />
You can run GeMoMaPipeline from the command line with<br/><br />
<code>java -jar GeMoMa-<version>.jar CLI GeMoMaPipeline [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>target genome (Target genome file (FASTA))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>species (data for reference species, range={own, pre-extracted}, default = own)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;own&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>ID (ID to distinguish the different reference species, default = , OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (Reference annotation file (GFF or GTF), which contains gene models annotated in the reference genome)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (Reference genome file (FASTA))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>weight (the weight can be used to prioritize predictions from different input files; each prediction will get an additional attribute sumWeight that can be used in the filter, valid range = [0.0, 1000.0], default = 1.0, OPTIONAL)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ai</font></td><br />
<td>annotation info (annotation information of the reference, tab-delimted file containing at least the columns transcriptName, GO and .*defline, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;pre-extracted&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>ID (ID to distinguish the different reference species, default = , OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>cds parts (The query cds parts file (FASTA), i.e., the cds parts that have been blasted)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>assignment (The assignment file, which combines parts of the CDS to transcripts, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">q</font></td><br />
<td>query proteins (optional query protein file (FASTA) for computing the optimal alignment score against complete protein prediction, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>weight (the weight can be used to prioritize predictions from different input files; each prediction will get an additional attribute sumWeight that can be used in the filter, valid range = [0.0, 1000.0], default = 1.0, OPTIONAL)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ai</font></td><br />
<td>annotation info (annotation information of the reference, tab-delimted file containing at least the columns transcriptName, GO and .*defline, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">selected</font></td><br />
<td>selected (The path to list file, which allows to make only a predictions for the contained transcript ids. The first column should contain transcript IDs as given in the annotation. Remaining columns can be used to determine a target region that should be overlapped by the prediction, if columns 2 to 5 contain chromosome, strand, start and end of region, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">gc</font></td><br />
<td>genetic code (optional user-specified genetic code, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">tblastn</font></td><br />
<td>tblastn (if *true* tblastn is used as search algorithm, otherwise mmseqs is used. Tblastn and mmseqs need to be installed to use the corresponding option, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">tag</font></td><br />
<td>tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = prediction)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>RNA-seq evidence (data for RNA-seq evidence, range={NO, MAPPED, EXTRACTED}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;MAPPED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.s</font></td><br />
<td>Stranded (Defines whether the reads are stranded. In case of FR_FIRST_STRAND, the first read of a read pair or the only read in case of single-end data is assumed to be located on forward strand of the cDNA, i.e., reverse to the mRNA orientation. If you are using Illumina TruSeq you should use FR_FIRST_STRAND., range={FR_UNSTRANDED, FR_FIRST_STRAND, FR_SECOND_STRAND}, default = FR_UNSTRANDED)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.m</font></td><br />
<td>mapped reads file (BAM/SAM files containing the mapped reads)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.v</font></td><br />
<td>ValidationStringency (Defines how strict to be when reading a SAM or BAM, beyond bare minimum validation., range={STRICT, LENIENT, SILENT}, default = LENIENT)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.u</font></td><br />
<td>use secondary alignments (allows to filter flags in the SAM or BAM, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.c</font></td><br />
<td>coverage (allows to output the coverage, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.mmq</font></td><br />
<td>minimum mapping quality (reads with a mapping quality that is lower than this value will be ignored, valid range = [0, 255], default = 40)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;EXTRACTED&quot;:</b></td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">introns</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used zero or multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage</font></td><br />
<td>coverage (experimental coverage (RNA-seq), range={UNSTRANDED, STRANDED}, default = UNSTRANDED)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>denoise (removing questionable introns that have been extracted by ERE, range={DENOISE, RAW}, default = DENOISE)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;DENOISE&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">DenoiseIntrons.m</font></td><br />
<td>maximum intron length (The maximum length of an intron, default = 15000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">DenoiseIntrons.me</font></td><br />
<td>minimum expression (The threshold for removing introns, valid range = [0.0, 1.0], default = 0.01)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">DenoiseIntrons.c</font></td><br />
<td>context (The context upstream a donor and donwstream an acceptor site that is used to determine the expression of the region, valid range = [0, 100], default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;RAW&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.p</font></td><br />
<td>proteins (whether the complete proteins sequences should returned as output, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.r</font></td><br />
<td>repair (if a transcript annotation can not be parsed, the program will try to infer the phase of the CDS parts to repair the annotation, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.a</font></td><br />
<td>Ambiguity (This parameter defines how to deal with ambiguities in the DNA. There are 3 options: EXCEPTION, which will remove the corresponding transcript, AMBIGUOUS, which will use an X for the corresponding amino acid, and RANDOM, which will randomly select an amnio acid from the list of possibilities., range={EXCEPTION, AMBIGUOUS, RANDOM}, default = AMBIGUOUS)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.s</font></td><br />
<td>stop-codon excluded from CDS (A flag that states whether the reference annotation contains the stop codon in the CDS annotation or not, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.f</font></td><br />
<td>full-length (A flag which allows for choosing between only full-length and all (i.e., full-length and partial) transcripts, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.s</font></td><br />
<td>splice (if no intron is given by RNA-seq, compute candidate splice sites or not, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.sm</font></td><br />
<td>substitution matrix (optional user-specified substitution matrix, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.g</font></td><br />
<td>gap opening (The gap opening cost in the alignment, default = 11)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.ge</font></td><br />
<td>gap extension (The gap extension cost in the alignment, default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.m</font></td><br />
<td>maximum intron length (The maximum length of an intron, default = 15000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.sil</font></td><br />
<td>static intron length (A flag which allows to switch between static intron length, which can be specified by the user and is identical for all genes, and dynamic intron length, which is based on the gene-specific maximum intron length in the reference organism plus the user given maximum intron length, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.i</font></td><br />
<td>intron-loss-gain-penalty (The penalty used for intron loss and gain, default = 25)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.e</font></td><br />
<td>e-value (The e-value for filtering blast results, default = 100.0)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.c</font></td><br />
<td>contig threshold (The threshold for evaluating contigs, valid range = [0.0, 1.0], default = 0.4)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.rt</font></td><br />
<td>region threshold (The threshold for evaluating regions, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.h</font></td><br />
<td>hit threshold (The threshold for adding additional hits, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.p</font></td><br />
<td>predictions (The (maximal) number of predictions per transcript, default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.a</font></td><br />
<td>avoid stop (A flag which allows to avoid stop codons in a transcript (except the last AS), default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.approx</font></td><br />
<td>approx (whether an approximation is used to compute the score for intron gain, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.prefix</font></td><br />
<td>prefix (A prefix to be used for naming the predictions, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.t</font></td><br />
<td>timeout (The (maximal) number of seconds to be used for the predictions of one transcript, if exceeded GeMoMa does not output a prediction for this transcript., valid range = [0, 604800], default = 3600)</td><br />
<td style="width:100px;">LONG</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.Score</font></td><br />
<td>Score (A flag which allows to do nothing, re-score or re-align the search results, range={Trust, ReScore, ReAlign}, default = Trust)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.c</font></td><br />
<td>common border filter (the filter on the common borders of transcripts, the lower the more transcripts will be checked as alternative splice isoforms, valid range = [0.0, 1.0], default = 0.75)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.m</font></td><br />
<td>maximal number of transcripts per gene (the maximal number of allowed transcript predictions per gene, valid range = [1, 2147483647], default = 2147483647)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.d</font></td><br />
<td>default attributes (Comma-separated list of attributes that is set to NaN if they are not given in the annotation file. This allows to use these attributes for sorting or filter criteria. It is especially meaningful if the gene annotation files were received fom different software packages (e.g., GeMoMa, Braker, ...) having different attributes., default = tie,tde,tae,iAA,pAA,score)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.f</font></td><br />
<td>filter (A filter can be applied to transcript predictions to receive only reasonable predictions. The filter is applied to the GFF attributes. The deault filter decides based on the completeness of the prediction (start=='M' and stop=='*') and the relative score (score/AA>=0.75) whether a prediction is used or not. Different criteria can be combined using 'and' and 'or'. In addition, you can check for NaN, e.g., 'isNaN(score)'. A more sophisticated filter could be applied for instance using the completeness, the relative score, the evidence as well as statistics based on RNA-seq: start=='M' and stop =='*' and score/AA>=0.75 and (evidence>1 or tpc==1.0), default = start=='M' and stop =='*' and score/AA>=0.75, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.s</font></td><br />
<td>sorting (comma-separated list that defines the way predictions are sorted within a cluster, default = evidence,score)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.a</font></td><br />
<td>alternative transcript filter (If a prediction is suggested as an alternative transcript, this additional filter will be applied. The filter works syntactically like the 'filter' parameter and allows you to keep the number of alternative transcripts small based on meaningful criteria. Reasonable filter could be based on RNA-seq data (tie==1) or on evidence (evidence>1). A more sophisticated filter could be applied combining several criteria: tie==1 or evidence>1, default = tie==1 or evidence>1, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.u</font></td><br />
<td>UTR (allows to predict UTRs using RNA-seq data, range={NO, YES}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;YES&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.r</font></td><br />
<td>rename (allows to generate generic gene and transcripts names (cf. attribute &quot;Name&quot;), range={COMPOSED, SIMPLE, NO}, default = COMPOSED)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;COMPOSED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.i</font></td><br />
<td>infix (the infix of the generic name, default = G)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.s</font></td><br />
<td>suffix (the suffix of the generic name, default = 0)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.di</font></td><br />
<td>delete infix (a comma-separated list of infixes that is deleted from the sequence names before building the gene/transcript name, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;SIMPLE&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>predicted proteins (If *true*, returns the predicted proteins of the target organism as fastA file, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">pc</font></td><br />
<td>predicted CDSs (If *true*, returns the predicted CDSs of the target organism as fastA file, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">pgr</font></td><br />
<td>predicted genomic regions (If *true*, returns the genomic regions of predicted gene models of the target organism as fastA file, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">o</font></td><br />
<td>output individual predictions (If *true*, returns the predictions for each reference species, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">debug</font></td><br />
<td>debug (If *false* removes all temporary files even if the jobs exits unexpected, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">threads</font></td><br />
<td>The number of threads used for the tool, defaults to 1</td><br />
<td>INT</td><br />
</tr><br />
</table><br />
<br />
=== Extract RNA-seq Evidence (ERE) ===<br />
<br />
For post-processing the mapped RNA-seq data, we provide the tool ExtractRNAseqEvidence (ERE). You can run Extractor from the command line with<br/><br />
<code>java -jar GeMoMa-<version>.jar CLI ERE [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>Stranded (Defines whether the reads are stranded. In case of FR_FIRST_STRAND, the first read of a read pair or the only read in case of single-end data is assumed to be located on forward strand of the cDNA, i.e., reverse to the mRNA orientation. If you are using Illumina TruSeq you should use FR_FIRST_STRAND., range={FR_UNSTRANDED, FR_FIRST_STRAND, FR_SECOND_STRAND}, default = FR_UNSTRANDED)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>mapped reads file (BAM/SAM files containing the mapped reads)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>ValidationStringency (Defines how strict to be when reading a SAM or BAM, beyond bare minimum validation., range={STRICT, LENIENT, SILENT}, default = LENIENT)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">u</font></td><br />
<td>use secondary alignments (allows to filter flags in the SAM or BAM, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage (allows to output the coverage, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">mmq</font></td><br />
<td>minimum mapping quality (reads with a mapping quality that is lower than this value will be ignored, valid range = [0, 254], default = 40)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== CheckIntrons===<br />
<br />
This tool allows to check whether the extracted introns show the expected patterns of di-nucleotides at the splice sites. You can run CheckIntrons from the command line with<br/><br />
<code>java -jar GeMoMa-<version>.jar CLI CheckIntrons [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>target genome (The target genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>verbose (A flag which allows to output a wealth of additional information per transcript, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== DenoiseIntrons ===<br />
This tool allows to remove potentially incorrectly extracted introns. You can run Denoise from the command line with<br/><br />
<code>java -jar GeMoMa-<version>.jar CLI DenoiseIntrons [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage (experimental coverage (RNA-seq), range={UNSTRANDED, STRANDED}, default = UNSTRANDED)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>maximum intron length (The maximum length of an intron, default = 15000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">me</font></td><br />
<td>minimum expression (The threshold for removing introns, valid range = [0.0, 1.0], default = 0.01)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">context</font></td><br />
<td>context (The context upstream a donor and donwstream an acceptor site that is used to determine the expression of the region, valid range = [0, 100], default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== Extractor ===<br />
<br />
For preparing the data, we provide the tool Extractor. You can run Extractor from the command line with<br/><br />
<code>java -jar GeMoMa-<version>.jar CLI Extractor [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (Reference annotation file (GFF or GTF), which contains gene models annotated in the reference genome)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (Reference genome file (FASTA))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">gc</font></td><br />
<td>genetic code (optional user-specified genetic code, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>proteins (whether the complete proteins sequences should returned as output, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>cds (whether the complete CDSs should returned as output, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">genomic</font></td><br />
<td>genomic (whether the genomic regions should be returned (upper case = coding, lower case = non coding), default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>repair (if a transcript annotation can not be parsed, the program will try to infer the phase of the CDS parts to repair the annotation, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>selected (The path to list file, which allows to make only a predictions for the contained transcript ids. The first column should contain transcript IDs as given in the annotation. Remaining columns will be ignored., OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Ambiguity</font></td><br />
<td>Ambiguity (This parameter defines how to deal with ambiguities in the DNA. There are 3 options: EXCEPTION, which will remove the corresponding transcript, AMBIGUOUS, which will use an X for the corresponding amino acid, and RANDOM, which will randomly select an amnio acid from the list of possibilities., range={EXCEPTION, AMBIGUOUS, RANDOM}, default = EXCEPTION)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sefc</font></td><br />
<td>stop-codon excluded from CDS (A flag that states whether the reference annotation contains the stop codon in the CDS annotation or not, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>full-length (A flag which allows for choosing between only full-length and all (i.e., full-length and partial) transcripts, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>verbose (A flag which allows to output a wealth of additional information, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== Gene Model Mapper (GeMoMa) ===<br />
For predicting gene models, we provide the tool GeMoMa. You can run GeMoMa from the command line with<br/><br />
<code>java -jar GeMoMa-<version>.jar CLI GeMoMa [<parameter>=<value> ...]</code><br/><br />
The parameters comprise: <br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>search results (The search results, e.g., from tblastn or mmseqs)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>target genome (The target genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>cds parts (The query cds parts file (FASTA), i.e., the cds parts that have been blasted)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>assignment (The assignment file, which combines parts of the CDS to transcripts, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">q</font></td><br />
<td>query proteins (optional query protein file (FASTA) for computing the optimal alignment score against complete protein prediction, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used zero or multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">splice</font></td><br />
<td>splice (if no intron is given by RNA-seq, compute candidate splice sites or not, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used zero or multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage</font></td><br />
<td>coverage (experimental coverage (RNA-seq), range={UNSTRANDED, STRANDED}, default = UNSTRANDED)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genetic code (optional user-specified genetic code, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sm</font></td><br />
<td>substitution matrix (optional user-specified substitution matrix, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">go</font></td><br />
<td>gap opening (The gap opening cost in the alignment, default = 11)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ge</font></td><br />
<td>gap extension (The gap extension cost in the alignment, default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>maximum intron length (The maximum length of an intron, default = 15000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sil</font></td><br />
<td>static intron length (A flag which allows to switch between static intron length, which can be specified by the user and is identical for all genes, and dynamic intron length, which is based on the gene-specific maximum intron length in the reference organism plus the user given maximum intron length, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">intron-loss-gain-penalty</font></td><br />
<td>intron-loss-gain-penalty (The penalty used for intron loss and gain, default = 25)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">e</font></td><br />
<td>e-value (The e-value for filtering blast results, default = 100.0)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ct</font></td><br />
<td>contig threshold (The threshold for evaluating contigs, valid range = [0.0, 1.0], default = 0.4)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">rt</font></td><br />
<td>region threshold (The threshold for evaluating regions, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">h</font></td><br />
<td>hit threshold (The threshold for adding additional hits, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>predictions (The (maximal) number of predictions per transcript, default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">selected</font></td><br />
<td>selected (The path to list file, which allows to make only a predictions for the contained transcript ids. The first column should contain transcript IDs as given in the annotation. Remaining columns can be used to determine a target region that should be overlapped by the prediction, if columns 2 to 5 contain chromosome, strand, start and end of region, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">as</font></td><br />
<td>avoid stop (A flag which allows to avoid stop codons in a transcript (except the last AS), default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">approx</font></td><br />
<td>approx (whether an approximation is used to compute the score for intron gain, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">prefix</font></td><br />
<td>prefix (A prefix to be used for naming the predictions, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">tag</font></td><br />
<td>tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = prediction)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>verbose (A flag which allows to output a wealth of additional information per transcript, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">timeout</font></td><br />
<td>timeout (The (maximal) number of seconds to be used for the predictions of one transcript, if exceeded GeMoMa does not output a prediction for this transcript., valid range = [0, 604800], default = 3600)</td><br />
<td style="width:100px;">LONG</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sort</font></td><br />
<td>sort (A flag which allows to sort the search results, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Score</font></td><br />
<td>Score (A flag which allows to do nothing, re-score or re-align the search results, range={Trust, ReScore, ReAlign}, default = Trust)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
GeMoMa returns the predicted annotation as gff file.<br />
<br />
=== GeMoMa Annotation Filter (GAF) ===<br />
<br />
The GeMoMa Annotation Filter allows to combine and reduce predictions from GeMoMa into a single final prediction. It is able to handle predictions from different reference species. It also handles overlapping or identical predictions.<br />
You can run GeMoMa from the command line with<br/><br />
<code>java -jar GeMoMa-<version>.jar CLI GAF[&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>tag (the tag used to read the GeMoMa annotations, default = prediction)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>common border filter (the filter on the common borders of transcripts, the lower the more transcripts will be checked as alternative splice isoforms, valid range = [0.0, 1.0], default = 0.75)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>maximal number of transcripts per gene (the maximal number of allowed transcript predictions per gene, valid range = [1, 2147483647], default = 2147483647)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prefix (the prefix can be used to distinguish predictions from different input files, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>weight (the weight can be used to prioritize predictions from different input files; each prediction will get an additional attribute sumWeight that can be used in the filter, valid range = [0.0, 1000.0], default = 1.0, OPTIONAL)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>gene annotation file (GFF files containing the gene annotations (predicted by GeMoMa))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation info (annotation information of the reference, tab-delimted file containing at least the columns transcriptName, GO and .*defline, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>default attributes (Comma-separated list of attributes that is set to NaN if they are not given in the annotation file. This allows to use these attributes for sorting or filter criteria. It is especially meaningful if the gene annotation files were received fom different software packages (e.g., GeMoMa, Braker, ...) having different attributes., default = tie,tde,tae,iAA,pAA,score)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>filter (A filter can be applied to transcript predictions to receive only reasonable predictions. The filter is applied to the GFF attributes. The deault filter decides based on the completeness of the prediction (start=='M' and stop=='*') and the relative score (score/AA>=0.75) whether a prediction is used or not. Different criteria can be combined using 'and' and 'or'. In addition, you can check for NaN, e.g., 'isNaN(score)'. A more sophisticated filter could be applied for instance using the completeness, the relative score, the evidence as well as statistics based on RNA-seq: start=='M' and stop =='*' and score/AA>=0.75 and (evidence>1 or tpc==1.0), default = start=='M' and stop =='*' and score/AA>=0.75, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>sorting (comma-separated list that defines the way predictions are sorted within a cluster, default = evidence,score)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">atf</font></td><br />
<td>alternative transcript filter (If a prediction is suggested as an alternative transcript, this additional filter will be applied. The filter works syntactically like the 'filter' parameter and allows you to keep the number of alternative transcripts small based on meaningful criteria. Reasonable filter could be based on RNA-seq data (tie==1) or on evidence (evidence>1). A more sophisticated filter could be applied combining several criteria: tie==1 or evidence>1, default = tie==1 or evidence>1, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== CompareTranscripts ===<br />
<br />
For comparing gene models from GeMoMa predictions with existing annotation, we provide the tool CompareTranscripts. You can run CompareTranscripts from the command line with<br/><br />
<code>java -jar GeMoMa-<version>.jar CLI CompareTranscripts [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prediction (The predicted annotation)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (The true annotation)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used zero or multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">prefix</font></td><br />
<td>prefix (the prefix can be used to distinguish predictions from different input files, default = , OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">assignment</font></td><br />
<td>assignment (the transcript info for the reference of the prediction)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== AnnotationEvidence ===<br />
<br />
For providing RNA-seq evidence (e.g. tie) for existing annotation, we provide the tool AnnotationEvidence. You can run AnnotationEvidence from the command line with<br/><br />
<code>java -jar GeMoMa-<version>.jar CLI AnnotationEvidence [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (The genome annotation file (GFF))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (The genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns file (Introns (GFF), which might be obtained from RNA-seq, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage file (experimental coverage (RNA-seq), range={NO, UNSTRANDED, STRANDED}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ao</font></td><br />
<td>annotation output (if the annotation should be returned with attributes tie, tpc, and AA, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">gc</font></td><br />
<td>genetic code (optional user-specified genetic code, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== AnnotationFinializer ===<br />
<br />
This tool allows to predict UTR and to rename predictions. You can run AnnotationEvidence from the command line with<br/><br />
<code>java -jar GeMoMa-<version>.jar CLI AnnotationFinalizer [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (The predicted genome annotation file (GFF))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = prediction)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">u</font></td><br />
<td>UTR (allows to predict UTRs using RNA-seq data, range={NO, YES}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;YES&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (The genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns file (Introns (GFF), which might be obtained from RNA-seq)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage file (experimental coverage (RNA-seq), range={NO, UNSTRANDED, STRANDED}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">rename</font></td><br />
<td>rename (allows to generate generic gene and transcripts names (cf. attribute &quot;Name&quot;), range={COMPOSED, SIMPLE, NO}, default = COMPOSED)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;COMPOSED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">infix</font></td><br />
<td>infix (the infix of the generic name, default = G)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>suffix (the suffix of the generic name, default = 0)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">di</font></td><br />
<td>delete infix (a comma-separated list of infixes that is deleted from the sequence names before building the gene/transcript name, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;SIMPLE&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
== Galaxy ==<br />
GeMoMa is available in a public web-server at [http://galaxy.informatik.uni-halle.de galaxy.informatik.uni-halle.de]. The provided web-server only allows a limited number of reference genes and uses a time out of 2 minutes per transcript prediction. For unlimited use, please use the command line program or integrate GeMoMa in your only Galaxy instance.<br />
<br />
[[File:GeMoMa-Workflow.png|thumb|center|700px|GeMoMa workflow adapted from Galaxy]]<br />
<br />
== GFF attributes ==<br />
<br />
Using GeMoMa and GAF, you'll obtain GFFs containing some special attributes. We briefly explain the most prominent attributes in the following table.<br />
<br />
{| class="wikitable sortable"<br />
|-<br />
!Attribute !!Long name !!Tool !!Necessary parameter !!Feature !!Description<br />
|-<br />
| score || GeMoMa score || GeMoMa || || prediction || score computed by GeMoMa using the substitution matrix, gap costs and additional penalties<br />
|-<br />
| minCov || minimal coverage || GeMoMa || coverage, ... || prediction || minimal coverage of any base of the prediction given RNA-seq evidence<br />
|-<br />
| avgCov || average coverage || GeMoMa || coverage, ... || prediction || average coverage of all bases of the prediction given RNA-seq evidence<br />
|-<br />
| tpc || transcript percentage coverage || GeMoMa || coverage, ... || prediction || percentage of covered bases per predicted transcript given RNA-seq evidence<br />
|-<br />
| tae || transcript acceptor evidence || GeMoMa || introns || prediction || percentage of predicted acceptor sites per predicted transcript with RNA-seq evidence<br />
|-<br />
| tde || transcript donor evidence || GeMoMa || introns || prediction || percentage of predicted donor sites per predicted transcript with RNA-seq evidence<br />
|-<br />
| tie || transcript intron evidence || GeMoMa || introns || prediction || percentage of predicted introns per predicted transcript with RNA-seq evidence<br />
|-<br />
| minSplitReads || minimal split reads || GeMoMa || introns || prediction || minimal number of split reads for any of the predicted introns per predicted transcript<br />
|-<br />
| iAA || identical amino acid || GeMoMa || query proteins || prediction || percentage of identical amino acids between reference transcript and prediction<br />
|-<br />
| pAA || positive amino acid || GeMoMa || query proteins || prediction || percentage of aligned positions between reference transcript and prediction yielding a positive score in the substitution matrix<br />
|-<br />
| evidence || || GAF || || prediction || number of reference organisms that have a transcript yielding this prediction<br />
|-<br />
| alternative || || GAF || || prediction || alternative gene ID(s) leading to the same prediction<br />
|-<br />
| sumWeight || || GAF || || prediction || the sum of the weights of the references that perfectly support this prediction<br />
|-<br />
| maxTie || maximal tie || GAF || || gene || maximal tie of all transcripts of this gene<br />
|-<br />
| maxEvidence || maximal evidence || GAF || || gene || maximal evidence of all transcripts of this gene<br />
|-<br />
|}<br />
<br />
== FAQs ==<br />
<br />
; Why does the Extractor not return a single CDS-part, protein, ...?<br />
: First, please check whether the names of your contigs/chromosomes in your annotation (gff) and genome file (fasta) are identical. The fasta comments should at best only contain the contig/chromosome name. (Since GeMoMa 1.4, comments, which contain the contig/chromosome name and some additional information separated by a space, are also fine.) Second, please check whether you have a valid GFF/GTF file. Valid GFF files should have a valid "ID" or "Parent" entry in the attributes column. Valid GTF files should have a valid "gene_id" and "transcript_id" entry. Finally, please check the statistics that are given by the Extractor. It lists how many genes have been read and how many genes have been removed for different reasons. One common problem is that some annotation files do not include the stop codon in the CDS annotation.<br />
; How can I force GeMoMa to make more predictions?<br />
: There are several parameters affecting the number of predictions. The most prominent are the number of predictions (p) and the contig threshold (ct). For each reference transcript/CDS, GeMoMa initially makes a preliminary prediction and uses this prediction to determine whether a contig is promising and should be used to determine the final predictions. You may decrease ct and increase p to have more contigs in the final prediction. Increasing the number of predictions allows GeMoMa to output more predictions that have been computed. Decreasing the contig threshold allows to increase the number of predictions that are (internally) computed. Increasing p to a very large number without decreasing ct does not help.<br />
; Running GeMoMa on a single contig of my assembly yield thousands of weird predictions. What went wrong?<br />
: By default, GeMoMa is not build to be run on a single contig. GeMoMa tries to make predictions for all given reference CDS in the given target sequence(s). If the given target sequence is only a fraction of the complete target genome/assembly, GeMoMa will produce weird predictions as it does not filter for the quality of the predictions internally. There are two options to handle this:<br />
:* Use a list of gene models that you expect to be located on this contig (cf. parameter "selected").<br />
:* Filter the predictions using GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>).<br />
; Is it mandatory to use RNA-seq data?<br />
: No, GeMoMa is able to make predictions with and without RNA-seq evidence.<br />
; Is it possible to use multiple reference organisms?<br />
: It is possible to use multiple reference organisms for GeMoMa. Just run GeMoMa on each reference organism separately. Finally, you can employ GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>) to combine these annotations.<br />
; Why do some reference genes not lead to a prediction in the target genome?<br />
: Please first check whether your reference genes have been discarded by the Extractor (cf. assignment file).<br />
: If the genes have been discarded, there are two possibilities:<br />
:* The CDS might be redundant, i.e. the coding exons are identical to those of another transcript. In this case, only one CDS is further evaluated.<br />
:* There might be something wrong with your reference genes, e.g., missing start codon, missing stop codon, premature stop codon, ambiguous nucleotides, ... and you should check the options of Extractor or the annotation.<br />
: If the reference genes passed the Extractor, there are several possible explanations for this behavior. The two most prominent are:<br />
:* GeMoMa stopped the prediction of a reference genes since it does not return a result within the given time (cf. parameter "timeout").<br />
:* GeMoMa simply did not find a prediction matching the remaining quality criteria<br />
:* GeMoMa did find a prediction, but it was filtered out by GAF, e.g. to low relative score, missing start or stop codon (cf. GAF parameters).<br />
; What does "partial gene model" mean in the context of GeMoMa?<br />
: We called a gene model partial if it does not contain an initial start codon and a final stop codon. However, this does not mean that the gene model is located at or close to the border of a chromosome or contig.<br />
; For two different reference transcripts, the predictions of GeMoMa overlap or are identical. What should I do with those?<br />
: GeMoMa makes the predictions for each reference transcript independently. Hence, it can occur that some of predictions of different reference transcripts overlap or are identical especially in gene families. Typically, you might like to filter or rank these predictions. We have implemented GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>) to do this automatically. However, you can also do it by hand using the GFF attributes. Using RNA-seq data in GeMoMa yields additional fields in the annotation that can be used, e.g., average coverage (avgCov).<br />
; A lot of transcripts have been filtered out by the Extractor. What can I do?<br />
: There are several reasons for removing transcripts by the Extractor. At least in two cases you can try to get more transcripts by setting specific parameter values. First, if the transcript contains ambiguous nucleotides, please test the parameter "Ambiguity". Second, sometimes we received GFFs which contain wrong phases for CDS entries (e.g., 0 for all CDS entries in the phase column of the GFF). Since version 1.3.2, we provide the option "r" which stands for repair. If r=true is chosen, the Extractor tries to infer all phases for transcripts that show an error and would be filtered out.<br />
; Is GeMoMa able to predict pseudo-genes/ncRNA?<br />
: No, currently not.<br />
; My RNA-seq data indicates there is an additional intron in a transcipt, but GeMoMa does not predict this. Or vice versa, GeMoMa predicts an intron that is not supported by RNA-seq data. What's the reason?<br />
: GeMoMa is mainly based on the assumptions of amino acid and intron position conservation between reference and target species. Hence, GeMoMa tries to predict a gene model with similar exon-intron structure in the target species and does not stick too much to RNA-seq data. Although intron position conservation can be observed in most cases, sometimes new introns evolve or others vanish. For this reasons, GeMoMa also allows for the inclusion or exclusion of introns adding some additional costs (cf. GeMoMa parameter intron-loss-gain-penalty). However, the behaviour of GeMoMa depends on the parameters settings (especially intron-loss-gain-penalty, sm (substitution matrix), go (gap opening), ge (gap extension)) and the length of the missed/additional intron. Nevertheless, such cases can only occur if the additional/missed intron has a length that can be divided by 3 preserving the reading frame.<br />
:Since the available RNA-seq data only reflects a fraction of tissues/environmental conditions/..., missing RNA-seq evidence does not necessarily mean that the predictions is wrong.<br />
; My RNA-seq data indicates two alternative, highly overlapping introns. Interestingly, GeMoMa does not take the intron that is more abundant. Why?<br />
: GeMoMa reads the introns from the input file using some filter (cf. GeMoMa parameter r (reads)). All introns that pass the filter are used and treated equally. Hence, GeMoMa uses the intron that matches the expectation of intron position and amino acid conservation compared to the reference transcript.<br />
; Does GeMoMa predict multiple transcripts per gene?<br />
: GeMoMa in principle allows to predict multiple transcripts per gene, if corresponding transcripts are given in the reference species or if multiple reference species are used.<br />
; GeMoMa failed with java.lang.OutOfMemoryError. What can I do?<br />
: Whenever you see a java.lang.OutOfMemoryError, you should rerun the program with Java virtual machine (VM) options. More specifically you should set: '''-Xms''' the initally used RAM, e.g. to 5Gb (–Xms5G), and '''-Xmx''' the maximally used RAM, e.g. to 50Gb (-Xmx50G). GeMoMa often needs more memory if you have a large genome ''and'' if you’re providing a large coverage file (extracted from RNA-seq data). If you don’t have a compute node with enough memory, you can run GeMoMa without coverage, which will return the same predictions, but does not include all statistics. Another point could be the protein alignment, if you use the optional parameter query protein. Again you can run GeMoMa without this parameter, which will return the same predictions, but less statistics.<br />
; I need to specify the genetic code for my organisms. What is the expected format?<br />
: The genetic code is given in a two column tab-delimited table, where the first column is the one letter code of the amino acid and the second column is a comma-separated list of triplets. As we are working on genomic DNA, GeMoMa expects the bases A, C, G, and T, and not U (as expected in mRNA). Here is the link to the default genetic code, which might be used as template:<br />
: https://github.com/Jstacs/Jstacs/blob/master/projects/gemoma/test_data/genetic_code.txt<br />
: Alternative genetic codes are described here using the RNA alphabet:<br />
: https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi<br />
: The genetic code might be specified for a reference organism in the module Extractor or for a target organism in the module GeMoMa.<br />
; I like to accelerate GeMoMa. What can I do?<br />
:If you like to improve the runtime, you can use several threads for the computation. If you run the GeMoMaPipeline you just have to select <code>threads=<your_number></code>.<br />
:In addition, you can change the search algorithm that is used in GeMoMa. Tblastn is used by default as search algorithm in GeMoMa (for historical reasons). However, tblastn can be replaced by mmseqs which is typically much faster. If you run the GeMoMaPipeline you just have to select <code>tblastn=false</code>. However, changing the search algorithm can also effect the results. We try to minimize these effect using specific parameters for the search algorithms.<br />
:If you modify other parameters, you will probably receive results that differ to a larger extend from those received using the default parameters.<br />
<br />
== Version history ==<br />
[http://www.jstacs.de/download.php?which=GeMoMa GeMoMa 1.6.3] (05.03.2020)<br />
* Jstacs changes:<br />
** CLI: bugfix ExpandableParameterSet<br />
* python wrapper (for *conda)<br />
* updated tests.sh, run.sh, pipeline.sh<br />
* rename Denoise to DenoiseIntrons<br />
* AnnotationEvidence: write phase (as given) to gff<br />
* GAF: new parameter: default attributes allows to set attributes that are not included in some gene annotation files<br />
* GeMoMa: new parameter: static intron length allowing to use dynamic intron length if set to false<br />
* GeMoMaPipeline: <br />
** bugfix: time-out<br />
** improve output<br />
** separate parameters for maximum intron length (DenoiseIntrons, GeMoMa)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.6.2.zip GeMoMa 1.6.2] (17.12.2019)<br />
* Jstacs changes:<br />
** test methods for modules<br />
** live protocol for Galaxy<br />
* new module Denoise: allowing to clean introns extracted by ERE<br />
* new module NCBIReferenceRetriever: allowing to retrieve data for reference organisms easily from NCBI.<br />
* GAF:<br />
** bugfix for filter using specific attributes if no RNA-seq or query proteins was used<br />
** allow to add annotation info (as for instance provided by Phytozome) based on the reference organisms<br />
* GeMoMa: bugfix for timeout<br />
* GeMoMaPipeline:<br />
** bugfix reporting predicted partial proteins<br />
** improved protocol<br />
** new default value for query proteins (changed from false to true)<br />
** new default value for Ambiguity (changed from EXCEPTION to AMBIGUOUS)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.6.1.zip GeMoMa 1.6.1] (4.06.2019)<br />
* createGalaxyIntegration.sh: bugfix for GeMoMaPipeline<br />
* new module CheckIntrons: allowing to create statistics for introns (extracted by ERE)<br />
* AnnotationFinalizer: bugfix for sequence IDs with large numbers<br />
* CompareTranscripts: <br />
** bugfix for prefix of ref-gene<br />
** allow no transcript info, but making assignment non-optional if a transcript info is set <br />
* GAF: bugfix for Galaxy integration<br />
* GeMoMaPipeline:<br />
** improved output in case of Exceptions<br />
** new parameter "output individual predictions" allows to in- or exclude individual predictions from each reference organism in the final result<br />
** new parameter "weight" allows weights for reference species (cf. GAF)<br />
* ERE: new parameter "minimum mapping quality"<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.6.zip GeMoMa 1.6] (2.04.2019)<br />
* allow to use mmseqs as alternative to tblastn<br />
* AnnotationEvidence:<br />
** allows to add attributes to the input gff: tie, tpc, AA, start, stop<br />
** new parameter for gff output<br />
* AnnotationFinalizer: new tool for predicting UTRs and renaming predictions<br />
* GAF:<br />
** relative score filter and evidence filter are replaced by a flexible filter that allows to filter by relative score, evidence or other GFF attributes as well as combinations thereof<br />
** sorting criteria of the predictions within clusters can now be user-specified<br />
** new attribute for genes: combinedEvidence<br />
** new attribute for predictions: sumWeight<br />
** allows to use gene predictions from all sources, including for instance ab-initio gene predictors, purely RNA-seq based gene prediction and manually curation<br />
** bugfix for predictions from multiple reference organisms<br />
** improved statistic output<br />
* GeMoMa<br />
** renamed the parameter tblastn results to search results<br />
** new parameter for sorting the results of the similarity search (tblastn or mmseqs), if you use mmseqs for the similarity search you have use sort <br />
** new parameter for score of the search results: three options: Trust (as is), ReScore (use aligned sequence, but recompute score), and ReAlign (use detected sequence for realignment and rescore)<br />
** bugfix: threshold for introns from multiple files<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.5.2.zip GeMoMa 1.5.3] (23.07.2018)<br />
* improved parameter description and presentation<br />
* GeMoMaPipeline:<br />
** removed unnecessary parameters<br />
* GeMoMa:<br />
** bugfix: reading coverage file<br />
** removed parameter genomic (cf. Extractor)<br />
** removed protein output (cf. Extractor)<br />
* GAF:<br />
** bugfix: prefix<br />
* Extractor:<br />
** new parameter genomic<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.5.2.zip GeMoMa 1.5.2] (31.5.2018)<br />
* GAF:<br />
** new parameter that allows to restrict the maximal number of transcript predictions per gene<br />
** altered behavior of the evidence filter from percentages to absolute values<br />
** bugfix: nested genes <br />
** checking for duplicates in prediction IDs<br />
* GeMoMa:<br />
** warning if RNA-seq data does not match with target genome<br />
* GeMoMaPipeline: new tool for running the complete GeMoMa pipeline at once allowing multi-threading<br />
* folder for temporary files of GeMoMa<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.5.zip GeMoMa 1.5] (13.02.2018)<br />
* AnnotationEvidence: add chromosome to output<br />
* CompareTranscripts: new parameter that allows to remove prefixes introduces by GAF<br />
* Extractor: new parameter "stop-codon excluded from CDS" that might be used if the annotation does not contain the stop codons<br />
* ExtractRNASeqEvidence: <br />
** print intron length stats<br />
** include program infos in introns.gff3<br />
* GeMoMa:<br />
** new attribute pAA in gff output if query protein is given<br />
** include program infos in predicted_annotation.gff3<br />
** minor bugfix<br />
* GAF:<br />
** new parameter that allows to specify a prefix for each input gff<br />
** collect and print program infos to filtered_prediction.gff3<br />
** improved statistics output<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.4.2.zip GeMoMa 1.4.2] (21.07.2017)<br />
* automatic searching for available updates<br />
* AnnotationEvidence: bugfix (tie computation: Arrays.binarysearch does not find first match)<br />
* Extractor: bugfix (files that are not zipped)<br />
* GeMoMa: bugfix (tie computation: Arrays.binarysearch does not find first match)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.4.1.zip GeMoMa 1.4.1] (30.05.2017)<br />
* CompareTranscripts: bugfix (NullPointerException)<br />
* Extractor: reference genome can be .*fa.gz and .*fasta.gz<br />
* GeMoMa: bugfix (shutdown problem after timeout)<br />
* modified additional scripts and documentation<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.4.zip GeMoMa 1.4] (03.05.2017)<br />
* AnnotationEvidence: new tool computing tie and tpc for given annotation (gff)<br />
* CompareTranscripts: new tool comparing predicted and given annotation (gff)<br />
* Extractor:<br />
** reading CDS with no parent tag (cf. discontinuous feature)<br />
** automatic recognition of GFF or GTF annotation<br />
** Warning if sequences mentioned in the annotation are not included in the reference sequence<br />
* GeMoMa: <br />
** allowing for multiple intron and coverage files (= using different library types at the same time)<br />
** NA instead of "?" for tae, tde, tie, minSplitReads of single coding exon genes<br />
** new default values for the parameters: predictions (10 instead of 1) and contig threshold (0.4 instead of 0.9)<br />
** bugfix (write pc and minCov if possible for last CDS part in predicted annotation)<br />
** bugfix (ref-gene name if no assignment is used)<br />
** bugfix (minSplitReads, minCov, tpc, avgCov if no coverage available)<br />
* GAF:<br />
** nested genes on the same strand<br />
** bugfix (if nothing passes the filter)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.3.2.zip GeMoMa 1.3.2] (18.01.2017)<br />
* Extractor: new parameter repair for broken transcript annotations<br />
* GeMoMa: bugfixes (splice site computation)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa_1.3.1.zip GeMoMa 1.3.1] (09.12.2016)<br />
* GeMoMa bugfix (finding start/stop codon for very small exons)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.3.zip GeMoMa 1.3] (06.12.2016)<br />
* ERE: new tool for extracting RNA-seq evidence<br />
* Extractor: offers options for <br />
** partial gene models<br />
** ambiguities<br />
* GeMoMa:<br />
** RNA-seq<br />
*** defining splice sites<br />
*** additional feature in GFF and output<br />
**** transcript intron evidence (tie)<br />
**** transcript acceptor evidence (tae)<br />
**** transcript donor evidence (tde)<br />
**** transcript percentage coverage (tpc)<br />
**** ...<br />
** improved GFF<br />
** simplified the command line parameters<br />
** IMPORTANT: parameter names changed for some parameters<br />
* GAF: new tool for filtering and combining different predictions (especially of different reference organisms)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.1.3.zip GeMoMa 1.1.3] (06.06.2016)<br />
* minor modifications to the Extractor tool<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.1.2.zip GeMoMa 1.1.2] (05.02.2016)<br />
* GeMoMa bugfix (upstream, downstream sequence for splice site detection)<br />
* Extractor: new parameter s for selecting transcripts<br />
* improved Galaxy integration<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.1.1.zip GeMoMa 1.1.1] (01.02.2016)<br />
* initial release for paper</div>Keilwagenhttps://www.jstacs.de/index.php?title=GeMoMa&diff=1091GeMoMa2020-03-30T08:41:45Z<p>Keilwagen: /* FAQs */</p>
<hr />
<div>'''Ge'''ne '''Mo'''del '''Ma'''pper (GeMoMa) is a homology-based gene prediction program. GeMoMa uses the annotation of protein-coding genes in a reference genome to infer the annotation of protein-coding genes in a target genome. Thereby, GeMoMa utilizes amino acid sequence and intron position conservation. In addition, GeMoMa allows to incorporate RNA-seq evidence for splice site prediction.<br />
<br />
[[File:GeMoMa-schema.png|thumb|right|350px|Schema of GeMoMa algorithm]]<br />
<br />
== Paper ==<br />
If you use GeMoMa, please cite<br />
<br />
J. Keilwagen, M. Wenk, J. L. Erickson, M. H. Schattat, J. Grau, and F. Hartung. [https://nar.oxfordjournals.org/content/44/9/e89 Using intron position conservation for homology-based gene prediction]. ''Nucleic Acids Research'', 2016. doi: 10.1093/nar/gkw092<br />
<br />
J. Keilwagen, F. Hartung, M. Paulini, S. O. Twardziok, and J. Grau<br />
[https://rdcu.be/QbKc Combining RNA-seq data and homology-based gene prediction for plants, animals and fungi]. ''BMC Bioinformatics'', 2018. doi: 10.1186/s12859-018-2203-5<br />
<br />
== Download ==<br />
<br />
GeMoMa is implemented in Java using Jstacs. You can [http://www.jstacs.de/download.php?which=GeMoMa download a zip file] containing a readme, the GeMoMa jar file and some tiny scripts for running GeMoMa. The jar file allows for <br />
<ul><br />
<li>creating the XML file needed for the Galaxy integration</li><br />
<li>running the command line interface (CLI) version.</li><br />
</ul><br />
<br />
You can also [[:File:GeMoMa-manual.pdf|download a small manual for GeMoMa]] which explains the main steps for the analysis.<br />
<br />
== Galaxy ==<br />
GeMoMa is available in a public web-server at [http://galaxy.informatik.uni-halle.de galaxy.informatik.uni-halle.de]. The provided web-server only allows a limited number of reference genes and uses a time out of 2 minutes per transcript prediction. For unlimited use, please use the command line program or integrate GeMoMa in your only Galaxy instance.<br />
<br />
[[File:GeMoMa-Workflow.png|thumb|center|700px|GeMoMa workflow adapted from Galaxy]]<br />
<br />
== Running the command line application ==<br />
<br />
For running the command line application, Java v1.8 or later is required.<br />
<br />
=== NCBIReferenceRetriever ===<br />
<br />
We provide the module NCBIReferenceRetriever allowing to retrieve data for reference organisms easily from NCBI. You can run NCBIReferenceRetriever from the command line with<br/><br />
<code>java -jar GeMoMa-<version>.jar CLI NRR [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reference directory (the directory where the genome and annotation files of the reference organisms should be stored, default = references/)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>number of tries (the number of tries for downloading a reference file, valid range = [1, 100], default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">rl</font></td><br />
<td>reference list (a list of reference organisms)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== GeMoMaPipeline ===<br />
<br />
If you like to run the GeMoMaPipeline on a server as a single job, you can use the module GeMoMaPipeline which allows to exploit the full compute power of the computer server via multi-threading. However, GeMoMaPipeline does not distribute task on a compute cluster.<br />
You can run GeMoMaPipeline from the command line with<br/><br />
<code>java -jar GeMoMa-<version>.jar CLI GeMoMaPipeline [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>target genome (Target genome file (FASTA))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>species (data for reference species, range={own, pre-extracted}, default = own)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;own&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>ID (ID to distinguish the different reference species, default = , OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (Reference annotation file (GFF or GTF), which contains gene models annotated in the reference genome)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (Reference genome file (FASTA))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>weight (the weight can be used to prioritize predictions from different input files; each prediction will get an additional attribute sumWeight that can be used in the filter, valid range = [0.0, 1000.0], default = 1.0, OPTIONAL)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ai</font></td><br />
<td>annotation info (annotation information of the reference, tab-delimted file containing at least the columns transcriptName, GO and .*defline, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;pre-extracted&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>ID (ID to distinguish the different reference species, default = , OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>cds parts (The query cds parts file (FASTA), i.e., the cds parts that have been blasted)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>assignment (The assignment file, which combines parts of the CDS to transcripts, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">q</font></td><br />
<td>query proteins (optional query protein file (FASTA) for computing the optimal alignment score against complete protein prediction, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>weight (the weight can be used to prioritize predictions from different input files; each prediction will get an additional attribute sumWeight that can be used in the filter, valid range = [0.0, 1000.0], default = 1.0, OPTIONAL)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ai</font></td><br />
<td>annotation info (annotation information of the reference, tab-delimted file containing at least the columns transcriptName, GO and .*defline, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">selected</font></td><br />
<td>selected (The path to list file, which allows to make only a predictions for the contained transcript ids. The first column should contain transcript IDs as given in the annotation. Remaining columns can be used to determine a target region that should be overlapped by the prediction, if columns 2 to 5 contain chromosome, strand, start and end of region, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">gc</font></td><br />
<td>genetic code (optional user-specified genetic code, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">tblastn</font></td><br />
<td>tblastn (if *true* tblastn is used as search algorithm, otherwise mmseqs is used. Tblastn and mmseqs need to be installed to use the corresponding option, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">tag</font></td><br />
<td>tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = prediction)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>RNA-seq evidence (data for RNA-seq evidence, range={NO, MAPPED, EXTRACTED}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;MAPPED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.s</font></td><br />
<td>Stranded (Defines whether the reads are stranded. In case of FR_FIRST_STRAND, the first read of a read pair or the only read in case of single-end data is assumed to be located on forward strand of the cDNA, i.e., reverse to the mRNA orientation. If you are using Illumina TruSeq you should use FR_FIRST_STRAND., range={FR_UNSTRANDED, FR_FIRST_STRAND, FR_SECOND_STRAND}, default = FR_UNSTRANDED)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.m</font></td><br />
<td>mapped reads file (BAM/SAM files containing the mapped reads)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.v</font></td><br />
<td>ValidationStringency (Defines how strict to be when reading a SAM or BAM, beyond bare minimum validation., range={STRICT, LENIENT, SILENT}, default = LENIENT)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.u</font></td><br />
<td>use secondary alignments (allows to filter flags in the SAM or BAM, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.c</font></td><br />
<td>coverage (allows to output the coverage, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.mmq</font></td><br />
<td>minimum mapping quality (reads with a mapping quality that is lower than this value will be ignored, valid range = [0, 255], default = 40)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;EXTRACTED&quot;:</b></td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">introns</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used zero or multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage</font></td><br />
<td>coverage (experimental coverage (RNA-seq), range={UNSTRANDED, STRANDED}, default = UNSTRANDED)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>denoise (removing questionable introns that have been extracted by ERE, range={DENOISE, RAW}, default = DENOISE)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;DENOISE&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">DenoiseIntrons.m</font></td><br />
<td>maximum intron length (The maximum length of an intron, default = 15000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">DenoiseIntrons.me</font></td><br />
<td>minimum expression (The threshold for removing introns, valid range = [0.0, 1.0], default = 0.01)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">DenoiseIntrons.c</font></td><br />
<td>context (The context upstream a donor and donwstream an acceptor site that is used to determine the expression of the region, valid range = [0, 100], default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;RAW&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.p</font></td><br />
<td>proteins (whether the complete proteins sequences should returned as output, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.r</font></td><br />
<td>repair (if a transcript annotation can not be parsed, the program will try to infer the phase of the CDS parts to repair the annotation, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.a</font></td><br />
<td>Ambiguity (This parameter defines how to deal with ambiguities in the DNA. There are 3 options: EXCEPTION, which will remove the corresponding transcript, AMBIGUOUS, which will use an X for the corresponding amino acid, and RANDOM, which will randomly select an amnio acid from the list of possibilities., range={EXCEPTION, AMBIGUOUS, RANDOM}, default = AMBIGUOUS)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.s</font></td><br />
<td>stop-codon excluded from CDS (A flag that states whether the reference annotation contains the stop codon in the CDS annotation or not, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.f</font></td><br />
<td>full-length (A flag which allows for choosing between only full-length and all (i.e., full-length and partial) transcripts, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.s</font></td><br />
<td>splice (if no intron is given by RNA-seq, compute candidate splice sites or not, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.sm</font></td><br />
<td>substitution matrix (optional user-specified substitution matrix, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.g</font></td><br />
<td>gap opening (The gap opening cost in the alignment, default = 11)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.ge</font></td><br />
<td>gap extension (The gap extension cost in the alignment, default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.m</font></td><br />
<td>maximum intron length (The maximum length of an intron, default = 15000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.sil</font></td><br />
<td>static intron length (A flag which allows to switch between static intron length, which can be specified by the user and is identical for all genes, and dynamic intron length, which is based on the gene-specific maximum intron length in the reference organism plus the user given maximum intron length, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.i</font></td><br />
<td>intron-loss-gain-penalty (The penalty used for intron loss and gain, default = 25)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.e</font></td><br />
<td>e-value (The e-value for filtering blast results, default = 100.0)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.c</font></td><br />
<td>contig threshold (The threshold for evaluating contigs, valid range = [0.0, 1.0], default = 0.4)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.rt</font></td><br />
<td>region threshold (The threshold for evaluating regions, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.h</font></td><br />
<td>hit threshold (The threshold for adding additional hits, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.p</font></td><br />
<td>predictions (The (maximal) number of predictions per transcript, default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.a</font></td><br />
<td>avoid stop (A flag which allows to avoid stop codons in a transcript (except the last AS), default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.approx</font></td><br />
<td>approx (whether an approximation is used to compute the score for intron gain, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.prefix</font></td><br />
<td>prefix (A prefix to be used for naming the predictions, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.t</font></td><br />
<td>timeout (The (maximal) number of seconds to be used for the predictions of one transcript, if exceeded GeMoMa does not output a prediction for this transcript., valid range = [0, 604800], default = 3600)</td><br />
<td style="width:100px;">LONG</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.Score</font></td><br />
<td>Score (A flag which allows to do nothing, re-score or re-align the search results, range={Trust, ReScore, ReAlign}, default = Trust)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.c</font></td><br />
<td>common border filter (the filter on the common borders of transcripts, the lower the more transcripts will be checked as alternative splice isoforms, valid range = [0.0, 1.0], default = 0.75)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.m</font></td><br />
<td>maximal number of transcripts per gene (the maximal number of allowed transcript predictions per gene, valid range = [1, 2147483647], default = 2147483647)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.d</font></td><br />
<td>default attributes (Comma-separated list of attributes that is set to NaN if they are not given in the annotation file. This allows to use these attributes for sorting or filter criteria. It is especially meaningful if the gene annotation files were received fom different software packages (e.g., GeMoMa, Braker, ...) having different attributes., default = tie,tde,tae,iAA,pAA,score)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.f</font></td><br />
<td>filter (A filter can be applied to transcript predictions to receive only reasonable predictions. The filter is applied to the GFF attributes. The deault filter decides based on the completeness of the prediction (start=='M' and stop=='*') and the relative score (score/AA>=0.75) whether a prediction is used or not. Different criteria can be combined using 'and' and 'or'. In addition, you can check for NaN, e.g., 'isNaN(score)'. A more sophisticated filter could be applied for instance using the completeness, the relative score, the evidence as well as statistics based on RNA-seq: start=='M' and stop =='*' and score/AA>=0.75 and (evidence>1 or tpc==1.0), default = start=='M' and stop =='*' and score/AA>=0.75, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.s</font></td><br />
<td>sorting (comma-separated list that defines the way predictions are sorted within a cluster, default = evidence,score)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.a</font></td><br />
<td>alternative transcript filter (If a prediction is suggested as an alternative transcript, this additional filter will be applied. The filter works syntactically like the 'filter' parameter and allows you to keep the number of alternative transcripts small based on meaningful criteria. Reasonable filter could be based on RNA-seq data (tie==1) or on evidence (evidence>1). A more sophisticated filter could be applied combining several criteria: tie==1 or evidence>1, default = tie==1 or evidence>1, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.u</font></td><br />
<td>UTR (allows to predict UTRs using RNA-seq data, range={NO, YES}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;YES&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.r</font></td><br />
<td>rename (allows to generate generic gene and transcripts names (cf. attribute &quot;Name&quot;), range={COMPOSED, SIMPLE, NO}, default = COMPOSED)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;COMPOSED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.i</font></td><br />
<td>infix (the infix of the generic name, default = G)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.s</font></td><br />
<td>suffix (the suffix of the generic name, default = 0)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.di</font></td><br />
<td>delete infix (a comma-separated list of infixes that is deleted from the sequence names before building the gene/transcript name, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;SIMPLE&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>predicted proteins (If *true*, returns the predicted proteins of the target organism as fastA file, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">pc</font></td><br />
<td>predicted CDSs (If *true*, returns the predicted CDSs of the target organism as fastA file, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">pgr</font></td><br />
<td>predicted genomic regions (If *true*, returns the genomic regions of predicted gene models of the target organism as fastA file, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">o</font></td><br />
<td>output individual predictions (If *true*, returns the predictions for each reference species, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">debug</font></td><br />
<td>debug (If *false* removes all temporary files even if the jobs exits unexpected, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">threads</font></td><br />
<td>The number of threads used for the tool, defaults to 1</td><br />
<td>INT</td><br />
</tr><br />
</table><br />
<br />
=== Extract RNA-seq Evidence (ERE) ===<br />
<br />
For post-processing the mapped RNA-seq data, we provide the tool ExtractRNAseqEvidence (ERE). You can run Extractor from the command line with<br/><br />
<code>java -jar GeMoMa-<version>.jar CLI ERE [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>Stranded (Defines whether the reads are stranded. In case of FR_FIRST_STRAND, the first read of a read pair or the only read in case of single-end data is assumed to be located on forward strand of the cDNA, i.e., reverse to the mRNA orientation. If you are using Illumina TruSeq you should use FR_FIRST_STRAND., range={FR_UNSTRANDED, FR_FIRST_STRAND, FR_SECOND_STRAND}, default = FR_UNSTRANDED)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>mapped reads file (BAM/SAM files containing the mapped reads)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>ValidationStringency (Defines how strict to be when reading a SAM or BAM, beyond bare minimum validation., range={STRICT, LENIENT, SILENT}, default = LENIENT)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">u</font></td><br />
<td>use secondary alignments (allows to filter flags in the SAM or BAM, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage (allows to output the coverage, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">mmq</font></td><br />
<td>minimum mapping quality (reads with a mapping quality that is lower than this value will be ignored, valid range = [0, 254], default = 40)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== CheckIntrons===<br />
<br />
This tool allows to check whether the extracted introns show the expected patterns of di-nucleotides at the splice sites. You can run CheckIntrons from the command line with<br/><br />
<code>java -jar GeMoMa-<version>.jar CLI CheckIntrons [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>target genome (The target genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>verbose (A flag which allows to output a wealth of additional information per transcript, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== DenoiseIntrons ===<br />
This tool allows to remove potentially incorrectly extracted introns. You can run Denoise from the command line with<br/><br />
<code>java -jar GeMoMa-<version>.jar CLI DenoiseIntrons [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage (experimental coverage (RNA-seq), range={UNSTRANDED, STRANDED}, default = UNSTRANDED)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>maximum intron length (The maximum length of an intron, default = 15000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">me</font></td><br />
<td>minimum expression (The threshold for removing introns, valid range = [0.0, 1.0], default = 0.01)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">context</font></td><br />
<td>context (The context upstream a donor and donwstream an acceptor site that is used to determine the expression of the region, valid range = [0, 100], default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== Extractor ===<br />
<br />
For preparing the data, we provide the tool Extractor. You can run Extractor from the command line with<br/><br />
<code>java -jar GeMoMa-<version>.jar CLI Extractor [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (Reference annotation file (GFF or GTF), which contains gene models annotated in the reference genome)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (Reference genome file (FASTA))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">gc</font></td><br />
<td>genetic code (optional user-specified genetic code, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>proteins (whether the complete proteins sequences should returned as output, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>cds (whether the complete CDSs should returned as output, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">genomic</font></td><br />
<td>genomic (whether the genomic regions should be returned (upper case = coding, lower case = non coding), default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>repair (if a transcript annotation can not be parsed, the program will try to infer the phase of the CDS parts to repair the annotation, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>selected (The path to list file, which allows to make only a predictions for the contained transcript ids. The first column should contain transcript IDs as given in the annotation. Remaining columns will be ignored., OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Ambiguity</font></td><br />
<td>Ambiguity (This parameter defines how to deal with ambiguities in the DNA. There are 3 options: EXCEPTION, which will remove the corresponding transcript, AMBIGUOUS, which will use an X for the corresponding amino acid, and RANDOM, which will randomly select an amnio acid from the list of possibilities., range={EXCEPTION, AMBIGUOUS, RANDOM}, default = EXCEPTION)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sefc</font></td><br />
<td>stop-codon excluded from CDS (A flag that states whether the reference annotation contains the stop codon in the CDS annotation or not, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>full-length (A flag which allows for choosing between only full-length and all (i.e., full-length and partial) transcripts, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>verbose (A flag which allows to output a wealth of additional information, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== Gene Model Mapper (GeMoMa) ===<br />
For predicting gene models, we provide the tool GeMoMa. You can run GeMoMa from the command line with<br/><br />
<code>java -jar GeMoMa-<version>.jar CLI GeMoMa [<parameter>=<value> ...]</code><br/><br />
The parameters comprise: <br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>search results (The search results, e.g., from tblastn or mmseqs)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>target genome (The target genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>cds parts (The query cds parts file (FASTA), i.e., the cds parts that have been blasted)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>assignment (The assignment file, which combines parts of the CDS to transcripts, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">q</font></td><br />
<td>query proteins (optional query protein file (FASTA) for computing the optimal alignment score against complete protein prediction, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used zero or multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">splice</font></td><br />
<td>splice (if no intron is given by RNA-seq, compute candidate splice sites or not, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used zero or multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage</font></td><br />
<td>coverage (experimental coverage (RNA-seq), range={UNSTRANDED, STRANDED}, default = UNSTRANDED)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genetic code (optional user-specified genetic code, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sm</font></td><br />
<td>substitution matrix (optional user-specified substitution matrix, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">go</font></td><br />
<td>gap opening (The gap opening cost in the alignment, default = 11)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ge</font></td><br />
<td>gap extension (The gap extension cost in the alignment, default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>maximum intron length (The maximum length of an intron, default = 15000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sil</font></td><br />
<td>static intron length (A flag which allows to switch between static intron length, which can be specified by the user and is identical for all genes, and dynamic intron length, which is based on the gene-specific maximum intron length in the reference organism plus the user given maximum intron length, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">intron-loss-gain-penalty</font></td><br />
<td>intron-loss-gain-penalty (The penalty used for intron loss and gain, default = 25)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">e</font></td><br />
<td>e-value (The e-value for filtering blast results, default = 100.0)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ct</font></td><br />
<td>contig threshold (The threshold for evaluating contigs, valid range = [0.0, 1.0], default = 0.4)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">rt</font></td><br />
<td>region threshold (The threshold for evaluating regions, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">h</font></td><br />
<td>hit threshold (The threshold for adding additional hits, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>predictions (The (maximal) number of predictions per transcript, default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">selected</font></td><br />
<td>selected (The path to list file, which allows to make only a predictions for the contained transcript ids. The first column should contain transcript IDs as given in the annotation. Remaining columns can be used to determine a target region that should be overlapped by the prediction, if columns 2 to 5 contain chromosome, strand, start and end of region, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">as</font></td><br />
<td>avoid stop (A flag which allows to avoid stop codons in a transcript (except the last AS), default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">approx</font></td><br />
<td>approx (whether an approximation is used to compute the score for intron gain, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">prefix</font></td><br />
<td>prefix (A prefix to be used for naming the predictions, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">tag</font></td><br />
<td>tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = prediction)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>verbose (A flag which allows to output a wealth of additional information per transcript, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">timeout</font></td><br />
<td>timeout (The (maximal) number of seconds to be used for the predictions of one transcript, if exceeded GeMoMa does not output a prediction for this transcript., valid range = [0, 604800], default = 3600)</td><br />
<td style="width:100px;">LONG</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sort</font></td><br />
<td>sort (A flag which allows to sort the search results, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Score</font></td><br />
<td>Score (A flag which allows to do nothing, re-score or re-align the search results, range={Trust, ReScore, ReAlign}, default = Trust)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
GeMoMa returns the predicted annotation as gff file.<br />
<br />
=== GeMoMa Annotation Filter (GAF) ===<br />
<br />
The GeMoMa Annotation Filter allows to combine and reduce predictions from GeMoMa into a single final prediction. It is able to handle predictions from different reference species. It also handles overlapping or identical predictions.<br />
You can run GeMoMa from the command line with<br/><br />
<code>java -jar GeMoMa-<version>.jar CLI GAF[&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>tag (the tag used to read the GeMoMa annotations, default = prediction)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>common border filter (the filter on the common borders of transcripts, the lower the more transcripts will be checked as alternative splice isoforms, valid range = [0.0, 1.0], default = 0.75)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>maximal number of transcripts per gene (the maximal number of allowed transcript predictions per gene, valid range = [1, 2147483647], default = 2147483647)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prefix (the prefix can be used to distinguish predictions from different input files, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>weight (the weight can be used to prioritize predictions from different input files; each prediction will get an additional attribute sumWeight that can be used in the filter, valid range = [0.0, 1000.0], default = 1.0, OPTIONAL)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>gene annotation file (GFF files containing the gene annotations (predicted by GeMoMa))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation info (annotation information of the reference, tab-delimted file containing at least the columns transcriptName, GO and .*defline, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>default attributes (Comma-separated list of attributes that is set to NaN if they are not given in the annotation file. This allows to use these attributes for sorting or filter criteria. It is especially meaningful if the gene annotation files were received fom different software packages (e.g., GeMoMa, Braker, ...) having different attributes., default = tie,tde,tae,iAA,pAA,score)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>filter (A filter can be applied to transcript predictions to receive only reasonable predictions. The filter is applied to the GFF attributes. The deault filter decides based on the completeness of the prediction (start=='M' and stop=='*') and the relative score (score/AA>=0.75) whether a prediction is used or not. Different criteria can be combined using 'and' and 'or'. In addition, you can check for NaN, e.g., 'isNaN(score)'. A more sophisticated filter could be applied for instance using the completeness, the relative score, the evidence as well as statistics based on RNA-seq: start=='M' and stop =='*' and score/AA>=0.75 and (evidence>1 or tpc==1.0), default = start=='M' and stop =='*' and score/AA>=0.75, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>sorting (comma-separated list that defines the way predictions are sorted within a cluster, default = evidence,score)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">atf</font></td><br />
<td>alternative transcript filter (If a prediction is suggested as an alternative transcript, this additional filter will be applied. The filter works syntactically like the 'filter' parameter and allows you to keep the number of alternative transcripts small based on meaningful criteria. Reasonable filter could be based on RNA-seq data (tie==1) or on evidence (evidence>1). A more sophisticated filter could be applied combining several criteria: tie==1 or evidence>1, default = tie==1 or evidence>1, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== CompareTranscripts ===<br />
<br />
For comparing gene models from GeMoMa predictions with existing annotation, we provide the tool CompareTranscripts. You can run CompareTranscripts from the command line with<br/><br />
<code>java -jar GeMoMa-<version>.jar CLI CompareTranscripts [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prediction (The predicted annotation)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (The true annotation)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used zero or multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">prefix</font></td><br />
<td>prefix (the prefix can be used to distinguish predictions from different input files, default = , OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">assignment</font></td><br />
<td>assignment (the transcript info for the reference of the prediction)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== AnnotationEvidence ===<br />
<br />
For providing RNA-seq evidence (e.g. tie) for existing annotation, we provide the tool AnnotationEvidence. You can run AnnotationEvidence from the command line with<br/><br />
<code>java -jar GeMoMa-<version>.jar CLI AnnotationEvidence [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (The genome annotation file (GFF))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (The genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns file (Introns (GFF), which might be obtained from RNA-seq, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage file (experimental coverage (RNA-seq), range={NO, UNSTRANDED, STRANDED}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ao</font></td><br />
<td>annotation output (if the annotation should be returned with attributes tie, tpc, and AA, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">gc</font></td><br />
<td>genetic code (optional user-specified genetic code, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== AnnotationFinializer ===<br />
<br />
This tool allows to predict UTR and to rename predictions. You can run AnnotationEvidence from the command line with<br/><br />
<code>java -jar GeMoMa-<version>.jar CLI AnnotationFinalizer [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (The predicted genome annotation file (GFF))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = prediction)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">u</font></td><br />
<td>UTR (allows to predict UTRs using RNA-seq data, range={NO, YES}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;YES&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (The genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns file (Introns (GFF), which might be obtained from RNA-seq)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage file (experimental coverage (RNA-seq), range={NO, UNSTRANDED, STRANDED}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">rename</font></td><br />
<td>rename (allows to generate generic gene and transcripts names (cf. attribute &quot;Name&quot;), range={COMPOSED, SIMPLE, NO}, default = COMPOSED)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;COMPOSED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">infix</font></td><br />
<td>infix (the infix of the generic name, default = G)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>suffix (the suffix of the generic name, default = 0)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">di</font></td><br />
<td>delete infix (a comma-separated list of infixes that is deleted from the sequence names before building the gene/transcript name, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;SIMPLE&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
== GFF attributes ==<br />
<br />
Using GeMoMa and GAF, you'll obtain GFFs containing some special attributes. We briefly explain the most prominent attributes in the following table.<br />
<br />
{| class="wikitable sortable"<br />
|-<br />
!Attribute !!Long name !!Tool !!Necessary parameter !!Feature !!Description<br />
|-<br />
| score || GeMoMa score || GeMoMa || || prediction || score computed by GeMoMa using the substitution matrix, gap costs and additional penalties<br />
|-<br />
| minCov || minimal coverage || GeMoMa || coverage, ... || prediction || minimal coverage of any base of the prediction given RNA-seq evidence<br />
|-<br />
| avgCov || average coverage || GeMoMa || coverage, ... || prediction || average coverage of all bases of the prediction given RNA-seq evidence<br />
|-<br />
| tpc || transcript percentage coverage || GeMoMa || coverage, ... || prediction || percentage of covered bases per predicted transcript given RNA-seq evidence<br />
|-<br />
| tae || transcript acceptor evidence || GeMoMa || introns || prediction || percentage of predicted acceptor sites per predicted transcript with RNA-seq evidence<br />
|-<br />
| tde || transcript donor evidence || GeMoMa || introns || prediction || percentage of predicted donor sites per predicted transcript with RNA-seq evidence<br />
|-<br />
| tie || transcript intron evidence || GeMoMa || introns || prediction || percentage of predicted introns per predicted transcript with RNA-seq evidence<br />
|-<br />
| minSplitReads || minimal split reads || GeMoMa || introns || prediction || minimal number of split reads for any of the predicted introns per predicted transcript<br />
|-<br />
| iAA || identical amino acid || GeMoMa || query proteins || prediction || percentage of identical amino acids between reference transcript and prediction<br />
|-<br />
| pAA || positive amino acid || GeMoMa || query proteins || prediction || percentage of aligned positions between reference transcript and prediction yielding a positive score in the substitution matrix<br />
|-<br />
| evidence || || GAF || || prediction || number of reference organisms that have a transcript yielding this prediction<br />
|-<br />
| alternative || || GAF || || prediction || alternative gene ID(s) leading to the same prediction<br />
|-<br />
| sumWeight || || GAF || || prediction || the sum of the weights of the references that perfectly support this prediction<br />
|-<br />
| maxTie || maximal tie || GAF || || gene || maximal tie of all transcripts of this gene<br />
|-<br />
| maxEvidence || maximal evidence || GAF || || gene || maximal evidence of all transcripts of this gene<br />
|-<br />
|}<br />
<br />
== FAQs ==<br />
<br />
; Why does the Extractor not return a single CDS-part, protein, ...?<br />
: First, please check whether the names of your contigs/chromosomes in your annotation (gff) and genome file (fasta) are identical. The fasta comments should at best only contain the contig/chromosome name. (Since GeMoMa 1.4, comments, which contain the contig/chromosome name and some additional information separated by a space, are also fine.) Second, please check whether you have a valid GFF/GTF file. Valid GFF files should have a valid "ID" or "Parent" entry in the attributes column. Valid GTF files should have a valid "gene_id" and "transcript_id" entry. Finally, please check the statistics that are given by the Extractor. It lists how many genes have been read and how many genes have been removed for different reasons. One common problem is that some annotation files do not include the stop codon in the CDS annotation.<br />
; How can I force GeMoMa to make more predictions?<br />
: There are several parameters affecting the number of predictions. The most prominent are the number of predictions (p) and the contig threshold (ct). For each reference transcript/CDS, GeMoMa initially makes a preliminary prediction and uses this prediction to determine whether a contig is promising and should be used to determine the final predictions. You may decrease ct and increase p to have more contigs in the final prediction. Increasing the number of predictions allows GeMoMa to output more predictions that have been computed. Decreasing the contig threshold allows to increase the number of predictions that are (internally) computed. Increasing p to a very large number without decreasing ct does not help.<br />
; Running GeMoMa on a single contig of my assembly yield thousands of weird predictions. What went wrong?<br />
: By default, GeMoMa is not build to be run on a single contig. GeMoMa tries to make predictions for all given reference CDS in the given target sequence(s). If the given target sequence is only a fraction of the complete target genome/assembly, GeMoMa will produce weird predictions as it does not filter for the quality of the predictions internally. There are two options to handle this:<br />
:* Use a list of gene models that you expect to be located on this contig (cf. parameter "selected").<br />
:* Filter the predictions using GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>).<br />
; Is it mandatory to use RNA-seq data?<br />
: No, GeMoMa is able to make predictions with and without RNA-seq evidence.<br />
; Is it possible to use multiple reference organisms?<br />
: It is possible to use multiple reference organisms for GeMoMa. Just run GeMoMa on each reference organism separately. Finally, you can employ GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>) to combine these annotations.<br />
; Why do some reference genes not lead to a prediction in the target genome?<br />
: Please first check whether your reference genes have been discarded by the Extractor (cf. assignment file).<br />
: If the genes have been discarded, there are two possibilities:<br />
:* The CDS might be redundant, i.e. the coding exons are identical to those of another transcript. In this case, only one CDS is further evaluated.<br />
:* There might be something wrong with your reference genes, e.g., missing start codon, missing stop codon, premature stop codon, ambiguous nucleotides, ... and you should check the options of Extractor or the annotation.<br />
: If the reference genes passed the Extractor, there are several possible explanations for this behavior. The two most prominent are:<br />
:* GeMoMa stopped the prediction of a reference genes since it does not return a result within the given time (cf. parameter "timeout").<br />
:* GeMoMa simply did not find a prediction matching the remaining quality criteria<br />
:* GeMoMa did find a prediction, but it was filtered out by GAF, e.g. to low relative score, missing start or stop codon (cf. GAF parameters).<br />
; What does "partial gene model" mean in the context of GeMoMa?<br />
: We called a gene model partial if it does not contain an initial start codon and a final stop codon. However, this does not mean that the gene model is located at or close to the border of a chromosome or contig.<br />
; For two different reference transcripts, the predictions of GeMoMa overlap or are identical. What should I do with those?<br />
: GeMoMa makes the predictions for each reference transcript independently. Hence, it can occur that some of predictions of different reference transcripts overlap or are identical especially in gene families. Typically, you might like to filter or rank these predictions. We have implemented GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>) to do this automatically. However, you can also do it by hand using the GFF attributes. Using RNA-seq data in GeMoMa yields additional fields in the annotation that can be used, e.g., average coverage (avgCov).<br />
; A lot of transcripts have been filtered out by the Extractor. What can I do?<br />
: There are several reasons for removing transcripts by the Extractor. At least in two cases you can try to get more transcripts by setting specific parameter values. First, if the transcript contains ambiguous nucleotides, please test the parameter "Ambiguity". Second, sometimes we received GFFs which contain wrong phases for CDS entries (e.g., 0 for all CDS entries in the phase column of the GFF). Since version 1.3.2, we provide the option "r" which stands for repair. If r=true is chosen, the Extractor tries to infer all phases for transcripts that show an error and would be filtered out.<br />
; Is GeMoMa able to predict pseudo-genes/ncRNA?<br />
: No, currently not.<br />
; My RNA-seq data indicates there is an additional intron in a transcipt, but GeMoMa does not predict this. Or vice versa, GeMoMa predicts an intron that is not supported by RNA-seq data. What's the reason?<br />
: GeMoMa is mainly based on the assumptions of amino acid and intron position conservation between reference and target species. Hence, GeMoMa tries to predict a gene model with similar exon-intron structure in the target species and does not stick too much to RNA-seq data. Although intron position conservation can be observed in most cases, sometimes new introns evolve or others vanish. For this reasons, GeMoMa also allows for the inclusion or exclusion of introns adding some additional costs (cf. GeMoMa parameter intron-loss-gain-penalty). However, the behaviour of GeMoMa depends on the parameters settings (especially intron-loss-gain-penalty, sm (substitution matrix), go (gap opening), ge (gap extension)) and the length of the missed/additional intron. Nevertheless, such cases can only occur if the additional/missed intron has a length that can be divided by 3 preserving the reading frame.<br />
:Since the available RNA-seq data only reflects a fraction of tissues/environmental conditions/..., missing RNA-seq evidence does not necessarily mean that the predictions is wrong.<br />
; My RNA-seq data indicates two alternative, highly overlapping introns. Interestingly, GeMoMa does not take the intron that is more abundant. Why?<br />
: GeMoMa reads the introns from the input file using some filter (cf. GeMoMa parameter r (reads)). All introns that pass the filter are used and treated equally. Hence, GeMoMa uses the intron that matches the expectation of intron position and amino acid conservation compared to the reference transcript.<br />
; Does GeMoMa predict multiple transcripts per gene?<br />
: GeMoMa in principle allows to predict multiple transcripts per gene, if corresponding transcripts are given in the reference species or if multiple reference species are used.<br />
; GeMoMa failed with java.lang.OutOfMemoryError. What can I do?<br />
: Whenever you see a java.lang.OutOfMemoryError, you should rerun the program with Java virtual machine (VM) options. More specifically you should set: '''-Xms''' the initally used RAM, e.g. to 5Gb (–Xms5G), and '''-Xmx''' the maximally used RAM, e.g. to 50Gb (-Xmx50G). GeMoMa often needs more memory if you have a large genome ''and'' if you’re providing a large coverage file (extracted from RNA-seq data). If you don’t have a compute node with enough memory, you can run GeMoMa without coverage, which will return the same predictions, but does not include all statistics. Another point could be the protein alignment, if you use the optional parameter query protein. Again you can run GeMoMa without this parameter, which will return the same predictions, but less statistics.<br />
; I need to specify the genetic code for my organisms. What is the expected format?<br />
: The genetic code is given in a two column tab-delimited table, where the first column is the one letter code of the amino acid and the second column is a comma-separated list of triplets. As we are working on genomic DNA, GeMoMa expects the bases A, C, G, and T, and not U (as expected in mRNA). Here is the link to the default genetic code, which might be used as template:<br />
: https://github.com/Jstacs/Jstacs/blob/master/projects/gemoma/test_data/genetic_code.txt<br />
: Alternative genetic codes are described here using the RNA alphabet:<br />
: https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi<br />
: The genetic code might be specified for a reference organism in the module Extractor or for a target organism in the module GeMoMa.<br />
; I like to accelerate GeMoMa. What can I do?<br />
:If you like to improve the runtime, you can use several threads for the computation. If you run the GeMoMaPipeline you just have to select <code>threads=<your_number></code>.<br />
:In addition, you can change the search algorithm that is used in GeMoMa. Tblastn is used by default as search algorithm in GeMoMa (for historical reasons). However, tblastn can be replaced by mmseqs which is typically much faster. If you run the GeMoMaPipeline you just have to select <code>tblastn=false</code>. However, changing the search algorithm can also effect the results. We try to minimize these effect using specific parameters for the search algorithms.<br />
:If you modify other parameters, you will probably receive results that differ to a larger extend from those received using the default parameters.<br />
<br />
== Version history ==<br />
[http://www.jstacs.de/download.php?which=GeMoMa GeMoMa 1.6.3] (05.03.2020)<br />
* Jstacs changes:<br />
** CLI: bugfix ExpandableParameterSet<br />
* python wrapper (for *conda)<br />
* updated tests.sh, run.sh, pipeline.sh<br />
* rename Denoise to DenoiseIntrons<br />
* AnnotationEvidence: write phase (as given) to gff<br />
* GAF: new parameter: default attributes allows to set attributes that are not included in some gene annotation files<br />
* GeMoMa: new parameter: static intron length allowing to use dynamic intron length if set to false<br />
* GeMoMaPipeline: <br />
** bugfix: time-out<br />
** improve output<br />
** separate parameters for maximum intron length (DenoiseIntrons, GeMoMa)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.6.2.zip GeMoMa 1.6.2] (17.12.2019)<br />
* Jstacs changes:<br />
** test methods for modules<br />
** live protocol for Galaxy<br />
* new module Denoise: allowing to clean introns extracted by ERE<br />
* new module NCBIReferenceRetriever: allowing to retrieve data for reference organisms easily from NCBI.<br />
* GAF:<br />
** bugfix for filter using specific attributes if no RNA-seq or query proteins was used<br />
** allow to add annotation info (as for instance provided by Phytozome) based on the reference organisms<br />
* GeMoMa: bugfix for timeout<br />
* GeMoMaPipeline:<br />
** bugfix reporting predicted partial proteins<br />
** improved protocol<br />
** new default value for query proteins (changed from false to true)<br />
** new default value for Ambiguity (changed from EXCEPTION to AMBIGUOUS)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.6.1.zip GeMoMa 1.6.1] (4.06.2019)<br />
* createGalaxyIntegration.sh: bugfix for GeMoMaPipeline<br />
* new module CheckIntrons: allowing to create statistics for introns (extracted by ERE)<br />
* AnnotationFinalizer: bugfix for sequence IDs with large numbers<br />
* CompareTranscripts: <br />
** bugfix for prefix of ref-gene<br />
** allow no transcript info, but making assignment non-optional if a transcript info is set <br />
* GAF: bugfix for Galaxy integration<br />
* GeMoMaPipeline:<br />
** improved output in case of Exceptions<br />
** new parameter "output individual predictions" allows to in- or exclude individual predictions from each reference organism in the final result<br />
** new parameter "weight" allows weights for reference species (cf. GAF)<br />
* ERE: new parameter "minimum mapping quality"<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.6.zip GeMoMa 1.6] (2.04.2019)<br />
* allow to use mmseqs as alternative to tblastn<br />
* AnnotationEvidence:<br />
** allows to add attributes to the input gff: tie, tpc, AA, start, stop<br />
** new parameter for gff output<br />
* AnnotationFinalizer: new tool for predicting UTRs and renaming predictions<br />
* GAF:<br />
** relative score filter and evidence filter are replaced by a flexible filter that allows to filter by relative score, evidence or other GFF attributes as well as combinations thereof<br />
** sorting criteria of the predictions within clusters can now be user-specified<br />
** new attribute for genes: combinedEvidence<br />
** new attribute for predictions: sumWeight<br />
** allows to use gene predictions from all sources, including for instance ab-initio gene predictors, purely RNA-seq based gene prediction and manually curation<br />
** bugfix for predictions from multiple reference organisms<br />
** improved statistic output<br />
* GeMoMa<br />
** renamed the parameter tblastn results to search results<br />
** new parameter for sorting the results of the similarity search (tblastn or mmseqs), if you use mmseqs for the similarity search you have use sort <br />
** new parameter for score of the search results: three options: Trust (as is), ReScore (use aligned sequence, but recompute score), and ReAlign (use detected sequence for realignment and rescore)<br />
** bugfix: threshold for introns from multiple files<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.5.2.zip GeMoMa 1.5.3] (23.07.2018)<br />
* improved parameter description and presentation<br />
* GeMoMaPipeline:<br />
** removed unnecessary parameters<br />
* GeMoMa:<br />
** bugfix: reading coverage file<br />
** removed parameter genomic (cf. Extractor)<br />
** removed protein output (cf. Extractor)<br />
* GAF:<br />
** bugfix: prefix<br />
* Extractor:<br />
** new parameter genomic<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.5.2.zip GeMoMa 1.5.2] (31.5.2018)<br />
* GAF:<br />
** new parameter that allows to restrict the maximal number of transcript predictions per gene<br />
** altered behavior of the evidence filter from percentages to absolute values<br />
** bugfix: nested genes <br />
** checking for duplicates in prediction IDs<br />
* GeMoMa:<br />
** warning if RNA-seq data does not match with target genome<br />
* GeMoMaPipeline: new tool for running the complete GeMoMa pipeline at once allowing multi-threading<br />
* folder for temporary files of GeMoMa<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.5.zip GeMoMa 1.5] (13.02.2018)<br />
* AnnotationEvidence: add chromosome to output<br />
* CompareTranscripts: new parameter that allows to remove prefixes introduces by GAF<br />
* Extractor: new parameter "stop-codon excluded from CDS" that might be used if the annotation does not contain the stop codons<br />
* ExtractRNASeqEvidence: <br />
** print intron length stats<br />
** include program infos in introns.gff3<br />
* GeMoMa:<br />
** new attribute pAA in gff output if query protein is given<br />
** include program infos in predicted_annotation.gff3<br />
** minor bugfix<br />
* GAF:<br />
** new parameter that allows to specify a prefix for each input gff<br />
** collect and print program infos to filtered_prediction.gff3<br />
** improved statistics output<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.4.2.zip GeMoMa 1.4.2] (21.07.2017)<br />
* automatic searching for available updates<br />
* AnnotationEvidence: bugfix (tie computation: Arrays.binarysearch does not find first match)<br />
* Extractor: bugfix (files that are not zipped)<br />
* GeMoMa: bugfix (tie computation: Arrays.binarysearch does not find first match)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.4.1.zip GeMoMa 1.4.1] (30.05.2017)<br />
* CompareTranscripts: bugfix (NullPointerException)<br />
* Extractor: reference genome can be .*fa.gz and .*fasta.gz<br />
* GeMoMa: bugfix (shutdown problem after timeout)<br />
* modified additional scripts and documentation<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.4.zip GeMoMa 1.4] (03.05.2017)<br />
* AnnotationEvidence: new tool computing tie and tpc for given annotation (gff)<br />
* CompareTranscripts: new tool comparing predicted and given annotation (gff)<br />
* Extractor:<br />
** reading CDS with no parent tag (cf. discontinuous feature)<br />
** automatic recognition of GFF or GTF annotation<br />
** Warning if sequences mentioned in the annotation are not included in the reference sequence<br />
* GeMoMa: <br />
** allowing for multiple intron and coverage files (= using different library types at the same time)<br />
** NA instead of "?" for tae, tde, tie, minSplitReads of single coding exon genes<br />
** new default values for the parameters: predictions (10 instead of 1) and contig threshold (0.4 instead of 0.9)<br />
** bugfix (write pc and minCov if possible for last CDS part in predicted annotation)<br />
** bugfix (ref-gene name if no assignment is used)<br />
** bugfix (minSplitReads, minCov, tpc, avgCov if no coverage available)<br />
* GAF:<br />
** nested genes on the same strand<br />
** bugfix (if nothing passes the filter)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.3.2.zip GeMoMa 1.3.2] (18.01.2017)<br />
* Extractor: new parameter repair for broken transcript annotations<br />
* GeMoMa: bugfixes (splice site computation)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa_1.3.1.zip GeMoMa 1.3.1] (09.12.2016)<br />
* GeMoMa bugfix (finding start/stop codon for very small exons)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.3.zip GeMoMa 1.3] (06.12.2016)<br />
* ERE: new tool for extracting RNA-seq evidence<br />
* Extractor: offers options for <br />
** partial gene models<br />
** ambiguities<br />
* GeMoMa:<br />
** RNA-seq<br />
*** defining splice sites<br />
*** additional feature in GFF and output<br />
**** transcript intron evidence (tie)<br />
**** transcript acceptor evidence (tae)<br />
**** transcript donor evidence (tde)<br />
**** transcript percentage coverage (tpc)<br />
**** ...<br />
** improved GFF<br />
** simplified the command line parameters<br />
** IMPORTANT: parameter names changed for some parameters<br />
* GAF: new tool for filtering and combining different predictions (especially of different reference organisms)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.1.3.zip GeMoMa 1.1.3] (06.06.2016)<br />
* minor modifications to the Extractor tool<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.1.2.zip GeMoMa 1.1.2] (05.02.2016)<br />
* GeMoMa bugfix (upstream, downstream sequence for splice site detection)<br />
* Extractor: new parameter s for selecting transcripts<br />
* improved Galaxy integration<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.1.1.zip GeMoMa 1.1.1] (01.02.2016)<br />
* initial release for paper</div>Keilwagenhttps://www.jstacs.de/index.php?title=GeMoMa&diff=1090GeMoMa2020-03-30T08:40:36Z<p>Keilwagen: /* FAQs */ runtime</p>
<hr />
<div>'''Ge'''ne '''Mo'''del '''Ma'''pper (GeMoMa) is a homology-based gene prediction program. GeMoMa uses the annotation of protein-coding genes in a reference genome to infer the annotation of protein-coding genes in a target genome. Thereby, GeMoMa utilizes amino acid sequence and intron position conservation. In addition, GeMoMa allows to incorporate RNA-seq evidence for splice site prediction.<br />
<br />
[[File:GeMoMa-schema.png|thumb|right|350px|Schema of GeMoMa algorithm]]<br />
<br />
== Paper ==<br />
If you use GeMoMa, please cite<br />
<br />
J. Keilwagen, M. Wenk, J. L. Erickson, M. H. Schattat, J. Grau, and F. Hartung. [https://nar.oxfordjournals.org/content/44/9/e89 Using intron position conservation for homology-based gene prediction]. ''Nucleic Acids Research'', 2016. doi: 10.1093/nar/gkw092<br />
<br />
J. Keilwagen, F. Hartung, M. Paulini, S. O. Twardziok, and J. Grau<br />
[https://rdcu.be/QbKc Combining RNA-seq data and homology-based gene prediction for plants, animals and fungi]. ''BMC Bioinformatics'', 2018. doi: 10.1186/s12859-018-2203-5<br />
<br />
== Download ==<br />
<br />
GeMoMa is implemented in Java using Jstacs. You can [http://www.jstacs.de/download.php?which=GeMoMa download a zip file] containing a readme, the GeMoMa jar file and some tiny scripts for running GeMoMa. The jar file allows for <br />
<ul><br />
<li>creating the XML file needed for the Galaxy integration</li><br />
<li>running the command line interface (CLI) version.</li><br />
</ul><br />
<br />
You can also [[:File:GeMoMa-manual.pdf|download a small manual for GeMoMa]] which explains the main steps for the analysis.<br />
<br />
== Galaxy ==<br />
GeMoMa is available in a public web-server at [http://galaxy.informatik.uni-halle.de galaxy.informatik.uni-halle.de]. The provided web-server only allows a limited number of reference genes and uses a time out of 2 minutes per transcript prediction. For unlimited use, please use the command line program or integrate GeMoMa in your only Galaxy instance.<br />
<br />
[[File:GeMoMa-Workflow.png|thumb|center|700px|GeMoMa workflow adapted from Galaxy]]<br />
<br />
== Running the command line application ==<br />
<br />
For running the command line application, Java v1.8 or later is required.<br />
<br />
=== NCBIReferenceRetriever ===<br />
<br />
We provide the module NCBIReferenceRetriever allowing to retrieve data for reference organisms easily from NCBI. You can run NCBIReferenceRetriever from the command line with<br/><br />
<code>java -jar GeMoMa-<version>.jar CLI NRR [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reference directory (the directory where the genome and annotation files of the reference organisms should be stored, default = references/)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>number of tries (the number of tries for downloading a reference file, valid range = [1, 100], default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">rl</font></td><br />
<td>reference list (a list of reference organisms)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== GeMoMaPipeline ===<br />
<br />
If you like to run the GeMoMaPipeline on a server as a single job, you can use the module GeMoMaPipeline which allows to exploit the full compute power of the computer server via multi-threading. However, GeMoMaPipeline does not distribute task on a compute cluster.<br />
You can run GeMoMaPipeline from the command line with<br/><br />
<code>java -jar GeMoMa-<version>.jar CLI GeMoMaPipeline [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>target genome (Target genome file (FASTA))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>species (data for reference species, range={own, pre-extracted}, default = own)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;own&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>ID (ID to distinguish the different reference species, default = , OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (Reference annotation file (GFF or GTF), which contains gene models annotated in the reference genome)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (Reference genome file (FASTA))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>weight (the weight can be used to prioritize predictions from different input files; each prediction will get an additional attribute sumWeight that can be used in the filter, valid range = [0.0, 1000.0], default = 1.0, OPTIONAL)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ai</font></td><br />
<td>annotation info (annotation information of the reference, tab-delimted file containing at least the columns transcriptName, GO and .*defline, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;pre-extracted&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>ID (ID to distinguish the different reference species, default = , OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>cds parts (The query cds parts file (FASTA), i.e., the cds parts that have been blasted)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>assignment (The assignment file, which combines parts of the CDS to transcripts, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">q</font></td><br />
<td>query proteins (optional query protein file (FASTA) for computing the optimal alignment score against complete protein prediction, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>weight (the weight can be used to prioritize predictions from different input files; each prediction will get an additional attribute sumWeight that can be used in the filter, valid range = [0.0, 1000.0], default = 1.0, OPTIONAL)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ai</font></td><br />
<td>annotation info (annotation information of the reference, tab-delimted file containing at least the columns transcriptName, GO and .*defline, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">selected</font></td><br />
<td>selected (The path to list file, which allows to make only a predictions for the contained transcript ids. The first column should contain transcript IDs as given in the annotation. Remaining columns can be used to determine a target region that should be overlapped by the prediction, if columns 2 to 5 contain chromosome, strand, start and end of region, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">gc</font></td><br />
<td>genetic code (optional user-specified genetic code, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">tblastn</font></td><br />
<td>tblastn (if *true* tblastn is used as search algorithm, otherwise mmseqs is used. Tblastn and mmseqs need to be installed to use the corresponding option, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">tag</font></td><br />
<td>tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = prediction)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>RNA-seq evidence (data for RNA-seq evidence, range={NO, MAPPED, EXTRACTED}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;MAPPED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.s</font></td><br />
<td>Stranded (Defines whether the reads are stranded. In case of FR_FIRST_STRAND, the first read of a read pair or the only read in case of single-end data is assumed to be located on forward strand of the cDNA, i.e., reverse to the mRNA orientation. If you are using Illumina TruSeq you should use FR_FIRST_STRAND., range={FR_UNSTRANDED, FR_FIRST_STRAND, FR_SECOND_STRAND}, default = FR_UNSTRANDED)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.m</font></td><br />
<td>mapped reads file (BAM/SAM files containing the mapped reads)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.v</font></td><br />
<td>ValidationStringency (Defines how strict to be when reading a SAM or BAM, beyond bare minimum validation., range={STRICT, LENIENT, SILENT}, default = LENIENT)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.u</font></td><br />
<td>use secondary alignments (allows to filter flags in the SAM or BAM, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.c</font></td><br />
<td>coverage (allows to output the coverage, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.mmq</font></td><br />
<td>minimum mapping quality (reads with a mapping quality that is lower than this value will be ignored, valid range = [0, 255], default = 40)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;EXTRACTED&quot;:</b></td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">introns</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used zero or multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage</font></td><br />
<td>coverage (experimental coverage (RNA-seq), range={UNSTRANDED, STRANDED}, default = UNSTRANDED)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>denoise (removing questionable introns that have been extracted by ERE, range={DENOISE, RAW}, default = DENOISE)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;DENOISE&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">DenoiseIntrons.m</font></td><br />
<td>maximum intron length (The maximum length of an intron, default = 15000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">DenoiseIntrons.me</font></td><br />
<td>minimum expression (The threshold for removing introns, valid range = [0.0, 1.0], default = 0.01)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">DenoiseIntrons.c</font></td><br />
<td>context (The context upstream a donor and donwstream an acceptor site that is used to determine the expression of the region, valid range = [0, 100], default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;RAW&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.p</font></td><br />
<td>proteins (whether the complete proteins sequences should returned as output, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.r</font></td><br />
<td>repair (if a transcript annotation can not be parsed, the program will try to infer the phase of the CDS parts to repair the annotation, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.a</font></td><br />
<td>Ambiguity (This parameter defines how to deal with ambiguities in the DNA. There are 3 options: EXCEPTION, which will remove the corresponding transcript, AMBIGUOUS, which will use an X for the corresponding amino acid, and RANDOM, which will randomly select an amnio acid from the list of possibilities., range={EXCEPTION, AMBIGUOUS, RANDOM}, default = AMBIGUOUS)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.s</font></td><br />
<td>stop-codon excluded from CDS (A flag that states whether the reference annotation contains the stop codon in the CDS annotation or not, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.f</font></td><br />
<td>full-length (A flag which allows for choosing between only full-length and all (i.e., full-length and partial) transcripts, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.s</font></td><br />
<td>splice (if no intron is given by RNA-seq, compute candidate splice sites or not, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.sm</font></td><br />
<td>substitution matrix (optional user-specified substitution matrix, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.g</font></td><br />
<td>gap opening (The gap opening cost in the alignment, default = 11)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.ge</font></td><br />
<td>gap extension (The gap extension cost in the alignment, default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.m</font></td><br />
<td>maximum intron length (The maximum length of an intron, default = 15000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.sil</font></td><br />
<td>static intron length (A flag which allows to switch between static intron length, which can be specified by the user and is identical for all genes, and dynamic intron length, which is based on the gene-specific maximum intron length in the reference organism plus the user given maximum intron length, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.i</font></td><br />
<td>intron-loss-gain-penalty (The penalty used for intron loss and gain, default = 25)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.e</font></td><br />
<td>e-value (The e-value for filtering blast results, default = 100.0)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.c</font></td><br />
<td>contig threshold (The threshold for evaluating contigs, valid range = [0.0, 1.0], default = 0.4)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.rt</font></td><br />
<td>region threshold (The threshold for evaluating regions, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.h</font></td><br />
<td>hit threshold (The threshold for adding additional hits, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.p</font></td><br />
<td>predictions (The (maximal) number of predictions per transcript, default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.a</font></td><br />
<td>avoid stop (A flag which allows to avoid stop codons in a transcript (except the last AS), default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.approx</font></td><br />
<td>approx (whether an approximation is used to compute the score for intron gain, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.prefix</font></td><br />
<td>prefix (A prefix to be used for naming the predictions, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.t</font></td><br />
<td>timeout (The (maximal) number of seconds to be used for the predictions of one transcript, if exceeded GeMoMa does not output a prediction for this transcript., valid range = [0, 604800], default = 3600)</td><br />
<td style="width:100px;">LONG</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.Score</font></td><br />
<td>Score (A flag which allows to do nothing, re-score or re-align the search results, range={Trust, ReScore, ReAlign}, default = Trust)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.c</font></td><br />
<td>common border filter (the filter on the common borders of transcripts, the lower the more transcripts will be checked as alternative splice isoforms, valid range = [0.0, 1.0], default = 0.75)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.m</font></td><br />
<td>maximal number of transcripts per gene (the maximal number of allowed transcript predictions per gene, valid range = [1, 2147483647], default = 2147483647)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.d</font></td><br />
<td>default attributes (Comma-separated list of attributes that is set to NaN if they are not given in the annotation file. This allows to use these attributes for sorting or filter criteria. It is especially meaningful if the gene annotation files were received fom different software packages (e.g., GeMoMa, Braker, ...) having different attributes., default = tie,tde,tae,iAA,pAA,score)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.f</font></td><br />
<td>filter (A filter can be applied to transcript predictions to receive only reasonable predictions. The filter is applied to the GFF attributes. The deault filter decides based on the completeness of the prediction (start=='M' and stop=='*') and the relative score (score/AA>=0.75) whether a prediction is used or not. Different criteria can be combined using 'and' and 'or'. In addition, you can check for NaN, e.g., 'isNaN(score)'. A more sophisticated filter could be applied for instance using the completeness, the relative score, the evidence as well as statistics based on RNA-seq: start=='M' and stop =='*' and score/AA>=0.75 and (evidence>1 or tpc==1.0), default = start=='M' and stop =='*' and score/AA>=0.75, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.s</font></td><br />
<td>sorting (comma-separated list that defines the way predictions are sorted within a cluster, default = evidence,score)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.a</font></td><br />
<td>alternative transcript filter (If a prediction is suggested as an alternative transcript, this additional filter will be applied. The filter works syntactically like the 'filter' parameter and allows you to keep the number of alternative transcripts small based on meaningful criteria. Reasonable filter could be based on RNA-seq data (tie==1) or on evidence (evidence>1). A more sophisticated filter could be applied combining several criteria: tie==1 or evidence>1, default = tie==1 or evidence>1, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.u</font></td><br />
<td>UTR (allows to predict UTRs using RNA-seq data, range={NO, YES}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;YES&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.r</font></td><br />
<td>rename (allows to generate generic gene and transcripts names (cf. attribute &quot;Name&quot;), range={COMPOSED, SIMPLE, NO}, default = COMPOSED)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;COMPOSED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.i</font></td><br />
<td>infix (the infix of the generic name, default = G)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.s</font></td><br />
<td>suffix (the suffix of the generic name, default = 0)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.di</font></td><br />
<td>delete infix (a comma-separated list of infixes that is deleted from the sequence names before building the gene/transcript name, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;SIMPLE&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>predicted proteins (If *true*, returns the predicted proteins of the target organism as fastA file, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">pc</font></td><br />
<td>predicted CDSs (If *true*, returns the predicted CDSs of the target organism as fastA file, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">pgr</font></td><br />
<td>predicted genomic regions (If *true*, returns the genomic regions of predicted gene models of the target organism as fastA file, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">o</font></td><br />
<td>output individual predictions (If *true*, returns the predictions for each reference species, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">debug</font></td><br />
<td>debug (If *false* removes all temporary files even if the jobs exits unexpected, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">threads</font></td><br />
<td>The number of threads used for the tool, defaults to 1</td><br />
<td>INT</td><br />
</tr><br />
</table><br />
<br />
=== Extract RNA-seq Evidence (ERE) ===<br />
<br />
For post-processing the mapped RNA-seq data, we provide the tool ExtractRNAseqEvidence (ERE). You can run Extractor from the command line with<br/><br />
<code>java -jar GeMoMa-<version>.jar CLI ERE [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>Stranded (Defines whether the reads are stranded. In case of FR_FIRST_STRAND, the first read of a read pair or the only read in case of single-end data is assumed to be located on forward strand of the cDNA, i.e., reverse to the mRNA orientation. If you are using Illumina TruSeq you should use FR_FIRST_STRAND., range={FR_UNSTRANDED, FR_FIRST_STRAND, FR_SECOND_STRAND}, default = FR_UNSTRANDED)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>mapped reads file (BAM/SAM files containing the mapped reads)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>ValidationStringency (Defines how strict to be when reading a SAM or BAM, beyond bare minimum validation., range={STRICT, LENIENT, SILENT}, default = LENIENT)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">u</font></td><br />
<td>use secondary alignments (allows to filter flags in the SAM or BAM, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage (allows to output the coverage, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">mmq</font></td><br />
<td>minimum mapping quality (reads with a mapping quality that is lower than this value will be ignored, valid range = [0, 254], default = 40)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== CheckIntrons===<br />
<br />
This tool allows to check whether the extracted introns show the expected patterns of di-nucleotides at the splice sites. You can run CheckIntrons from the command line with<br/><br />
<code>java -jar GeMoMa-<version>.jar CLI CheckIntrons [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>target genome (The target genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>verbose (A flag which allows to output a wealth of additional information per transcript, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== DenoiseIntrons ===<br />
This tool allows to remove potentially incorrectly extracted introns. You can run Denoise from the command line with<br/><br />
<code>java -jar GeMoMa-<version>.jar CLI DenoiseIntrons [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage (experimental coverage (RNA-seq), range={UNSTRANDED, STRANDED}, default = UNSTRANDED)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>maximum intron length (The maximum length of an intron, default = 15000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">me</font></td><br />
<td>minimum expression (The threshold for removing introns, valid range = [0.0, 1.0], default = 0.01)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">context</font></td><br />
<td>context (The context upstream a donor and donwstream an acceptor site that is used to determine the expression of the region, valid range = [0, 100], default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== Extractor ===<br />
<br />
For preparing the data, we provide the tool Extractor. You can run Extractor from the command line with<br/><br />
<code>java -jar GeMoMa-<version>.jar CLI Extractor [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (Reference annotation file (GFF or GTF), which contains gene models annotated in the reference genome)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (Reference genome file (FASTA))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">gc</font></td><br />
<td>genetic code (optional user-specified genetic code, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>proteins (whether the complete proteins sequences should returned as output, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>cds (whether the complete CDSs should returned as output, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">genomic</font></td><br />
<td>genomic (whether the genomic regions should be returned (upper case = coding, lower case = non coding), default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>repair (if a transcript annotation can not be parsed, the program will try to infer the phase of the CDS parts to repair the annotation, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>selected (The path to list file, which allows to make only a predictions for the contained transcript ids. The first column should contain transcript IDs as given in the annotation. Remaining columns will be ignored., OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Ambiguity</font></td><br />
<td>Ambiguity (This parameter defines how to deal with ambiguities in the DNA. There are 3 options: EXCEPTION, which will remove the corresponding transcript, AMBIGUOUS, which will use an X for the corresponding amino acid, and RANDOM, which will randomly select an amnio acid from the list of possibilities., range={EXCEPTION, AMBIGUOUS, RANDOM}, default = EXCEPTION)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sefc</font></td><br />
<td>stop-codon excluded from CDS (A flag that states whether the reference annotation contains the stop codon in the CDS annotation or not, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>full-length (A flag which allows for choosing between only full-length and all (i.e., full-length and partial) transcripts, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>verbose (A flag which allows to output a wealth of additional information, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== Gene Model Mapper (GeMoMa) ===<br />
For predicting gene models, we provide the tool GeMoMa. You can run GeMoMa from the command line with<br/><br />
<code>java -jar GeMoMa-<version>.jar CLI GeMoMa [<parameter>=<value> ...]</code><br/><br />
The parameters comprise: <br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>search results (The search results, e.g., from tblastn or mmseqs)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>target genome (The target genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>cds parts (The query cds parts file (FASTA), i.e., the cds parts that have been blasted)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>assignment (The assignment file, which combines parts of the CDS to transcripts, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">q</font></td><br />
<td>query proteins (optional query protein file (FASTA) for computing the optimal alignment score against complete protein prediction, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used zero or multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">splice</font></td><br />
<td>splice (if no intron is given by RNA-seq, compute candidate splice sites or not, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used zero or multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage</font></td><br />
<td>coverage (experimental coverage (RNA-seq), range={UNSTRANDED, STRANDED}, default = UNSTRANDED)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genetic code (optional user-specified genetic code, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sm</font></td><br />
<td>substitution matrix (optional user-specified substitution matrix, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">go</font></td><br />
<td>gap opening (The gap opening cost in the alignment, default = 11)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ge</font></td><br />
<td>gap extension (The gap extension cost in the alignment, default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>maximum intron length (The maximum length of an intron, default = 15000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sil</font></td><br />
<td>static intron length (A flag which allows to switch between static intron length, which can be specified by the user and is identical for all genes, and dynamic intron length, which is based on the gene-specific maximum intron length in the reference organism plus the user given maximum intron length, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">intron-loss-gain-penalty</font></td><br />
<td>intron-loss-gain-penalty (The penalty used for intron loss and gain, default = 25)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">e</font></td><br />
<td>e-value (The e-value for filtering blast results, default = 100.0)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ct</font></td><br />
<td>contig threshold (The threshold for evaluating contigs, valid range = [0.0, 1.0], default = 0.4)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">rt</font></td><br />
<td>region threshold (The threshold for evaluating regions, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">h</font></td><br />
<td>hit threshold (The threshold for adding additional hits, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>predictions (The (maximal) number of predictions per transcript, default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">selected</font></td><br />
<td>selected (The path to list file, which allows to make only a predictions for the contained transcript ids. The first column should contain transcript IDs as given in the annotation. Remaining columns can be used to determine a target region that should be overlapped by the prediction, if columns 2 to 5 contain chromosome, strand, start and end of region, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">as</font></td><br />
<td>avoid stop (A flag which allows to avoid stop codons in a transcript (except the last AS), default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">approx</font></td><br />
<td>approx (whether an approximation is used to compute the score for intron gain, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">prefix</font></td><br />
<td>prefix (A prefix to be used for naming the predictions, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">tag</font></td><br />
<td>tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = prediction)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>verbose (A flag which allows to output a wealth of additional information per transcript, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">timeout</font></td><br />
<td>timeout (The (maximal) number of seconds to be used for the predictions of one transcript, if exceeded GeMoMa does not output a prediction for this transcript., valid range = [0, 604800], default = 3600)</td><br />
<td style="width:100px;">LONG</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sort</font></td><br />
<td>sort (A flag which allows to sort the search results, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Score</font></td><br />
<td>Score (A flag which allows to do nothing, re-score or re-align the search results, range={Trust, ReScore, ReAlign}, default = Trust)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
GeMoMa returns the predicted annotation as gff file.<br />
<br />
=== GeMoMa Annotation Filter (GAF) ===<br />
<br />
The GeMoMa Annotation Filter allows to combine and reduce predictions from GeMoMa into a single final prediction. It is able to handle predictions from different reference species. It also handles overlapping or identical predictions.<br />
You can run GeMoMa from the command line with<br/><br />
<code>java -jar GeMoMa-<version>.jar CLI GAF[&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>tag (the tag used to read the GeMoMa annotations, default = prediction)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>common border filter (the filter on the common borders of transcripts, the lower the more transcripts will be checked as alternative splice isoforms, valid range = [0.0, 1.0], default = 0.75)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>maximal number of transcripts per gene (the maximal number of allowed transcript predictions per gene, valid range = [1, 2147483647], default = 2147483647)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prefix (the prefix can be used to distinguish predictions from different input files, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>weight (the weight can be used to prioritize predictions from different input files; each prediction will get an additional attribute sumWeight that can be used in the filter, valid range = [0.0, 1000.0], default = 1.0, OPTIONAL)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>gene annotation file (GFF files containing the gene annotations (predicted by GeMoMa))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation info (annotation information of the reference, tab-delimted file containing at least the columns transcriptName, GO and .*defline, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>default attributes (Comma-separated list of attributes that is set to NaN if they are not given in the annotation file. This allows to use these attributes for sorting or filter criteria. It is especially meaningful if the gene annotation files were received fom different software packages (e.g., GeMoMa, Braker, ...) having different attributes., default = tie,tde,tae,iAA,pAA,score)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>filter (A filter can be applied to transcript predictions to receive only reasonable predictions. The filter is applied to the GFF attributes. The deault filter decides based on the completeness of the prediction (start=='M' and stop=='*') and the relative score (score/AA>=0.75) whether a prediction is used or not. Different criteria can be combined using 'and' and 'or'. In addition, you can check for NaN, e.g., 'isNaN(score)'. A more sophisticated filter could be applied for instance using the completeness, the relative score, the evidence as well as statistics based on RNA-seq: start=='M' and stop =='*' and score/AA>=0.75 and (evidence>1 or tpc==1.0), default = start=='M' and stop =='*' and score/AA>=0.75, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>sorting (comma-separated list that defines the way predictions are sorted within a cluster, default = evidence,score)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">atf</font></td><br />
<td>alternative transcript filter (If a prediction is suggested as an alternative transcript, this additional filter will be applied. The filter works syntactically like the 'filter' parameter and allows you to keep the number of alternative transcripts small based on meaningful criteria. Reasonable filter could be based on RNA-seq data (tie==1) or on evidence (evidence>1). A more sophisticated filter could be applied combining several criteria: tie==1 or evidence>1, default = tie==1 or evidence>1, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== CompareTranscripts ===<br />
<br />
For comparing gene models from GeMoMa predictions with existing annotation, we provide the tool CompareTranscripts. You can run CompareTranscripts from the command line with<br/><br />
<code>java -jar GeMoMa-<version>.jar CLI CompareTranscripts [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prediction (The predicted annotation)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (The true annotation)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used zero or multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">prefix</font></td><br />
<td>prefix (the prefix can be used to distinguish predictions from different input files, default = , OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">assignment</font></td><br />
<td>assignment (the transcript info for the reference of the prediction)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== AnnotationEvidence ===<br />
<br />
For providing RNA-seq evidence (e.g. tie) for existing annotation, we provide the tool AnnotationEvidence. You can run AnnotationEvidence from the command line with<br/><br />
<code>java -jar GeMoMa-<version>.jar CLI AnnotationEvidence [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (The genome annotation file (GFF))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (The genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns file (Introns (GFF), which might be obtained from RNA-seq, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage file (experimental coverage (RNA-seq), range={NO, UNSTRANDED, STRANDED}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ao</font></td><br />
<td>annotation output (if the annotation should be returned with attributes tie, tpc, and AA, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">gc</font></td><br />
<td>genetic code (optional user-specified genetic code, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== AnnotationFinializer ===<br />
<br />
This tool allows to predict UTR and to rename predictions. You can run AnnotationEvidence from the command line with<br/><br />
<code>java -jar GeMoMa-<version>.jar CLI AnnotationFinalizer [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (The predicted genome annotation file (GFF))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = prediction)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">u</font></td><br />
<td>UTR (allows to predict UTRs using RNA-seq data, range={NO, YES}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;YES&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (The genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns file (Introns (GFF), which might be obtained from RNA-seq)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage file (experimental coverage (RNA-seq), range={NO, UNSTRANDED, STRANDED}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">rename</font></td><br />
<td>rename (allows to generate generic gene and transcripts names (cf. attribute &quot;Name&quot;), range={COMPOSED, SIMPLE, NO}, default = COMPOSED)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;COMPOSED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">infix</font></td><br />
<td>infix (the infix of the generic name, default = G)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>suffix (the suffix of the generic name, default = 0)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">di</font></td><br />
<td>delete infix (a comma-separated list of infixes that is deleted from the sequence names before building the gene/transcript name, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;SIMPLE&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
== GFF attributes ==<br />
<br />
Using GeMoMa and GAF, you'll obtain GFFs containing some special attributes. We briefly explain the most prominent attributes in the following table.<br />
<br />
{| class="wikitable sortable"<br />
|-<br />
!Attribute !!Long name !!Tool !!Necessary parameter !!Feature !!Description<br />
|-<br />
| score || GeMoMa score || GeMoMa || || prediction || score computed by GeMoMa using the substitution matrix, gap costs and additional penalties<br />
|-<br />
| minCov || minimal coverage || GeMoMa || coverage, ... || prediction || minimal coverage of any base of the prediction given RNA-seq evidence<br />
|-<br />
| avgCov || average coverage || GeMoMa || coverage, ... || prediction || average coverage of all bases of the prediction given RNA-seq evidence<br />
|-<br />
| tpc || transcript percentage coverage || GeMoMa || coverage, ... || prediction || percentage of covered bases per predicted transcript given RNA-seq evidence<br />
|-<br />
| tae || transcript acceptor evidence || GeMoMa || introns || prediction || percentage of predicted acceptor sites per predicted transcript with RNA-seq evidence<br />
|-<br />
| tde || transcript donor evidence || GeMoMa || introns || prediction || percentage of predicted donor sites per predicted transcript with RNA-seq evidence<br />
|-<br />
| tie || transcript intron evidence || GeMoMa || introns || prediction || percentage of predicted introns per predicted transcript with RNA-seq evidence<br />
|-<br />
| minSplitReads || minimal split reads || GeMoMa || introns || prediction || minimal number of split reads for any of the predicted introns per predicted transcript<br />
|-<br />
| iAA || identical amino acid || GeMoMa || query proteins || prediction || percentage of identical amino acids between reference transcript and prediction<br />
|-<br />
| pAA || positive amino acid || GeMoMa || query proteins || prediction || percentage of aligned positions between reference transcript and prediction yielding a positive score in the substitution matrix<br />
|-<br />
| evidence || || GAF || || prediction || number of reference organisms that have a transcript yielding this prediction<br />
|-<br />
| alternative || || GAF || || prediction || alternative gene ID(s) leading to the same prediction<br />
|-<br />
| sumWeight || || GAF || || prediction || the sum of the weights of the references that perfectly support this prediction<br />
|-<br />
| maxTie || maximal tie || GAF || || gene || maximal tie of all transcripts of this gene<br />
|-<br />
| maxEvidence || maximal evidence || GAF || || gene || maximal evidence of all transcripts of this gene<br />
|-<br />
|}<br />
<br />
== FAQs ==<br />
<br />
; Why does the Extractor not return a single CDS-part, protein, ...?<br />
: First, please check whether the names of your contigs/chromosomes in your annotation (gff) and genome file (fasta) are identical. The fasta comments should at best only contain the contig/chromosome name. (Since GeMoMa 1.4, comments, which contain the contig/chromosome name and some additional information separated by a space, are also fine.) Second, please check whether you have a valid GFF/GTF file. Valid GFF files should have a valid "ID" or "Parent" entry in the attributes column. Valid GTF files should have a valid "gene_id" and "transcript_id" entry. Finally, please check the statistics that are given by the Extractor. It lists how many genes have been read and how many genes have been removed for different reasons. One common problem is that some annotation files do not include the stop codon in the CDS annotation.<br />
; How can I force GeMoMa to make more predictions?<br />
: There are several parameters affecting the number of predictions. The most prominent are the number of predictions (p) and the contig threshold (ct). For each reference transcript/CDS, GeMoMa initially makes a preliminary prediction and uses this prediction to determine whether a contig is promising and should be used to determine the final predictions. You may decrease ct and increase p to have more contigs in the final prediction. Increasing the number of predictions allows GeMoMa to output more predictions that have been computed. Decreasing the contig threshold allows to increase the number of predictions that are (internally) computed. Increasing p to a very large number without decreasing ct does not help.<br />
; Running GeMoMa on a single contig of my assembly yield thousands of weird predictions. What went wrong?<br />
: By default, GeMoMa is not build to be run on a single contig. GeMoMa tries to make predictions for all given reference CDS in the given target sequence(s). If the given target sequence is only a fraction of the complete target genome/assembly, GeMoMa will produce weird predictions as it does not filter for the quality of the predictions internally. There are two options to handle this:<br />
:* Use a list of gene models that you expect to be located on this contig (cf. parameter "selected").<br />
:* Filter the predictions using GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>).<br />
; Is it mandatory to use RNA-seq data?<br />
: No, GeMoMa is able to make predictions with and without RNA-seq evidence.<br />
; Is it possible to use multiple reference organisms?<br />
: It is possible to use multiple reference organisms for GeMoMa. Just run GeMoMa on each reference organism separately. Finally, you can employ GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>) to combine these annotations.<br />
; Why do some reference genes not lead to a prediction in the target genome?<br />
: Please first check whether your reference genes have been discarded by the Extractor (cf. assignment file).<br />
: If the genes have been discarded, there are two possibilities:<br />
:* The CDS might be redundant, i.e. the coding exons are identical to those of another transcript. In this case, only one CDS is further evaluated.<br />
:* There might be something wrong with your reference genes, e.g., missing start codon, missing stop codon, premature stop codon, ambiguous nucleotides, ... and you should check the options of Extractor or the annotation.<br />
: If the reference genes passed the Extractor, there are several possible explanations for this behavior. The two most prominent are:<br />
:* GeMoMa stopped the prediction of a reference genes since it does not return a result within the given time (cf. parameter "timeout").<br />
:* GeMoMa simply did not find a prediction matching the remaining quality criteria<br />
:* GeMoMa did find a prediction, but it was filtered out by GAF, e.g. to low relative score, missing start or stop codon (cf. GAF parameters).<br />
; What does "partial gene model" mean in the context of GeMoMa?<br />
: We called a gene model partial if it does not contain an initial start codon and a final stop codon. However, this does not mean that the gene model is located at or close to the border of a chromosome or contig.<br />
; For two different reference transcripts, the predictions of GeMoMa overlap or are identical. What should I do with those?<br />
: GeMoMa makes the predictions for each reference transcript independently. Hence, it can occur that some of predictions of different reference transcripts overlap or are identical especially in gene families. Typically, you might like to filter or rank these predictions. We have implemented GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>) to do this automatically. However, you can also do it by hand using the GFF attributes. Using RNA-seq data in GeMoMa yields additional fields in the annotation that can be used, e.g., average coverage (avgCov).<br />
; A lot of transcripts have been filtered out by the Extractor. What can I do?<br />
: There are several reasons for removing transcripts by the Extractor. At least in two cases you can try to get more transcripts by setting specific parameter values. First, if the transcript contains ambiguous nucleotides, please test the parameter "Ambiguity". Second, sometimes we received GFFs which contain wrong phases for CDS entries (e.g., 0 for all CDS entries in the phase column of the GFF). Since version 1.3.2, we provide the option "r" which stands for repair. If r=true is chosen, the Extractor tries to infer all phases for transcripts that show an error and would be filtered out.<br />
; Is GeMoMa able to predict pseudo-genes/ncRNA?<br />
: No, currently not.<br />
; My RNA-seq data indicates there is an additional intron in a transcipt, but GeMoMa does not predict this. Or vice versa, GeMoMa predicts an intron that is not supported by RNA-seq data. What's the reason?<br />
: GeMoMa is mainly based on the assumptions of amino acid and intron position conservation between reference and target species. Hence, GeMoMa tries to predict a gene model with similar exon-intron structure in the target species and does not stick too much to RNA-seq data. Although intron position conservation can be observed in most cases, sometimes new introns evolve or others vanish. For this reasons, GeMoMa also allows for the inclusion or exclusion of introns adding some additional costs (cf. GeMoMa parameter intron-loss-gain-penalty). However, the behaviour of GeMoMa depends on the parameters settings (especially intron-loss-gain-penalty, sm (substitution matrix), go (gap opening), ge (gap extension)) and the length of the missed/additional intron. Nevertheless, such cases can only occur if the additional/missed intron has a length that can be divided by 3 preserving the reading frame.<br />
:Since the available RNA-seq data only reflects a fraction of tissues/environmental conditions/..., missing RNA-seq evidence does not necessarily mean that the predictions is wrong.<br />
; My RNA-seq data indicates two alternative, highly overlapping introns. Interestingly, GeMoMa does not take the intron that is more abundant. Why?<br />
: GeMoMa reads the introns from the input file using some filter (cf. GeMoMa parameter r (reads)). All introns that pass the filter are used and treated equally. Hence, GeMoMa uses the intron that matches the expectation of intron position and amino acid conservation compared to the reference transcript.<br />
; Does GeMoMa predict multiple transcripts per gene?<br />
: GeMoMa in principle allows to predict multiple transcripts per gene, if corresponding transcripts are given in the reference species or if multiple reference species are used.<br />
; GeMoMa failed with java.lang.OutOfMemoryError. What can I do?<br />
: Whenever you see a java.lang.OutOfMemoryError, you should rerun the program with Java virtual machine (VM) options. More specifically you should set: '''-Xms''' the initally used RAM, e.g. to 5Gb (–Xms5G), and '''-Xmx''' the maximally used RAM, e.g. to 50Gb (-Xmx50G). GeMoMa often needs more memory if you have a large genome ''and'' if you’re providing a large coverage file (extracted from RNA-seq data). If you don’t have a compute node with enough memory, you can run GeMoMa without coverage, which will return the same predictions, but does not include all statistics. Another point could be the protein alignment, if you use the optional parameter query protein. Again you can run GeMoMa without this parameter, which will return the same predictions, but less statistics.<br />
; I need to specify the genetic code for my organisms. What is the expected format?<br />
: The genetic code is given in a two column tab-delimited table, where the first column is the one letter code of the amino acid and the second column is a comma-separated list of triplets. As we are working on genomic DNA, GeMoMa expects the bases A, C, G, and T, and not U (as expected in mRNA). Here is the link to the default genetic code, which might be used as template:<br />
: https://github.com/Jstacs/Jstacs/blob/master/projects/gemoma/test_data/genetic_code.txt<br />
: Alternative genetic codes are described here using the RNA alphabet:<br />
: https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi<br />
: The genetic code might be specified for a reference organism in the module Extractor or for a target organism in the module GeMoMa.<br />
; I like to accelerate GeMoMa. What can I do?<br />
:If you like to improve the runtime, you can use several threads for the computation. If you run the GeMoMaPipeline you just have to select <code>threads=<your_number></code>.<br />
:In addition, you can change the search algorithm that is used in GeMoMa. Tblastn is used by default as search algorithm in GeMoMa (for historical reasons). However, tblastn can be replaced by mmseqs which is typically much faster. If you run the GeMoMaPipeline you just have to select <code>tblastn=false</code>. However, changing the search algorithm can also effect the results. We try to minimize these effect using specific parameters for the search algorithms.<br />
:If you modify other parameters, you will probably receive results that differ to a larger extend from the those received using the default parameters.<br />
<br />
== Version history ==<br />
[http://www.jstacs.de/download.php?which=GeMoMa GeMoMa 1.6.3] (05.03.2020)<br />
* Jstacs changes:<br />
** CLI: bugfix ExpandableParameterSet<br />
* python wrapper (for *conda)<br />
* updated tests.sh, run.sh, pipeline.sh<br />
* rename Denoise to DenoiseIntrons<br />
* AnnotationEvidence: write phase (as given) to gff<br />
* GAF: new parameter: default attributes allows to set attributes that are not included in some gene annotation files<br />
* GeMoMa: new parameter: static intron length allowing to use dynamic intron length if set to false<br />
* GeMoMaPipeline: <br />
** bugfix: time-out<br />
** improve output<br />
** separate parameters for maximum intron length (DenoiseIntrons, GeMoMa)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.6.2.zip GeMoMa 1.6.2] (17.12.2019)<br />
* Jstacs changes:<br />
** test methods for modules<br />
** live protocol for Galaxy<br />
* new module Denoise: allowing to clean introns extracted by ERE<br />
* new module NCBIReferenceRetriever: allowing to retrieve data for reference organisms easily from NCBI.<br />
* GAF:<br />
** bugfix for filter using specific attributes if no RNA-seq or query proteins was used<br />
** allow to add annotation info (as for instance provided by Phytozome) based on the reference organisms<br />
* GeMoMa: bugfix for timeout<br />
* GeMoMaPipeline:<br />
** bugfix reporting predicted partial proteins<br />
** improved protocol<br />
** new default value for query proteins (changed from false to true)<br />
** new default value for Ambiguity (changed from EXCEPTION to AMBIGUOUS)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.6.1.zip GeMoMa 1.6.1] (4.06.2019)<br />
* createGalaxyIntegration.sh: bugfix for GeMoMaPipeline<br />
* new module CheckIntrons: allowing to create statistics for introns (extracted by ERE)<br />
* AnnotationFinalizer: bugfix for sequence IDs with large numbers<br />
* CompareTranscripts: <br />
** bugfix for prefix of ref-gene<br />
** allow no transcript info, but making assignment non-optional if a transcript info is set <br />
* GAF: bugfix for Galaxy integration<br />
* GeMoMaPipeline:<br />
** improved output in case of Exceptions<br />
** new parameter "output individual predictions" allows to in- or exclude individual predictions from each reference organism in the final result<br />
** new parameter "weight" allows weights for reference species (cf. GAF)<br />
* ERE: new parameter "minimum mapping quality"<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.6.zip GeMoMa 1.6] (2.04.2019)<br />
* allow to use mmseqs as alternative to tblastn<br />
* AnnotationEvidence:<br />
** allows to add attributes to the input gff: tie, tpc, AA, start, stop<br />
** new parameter for gff output<br />
* AnnotationFinalizer: new tool for predicting UTRs and renaming predictions<br />
* GAF:<br />
** relative score filter and evidence filter are replaced by a flexible filter that allows to filter by relative score, evidence or other GFF attributes as well as combinations thereof<br />
** sorting criteria of the predictions within clusters can now be user-specified<br />
** new attribute for genes: combinedEvidence<br />
** new attribute for predictions: sumWeight<br />
** allows to use gene predictions from all sources, including for instance ab-initio gene predictors, purely RNA-seq based gene prediction and manually curation<br />
** bugfix for predictions from multiple reference organisms<br />
** improved statistic output<br />
* GeMoMa<br />
** renamed the parameter tblastn results to search results<br />
** new parameter for sorting the results of the similarity search (tblastn or mmseqs), if you use mmseqs for the similarity search you have use sort <br />
** new parameter for score of the search results: three options: Trust (as is), ReScore (use aligned sequence, but recompute score), and ReAlign (use detected sequence for realignment and rescore)<br />
** bugfix: threshold for introns from multiple files<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.5.2.zip GeMoMa 1.5.3] (23.07.2018)<br />
* improved parameter description and presentation<br />
* GeMoMaPipeline:<br />
** removed unnecessary parameters<br />
* GeMoMa:<br />
** bugfix: reading coverage file<br />
** removed parameter genomic (cf. Extractor)<br />
** removed protein output (cf. Extractor)<br />
* GAF:<br />
** bugfix: prefix<br />
* Extractor:<br />
** new parameter genomic<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.5.2.zip GeMoMa 1.5.2] (31.5.2018)<br />
* GAF:<br />
** new parameter that allows to restrict the maximal number of transcript predictions per gene<br />
** altered behavior of the evidence filter from percentages to absolute values<br />
** bugfix: nested genes <br />
** checking for duplicates in prediction IDs<br />
* GeMoMa:<br />
** warning if RNA-seq data does not match with target genome<br />
* GeMoMaPipeline: new tool for running the complete GeMoMa pipeline at once allowing multi-threading<br />
* folder for temporary files of GeMoMa<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.5.zip GeMoMa 1.5] (13.02.2018)<br />
* AnnotationEvidence: add chromosome to output<br />
* CompareTranscripts: new parameter that allows to remove prefixes introduces by GAF<br />
* Extractor: new parameter "stop-codon excluded from CDS" that might be used if the annotation does not contain the stop codons<br />
* ExtractRNASeqEvidence: <br />
** print intron length stats<br />
** include program infos in introns.gff3<br />
* GeMoMa:<br />
** new attribute pAA in gff output if query protein is given<br />
** include program infos in predicted_annotation.gff3<br />
** minor bugfix<br />
* GAF:<br />
** new parameter that allows to specify a prefix for each input gff<br />
** collect and print program infos to filtered_prediction.gff3<br />
** improved statistics output<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.4.2.zip GeMoMa 1.4.2] (21.07.2017)<br />
* automatic searching for available updates<br />
* AnnotationEvidence: bugfix (tie computation: Arrays.binarysearch does not find first match)<br />
* Extractor: bugfix (files that are not zipped)<br />
* GeMoMa: bugfix (tie computation: Arrays.binarysearch does not find first match)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.4.1.zip GeMoMa 1.4.1] (30.05.2017)<br />
* CompareTranscripts: bugfix (NullPointerException)<br />
* Extractor: reference genome can be .*fa.gz and .*fasta.gz<br />
* GeMoMa: bugfix (shutdown problem after timeout)<br />
* modified additional scripts and documentation<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.4.zip GeMoMa 1.4] (03.05.2017)<br />
* AnnotationEvidence: new tool computing tie and tpc for given annotation (gff)<br />
* CompareTranscripts: new tool comparing predicted and given annotation (gff)<br />
* Extractor:<br />
** reading CDS with no parent tag (cf. discontinuous feature)<br />
** automatic recognition of GFF or GTF annotation<br />
** Warning if sequences mentioned in the annotation are not included in the reference sequence<br />
* GeMoMa: <br />
** allowing for multiple intron and coverage files (= using different library types at the same time)<br />
** NA instead of "?" for tae, tde, tie, minSplitReads of single coding exon genes<br />
** new default values for the parameters: predictions (10 instead of 1) and contig threshold (0.4 instead of 0.9)<br />
** bugfix (write pc and minCov if possible for last CDS part in predicted annotation)<br />
** bugfix (ref-gene name if no assignment is used)<br />
** bugfix (minSplitReads, minCov, tpc, avgCov if no coverage available)<br />
* GAF:<br />
** nested genes on the same strand<br />
** bugfix (if nothing passes the filter)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.3.2.zip GeMoMa 1.3.2] (18.01.2017)<br />
* Extractor: new parameter repair for broken transcript annotations<br />
* GeMoMa: bugfixes (splice site computation)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa_1.3.1.zip GeMoMa 1.3.1] (09.12.2016)<br />
* GeMoMa bugfix (finding start/stop codon for very small exons)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.3.zip GeMoMa 1.3] (06.12.2016)<br />
* ERE: new tool for extracting RNA-seq evidence<br />
* Extractor: offers options for <br />
** partial gene models<br />
** ambiguities<br />
* GeMoMa:<br />
** RNA-seq<br />
*** defining splice sites<br />
*** additional feature in GFF and output<br />
**** transcript intron evidence (tie)<br />
**** transcript acceptor evidence (tae)<br />
**** transcript donor evidence (tde)<br />
**** transcript percentage coverage (tpc)<br />
**** ...<br />
** improved GFF<br />
** simplified the command line parameters<br />
** IMPORTANT: parameter names changed for some parameters<br />
* GAF: new tool for filtering and combining different predictions (especially of different reference organisms)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.1.3.zip GeMoMa 1.1.3] (06.06.2016)<br />
* minor modifications to the Extractor tool<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.1.2.zip GeMoMa 1.1.2] (05.02.2016)<br />
* GeMoMa bugfix (upstream, downstream sequence for splice site detection)<br />
* Extractor: new parameter s for selecting transcripts<br />
* improved Galaxy integration<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.1.1.zip GeMoMa 1.1.1] (01.02.2016)<br />
* initial release for paper</div>Keilwagenhttps://www.jstacs.de/index.php?title=GeMoMa&diff=1071GeMoMa2020-03-09T08:32:34Z<p>Keilwagen: /* GeMoMaPipeline */</p>
<hr />
<div>'''Ge'''ne '''Mo'''del '''Ma'''pper (GeMoMa) is a homology-based gene prediction program. GeMoMa uses the annotation of protein-coding genes in a reference genome to infer the annotation of protein-coding genes in a target genome. Thereby, GeMoMa utilizes amino acid sequence and intron position conservation. In addition, GeMoMa allows to incorporate RNA-seq evidence for splice site prediction.<br />
<br />
[[File:GeMoMa-schema.png|thumb|right|350px|Schema of GeMoMa algorithm]]<br />
<br />
== Paper ==<br />
If you use GeMoMa, please cite<br />
<br />
J. Keilwagen, M. Wenk, J. L. Erickson, M. H. Schattat, J. Grau, and F. Hartung. [https://nar.oxfordjournals.org/content/44/9/e89 Using intron position conservation for homology-based gene prediction]. ''Nucleic Acids Research'', 2016. doi: 10.1093/nar/gkw092<br />
<br />
J. Keilwagen, F. Hartung, M. Paulini, S. O. Twardziok, and J. Grau<br />
[https://rdcu.be/QbKc Combining RNA-seq data and homology-based gene prediction for plants, animals and fungi]. ''BMC Bioinformatics'', 2018. doi: 10.1186/s12859-018-2203-5<br />
<br />
== Download ==<br />
<br />
GeMoMa is implemented in Java using Jstacs. You can [http://www.jstacs.de/download.php?which=GeMoMa download a zip file] containing a readme, the GeMoMa jar file and some tiny scripts for running GeMoMa. The jar file allows for <br />
<ul><br />
<li>creating the XML file needed for the Galaxy integration</li><br />
<li>running the command line interface (CLI) version.</li><br />
</ul><br />
<br />
You can also [[:File:GeMoMa-manual.pdf|download a small manual for GeMoMa]] which explains the main steps for the analysis.<br />
<br />
== Galaxy ==<br />
GeMoMa is available in a public web-server at [http://galaxy.informatik.uni-halle.de galaxy.informatik.uni-halle.de]. The provided web-server only allows a limited number of reference genes and uses a time out of 2 minutes per transcript prediction. For unlimited use, please use the command line program or integrate GeMoMa in your only Galaxy instance.<br />
<br />
[[File:GeMoMa-Workflow.png|thumb|center|700px|GeMoMa workflow adapted from Galaxy]]<br />
<br />
== Running the command line application ==<br />
<br />
For running the command line application, Java v1.8 or later is required.<br />
<br />
=== NCBIReferenceRetriever ===<br />
<br />
We provide the module NCBIReferenceRetriever allowing to retrieve data for reference organisms easily from NCBI. You can run NCBIReferenceRetriever from the command line with<br/><br />
<code>java -jar GeMoMa-<version>.jar CLI NRR [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reference directory (the directory where the genome and annotation files of the reference organisms should be stored, default = references/)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>number of tries (the number of tries for downloading a reference file, valid range = [1, 100], default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">rl</font></td><br />
<td>reference list (a list of reference organisms)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== GeMoMaPipeline ===<br />
<br />
If you like to run the GeMoMaPipeline on a server as a single job, you can use the module GeMoMaPipeline which allows to exploit the full compute power of the computer server via multi-threading. However, GeMoMaPipeline does not distribute task on a compute cluster.<br />
You can run GeMoMaPipeline from the command line with<br/><br />
<code>java -jar GeMoMa-<version>.jar CLI GeMoMaPipeline [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>target genome (Target genome file (FASTA))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>species (data for reference species, range={own, pre-extracted}, default = own)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;own&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>ID (ID to distinguish the different reference species, default = , OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (Reference annotation file (GFF or GTF), which contains gene models annotated in the reference genome)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (Reference genome file (FASTA))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>weight (the weight can be used to prioritize predictions from different input files; each prediction will get an additional attribute sumWeight that can be used in the filter, valid range = [0.0, 1000.0], default = 1.0, OPTIONAL)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ai</font></td><br />
<td>annotation info (annotation information of the reference, tab-delimted file containing at least the columns transcriptName, GO and .*defline, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;pre-extracted&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>ID (ID to distinguish the different reference species, default = , OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>cds parts (The query cds parts file (FASTA), i.e., the cds parts that have been blasted)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>assignment (The assignment file, which combines parts of the CDS to transcripts, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">q</font></td><br />
<td>query proteins (optional query protein file (FASTA) for computing the optimal alignment score against complete protein prediction, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>weight (the weight can be used to prioritize predictions from different input files; each prediction will get an additional attribute sumWeight that can be used in the filter, valid range = [0.0, 1000.0], default = 1.0, OPTIONAL)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ai</font></td><br />
<td>annotation info (annotation information of the reference, tab-delimted file containing at least the columns transcriptName, GO and .*defline, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">selected</font></td><br />
<td>selected (The path to list file, which allows to make only a predictions for the contained transcript ids. The first column should contain transcript IDs as given in the annotation. Remaining columns can be used to determine a target region that should be overlapped by the prediction, if columns 2 to 5 contain chromosome, strand, start and end of region, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">gc</font></td><br />
<td>genetic code (optional user-specified genetic code, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">tblastn</font></td><br />
<td>tblastn (if *true* tblastn is used as search algorithm, otherwise mmseqs is used. Tblastn and mmseqs need to be installed to use the corresponding option, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">tag</font></td><br />
<td>tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = prediction)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>RNA-seq evidence (data for RNA-seq evidence, range={NO, MAPPED, EXTRACTED}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;MAPPED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.s</font></td><br />
<td>Stranded (Defines whether the reads are stranded. In case of FR_FIRST_STRAND, the first read of a read pair or the only read in case of single-end data is assumed to be located on forward strand of the cDNA, i.e., reverse to the mRNA orientation. If you are using Illumina TruSeq you should use FR_FIRST_STRAND., range={FR_UNSTRANDED, FR_FIRST_STRAND, FR_SECOND_STRAND}, default = FR_UNSTRANDED)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.m</font></td><br />
<td>mapped reads file (BAM/SAM files containing the mapped reads)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.v</font></td><br />
<td>ValidationStringency (Defines how strict to be when reading a SAM or BAM, beyond bare minimum validation., range={STRICT, LENIENT, SILENT}, default = LENIENT)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.u</font></td><br />
<td>use secondary alignments (allows to filter flags in the SAM or BAM, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.c</font></td><br />
<td>coverage (allows to output the coverage, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.mmq</font></td><br />
<td>minimum mapping quality (reads with a mapping quality that is lower than this value will be ignored, valid range = [0, 255], default = 40)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;EXTRACTED&quot;:</b></td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">introns</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used zero or multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage</font></td><br />
<td>coverage (experimental coverage (RNA-seq), range={UNSTRANDED, STRANDED}, default = UNSTRANDED)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>denoise (removing questionable introns that have been extracted by ERE, range={DENOISE, RAW}, default = DENOISE)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;DENOISE&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">DenoiseIntrons.m</font></td><br />
<td>maximum intron length (The maximum length of an intron, default = 15000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">DenoiseIntrons.me</font></td><br />
<td>minimum expression (The threshold for removing introns, valid range = [0.0, 1.0], default = 0.01)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">DenoiseIntrons.c</font></td><br />
<td>context (The context upstream a donor and donwstream an acceptor site that is used to determine the expression of the region, valid range = [0, 100], default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;RAW&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.p</font></td><br />
<td>proteins (whether the complete proteins sequences should returned as output, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.r</font></td><br />
<td>repair (if a transcript annotation can not be parsed, the program will try to infer the phase of the CDS parts to repair the annotation, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.a</font></td><br />
<td>Ambiguity (This parameter defines how to deal with ambiguities in the DNA. There are 3 options: EXCEPTION, which will remove the corresponding transcript, AMBIGUOUS, which will use an X for the corresponding amino acid, and RANDOM, which will randomly select an amnio acid from the list of possibilities., range={EXCEPTION, AMBIGUOUS, RANDOM}, default = AMBIGUOUS)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.s</font></td><br />
<td>stop-codon excluded from CDS (A flag that states whether the reference annotation contains the stop codon in the CDS annotation or not, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.f</font></td><br />
<td>full-length (A flag which allows for choosing between only full-length and all (i.e., full-length and partial) transcripts, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.s</font></td><br />
<td>splice (if no intron is given by RNA-seq, compute candidate splice sites or not, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.sm</font></td><br />
<td>substitution matrix (optional user-specified substitution matrix, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.g</font></td><br />
<td>gap opening (The gap opening cost in the alignment, default = 11)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.ge</font></td><br />
<td>gap extension (The gap extension cost in the alignment, default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.m</font></td><br />
<td>maximum intron length (The maximum length of an intron, default = 15000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.sil</font></td><br />
<td>static intron length (A flag which allows to switch between static intron length, which can be specified by the user and is identical for all genes, and dynamic intron length, which is based on the gene-specific maximum intron length in the reference organism plus the user given maximum intron length, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.i</font></td><br />
<td>intron-loss-gain-penalty (The penalty used for intron loss and gain, default = 25)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.e</font></td><br />
<td>e-value (The e-value for filtering blast results, default = 100.0)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.c</font></td><br />
<td>contig threshold (The threshold for evaluating contigs, valid range = [0.0, 1.0], default = 0.4)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.rt</font></td><br />
<td>region threshold (The threshold for evaluating regions, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.h</font></td><br />
<td>hit threshold (The threshold for adding additional hits, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.p</font></td><br />
<td>predictions (The (maximal) number of predictions per transcript, default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.a</font></td><br />
<td>avoid stop (A flag which allows to avoid stop codons in a transcript (except the last AS), default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.approx</font></td><br />
<td>approx (whether an approximation is used to compute the score for intron gain, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.prefix</font></td><br />
<td>prefix (A prefix to be used for naming the predictions, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.t</font></td><br />
<td>timeout (The (maximal) number of seconds to be used for the predictions of one transcript, if exceeded GeMoMa does not output a prediction for this transcript., valid range = [0, 604800], default = 3600)</td><br />
<td style="width:100px;">LONG</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.Score</font></td><br />
<td>Score (A flag which allows to do nothing, re-score or re-align the search results, range={Trust, ReScore, ReAlign}, default = Trust)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.c</font></td><br />
<td>common border filter (the filter on the common borders of transcripts, the lower the more transcripts will be checked as alternative splice isoforms, valid range = [0.0, 1.0], default = 0.75)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.m</font></td><br />
<td>maximal number of transcripts per gene (the maximal number of allowed transcript predictions per gene, valid range = [1, 2147483647], default = 2147483647)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.d</font></td><br />
<td>default attributes (Comma-separated list of attributes that is set to NaN if they are not given in the annotation file. This allows to use these attributes for sorting or filter criteria. It is especially meaningful if the gene annotation files were received fom different software packages (e.g., GeMoMa, Braker, ...) having different attributes., default = tie,tde,tae,iAA,pAA,score)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.f</font></td><br />
<td>filter (A filter can be applied to transcript predictions to receive only reasonable predictions. The filter is applied to the GFF attributes. The deault filter decides based on the completeness of the prediction (start=='M' and stop=='*') and the relative score (score/AA>=0.75) whether a prediction is used or not. Different criteria can be combined using 'and' and 'or'. In addition, you can check for NaN, e.g., 'isNaN(score)'. A more sophisticated filter could be applied for instance using the completeness, the relative score, the evidence as well as statistics based on RNA-seq: start=='M' and stop =='*' and score/AA>=0.75 and (evidence>1 or tpc==1.0), default = start=='M' and stop =='*' and score/AA>=0.75, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.s</font></td><br />
<td>sorting (comma-separated list that defines the way predictions are sorted within a cluster, default = evidence,score)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.a</font></td><br />
<td>alternative transcript filter (If a prediction is suggested as an alternative transcript, this additional filter will be applied. The filter works syntactically like the 'filter' parameter and allows you to keep the number of alternative transcripts small based on meaningful criteria. Reasonable filter could be based on RNA-seq data (tie==1) or on evidence (evidence>1). A more sophisticated filter could be applied combining several criteria: tie==1 or evidence>1, default = tie==1 or evidence>1, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.u</font></td><br />
<td>UTR (allows to predict UTRs using RNA-seq data, range={NO, YES}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;YES&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.r</font></td><br />
<td>rename (allows to generate generic gene and transcripts names (cf. attribute &quot;Name&quot;), range={COMPOSED, SIMPLE, NO}, default = COMPOSED)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;COMPOSED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.i</font></td><br />
<td>infix (the infix of the generic name, default = G)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.s</font></td><br />
<td>suffix (the suffix of the generic name, default = 0)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.di</font></td><br />
<td>delete infix (a comma-separated list of infixes that is deleted from the sequence names before building the gene/transcript name, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;SIMPLE&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>predicted proteins (If *true*, returns the predicted proteins of the target organism as fastA file, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">pc</font></td><br />
<td>predicted CDSs (If *true*, returns the predicted CDSs of the target organism as fastA file, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">pgr</font></td><br />
<td>predicted genomic regions (If *true*, returns the genomic regions of predicted gene models of the target organism as fastA file, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">o</font></td><br />
<td>output individual predictions (If *true*, returns the predictions for each reference species, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">debug</font></td><br />
<td>debug (If *false* removes all temporary files even if the jobs exits unexpected, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">threads</font></td><br />
<td>The number of threads used for the tool, defaults to 1</td><br />
<td>INT</td><br />
</tr><br />
</table><br />
<br />
=== Extract RNA-seq Evidence (ERE) ===<br />
<br />
For post-processing the mapped RNA-seq data, we provide the tool ExtractRNAseqEvidence (ERE). You can run Extractor from the command line with<br/><br />
<code>java -jar GeMoMa-<version>.jar CLI ERE [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>Stranded (Defines whether the reads are stranded. In case of FR_FIRST_STRAND, the first read of a read pair or the only read in case of single-end data is assumed to be located on forward strand of the cDNA, i.e., reverse to the mRNA orientation. If you are using Illumina TruSeq you should use FR_FIRST_STRAND., range={FR_UNSTRANDED, FR_FIRST_STRAND, FR_SECOND_STRAND}, default = FR_UNSTRANDED)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>mapped reads file (BAM/SAM files containing the mapped reads)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>ValidationStringency (Defines how strict to be when reading a SAM or BAM, beyond bare minimum validation., range={STRICT, LENIENT, SILENT}, default = LENIENT)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">u</font></td><br />
<td>use secondary alignments (allows to filter flags in the SAM or BAM, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage (allows to output the coverage, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">mmq</font></td><br />
<td>minimum mapping quality (reads with a mapping quality that is lower than this value will be ignored, valid range = [0, 254], default = 40)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== CheckIntrons===<br />
<br />
This tool allows to check whether the extracted introns show the expected patterns of di-nucleotides at the splice sites. You can run CheckIntrons from the command line with<br/><br />
<code>java -jar GeMoMa-<version>.jar CLI CheckIntrons [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>target genome (The target genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>verbose (A flag which allows to output a wealth of additional information per transcript, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== DenoiseIntrons ===<br />
This tool allows to remove potentially incorrectly extracted introns. You can run Denoise from the command line with<br/><br />
<code>java -jar GeMoMa-<version>.jar CLI DenoiseIntrons [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage (experimental coverage (RNA-seq), range={UNSTRANDED, STRANDED}, default = UNSTRANDED)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>maximum intron length (The maximum length of an intron, default = 15000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">me</font></td><br />
<td>minimum expression (The threshold for removing introns, valid range = [0.0, 1.0], default = 0.01)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">context</font></td><br />
<td>context (The context upstream a donor and donwstream an acceptor site that is used to determine the expression of the region, valid range = [0, 100], default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== Extractor ===<br />
<br />
For preparing the data, we provide the tool Extractor. You can run Extractor from the command line with<br/><br />
<code>java -jar GeMoMa-<version>.jar CLI Extractor [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (Reference annotation file (GFF or GTF), which contains gene models annotated in the reference genome)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (Reference genome file (FASTA))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">gc</font></td><br />
<td>genetic code (optional user-specified genetic code, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>proteins (whether the complete proteins sequences should returned as output, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>cds (whether the complete CDSs should returned as output, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">genomic</font></td><br />
<td>genomic (whether the genomic regions should be returned (upper case = coding, lower case = non coding), default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>repair (if a transcript annotation can not be parsed, the program will try to infer the phase of the CDS parts to repair the annotation, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>selected (The path to list file, which allows to make only a predictions for the contained transcript ids. The first column should contain transcript IDs as given in the annotation. Remaining columns will be ignored., OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Ambiguity</font></td><br />
<td>Ambiguity (This parameter defines how to deal with ambiguities in the DNA. There are 3 options: EXCEPTION, which will remove the corresponding transcript, AMBIGUOUS, which will use an X for the corresponding amino acid, and RANDOM, which will randomly select an amnio acid from the list of possibilities., range={EXCEPTION, AMBIGUOUS, RANDOM}, default = EXCEPTION)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sefc</font></td><br />
<td>stop-codon excluded from CDS (A flag that states whether the reference annotation contains the stop codon in the CDS annotation or not, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>full-length (A flag which allows for choosing between only full-length and all (i.e., full-length and partial) transcripts, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>verbose (A flag which allows to output a wealth of additional information, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== Gene Model Mapper (GeMoMa) ===<br />
For predicting gene models, we provide the tool GeMoMa. You can run GeMoMa from the command line with<br/><br />
<code>java -jar GeMoMa-<version>.jar CLI GeMoMa [<parameter>=<value> ...]</code><br/><br />
The parameters comprise: <br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>search results (The search results, e.g., from tblastn or mmseqs)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>target genome (The target genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>cds parts (The query cds parts file (FASTA), i.e., the cds parts that have been blasted)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>assignment (The assignment file, which combines parts of the CDS to transcripts, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">q</font></td><br />
<td>query proteins (optional query protein file (FASTA) for computing the optimal alignment score against complete protein prediction, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used zero or multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">splice</font></td><br />
<td>splice (if no intron is given by RNA-seq, compute candidate splice sites or not, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used zero or multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage</font></td><br />
<td>coverage (experimental coverage (RNA-seq), range={UNSTRANDED, STRANDED}, default = UNSTRANDED)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genetic code (optional user-specified genetic code, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sm</font></td><br />
<td>substitution matrix (optional user-specified substitution matrix, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">go</font></td><br />
<td>gap opening (The gap opening cost in the alignment, default = 11)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ge</font></td><br />
<td>gap extension (The gap extension cost in the alignment, default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>maximum intron length (The maximum length of an intron, default = 15000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sil</font></td><br />
<td>static intron length (A flag which allows to switch between static intron length, which can be specified by the user and is identical for all genes, and dynamic intron length, which is based on the gene-specific maximum intron length in the reference organism plus the user given maximum intron length, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">intron-loss-gain-penalty</font></td><br />
<td>intron-loss-gain-penalty (The penalty used for intron loss and gain, default = 25)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">e</font></td><br />
<td>e-value (The e-value for filtering blast results, default = 100.0)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ct</font></td><br />
<td>contig threshold (The threshold for evaluating contigs, valid range = [0.0, 1.0], default = 0.4)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">rt</font></td><br />
<td>region threshold (The threshold for evaluating regions, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">h</font></td><br />
<td>hit threshold (The threshold for adding additional hits, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>predictions (The (maximal) number of predictions per transcript, default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">selected</font></td><br />
<td>selected (The path to list file, which allows to make only a predictions for the contained transcript ids. The first column should contain transcript IDs as given in the annotation. Remaining columns can be used to determine a target region that should be overlapped by the prediction, if columns 2 to 5 contain chromosome, strand, start and end of region, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">as</font></td><br />
<td>avoid stop (A flag which allows to avoid stop codons in a transcript (except the last AS), default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">approx</font></td><br />
<td>approx (whether an approximation is used to compute the score for intron gain, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">prefix</font></td><br />
<td>prefix (A prefix to be used for naming the predictions, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">tag</font></td><br />
<td>tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = prediction)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>verbose (A flag which allows to output a wealth of additional information per transcript, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">timeout</font></td><br />
<td>timeout (The (maximal) number of seconds to be used for the predictions of one transcript, if exceeded GeMoMa does not output a prediction for this transcript., valid range = [0, 604800], default = 3600)</td><br />
<td style="width:100px;">LONG</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sort</font></td><br />
<td>sort (A flag which allows to sort the search results, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Score</font></td><br />
<td>Score (A flag which allows to do nothing, re-score or re-align the search results, range={Trust, ReScore, ReAlign}, default = Trust)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
GeMoMa returns the predicted annotation as gff file.<br />
<br />
=== GeMoMa Annotation Filter (GAF) ===<br />
<br />
The GeMoMa Annotation Filter allows to combine and reduce predictions from GeMoMa into a single final prediction. It is able to handle predictions from different reference species. It also handles overlapping or identical predictions.<br />
You can run GeMoMa from the command line with<br/><br />
<code>java -jar GeMoMa-<version>.jar CLI GAF[&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>tag (the tag used to read the GeMoMa annotations, default = prediction)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>common border filter (the filter on the common borders of transcripts, the lower the more transcripts will be checked as alternative splice isoforms, valid range = [0.0, 1.0], default = 0.75)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>maximal number of transcripts per gene (the maximal number of allowed transcript predictions per gene, valid range = [1, 2147483647], default = 2147483647)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prefix (the prefix can be used to distinguish predictions from different input files, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>weight (the weight can be used to prioritize predictions from different input files; each prediction will get an additional attribute sumWeight that can be used in the filter, valid range = [0.0, 1000.0], default = 1.0, OPTIONAL)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>gene annotation file (GFF files containing the gene annotations (predicted by GeMoMa))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation info (annotation information of the reference, tab-delimted file containing at least the columns transcriptName, GO and .*defline, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>default attributes (Comma-separated list of attributes that is set to NaN if they are not given in the annotation file. This allows to use these attributes for sorting or filter criteria. It is especially meaningful if the gene annotation files were received fom different software packages (e.g., GeMoMa, Braker, ...) having different attributes., default = tie,tde,tae,iAA,pAA,score)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>filter (A filter can be applied to transcript predictions to receive only reasonable predictions. The filter is applied to the GFF attributes. The deault filter decides based on the completeness of the prediction (start=='M' and stop=='*') and the relative score (score/AA>=0.75) whether a prediction is used or not. Different criteria can be combined using 'and' and 'or'. In addition, you can check for NaN, e.g., 'isNaN(score)'. A more sophisticated filter could be applied for instance using the completeness, the relative score, the evidence as well as statistics based on RNA-seq: start=='M' and stop =='*' and score/AA>=0.75 and (evidence>1 or tpc==1.0), default = start=='M' and stop =='*' and score/AA>=0.75, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>sorting (comma-separated list that defines the way predictions are sorted within a cluster, default = evidence,score)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">atf</font></td><br />
<td>alternative transcript filter (If a prediction is suggested as an alternative transcript, this additional filter will be applied. The filter works syntactically like the 'filter' parameter and allows you to keep the number of alternative transcripts small based on meaningful criteria. Reasonable filter could be based on RNA-seq data (tie==1) or on evidence (evidence>1). A more sophisticated filter could be applied combining several criteria: tie==1 or evidence>1, default = tie==1 or evidence>1, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== CompareTranscripts ===<br />
<br />
For comparing gene models from GeMoMa predictions with existing annotation, we provide the tool CompareTranscripts. You can run CompareTranscripts from the command line with<br/><br />
<code>java -jar GeMoMa-<version>.jar CLI CompareTranscripts [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prediction (The predicted annotation)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (The true annotation)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used zero or multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">prefix</font></td><br />
<td>prefix (the prefix can be used to distinguish predictions from different input files, default = , OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">assignment</font></td><br />
<td>assignment (the transcript info for the reference of the prediction)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== AnnotationEvidence ===<br />
<br />
For providing RNA-seq evidence (e.g. tie) for existing annotation, we provide the tool AnnotationEvidence. You can run AnnotationEvidence from the command line with<br/><br />
<code>java -jar GeMoMa-<version>.jar CLI AnnotationEvidence [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (The genome annotation file (GFF))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (The genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns file (Introns (GFF), which might be obtained from RNA-seq, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage file (experimental coverage (RNA-seq), range={NO, UNSTRANDED, STRANDED}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ao</font></td><br />
<td>annotation output (if the annotation should be returned with attributes tie, tpc, and AA, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">gc</font></td><br />
<td>genetic code (optional user-specified genetic code, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== AnnotationFinializer ===<br />
<br />
This tool allows to predict UTR and to rename predictions. You can run AnnotationEvidence from the command line with<br/><br />
<code>java -jar GeMoMa-<version>.jar CLI AnnotationFinalizer [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (The predicted genome annotation file (GFF))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = prediction)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">u</font></td><br />
<td>UTR (allows to predict UTRs using RNA-seq data, range={NO, YES}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;YES&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (The genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns file (Introns (GFF), which might be obtained from RNA-seq)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage file (experimental coverage (RNA-seq), range={NO, UNSTRANDED, STRANDED}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">rename</font></td><br />
<td>rename (allows to generate generic gene and transcripts names (cf. attribute &quot;Name&quot;), range={COMPOSED, SIMPLE, NO}, default = COMPOSED)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;COMPOSED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">infix</font></td><br />
<td>infix (the infix of the generic name, default = G)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>suffix (the suffix of the generic name, default = 0)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">di</font></td><br />
<td>delete infix (a comma-separated list of infixes that is deleted from the sequence names before building the gene/transcript name, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;SIMPLE&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
== GFF attributes ==<br />
<br />
Using GeMoMa and GAF, you'll obtain GFFs containing some special attributes. We briefly explain the most prominent attributes in the following table.<br />
<br />
{| class="wikitable sortable"<br />
|-<br />
!Attribute !!Long name !!Tool !!Necessary parameter !!Feature !!Description<br />
|-<br />
| score || GeMoMa score || GeMoMa || || prediction || score computed by GeMoMa using the substitution matrix, gap costs and additional penalties<br />
|-<br />
| minCov || minimal coverage || GeMoMa || coverage, ... || prediction || minimal coverage of any base of the prediction given RNA-seq evidence<br />
|-<br />
| avgCov || average coverage || GeMoMa || coverage, ... || prediction || average coverage of all bases of the prediction given RNA-seq evidence<br />
|-<br />
| tpc || transcript percentage coverage || GeMoMa || coverage, ... || prediction || percentage of covered bases per predicted transcript given RNA-seq evidence<br />
|-<br />
| tae || transcript acceptor evidence || GeMoMa || introns || prediction || percentage of predicted acceptor sites per predicted transcript with RNA-seq evidence<br />
|-<br />
| tde || transcript donor evidence || GeMoMa || introns || prediction || percentage of predicted donor sites per predicted transcript with RNA-seq evidence<br />
|-<br />
| tie || transcript intron evidence || GeMoMa || introns || prediction || percentage of predicted introns per predicted transcript with RNA-seq evidence<br />
|-<br />
| minSplitReads || minimal split reads || GeMoMa || introns || prediction || minimal number of split reads for any of the predicted introns per predicted transcript<br />
|-<br />
| iAA || identical amino acid || GeMoMa || query proteins || prediction || percentage of identical amino acids between reference transcript and prediction<br />
|-<br />
| pAA || positive amino acid || GeMoMa || query proteins || prediction || percentage of aligned positions between reference transcript and prediction yielding a positive score in the substitution matrix<br />
|-<br />
| evidence || || GAF || || prediction || number of reference organisms that have a transcript yielding this prediction<br />
|-<br />
| alternative || || GAF || || prediction || alternative gene ID(s) leading to the same prediction<br />
|-<br />
| sumWeight || || GAF || || prediction || the sum of the weights of the references that perfectly support this prediction<br />
|-<br />
| maxTie || maximal tie || GAF || || gene || maximal tie of all transcripts of this gene<br />
|-<br />
| maxEvidence || maximal evidence || GAF || || gene || maximal evidence of all transcripts of this gene<br />
|-<br />
|}<br />
<br />
== FAQs ==<br />
<br />
; Why does the Extractor not return a single CDS-part, protein, ...?<br />
: First, please check whether the names of your contigs/chromosomes in your annotation (gff) and genome file (fasta) are identical. The fasta comments should at best only contain the contig/chromosome name. (Since GeMoMa 1.4, comments, which contain the contig/chromosome name and some additional information separated by a space, are also fine.) Second, please check whether you have a valid GFF/GTF file. Valid GFF files should have a valid "ID" or "Parent" entry in the attributes column. Valid GTF files should have a valid "gene_id" and "transcript_id" entry. Finally, please check the statistics that are given by the Extractor. It lists how many genes have been read and how many genes have been removed for different reasons. One common problem is that some annotation files do not include the stop codon in the CDS annotation.<br />
; How can I force GeMoMa to make more predictions?<br />
: There are several parameters affecting the number of predictions. The most prominent are the number of predictions (p) and the contig threshold (ct). For each reference transcript/CDS, GeMoMa initially makes a preliminary prediction and uses this prediction to determine whether a contig is promising and should be used to determine the final predictions. You may decrease ct and increase p to have more contigs in the final prediction. Increasing the number of predictions allows GeMoMa to output more predictions that have been computed. Decreasing the contig threshold allows to increase the number of predictions that are (internally) computed. Increasing p to a very large number without decreasing ct does not help.<br />
; Running GeMoMa on a single contig of my assembly yield thousands of weird predictions. What went wrong?<br />
: By default, GeMoMa is not build to be run on a single contig. GeMoMa tries to make predictions for all given reference CDS in the given target sequence(s). If the given target sequence is only a fraction of the complete target genome/assembly, GeMoMa will produce weird predictions as it does not filter for the quality of the predictions internally. There are two options to handle this:<br />
:* Use a list of gene models that you expect to be located on this contig (cf. parameter "selected").<br />
:* Filter the predictions using GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>).<br />
; Is it mandatory to use RNA-seq data?<br />
: No, GeMoMa is able to make predictions with and without RNA-seq evidence.<br />
; Is it possible to use multiple reference organisms?<br />
: It is possible to use multiple reference organisms for GeMoMa. Just run GeMoMa on each reference organism separately. Finally, you can employ GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>) to combine these annotations.<br />
; Why do some reference genes not lead to a prediction in the target genome?<br />
: Please first check whether your reference genes have been discarded by the Extractor (cf. assignment file).<br />
: If the genes have been discarded, there are two possibilities:<br />
:* The CDS might be redundant, i.e. the coding exons are identical to those of another transcript. In this case, only one CDS is further evaluated.<br />
:* There might be something wrong with your reference genes, e.g., missing start codon, missing stop codon, premature stop codon, ambiguous nucleotides, ... and you should check the options of Extractor or the annotation.<br />
: If the reference genes passed the Extractor, there are several possible explanations for this behavior. The two most prominent are:<br />
:* GeMoMa stopped the prediction of a reference genes since it does not return a result within the given time (cf. parameter "timeout").<br />
:* GeMoMa simply did not find a prediction matching the remaining quality criteria<br />
:* GeMoMa did find a prediction, but it was filtered out by GAF, e.g. to low relative score, missing start or stop codon (cf. GAF parameters).<br />
; What does "partial gene model" mean in the context of GeMoMa?<br />
: We called a gene model partial if it does not contain an initial start codon and a final stop codon. However, this does not mean that the gene model is located at or close to the border of a chromosome or contig.<br />
; For two different reference transcripts, the predictions of GeMoMa overlap or are identical. What should I do with those?<br />
: GeMoMa makes the predictions for each reference transcript independently. Hence, it can occur that some of predictions of different reference transcripts overlap or are identical especially in gene families. Typically, you might like to filter or rank these predictions. We have implemented GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>) to do this automatically. However, you can also do it by hand using the GFF attributes. Using RNA-seq data in GeMoMa yields additional fields in the annotation that can be used, e.g., average coverage (avgCov).<br />
; A lot of transcripts have been filtered out by the Extractor. What can I do?<br />
: There are several reasons for removing transcripts by the Extractor. At least in two cases you can try to get more transcripts by setting specific parameter values. First, if the transcript contains ambiguous nucleotides, please test the parameter "Ambiguity". Second, sometimes we received GFFs which contain wrong phases for CDS entries (e.g., 0 for all CDS entries in the phase column of the GFF). Since version 1.3.2, we provide the option "r" which stands for repair. If r=true is chosen, the Extractor tries to infer all phases for transcripts that show an error and would be filtered out.<br />
; Is GeMoMa able to predict pseudo-genes/ncRNA?<br />
: No, currently not.<br />
; My RNA-seq data indicates there is an additional intron in a transcipt, but GeMoMa does not predict this. Or vice versa, GeMoMa predicts an intron that is not supported by RNA-seq data. What's the reason?<br />
: GeMoMa is mainly based on the assumptions of amino acid and intron position conservation between reference and target species. Hence, GeMoMa tries to predict a gene model with similar exon-intron structure in the target species and does not stick too much to RNA-seq data. Although intron position conservation can be observed in most cases, sometimes new introns evolve or others vanish. For this reasons, GeMoMa also allows for the inclusion or exclusion of introns adding some additional costs (cf. GeMoMa parameter intron-loss-gain-penalty). However, the behaviour of GeMoMa depends on the parameters settings (especially intron-loss-gain-penalty, sm (substitution matrix), go (gap opening), ge (gap extension)) and the length of the missed/additional intron. Nevertheless, such cases can only occur if the additional/missed intron has a length that can be divided by 3 preserving the reading frame.<br />
:Since the available RNA-seq data only reflects a fraction of tissues/environmental conditions/..., missing RNA-seq evidence does not necessarily mean that the predictions is wrong.<br />
; My RNA-seq data indicates two alternative, highly overlapping introns. Interestingly, GeMoMa does not take the intron that is more abundant. Why?<br />
: GeMoMa reads the introns from the input file using some filter (cf. GeMoMa parameter r (reads)). All introns that pass the filter are used and treated equally. Hence, GeMoMa uses the intron that matches the expectation of intron position and amino acid conservation compared to the reference transcript.<br />
; Does GeMoMa predict multiple transcripts per gene?<br />
: GeMoMa in principle allows to predict multiple transcripts per gene, if corresponding transcripts are given in the reference species or if multiple reference species are used.<br />
; GeMoMa failed with java.lang.OutOfMemoryError. What can I do?<br />
: Whenever you see a java.lang.OutOfMemoryError, you should rerun the program with Java virtual machine (VM) options. More specifically you should set: '''-Xms''' the initally used RAM, e.g. to 5Gb (–Xms5G), and '''-Xmx''' the maximally used RAM, e.g. to 50Gb (-Xmx50G). GeMoMa often needs more memory if you have a large genome ''and'' if you’re providing a large coverage file (extracted from RNA-seq data). If you don’t have a compute node with enough memory, you can run GeMoMa without coverage, which will return the same predictions, but does not include all statistics. Another point could be the protein alignment, if you use the optional parameter query protein. Again you can run GeMoMa without this parameter, which will return the same predictions, but less statistics.<br />
; I need to specify the genetic code for my organisms. What is the expected format?<br />
: The genetic code is given in a two column tab-delimited table, where the first column is the one letter code of the amino acid and the second column is a comma-separated list of triplets. As we are working on genomic DNA, GeMoMa expects the bases A, C, G, and T, and not U (as expected in mRNA). Here is the link to the default genetic code, which might be used as template:<br />
: https://github.com/Jstacs/Jstacs/blob/master/projects/gemoma/test_data/genetic_code.txt<br />
: Alternative genetic codes are described here using the RNA alphabet:<br />
: https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi<br />
: The genetic code might be specified for a reference organism in the module Extractor or for a target organism in the module GeMoMa.<br />
<br />
== Version history ==<br />
[http://www.jstacs.de/download.php?which=GeMoMa GeMoMa 1.6.3] (05.03.2020)<br />
* Jstacs changes:<br />
** CLI: bugfix ExpandableParameterSet<br />
* python wrapper (for *conda)<br />
* updated tests.sh, run.sh, pipeline.sh<br />
* rename Denoise to DenoiseIntrons<br />
* AnnotationEvidence: write phase (as given) to gff<br />
* GAF: new parameter: default attributes allows to set attributes that are not included in some gene annotation files<br />
* GeMoMa: new parameter: static intron length allowing to use dynamic intron length if set to false<br />
* GeMoMaPipeline: <br />
** bugfix: time-out<br />
** improve output<br />
** separate parameters for maximum intron length (DenoiseIntrons, GeMoMa)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.6.2.zip GeMoMa 1.6.2] (17.12.2019)<br />
* Jstacs changes:<br />
** test methods for modules<br />
** live protocol for Galaxy<br />
* new module Denoise: allowing to clean introns extracted by ERE<br />
* new module NCBIReferenceRetriever: allowing to retrieve data for reference organisms easily from NCBI.<br />
* GAF:<br />
** bugfix for filter using specific attributes if no RNA-seq or query proteins was used<br />
** allow to add annotation info (as for instance provided by Phytozome) based on the reference organisms<br />
* GeMoMa: bugfix for timeout<br />
* GeMoMaPipeline:<br />
** bugfix reporting predicted partial proteins<br />
** improved protocol<br />
** new default value for query proteins (changed from false to true)<br />
** new default value for Ambiguity (changed from EXCEPTION to AMBIGUOUS)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.6.1.zip GeMoMa 1.6.1] (4.06.2019)<br />
* createGalaxyIntegration.sh: bugfix for GeMoMaPipeline<br />
* new module CheckIntrons: allowing to create statistics for introns (extracted by ERE)<br />
* AnnotationFinalizer: bugfix for sequence IDs with large numbers<br />
* CompareTranscripts: <br />
** bugfix for prefix of ref-gene<br />
** allow no transcript info, but making assignment non-optional if a transcript info is set <br />
* GAF: bugfix for Galaxy integration<br />
* GeMoMaPipeline:<br />
** improved output in case of Exceptions<br />
** new parameter "output individual predictions" allows to in- or exclude individual predictions from each reference organism in the final result<br />
** new parameter "weight" allows weights for reference species (cf. GAF)<br />
* ERE: new parameter "minimum mapping quality"<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.6.zip GeMoMa 1.6] (2.04.2019)<br />
* allow to use mmseqs as alternative to tblastn<br />
* AnnotationEvidence:<br />
** allows to add attributes to the input gff: tie, tpc, AA, start, stop<br />
** new parameter for gff output<br />
* AnnotationFinalizer: new tool for predicting UTRs and renaming predictions<br />
* GAF:<br />
** relative score filter and evidence filter are replaced by a flexible filter that allows to filter by relative score, evidence or other GFF attributes as well as combinations thereof<br />
** sorting criteria of the predictions within clusters can now be user-specified<br />
** new attribute for genes: combinedEvidence<br />
** new attribute for predictions: sumWeight<br />
** allows to use gene predictions from all sources, including for instance ab-initio gene predictors, purely RNA-seq based gene prediction and manually curation<br />
** bugfix for predictions from multiple reference organisms<br />
** improved statistic output<br />
* GeMoMa<br />
** renamed the parameter tblastn results to search results<br />
** new parameter for sorting the results of the similarity search (tblastn or mmseqs), if you use mmseqs for the similarity search you have use sort <br />
** new parameter for score of the search results: three options: Trust (as is), ReScore (use aligned sequence, but recompute score), and ReAlign (use detected sequence for realignment and rescore)<br />
** bugfix: threshold for introns from multiple files<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.5.2.zip GeMoMa 1.5.3] (23.07.2018)<br />
* improved parameter description and presentation<br />
* GeMoMaPipeline:<br />
** removed unnecessary parameters<br />
* GeMoMa:<br />
** bugfix: reading coverage file<br />
** removed parameter genomic (cf. Extractor)<br />
** removed protein output (cf. Extractor)<br />
* GAF:<br />
** bugfix: prefix<br />
* Extractor:<br />
** new parameter genomic<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.5.2.zip GeMoMa 1.5.2] (31.5.2018)<br />
* GAF:<br />
** new parameter that allows to restrict the maximal number of transcript predictions per gene<br />
** altered behavior of the evidence filter from percentages to absolute values<br />
** bugfix: nested genes <br />
** checking for duplicates in prediction IDs<br />
* GeMoMa:<br />
** warning if RNA-seq data does not match with target genome<br />
* GeMoMaPipeline: new tool for running the complete GeMoMa pipeline at once allowing multi-threading<br />
* folder for temporary files of GeMoMa<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.5.zip GeMoMa 1.5] (13.02.2018)<br />
* AnnotationEvidence: add chromosome to output<br />
* CompareTranscripts: new parameter that allows to remove prefixes introduces by GAF<br />
* Extractor: new parameter "stop-codon excluded from CDS" that might be used if the annotation does not contain the stop codons<br />
* ExtractRNASeqEvidence: <br />
** print intron length stats<br />
** include program infos in introns.gff3<br />
* GeMoMa:<br />
** new attribute pAA in gff output if query protein is given<br />
** include program infos in predicted_annotation.gff3<br />
** minor bugfix<br />
* GAF:<br />
** new parameter that allows to specify a prefix for each input gff<br />
** collect and print program infos to filtered_prediction.gff3<br />
** improved statistics output<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.4.2.zip GeMoMa 1.4.2] (21.07.2017)<br />
* automatic searching for available updates<br />
* AnnotationEvidence: bugfix (tie computation: Arrays.binarysearch does not find first match)<br />
* Extractor: bugfix (files that are not zipped)<br />
* GeMoMa: bugfix (tie computation: Arrays.binarysearch does not find first match)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.4.1.zip GeMoMa 1.4.1] (30.05.2017)<br />
* CompareTranscripts: bugfix (NullPointerException)<br />
* Extractor: reference genome can be .*fa.gz and .*fasta.gz<br />
* GeMoMa: bugfix (shutdown problem after timeout)<br />
* modified additional scripts and documentation<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.4.zip GeMoMa 1.4] (03.05.2017)<br />
* AnnotationEvidence: new tool computing tie and tpc for given annotation (gff)<br />
* CompareTranscripts: new tool comparing predicted and given annotation (gff)<br />
* Extractor:<br />
** reading CDS with no parent tag (cf. discontinuous feature)<br />
** automatic recognition of GFF or GTF annotation<br />
** Warning if sequences mentioned in the annotation are not included in the reference sequence<br />
* GeMoMa: <br />
** allowing for multiple intron and coverage files (= using different library types at the same time)<br />
** NA instead of "?" for tae, tde, tie, minSplitReads of single coding exon genes<br />
** new default values for the parameters: predictions (10 instead of 1) and contig threshold (0.4 instead of 0.9)<br />
** bugfix (write pc and minCov if possible for last CDS part in predicted annotation)<br />
** bugfix (ref-gene name if no assignment is used)<br />
** bugfix (minSplitReads, minCov, tpc, avgCov if no coverage available)<br />
* GAF:<br />
** nested genes on the same strand<br />
** bugfix (if nothing passes the filter)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.3.2.zip GeMoMa 1.3.2] (18.01.2017)<br />
* Extractor: new parameter repair for broken transcript annotations<br />
* GeMoMa: bugfixes (splice site computation)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa_1.3.1.zip GeMoMa 1.3.1] (09.12.2016)<br />
* GeMoMa bugfix (finding start/stop codon for very small exons)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.3.zip GeMoMa 1.3] (06.12.2016)<br />
* ERE: new tool for extracting RNA-seq evidence<br />
* Extractor: offers options for <br />
** partial gene models<br />
** ambiguities<br />
* GeMoMa:<br />
** RNA-seq<br />
*** defining splice sites<br />
*** additional feature in GFF and output<br />
**** transcript intron evidence (tie)<br />
**** transcript acceptor evidence (tae)<br />
**** transcript donor evidence (tde)<br />
**** transcript percentage coverage (tpc)<br />
**** ...<br />
** improved GFF<br />
** simplified the command line parameters<br />
** IMPORTANT: parameter names changed for some parameters<br />
* GAF: new tool for filtering and combining different predictions (especially of different reference organisms)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.1.3.zip GeMoMa 1.1.3] (06.06.2016)<br />
* minor modifications to the Extractor tool<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.1.2.zip GeMoMa 1.1.2] (05.02.2016)<br />
* GeMoMa bugfix (upstream, downstream sequence for splice site detection)<br />
* Extractor: new parameter s for selecting transcripts<br />
* improved Galaxy integration<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.1.1.zip GeMoMa 1.1.1] (01.02.2016)<br />
* initial release for paper</div>Keilwagenhttps://www.jstacs.de/index.php?title=GeMoMa&diff=1070GeMoMa2020-03-09T08:19:34Z<p>Keilwagen: /* GeMoMa Annotation Filter (GAF) */</p>
<hr />
<div>'''Ge'''ne '''Mo'''del '''Ma'''pper (GeMoMa) is a homology-based gene prediction program. GeMoMa uses the annotation of protein-coding genes in a reference genome to infer the annotation of protein-coding genes in a target genome. Thereby, GeMoMa utilizes amino acid sequence and intron position conservation. In addition, GeMoMa allows to incorporate RNA-seq evidence for splice site prediction.<br />
<br />
[[File:GeMoMa-schema.png|thumb|right|350px|Schema of GeMoMa algorithm]]<br />
<br />
== Paper ==<br />
If you use GeMoMa, please cite<br />
<br />
J. Keilwagen, M. Wenk, J. L. Erickson, M. H. Schattat, J. Grau, and F. Hartung. [https://nar.oxfordjournals.org/content/44/9/e89 Using intron position conservation for homology-based gene prediction]. ''Nucleic Acids Research'', 2016. doi: 10.1093/nar/gkw092<br />
<br />
J. Keilwagen, F. Hartung, M. Paulini, S. O. Twardziok, and J. Grau<br />
[https://rdcu.be/QbKc Combining RNA-seq data and homology-based gene prediction for plants, animals and fungi]. ''BMC Bioinformatics'', 2018. doi: 10.1186/s12859-018-2203-5<br />
<br />
== Download ==<br />
<br />
GeMoMa is implemented in Java using Jstacs. You can [http://www.jstacs.de/download.php?which=GeMoMa download a zip file] containing a readme, the GeMoMa jar file and some tiny scripts for running GeMoMa. The jar file allows for <br />
<ul><br />
<li>creating the XML file needed for the Galaxy integration</li><br />
<li>running the command line interface (CLI) version.</li><br />
</ul><br />
<br />
You can also [[:File:GeMoMa-manual.pdf|download a small manual for GeMoMa]] which explains the main steps for the analysis.<br />
<br />
== Galaxy ==<br />
GeMoMa is available in a public web-server at [http://galaxy.informatik.uni-halle.de galaxy.informatik.uni-halle.de]. The provided web-server only allows a limited number of reference genes and uses a time out of 2 minutes per transcript prediction. For unlimited use, please use the command line program or integrate GeMoMa in your only Galaxy instance.<br />
<br />
[[File:GeMoMa-Workflow.png|thumb|center|700px|GeMoMa workflow adapted from Galaxy]]<br />
<br />
== Running the command line application ==<br />
<br />
For running the command line application, Java v1.8 or later is required.<br />
<br />
=== NCBIReferenceRetriever ===<br />
<br />
We provide the module NCBIReferenceRetriever allowing to retrieve data for reference organisms easily from NCBI. You can run NCBIReferenceRetriever from the command line with<br/><br />
<code>java -jar GeMoMa-<version>.jar CLI NRR [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reference directory (the directory where the genome and annotation files of the reference organisms should be stored, default = references/)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>number of tries (the number of tries for downloading a reference file, valid range = [1, 100], default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">rl</font></td><br />
<td>reference list (a list of reference organisms)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== GeMoMaPipeline ===<br />
<br />
If you like to run the GeMoMaPipeline on a server as a single job, you can use the module GeMoMaPipeline which allows to exploit the full compute power of the computer server via multi-threading. However, GeMoMaPipeline does not distribute task on a compute cluster.<br />
You can run GeMoMaPipeline from the command line with<br/><br />
<code>java -jar GeMoMa-<version>.jar CLI GeMoMaPipeline [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>target genome (Target genome file (FASTA))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>species (data for reference species, range={own, pre-extracted}, default = own)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;own&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>ID (ID to distinguish the different reference species, default = , OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (Reference annotation file (GFF or GTF), which contains gene models annotated in the reference genome)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (Reference genome file (FASTA))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>weight (the weight can be used to prioritize predictions from different input files; each prediction will get an additional attribute sumWeight that can be used in the filter, valid range = [0.0, 1000.0], default = 1.0, OPTIONAL)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ai</font></td><br />
<td>annotation info (annotation information of the reference, tab-delimted file containing at least the columns transcriptName, GO and .*defline, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;pre-extracted&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>ID (ID to distinguish the different reference species, default = , OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>cds parts (The query cds parts file (FASTA), i.e., the cds parts that have been blasted)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>assignment (The assignment file, which combines parts of the CDS to transcripts, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">q</font></td><br />
<td>query proteins (optional query protein file (FASTA) for computing the optimal alignment score against complete protein prediction, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>weight (the weight can be used to prioritize predictions from different input files; each prediction will get an additional attribute sumWeight that can be used in the filter, valid range = [0.0, 1000.0], default = 1.0, OPTIONAL)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ai</font></td><br />
<td>annotation info (annotation information of the reference, tab-delimted file containing at least the columns transcriptName, GO and .*defline, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">selected</font></td><br />
<td>selected (The path to list file, which allows to make only a predictions for the contained transcript ids. The first column should contain transcript IDs as given in the annotation. Remaining columns can be used to determine a target region that should be overlapped by the prediction, if columns 2 to 5 contain chromosome, strand, start and end of region, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">gc</font></td><br />
<td>genetic code (optional user-specified genetic code, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>maximum intron length (The maximum length of an intron, default = 15000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">tblastn</font></td><br />
<td>tblastn (if *true* tblastn is used as search algorithm, otherwise mmseqs is used. Tblastn and mmseqs need to be installed to use the corresponding option, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">tag</font></td><br />
<td>tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = prediction)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>RNA-seq evidence (data for RNA-seq evidence, range={NO, MAPPED, EXTRACTED}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;MAPPED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.s</font></td><br />
<td>Stranded (Defines whether the reads are stranded. In case of FR_FIRST_STRAND, the first read of a read pair or the only read in case of single-end data is assumed to be located on forward strand of the cDNA, i.e., reverse to the mRNA orientation. If you are using Illumina TruSeq you should use FR_FIRST_STRAND., range={FR_UNSTRANDED, FR_FIRST_STRAND, FR_SECOND_STRAND}, default = FR_UNSTRANDED)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.m</font></td><br />
<td>mapped reads file (BAM/SAM files containing the mapped reads)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.v</font></td><br />
<td>ValidationStringency (Defines how strict to be when reading a SAM or BAM, beyond bare minimum validation., range={STRICT, LENIENT, SILENT}, default = LENIENT)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.u</font></td><br />
<td>use secondary alignments (allows to filter flags in the SAM or BAM, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.c</font></td><br />
<td>coverage (allows to output the coverage, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.mmq</font></td><br />
<td>minimum mapping quality (reads with a mapping quality that is lower than this value will be ignored, valid range = [0, 255], default = 40)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;EXTRACTED&quot;:</b></td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">introns</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used zero or multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage</font></td><br />
<td>coverage (experimental coverage (RNA-seq), range={UNSTRANDED, STRANDED}, default = UNSTRANDED)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>denoise (removing questionable introns that have been extracted by ERE, range={DENOISE, RAW}, default = DENOISE)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;DENOISE&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Denoise.m</font></td><br />
<td>minimum expression (The threshold for removing introns, valid range = [0.0, 1.0], default = 0.01)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Denoise.c</font></td><br />
<td>context (The context upstream a donor and donwstream an acceptor site that is used to determine the expression of the region, valid range = [0, 100], default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;RAW&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.p</font></td><br />
<td>proteins (whether the complete proteins sequences should returned as output, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.r</font></td><br />
<td>repair (if a transcript annotation can not be parsed, the program will try to infer the phase of the CDS parts to repair the annotation, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.a</font></td><br />
<td>Ambiguity (This parameter defines how to deal with ambiguities in the DNA. There are 3 options: EXCEPTION, which will remove the corresponding transcript, AMBIGUOUS, which will use an X for the corresponding amino acid, and RANDOM, which will randomly select an amnio acid from the list of possibilities., range={EXCEPTION, AMBIGUOUS, RANDOM}, default = AMBIGUOUS)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.s</font></td><br />
<td>stop-codon excluded from CDS (A flag that states whether the reference annotation contains the stop codon in the CDS annotation or not, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.f</font></td><br />
<td>full-length (A flag which allows for choosing between only full-length and all (i.e., full-length and partial) transcripts, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.s</font></td><br />
<td>splice (if no intron is given by RNA-seq, compute candidate splice sites or not, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.sm</font></td><br />
<td>substitution matrix (optional user-specified substitution matrix, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.g</font></td><br />
<td>gap opening (The gap opening cost in the alignment, default = 11)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.ge</font></td><br />
<td>gap extension (The gap extension cost in the alignment, default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.i</font></td><br />
<td>intron-loss-gain-penalty (The penalty used for intron loss and gain, default = 25)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.e</font></td><br />
<td>e-value (The e-value for filtering blast results, default = 100.0)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.c</font></td><br />
<td>contig threshold (The threshold for evaluating contigs, valid range = [0.0, 1.0], default = 0.4)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.rt</font></td><br />
<td>region threshold (The threshold for evaluating regions, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.h</font></td><br />
<td>hit threshold (The threshold for adding additional hits, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.p</font></td><br />
<td>predictions (The (maximal) number of predictions per transcript, default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.a</font></td><br />
<td>avoid stop (A flag which allows to avoid stop codons in a transcript (except the last AS), default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.approx</font></td><br />
<td>approx (whether an approximation is used to compute the score for intron gain, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.prefix</font></td><br />
<td>prefix (A prefix to be used for naming the predictions, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.t</font></td><br />
<td>timeout (The (maximal) number of seconds to be used for the predictions of one transcript, if exceeded GeMoMa does not output a prediction for this transcript., valid range = [0, 604800], default = 3600)</td><br />
<td style="width:100px;">LONG</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.Score</font></td><br />
<td>Score (A flag which allows to do nothing, re-score or re-align the search results, range={Trust, ReScore, ReAlign}, default = Trust)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.s</font></td><br />
<td>sorting (comma-separated list that defines the way predictions are sorted within a cluster, default = evidence,score)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.c</font></td><br />
<td>common border filter (the filter on the common borders of transcripts, the lower the more transcripts will be checked as alternative splice isoforms, valid range = [0.0, 1.0], default = 0.75)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.m</font></td><br />
<td>maximal number of transcripts per gene (the maximal number of allowed transcript predictions per gene, valid range = [1, 2147483647], default = 2147483647)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.f</font></td><br />
<td>filter (A filter can be applied to transcript predictions to receive only reasonable predictions. The filter is applied to the GFF attributes. The deault filter decides based on the completeness of the prediction (start=='M' and stop=='*') and the relative score (score/AA>=0.75) whether a prediction is used or not. Different criteria can be combined using 'and ' and 'or'. A more sophisticated filter could be applied for instance using the completeness, the relative score, the evidence as well as statistics based on RNA-seq: start=='M' and stop =='*' and score/AA>=0.75 and (evidence>1 or tpc==1.0), default = start=='M' and stop =='*' and score/AA>=0.75, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.a</font></td><br />
<td>alternative transcript filter (If a prediction is suggested as an alternative transcript, this additional filter will be applied. The filter works syntactically like the 'filter' parameter and allows you to keep the number of alternative transcripts small based on meaningful criteria. Reasonable filter could be based on RNA-seq data (tie==1) or on evidence (evidence>1). A more sophisticated filter could be applied combining several criteria: tie==1 or evidence>1, default = tie==1 or evidence>1, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.u</font></td><br />
<td>UTR (allows to predict UTRs using RNA-seq data, range={NO, YES}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;YES&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.r</font></td><br />
<td>rename (allows to generate generic gene and transcripts names (cf. attribute &quot;Name&quot;), range={COMPOSED, SIMPLE, NO}, default = COMPOSED)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;COMPOSED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.i</font></td><br />
<td>infix (the infix of the generic name, default = G)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.s</font></td><br />
<td>suffix (the suffix of the generic name, default = 0)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.di</font></td><br />
<td>delete infix (a comma-separated list of infixes that is deleted from the sequence names before building the gene/transcript name, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;SIMPLE&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>predicted proteins (If *true*, returns the predicted proteins of the target organism as fastA file, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">pc</font></td><br />
<td>predicted CDSs (If *true*, returns the predicted CDSs of the target organism as fastA file, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">pgr</font></td><br />
<td>predicted genomic regions (If *true*, returns the genomic regions of predicted gene models of the target organism as fastA file, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">o</font></td><br />
<td>output individual predictions (If *true*, returns the predictions for each reference species, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">debug</font></td><br />
<td>debug (If *false* removes all temporary files even if the jobs exits unexpected, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">threads</font></td><br />
<td>The number of threads used for the tool, defaults to 1</td><br />
<td>INT</td><br />
</tr><br />
</table><br />
<br />
=== Extract RNA-seq Evidence (ERE) ===<br />
<br />
For post-processing the mapped RNA-seq data, we provide the tool ExtractRNAseqEvidence (ERE). You can run Extractor from the command line with<br/><br />
<code>java -jar GeMoMa-<version>.jar CLI ERE [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>Stranded (Defines whether the reads are stranded. In case of FR_FIRST_STRAND, the first read of a read pair or the only read in case of single-end data is assumed to be located on forward strand of the cDNA, i.e., reverse to the mRNA orientation. If you are using Illumina TruSeq you should use FR_FIRST_STRAND., range={FR_UNSTRANDED, FR_FIRST_STRAND, FR_SECOND_STRAND}, default = FR_UNSTRANDED)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>mapped reads file (BAM/SAM files containing the mapped reads)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>ValidationStringency (Defines how strict to be when reading a SAM or BAM, beyond bare minimum validation., range={STRICT, LENIENT, SILENT}, default = LENIENT)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">u</font></td><br />
<td>use secondary alignments (allows to filter flags in the SAM or BAM, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage (allows to output the coverage, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">mmq</font></td><br />
<td>minimum mapping quality (reads with a mapping quality that is lower than this value will be ignored, valid range = [0, 254], default = 40)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== CheckIntrons===<br />
<br />
This tool allows to check whether the extracted introns show the expected patterns of di-nucleotides at the splice sites. You can run CheckIntrons from the command line with<br/><br />
<code>java -jar GeMoMa-<version>.jar CLI CheckIntrons [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>target genome (The target genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>verbose (A flag which allows to output a wealth of additional information per transcript, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== DenoiseIntrons ===<br />
This tool allows to remove potentially incorrectly extracted introns. You can run Denoise from the command line with<br/><br />
<code>java -jar GeMoMa-<version>.jar CLI DenoiseIntrons [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage (experimental coverage (RNA-seq), range={UNSTRANDED, STRANDED}, default = UNSTRANDED)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>maximum intron length (The maximum length of an intron, default = 15000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">me</font></td><br />
<td>minimum expression (The threshold for removing introns, valid range = [0.0, 1.0], default = 0.01)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">context</font></td><br />
<td>context (The context upstream a donor and donwstream an acceptor site that is used to determine the expression of the region, valid range = [0, 100], default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== Extractor ===<br />
<br />
For preparing the data, we provide the tool Extractor. You can run Extractor from the command line with<br/><br />
<code>java -jar GeMoMa-<version>.jar CLI Extractor [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (Reference annotation file (GFF or GTF), which contains gene models annotated in the reference genome)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (Reference genome file (FASTA))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">gc</font></td><br />
<td>genetic code (optional user-specified genetic code, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>proteins (whether the complete proteins sequences should returned as output, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>cds (whether the complete CDSs should returned as output, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">genomic</font></td><br />
<td>genomic (whether the genomic regions should be returned (upper case = coding, lower case = non coding), default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>repair (if a transcript annotation can not be parsed, the program will try to infer the phase of the CDS parts to repair the annotation, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>selected (The path to list file, which allows to make only a predictions for the contained transcript ids. The first column should contain transcript IDs as given in the annotation. Remaining columns will be ignored., OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Ambiguity</font></td><br />
<td>Ambiguity (This parameter defines how to deal with ambiguities in the DNA. There are 3 options: EXCEPTION, which will remove the corresponding transcript, AMBIGUOUS, which will use an X for the corresponding amino acid, and RANDOM, which will randomly select an amnio acid from the list of possibilities., range={EXCEPTION, AMBIGUOUS, RANDOM}, default = EXCEPTION)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sefc</font></td><br />
<td>stop-codon excluded from CDS (A flag that states whether the reference annotation contains the stop codon in the CDS annotation or not, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>full-length (A flag which allows for choosing between only full-length and all (i.e., full-length and partial) transcripts, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>verbose (A flag which allows to output a wealth of additional information, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== Gene Model Mapper (GeMoMa) ===<br />
For predicting gene models, we provide the tool GeMoMa. You can run GeMoMa from the command line with<br/><br />
<code>java -jar GeMoMa-<version>.jar CLI GeMoMa [<parameter>=<value> ...]</code><br/><br />
The parameters comprise: <br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>search results (The search results, e.g., from tblastn or mmseqs)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>target genome (The target genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>cds parts (The query cds parts file (FASTA), i.e., the cds parts that have been blasted)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>assignment (The assignment file, which combines parts of the CDS to transcripts, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">q</font></td><br />
<td>query proteins (optional query protein file (FASTA) for computing the optimal alignment score against complete protein prediction, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used zero or multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">splice</font></td><br />
<td>splice (if no intron is given by RNA-seq, compute candidate splice sites or not, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used zero or multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage</font></td><br />
<td>coverage (experimental coverage (RNA-seq), range={UNSTRANDED, STRANDED}, default = UNSTRANDED)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genetic code (optional user-specified genetic code, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sm</font></td><br />
<td>substitution matrix (optional user-specified substitution matrix, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">go</font></td><br />
<td>gap opening (The gap opening cost in the alignment, default = 11)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ge</font></td><br />
<td>gap extension (The gap extension cost in the alignment, default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>maximum intron length (The maximum length of an intron, default = 15000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sil</font></td><br />
<td>static intron length (A flag which allows to switch between static intron length, which can be specified by the user and is identical for all genes, and dynamic intron length, which is based on the gene-specific maximum intron length in the reference organism plus the user given maximum intron length, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">intron-loss-gain-penalty</font></td><br />
<td>intron-loss-gain-penalty (The penalty used for intron loss and gain, default = 25)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">e</font></td><br />
<td>e-value (The e-value for filtering blast results, default = 100.0)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ct</font></td><br />
<td>contig threshold (The threshold for evaluating contigs, valid range = [0.0, 1.0], default = 0.4)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">rt</font></td><br />
<td>region threshold (The threshold for evaluating regions, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">h</font></td><br />
<td>hit threshold (The threshold for adding additional hits, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>predictions (The (maximal) number of predictions per transcript, default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">selected</font></td><br />
<td>selected (The path to list file, which allows to make only a predictions for the contained transcript ids. The first column should contain transcript IDs as given in the annotation. Remaining columns can be used to determine a target region that should be overlapped by the prediction, if columns 2 to 5 contain chromosome, strand, start and end of region, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">as</font></td><br />
<td>avoid stop (A flag which allows to avoid stop codons in a transcript (except the last AS), default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">approx</font></td><br />
<td>approx (whether an approximation is used to compute the score for intron gain, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">prefix</font></td><br />
<td>prefix (A prefix to be used for naming the predictions, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">tag</font></td><br />
<td>tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = prediction)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>verbose (A flag which allows to output a wealth of additional information per transcript, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">timeout</font></td><br />
<td>timeout (The (maximal) number of seconds to be used for the predictions of one transcript, if exceeded GeMoMa does not output a prediction for this transcript., valid range = [0, 604800], default = 3600)</td><br />
<td style="width:100px;">LONG</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sort</font></td><br />
<td>sort (A flag which allows to sort the search results, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Score</font></td><br />
<td>Score (A flag which allows to do nothing, re-score or re-align the search results, range={Trust, ReScore, ReAlign}, default = Trust)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
GeMoMa returns the predicted annotation as gff file.<br />
<br />
=== GeMoMa Annotation Filter (GAF) ===<br />
<br />
The GeMoMa Annotation Filter allows to combine and reduce predictions from GeMoMa into a single final prediction. It is able to handle predictions from different reference species. It also handles overlapping or identical predictions.<br />
You can run GeMoMa from the command line with<br/><br />
<code>java -jar GeMoMa-<version>.jar CLI GAF[&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>tag (the tag used to read the GeMoMa annotations, default = prediction)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>common border filter (the filter on the common borders of transcripts, the lower the more transcripts will be checked as alternative splice isoforms, valid range = [0.0, 1.0], default = 0.75)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>maximal number of transcripts per gene (the maximal number of allowed transcript predictions per gene, valid range = [1, 2147483647], default = 2147483647)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prefix (the prefix can be used to distinguish predictions from different input files, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>weight (the weight can be used to prioritize predictions from different input files; each prediction will get an additional attribute sumWeight that can be used in the filter, valid range = [0.0, 1000.0], default = 1.0, OPTIONAL)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>gene annotation file (GFF files containing the gene annotations (predicted by GeMoMa))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation info (annotation information of the reference, tab-delimted file containing at least the columns transcriptName, GO and .*defline, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>default attributes (Comma-separated list of attributes that is set to NaN if they are not given in the annotation file. This allows to use these attributes for sorting or filter criteria. It is especially meaningful if the gene annotation files were received fom different software packages (e.g., GeMoMa, Braker, ...) having different attributes., default = tie,tde,tae,iAA,pAA,score)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>filter (A filter can be applied to transcript predictions to receive only reasonable predictions. The filter is applied to the GFF attributes. The deault filter decides based on the completeness of the prediction (start=='M' and stop=='*') and the relative score (score/AA>=0.75) whether a prediction is used or not. Different criteria can be combined using 'and' and 'or'. In addition, you can check for NaN, e.g., 'isNaN(score)'. A more sophisticated filter could be applied for instance using the completeness, the relative score, the evidence as well as statistics based on RNA-seq: start=='M' and stop =='*' and score/AA>=0.75 and (evidence>1 or tpc==1.0), default = start=='M' and stop =='*' and score/AA>=0.75, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>sorting (comma-separated list that defines the way predictions are sorted within a cluster, default = evidence,score)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">atf</font></td><br />
<td>alternative transcript filter (If a prediction is suggested as an alternative transcript, this additional filter will be applied. The filter works syntactically like the 'filter' parameter and allows you to keep the number of alternative transcripts small based on meaningful criteria. Reasonable filter could be based on RNA-seq data (tie==1) or on evidence (evidence>1). A more sophisticated filter could be applied combining several criteria: tie==1 or evidence>1, default = tie==1 or evidence>1, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== CompareTranscripts ===<br />
<br />
For comparing gene models from GeMoMa predictions with existing annotation, we provide the tool CompareTranscripts. You can run CompareTranscripts from the command line with<br/><br />
<code>java -jar GeMoMa-<version>.jar CLI CompareTranscripts [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prediction (The predicted annotation)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (The true annotation)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used zero or multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">prefix</font></td><br />
<td>prefix (the prefix can be used to distinguish predictions from different input files, default = , OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">assignment</font></td><br />
<td>assignment (the transcript info for the reference of the prediction)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== AnnotationEvidence ===<br />
<br />
For providing RNA-seq evidence (e.g. tie) for existing annotation, we provide the tool AnnotationEvidence. You can run AnnotationEvidence from the command line with<br/><br />
<code>java -jar GeMoMa-<version>.jar CLI AnnotationEvidence [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (The genome annotation file (GFF))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (The genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns file (Introns (GFF), which might be obtained from RNA-seq, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage file (experimental coverage (RNA-seq), range={NO, UNSTRANDED, STRANDED}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ao</font></td><br />
<td>annotation output (if the annotation should be returned with attributes tie, tpc, and AA, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">gc</font></td><br />
<td>genetic code (optional user-specified genetic code, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== AnnotationFinializer ===<br />
<br />
This tool allows to predict UTR and to rename predictions. You can run AnnotationEvidence from the command line with<br/><br />
<code>java -jar GeMoMa-<version>.jar CLI AnnotationFinalizer [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (The predicted genome annotation file (GFF))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = prediction)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">u</font></td><br />
<td>UTR (allows to predict UTRs using RNA-seq data, range={NO, YES}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;YES&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (The genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns file (Introns (GFF), which might be obtained from RNA-seq)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage file (experimental coverage (RNA-seq), range={NO, UNSTRANDED, STRANDED}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">rename</font></td><br />
<td>rename (allows to generate generic gene and transcripts names (cf. attribute &quot;Name&quot;), range={COMPOSED, SIMPLE, NO}, default = COMPOSED)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;COMPOSED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">infix</font></td><br />
<td>infix (the infix of the generic name, default = G)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>suffix (the suffix of the generic name, default = 0)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">di</font></td><br />
<td>delete infix (a comma-separated list of infixes that is deleted from the sequence names before building the gene/transcript name, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;SIMPLE&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
== GFF attributes ==<br />
<br />
Using GeMoMa and GAF, you'll obtain GFFs containing some special attributes. We briefly explain the most prominent attributes in the following table.<br />
<br />
{| class="wikitable sortable"<br />
|-<br />
!Attribute !!Long name !!Tool !!Necessary parameter !!Feature !!Description<br />
|-<br />
| score || GeMoMa score || GeMoMa || || prediction || score computed by GeMoMa using the substitution matrix, gap costs and additional penalties<br />
|-<br />
| minCov || minimal coverage || GeMoMa || coverage, ... || prediction || minimal coverage of any base of the prediction given RNA-seq evidence<br />
|-<br />
| avgCov || average coverage || GeMoMa || coverage, ... || prediction || average coverage of all bases of the prediction given RNA-seq evidence<br />
|-<br />
| tpc || transcript percentage coverage || GeMoMa || coverage, ... || prediction || percentage of covered bases per predicted transcript given RNA-seq evidence<br />
|-<br />
| tae || transcript acceptor evidence || GeMoMa || introns || prediction || percentage of predicted acceptor sites per predicted transcript with RNA-seq evidence<br />
|-<br />
| tde || transcript donor evidence || GeMoMa || introns || prediction || percentage of predicted donor sites per predicted transcript with RNA-seq evidence<br />
|-<br />
| tie || transcript intron evidence || GeMoMa || introns || prediction || percentage of predicted introns per predicted transcript with RNA-seq evidence<br />
|-<br />
| minSplitReads || minimal split reads || GeMoMa || introns || prediction || minimal number of split reads for any of the predicted introns per predicted transcript<br />
|-<br />
| iAA || identical amino acid || GeMoMa || query proteins || prediction || percentage of identical amino acids between reference transcript and prediction<br />
|-<br />
| pAA || positive amino acid || GeMoMa || query proteins || prediction || percentage of aligned positions between reference transcript and prediction yielding a positive score in the substitution matrix<br />
|-<br />
| evidence || || GAF || || prediction || number of reference organisms that have a transcript yielding this prediction<br />
|-<br />
| alternative || || GAF || || prediction || alternative gene ID(s) leading to the same prediction<br />
|-<br />
| sumWeight || || GAF || || prediction || the sum of the weights of the references that perfectly support this prediction<br />
|-<br />
| maxTie || maximal tie || GAF || || gene || maximal tie of all transcripts of this gene<br />
|-<br />
| maxEvidence || maximal evidence || GAF || || gene || maximal evidence of all transcripts of this gene<br />
|-<br />
|}<br />
<br />
== FAQs ==<br />
<br />
; Why does the Extractor not return a single CDS-part, protein, ...?<br />
: First, please check whether the names of your contigs/chromosomes in your annotation (gff) and genome file (fasta) are identical. The fasta comments should at best only contain the contig/chromosome name. (Since GeMoMa 1.4, comments, which contain the contig/chromosome name and some additional information separated by a space, are also fine.) Second, please check whether you have a valid GFF/GTF file. Valid GFF files should have a valid "ID" or "Parent" entry in the attributes column. Valid GTF files should have a valid "gene_id" and "transcript_id" entry. Finally, please check the statistics that are given by the Extractor. It lists how many genes have been read and how many genes have been removed for different reasons. One common problem is that some annotation files do not include the stop codon in the CDS annotation.<br />
; How can I force GeMoMa to make more predictions?<br />
: There are several parameters affecting the number of predictions. The most prominent are the number of predictions (p) and the contig threshold (ct). For each reference transcript/CDS, GeMoMa initially makes a preliminary prediction and uses this prediction to determine whether a contig is promising and should be used to determine the final predictions. You may decrease ct and increase p to have more contigs in the final prediction. Increasing the number of predictions allows GeMoMa to output more predictions that have been computed. Decreasing the contig threshold allows to increase the number of predictions that are (internally) computed. Increasing p to a very large number without decreasing ct does not help.<br />
; Running GeMoMa on a single contig of my assembly yield thousands of weird predictions. What went wrong?<br />
: By default, GeMoMa is not build to be run on a single contig. GeMoMa tries to make predictions for all given reference CDS in the given target sequence(s). If the given target sequence is only a fraction of the complete target genome/assembly, GeMoMa will produce weird predictions as it does not filter for the quality of the predictions internally. There are two options to handle this:<br />
:* Use a list of gene models that you expect to be located on this contig (cf. parameter "selected").<br />
:* Filter the predictions using GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>).<br />
; Is it mandatory to use RNA-seq data?<br />
: No, GeMoMa is able to make predictions with and without RNA-seq evidence.<br />
; Is it possible to use multiple reference organisms?<br />
: It is possible to use multiple reference organisms for GeMoMa. Just run GeMoMa on each reference organism separately. Finally, you can employ GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>) to combine these annotations.<br />
; Why do some reference genes not lead to a prediction in the target genome?<br />
: Please first check whether your reference genes have been discarded by the Extractor (cf. assignment file).<br />
: If the genes have been discarded, there are two possibilities:<br />
:* The CDS might be redundant, i.e. the coding exons are identical to those of another transcript. In this case, only one CDS is further evaluated.<br />
:* There might be something wrong with your reference genes, e.g., missing start codon, missing stop codon, premature stop codon, ambiguous nucleotides, ... and you should check the options of Extractor or the annotation.<br />
: If the reference genes passed the Extractor, there are several possible explanations for this behavior. The two most prominent are:<br />
:* GeMoMa stopped the prediction of a reference genes since it does not return a result within the given time (cf. parameter "timeout").<br />
:* GeMoMa simply did not find a prediction matching the remaining quality criteria<br />
:* GeMoMa did find a prediction, but it was filtered out by GAF, e.g. to low relative score, missing start or stop codon (cf. GAF parameters).<br />
; What does "partial gene model" mean in the context of GeMoMa?<br />
: We called a gene model partial if it does not contain an initial start codon and a final stop codon. However, this does not mean that the gene model is located at or close to the border of a chromosome or contig.<br />
; For two different reference transcripts, the predictions of GeMoMa overlap or are identical. What should I do with those?<br />
: GeMoMa makes the predictions for each reference transcript independently. Hence, it can occur that some of predictions of different reference transcripts overlap or are identical especially in gene families. Typically, you might like to filter or rank these predictions. We have implemented GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>) to do this automatically. However, you can also do it by hand using the GFF attributes. Using RNA-seq data in GeMoMa yields additional fields in the annotation that can be used, e.g., average coverage (avgCov).<br />
; A lot of transcripts have been filtered out by the Extractor. What can I do?<br />
: There are several reasons for removing transcripts by the Extractor. At least in two cases you can try to get more transcripts by setting specific parameter values. First, if the transcript contains ambiguous nucleotides, please test the parameter "Ambiguity". Second, sometimes we received GFFs which contain wrong phases for CDS entries (e.g., 0 for all CDS entries in the phase column of the GFF). Since version 1.3.2, we provide the option "r" which stands for repair. If r=true is chosen, the Extractor tries to infer all phases for transcripts that show an error and would be filtered out.<br />
; Is GeMoMa able to predict pseudo-genes/ncRNA?<br />
: No, currently not.<br />
; My RNA-seq data indicates there is an additional intron in a transcipt, but GeMoMa does not predict this. Or vice versa, GeMoMa predicts an intron that is not supported by RNA-seq data. What's the reason?<br />
: GeMoMa is mainly based on the assumptions of amino acid and intron position conservation between reference and target species. Hence, GeMoMa tries to predict a gene model with similar exon-intron structure in the target species and does not stick too much to RNA-seq data. Although intron position conservation can be observed in most cases, sometimes new introns evolve or others vanish. For this reasons, GeMoMa also allows for the inclusion or exclusion of introns adding some additional costs (cf. GeMoMa parameter intron-loss-gain-penalty). However, the behaviour of GeMoMa depends on the parameters settings (especially intron-loss-gain-penalty, sm (substitution matrix), go (gap opening), ge (gap extension)) and the length of the missed/additional intron. Nevertheless, such cases can only occur if the additional/missed intron has a length that can be divided by 3 preserving the reading frame.<br />
:Since the available RNA-seq data only reflects a fraction of tissues/environmental conditions/..., missing RNA-seq evidence does not necessarily mean that the predictions is wrong.<br />
; My RNA-seq data indicates two alternative, highly overlapping introns. Interestingly, GeMoMa does not take the intron that is more abundant. Why?<br />
: GeMoMa reads the introns from the input file using some filter (cf. GeMoMa parameter r (reads)). All introns that pass the filter are used and treated equally. Hence, GeMoMa uses the intron that matches the expectation of intron position and amino acid conservation compared to the reference transcript.<br />
; Does GeMoMa predict multiple transcripts per gene?<br />
: GeMoMa in principle allows to predict multiple transcripts per gene, if corresponding transcripts are given in the reference species or if multiple reference species are used.<br />
; GeMoMa failed with java.lang.OutOfMemoryError. What can I do?<br />
: Whenever you see a java.lang.OutOfMemoryError, you should rerun the program with Java virtual machine (VM) options. More specifically you should set: '''-Xms''' the initally used RAM, e.g. to 5Gb (–Xms5G), and '''-Xmx''' the maximally used RAM, e.g. to 50Gb (-Xmx50G). GeMoMa often needs more memory if you have a large genome ''and'' if you’re providing a large coverage file (extracted from RNA-seq data). If you don’t have a compute node with enough memory, you can run GeMoMa without coverage, which will return the same predictions, but does not include all statistics. Another point could be the protein alignment, if you use the optional parameter query protein. Again you can run GeMoMa without this parameter, which will return the same predictions, but less statistics.<br />
; I need to specify the genetic code for my organisms. What is the expected format?<br />
: The genetic code is given in a two column tab-delimited table, where the first column is the one letter code of the amino acid and the second column is a comma-separated list of triplets. As we are working on genomic DNA, GeMoMa expects the bases A, C, G, and T, and not U (as expected in mRNA). Here is the link to the default genetic code, which might be used as template:<br />
: https://github.com/Jstacs/Jstacs/blob/master/projects/gemoma/test_data/genetic_code.txt<br />
: Alternative genetic codes are described here using the RNA alphabet:<br />
: https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi<br />
: The genetic code might be specified for a reference organism in the module Extractor or for a target organism in the module GeMoMa.<br />
<br />
== Version history ==<br />
[http://www.jstacs.de/download.php?which=GeMoMa GeMoMa 1.6.3] (05.03.2020)<br />
* Jstacs changes:<br />
** CLI: bugfix ExpandableParameterSet<br />
* python wrapper (for *conda)<br />
* updated tests.sh, run.sh, pipeline.sh<br />
* rename Denoise to DenoiseIntrons<br />
* AnnotationEvidence: write phase (as given) to gff<br />
* GAF: new parameter: default attributes allows to set attributes that are not included in some gene annotation files<br />
* GeMoMa: new parameter: static intron length allowing to use dynamic intron length if set to false<br />
* GeMoMaPipeline: <br />
** bugfix: time-out<br />
** improve output<br />
** separate parameters for maximum intron length (DenoiseIntrons, GeMoMa)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.6.2.zip GeMoMa 1.6.2] (17.12.2019)<br />
* Jstacs changes:<br />
** test methods for modules<br />
** live protocol for Galaxy<br />
* new module Denoise: allowing to clean introns extracted by ERE<br />
* new module NCBIReferenceRetriever: allowing to retrieve data for reference organisms easily from NCBI.<br />
* GAF:<br />
** bugfix for filter using specific attributes if no RNA-seq or query proteins was used<br />
** allow to add annotation info (as for instance provided by Phytozome) based on the reference organisms<br />
* GeMoMa: bugfix for timeout<br />
* GeMoMaPipeline:<br />
** bugfix reporting predicted partial proteins<br />
** improved protocol<br />
** new default value for query proteins (changed from false to true)<br />
** new default value for Ambiguity (changed from EXCEPTION to AMBIGUOUS)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.6.1.zip GeMoMa 1.6.1] (4.06.2019)<br />
* createGalaxyIntegration.sh: bugfix for GeMoMaPipeline<br />
* new module CheckIntrons: allowing to create statistics for introns (extracted by ERE)<br />
* AnnotationFinalizer: bugfix for sequence IDs with large numbers<br />
* CompareTranscripts: <br />
** bugfix for prefix of ref-gene<br />
** allow no transcript info, but making assignment non-optional if a transcript info is set <br />
* GAF: bugfix for Galaxy integration<br />
* GeMoMaPipeline:<br />
** improved output in case of Exceptions<br />
** new parameter "output individual predictions" allows to in- or exclude individual predictions from each reference organism in the final result<br />
** new parameter "weight" allows weights for reference species (cf. GAF)<br />
* ERE: new parameter "minimum mapping quality"<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.6.zip GeMoMa 1.6] (2.04.2019)<br />
* allow to use mmseqs as alternative to tblastn<br />
* AnnotationEvidence:<br />
** allows to add attributes to the input gff: tie, tpc, AA, start, stop<br />
** new parameter for gff output<br />
* AnnotationFinalizer: new tool for predicting UTRs and renaming predictions<br />
* GAF:<br />
** relative score filter and evidence filter are replaced by a flexible filter that allows to filter by relative score, evidence or other GFF attributes as well as combinations thereof<br />
** sorting criteria of the predictions within clusters can now be user-specified<br />
** new attribute for genes: combinedEvidence<br />
** new attribute for predictions: sumWeight<br />
** allows to use gene predictions from all sources, including for instance ab-initio gene predictors, purely RNA-seq based gene prediction and manually curation<br />
** bugfix for predictions from multiple reference organisms<br />
** improved statistic output<br />
* GeMoMa<br />
** renamed the parameter tblastn results to search results<br />
** new parameter for sorting the results of the similarity search (tblastn or mmseqs), if you use mmseqs for the similarity search you have use sort <br />
** new parameter for score of the search results: three options: Trust (as is), ReScore (use aligned sequence, but recompute score), and ReAlign (use detected sequence for realignment and rescore)<br />
** bugfix: threshold for introns from multiple files<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.5.2.zip GeMoMa 1.5.3] (23.07.2018)<br />
* improved parameter description and presentation<br />
* GeMoMaPipeline:<br />
** removed unnecessary parameters<br />
* GeMoMa:<br />
** bugfix: reading coverage file<br />
** removed parameter genomic (cf. Extractor)<br />
** removed protein output (cf. Extractor)<br />
* GAF:<br />
** bugfix: prefix<br />
* Extractor:<br />
** new parameter genomic<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.5.2.zip GeMoMa 1.5.2] (31.5.2018)<br />
* GAF:<br />
** new parameter that allows to restrict the maximal number of transcript predictions per gene<br />
** altered behavior of the evidence filter from percentages to absolute values<br />
** bugfix: nested genes <br />
** checking for duplicates in prediction IDs<br />
* GeMoMa:<br />
** warning if RNA-seq data does not match with target genome<br />
* GeMoMaPipeline: new tool for running the complete GeMoMa pipeline at once allowing multi-threading<br />
* folder for temporary files of GeMoMa<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.5.zip GeMoMa 1.5] (13.02.2018)<br />
* AnnotationEvidence: add chromosome to output<br />
* CompareTranscripts: new parameter that allows to remove prefixes introduces by GAF<br />
* Extractor: new parameter "stop-codon excluded from CDS" that might be used if the annotation does not contain the stop codons<br />
* ExtractRNASeqEvidence: <br />
** print intron length stats<br />
** include program infos in introns.gff3<br />
* GeMoMa:<br />
** new attribute pAA in gff output if query protein is given<br />
** include program infos in predicted_annotation.gff3<br />
** minor bugfix<br />
* GAF:<br />
** new parameter that allows to specify a prefix for each input gff<br />
** collect and print program infos to filtered_prediction.gff3<br />
** improved statistics output<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.4.2.zip GeMoMa 1.4.2] (21.07.2017)<br />
* automatic searching for available updates<br />
* AnnotationEvidence: bugfix (tie computation: Arrays.binarysearch does not find first match)<br />
* Extractor: bugfix (files that are not zipped)<br />
* GeMoMa: bugfix (tie computation: Arrays.binarysearch does not find first match)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.4.1.zip GeMoMa 1.4.1] (30.05.2017)<br />
* CompareTranscripts: bugfix (NullPointerException)<br />
* Extractor: reference genome can be .*fa.gz and .*fasta.gz<br />
* GeMoMa: bugfix (shutdown problem after timeout)<br />
* modified additional scripts and documentation<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.4.zip GeMoMa 1.4] (03.05.2017)<br />
* AnnotationEvidence: new tool computing tie and tpc for given annotation (gff)<br />
* CompareTranscripts: new tool comparing predicted and given annotation (gff)<br />
* Extractor:<br />
** reading CDS with no parent tag (cf. discontinuous feature)<br />
** automatic recognition of GFF or GTF annotation<br />
** Warning if sequences mentioned in the annotation are not included in the reference sequence<br />
* GeMoMa: <br />
** allowing for multiple intron and coverage files (= using different library types at the same time)<br />
** NA instead of "?" for tae, tde, tie, minSplitReads of single coding exon genes<br />
** new default values for the parameters: predictions (10 instead of 1) and contig threshold (0.4 instead of 0.9)<br />
** bugfix (write pc and minCov if possible for last CDS part in predicted annotation)<br />
** bugfix (ref-gene name if no assignment is used)<br />
** bugfix (minSplitReads, minCov, tpc, avgCov if no coverage available)<br />
* GAF:<br />
** nested genes on the same strand<br />
** bugfix (if nothing passes the filter)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.3.2.zip GeMoMa 1.3.2] (18.01.2017)<br />
* Extractor: new parameter repair for broken transcript annotations<br />
* GeMoMa: bugfixes (splice site computation)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa_1.3.1.zip GeMoMa 1.3.1] (09.12.2016)<br />
* GeMoMa bugfix (finding start/stop codon for very small exons)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.3.zip GeMoMa 1.3] (06.12.2016)<br />
* ERE: new tool for extracting RNA-seq evidence<br />
* Extractor: offers options for <br />
** partial gene models<br />
** ambiguities<br />
* GeMoMa:<br />
** RNA-seq<br />
*** defining splice sites<br />
*** additional feature in GFF and output<br />
**** transcript intron evidence (tie)<br />
**** transcript acceptor evidence (tae)<br />
**** transcript donor evidence (tde)<br />
**** transcript percentage coverage (tpc)<br />
**** ...<br />
** improved GFF<br />
** simplified the command line parameters<br />
** IMPORTANT: parameter names changed for some parameters<br />
* GAF: new tool for filtering and combining different predictions (especially of different reference organisms)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.1.3.zip GeMoMa 1.1.3] (06.06.2016)<br />
* minor modifications to the Extractor tool<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.1.2.zip GeMoMa 1.1.2] (05.02.2016)<br />
* GeMoMa bugfix (upstream, downstream sequence for splice site detection)<br />
* Extractor: new parameter s for selecting transcripts<br />
* improved Galaxy integration<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.1.1.zip GeMoMa 1.1.1] (01.02.2016)<br />
* initial release for paper</div>Keilwagenhttps://www.jstacs.de/index.php?title=GeMoMa&diff=1069GeMoMa2020-03-09T08:18:18Z<p>Keilwagen: /* Gene Model Mapper (GeMoMa) */</p>
<hr />
<div>'''Ge'''ne '''Mo'''del '''Ma'''pper (GeMoMa) is a homology-based gene prediction program. GeMoMa uses the annotation of protein-coding genes in a reference genome to infer the annotation of protein-coding genes in a target genome. Thereby, GeMoMa utilizes amino acid sequence and intron position conservation. In addition, GeMoMa allows to incorporate RNA-seq evidence for splice site prediction.<br />
<br />
[[File:GeMoMa-schema.png|thumb|right|350px|Schema of GeMoMa algorithm]]<br />
<br />
== Paper ==<br />
If you use GeMoMa, please cite<br />
<br />
J. Keilwagen, M. Wenk, J. L. Erickson, M. H. Schattat, J. Grau, and F. Hartung. [https://nar.oxfordjournals.org/content/44/9/e89 Using intron position conservation for homology-based gene prediction]. ''Nucleic Acids Research'', 2016. doi: 10.1093/nar/gkw092<br />
<br />
J. Keilwagen, F. Hartung, M. Paulini, S. O. Twardziok, and J. Grau<br />
[https://rdcu.be/QbKc Combining RNA-seq data and homology-based gene prediction for plants, animals and fungi]. ''BMC Bioinformatics'', 2018. doi: 10.1186/s12859-018-2203-5<br />
<br />
== Download ==<br />
<br />
GeMoMa is implemented in Java using Jstacs. You can [http://www.jstacs.de/download.php?which=GeMoMa download a zip file] containing a readme, the GeMoMa jar file and some tiny scripts for running GeMoMa. The jar file allows for <br />
<ul><br />
<li>creating the XML file needed for the Galaxy integration</li><br />
<li>running the command line interface (CLI) version.</li><br />
</ul><br />
<br />
You can also [[:File:GeMoMa-manual.pdf|download a small manual for GeMoMa]] which explains the main steps for the analysis.<br />
<br />
== Galaxy ==<br />
GeMoMa is available in a public web-server at [http://galaxy.informatik.uni-halle.de galaxy.informatik.uni-halle.de]. The provided web-server only allows a limited number of reference genes and uses a time out of 2 minutes per transcript prediction. For unlimited use, please use the command line program or integrate GeMoMa in your only Galaxy instance.<br />
<br />
[[File:GeMoMa-Workflow.png|thumb|center|700px|GeMoMa workflow adapted from Galaxy]]<br />
<br />
== Running the command line application ==<br />
<br />
For running the command line application, Java v1.8 or later is required.<br />
<br />
=== NCBIReferenceRetriever ===<br />
<br />
We provide the module NCBIReferenceRetriever allowing to retrieve data for reference organisms easily from NCBI. You can run NCBIReferenceRetriever from the command line with<br/><br />
<code>java -jar GeMoMa-<version>.jar CLI NRR [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reference directory (the directory where the genome and annotation files of the reference organisms should be stored, default = references/)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>number of tries (the number of tries for downloading a reference file, valid range = [1, 100], default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">rl</font></td><br />
<td>reference list (a list of reference organisms)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== GeMoMaPipeline ===<br />
<br />
If you like to run the GeMoMaPipeline on a server as a single job, you can use the module GeMoMaPipeline which allows to exploit the full compute power of the computer server via multi-threading. However, GeMoMaPipeline does not distribute task on a compute cluster.<br />
You can run GeMoMaPipeline from the command line with<br/><br />
<code>java -jar GeMoMa-<version>.jar CLI GeMoMaPipeline [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>target genome (Target genome file (FASTA))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>species (data for reference species, range={own, pre-extracted}, default = own)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;own&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>ID (ID to distinguish the different reference species, default = , OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (Reference annotation file (GFF or GTF), which contains gene models annotated in the reference genome)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (Reference genome file (FASTA))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>weight (the weight can be used to prioritize predictions from different input files; each prediction will get an additional attribute sumWeight that can be used in the filter, valid range = [0.0, 1000.0], default = 1.0, OPTIONAL)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ai</font></td><br />
<td>annotation info (annotation information of the reference, tab-delimted file containing at least the columns transcriptName, GO and .*defline, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;pre-extracted&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>ID (ID to distinguish the different reference species, default = , OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>cds parts (The query cds parts file (FASTA), i.e., the cds parts that have been blasted)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>assignment (The assignment file, which combines parts of the CDS to transcripts, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">q</font></td><br />
<td>query proteins (optional query protein file (FASTA) for computing the optimal alignment score against complete protein prediction, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>weight (the weight can be used to prioritize predictions from different input files; each prediction will get an additional attribute sumWeight that can be used in the filter, valid range = [0.0, 1000.0], default = 1.0, OPTIONAL)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ai</font></td><br />
<td>annotation info (annotation information of the reference, tab-delimted file containing at least the columns transcriptName, GO and .*defline, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">selected</font></td><br />
<td>selected (The path to list file, which allows to make only a predictions for the contained transcript ids. The first column should contain transcript IDs as given in the annotation. Remaining columns can be used to determine a target region that should be overlapped by the prediction, if columns 2 to 5 contain chromosome, strand, start and end of region, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">gc</font></td><br />
<td>genetic code (optional user-specified genetic code, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>maximum intron length (The maximum length of an intron, default = 15000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">tblastn</font></td><br />
<td>tblastn (if *true* tblastn is used as search algorithm, otherwise mmseqs is used. Tblastn and mmseqs need to be installed to use the corresponding option, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">tag</font></td><br />
<td>tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = prediction)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>RNA-seq evidence (data for RNA-seq evidence, range={NO, MAPPED, EXTRACTED}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;MAPPED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.s</font></td><br />
<td>Stranded (Defines whether the reads are stranded. In case of FR_FIRST_STRAND, the first read of a read pair or the only read in case of single-end data is assumed to be located on forward strand of the cDNA, i.e., reverse to the mRNA orientation. If you are using Illumina TruSeq you should use FR_FIRST_STRAND., range={FR_UNSTRANDED, FR_FIRST_STRAND, FR_SECOND_STRAND}, default = FR_UNSTRANDED)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.m</font></td><br />
<td>mapped reads file (BAM/SAM files containing the mapped reads)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.v</font></td><br />
<td>ValidationStringency (Defines how strict to be when reading a SAM or BAM, beyond bare minimum validation., range={STRICT, LENIENT, SILENT}, default = LENIENT)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.u</font></td><br />
<td>use secondary alignments (allows to filter flags in the SAM or BAM, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.c</font></td><br />
<td>coverage (allows to output the coverage, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.mmq</font></td><br />
<td>minimum mapping quality (reads with a mapping quality that is lower than this value will be ignored, valid range = [0, 255], default = 40)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;EXTRACTED&quot;:</b></td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">introns</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used zero or multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage</font></td><br />
<td>coverage (experimental coverage (RNA-seq), range={UNSTRANDED, STRANDED}, default = UNSTRANDED)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>denoise (removing questionable introns that have been extracted by ERE, range={DENOISE, RAW}, default = DENOISE)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;DENOISE&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Denoise.m</font></td><br />
<td>minimum expression (The threshold for removing introns, valid range = [0.0, 1.0], default = 0.01)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Denoise.c</font></td><br />
<td>context (The context upstream a donor and donwstream an acceptor site that is used to determine the expression of the region, valid range = [0, 100], default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;RAW&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.p</font></td><br />
<td>proteins (whether the complete proteins sequences should returned as output, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.r</font></td><br />
<td>repair (if a transcript annotation can not be parsed, the program will try to infer the phase of the CDS parts to repair the annotation, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.a</font></td><br />
<td>Ambiguity (This parameter defines how to deal with ambiguities in the DNA. There are 3 options: EXCEPTION, which will remove the corresponding transcript, AMBIGUOUS, which will use an X for the corresponding amino acid, and RANDOM, which will randomly select an amnio acid from the list of possibilities., range={EXCEPTION, AMBIGUOUS, RANDOM}, default = AMBIGUOUS)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.s</font></td><br />
<td>stop-codon excluded from CDS (A flag that states whether the reference annotation contains the stop codon in the CDS annotation or not, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.f</font></td><br />
<td>full-length (A flag which allows for choosing between only full-length and all (i.e., full-length and partial) transcripts, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.s</font></td><br />
<td>splice (if no intron is given by RNA-seq, compute candidate splice sites or not, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.sm</font></td><br />
<td>substitution matrix (optional user-specified substitution matrix, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.g</font></td><br />
<td>gap opening (The gap opening cost in the alignment, default = 11)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.ge</font></td><br />
<td>gap extension (The gap extension cost in the alignment, default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.i</font></td><br />
<td>intron-loss-gain-penalty (The penalty used for intron loss and gain, default = 25)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.e</font></td><br />
<td>e-value (The e-value for filtering blast results, default = 100.0)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.c</font></td><br />
<td>contig threshold (The threshold for evaluating contigs, valid range = [0.0, 1.0], default = 0.4)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.rt</font></td><br />
<td>region threshold (The threshold for evaluating regions, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.h</font></td><br />
<td>hit threshold (The threshold for adding additional hits, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.p</font></td><br />
<td>predictions (The (maximal) number of predictions per transcript, default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.a</font></td><br />
<td>avoid stop (A flag which allows to avoid stop codons in a transcript (except the last AS), default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.approx</font></td><br />
<td>approx (whether an approximation is used to compute the score for intron gain, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.prefix</font></td><br />
<td>prefix (A prefix to be used for naming the predictions, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.t</font></td><br />
<td>timeout (The (maximal) number of seconds to be used for the predictions of one transcript, if exceeded GeMoMa does not output a prediction for this transcript., valid range = [0, 604800], default = 3600)</td><br />
<td style="width:100px;">LONG</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.Score</font></td><br />
<td>Score (A flag which allows to do nothing, re-score or re-align the search results, range={Trust, ReScore, ReAlign}, default = Trust)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.s</font></td><br />
<td>sorting (comma-separated list that defines the way predictions are sorted within a cluster, default = evidence,score)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.c</font></td><br />
<td>common border filter (the filter on the common borders of transcripts, the lower the more transcripts will be checked as alternative splice isoforms, valid range = [0.0, 1.0], default = 0.75)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.m</font></td><br />
<td>maximal number of transcripts per gene (the maximal number of allowed transcript predictions per gene, valid range = [1, 2147483647], default = 2147483647)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.f</font></td><br />
<td>filter (A filter can be applied to transcript predictions to receive only reasonable predictions. The filter is applied to the GFF attributes. The deault filter decides based on the completeness of the prediction (start=='M' and stop=='*') and the relative score (score/AA>=0.75) whether a prediction is used or not. Different criteria can be combined using 'and ' and 'or'. A more sophisticated filter could be applied for instance using the completeness, the relative score, the evidence as well as statistics based on RNA-seq: start=='M' and stop =='*' and score/AA>=0.75 and (evidence>1 or tpc==1.0), default = start=='M' and stop =='*' and score/AA>=0.75, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.a</font></td><br />
<td>alternative transcript filter (If a prediction is suggested as an alternative transcript, this additional filter will be applied. The filter works syntactically like the 'filter' parameter and allows you to keep the number of alternative transcripts small based on meaningful criteria. Reasonable filter could be based on RNA-seq data (tie==1) or on evidence (evidence>1). A more sophisticated filter could be applied combining several criteria: tie==1 or evidence>1, default = tie==1 or evidence>1, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.u</font></td><br />
<td>UTR (allows to predict UTRs using RNA-seq data, range={NO, YES}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;YES&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.r</font></td><br />
<td>rename (allows to generate generic gene and transcripts names (cf. attribute &quot;Name&quot;), range={COMPOSED, SIMPLE, NO}, default = COMPOSED)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;COMPOSED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.i</font></td><br />
<td>infix (the infix of the generic name, default = G)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.s</font></td><br />
<td>suffix (the suffix of the generic name, default = 0)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.di</font></td><br />
<td>delete infix (a comma-separated list of infixes that is deleted from the sequence names before building the gene/transcript name, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;SIMPLE&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>predicted proteins (If *true*, returns the predicted proteins of the target organism as fastA file, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">pc</font></td><br />
<td>predicted CDSs (If *true*, returns the predicted CDSs of the target organism as fastA file, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">pgr</font></td><br />
<td>predicted genomic regions (If *true*, returns the genomic regions of predicted gene models of the target organism as fastA file, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">o</font></td><br />
<td>output individual predictions (If *true*, returns the predictions for each reference species, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">debug</font></td><br />
<td>debug (If *false* removes all temporary files even if the jobs exits unexpected, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">threads</font></td><br />
<td>The number of threads used for the tool, defaults to 1</td><br />
<td>INT</td><br />
</tr><br />
</table><br />
<br />
=== Extract RNA-seq Evidence (ERE) ===<br />
<br />
For post-processing the mapped RNA-seq data, we provide the tool ExtractRNAseqEvidence (ERE). You can run Extractor from the command line with<br/><br />
<code>java -jar GeMoMa-<version>.jar CLI ERE [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>Stranded (Defines whether the reads are stranded. In case of FR_FIRST_STRAND, the first read of a read pair or the only read in case of single-end data is assumed to be located on forward strand of the cDNA, i.e., reverse to the mRNA orientation. If you are using Illumina TruSeq you should use FR_FIRST_STRAND., range={FR_UNSTRANDED, FR_FIRST_STRAND, FR_SECOND_STRAND}, default = FR_UNSTRANDED)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>mapped reads file (BAM/SAM files containing the mapped reads)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>ValidationStringency (Defines how strict to be when reading a SAM or BAM, beyond bare minimum validation., range={STRICT, LENIENT, SILENT}, default = LENIENT)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">u</font></td><br />
<td>use secondary alignments (allows to filter flags in the SAM or BAM, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage (allows to output the coverage, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">mmq</font></td><br />
<td>minimum mapping quality (reads with a mapping quality that is lower than this value will be ignored, valid range = [0, 254], default = 40)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== CheckIntrons===<br />
<br />
This tool allows to check whether the extracted introns show the expected patterns of di-nucleotides at the splice sites. You can run CheckIntrons from the command line with<br/><br />
<code>java -jar GeMoMa-<version>.jar CLI CheckIntrons [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>target genome (The target genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>verbose (A flag which allows to output a wealth of additional information per transcript, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== DenoiseIntrons ===<br />
This tool allows to remove potentially incorrectly extracted introns. You can run Denoise from the command line with<br/><br />
<code>java -jar GeMoMa-<version>.jar CLI DenoiseIntrons [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage (experimental coverage (RNA-seq), range={UNSTRANDED, STRANDED}, default = UNSTRANDED)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>maximum intron length (The maximum length of an intron, default = 15000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">me</font></td><br />
<td>minimum expression (The threshold for removing introns, valid range = [0.0, 1.0], default = 0.01)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">context</font></td><br />
<td>context (The context upstream a donor and donwstream an acceptor site that is used to determine the expression of the region, valid range = [0, 100], default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== Extractor ===<br />
<br />
For preparing the data, we provide the tool Extractor. You can run Extractor from the command line with<br/><br />
<code>java -jar GeMoMa-<version>.jar CLI Extractor [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (Reference annotation file (GFF or GTF), which contains gene models annotated in the reference genome)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (Reference genome file (FASTA))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">gc</font></td><br />
<td>genetic code (optional user-specified genetic code, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>proteins (whether the complete proteins sequences should returned as output, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>cds (whether the complete CDSs should returned as output, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">genomic</font></td><br />
<td>genomic (whether the genomic regions should be returned (upper case = coding, lower case = non coding), default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>repair (if a transcript annotation can not be parsed, the program will try to infer the phase of the CDS parts to repair the annotation, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>selected (The path to list file, which allows to make only a predictions for the contained transcript ids. The first column should contain transcript IDs as given in the annotation. Remaining columns will be ignored., OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Ambiguity</font></td><br />
<td>Ambiguity (This parameter defines how to deal with ambiguities in the DNA. There are 3 options: EXCEPTION, which will remove the corresponding transcript, AMBIGUOUS, which will use an X for the corresponding amino acid, and RANDOM, which will randomly select an amnio acid from the list of possibilities., range={EXCEPTION, AMBIGUOUS, RANDOM}, default = EXCEPTION)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sefc</font></td><br />
<td>stop-codon excluded from CDS (A flag that states whether the reference annotation contains the stop codon in the CDS annotation or not, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>full-length (A flag which allows for choosing between only full-length and all (i.e., full-length and partial) transcripts, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>verbose (A flag which allows to output a wealth of additional information, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== Gene Model Mapper (GeMoMa) ===<br />
For predicting gene models, we provide the tool GeMoMa. You can run GeMoMa from the command line with<br/><br />
<code>java -jar GeMoMa-<version>.jar CLI GeMoMa [<parameter>=<value> ...]</code><br/><br />
The parameters comprise: <br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>search results (The search results, e.g., from tblastn or mmseqs)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>target genome (The target genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>cds parts (The query cds parts file (FASTA), i.e., the cds parts that have been blasted)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>assignment (The assignment file, which combines parts of the CDS to transcripts, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">q</font></td><br />
<td>query proteins (optional query protein file (FASTA) for computing the optimal alignment score against complete protein prediction, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used zero or multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">splice</font></td><br />
<td>splice (if no intron is given by RNA-seq, compute candidate splice sites or not, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used zero or multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage</font></td><br />
<td>coverage (experimental coverage (RNA-seq), range={UNSTRANDED, STRANDED}, default = UNSTRANDED)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genetic code (optional user-specified genetic code, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sm</font></td><br />
<td>substitution matrix (optional user-specified substitution matrix, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">go</font></td><br />
<td>gap opening (The gap opening cost in the alignment, default = 11)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ge</font></td><br />
<td>gap extension (The gap extension cost in the alignment, default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>maximum intron length (The maximum length of an intron, default = 15000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sil</font></td><br />
<td>static intron length (A flag which allows to switch between static intron length, which can be specified by the user and is identical for all genes, and dynamic intron length, which is based on the gene-specific maximum intron length in the reference organism plus the user given maximum intron length, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">intron-loss-gain-penalty</font></td><br />
<td>intron-loss-gain-penalty (The penalty used for intron loss and gain, default = 25)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">e</font></td><br />
<td>e-value (The e-value for filtering blast results, default = 100.0)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ct</font></td><br />
<td>contig threshold (The threshold for evaluating contigs, valid range = [0.0, 1.0], default = 0.4)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">rt</font></td><br />
<td>region threshold (The threshold for evaluating regions, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">h</font></td><br />
<td>hit threshold (The threshold for adding additional hits, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>predictions (The (maximal) number of predictions per transcript, default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">selected</font></td><br />
<td>selected (The path to list file, which allows to make only a predictions for the contained transcript ids. The first column should contain transcript IDs as given in the annotation. Remaining columns can be used to determine a target region that should be overlapped by the prediction, if columns 2 to 5 contain chromosome, strand, start and end of region, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">as</font></td><br />
<td>avoid stop (A flag which allows to avoid stop codons in a transcript (except the last AS), default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">approx</font></td><br />
<td>approx (whether an approximation is used to compute the score for intron gain, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">prefix</font></td><br />
<td>prefix (A prefix to be used for naming the predictions, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">tag</font></td><br />
<td>tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = prediction)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>verbose (A flag which allows to output a wealth of additional information per transcript, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">timeout</font></td><br />
<td>timeout (The (maximal) number of seconds to be used for the predictions of one transcript, if exceeded GeMoMa does not output a prediction for this transcript., valid range = [0, 604800], default = 3600)</td><br />
<td style="width:100px;">LONG</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sort</font></td><br />
<td>sort (A flag which allows to sort the search results, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Score</font></td><br />
<td>Score (A flag which allows to do nothing, re-score or re-align the search results, range={Trust, ReScore, ReAlign}, default = Trust)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
GeMoMa returns the predicted annotation as gff file.<br />
<br />
=== GeMoMa Annotation Filter (GAF) ===<br />
<br />
The GeMoMa Annotation Filter allows to combine and reduce predictions from GeMoMa into a single final prediction. It is able to handle predictions from different reference species. It also handles overlapping or identical predictions.<br />
You can run GeMoMa from the command line with<br/><br />
<code>java -jar GeMoMa-<version>.jar CLI GAF[&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>tag (the tag used to read the GeMoMa annotations, default = prediction)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>sorting (comma-separated list that defines the way predictions are sorted within a cluster, default = evidence,score)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>common border filter (the filter on the common borders of transcripts, the lower the more transcripts will be checked as alternative splice isoforms, valid range = [0.0, 1.0], default = 0.75)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>maximal number of transcripts per gene (the maximal number of allowed transcript predictions per gene, valid range = [1, 2147483647], default = 2147483647)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prefix (the prefix can be used to distinguish predictions from different input files, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>weight (the weight can be used to prioritize predictions from different input files; each prediction will get an additional attribute sumWeight that can be used in the filter, valid range = [0.0, 1000.0], default = 1.0, OPTIONAL)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>gene annotation file (GFF files containing the gene annotations (predicted by GeMoMa))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation info (annotation information of the reference, tab-delimted file containing at least the columns transcriptName, GO and .*defline, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>filter (A filter can be applied to transcript predictions to receive only reasonable predictions. The filter is applied to the GFF attributes. The deault filter decides based on the completeness of the prediction (start=='M' and stop=='*') and the relative score (score/AA>=0.75) whether a prediction is used or not. Different criteria can be combined using 'and ' and 'or'. A more sophisticated filter could be applied for instance using the completeness, the relative score, the evidence as well as statistics based on RNA-seq: start=='M' and stop =='*' and score/AA>=0.75 and (evidence>1 or tpc==1.0), default = start=='M' and stop =='*' and score/AA>=0.75, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">atf</font></td><br />
<td>alternative transcript filter (If a prediction is suggested as an alternative transcript, this additional filter will be applied. The filter works syntactically like the 'filter' parameter and allows you to keep the number of alternative transcripts small based on meaningful criteria. Reasonable filter could be based on RNA-seq data (tie==1) or on evidence (evidence>1). A more sophisticated filter could be applied combining several criteria: tie==1 or evidence>1, default = tie==1 or evidence>1, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== CompareTranscripts ===<br />
<br />
For comparing gene models from GeMoMa predictions with existing annotation, we provide the tool CompareTranscripts. You can run CompareTranscripts from the command line with<br/><br />
<code>java -jar GeMoMa-<version>.jar CLI CompareTranscripts [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prediction (The predicted annotation)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (The true annotation)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used zero or multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">prefix</font></td><br />
<td>prefix (the prefix can be used to distinguish predictions from different input files, default = , OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">assignment</font></td><br />
<td>assignment (the transcript info for the reference of the prediction)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== AnnotationEvidence ===<br />
<br />
For providing RNA-seq evidence (e.g. tie) for existing annotation, we provide the tool AnnotationEvidence. You can run AnnotationEvidence from the command line with<br/><br />
<code>java -jar GeMoMa-<version>.jar CLI AnnotationEvidence [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (The genome annotation file (GFF))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (The genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns file (Introns (GFF), which might be obtained from RNA-seq, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage file (experimental coverage (RNA-seq), range={NO, UNSTRANDED, STRANDED}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ao</font></td><br />
<td>annotation output (if the annotation should be returned with attributes tie, tpc, and AA, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">gc</font></td><br />
<td>genetic code (optional user-specified genetic code, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== AnnotationFinializer ===<br />
<br />
This tool allows to predict UTR and to rename predictions. You can run AnnotationEvidence from the command line with<br/><br />
<code>java -jar GeMoMa-<version>.jar CLI AnnotationFinalizer [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (The predicted genome annotation file (GFF))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = prediction)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">u</font></td><br />
<td>UTR (allows to predict UTRs using RNA-seq data, range={NO, YES}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;YES&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (The genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns file (Introns (GFF), which might be obtained from RNA-seq)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage file (experimental coverage (RNA-seq), range={NO, UNSTRANDED, STRANDED}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">rename</font></td><br />
<td>rename (allows to generate generic gene and transcripts names (cf. attribute &quot;Name&quot;), range={COMPOSED, SIMPLE, NO}, default = COMPOSED)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;COMPOSED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">infix</font></td><br />
<td>infix (the infix of the generic name, default = G)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>suffix (the suffix of the generic name, default = 0)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">di</font></td><br />
<td>delete infix (a comma-separated list of infixes that is deleted from the sequence names before building the gene/transcript name, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;SIMPLE&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
== GFF attributes ==<br />
<br />
Using GeMoMa and GAF, you'll obtain GFFs containing some special attributes. We briefly explain the most prominent attributes in the following table.<br />
<br />
{| class="wikitable sortable"<br />
|-<br />
!Attribute !!Long name !!Tool !!Necessary parameter !!Feature !!Description<br />
|-<br />
| score || GeMoMa score || GeMoMa || || prediction || score computed by GeMoMa using the substitution matrix, gap costs and additional penalties<br />
|-<br />
| minCov || minimal coverage || GeMoMa || coverage, ... || prediction || minimal coverage of any base of the prediction given RNA-seq evidence<br />
|-<br />
| avgCov || average coverage || GeMoMa || coverage, ... || prediction || average coverage of all bases of the prediction given RNA-seq evidence<br />
|-<br />
| tpc || transcript percentage coverage || GeMoMa || coverage, ... || prediction || percentage of covered bases per predicted transcript given RNA-seq evidence<br />
|-<br />
| tae || transcript acceptor evidence || GeMoMa || introns || prediction || percentage of predicted acceptor sites per predicted transcript with RNA-seq evidence<br />
|-<br />
| tde || transcript donor evidence || GeMoMa || introns || prediction || percentage of predicted donor sites per predicted transcript with RNA-seq evidence<br />
|-<br />
| tie || transcript intron evidence || GeMoMa || introns || prediction || percentage of predicted introns per predicted transcript with RNA-seq evidence<br />
|-<br />
| minSplitReads || minimal split reads || GeMoMa || introns || prediction || minimal number of split reads for any of the predicted introns per predicted transcript<br />
|-<br />
| iAA || identical amino acid || GeMoMa || query proteins || prediction || percentage of identical amino acids between reference transcript and prediction<br />
|-<br />
| pAA || positive amino acid || GeMoMa || query proteins || prediction || percentage of aligned positions between reference transcript and prediction yielding a positive score in the substitution matrix<br />
|-<br />
| evidence || || GAF || || prediction || number of reference organisms that have a transcript yielding this prediction<br />
|-<br />
| alternative || || GAF || || prediction || alternative gene ID(s) leading to the same prediction<br />
|-<br />
| sumWeight || || GAF || || prediction || the sum of the weights of the references that perfectly support this prediction<br />
|-<br />
| maxTie || maximal tie || GAF || || gene || maximal tie of all transcripts of this gene<br />
|-<br />
| maxEvidence || maximal evidence || GAF || || gene || maximal evidence of all transcripts of this gene<br />
|-<br />
|}<br />
<br />
== FAQs ==<br />
<br />
; Why does the Extractor not return a single CDS-part, protein, ...?<br />
: First, please check whether the names of your contigs/chromosomes in your annotation (gff) and genome file (fasta) are identical. The fasta comments should at best only contain the contig/chromosome name. (Since GeMoMa 1.4, comments, which contain the contig/chromosome name and some additional information separated by a space, are also fine.) Second, please check whether you have a valid GFF/GTF file. Valid GFF files should have a valid "ID" or "Parent" entry in the attributes column. Valid GTF files should have a valid "gene_id" and "transcript_id" entry. Finally, please check the statistics that are given by the Extractor. It lists how many genes have been read and how many genes have been removed for different reasons. One common problem is that some annotation files do not include the stop codon in the CDS annotation.<br />
; How can I force GeMoMa to make more predictions?<br />
: There are several parameters affecting the number of predictions. The most prominent are the number of predictions (p) and the contig threshold (ct). For each reference transcript/CDS, GeMoMa initially makes a preliminary prediction and uses this prediction to determine whether a contig is promising and should be used to determine the final predictions. You may decrease ct and increase p to have more contigs in the final prediction. Increasing the number of predictions allows GeMoMa to output more predictions that have been computed. Decreasing the contig threshold allows to increase the number of predictions that are (internally) computed. Increasing p to a very large number without decreasing ct does not help.<br />
; Running GeMoMa on a single contig of my assembly yield thousands of weird predictions. What went wrong?<br />
: By default, GeMoMa is not build to be run on a single contig. GeMoMa tries to make predictions for all given reference CDS in the given target sequence(s). If the given target sequence is only a fraction of the complete target genome/assembly, GeMoMa will produce weird predictions as it does not filter for the quality of the predictions internally. There are two options to handle this:<br />
:* Use a list of gene models that you expect to be located on this contig (cf. parameter "selected").<br />
:* Filter the predictions using GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>).<br />
; Is it mandatory to use RNA-seq data?<br />
: No, GeMoMa is able to make predictions with and without RNA-seq evidence.<br />
; Is it possible to use multiple reference organisms?<br />
: It is possible to use multiple reference organisms for GeMoMa. Just run GeMoMa on each reference organism separately. Finally, you can employ GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>) to combine these annotations.<br />
; Why do some reference genes not lead to a prediction in the target genome?<br />
: Please first check whether your reference genes have been discarded by the Extractor (cf. assignment file).<br />
: If the genes have been discarded, there are two possibilities:<br />
:* The CDS might be redundant, i.e. the coding exons are identical to those of another transcript. In this case, only one CDS is further evaluated.<br />
:* There might be something wrong with your reference genes, e.g., missing start codon, missing stop codon, premature stop codon, ambiguous nucleotides, ... and you should check the options of Extractor or the annotation.<br />
: If the reference genes passed the Extractor, there are several possible explanations for this behavior. The two most prominent are:<br />
:* GeMoMa stopped the prediction of a reference genes since it does not return a result within the given time (cf. parameter "timeout").<br />
:* GeMoMa simply did not find a prediction matching the remaining quality criteria<br />
:* GeMoMa did find a prediction, but it was filtered out by GAF, e.g. to low relative score, missing start or stop codon (cf. GAF parameters).<br />
; What does "partial gene model" mean in the context of GeMoMa?<br />
: We called a gene model partial if it does not contain an initial start codon and a final stop codon. However, this does not mean that the gene model is located at or close to the border of a chromosome or contig.<br />
; For two different reference transcripts, the predictions of GeMoMa overlap or are identical. What should I do with those?<br />
: GeMoMa makes the predictions for each reference transcript independently. Hence, it can occur that some of predictions of different reference transcripts overlap or are identical especially in gene families. Typically, you might like to filter or rank these predictions. We have implemented GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>) to do this automatically. However, you can also do it by hand using the GFF attributes. Using RNA-seq data in GeMoMa yields additional fields in the annotation that can be used, e.g., average coverage (avgCov).<br />
; A lot of transcripts have been filtered out by the Extractor. What can I do?<br />
: There are several reasons for removing transcripts by the Extractor. At least in two cases you can try to get more transcripts by setting specific parameter values. First, if the transcript contains ambiguous nucleotides, please test the parameter "Ambiguity". Second, sometimes we received GFFs which contain wrong phases for CDS entries (e.g., 0 for all CDS entries in the phase column of the GFF). Since version 1.3.2, we provide the option "r" which stands for repair. If r=true is chosen, the Extractor tries to infer all phases for transcripts that show an error and would be filtered out.<br />
; Is GeMoMa able to predict pseudo-genes/ncRNA?<br />
: No, currently not.<br />
; My RNA-seq data indicates there is an additional intron in a transcipt, but GeMoMa does not predict this. Or vice versa, GeMoMa predicts an intron that is not supported by RNA-seq data. What's the reason?<br />
: GeMoMa is mainly based on the assumptions of amino acid and intron position conservation between reference and target species. Hence, GeMoMa tries to predict a gene model with similar exon-intron structure in the target species and does not stick too much to RNA-seq data. Although intron position conservation can be observed in most cases, sometimes new introns evolve or others vanish. For this reasons, GeMoMa also allows for the inclusion or exclusion of introns adding some additional costs (cf. GeMoMa parameter intron-loss-gain-penalty). However, the behaviour of GeMoMa depends on the parameters settings (especially intron-loss-gain-penalty, sm (substitution matrix), go (gap opening), ge (gap extension)) and the length of the missed/additional intron. Nevertheless, such cases can only occur if the additional/missed intron has a length that can be divided by 3 preserving the reading frame.<br />
:Since the available RNA-seq data only reflects a fraction of tissues/environmental conditions/..., missing RNA-seq evidence does not necessarily mean that the predictions is wrong.<br />
; My RNA-seq data indicates two alternative, highly overlapping introns. Interestingly, GeMoMa does not take the intron that is more abundant. Why?<br />
: GeMoMa reads the introns from the input file using some filter (cf. GeMoMa parameter r (reads)). All introns that pass the filter are used and treated equally. Hence, GeMoMa uses the intron that matches the expectation of intron position and amino acid conservation compared to the reference transcript.<br />
; Does GeMoMa predict multiple transcripts per gene?<br />
: GeMoMa in principle allows to predict multiple transcripts per gene, if corresponding transcripts are given in the reference species or if multiple reference species are used.<br />
; GeMoMa failed with java.lang.OutOfMemoryError. What can I do?<br />
: Whenever you see a java.lang.OutOfMemoryError, you should rerun the program with Java virtual machine (VM) options. More specifically you should set: '''-Xms''' the initally used RAM, e.g. to 5Gb (–Xms5G), and '''-Xmx''' the maximally used RAM, e.g. to 50Gb (-Xmx50G). GeMoMa often needs more memory if you have a large genome ''and'' if you’re providing a large coverage file (extracted from RNA-seq data). If you don’t have a compute node with enough memory, you can run GeMoMa without coverage, which will return the same predictions, but does not include all statistics. Another point could be the protein alignment, if you use the optional parameter query protein. Again you can run GeMoMa without this parameter, which will return the same predictions, but less statistics.<br />
; I need to specify the genetic code for my organisms. What is the expected format?<br />
: The genetic code is given in a two column tab-delimited table, where the first column is the one letter code of the amino acid and the second column is a comma-separated list of triplets. As we are working on genomic DNA, GeMoMa expects the bases A, C, G, and T, and not U (as expected in mRNA). Here is the link to the default genetic code, which might be used as template:<br />
: https://github.com/Jstacs/Jstacs/blob/master/projects/gemoma/test_data/genetic_code.txt<br />
: Alternative genetic codes are described here using the RNA alphabet:<br />
: https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi<br />
: The genetic code might be specified for a reference organism in the module Extractor or for a target organism in the module GeMoMa.<br />
<br />
== Version history ==<br />
[http://www.jstacs.de/download.php?which=GeMoMa GeMoMa 1.6.3] (05.03.2020)<br />
* Jstacs changes:<br />
** CLI: bugfix ExpandableParameterSet<br />
* python wrapper (for *conda)<br />
* updated tests.sh, run.sh, pipeline.sh<br />
* rename Denoise to DenoiseIntrons<br />
* AnnotationEvidence: write phase (as given) to gff<br />
* GAF: new parameter: default attributes allows to set attributes that are not included in some gene annotation files<br />
* GeMoMa: new parameter: static intron length allowing to use dynamic intron length if set to false<br />
* GeMoMaPipeline: <br />
** bugfix: time-out<br />
** improve output<br />
** separate parameters for maximum intron length (DenoiseIntrons, GeMoMa)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.6.2.zip GeMoMa 1.6.2] (17.12.2019)<br />
* Jstacs changes:<br />
** test methods for modules<br />
** live protocol for Galaxy<br />
* new module Denoise: allowing to clean introns extracted by ERE<br />
* new module NCBIReferenceRetriever: allowing to retrieve data for reference organisms easily from NCBI.<br />
* GAF:<br />
** bugfix for filter using specific attributes if no RNA-seq or query proteins was used<br />
** allow to add annotation info (as for instance provided by Phytozome) based on the reference organisms<br />
* GeMoMa: bugfix for timeout<br />
* GeMoMaPipeline:<br />
** bugfix reporting predicted partial proteins<br />
** improved protocol<br />
** new default value for query proteins (changed from false to true)<br />
** new default value for Ambiguity (changed from EXCEPTION to AMBIGUOUS)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.6.1.zip GeMoMa 1.6.1] (4.06.2019)<br />
* createGalaxyIntegration.sh: bugfix for GeMoMaPipeline<br />
* new module CheckIntrons: allowing to create statistics for introns (extracted by ERE)<br />
* AnnotationFinalizer: bugfix for sequence IDs with large numbers<br />
* CompareTranscripts: <br />
** bugfix for prefix of ref-gene<br />
** allow no transcript info, but making assignment non-optional if a transcript info is set <br />
* GAF: bugfix for Galaxy integration<br />
* GeMoMaPipeline:<br />
** improved output in case of Exceptions<br />
** new parameter "output individual predictions" allows to in- or exclude individual predictions from each reference organism in the final result<br />
** new parameter "weight" allows weights for reference species (cf. GAF)<br />
* ERE: new parameter "minimum mapping quality"<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.6.zip GeMoMa 1.6] (2.04.2019)<br />
* allow to use mmseqs as alternative to tblastn<br />
* AnnotationEvidence:<br />
** allows to add attributes to the input gff: tie, tpc, AA, start, stop<br />
** new parameter for gff output<br />
* AnnotationFinalizer: new tool for predicting UTRs and renaming predictions<br />
* GAF:<br />
** relative score filter and evidence filter are replaced by a flexible filter that allows to filter by relative score, evidence or other GFF attributes as well as combinations thereof<br />
** sorting criteria of the predictions within clusters can now be user-specified<br />
** new attribute for genes: combinedEvidence<br />
** new attribute for predictions: sumWeight<br />
** allows to use gene predictions from all sources, including for instance ab-initio gene predictors, purely RNA-seq based gene prediction and manually curation<br />
** bugfix for predictions from multiple reference organisms<br />
** improved statistic output<br />
* GeMoMa<br />
** renamed the parameter tblastn results to search results<br />
** new parameter for sorting the results of the similarity search (tblastn or mmseqs), if you use mmseqs for the similarity search you have use sort <br />
** new parameter for score of the search results: three options: Trust (as is), ReScore (use aligned sequence, but recompute score), and ReAlign (use detected sequence for realignment and rescore)<br />
** bugfix: threshold for introns from multiple files<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.5.2.zip GeMoMa 1.5.3] (23.07.2018)<br />
* improved parameter description and presentation<br />
* GeMoMaPipeline:<br />
** removed unnecessary parameters<br />
* GeMoMa:<br />
** bugfix: reading coverage file<br />
** removed parameter genomic (cf. Extractor)<br />
** removed protein output (cf. Extractor)<br />
* GAF:<br />
** bugfix: prefix<br />
* Extractor:<br />
** new parameter genomic<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.5.2.zip GeMoMa 1.5.2] (31.5.2018)<br />
* GAF:<br />
** new parameter that allows to restrict the maximal number of transcript predictions per gene<br />
** altered behavior of the evidence filter from percentages to absolute values<br />
** bugfix: nested genes <br />
** checking for duplicates in prediction IDs<br />
* GeMoMa:<br />
** warning if RNA-seq data does not match with target genome<br />
* GeMoMaPipeline: new tool for running the complete GeMoMa pipeline at once allowing multi-threading<br />
* folder for temporary files of GeMoMa<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.5.zip GeMoMa 1.5] (13.02.2018)<br />
* AnnotationEvidence: add chromosome to output<br />
* CompareTranscripts: new parameter that allows to remove prefixes introduces by GAF<br />
* Extractor: new parameter "stop-codon excluded from CDS" that might be used if the annotation does not contain the stop codons<br />
* ExtractRNASeqEvidence: <br />
** print intron length stats<br />
** include program infos in introns.gff3<br />
* GeMoMa:<br />
** new attribute pAA in gff output if query protein is given<br />
** include program infos in predicted_annotation.gff3<br />
** minor bugfix<br />
* GAF:<br />
** new parameter that allows to specify a prefix for each input gff<br />
** collect and print program infos to filtered_prediction.gff3<br />
** improved statistics output<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.4.2.zip GeMoMa 1.4.2] (21.07.2017)<br />
* automatic searching for available updates<br />
* AnnotationEvidence: bugfix (tie computation: Arrays.binarysearch does not find first match)<br />
* Extractor: bugfix (files that are not zipped)<br />
* GeMoMa: bugfix (tie computation: Arrays.binarysearch does not find first match)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.4.1.zip GeMoMa 1.4.1] (30.05.2017)<br />
* CompareTranscripts: bugfix (NullPointerException)<br />
* Extractor: reference genome can be .*fa.gz and .*fasta.gz<br />
* GeMoMa: bugfix (shutdown problem after timeout)<br />
* modified additional scripts and documentation<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.4.zip GeMoMa 1.4] (03.05.2017)<br />
* AnnotationEvidence: new tool computing tie and tpc for given annotation (gff)<br />
* CompareTranscripts: new tool comparing predicted and given annotation (gff)<br />
* Extractor:<br />
** reading CDS with no parent tag (cf. discontinuous feature)<br />
** automatic recognition of GFF or GTF annotation<br />
** Warning if sequences mentioned in the annotation are not included in the reference sequence<br />
* GeMoMa: <br />
** allowing for multiple intron and coverage files (= using different library types at the same time)<br />
** NA instead of "?" for tae, tde, tie, minSplitReads of single coding exon genes<br />
** new default values for the parameters: predictions (10 instead of 1) and contig threshold (0.4 instead of 0.9)<br />
** bugfix (write pc and minCov if possible for last CDS part in predicted annotation)<br />
** bugfix (ref-gene name if no assignment is used)<br />
** bugfix (minSplitReads, minCov, tpc, avgCov if no coverage available)<br />
* GAF:<br />
** nested genes on the same strand<br />
** bugfix (if nothing passes the filter)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.3.2.zip GeMoMa 1.3.2] (18.01.2017)<br />
* Extractor: new parameter repair for broken transcript annotations<br />
* GeMoMa: bugfixes (splice site computation)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa_1.3.1.zip GeMoMa 1.3.1] (09.12.2016)<br />
* GeMoMa bugfix (finding start/stop codon for very small exons)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.3.zip GeMoMa 1.3] (06.12.2016)<br />
* ERE: new tool for extracting RNA-seq evidence<br />
* Extractor: offers options for <br />
** partial gene models<br />
** ambiguities<br />
* GeMoMa:<br />
** RNA-seq<br />
*** defining splice sites<br />
*** additional feature in GFF and output<br />
**** transcript intron evidence (tie)<br />
**** transcript acceptor evidence (tae)<br />
**** transcript donor evidence (tde)<br />
**** transcript percentage coverage (tpc)<br />
**** ...<br />
** improved GFF<br />
** simplified the command line parameters<br />
** IMPORTANT: parameter names changed for some parameters<br />
* GAF: new tool for filtering and combining different predictions (especially of different reference organisms)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.1.3.zip GeMoMa 1.1.3] (06.06.2016)<br />
* minor modifications to the Extractor tool<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.1.2.zip GeMoMa 1.1.2] (05.02.2016)<br />
* GeMoMa bugfix (upstream, downstream sequence for splice site detection)<br />
* Extractor: new parameter s for selecting transcripts<br />
* improved Galaxy integration<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.1.1.zip GeMoMa 1.1.1] (01.02.2016)<br />
* initial release for paper</div>Keilwagenhttps://www.jstacs.de/index.php?title=GeMoMa&diff=1068GeMoMa2020-03-09T08:16:08Z<p>Keilwagen: /* Denoise */ -> /* DenoiseIntrons */</p>
<hr />
<div>'''Ge'''ne '''Mo'''del '''Ma'''pper (GeMoMa) is a homology-based gene prediction program. GeMoMa uses the annotation of protein-coding genes in a reference genome to infer the annotation of protein-coding genes in a target genome. Thereby, GeMoMa utilizes amino acid sequence and intron position conservation. In addition, GeMoMa allows to incorporate RNA-seq evidence for splice site prediction.<br />
<br />
[[File:GeMoMa-schema.png|thumb|right|350px|Schema of GeMoMa algorithm]]<br />
<br />
== Paper ==<br />
If you use GeMoMa, please cite<br />
<br />
J. Keilwagen, M. Wenk, J. L. Erickson, M. H. Schattat, J. Grau, and F. Hartung. [https://nar.oxfordjournals.org/content/44/9/e89 Using intron position conservation for homology-based gene prediction]. ''Nucleic Acids Research'', 2016. doi: 10.1093/nar/gkw092<br />
<br />
J. Keilwagen, F. Hartung, M. Paulini, S. O. Twardziok, and J. Grau<br />
[https://rdcu.be/QbKc Combining RNA-seq data and homology-based gene prediction for plants, animals and fungi]. ''BMC Bioinformatics'', 2018. doi: 10.1186/s12859-018-2203-5<br />
<br />
== Download ==<br />
<br />
GeMoMa is implemented in Java using Jstacs. You can [http://www.jstacs.de/download.php?which=GeMoMa download a zip file] containing a readme, the GeMoMa jar file and some tiny scripts for running GeMoMa. The jar file allows for <br />
<ul><br />
<li>creating the XML file needed for the Galaxy integration</li><br />
<li>running the command line interface (CLI) version.</li><br />
</ul><br />
<br />
You can also [[:File:GeMoMa-manual.pdf|download a small manual for GeMoMa]] which explains the main steps for the analysis.<br />
<br />
== Galaxy ==<br />
GeMoMa is available in a public web-server at [http://galaxy.informatik.uni-halle.de galaxy.informatik.uni-halle.de]. The provided web-server only allows a limited number of reference genes and uses a time out of 2 minutes per transcript prediction. For unlimited use, please use the command line program or integrate GeMoMa in your only Galaxy instance.<br />
<br />
[[File:GeMoMa-Workflow.png|thumb|center|700px|GeMoMa workflow adapted from Galaxy]]<br />
<br />
== Running the command line application ==<br />
<br />
For running the command line application, Java v1.8 or later is required.<br />
<br />
=== NCBIReferenceRetriever ===<br />
<br />
We provide the module NCBIReferenceRetriever allowing to retrieve data for reference organisms easily from NCBI. You can run NCBIReferenceRetriever from the command line with<br/><br />
<code>java -jar GeMoMa-<version>.jar CLI NRR [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reference directory (the directory where the genome and annotation files of the reference organisms should be stored, default = references/)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>number of tries (the number of tries for downloading a reference file, valid range = [1, 100], default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">rl</font></td><br />
<td>reference list (a list of reference organisms)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== GeMoMaPipeline ===<br />
<br />
If you like to run the GeMoMaPipeline on a server as a single job, you can use the module GeMoMaPipeline which allows to exploit the full compute power of the computer server via multi-threading. However, GeMoMaPipeline does not distribute task on a compute cluster.<br />
You can run GeMoMaPipeline from the command line with<br/><br />
<code>java -jar GeMoMa-<version>.jar CLI GeMoMaPipeline [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>target genome (Target genome file (FASTA))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>species (data for reference species, range={own, pre-extracted}, default = own)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;own&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>ID (ID to distinguish the different reference species, default = , OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (Reference annotation file (GFF or GTF), which contains gene models annotated in the reference genome)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (Reference genome file (FASTA))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>weight (the weight can be used to prioritize predictions from different input files; each prediction will get an additional attribute sumWeight that can be used in the filter, valid range = [0.0, 1000.0], default = 1.0, OPTIONAL)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ai</font></td><br />
<td>annotation info (annotation information of the reference, tab-delimted file containing at least the columns transcriptName, GO and .*defline, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;pre-extracted&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>ID (ID to distinguish the different reference species, default = , OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>cds parts (The query cds parts file (FASTA), i.e., the cds parts that have been blasted)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>assignment (The assignment file, which combines parts of the CDS to transcripts, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">q</font></td><br />
<td>query proteins (optional query protein file (FASTA) for computing the optimal alignment score against complete protein prediction, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>weight (the weight can be used to prioritize predictions from different input files; each prediction will get an additional attribute sumWeight that can be used in the filter, valid range = [0.0, 1000.0], default = 1.0, OPTIONAL)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ai</font></td><br />
<td>annotation info (annotation information of the reference, tab-delimted file containing at least the columns transcriptName, GO and .*defline, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">selected</font></td><br />
<td>selected (The path to list file, which allows to make only a predictions for the contained transcript ids. The first column should contain transcript IDs as given in the annotation. Remaining columns can be used to determine a target region that should be overlapped by the prediction, if columns 2 to 5 contain chromosome, strand, start and end of region, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">gc</font></td><br />
<td>genetic code (optional user-specified genetic code, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>maximum intron length (The maximum length of an intron, default = 15000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">tblastn</font></td><br />
<td>tblastn (if *true* tblastn is used as search algorithm, otherwise mmseqs is used. Tblastn and mmseqs need to be installed to use the corresponding option, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">tag</font></td><br />
<td>tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = prediction)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>RNA-seq evidence (data for RNA-seq evidence, range={NO, MAPPED, EXTRACTED}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;MAPPED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.s</font></td><br />
<td>Stranded (Defines whether the reads are stranded. In case of FR_FIRST_STRAND, the first read of a read pair or the only read in case of single-end data is assumed to be located on forward strand of the cDNA, i.e., reverse to the mRNA orientation. If you are using Illumina TruSeq you should use FR_FIRST_STRAND., range={FR_UNSTRANDED, FR_FIRST_STRAND, FR_SECOND_STRAND}, default = FR_UNSTRANDED)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.m</font></td><br />
<td>mapped reads file (BAM/SAM files containing the mapped reads)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.v</font></td><br />
<td>ValidationStringency (Defines how strict to be when reading a SAM or BAM, beyond bare minimum validation., range={STRICT, LENIENT, SILENT}, default = LENIENT)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.u</font></td><br />
<td>use secondary alignments (allows to filter flags in the SAM or BAM, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.c</font></td><br />
<td>coverage (allows to output the coverage, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.mmq</font></td><br />
<td>minimum mapping quality (reads with a mapping quality that is lower than this value will be ignored, valid range = [0, 255], default = 40)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;EXTRACTED&quot;:</b></td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">introns</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used zero or multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage</font></td><br />
<td>coverage (experimental coverage (RNA-seq), range={UNSTRANDED, STRANDED}, default = UNSTRANDED)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>denoise (removing questionable introns that have been extracted by ERE, range={DENOISE, RAW}, default = DENOISE)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;DENOISE&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Denoise.m</font></td><br />
<td>minimum expression (The threshold for removing introns, valid range = [0.0, 1.0], default = 0.01)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Denoise.c</font></td><br />
<td>context (The context upstream a donor and donwstream an acceptor site that is used to determine the expression of the region, valid range = [0, 100], default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;RAW&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.p</font></td><br />
<td>proteins (whether the complete proteins sequences should returned as output, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.r</font></td><br />
<td>repair (if a transcript annotation can not be parsed, the program will try to infer the phase of the CDS parts to repair the annotation, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.a</font></td><br />
<td>Ambiguity (This parameter defines how to deal with ambiguities in the DNA. There are 3 options: EXCEPTION, which will remove the corresponding transcript, AMBIGUOUS, which will use an X for the corresponding amino acid, and RANDOM, which will randomly select an amnio acid from the list of possibilities., range={EXCEPTION, AMBIGUOUS, RANDOM}, default = AMBIGUOUS)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.s</font></td><br />
<td>stop-codon excluded from CDS (A flag that states whether the reference annotation contains the stop codon in the CDS annotation or not, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.f</font></td><br />
<td>full-length (A flag which allows for choosing between only full-length and all (i.e., full-length and partial) transcripts, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.s</font></td><br />
<td>splice (if no intron is given by RNA-seq, compute candidate splice sites or not, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.sm</font></td><br />
<td>substitution matrix (optional user-specified substitution matrix, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.g</font></td><br />
<td>gap opening (The gap opening cost in the alignment, default = 11)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.ge</font></td><br />
<td>gap extension (The gap extension cost in the alignment, default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.i</font></td><br />
<td>intron-loss-gain-penalty (The penalty used for intron loss and gain, default = 25)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.e</font></td><br />
<td>e-value (The e-value for filtering blast results, default = 100.0)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.c</font></td><br />
<td>contig threshold (The threshold for evaluating contigs, valid range = [0.0, 1.0], default = 0.4)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.rt</font></td><br />
<td>region threshold (The threshold for evaluating regions, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.h</font></td><br />
<td>hit threshold (The threshold for adding additional hits, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.p</font></td><br />
<td>predictions (The (maximal) number of predictions per transcript, default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.a</font></td><br />
<td>avoid stop (A flag which allows to avoid stop codons in a transcript (except the last AS), default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.approx</font></td><br />
<td>approx (whether an approximation is used to compute the score for intron gain, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.prefix</font></td><br />
<td>prefix (A prefix to be used for naming the predictions, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.t</font></td><br />
<td>timeout (The (maximal) number of seconds to be used for the predictions of one transcript, if exceeded GeMoMa does not output a prediction for this transcript., valid range = [0, 604800], default = 3600)</td><br />
<td style="width:100px;">LONG</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.Score</font></td><br />
<td>Score (A flag which allows to do nothing, re-score or re-align the search results, range={Trust, ReScore, ReAlign}, default = Trust)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.s</font></td><br />
<td>sorting (comma-separated list that defines the way predictions are sorted within a cluster, default = evidence,score)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.c</font></td><br />
<td>common border filter (the filter on the common borders of transcripts, the lower the more transcripts will be checked as alternative splice isoforms, valid range = [0.0, 1.0], default = 0.75)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.m</font></td><br />
<td>maximal number of transcripts per gene (the maximal number of allowed transcript predictions per gene, valid range = [1, 2147483647], default = 2147483647)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.f</font></td><br />
<td>filter (A filter can be applied to transcript predictions to receive only reasonable predictions. The filter is applied to the GFF attributes. The deault filter decides based on the completeness of the prediction (start=='M' and stop=='*') and the relative score (score/AA>=0.75) whether a prediction is used or not. Different criteria can be combined using 'and ' and 'or'. A more sophisticated filter could be applied for instance using the completeness, the relative score, the evidence as well as statistics based on RNA-seq: start=='M' and stop =='*' and score/AA>=0.75 and (evidence>1 or tpc==1.0), default = start=='M' and stop =='*' and score/AA>=0.75, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.a</font></td><br />
<td>alternative transcript filter (If a prediction is suggested as an alternative transcript, this additional filter will be applied. The filter works syntactically like the 'filter' parameter and allows you to keep the number of alternative transcripts small based on meaningful criteria. Reasonable filter could be based on RNA-seq data (tie==1) or on evidence (evidence>1). A more sophisticated filter could be applied combining several criteria: tie==1 or evidence>1, default = tie==1 or evidence>1, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.u</font></td><br />
<td>UTR (allows to predict UTRs using RNA-seq data, range={NO, YES}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;YES&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.r</font></td><br />
<td>rename (allows to generate generic gene and transcripts names (cf. attribute &quot;Name&quot;), range={COMPOSED, SIMPLE, NO}, default = COMPOSED)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;COMPOSED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.i</font></td><br />
<td>infix (the infix of the generic name, default = G)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.s</font></td><br />
<td>suffix (the suffix of the generic name, default = 0)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.di</font></td><br />
<td>delete infix (a comma-separated list of infixes that is deleted from the sequence names before building the gene/transcript name, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;SIMPLE&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>predicted proteins (If *true*, returns the predicted proteins of the target organism as fastA file, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">pc</font></td><br />
<td>predicted CDSs (If *true*, returns the predicted CDSs of the target organism as fastA file, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">pgr</font></td><br />
<td>predicted genomic regions (If *true*, returns the genomic regions of predicted gene models of the target organism as fastA file, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">o</font></td><br />
<td>output individual predictions (If *true*, returns the predictions for each reference species, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">debug</font></td><br />
<td>debug (If *false* removes all temporary files even if the jobs exits unexpected, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">threads</font></td><br />
<td>The number of threads used for the tool, defaults to 1</td><br />
<td>INT</td><br />
</tr><br />
</table><br />
<br />
=== Extract RNA-seq Evidence (ERE) ===<br />
<br />
For post-processing the mapped RNA-seq data, we provide the tool ExtractRNAseqEvidence (ERE). You can run Extractor from the command line with<br/><br />
<code>java -jar GeMoMa-<version>.jar CLI ERE [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>Stranded (Defines whether the reads are stranded. In case of FR_FIRST_STRAND, the first read of a read pair or the only read in case of single-end data is assumed to be located on forward strand of the cDNA, i.e., reverse to the mRNA orientation. If you are using Illumina TruSeq you should use FR_FIRST_STRAND., range={FR_UNSTRANDED, FR_FIRST_STRAND, FR_SECOND_STRAND}, default = FR_UNSTRANDED)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>mapped reads file (BAM/SAM files containing the mapped reads)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>ValidationStringency (Defines how strict to be when reading a SAM or BAM, beyond bare minimum validation., range={STRICT, LENIENT, SILENT}, default = LENIENT)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">u</font></td><br />
<td>use secondary alignments (allows to filter flags in the SAM or BAM, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage (allows to output the coverage, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">mmq</font></td><br />
<td>minimum mapping quality (reads with a mapping quality that is lower than this value will be ignored, valid range = [0, 254], default = 40)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== CheckIntrons===<br />
<br />
This tool allows to check whether the extracted introns show the expected patterns of di-nucleotides at the splice sites. You can run CheckIntrons from the command line with<br/><br />
<code>java -jar GeMoMa-<version>.jar CLI CheckIntrons [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>target genome (The target genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>verbose (A flag which allows to output a wealth of additional information per transcript, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== DenoiseIntrons ===<br />
This tool allows to remove potentially incorrectly extracted introns. You can run Denoise from the command line with<br/><br />
<code>java -jar GeMoMa-<version>.jar CLI DenoiseIntrons [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage (experimental coverage (RNA-seq), range={UNSTRANDED, STRANDED}, default = UNSTRANDED)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>maximum intron length (The maximum length of an intron, default = 15000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">me</font></td><br />
<td>minimum expression (The threshold for removing introns, valid range = [0.0, 1.0], default = 0.01)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">context</font></td><br />
<td>context (The context upstream a donor and donwstream an acceptor site that is used to determine the expression of the region, valid range = [0, 100], default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== Extractor ===<br />
<br />
For preparing the data, we provide the tool Extractor. You can run Extractor from the command line with<br/><br />
<code>java -jar GeMoMa-<version>.jar CLI Extractor [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (Reference annotation file (GFF or GTF), which contains gene models annotated in the reference genome)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (Reference genome file (FASTA))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">gc</font></td><br />
<td>genetic code (optional user-specified genetic code, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>proteins (whether the complete proteins sequences should returned as output, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>cds (whether the complete CDSs should returned as output, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">genomic</font></td><br />
<td>genomic (whether the genomic regions should be returned (upper case = coding, lower case = non coding), default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>repair (if a transcript annotation can not be parsed, the program will try to infer the phase of the CDS parts to repair the annotation, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>selected (The path to list file, which allows to make only a predictions for the contained transcript ids. The first column should contain transcript IDs as given in the annotation. Remaining columns will be ignored., OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Ambiguity</font></td><br />
<td>Ambiguity (This parameter defines how to deal with ambiguities in the DNA. There are 3 options: EXCEPTION, which will remove the corresponding transcript, AMBIGUOUS, which will use an X for the corresponding amino acid, and RANDOM, which will randomly select an amnio acid from the list of possibilities., range={EXCEPTION, AMBIGUOUS, RANDOM}, default = EXCEPTION)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sefc</font></td><br />
<td>stop-codon excluded from CDS (A flag that states whether the reference annotation contains the stop codon in the CDS annotation or not, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>full-length (A flag which allows for choosing between only full-length and all (i.e., full-length and partial) transcripts, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>verbose (A flag which allows to output a wealth of additional information, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== Gene Model Mapper (GeMoMa) ===<br />
For predicting gene models, we provide the tool GeMoMa. You can run GeMoMa from the command line with<br/><br />
<code>java -jar GeMoMa-<version>.jar CLI GeMoMa [<parameter>=<value> ...]</code><br/><br />
The parameters comprise: <br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>search results (The search results, e.g., from tblastn or mmseqs)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>target genome (The target genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>cds parts (The query cds parts file (FASTA), i.e., the cds parts that have been blasted)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>assignment (The assignment file, which combines parts of the CDS to transcripts, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">q</font></td><br />
<td>query proteins (optional query protein file (FASTA) for computing the optimal alignment score against complete protein prediction, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">splice</font></td><br />
<td>splice (if no intron is given by RNA-seq, compute candidate splice sites or not, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage</font></td><br />
<td>coverage (experimental coverage (RNA-seq), range={NO, UNSTRANDED, STRANDED}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genetic code (optional user-specified genetic code, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sm</font></td><br />
<td>substitution matrix (optional user-specified substitution matrix, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">go</font></td><br />
<td>gap opening (The gap opening cost in the alignment, default = 11)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ge</font></td><br />
<td>gap extension (The gap extension cost in the alignment, default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>maximum intron length (The maximum length of an intron, default = 15000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">intron-loss-gain-penalty</font></td><br />
<td>intron-loss-gain-penalty (The penalty used for intron loss and gain, default = 25)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">e</font></td><br />
<td>e-value (The e-value for filtering blast results, default = 100.0)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ct</font></td><br />
<td>contig threshold (The threshold for evaluating contigs, valid range = [0.0, 1.0], default = 0.4)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">rt</font></td><br />
<td>region threshold (The threshold for evaluating regions, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">h</font></td><br />
<td>hit threshold (The threshold for adding additional hits, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>predictions (The (maximal) number of predictions per transcript, default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">selected</font></td><br />
<td>selected (The path to list file, which allows to make only a predictions for the contained transcript ids. The first column should contain transcript IDs as given in the annotation. Remaining columns can be used to determine a target region that should be overlapped by the prediction, if columns 2 to 5 contain chromosome, strand, start and end of region, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">as</font></td><br />
<td>avoid stop (A flag which allows to avoid stop codons in a transcript (except the last AS), default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">approx</font></td><br />
<td>approx (whether an approximation is used to compute the score for intron gain, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">prefix</font></td><br />
<td>prefix (A prefix to be used for naming the predictions, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">tag</font></td><br />
<td>tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = prediction)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>verbose (A flag which allows to output a wealth of additional information per transcript, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">timeout</font></td><br />
<td>timeout (The (maximal) number of seconds to be used for the predictions of one transcript, if exceeded GeMoMa does not output a prediction for this transcript., valid range = [0, 604800], default = 3600)</td><br />
<td style="width:100px;">LONG</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sort</font></td><br />
<td>sort (A flag which allows to sort the search results, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Score</font></td><br />
<td>Score (A flag which allows to do nothing, re-score or re-align the search results, range={Trust, ReScore, ReAlign}, default = Trust)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
GeMoMa returns the predicted annotation as gff file.<br />
<br />
=== GeMoMa Annotation Filter (GAF) ===<br />
<br />
The GeMoMa Annotation Filter allows to combine and reduce predictions from GeMoMa into a single final prediction. It is able to handle predictions from different reference species. It also handles overlapping or identical predictions.<br />
You can run GeMoMa from the command line with<br/><br />
<code>java -jar GeMoMa-<version>.jar CLI GAF[&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>tag (the tag used to read the GeMoMa annotations, default = prediction)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>sorting (comma-separated list that defines the way predictions are sorted within a cluster, default = evidence,score)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>common border filter (the filter on the common borders of transcripts, the lower the more transcripts will be checked as alternative splice isoforms, valid range = [0.0, 1.0], default = 0.75)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>maximal number of transcripts per gene (the maximal number of allowed transcript predictions per gene, valid range = [1, 2147483647], default = 2147483647)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prefix (the prefix can be used to distinguish predictions from different input files, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>weight (the weight can be used to prioritize predictions from different input files; each prediction will get an additional attribute sumWeight that can be used in the filter, valid range = [0.0, 1000.0], default = 1.0, OPTIONAL)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>gene annotation file (GFF files containing the gene annotations (predicted by GeMoMa))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation info (annotation information of the reference, tab-delimted file containing at least the columns transcriptName, GO and .*defline, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>filter (A filter can be applied to transcript predictions to receive only reasonable predictions. The filter is applied to the GFF attributes. The deault filter decides based on the completeness of the prediction (start=='M' and stop=='*') and the relative score (score/AA>=0.75) whether a prediction is used or not. Different criteria can be combined using 'and ' and 'or'. A more sophisticated filter could be applied for instance using the completeness, the relative score, the evidence as well as statistics based on RNA-seq: start=='M' and stop =='*' and score/AA>=0.75 and (evidence>1 or tpc==1.0), default = start=='M' and stop =='*' and score/AA>=0.75, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">atf</font></td><br />
<td>alternative transcript filter (If a prediction is suggested as an alternative transcript, this additional filter will be applied. The filter works syntactically like the 'filter' parameter and allows you to keep the number of alternative transcripts small based on meaningful criteria. Reasonable filter could be based on RNA-seq data (tie==1) or on evidence (evidence>1). A more sophisticated filter could be applied combining several criteria: tie==1 or evidence>1, default = tie==1 or evidence>1, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== CompareTranscripts ===<br />
<br />
For comparing gene models from GeMoMa predictions with existing annotation, we provide the tool CompareTranscripts. You can run CompareTranscripts from the command line with<br/><br />
<code>java -jar GeMoMa-<version>.jar CLI CompareTranscripts [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prediction (The predicted annotation)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (The true annotation)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used zero or multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">prefix</font></td><br />
<td>prefix (the prefix can be used to distinguish predictions from different input files, default = , OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">assignment</font></td><br />
<td>assignment (the transcript info for the reference of the prediction)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== AnnotationEvidence ===<br />
<br />
For providing RNA-seq evidence (e.g. tie) for existing annotation, we provide the tool AnnotationEvidence. You can run AnnotationEvidence from the command line with<br/><br />
<code>java -jar GeMoMa-<version>.jar CLI AnnotationEvidence [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (The genome annotation file (GFF))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (The genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns file (Introns (GFF), which might be obtained from RNA-seq, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage file (experimental coverage (RNA-seq), range={NO, UNSTRANDED, STRANDED}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ao</font></td><br />
<td>annotation output (if the annotation should be returned with attributes tie, tpc, and AA, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">gc</font></td><br />
<td>genetic code (optional user-specified genetic code, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== AnnotationFinializer ===<br />
<br />
This tool allows to predict UTR and to rename predictions. You can run AnnotationEvidence from the command line with<br/><br />
<code>java -jar GeMoMa-<version>.jar CLI AnnotationFinalizer [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (The predicted genome annotation file (GFF))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = prediction)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">u</font></td><br />
<td>UTR (allows to predict UTRs using RNA-seq data, range={NO, YES}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;YES&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (The genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns file (Introns (GFF), which might be obtained from RNA-seq)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage file (experimental coverage (RNA-seq), range={NO, UNSTRANDED, STRANDED}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">rename</font></td><br />
<td>rename (allows to generate generic gene and transcripts names (cf. attribute &quot;Name&quot;), range={COMPOSED, SIMPLE, NO}, default = COMPOSED)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;COMPOSED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">infix</font></td><br />
<td>infix (the infix of the generic name, default = G)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>suffix (the suffix of the generic name, default = 0)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">di</font></td><br />
<td>delete infix (a comma-separated list of infixes that is deleted from the sequence names before building the gene/transcript name, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;SIMPLE&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
== GFF attributes ==<br />
<br />
Using GeMoMa and GAF, you'll obtain GFFs containing some special attributes. We briefly explain the most prominent attributes in the following table.<br />
<br />
{| class="wikitable sortable"<br />
|-<br />
!Attribute !!Long name !!Tool !!Necessary parameter !!Feature !!Description<br />
|-<br />
| score || GeMoMa score || GeMoMa || || prediction || score computed by GeMoMa using the substitution matrix, gap costs and additional penalties<br />
|-<br />
| minCov || minimal coverage || GeMoMa || coverage, ... || prediction || minimal coverage of any base of the prediction given RNA-seq evidence<br />
|-<br />
| avgCov || average coverage || GeMoMa || coverage, ... || prediction || average coverage of all bases of the prediction given RNA-seq evidence<br />
|-<br />
| tpc || transcript percentage coverage || GeMoMa || coverage, ... || prediction || percentage of covered bases per predicted transcript given RNA-seq evidence<br />
|-<br />
| tae || transcript acceptor evidence || GeMoMa || introns || prediction || percentage of predicted acceptor sites per predicted transcript with RNA-seq evidence<br />
|-<br />
| tde || transcript donor evidence || GeMoMa || introns || prediction || percentage of predicted donor sites per predicted transcript with RNA-seq evidence<br />
|-<br />
| tie || transcript intron evidence || GeMoMa || introns || prediction || percentage of predicted introns per predicted transcript with RNA-seq evidence<br />
|-<br />
| minSplitReads || minimal split reads || GeMoMa || introns || prediction || minimal number of split reads for any of the predicted introns per predicted transcript<br />
|-<br />
| iAA || identical amino acid || GeMoMa || query proteins || prediction || percentage of identical amino acids between reference transcript and prediction<br />
|-<br />
| pAA || positive amino acid || GeMoMa || query proteins || prediction || percentage of aligned positions between reference transcript and prediction yielding a positive score in the substitution matrix<br />
|-<br />
| evidence || || GAF || || prediction || number of reference organisms that have a transcript yielding this prediction<br />
|-<br />
| alternative || || GAF || || prediction || alternative gene ID(s) leading to the same prediction<br />
|-<br />
| sumWeight || || GAF || || prediction || the sum of the weights of the references that perfectly support this prediction<br />
|-<br />
| maxTie || maximal tie || GAF || || gene || maximal tie of all transcripts of this gene<br />
|-<br />
| maxEvidence || maximal evidence || GAF || || gene || maximal evidence of all transcripts of this gene<br />
|-<br />
|}<br />
<br />
== FAQs ==<br />
<br />
; Why does the Extractor not return a single CDS-part, protein, ...?<br />
: First, please check whether the names of your contigs/chromosomes in your annotation (gff) and genome file (fasta) are identical. The fasta comments should at best only contain the contig/chromosome name. (Since GeMoMa 1.4, comments, which contain the contig/chromosome name and some additional information separated by a space, are also fine.) Second, please check whether you have a valid GFF/GTF file. Valid GFF files should have a valid "ID" or "Parent" entry in the attributes column. Valid GTF files should have a valid "gene_id" and "transcript_id" entry. Finally, please check the statistics that are given by the Extractor. It lists how many genes have been read and how many genes have been removed for different reasons. One common problem is that some annotation files do not include the stop codon in the CDS annotation.<br />
; How can I force GeMoMa to make more predictions?<br />
: There are several parameters affecting the number of predictions. The most prominent are the number of predictions (p) and the contig threshold (ct). For each reference transcript/CDS, GeMoMa initially makes a preliminary prediction and uses this prediction to determine whether a contig is promising and should be used to determine the final predictions. You may decrease ct and increase p to have more contigs in the final prediction. Increasing the number of predictions allows GeMoMa to output more predictions that have been computed. Decreasing the contig threshold allows to increase the number of predictions that are (internally) computed. Increasing p to a very large number without decreasing ct does not help.<br />
; Running GeMoMa on a single contig of my assembly yield thousands of weird predictions. What went wrong?<br />
: By default, GeMoMa is not build to be run on a single contig. GeMoMa tries to make predictions for all given reference CDS in the given target sequence(s). If the given target sequence is only a fraction of the complete target genome/assembly, GeMoMa will produce weird predictions as it does not filter for the quality of the predictions internally. There are two options to handle this:<br />
:* Use a list of gene models that you expect to be located on this contig (cf. parameter "selected").<br />
:* Filter the predictions using GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>).<br />
; Is it mandatory to use RNA-seq data?<br />
: No, GeMoMa is able to make predictions with and without RNA-seq evidence.<br />
; Is it possible to use multiple reference organisms?<br />
: It is possible to use multiple reference organisms for GeMoMa. Just run GeMoMa on each reference organism separately. Finally, you can employ GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>) to combine these annotations.<br />
; Why do some reference genes not lead to a prediction in the target genome?<br />
: Please first check whether your reference genes have been discarded by the Extractor (cf. assignment file).<br />
: If the genes have been discarded, there are two possibilities:<br />
:* The CDS might be redundant, i.e. the coding exons are identical to those of another transcript. In this case, only one CDS is further evaluated.<br />
:* There might be something wrong with your reference genes, e.g., missing start codon, missing stop codon, premature stop codon, ambiguous nucleotides, ... and you should check the options of Extractor or the annotation.<br />
: If the reference genes passed the Extractor, there are several possible explanations for this behavior. The two most prominent are:<br />
:* GeMoMa stopped the prediction of a reference genes since it does not return a result within the given time (cf. parameter "timeout").<br />
:* GeMoMa simply did not find a prediction matching the remaining quality criteria<br />
:* GeMoMa did find a prediction, but it was filtered out by GAF, e.g. to low relative score, missing start or stop codon (cf. GAF parameters).<br />
; What does "partial gene model" mean in the context of GeMoMa?<br />
: We called a gene model partial if it does not contain an initial start codon and a final stop codon. However, this does not mean that the gene model is located at or close to the border of a chromosome or contig.<br />
; For two different reference transcripts, the predictions of GeMoMa overlap or are identical. What should I do with those?<br />
: GeMoMa makes the predictions for each reference transcript independently. Hence, it can occur that some of predictions of different reference transcripts overlap or are identical especially in gene families. Typically, you might like to filter or rank these predictions. We have implemented GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>) to do this automatically. However, you can also do it by hand using the GFF attributes. Using RNA-seq data in GeMoMa yields additional fields in the annotation that can be used, e.g., average coverage (avgCov).<br />
; A lot of transcripts have been filtered out by the Extractor. What can I do?<br />
: There are several reasons for removing transcripts by the Extractor. At least in two cases you can try to get more transcripts by setting specific parameter values. First, if the transcript contains ambiguous nucleotides, please test the parameter "Ambiguity". Second, sometimes we received GFFs which contain wrong phases for CDS entries (e.g., 0 for all CDS entries in the phase column of the GFF). Since version 1.3.2, we provide the option "r" which stands for repair. If r=true is chosen, the Extractor tries to infer all phases for transcripts that show an error and would be filtered out.<br />
; Is GeMoMa able to predict pseudo-genes/ncRNA?<br />
: No, currently not.<br />
; My RNA-seq data indicates there is an additional intron in a transcipt, but GeMoMa does not predict this. Or vice versa, GeMoMa predicts an intron that is not supported by RNA-seq data. What's the reason?<br />
: GeMoMa is mainly based on the assumptions of amino acid and intron position conservation between reference and target species. Hence, GeMoMa tries to predict a gene model with similar exon-intron structure in the target species and does not stick too much to RNA-seq data. Although intron position conservation can be observed in most cases, sometimes new introns evolve or others vanish. For this reasons, GeMoMa also allows for the inclusion or exclusion of introns adding some additional costs (cf. GeMoMa parameter intron-loss-gain-penalty). However, the behaviour of GeMoMa depends on the parameters settings (especially intron-loss-gain-penalty, sm (substitution matrix), go (gap opening), ge (gap extension)) and the length of the missed/additional intron. Nevertheless, such cases can only occur if the additional/missed intron has a length that can be divided by 3 preserving the reading frame.<br />
:Since the available RNA-seq data only reflects a fraction of tissues/environmental conditions/..., missing RNA-seq evidence does not necessarily mean that the predictions is wrong.<br />
; My RNA-seq data indicates two alternative, highly overlapping introns. Interestingly, GeMoMa does not take the intron that is more abundant. Why?<br />
: GeMoMa reads the introns from the input file using some filter (cf. GeMoMa parameter r (reads)). All introns that pass the filter are used and treated equally. Hence, GeMoMa uses the intron that matches the expectation of intron position and amino acid conservation compared to the reference transcript.<br />
; Does GeMoMa predict multiple transcripts per gene?<br />
: GeMoMa in principle allows to predict multiple transcripts per gene, if corresponding transcripts are given in the reference species or if multiple reference species are used.<br />
; GeMoMa failed with java.lang.OutOfMemoryError. What can I do?<br />
: Whenever you see a java.lang.OutOfMemoryError, you should rerun the program with Java virtual machine (VM) options. More specifically you should set: '''-Xms''' the initally used RAM, e.g. to 5Gb (–Xms5G), and '''-Xmx''' the maximally used RAM, e.g. to 50Gb (-Xmx50G). GeMoMa often needs more memory if you have a large genome ''and'' if you’re providing a large coverage file (extracted from RNA-seq data). If you don’t have a compute node with enough memory, you can run GeMoMa without coverage, which will return the same predictions, but does not include all statistics. Another point could be the protein alignment, if you use the optional parameter query protein. Again you can run GeMoMa without this parameter, which will return the same predictions, but less statistics.<br />
; I need to specify the genetic code for my organisms. What is the expected format?<br />
: The genetic code is given in a two column tab-delimited table, where the first column is the one letter code of the amino acid and the second column is a comma-separated list of triplets. As we are working on genomic DNA, GeMoMa expects the bases A, C, G, and T, and not U (as expected in mRNA). Here is the link to the default genetic code, which might be used as template:<br />
: https://github.com/Jstacs/Jstacs/blob/master/projects/gemoma/test_data/genetic_code.txt<br />
: Alternative genetic codes are described here using the RNA alphabet:<br />
: https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi<br />
: The genetic code might be specified for a reference organism in the module Extractor or for a target organism in the module GeMoMa.<br />
<br />
== Version history ==<br />
[http://www.jstacs.de/download.php?which=GeMoMa GeMoMa 1.6.3] (05.03.2020)<br />
* Jstacs changes:<br />
** CLI: bugfix ExpandableParameterSet<br />
* python wrapper (for *conda)<br />
* updated tests.sh, run.sh, pipeline.sh<br />
* rename Denoise to DenoiseIntrons<br />
* AnnotationEvidence: write phase (as given) to gff<br />
* GAF: new parameter: default attributes allows to set attributes that are not included in some gene annotation files<br />
* GeMoMa: new parameter: static intron length allowing to use dynamic intron length if set to false<br />
* GeMoMaPipeline: <br />
** bugfix: time-out<br />
** improve output<br />
** separate parameters for maximum intron length (DenoiseIntrons, GeMoMa)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.6.2.zip GeMoMa 1.6.2] (17.12.2019)<br />
* Jstacs changes:<br />
** test methods for modules<br />
** live protocol for Galaxy<br />
* new module Denoise: allowing to clean introns extracted by ERE<br />
* new module NCBIReferenceRetriever: allowing to retrieve data for reference organisms easily from NCBI.<br />
* GAF:<br />
** bugfix for filter using specific attributes if no RNA-seq or query proteins was used<br />
** allow to add annotation info (as for instance provided by Phytozome) based on the reference organisms<br />
* GeMoMa: bugfix for timeout<br />
* GeMoMaPipeline:<br />
** bugfix reporting predicted partial proteins<br />
** improved protocol<br />
** new default value for query proteins (changed from false to true)<br />
** new default value for Ambiguity (changed from EXCEPTION to AMBIGUOUS)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.6.1.zip GeMoMa 1.6.1] (4.06.2019)<br />
* createGalaxyIntegration.sh: bugfix for GeMoMaPipeline<br />
* new module CheckIntrons: allowing to create statistics for introns (extracted by ERE)<br />
* AnnotationFinalizer: bugfix for sequence IDs with large numbers<br />
* CompareTranscripts: <br />
** bugfix for prefix of ref-gene<br />
** allow no transcript info, but making assignment non-optional if a transcript info is set <br />
* GAF: bugfix for Galaxy integration<br />
* GeMoMaPipeline:<br />
** improved output in case of Exceptions<br />
** new parameter "output individual predictions" allows to in- or exclude individual predictions from each reference organism in the final result<br />
** new parameter "weight" allows weights for reference species (cf. GAF)<br />
* ERE: new parameter "minimum mapping quality"<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.6.zip GeMoMa 1.6] (2.04.2019)<br />
* allow to use mmseqs as alternative to tblastn<br />
* AnnotationEvidence:<br />
** allows to add attributes to the input gff: tie, tpc, AA, start, stop<br />
** new parameter for gff output<br />
* AnnotationFinalizer: new tool for predicting UTRs and renaming predictions<br />
* GAF:<br />
** relative score filter and evidence filter are replaced by a flexible filter that allows to filter by relative score, evidence or other GFF attributes as well as combinations thereof<br />
** sorting criteria of the predictions within clusters can now be user-specified<br />
** new attribute for genes: combinedEvidence<br />
** new attribute for predictions: sumWeight<br />
** allows to use gene predictions from all sources, including for instance ab-initio gene predictors, purely RNA-seq based gene prediction and manually curation<br />
** bugfix for predictions from multiple reference organisms<br />
** improved statistic output<br />
* GeMoMa<br />
** renamed the parameter tblastn results to search results<br />
** new parameter for sorting the results of the similarity search (tblastn or mmseqs), if you use mmseqs for the similarity search you have use sort <br />
** new parameter for score of the search results: three options: Trust (as is), ReScore (use aligned sequence, but recompute score), and ReAlign (use detected sequence for realignment and rescore)<br />
** bugfix: threshold for introns from multiple files<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.5.2.zip GeMoMa 1.5.3] (23.07.2018)<br />
* improved parameter description and presentation<br />
* GeMoMaPipeline:<br />
** removed unnecessary parameters<br />
* GeMoMa:<br />
** bugfix: reading coverage file<br />
** removed parameter genomic (cf. Extractor)<br />
** removed protein output (cf. Extractor)<br />
* GAF:<br />
** bugfix: prefix<br />
* Extractor:<br />
** new parameter genomic<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.5.2.zip GeMoMa 1.5.2] (31.5.2018)<br />
* GAF:<br />
** new parameter that allows to restrict the maximal number of transcript predictions per gene<br />
** altered behavior of the evidence filter from percentages to absolute values<br />
** bugfix: nested genes <br />
** checking for duplicates in prediction IDs<br />
* GeMoMa:<br />
** warning if RNA-seq data does not match with target genome<br />
* GeMoMaPipeline: new tool for running the complete GeMoMa pipeline at once allowing multi-threading<br />
* folder for temporary files of GeMoMa<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.5.zip GeMoMa 1.5] (13.02.2018)<br />
* AnnotationEvidence: add chromosome to output<br />
* CompareTranscripts: new parameter that allows to remove prefixes introduces by GAF<br />
* Extractor: new parameter "stop-codon excluded from CDS" that might be used if the annotation does not contain the stop codons<br />
* ExtractRNASeqEvidence: <br />
** print intron length stats<br />
** include program infos in introns.gff3<br />
* GeMoMa:<br />
** new attribute pAA in gff output if query protein is given<br />
** include program infos in predicted_annotation.gff3<br />
** minor bugfix<br />
* GAF:<br />
** new parameter that allows to specify a prefix for each input gff<br />
** collect and print program infos to filtered_prediction.gff3<br />
** improved statistics output<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.4.2.zip GeMoMa 1.4.2] (21.07.2017)<br />
* automatic searching for available updates<br />
* AnnotationEvidence: bugfix (tie computation: Arrays.binarysearch does not find first match)<br />
* Extractor: bugfix (files that are not zipped)<br />
* GeMoMa: bugfix (tie computation: Arrays.binarysearch does not find first match)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.4.1.zip GeMoMa 1.4.1] (30.05.2017)<br />
* CompareTranscripts: bugfix (NullPointerException)<br />
* Extractor: reference genome can be .*fa.gz and .*fasta.gz<br />
* GeMoMa: bugfix (shutdown problem after timeout)<br />
* modified additional scripts and documentation<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.4.zip GeMoMa 1.4] (03.05.2017)<br />
* AnnotationEvidence: new tool computing tie and tpc for given annotation (gff)<br />
* CompareTranscripts: new tool comparing predicted and given annotation (gff)<br />
* Extractor:<br />
** reading CDS with no parent tag (cf. discontinuous feature)<br />
** automatic recognition of GFF or GTF annotation<br />
** Warning if sequences mentioned in the annotation are not included in the reference sequence<br />
* GeMoMa: <br />
** allowing for multiple intron and coverage files (= using different library types at the same time)<br />
** NA instead of "?" for tae, tde, tie, minSplitReads of single coding exon genes<br />
** new default values for the parameters: predictions (10 instead of 1) and contig threshold (0.4 instead of 0.9)<br />
** bugfix (write pc and minCov if possible for last CDS part in predicted annotation)<br />
** bugfix (ref-gene name if no assignment is used)<br />
** bugfix (minSplitReads, minCov, tpc, avgCov if no coverage available)<br />
* GAF:<br />
** nested genes on the same strand<br />
** bugfix (if nothing passes the filter)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.3.2.zip GeMoMa 1.3.2] (18.01.2017)<br />
* Extractor: new parameter repair for broken transcript annotations<br />
* GeMoMa: bugfixes (splice site computation)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa_1.3.1.zip GeMoMa 1.3.1] (09.12.2016)<br />
* GeMoMa bugfix (finding start/stop codon for very small exons)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.3.zip GeMoMa 1.3] (06.12.2016)<br />
* ERE: new tool for extracting RNA-seq evidence<br />
* Extractor: offers options for <br />
** partial gene models<br />
** ambiguities<br />
* GeMoMa:<br />
** RNA-seq<br />
*** defining splice sites<br />
*** additional feature in GFF and output<br />
**** transcript intron evidence (tie)<br />
**** transcript acceptor evidence (tae)<br />
**** transcript donor evidence (tde)<br />
**** transcript percentage coverage (tpc)<br />
**** ...<br />
** improved GFF<br />
** simplified the command line parameters<br />
** IMPORTANT: parameter names changed for some parameters<br />
* GAF: new tool for filtering and combining different predictions (especially of different reference organisms)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.1.3.zip GeMoMa 1.1.3] (06.06.2016)<br />
* minor modifications to the Extractor tool<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.1.2.zip GeMoMa 1.1.2] (05.02.2016)<br />
* GeMoMa bugfix (upstream, downstream sequence for splice site detection)<br />
* Extractor: new parameter s for selecting transcripts<br />
* improved Galaxy integration<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.1.1.zip GeMoMa 1.1.1] (01.02.2016)<br />
* initial release for paper</div>Keilwagenhttps://www.jstacs.de/index.php?title=GeMoMa&diff=1067GeMoMa2020-03-09T08:10:31Z<p>Keilwagen: /* Version history */ version 1.6.3</p>
<hr />
<div>'''Ge'''ne '''Mo'''del '''Ma'''pper (GeMoMa) is a homology-based gene prediction program. GeMoMa uses the annotation of protein-coding genes in a reference genome to infer the annotation of protein-coding genes in a target genome. Thereby, GeMoMa utilizes amino acid sequence and intron position conservation. In addition, GeMoMa allows to incorporate RNA-seq evidence for splice site prediction.<br />
<br />
[[File:GeMoMa-schema.png|thumb|right|350px|Schema of GeMoMa algorithm]]<br />
<br />
== Paper ==<br />
If you use GeMoMa, please cite<br />
<br />
J. Keilwagen, M. Wenk, J. L. Erickson, M. H. Schattat, J. Grau, and F. Hartung. [https://nar.oxfordjournals.org/content/44/9/e89 Using intron position conservation for homology-based gene prediction]. ''Nucleic Acids Research'', 2016. doi: 10.1093/nar/gkw092<br />
<br />
J. Keilwagen, F. Hartung, M. Paulini, S. O. Twardziok, and J. Grau<br />
[https://rdcu.be/QbKc Combining RNA-seq data and homology-based gene prediction for plants, animals and fungi]. ''BMC Bioinformatics'', 2018. doi: 10.1186/s12859-018-2203-5<br />
<br />
== Download ==<br />
<br />
GeMoMa is implemented in Java using Jstacs. You can [http://www.jstacs.de/download.php?which=GeMoMa download a zip file] containing a readme, the GeMoMa jar file and some tiny scripts for running GeMoMa. The jar file allows for <br />
<ul><br />
<li>creating the XML file needed for the Galaxy integration</li><br />
<li>running the command line interface (CLI) version.</li><br />
</ul><br />
<br />
You can also [[:File:GeMoMa-manual.pdf|download a small manual for GeMoMa]] which explains the main steps for the analysis.<br />
<br />
== Galaxy ==<br />
GeMoMa is available in a public web-server at [http://galaxy.informatik.uni-halle.de galaxy.informatik.uni-halle.de]. The provided web-server only allows a limited number of reference genes and uses a time out of 2 minutes per transcript prediction. For unlimited use, please use the command line program or integrate GeMoMa in your only Galaxy instance.<br />
<br />
[[File:GeMoMa-Workflow.png|thumb|center|700px|GeMoMa workflow adapted from Galaxy]]<br />
<br />
== Running the command line application ==<br />
<br />
For running the command line application, Java v1.8 or later is required.<br />
<br />
=== NCBIReferenceRetriever ===<br />
<br />
We provide the module NCBIReferenceRetriever allowing to retrieve data for reference organisms easily from NCBI. You can run NCBIReferenceRetriever from the command line with<br/><br />
<code>java -jar GeMoMa-<version>.jar CLI NRR [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reference directory (the directory where the genome and annotation files of the reference organisms should be stored, default = references/)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>number of tries (the number of tries for downloading a reference file, valid range = [1, 100], default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">rl</font></td><br />
<td>reference list (a list of reference organisms)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== GeMoMaPipeline ===<br />
<br />
If you like to run the GeMoMaPipeline on a server as a single job, you can use the module GeMoMaPipeline which allows to exploit the full compute power of the computer server via multi-threading. However, GeMoMaPipeline does not distribute task on a compute cluster.<br />
You can run GeMoMaPipeline from the command line with<br/><br />
<code>java -jar GeMoMa-<version>.jar CLI GeMoMaPipeline [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>target genome (Target genome file (FASTA))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>species (data for reference species, range={own, pre-extracted}, default = own)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;own&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>ID (ID to distinguish the different reference species, default = , OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (Reference annotation file (GFF or GTF), which contains gene models annotated in the reference genome)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (Reference genome file (FASTA))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>weight (the weight can be used to prioritize predictions from different input files; each prediction will get an additional attribute sumWeight that can be used in the filter, valid range = [0.0, 1000.0], default = 1.0, OPTIONAL)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ai</font></td><br />
<td>annotation info (annotation information of the reference, tab-delimted file containing at least the columns transcriptName, GO and .*defline, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;pre-extracted&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>ID (ID to distinguish the different reference species, default = , OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>cds parts (The query cds parts file (FASTA), i.e., the cds parts that have been blasted)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>assignment (The assignment file, which combines parts of the CDS to transcripts, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">q</font></td><br />
<td>query proteins (optional query protein file (FASTA) for computing the optimal alignment score against complete protein prediction, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>weight (the weight can be used to prioritize predictions from different input files; each prediction will get an additional attribute sumWeight that can be used in the filter, valid range = [0.0, 1000.0], default = 1.0, OPTIONAL)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ai</font></td><br />
<td>annotation info (annotation information of the reference, tab-delimted file containing at least the columns transcriptName, GO and .*defline, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">selected</font></td><br />
<td>selected (The path to list file, which allows to make only a predictions for the contained transcript ids. The first column should contain transcript IDs as given in the annotation. Remaining columns can be used to determine a target region that should be overlapped by the prediction, if columns 2 to 5 contain chromosome, strand, start and end of region, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">gc</font></td><br />
<td>genetic code (optional user-specified genetic code, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>maximum intron length (The maximum length of an intron, default = 15000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">tblastn</font></td><br />
<td>tblastn (if *true* tblastn is used as search algorithm, otherwise mmseqs is used. Tblastn and mmseqs need to be installed to use the corresponding option, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">tag</font></td><br />
<td>tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = prediction)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>RNA-seq evidence (data for RNA-seq evidence, range={NO, MAPPED, EXTRACTED}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;MAPPED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.s</font></td><br />
<td>Stranded (Defines whether the reads are stranded. In case of FR_FIRST_STRAND, the first read of a read pair or the only read in case of single-end data is assumed to be located on forward strand of the cDNA, i.e., reverse to the mRNA orientation. If you are using Illumina TruSeq you should use FR_FIRST_STRAND., range={FR_UNSTRANDED, FR_FIRST_STRAND, FR_SECOND_STRAND}, default = FR_UNSTRANDED)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.m</font></td><br />
<td>mapped reads file (BAM/SAM files containing the mapped reads)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.v</font></td><br />
<td>ValidationStringency (Defines how strict to be when reading a SAM or BAM, beyond bare minimum validation., range={STRICT, LENIENT, SILENT}, default = LENIENT)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.u</font></td><br />
<td>use secondary alignments (allows to filter flags in the SAM or BAM, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.c</font></td><br />
<td>coverage (allows to output the coverage, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.mmq</font></td><br />
<td>minimum mapping quality (reads with a mapping quality that is lower than this value will be ignored, valid range = [0, 255], default = 40)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;EXTRACTED&quot;:</b></td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">introns</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used zero or multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage</font></td><br />
<td>coverage (experimental coverage (RNA-seq), range={UNSTRANDED, STRANDED}, default = UNSTRANDED)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>denoise (removing questionable introns that have been extracted by ERE, range={DENOISE, RAW}, default = DENOISE)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;DENOISE&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Denoise.m</font></td><br />
<td>minimum expression (The threshold for removing introns, valid range = [0.0, 1.0], default = 0.01)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Denoise.c</font></td><br />
<td>context (The context upstream a donor and donwstream an acceptor site that is used to determine the expression of the region, valid range = [0, 100], default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;RAW&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.p</font></td><br />
<td>proteins (whether the complete proteins sequences should returned as output, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.r</font></td><br />
<td>repair (if a transcript annotation can not be parsed, the program will try to infer the phase of the CDS parts to repair the annotation, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.a</font></td><br />
<td>Ambiguity (This parameter defines how to deal with ambiguities in the DNA. There are 3 options: EXCEPTION, which will remove the corresponding transcript, AMBIGUOUS, which will use an X for the corresponding amino acid, and RANDOM, which will randomly select an amnio acid from the list of possibilities., range={EXCEPTION, AMBIGUOUS, RANDOM}, default = AMBIGUOUS)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.s</font></td><br />
<td>stop-codon excluded from CDS (A flag that states whether the reference annotation contains the stop codon in the CDS annotation or not, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.f</font></td><br />
<td>full-length (A flag which allows for choosing between only full-length and all (i.e., full-length and partial) transcripts, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.s</font></td><br />
<td>splice (if no intron is given by RNA-seq, compute candidate splice sites or not, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.sm</font></td><br />
<td>substitution matrix (optional user-specified substitution matrix, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.g</font></td><br />
<td>gap opening (The gap opening cost in the alignment, default = 11)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.ge</font></td><br />
<td>gap extension (The gap extension cost in the alignment, default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.i</font></td><br />
<td>intron-loss-gain-penalty (The penalty used for intron loss and gain, default = 25)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.e</font></td><br />
<td>e-value (The e-value for filtering blast results, default = 100.0)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.c</font></td><br />
<td>contig threshold (The threshold for evaluating contigs, valid range = [0.0, 1.0], default = 0.4)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.rt</font></td><br />
<td>region threshold (The threshold for evaluating regions, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.h</font></td><br />
<td>hit threshold (The threshold for adding additional hits, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.p</font></td><br />
<td>predictions (The (maximal) number of predictions per transcript, default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.a</font></td><br />
<td>avoid stop (A flag which allows to avoid stop codons in a transcript (except the last AS), default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.approx</font></td><br />
<td>approx (whether an approximation is used to compute the score for intron gain, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.prefix</font></td><br />
<td>prefix (A prefix to be used for naming the predictions, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.t</font></td><br />
<td>timeout (The (maximal) number of seconds to be used for the predictions of one transcript, if exceeded GeMoMa does not output a prediction for this transcript., valid range = [0, 604800], default = 3600)</td><br />
<td style="width:100px;">LONG</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.Score</font></td><br />
<td>Score (A flag which allows to do nothing, re-score or re-align the search results, range={Trust, ReScore, ReAlign}, default = Trust)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.s</font></td><br />
<td>sorting (comma-separated list that defines the way predictions are sorted within a cluster, default = evidence,score)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.c</font></td><br />
<td>common border filter (the filter on the common borders of transcripts, the lower the more transcripts will be checked as alternative splice isoforms, valid range = [0.0, 1.0], default = 0.75)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.m</font></td><br />
<td>maximal number of transcripts per gene (the maximal number of allowed transcript predictions per gene, valid range = [1, 2147483647], default = 2147483647)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.f</font></td><br />
<td>filter (A filter can be applied to transcript predictions to receive only reasonable predictions. The filter is applied to the GFF attributes. The deault filter decides based on the completeness of the prediction (start=='M' and stop=='*') and the relative score (score/AA>=0.75) whether a prediction is used or not. Different criteria can be combined using 'and ' and 'or'. A more sophisticated filter could be applied for instance using the completeness, the relative score, the evidence as well as statistics based on RNA-seq: start=='M' and stop =='*' and score/AA>=0.75 and (evidence>1 or tpc==1.0), default = start=='M' and stop =='*' and score/AA>=0.75, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.a</font></td><br />
<td>alternative transcript filter (If a prediction is suggested as an alternative transcript, this additional filter will be applied. The filter works syntactically like the 'filter' parameter and allows you to keep the number of alternative transcripts small based on meaningful criteria. Reasonable filter could be based on RNA-seq data (tie==1) or on evidence (evidence>1). A more sophisticated filter could be applied combining several criteria: tie==1 or evidence>1, default = tie==1 or evidence>1, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.u</font></td><br />
<td>UTR (allows to predict UTRs using RNA-seq data, range={NO, YES}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;YES&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.r</font></td><br />
<td>rename (allows to generate generic gene and transcripts names (cf. attribute &quot;Name&quot;), range={COMPOSED, SIMPLE, NO}, default = COMPOSED)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;COMPOSED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.i</font></td><br />
<td>infix (the infix of the generic name, default = G)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.s</font></td><br />
<td>suffix (the suffix of the generic name, default = 0)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.di</font></td><br />
<td>delete infix (a comma-separated list of infixes that is deleted from the sequence names before building the gene/transcript name, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;SIMPLE&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>predicted proteins (If *true*, returns the predicted proteins of the target organism as fastA file, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">pc</font></td><br />
<td>predicted CDSs (If *true*, returns the predicted CDSs of the target organism as fastA file, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">pgr</font></td><br />
<td>predicted genomic regions (If *true*, returns the genomic regions of predicted gene models of the target organism as fastA file, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">o</font></td><br />
<td>output individual predictions (If *true*, returns the predictions for each reference species, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">debug</font></td><br />
<td>debug (If *false* removes all temporary files even if the jobs exits unexpected, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">threads</font></td><br />
<td>The number of threads used for the tool, defaults to 1</td><br />
<td>INT</td><br />
</tr><br />
</table><br />
<br />
=== Extract RNA-seq Evidence (ERE) ===<br />
<br />
For post-processing the mapped RNA-seq data, we provide the tool ExtractRNAseqEvidence (ERE). You can run Extractor from the command line with<br/><br />
<code>java -jar GeMoMa-<version>.jar CLI ERE [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>Stranded (Defines whether the reads are stranded. In case of FR_FIRST_STRAND, the first read of a read pair or the only read in case of single-end data is assumed to be located on forward strand of the cDNA, i.e., reverse to the mRNA orientation. If you are using Illumina TruSeq you should use FR_FIRST_STRAND., range={FR_UNSTRANDED, FR_FIRST_STRAND, FR_SECOND_STRAND}, default = FR_UNSTRANDED)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>mapped reads file (BAM/SAM files containing the mapped reads)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>ValidationStringency (Defines how strict to be when reading a SAM or BAM, beyond bare minimum validation., range={STRICT, LENIENT, SILENT}, default = LENIENT)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">u</font></td><br />
<td>use secondary alignments (allows to filter flags in the SAM or BAM, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage (allows to output the coverage, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">mmq</font></td><br />
<td>minimum mapping quality (reads with a mapping quality that is lower than this value will be ignored, valid range = [0, 254], default = 40)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== CheckIntrons===<br />
<br />
This tool allows to check whether the extracted introns show the expected patterns of di-nucleotides at the splice sites. You can run CheckIntrons from the command line with<br/><br />
<code>java -jar GeMoMa-<version>.jar CLI CheckIntrons [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>target genome (The target genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>verbose (A flag which allows to output a wealth of additional information per transcript, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== Denoise ===<br />
This tool allows to remove potentially incorrectly extracted introns. You can run Denoise from the command line with<br/><br />
<code>java -jar GeMoMa-<version>.jar CLI Denoise [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage (experimental coverage (RNA-seq), range={UNSTRANDED, STRANDED}, default = UNSTRANDED)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>maximum intron length (The maximum length of an intron, default = 15000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">me</font></td><br />
<td>minimum expression (The threshold for removing introns, valid range = [0.0, 1.0], default = 0.01)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">context</font></td><br />
<td>context (The context upstream a donor and donwstream an acceptor site that is used to determine the expression of the region, valid range = [0, 100], default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== Extractor ===<br />
<br />
For preparing the data, we provide the tool Extractor. You can run Extractor from the command line with<br/><br />
<code>java -jar GeMoMa-<version>.jar CLI Extractor [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (Reference annotation file (GFF or GTF), which contains gene models annotated in the reference genome)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (Reference genome file (FASTA))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">gc</font></td><br />
<td>genetic code (optional user-specified genetic code, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>proteins (whether the complete proteins sequences should returned as output, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>cds (whether the complete CDSs should returned as output, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">genomic</font></td><br />
<td>genomic (whether the genomic regions should be returned (upper case = coding, lower case = non coding), default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>repair (if a transcript annotation can not be parsed, the program will try to infer the phase of the CDS parts to repair the annotation, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>selected (The path to list file, which allows to make only a predictions for the contained transcript ids. The first column should contain transcript IDs as given in the annotation. Remaining columns will be ignored., OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Ambiguity</font></td><br />
<td>Ambiguity (This parameter defines how to deal with ambiguities in the DNA. There are 3 options: EXCEPTION, which will remove the corresponding transcript, AMBIGUOUS, which will use an X for the corresponding amino acid, and RANDOM, which will randomly select an amnio acid from the list of possibilities., range={EXCEPTION, AMBIGUOUS, RANDOM}, default = EXCEPTION)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sefc</font></td><br />
<td>stop-codon excluded from CDS (A flag that states whether the reference annotation contains the stop codon in the CDS annotation or not, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>full-length (A flag which allows for choosing between only full-length and all (i.e., full-length and partial) transcripts, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>verbose (A flag which allows to output a wealth of additional information, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== Gene Model Mapper (GeMoMa) ===<br />
For predicting gene models, we provide the tool GeMoMa. You can run GeMoMa from the command line with<br/><br />
<code>java -jar GeMoMa-<version>.jar CLI GeMoMa [<parameter>=<value> ...]</code><br/><br />
The parameters comprise: <br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>search results (The search results, e.g., from tblastn or mmseqs)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>target genome (The target genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>cds parts (The query cds parts file (FASTA), i.e., the cds parts that have been blasted)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>assignment (The assignment file, which combines parts of the CDS to transcripts, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">q</font></td><br />
<td>query proteins (optional query protein file (FASTA) for computing the optimal alignment score against complete protein prediction, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">splice</font></td><br />
<td>splice (if no intron is given by RNA-seq, compute candidate splice sites or not, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage</font></td><br />
<td>coverage (experimental coverage (RNA-seq), range={NO, UNSTRANDED, STRANDED}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genetic code (optional user-specified genetic code, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sm</font></td><br />
<td>substitution matrix (optional user-specified substitution matrix, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">go</font></td><br />
<td>gap opening (The gap opening cost in the alignment, default = 11)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ge</font></td><br />
<td>gap extension (The gap extension cost in the alignment, default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>maximum intron length (The maximum length of an intron, default = 15000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">intron-loss-gain-penalty</font></td><br />
<td>intron-loss-gain-penalty (The penalty used for intron loss and gain, default = 25)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">e</font></td><br />
<td>e-value (The e-value for filtering blast results, default = 100.0)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ct</font></td><br />
<td>contig threshold (The threshold for evaluating contigs, valid range = [0.0, 1.0], default = 0.4)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">rt</font></td><br />
<td>region threshold (The threshold for evaluating regions, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">h</font></td><br />
<td>hit threshold (The threshold for adding additional hits, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>predictions (The (maximal) number of predictions per transcript, default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">selected</font></td><br />
<td>selected (The path to list file, which allows to make only a predictions for the contained transcript ids. The first column should contain transcript IDs as given in the annotation. Remaining columns can be used to determine a target region that should be overlapped by the prediction, if columns 2 to 5 contain chromosome, strand, start and end of region, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">as</font></td><br />
<td>avoid stop (A flag which allows to avoid stop codons in a transcript (except the last AS), default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">approx</font></td><br />
<td>approx (whether an approximation is used to compute the score for intron gain, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">prefix</font></td><br />
<td>prefix (A prefix to be used for naming the predictions, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">tag</font></td><br />
<td>tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = prediction)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>verbose (A flag which allows to output a wealth of additional information per transcript, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">timeout</font></td><br />
<td>timeout (The (maximal) number of seconds to be used for the predictions of one transcript, if exceeded GeMoMa does not output a prediction for this transcript., valid range = [0, 604800], default = 3600)</td><br />
<td style="width:100px;">LONG</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sort</font></td><br />
<td>sort (A flag which allows to sort the search results, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Score</font></td><br />
<td>Score (A flag which allows to do nothing, re-score or re-align the search results, range={Trust, ReScore, ReAlign}, default = Trust)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
GeMoMa returns the predicted annotation as gff file.<br />
<br />
=== GeMoMa Annotation Filter (GAF) ===<br />
<br />
The GeMoMa Annotation Filter allows to combine and reduce predictions from GeMoMa into a single final prediction. It is able to handle predictions from different reference species. It also handles overlapping or identical predictions.<br />
You can run GeMoMa from the command line with<br/><br />
<code>java -jar GeMoMa-<version>.jar CLI GAF[&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>tag (the tag used to read the GeMoMa annotations, default = prediction)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>sorting (comma-separated list that defines the way predictions are sorted within a cluster, default = evidence,score)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>common border filter (the filter on the common borders of transcripts, the lower the more transcripts will be checked as alternative splice isoforms, valid range = [0.0, 1.0], default = 0.75)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>maximal number of transcripts per gene (the maximal number of allowed transcript predictions per gene, valid range = [1, 2147483647], default = 2147483647)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prefix (the prefix can be used to distinguish predictions from different input files, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>weight (the weight can be used to prioritize predictions from different input files; each prediction will get an additional attribute sumWeight that can be used in the filter, valid range = [0.0, 1000.0], default = 1.0, OPTIONAL)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>gene annotation file (GFF files containing the gene annotations (predicted by GeMoMa))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation info (annotation information of the reference, tab-delimted file containing at least the columns transcriptName, GO and .*defline, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>filter (A filter can be applied to transcript predictions to receive only reasonable predictions. The filter is applied to the GFF attributes. The deault filter decides based on the completeness of the prediction (start=='M' and stop=='*') and the relative score (score/AA>=0.75) whether a prediction is used or not. Different criteria can be combined using 'and ' and 'or'. A more sophisticated filter could be applied for instance using the completeness, the relative score, the evidence as well as statistics based on RNA-seq: start=='M' and stop =='*' and score/AA>=0.75 and (evidence>1 or tpc==1.0), default = start=='M' and stop =='*' and score/AA>=0.75, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">atf</font></td><br />
<td>alternative transcript filter (If a prediction is suggested as an alternative transcript, this additional filter will be applied. The filter works syntactically like the 'filter' parameter and allows you to keep the number of alternative transcripts small based on meaningful criteria. Reasonable filter could be based on RNA-seq data (tie==1) or on evidence (evidence>1). A more sophisticated filter could be applied combining several criteria: tie==1 or evidence>1, default = tie==1 or evidence>1, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== CompareTranscripts ===<br />
<br />
For comparing gene models from GeMoMa predictions with existing annotation, we provide the tool CompareTranscripts. You can run CompareTranscripts from the command line with<br/><br />
<code>java -jar GeMoMa-<version>.jar CLI CompareTranscripts [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prediction (The predicted annotation)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (The true annotation)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used zero or multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">prefix</font></td><br />
<td>prefix (the prefix can be used to distinguish predictions from different input files, default = , OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">assignment</font></td><br />
<td>assignment (the transcript info for the reference of the prediction)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== AnnotationEvidence ===<br />
<br />
For providing RNA-seq evidence (e.g. tie) for existing annotation, we provide the tool AnnotationEvidence. You can run AnnotationEvidence from the command line with<br/><br />
<code>java -jar GeMoMa-<version>.jar CLI AnnotationEvidence [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (The genome annotation file (GFF))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (The genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns file (Introns (GFF), which might be obtained from RNA-seq, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage file (experimental coverage (RNA-seq), range={NO, UNSTRANDED, STRANDED}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ao</font></td><br />
<td>annotation output (if the annotation should be returned with attributes tie, tpc, and AA, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">gc</font></td><br />
<td>genetic code (optional user-specified genetic code, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== AnnotationFinializer ===<br />
<br />
This tool allows to predict UTR and to rename predictions. You can run AnnotationEvidence from the command line with<br/><br />
<code>java -jar GeMoMa-<version>.jar CLI AnnotationFinalizer [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (The predicted genome annotation file (GFF))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = prediction)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">u</font></td><br />
<td>UTR (allows to predict UTRs using RNA-seq data, range={NO, YES}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;YES&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (The genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns file (Introns (GFF), which might be obtained from RNA-seq)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage file (experimental coverage (RNA-seq), range={NO, UNSTRANDED, STRANDED}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">rename</font></td><br />
<td>rename (allows to generate generic gene and transcripts names (cf. attribute &quot;Name&quot;), range={COMPOSED, SIMPLE, NO}, default = COMPOSED)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;COMPOSED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">infix</font></td><br />
<td>infix (the infix of the generic name, default = G)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>suffix (the suffix of the generic name, default = 0)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">di</font></td><br />
<td>delete infix (a comma-separated list of infixes that is deleted from the sequence names before building the gene/transcript name, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;SIMPLE&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
== GFF attributes ==<br />
<br />
Using GeMoMa and GAF, you'll obtain GFFs containing some special attributes. We briefly explain the most prominent attributes in the following table.<br />
<br />
{| class="wikitable sortable"<br />
|-<br />
!Attribute !!Long name !!Tool !!Necessary parameter !!Feature !!Description<br />
|-<br />
| score || GeMoMa score || GeMoMa || || prediction || score computed by GeMoMa using the substitution matrix, gap costs and additional penalties<br />
|-<br />
| minCov || minimal coverage || GeMoMa || coverage, ... || prediction || minimal coverage of any base of the prediction given RNA-seq evidence<br />
|-<br />
| avgCov || average coverage || GeMoMa || coverage, ... || prediction || average coverage of all bases of the prediction given RNA-seq evidence<br />
|-<br />
| tpc || transcript percentage coverage || GeMoMa || coverage, ... || prediction || percentage of covered bases per predicted transcript given RNA-seq evidence<br />
|-<br />
| tae || transcript acceptor evidence || GeMoMa || introns || prediction || percentage of predicted acceptor sites per predicted transcript with RNA-seq evidence<br />
|-<br />
| tde || transcript donor evidence || GeMoMa || introns || prediction || percentage of predicted donor sites per predicted transcript with RNA-seq evidence<br />
|-<br />
| tie || transcript intron evidence || GeMoMa || introns || prediction || percentage of predicted introns per predicted transcript with RNA-seq evidence<br />
|-<br />
| minSplitReads || minimal split reads || GeMoMa || introns || prediction || minimal number of split reads for any of the predicted introns per predicted transcript<br />
|-<br />
| iAA || identical amino acid || GeMoMa || query proteins || prediction || percentage of identical amino acids between reference transcript and prediction<br />
|-<br />
| pAA || positive amino acid || GeMoMa || query proteins || prediction || percentage of aligned positions between reference transcript and prediction yielding a positive score in the substitution matrix<br />
|-<br />
| evidence || || GAF || || prediction || number of reference organisms that have a transcript yielding this prediction<br />
|-<br />
| alternative || || GAF || || prediction || alternative gene ID(s) leading to the same prediction<br />
|-<br />
| sumWeight || || GAF || || prediction || the sum of the weights of the references that perfectly support this prediction<br />
|-<br />
| maxTie || maximal tie || GAF || || gene || maximal tie of all transcripts of this gene<br />
|-<br />
| maxEvidence || maximal evidence || GAF || || gene || maximal evidence of all transcripts of this gene<br />
|-<br />
|}<br />
<br />
== FAQs ==<br />
<br />
; Why does the Extractor not return a single CDS-part, protein, ...?<br />
: First, please check whether the names of your contigs/chromosomes in your annotation (gff) and genome file (fasta) are identical. The fasta comments should at best only contain the contig/chromosome name. (Since GeMoMa 1.4, comments, which contain the contig/chromosome name and some additional information separated by a space, are also fine.) Second, please check whether you have a valid GFF/GTF file. Valid GFF files should have a valid "ID" or "Parent" entry in the attributes column. Valid GTF files should have a valid "gene_id" and "transcript_id" entry. Finally, please check the statistics that are given by the Extractor. It lists how many genes have been read and how many genes have been removed for different reasons. One common problem is that some annotation files do not include the stop codon in the CDS annotation.<br />
; How can I force GeMoMa to make more predictions?<br />
: There are several parameters affecting the number of predictions. The most prominent are the number of predictions (p) and the contig threshold (ct). For each reference transcript/CDS, GeMoMa initially makes a preliminary prediction and uses this prediction to determine whether a contig is promising and should be used to determine the final predictions. You may decrease ct and increase p to have more contigs in the final prediction. Increasing the number of predictions allows GeMoMa to output more predictions that have been computed. Decreasing the contig threshold allows to increase the number of predictions that are (internally) computed. Increasing p to a very large number without decreasing ct does not help.<br />
; Running GeMoMa on a single contig of my assembly yield thousands of weird predictions. What went wrong?<br />
: By default, GeMoMa is not build to be run on a single contig. GeMoMa tries to make predictions for all given reference CDS in the given target sequence(s). If the given target sequence is only a fraction of the complete target genome/assembly, GeMoMa will produce weird predictions as it does not filter for the quality of the predictions internally. There are two options to handle this:<br />
:* Use a list of gene models that you expect to be located on this contig (cf. parameter "selected").<br />
:* Filter the predictions using GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>).<br />
; Is it mandatory to use RNA-seq data?<br />
: No, GeMoMa is able to make predictions with and without RNA-seq evidence.<br />
; Is it possible to use multiple reference organisms?<br />
: It is possible to use multiple reference organisms for GeMoMa. Just run GeMoMa on each reference organism separately. Finally, you can employ GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>) to combine these annotations.<br />
; Why do some reference genes not lead to a prediction in the target genome?<br />
: Please first check whether your reference genes have been discarded by the Extractor (cf. assignment file).<br />
: If the genes have been discarded, there are two possibilities:<br />
:* The CDS might be redundant, i.e. the coding exons are identical to those of another transcript. In this case, only one CDS is further evaluated.<br />
:* There might be something wrong with your reference genes, e.g., missing start codon, missing stop codon, premature stop codon, ambiguous nucleotides, ... and you should check the options of Extractor or the annotation.<br />
: If the reference genes passed the Extractor, there are several possible explanations for this behavior. The two most prominent are:<br />
:* GeMoMa stopped the prediction of a reference genes since it does not return a result within the given time (cf. parameter "timeout").<br />
:* GeMoMa simply did not find a prediction matching the remaining quality criteria<br />
:* GeMoMa did find a prediction, but it was filtered out by GAF, e.g. to low relative score, missing start or stop codon (cf. GAF parameters).<br />
; What does "partial gene model" mean in the context of GeMoMa?<br />
: We called a gene model partial if it does not contain an initial start codon and a final stop codon. However, this does not mean that the gene model is located at or close to the border of a chromosome or contig.<br />
; For two different reference transcripts, the predictions of GeMoMa overlap or are identical. What should I do with those?<br />
: GeMoMa makes the predictions for each reference transcript independently. Hence, it can occur that some of predictions of different reference transcripts overlap or are identical especially in gene families. Typically, you might like to filter or rank these predictions. We have implemented GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>) to do this automatically. However, you can also do it by hand using the GFF attributes. Using RNA-seq data in GeMoMa yields additional fields in the annotation that can be used, e.g., average coverage (avgCov).<br />
; A lot of transcripts have been filtered out by the Extractor. What can I do?<br />
: There are several reasons for removing transcripts by the Extractor. At least in two cases you can try to get more transcripts by setting specific parameter values. First, if the transcript contains ambiguous nucleotides, please test the parameter "Ambiguity". Second, sometimes we received GFFs which contain wrong phases for CDS entries (e.g., 0 for all CDS entries in the phase column of the GFF). Since version 1.3.2, we provide the option "r" which stands for repair. If r=true is chosen, the Extractor tries to infer all phases for transcripts that show an error and would be filtered out.<br />
; Is GeMoMa able to predict pseudo-genes/ncRNA?<br />
: No, currently not.<br />
; My RNA-seq data indicates there is an additional intron in a transcipt, but GeMoMa does not predict this. Or vice versa, GeMoMa predicts an intron that is not supported by RNA-seq data. What's the reason?<br />
: GeMoMa is mainly based on the assumptions of amino acid and intron position conservation between reference and target species. Hence, GeMoMa tries to predict a gene model with similar exon-intron structure in the target species and does not stick too much to RNA-seq data. Although intron position conservation can be observed in most cases, sometimes new introns evolve or others vanish. For this reasons, GeMoMa also allows for the inclusion or exclusion of introns adding some additional costs (cf. GeMoMa parameter intron-loss-gain-penalty). However, the behaviour of GeMoMa depends on the parameters settings (especially intron-loss-gain-penalty, sm (substitution matrix), go (gap opening), ge (gap extension)) and the length of the missed/additional intron. Nevertheless, such cases can only occur if the additional/missed intron has a length that can be divided by 3 preserving the reading frame.<br />
:Since the available RNA-seq data only reflects a fraction of tissues/environmental conditions/..., missing RNA-seq evidence does not necessarily mean that the predictions is wrong.<br />
; My RNA-seq data indicates two alternative, highly overlapping introns. Interestingly, GeMoMa does not take the intron that is more abundant. Why?<br />
: GeMoMa reads the introns from the input file using some filter (cf. GeMoMa parameter r (reads)). All introns that pass the filter are used and treated equally. Hence, GeMoMa uses the intron that matches the expectation of intron position and amino acid conservation compared to the reference transcript.<br />
; Does GeMoMa predict multiple transcripts per gene?<br />
: GeMoMa in principle allows to predict multiple transcripts per gene, if corresponding transcripts are given in the reference species or if multiple reference species are used.<br />
; GeMoMa failed with java.lang.OutOfMemoryError. What can I do?<br />
: Whenever you see a java.lang.OutOfMemoryError, you should rerun the program with Java virtual machine (VM) options. More specifically you should set: '''-Xms''' the initally used RAM, e.g. to 5Gb (–Xms5G), and '''-Xmx''' the maximally used RAM, e.g. to 50Gb (-Xmx50G). GeMoMa often needs more memory if you have a large genome ''and'' if you’re providing a large coverage file (extracted from RNA-seq data). If you don’t have a compute node with enough memory, you can run GeMoMa without coverage, which will return the same predictions, but does not include all statistics. Another point could be the protein alignment, if you use the optional parameter query protein. Again you can run GeMoMa without this parameter, which will return the same predictions, but less statistics.<br />
; I need to specify the genetic code for my organisms. What is the expected format?<br />
: The genetic code is given in a two column tab-delimited table, where the first column is the one letter code of the amino acid and the second column is a comma-separated list of triplets. As we are working on genomic DNA, GeMoMa expects the bases A, C, G, and T, and not U (as expected in mRNA). Here is the link to the default genetic code, which might be used as template:<br />
: https://github.com/Jstacs/Jstacs/blob/master/projects/gemoma/test_data/genetic_code.txt<br />
: Alternative genetic codes are described here using the RNA alphabet:<br />
: https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi<br />
: The genetic code might be specified for a reference organism in the module Extractor or for a target organism in the module GeMoMa.<br />
<br />
== Version history ==<br />
[http://www.jstacs.de/download.php?which=GeMoMa GeMoMa 1.6.3] (05.03.2020)<br />
* Jstacs changes:<br />
** CLI: bugfix ExpandableParameterSet<br />
* python wrapper (for *conda)<br />
* updated tests.sh, run.sh, pipeline.sh<br />
* rename Denoise to DenoiseIntrons<br />
* AnnotationEvidence: write phase (as given) to gff<br />
* GAF: new parameter: default attributes allows to set attributes that are not included in some gene annotation files<br />
* GeMoMa: new parameter: static intron length allowing to use dynamic intron length if set to false<br />
* GeMoMaPipeline: <br />
** bugfix: time-out<br />
** improve output<br />
** separate parameters for maximum intron length (DenoiseIntrons, GeMoMa)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.6.2.zip GeMoMa 1.6.2] (17.12.2019)<br />
* Jstacs changes:<br />
** test methods for modules<br />
** live protocol for Galaxy<br />
* new module Denoise: allowing to clean introns extracted by ERE<br />
* new module NCBIReferenceRetriever: allowing to retrieve data for reference organisms easily from NCBI.<br />
* GAF:<br />
** bugfix for filter using specific attributes if no RNA-seq or query proteins was used<br />
** allow to add annotation info (as for instance provided by Phytozome) based on the reference organisms<br />
* GeMoMa: bugfix for timeout<br />
* GeMoMaPipeline:<br />
** bugfix reporting predicted partial proteins<br />
** improved protocol<br />
** new default value for query proteins (changed from false to true)<br />
** new default value for Ambiguity (changed from EXCEPTION to AMBIGUOUS)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.6.1.zip GeMoMa 1.6.1] (4.06.2019)<br />
* createGalaxyIntegration.sh: bugfix for GeMoMaPipeline<br />
* new module CheckIntrons: allowing to create statistics for introns (extracted by ERE)<br />
* AnnotationFinalizer: bugfix for sequence IDs with large numbers<br />
* CompareTranscripts: <br />
** bugfix for prefix of ref-gene<br />
** allow no transcript info, but making assignment non-optional if a transcript info is set <br />
* GAF: bugfix for Galaxy integration<br />
* GeMoMaPipeline:<br />
** improved output in case of Exceptions<br />
** new parameter "output individual predictions" allows to in- or exclude individual predictions from each reference organism in the final result<br />
** new parameter "weight" allows weights for reference species (cf. GAF)<br />
* ERE: new parameter "minimum mapping quality"<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.6.zip GeMoMa 1.6] (2.04.2019)<br />
* allow to use mmseqs as alternative to tblastn<br />
* AnnotationEvidence:<br />
** allows to add attributes to the input gff: tie, tpc, AA, start, stop<br />
** new parameter for gff output<br />
* AnnotationFinalizer: new tool for predicting UTRs and renaming predictions<br />
* GAF:<br />
** relative score filter and evidence filter are replaced by a flexible filter that allows to filter by relative score, evidence or other GFF attributes as well as combinations thereof<br />
** sorting criteria of the predictions within clusters can now be user-specified<br />
** new attribute for genes: combinedEvidence<br />
** new attribute for predictions: sumWeight<br />
** allows to use gene predictions from all sources, including for instance ab-initio gene predictors, purely RNA-seq based gene prediction and manually curation<br />
** bugfix for predictions from multiple reference organisms<br />
** improved statistic output<br />
* GeMoMa<br />
** renamed the parameter tblastn results to search results<br />
** new parameter for sorting the results of the similarity search (tblastn or mmseqs), if you use mmseqs for the similarity search you have use sort <br />
** new parameter for score of the search results: three options: Trust (as is), ReScore (use aligned sequence, but recompute score), and ReAlign (use detected sequence for realignment and rescore)<br />
** bugfix: threshold for introns from multiple files<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.5.2.zip GeMoMa 1.5.3] (23.07.2018)<br />
* improved parameter description and presentation<br />
* GeMoMaPipeline:<br />
** removed unnecessary parameters<br />
* GeMoMa:<br />
** bugfix: reading coverage file<br />
** removed parameter genomic (cf. Extractor)<br />
** removed protein output (cf. Extractor)<br />
* GAF:<br />
** bugfix: prefix<br />
* Extractor:<br />
** new parameter genomic<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.5.2.zip GeMoMa 1.5.2] (31.5.2018)<br />
* GAF:<br />
** new parameter that allows to restrict the maximal number of transcript predictions per gene<br />
** altered behavior of the evidence filter from percentages to absolute values<br />
** bugfix: nested genes <br />
** checking for duplicates in prediction IDs<br />
* GeMoMa:<br />
** warning if RNA-seq data does not match with target genome<br />
* GeMoMaPipeline: new tool for running the complete GeMoMa pipeline at once allowing multi-threading<br />
* folder for temporary files of GeMoMa<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.5.zip GeMoMa 1.5] (13.02.2018)<br />
* AnnotationEvidence: add chromosome to output<br />
* CompareTranscripts: new parameter that allows to remove prefixes introduces by GAF<br />
* Extractor: new parameter "stop-codon excluded from CDS" that might be used if the annotation does not contain the stop codons<br />
* ExtractRNASeqEvidence: <br />
** print intron length stats<br />
** include program infos in introns.gff3<br />
* GeMoMa:<br />
** new attribute pAA in gff output if query protein is given<br />
** include program infos in predicted_annotation.gff3<br />
** minor bugfix<br />
* GAF:<br />
** new parameter that allows to specify a prefix for each input gff<br />
** collect and print program infos to filtered_prediction.gff3<br />
** improved statistics output<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.4.2.zip GeMoMa 1.4.2] (21.07.2017)<br />
* automatic searching for available updates<br />
* AnnotationEvidence: bugfix (tie computation: Arrays.binarysearch does not find first match)<br />
* Extractor: bugfix (files that are not zipped)<br />
* GeMoMa: bugfix (tie computation: Arrays.binarysearch does not find first match)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.4.1.zip GeMoMa 1.4.1] (30.05.2017)<br />
* CompareTranscripts: bugfix (NullPointerException)<br />
* Extractor: reference genome can be .*fa.gz and .*fasta.gz<br />
* GeMoMa: bugfix (shutdown problem after timeout)<br />
* modified additional scripts and documentation<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.4.zip GeMoMa 1.4] (03.05.2017)<br />
* AnnotationEvidence: new tool computing tie and tpc for given annotation (gff)<br />
* CompareTranscripts: new tool comparing predicted and given annotation (gff)<br />
* Extractor:<br />
** reading CDS with no parent tag (cf. discontinuous feature)<br />
** automatic recognition of GFF or GTF annotation<br />
** Warning if sequences mentioned in the annotation are not included in the reference sequence<br />
* GeMoMa: <br />
** allowing for multiple intron and coverage files (= using different library types at the same time)<br />
** NA instead of "?" for tae, tde, tie, minSplitReads of single coding exon genes<br />
** new default values for the parameters: predictions (10 instead of 1) and contig threshold (0.4 instead of 0.9)<br />
** bugfix (write pc and minCov if possible for last CDS part in predicted annotation)<br />
** bugfix (ref-gene name if no assignment is used)<br />
** bugfix (minSplitReads, minCov, tpc, avgCov if no coverage available)<br />
* GAF:<br />
** nested genes on the same strand<br />
** bugfix (if nothing passes the filter)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.3.2.zip GeMoMa 1.3.2] (18.01.2017)<br />
* Extractor: new parameter repair for broken transcript annotations<br />
* GeMoMa: bugfixes (splice site computation)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa_1.3.1.zip GeMoMa 1.3.1] (09.12.2016)<br />
* GeMoMa bugfix (finding start/stop codon for very small exons)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.3.zip GeMoMa 1.3] (06.12.2016)<br />
* ERE: new tool for extracting RNA-seq evidence<br />
* Extractor: offers options for <br />
** partial gene models<br />
** ambiguities<br />
* GeMoMa:<br />
** RNA-seq<br />
*** defining splice sites<br />
*** additional feature in GFF and output<br />
**** transcript intron evidence (tie)<br />
**** transcript acceptor evidence (tae)<br />
**** transcript donor evidence (tde)<br />
**** transcript percentage coverage (tpc)<br />
**** ...<br />
** improved GFF<br />
** simplified the command line parameters<br />
** IMPORTANT: parameter names changed for some parameters<br />
* GAF: new tool for filtering and combining different predictions (especially of different reference organisms)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.1.3.zip GeMoMa 1.1.3] (06.06.2016)<br />
* minor modifications to the Extractor tool<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.1.2.zip GeMoMa 1.1.2] (05.02.2016)<br />
* GeMoMa bugfix (upstream, downstream sequence for splice site detection)<br />
* Extractor: new parameter s for selecting transcripts<br />
* improved Galaxy integration<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.1.1.zip GeMoMa 1.1.1] (01.02.2016)<br />
* initial release for paper</div>Keilwagenhttps://www.jstacs.de/index.php?title=GeMoMa&diff=1066GeMoMa2020-01-17T10:20:46Z<p>Keilwagen: 1.6.2</p>
<hr />
<div>'''Ge'''ne '''Mo'''del '''Ma'''pper (GeMoMa) is a homology-based gene prediction program. GeMoMa uses the annotation of protein-coding genes in a reference genome to infer the annotation of protein-coding genes in a target genome. Thereby, GeMoMa utilizes amino acid sequence and intron position conservation. In addition, GeMoMa allows to incorporate RNA-seq evidence for splice site prediction.<br />
<br />
[[File:GeMoMa-schema.png|thumb|right|350px|Schema of GeMoMa algorithm]]<br />
<br />
== Paper ==<br />
If you use GeMoMa, please cite<br />
<br />
J. Keilwagen, M. Wenk, J. L. Erickson, M. H. Schattat, J. Grau, and F. Hartung. [https://nar.oxfordjournals.org/content/44/9/e89 Using intron position conservation for homology-based gene prediction]. ''Nucleic Acids Research'', 2016. doi: 10.1093/nar/gkw092<br />
<br />
J. Keilwagen, F. Hartung, M. Paulini, S. O. Twardziok, and J. Grau<br />
[https://rdcu.be/QbKc Combining RNA-seq data and homology-based gene prediction for plants, animals and fungi]. ''BMC Bioinformatics'', 2018. doi: 10.1186/s12859-018-2203-5<br />
<br />
== Download ==<br />
<br />
GeMoMa is implemented in Java using Jstacs. You can [http://www.jstacs.de/download.php?which=GeMoMa download a zip file] containing a readme, the GeMoMa jar file and some tiny scripts for running GeMoMa. The jar file allows for <br />
<ul><br />
<li>creating the XML file needed for the Galaxy integration</li><br />
<li>running the command line interface (CLI) version.</li><br />
</ul><br />
<br />
You can also [[:File:GeMoMa-manual.pdf|download a small manual for GeMoMa]] which explains the main steps for the analysis.<br />
<br />
== Galaxy ==<br />
GeMoMa is available in a public web-server at [http://galaxy.informatik.uni-halle.de galaxy.informatik.uni-halle.de]. The provided web-server only allows a limited number of reference genes and uses a time out of 2 minutes per transcript prediction. For unlimited use, please use the command line program or integrate GeMoMa in your only Galaxy instance.<br />
<br />
[[File:GeMoMa-Workflow.png|thumb|center|700px|GeMoMa workflow adapted from Galaxy]]<br />
<br />
== Running the command line application ==<br />
<br />
For running the command line application, Java v1.8 or later is required.<br />
<br />
=== NCBIReferenceRetriever ===<br />
<br />
We provide the module NCBIReferenceRetriever allowing to retrieve data for reference organisms easily from NCBI. You can run NCBIReferenceRetriever from the command line with<br/><br />
<code>java -jar GeMoMa-<version>.jar CLI NRR [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reference directory (the directory where the genome and annotation files of the reference organisms should be stored, default = references/)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">n</font></td><br />
<td>number of tries (the number of tries for downloading a reference file, valid range = [1, 100], default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">rl</font></td><br />
<td>reference list (a list of reference organisms)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== GeMoMaPipeline ===<br />
<br />
If you like to run the GeMoMaPipeline on a server as a single job, you can use the module GeMoMaPipeline which allows to exploit the full compute power of the computer server via multi-threading. However, GeMoMaPipeline does not distribute task on a compute cluster.<br />
You can run GeMoMaPipeline from the command line with<br/><br />
<code>java -jar GeMoMa-<version>.jar CLI GeMoMaPipeline [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>target genome (Target genome file (FASTA))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>species (data for reference species, range={own, pre-extracted}, default = own)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;own&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>ID (ID to distinguish the different reference species, default = , OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (Reference annotation file (GFF or GTF), which contains gene models annotated in the reference genome)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (Reference genome file (FASTA))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>weight (the weight can be used to prioritize predictions from different input files; each prediction will get an additional attribute sumWeight that can be used in the filter, valid range = [0.0, 1000.0], default = 1.0, OPTIONAL)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ai</font></td><br />
<td>annotation info (annotation information of the reference, tab-delimted file containing at least the columns transcriptName, GO and .*defline, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;pre-extracted&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>ID (ID to distinguish the different reference species, default = , OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>cds parts (The query cds parts file (FASTA), i.e., the cds parts that have been blasted)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>assignment (The assignment file, which combines parts of the CDS to transcripts, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">q</font></td><br />
<td>query proteins (optional query protein file (FASTA) for computing the optimal alignment score against complete protein prediction, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>weight (the weight can be used to prioritize predictions from different input files; each prediction will get an additional attribute sumWeight that can be used in the filter, valid range = [0.0, 1000.0], default = 1.0, OPTIONAL)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ai</font></td><br />
<td>annotation info (annotation information of the reference, tab-delimted file containing at least the columns transcriptName, GO and .*defline, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">selected</font></td><br />
<td>selected (The path to list file, which allows to make only a predictions for the contained transcript ids. The first column should contain transcript IDs as given in the annotation. Remaining columns can be used to determine a target region that should be overlapped by the prediction, if columns 2 to 5 contain chromosome, strand, start and end of region, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">gc</font></td><br />
<td>genetic code (optional user-specified genetic code, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>maximum intron length (The maximum length of an intron, default = 15000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">tblastn</font></td><br />
<td>tblastn (if *true* tblastn is used as search algorithm, otherwise mmseqs is used. Tblastn and mmseqs need to be installed to use the corresponding option, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">tag</font></td><br />
<td>tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = prediction)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>RNA-seq evidence (data for RNA-seq evidence, range={NO, MAPPED, EXTRACTED}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;MAPPED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.s</font></td><br />
<td>Stranded (Defines whether the reads are stranded. In case of FR_FIRST_STRAND, the first read of a read pair or the only read in case of single-end data is assumed to be located on forward strand of the cDNA, i.e., reverse to the mRNA orientation. If you are using Illumina TruSeq you should use FR_FIRST_STRAND., range={FR_UNSTRANDED, FR_FIRST_STRAND, FR_SECOND_STRAND}, default = FR_UNSTRANDED)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.m</font></td><br />
<td>mapped reads file (BAM/SAM files containing the mapped reads)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.v</font></td><br />
<td>ValidationStringency (Defines how strict to be when reading a SAM or BAM, beyond bare minimum validation., range={STRICT, LENIENT, SILENT}, default = LENIENT)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.u</font></td><br />
<td>use secondary alignments (allows to filter flags in the SAM or BAM, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.c</font></td><br />
<td>coverage (allows to output the coverage, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.mmq</font></td><br />
<td>minimum mapping quality (reads with a mapping quality that is lower than this value will be ignored, valid range = [0, 255], default = 40)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;EXTRACTED&quot;:</b></td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">introns</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used zero or multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage</font></td><br />
<td>coverage (experimental coverage (RNA-seq), range={UNSTRANDED, STRANDED}, default = UNSTRANDED)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>denoise (removing questionable introns that have been extracted by ERE, range={DENOISE, RAW}, default = DENOISE)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;DENOISE&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Denoise.m</font></td><br />
<td>minimum expression (The threshold for removing introns, valid range = [0.0, 1.0], default = 0.01)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Denoise.c</font></td><br />
<td>context (The context upstream a donor and donwstream an acceptor site that is used to determine the expression of the region, valid range = [0, 100], default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;RAW&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.p</font></td><br />
<td>proteins (whether the complete proteins sequences should returned as output, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.r</font></td><br />
<td>repair (if a transcript annotation can not be parsed, the program will try to infer the phase of the CDS parts to repair the annotation, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.a</font></td><br />
<td>Ambiguity (This parameter defines how to deal with ambiguities in the DNA. There are 3 options: EXCEPTION, which will remove the corresponding transcript, AMBIGUOUS, which will use an X for the corresponding amino acid, and RANDOM, which will randomly select an amnio acid from the list of possibilities., range={EXCEPTION, AMBIGUOUS, RANDOM}, default = AMBIGUOUS)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.s</font></td><br />
<td>stop-codon excluded from CDS (A flag that states whether the reference annotation contains the stop codon in the CDS annotation or not, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.f</font></td><br />
<td>full-length (A flag which allows for choosing between only full-length and all (i.e., full-length and partial) transcripts, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.s</font></td><br />
<td>splice (if no intron is given by RNA-seq, compute candidate splice sites or not, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.sm</font></td><br />
<td>substitution matrix (optional user-specified substitution matrix, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.g</font></td><br />
<td>gap opening (The gap opening cost in the alignment, default = 11)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.ge</font></td><br />
<td>gap extension (The gap extension cost in the alignment, default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.i</font></td><br />
<td>intron-loss-gain-penalty (The penalty used for intron loss and gain, default = 25)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.e</font></td><br />
<td>e-value (The e-value for filtering blast results, default = 100.0)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.c</font></td><br />
<td>contig threshold (The threshold for evaluating contigs, valid range = [0.0, 1.0], default = 0.4)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.rt</font></td><br />
<td>region threshold (The threshold for evaluating regions, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.h</font></td><br />
<td>hit threshold (The threshold for adding additional hits, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.p</font></td><br />
<td>predictions (The (maximal) number of predictions per transcript, default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.a</font></td><br />
<td>avoid stop (A flag which allows to avoid stop codons in a transcript (except the last AS), default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.approx</font></td><br />
<td>approx (whether an approximation is used to compute the score for intron gain, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.prefix</font></td><br />
<td>prefix (A prefix to be used for naming the predictions, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.t</font></td><br />
<td>timeout (The (maximal) number of seconds to be used for the predictions of one transcript, if exceeded GeMoMa does not output a prediction for this transcript., valid range = [0, 604800], default = 3600)</td><br />
<td style="width:100px;">LONG</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.Score</font></td><br />
<td>Score (A flag which allows to do nothing, re-score or re-align the search results, range={Trust, ReScore, ReAlign}, default = Trust)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.s</font></td><br />
<td>sorting (comma-separated list that defines the way predictions are sorted within a cluster, default = evidence,score)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.c</font></td><br />
<td>common border filter (the filter on the common borders of transcripts, the lower the more transcripts will be checked as alternative splice isoforms, valid range = [0.0, 1.0], default = 0.75)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.m</font></td><br />
<td>maximal number of transcripts per gene (the maximal number of allowed transcript predictions per gene, valid range = [1, 2147483647], default = 2147483647)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.f</font></td><br />
<td>filter (A filter can be applied to transcript predictions to receive only reasonable predictions. The filter is applied to the GFF attributes. The deault filter decides based on the completeness of the prediction (start=='M' and stop=='*') and the relative score (score/AA>=0.75) whether a prediction is used or not. Different criteria can be combined using 'and ' and 'or'. A more sophisticated filter could be applied for instance using the completeness, the relative score, the evidence as well as statistics based on RNA-seq: start=='M' and stop =='*' and score/AA>=0.75 and (evidence>1 or tpc==1.0), default = start=='M' and stop =='*' and score/AA>=0.75, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.a</font></td><br />
<td>alternative transcript filter (If a prediction is suggested as an alternative transcript, this additional filter will be applied. The filter works syntactically like the 'filter' parameter and allows you to keep the number of alternative transcripts small based on meaningful criteria. Reasonable filter could be based on RNA-seq data (tie==1) or on evidence (evidence>1). A more sophisticated filter could be applied combining several criteria: tie==1 or evidence>1, default = tie==1 or evidence>1, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.u</font></td><br />
<td>UTR (allows to predict UTRs using RNA-seq data, range={NO, YES}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;YES&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.r</font></td><br />
<td>rename (allows to generate generic gene and transcripts names (cf. attribute &quot;Name&quot;), range={COMPOSED, SIMPLE, NO}, default = COMPOSED)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;COMPOSED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.i</font></td><br />
<td>infix (the infix of the generic name, default = G)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.s</font></td><br />
<td>suffix (the suffix of the generic name, default = 0)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.di</font></td><br />
<td>delete infix (a comma-separated list of infixes that is deleted from the sequence names before building the gene/transcript name, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;SIMPLE&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>predicted proteins (If *true*, returns the predicted proteins of the target organism as fastA file, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">pc</font></td><br />
<td>predicted CDSs (If *true*, returns the predicted CDSs of the target organism as fastA file, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">pgr</font></td><br />
<td>predicted genomic regions (If *true*, returns the genomic regions of predicted gene models of the target organism as fastA file, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">o</font></td><br />
<td>output individual predictions (If *true*, returns the predictions for each reference species, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">debug</font></td><br />
<td>debug (If *false* removes all temporary files even if the jobs exits unexpected, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">threads</font></td><br />
<td>The number of threads used for the tool, defaults to 1</td><br />
<td>INT</td><br />
</tr><br />
</table><br />
<br />
=== Extract RNA-seq Evidence (ERE) ===<br />
<br />
For post-processing the mapped RNA-seq data, we provide the tool ExtractRNAseqEvidence (ERE). You can run Extractor from the command line with<br/><br />
<code>java -jar GeMoMa-<version>.jar CLI ERE [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>Stranded (Defines whether the reads are stranded. In case of FR_FIRST_STRAND, the first read of a read pair or the only read in case of single-end data is assumed to be located on forward strand of the cDNA, i.e., reverse to the mRNA orientation. If you are using Illumina TruSeq you should use FR_FIRST_STRAND., range={FR_UNSTRANDED, FR_FIRST_STRAND, FR_SECOND_STRAND}, default = FR_UNSTRANDED)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>mapped reads file (BAM/SAM files containing the mapped reads)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>ValidationStringency (Defines how strict to be when reading a SAM or BAM, beyond bare minimum validation., range={STRICT, LENIENT, SILENT}, default = LENIENT)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">u</font></td><br />
<td>use secondary alignments (allows to filter flags in the SAM or BAM, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage (allows to output the coverage, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">mmq</font></td><br />
<td>minimum mapping quality (reads with a mapping quality that is lower than this value will be ignored, valid range = [0, 254], default = 40)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== CheckIntrons===<br />
<br />
This tool allows to check whether the extracted introns show the expected patterns of di-nucleotides at the splice sites. You can run CheckIntrons from the command line with<br/><br />
<code>java -jar GeMoMa-<version>.jar CLI CheckIntrons [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>target genome (The target genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>verbose (A flag which allows to output a wealth of additional information per transcript, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== Denoise ===<br />
This tool allows to remove potentially incorrectly extracted introns. You can run Denoise from the command line with<br/><br />
<code>java -jar GeMoMa-<version>.jar CLI Denoise [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage (experimental coverage (RNA-seq), range={UNSTRANDED, STRANDED}, default = UNSTRANDED)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>maximum intron length (The maximum length of an intron, default = 15000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">me</font></td><br />
<td>minimum expression (The threshold for removing introns, valid range = [0.0, 1.0], default = 0.01)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">context</font></td><br />
<td>context (The context upstream a donor and donwstream an acceptor site that is used to determine the expression of the region, valid range = [0, 100], default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== Extractor ===<br />
<br />
For preparing the data, we provide the tool Extractor. You can run Extractor from the command line with<br/><br />
<code>java -jar GeMoMa-<version>.jar CLI Extractor [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (Reference annotation file (GFF or GTF), which contains gene models annotated in the reference genome)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (Reference genome file (FASTA))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">gc</font></td><br />
<td>genetic code (optional user-specified genetic code, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>proteins (whether the complete proteins sequences should returned as output, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>cds (whether the complete CDSs should returned as output, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">genomic</font></td><br />
<td>genomic (whether the genomic regions should be returned (upper case = coding, lower case = non coding), default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>repair (if a transcript annotation can not be parsed, the program will try to infer the phase of the CDS parts to repair the annotation, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>selected (The path to list file, which allows to make only a predictions for the contained transcript ids. The first column should contain transcript IDs as given in the annotation. Remaining columns will be ignored., OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Ambiguity</font></td><br />
<td>Ambiguity (This parameter defines how to deal with ambiguities in the DNA. There are 3 options: EXCEPTION, which will remove the corresponding transcript, AMBIGUOUS, which will use an X for the corresponding amino acid, and RANDOM, which will randomly select an amnio acid from the list of possibilities., range={EXCEPTION, AMBIGUOUS, RANDOM}, default = EXCEPTION)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sefc</font></td><br />
<td>stop-codon excluded from CDS (A flag that states whether the reference annotation contains the stop codon in the CDS annotation or not, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>full-length (A flag which allows for choosing between only full-length and all (i.e., full-length and partial) transcripts, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>verbose (A flag which allows to output a wealth of additional information, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== Gene Model Mapper (GeMoMa) ===<br />
For predicting gene models, we provide the tool GeMoMa. You can run GeMoMa from the command line with<br/><br />
<code>java -jar GeMoMa-<version>.jar CLI GeMoMa [<parameter>=<value> ...]</code><br/><br />
The parameters comprise: <br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>search results (The search results, e.g., from tblastn or mmseqs)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>target genome (The target genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>cds parts (The query cds parts file (FASTA), i.e., the cds parts that have been blasted)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>assignment (The assignment file, which combines parts of the CDS to transcripts, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">q</font></td><br />
<td>query proteins (optional query protein file (FASTA) for computing the optimal alignment score against complete protein prediction, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">splice</font></td><br />
<td>splice (if no intron is given by RNA-seq, compute candidate splice sites or not, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage</font></td><br />
<td>coverage (experimental coverage (RNA-seq), range={NO, UNSTRANDED, STRANDED}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genetic code (optional user-specified genetic code, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sm</font></td><br />
<td>substitution matrix (optional user-specified substitution matrix, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">go</font></td><br />
<td>gap opening (The gap opening cost in the alignment, default = 11)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ge</font></td><br />
<td>gap extension (The gap extension cost in the alignment, default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>maximum intron length (The maximum length of an intron, default = 15000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">intron-loss-gain-penalty</font></td><br />
<td>intron-loss-gain-penalty (The penalty used for intron loss and gain, default = 25)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">e</font></td><br />
<td>e-value (The e-value for filtering blast results, default = 100.0)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ct</font></td><br />
<td>contig threshold (The threshold for evaluating contigs, valid range = [0.0, 1.0], default = 0.4)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">rt</font></td><br />
<td>region threshold (The threshold for evaluating regions, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">h</font></td><br />
<td>hit threshold (The threshold for adding additional hits, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>predictions (The (maximal) number of predictions per transcript, default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">selected</font></td><br />
<td>selected (The path to list file, which allows to make only a predictions for the contained transcript ids. The first column should contain transcript IDs as given in the annotation. Remaining columns can be used to determine a target region that should be overlapped by the prediction, if columns 2 to 5 contain chromosome, strand, start and end of region, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">as</font></td><br />
<td>avoid stop (A flag which allows to avoid stop codons in a transcript (except the last AS), default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">approx</font></td><br />
<td>approx (whether an approximation is used to compute the score for intron gain, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">prefix</font></td><br />
<td>prefix (A prefix to be used for naming the predictions, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">tag</font></td><br />
<td>tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = prediction)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>verbose (A flag which allows to output a wealth of additional information per transcript, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">timeout</font></td><br />
<td>timeout (The (maximal) number of seconds to be used for the predictions of one transcript, if exceeded GeMoMa does not output a prediction for this transcript., valid range = [0, 604800], default = 3600)</td><br />
<td style="width:100px;">LONG</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sort</font></td><br />
<td>sort (A flag which allows to sort the search results, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Score</font></td><br />
<td>Score (A flag which allows to do nothing, re-score or re-align the search results, range={Trust, ReScore, ReAlign}, default = Trust)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
GeMoMa returns the predicted annotation as gff file.<br />
<br />
=== GeMoMa Annotation Filter (GAF) ===<br />
<br />
The GeMoMa Annotation Filter allows to combine and reduce predictions from GeMoMa into a single final prediction. It is able to handle predictions from different reference species. It also handles overlapping or identical predictions.<br />
You can run GeMoMa from the command line with<br/><br />
<code>java -jar GeMoMa-<version>.jar CLI GAF[&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>tag (the tag used to read the GeMoMa annotations, default = prediction)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>sorting (comma-separated list that defines the way predictions are sorted within a cluster, default = evidence,score)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>common border filter (the filter on the common borders of transcripts, the lower the more transcripts will be checked as alternative splice isoforms, valid range = [0.0, 1.0], default = 0.75)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>maximal number of transcripts per gene (the maximal number of allowed transcript predictions per gene, valid range = [1, 2147483647], default = 2147483647)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prefix (the prefix can be used to distinguish predictions from different input files, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>weight (the weight can be used to prioritize predictions from different input files; each prediction will get an additional attribute sumWeight that can be used in the filter, valid range = [0.0, 1000.0], default = 1.0, OPTIONAL)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>gene annotation file (GFF files containing the gene annotations (predicted by GeMoMa))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation info (annotation information of the reference, tab-delimted file containing at least the columns transcriptName, GO and .*defline, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>filter (A filter can be applied to transcript predictions to receive only reasonable predictions. The filter is applied to the GFF attributes. The deault filter decides based on the completeness of the prediction (start=='M' and stop=='*') and the relative score (score/AA>=0.75) whether a prediction is used or not. Different criteria can be combined using 'and ' and 'or'. A more sophisticated filter could be applied for instance using the completeness, the relative score, the evidence as well as statistics based on RNA-seq: start=='M' and stop =='*' and score/AA>=0.75 and (evidence>1 or tpc==1.0), default = start=='M' and stop =='*' and score/AA>=0.75, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">atf</font></td><br />
<td>alternative transcript filter (If a prediction is suggested as an alternative transcript, this additional filter will be applied. The filter works syntactically like the 'filter' parameter and allows you to keep the number of alternative transcripts small based on meaningful criteria. Reasonable filter could be based on RNA-seq data (tie==1) or on evidence (evidence>1). A more sophisticated filter could be applied combining several criteria: tie==1 or evidence>1, default = tie==1 or evidence>1, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== CompareTranscripts ===<br />
<br />
For comparing gene models from GeMoMa predictions with existing annotation, we provide the tool CompareTranscripts. You can run CompareTranscripts from the command line with<br/><br />
<code>java -jar GeMoMa-<version>.jar CLI CompareTranscripts [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prediction (The predicted annotation)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (The true annotation)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used zero or multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">prefix</font></td><br />
<td>prefix (the prefix can be used to distinguish predictions from different input files, default = , OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">assignment</font></td><br />
<td>assignment (the transcript info for the reference of the prediction)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== AnnotationEvidence ===<br />
<br />
For providing RNA-seq evidence (e.g. tie) for existing annotation, we provide the tool AnnotationEvidence. You can run AnnotationEvidence from the command line with<br/><br />
<code>java -jar GeMoMa-<version>.jar CLI AnnotationEvidence [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (The genome annotation file (GFF))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (The genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns file (Introns (GFF), which might be obtained from RNA-seq, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage file (experimental coverage (RNA-seq), range={NO, UNSTRANDED, STRANDED}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ao</font></td><br />
<td>annotation output (if the annotation should be returned with attributes tie, tpc, and AA, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">gc</font></td><br />
<td>genetic code (optional user-specified genetic code, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== AnnotationFinializer ===<br />
<br />
This tool allows to predict UTR and to rename predictions. You can run AnnotationEvidence from the command line with<br/><br />
<code>java -jar GeMoMa-<version>.jar CLI AnnotationFinalizer [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (The predicted genome annotation file (GFF))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = prediction)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">u</font></td><br />
<td>UTR (allows to predict UTRs using RNA-seq data, range={NO, YES}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;YES&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (The genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns file (Introns (GFF), which might be obtained from RNA-seq)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage file (experimental coverage (RNA-seq), range={NO, UNSTRANDED, STRANDED}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">rename</font></td><br />
<td>rename (allows to generate generic gene and transcripts names (cf. attribute &quot;Name&quot;), range={COMPOSED, SIMPLE, NO}, default = COMPOSED)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;COMPOSED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">infix</font></td><br />
<td>infix (the infix of the generic name, default = G)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>suffix (the suffix of the generic name, default = 0)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">di</font></td><br />
<td>delete infix (a comma-separated list of infixes that is deleted from the sequence names before building the gene/transcript name, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;SIMPLE&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
== GFF attributes ==<br />
<br />
Using GeMoMa and GAF, you'll obtain GFFs containing some special attributes. We briefly explain the most prominent attributes in the following table.<br />
<br />
{| class="wikitable sortable"<br />
|-<br />
!Attribute !!Long name !!Tool !!Necessary parameter !!Feature !!Description<br />
|-<br />
| score || GeMoMa score || GeMoMa || || prediction || score computed by GeMoMa using the substitution matrix, gap costs and additional penalties<br />
|-<br />
| minCov || minimal coverage || GeMoMa || coverage, ... || prediction || minimal coverage of any base of the prediction given RNA-seq evidence<br />
|-<br />
| avgCov || average coverage || GeMoMa || coverage, ... || prediction || average coverage of all bases of the prediction given RNA-seq evidence<br />
|-<br />
| tpc || transcript percentage coverage || GeMoMa || coverage, ... || prediction || percentage of covered bases per predicted transcript given RNA-seq evidence<br />
|-<br />
| tae || transcript acceptor evidence || GeMoMa || introns || prediction || percentage of predicted acceptor sites per predicted transcript with RNA-seq evidence<br />
|-<br />
| tde || transcript donor evidence || GeMoMa || introns || prediction || percentage of predicted donor sites per predicted transcript with RNA-seq evidence<br />
|-<br />
| tie || transcript intron evidence || GeMoMa || introns || prediction || percentage of predicted introns per predicted transcript with RNA-seq evidence<br />
|-<br />
| minSplitReads || minimal split reads || GeMoMa || introns || prediction || minimal number of split reads for any of the predicted introns per predicted transcript<br />
|-<br />
| iAA || identical amino acid || GeMoMa || query proteins || prediction || percentage of identical amino acids between reference transcript and prediction<br />
|-<br />
| pAA || positive amino acid || GeMoMa || query proteins || prediction || percentage of aligned positions between reference transcript and prediction yielding a positive score in the substitution matrix<br />
|-<br />
| evidence || || GAF || || prediction || number of reference organisms that have a transcript yielding this prediction<br />
|-<br />
| alternative || || GAF || || prediction || alternative gene ID(s) leading to the same prediction<br />
|-<br />
| sumWeight || || GAF || || prediction || the sum of the weights of the references that perfectly support this prediction<br />
|-<br />
| maxTie || maximal tie || GAF || || gene || maximal tie of all transcripts of this gene<br />
|-<br />
| maxEvidence || maximal evidence || GAF || || gene || maximal evidence of all transcripts of this gene<br />
|-<br />
|}<br />
<br />
== FAQs ==<br />
<br />
; Why does the Extractor not return a single CDS-part, protein, ...?<br />
: First, please check whether the names of your contigs/chromosomes in your annotation (gff) and genome file (fasta) are identical. The fasta comments should at best only contain the contig/chromosome name. (Since GeMoMa 1.4, comments, which contain the contig/chromosome name and some additional information separated by a space, are also fine.) Second, please check whether you have a valid GFF/GTF file. Valid GFF files should have a valid "ID" or "Parent" entry in the attributes column. Valid GTF files should have a valid "gene_id" and "transcript_id" entry. Finally, please check the statistics that are given by the Extractor. It lists how many genes have been read and how many genes have been removed for different reasons. One common problem is that some annotation files do not include the stop codon in the CDS annotation.<br />
; How can I force GeMoMa to make more predictions?<br />
: There are several parameters affecting the number of predictions. The most prominent are the number of predictions (p) and the contig threshold (ct). For each reference transcript/CDS, GeMoMa initially makes a preliminary prediction and uses this prediction to determine whether a contig is promising and should be used to determine the final predictions. You may decrease ct and increase p to have more contigs in the final prediction. Increasing the number of predictions allows GeMoMa to output more predictions that have been computed. Decreasing the contig threshold allows to increase the number of predictions that are (internally) computed. Increasing p to a very large number without decreasing ct does not help.<br />
; Running GeMoMa on a single contig of my assembly yield thousands of weird predictions. What went wrong?<br />
: By default, GeMoMa is not build to be run on a single contig. GeMoMa tries to make predictions for all given reference CDS in the given target sequence(s). If the given target sequence is only a fraction of the complete target genome/assembly, GeMoMa will produce weird predictions as it does not filter for the quality of the predictions internally. There are two options to handle this:<br />
:* Use a list of gene models that you expect to be located on this contig (cf. parameter "selected").<br />
:* Filter the predictions using GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>).<br />
; Is it mandatory to use RNA-seq data?<br />
: No, GeMoMa is able to make predictions with and without RNA-seq evidence.<br />
; Is it possible to use multiple reference organisms?<br />
: It is possible to use multiple reference organisms for GeMoMa. Just run GeMoMa on each reference organism separately. Finally, you can employ GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>) to combine these annotations.<br />
; Why do some reference genes not lead to a prediction in the target genome?<br />
: Please first check whether your reference genes have been discarded by the Extractor (cf. assignment file).<br />
: If the genes have been discarded, there are two possibilities:<br />
:* The CDS might be redundant, i.e. the coding exons are identical to those of another transcript. In this case, only one CDS is further evaluated.<br />
:* There might be something wrong with your reference genes, e.g., missing start codon, missing stop codon, premature stop codon, ambiguous nucleotides, ... and you should check the options of Extractor or the annotation.<br />
: If the reference genes passed the Extractor, there are several possible explanations for this behavior. The two most prominent are:<br />
:* GeMoMa stopped the prediction of a reference genes since it does not return a result within the given time (cf. parameter "timeout").<br />
:* GeMoMa simply did not find a prediction matching the remaining quality criteria<br />
:* GeMoMa did find a prediction, but it was filtered out by GAF, e.g. to low relative score, missing start or stop codon (cf. GAF parameters).<br />
; What does "partial gene model" mean in the context of GeMoMa?<br />
: We called a gene model partial if it does not contain an initial start codon and a final stop codon. However, this does not mean that the gene model is located at or close to the border of a chromosome or contig.<br />
; For two different reference transcripts, the predictions of GeMoMa overlap or are identical. What should I do with those?<br />
: GeMoMa makes the predictions for each reference transcript independently. Hence, it can occur that some of predictions of different reference transcripts overlap or are identical especially in gene families. Typically, you might like to filter or rank these predictions. We have implemented GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>) to do this automatically. However, you can also do it by hand using the GFF attributes. Using RNA-seq data in GeMoMa yields additional fields in the annotation that can be used, e.g., average coverage (avgCov).<br />
; A lot of transcripts have been filtered out by the Extractor. What can I do?<br />
: There are several reasons for removing transcripts by the Extractor. At least in two cases you can try to get more transcripts by setting specific parameter values. First, if the transcript contains ambiguous nucleotides, please test the parameter "Ambiguity". Second, sometimes we received GFFs which contain wrong phases for CDS entries (e.g., 0 for all CDS entries in the phase column of the GFF). Since version 1.3.2, we provide the option "r" which stands for repair. If r=true is chosen, the Extractor tries to infer all phases for transcripts that show an error and would be filtered out.<br />
; Is GeMoMa able to predict pseudo-genes/ncRNA?<br />
: No, currently not.<br />
; My RNA-seq data indicates there is an additional intron in a transcipt, but GeMoMa does not predict this. Or vice versa, GeMoMa predicts an intron that is not supported by RNA-seq data. What's the reason?<br />
: GeMoMa is mainly based on the assumptions of amino acid and intron position conservation between reference and target species. Hence, GeMoMa tries to predict a gene model with similar exon-intron structure in the target species and does not stick too much to RNA-seq data. Although intron position conservation can be observed in most cases, sometimes new introns evolve or others vanish. For this reasons, GeMoMa also allows for the inclusion or exclusion of introns adding some additional costs (cf. GeMoMa parameter intron-loss-gain-penalty). However, the behaviour of GeMoMa depends on the parameters settings (especially intron-loss-gain-penalty, sm (substitution matrix), go (gap opening), ge (gap extension)) and the length of the missed/additional intron. Nevertheless, such cases can only occur if the additional/missed intron has a length that can be divided by 3 preserving the reading frame.<br />
:Since the available RNA-seq data only reflects a fraction of tissues/environmental conditions/..., missing RNA-seq evidence does not necessarily mean that the predictions is wrong.<br />
; My RNA-seq data indicates two alternative, highly overlapping introns. Interestingly, GeMoMa does not take the intron that is more abundant. Why?<br />
: GeMoMa reads the introns from the input file using some filter (cf. GeMoMa parameter r (reads)). All introns that pass the filter are used and treated equally. Hence, GeMoMa uses the intron that matches the expectation of intron position and amino acid conservation compared to the reference transcript.<br />
; Does GeMoMa predict multiple transcripts per gene?<br />
: GeMoMa in principle allows to predict multiple transcripts per gene, if corresponding transcripts are given in the reference species or if multiple reference species are used.<br />
; GeMoMa failed with java.lang.OutOfMemoryError. What can I do?<br />
: Whenever you see a java.lang.OutOfMemoryError, you should rerun the program with Java virtual machine (VM) options. More specifically you should set: '''-Xms''' the initally used RAM, e.g. to 5Gb (–Xms5G), and '''-Xmx''' the maximally used RAM, e.g. to 50Gb (-Xmx50G). GeMoMa often needs more memory if you have a large genome ''and'' if you’re providing a large coverage file (extracted from RNA-seq data). If you don’t have a compute node with enough memory, you can run GeMoMa without coverage, which will return the same predictions, but does not include all statistics. Another point could be the protein alignment, if you use the optional parameter query protein. Again you can run GeMoMa without this parameter, which will return the same predictions, but less statistics.<br />
; I need to specify the genetic code for my organisms. What is the expected format?<br />
: The genetic code is given in a two column tab-delimited table, where the first column is the one letter code of the amino acid and the second column is a comma-separated list of triplets. As we are working on genomic DNA, GeMoMa expects the bases A, C, G, and T, and not U (as expected in mRNA). Here is the link to the default genetic code, which might be used as template:<br />
: https://github.com/Jstacs/Jstacs/blob/master/projects/gemoma/test_data/genetic_code.txt<br />
: Alternative genetic codes are described here using the RNA alphabet:<br />
: https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi<br />
: The genetic code might be specified for a reference organism in the module Extractor or for a target organism in the module GeMoMa.<br />
<br />
== Version history ==<br />
[http://www.jstacs.de/download.php?which=GeMoMa GeMoMa 1.6.2] (17.12.2019)<br />
* Jstacs changes:<br />
** test methods for modules<br />
** live protocol for Galaxy<br />
* new module Denoise: allowing to clean introns extracted by ERE<br />
* new module NCBIReferenceRetriever: allowing to retrieve data for reference organisms easily from NCBI.<br />
* GAF:<br />
** bugfix for filter using specific attributes if no RNA-seq or query proteins was used<br />
** allow to add annotation info (as for instance provided by Phytozome) based on the reference organisms<br />
* GeMoMa: bugfix for timeout<br />
* GeMoMaPipeline:<br />
** bugfix reporting predicted partial proteins<br />
** improved protocol<br />
** new default value for query proteins (changed from false to true)<br />
** new default value for Ambiguity (changed from EXCEPTION to AMBIGUOUS)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.6.1.zip GeMoMa 1.6.1] (4.06.2019)<br />
* createGalaxyIntegration.sh: bugfix for GeMoMaPipeline<br />
* new module CheckIntrons: allowing to create statistics for introns (extracted by ERE)<br />
* AnnotationFinalizer: bugfix for sequence IDs with large numbers<br />
* CompareTranscripts: <br />
** bugfix for prefix of ref-gene<br />
** allow no transcript info, but making assignment non-optional if a transcript info is set <br />
* GAF: bugfix for Galaxy integration<br />
* GeMoMaPipeline:<br />
** improved output in case of Exceptions<br />
** new parameter "output individual predictions" allows to in- or exclude individual predictions from each reference organism in the final result<br />
** new parameter "weight" allows weights for reference species (cf. GAF)<br />
* ERE: new parameter "minimum mapping quality"<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.6.zip GeMoMa 1.6] (2.04.2019)<br />
* allow to use mmseqs as alternative to tblastn<br />
* AnnotationEvidence:<br />
** allows to add attributes to the input gff: tie, tpc, AA, start, stop<br />
** new parameter for gff output<br />
* AnnotationFinalizer: new tool for predicting UTRs and renaming predictions<br />
* GAF:<br />
** relative score filter and evidence filter are replaced by a flexible filter that allows to filter by relative score, evidence or other GFF attributes as well as combinations thereof<br />
** sorting criteria of the predictions within clusters can now be user-specified<br />
** new attribute for genes: combinedEvidence<br />
** new attribute for predictions: sumWeight<br />
** allows to use gene predictions from all sources, including for instance ab-initio gene predictors, purely RNA-seq based gene prediction and manually curation<br />
** bugfix for predictions from multiple reference organisms<br />
** improved statistic output<br />
* GeMoMa<br />
** renamed the parameter tblastn results to search results<br />
** new parameter for sorting the results of the similarity search (tblastn or mmseqs), if you use mmseqs for the similarity search you have use sort <br />
** new parameter for score of the search results: three options: Trust (as is), ReScore (use aligned sequence, but recompute score), and ReAlign (use detected sequence for realignment and rescore)<br />
** bugfix: threshold for introns from multiple files<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.5.2.zip GeMoMa 1.5.3] (23.07.2018)<br />
* improved parameter description and presentation<br />
* GeMoMaPipeline:<br />
** removed unnecessary parameters<br />
* GeMoMa:<br />
** bugfix: reading coverage file<br />
** removed parameter genomic (cf. Extractor)<br />
** removed protein output (cf. Extractor)<br />
* GAF:<br />
** bugfix: prefix<br />
* Extractor:<br />
** new parameter genomic<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.5.2.zip GeMoMa 1.5.2] (31.5.2018)<br />
* GAF:<br />
** new parameter that allows to restrict the maximal number of transcript predictions per gene<br />
** altered behavior of the evidence filter from percentages to absolute values<br />
** bugfix: nested genes <br />
** checking for duplicates in prediction IDs<br />
* GeMoMa:<br />
** warning if RNA-seq data does not match with target genome<br />
* GeMoMaPipeline: new tool for running the complete GeMoMa pipeline at once allowing multi-threading<br />
* folder for temporary files of GeMoMa<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.5.zip GeMoMa 1.5] (13.02.2018)<br />
* AnnotationEvidence: add chromosome to output<br />
* CompareTranscripts: new parameter that allows to remove prefixes introduces by GAF<br />
* Extractor: new parameter "stop-codon excluded from CDS" that might be used if the annotation does not contain the stop codons<br />
* ExtractRNASeqEvidence: <br />
** print intron length stats<br />
** include program infos in introns.gff3<br />
* GeMoMa:<br />
** new attribute pAA in gff output if query protein is given<br />
** include program infos in predicted_annotation.gff3<br />
** minor bugfix<br />
* GAF:<br />
** new parameter that allows to specify a prefix for each input gff<br />
** collect and print program infos to filtered_prediction.gff3<br />
** improved statistics output<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.4.2.zip GeMoMa 1.4.2] (21.07.2017)<br />
* automatic searching for available updates<br />
* AnnotationEvidence: bugfix (tie computation: Arrays.binarysearch does not find first match)<br />
* Extractor: bugfix (files that are not zipped)<br />
* GeMoMa: bugfix (tie computation: Arrays.binarysearch does not find first match)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.4.1.zip GeMoMa 1.4.1] (30.05.2017)<br />
* CompareTranscripts: bugfix (NullPointerException)<br />
* Extractor: reference genome can be .*fa.gz and .*fasta.gz<br />
* GeMoMa: bugfix (shutdown problem after timeout)<br />
* modified additional scripts and documentation<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.4.zip GeMoMa 1.4] (03.05.2017)<br />
* AnnotationEvidence: new tool computing tie and tpc for given annotation (gff)<br />
* CompareTranscripts: new tool comparing predicted and given annotation (gff)<br />
* Extractor:<br />
** reading CDS with no parent tag (cf. discontinuous feature)<br />
** automatic recognition of GFF or GTF annotation<br />
** Warning if sequences mentioned in the annotation are not included in the reference sequence<br />
* GeMoMa: <br />
** allowing for multiple intron and coverage files (= using different library types at the same time)<br />
** NA instead of "?" for tae, tde, tie, minSplitReads of single coding exon genes<br />
** new default values for the parameters: predictions (10 instead of 1) and contig threshold (0.4 instead of 0.9)<br />
** bugfix (write pc and minCov if possible for last CDS part in predicted annotation)<br />
** bugfix (ref-gene name if no assignment is used)<br />
** bugfix (minSplitReads, minCov, tpc, avgCov if no coverage available)<br />
* GAF:<br />
** nested genes on the same strand<br />
** bugfix (if nothing passes the filter)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.3.2.zip GeMoMa 1.3.2] (18.01.2017)<br />
* Extractor: new parameter repair for broken transcript annotations<br />
* GeMoMa: bugfixes (splice site computation)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa_1.3.1.zip GeMoMa 1.3.1] (09.12.2016)<br />
* GeMoMa bugfix (finding start/stop codon for very small exons)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.3.zip GeMoMa 1.3] (06.12.2016)<br />
* ERE: new tool for extracting RNA-seq evidence<br />
* Extractor: offers options for <br />
** partial gene models<br />
** ambiguities<br />
* GeMoMa:<br />
** RNA-seq<br />
*** defining splice sites<br />
*** additional feature in GFF and output<br />
**** transcript intron evidence (tie)<br />
**** transcript acceptor evidence (tae)<br />
**** transcript donor evidence (tde)<br />
**** transcript percentage coverage (tpc)<br />
**** ...<br />
** improved GFF<br />
** simplified the command line parameters<br />
** IMPORTANT: parameter names changed for some parameters<br />
* GAF: new tool for filtering and combining different predictions (especially of different reference organisms)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.1.3.zip GeMoMa 1.1.3] (06.06.2016)<br />
* minor modifications to the Extractor tool<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.1.2.zip GeMoMa 1.1.2] (05.02.2016)<br />
* GeMoMa bugfix (upstream, downstream sequence for splice site detection)<br />
* Extractor: new parameter s for selecting transcripts<br />
* improved Galaxy integration<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.1.1.zip GeMoMa 1.1.1] (01.02.2016)<br />
* initial release for paper</div>Keilwagenhttps://www.jstacs.de/index.php?title=GeMoMa&diff=1065GeMoMa2020-01-17T10:10:03Z<p>Keilwagen: /* Gene Model Mapper (GeMoMa) */</p>
<hr />
<div>'''Ge'''ne '''Mo'''del '''Ma'''pper (GeMoMa) is a homology-based gene prediction program. GeMoMa uses the annotation of protein-coding genes in a reference genome to infer the annotation of protein-coding genes in a target genome. Thereby, GeMoMa utilizes amino acid sequence and intron position conservation. In addition, GeMoMa allows to incorporate RNA-seq evidence for splice site prediction.<br />
<br />
[[File:GeMoMa-schema.png|thumb|right|350px|Schema of GeMoMa algorithm]]<br />
<br />
== Paper ==<br />
If you use GeMoMa, please cite<br />
<br />
J. Keilwagen, M. Wenk, J. L. Erickson, M. H. Schattat, J. Grau, and F. Hartung. [https://nar.oxfordjournals.org/content/44/9/e89 Using intron position conservation for homology-based gene prediction]. ''Nucleic Acids Research'', 2016. doi: 10.1093/nar/gkw092<br />
<br />
J. Keilwagen, F. Hartung, M. Paulini, S. O. Twardziok, and J. Grau<br />
[https://rdcu.be/QbKc Combining RNA-seq data and homology-based gene prediction for plants, animals and fungi]. ''BMC Bioinformatics'', 2018. doi: 10.1186/s12859-018-2203-5<br />
<br />
== Download ==<br />
<br />
GeMoMa is implemented in Java using Jstacs. You can [http://www.jstacs.de/download.php?which=GeMoMa download a zip file] containing a readme, the GeMoMa jar file and some tiny scripts for running GeMoMa. The jar file allows for <br />
<ul><br />
<li>creating the XML file needed for the Galaxy integration</li><br />
<li>running the command line interface (CLI) version.</li><br />
</ul><br />
<br />
You can also [[:File:GeMoMa-manual.pdf|download a small manual for GeMoMa]] which explains the main steps for the analysis.<br />
<br />
== Galaxy ==<br />
GeMoMa is available in a public web-server at [http://galaxy.informatik.uni-halle.de galaxy.informatik.uni-halle.de]. The provided web-server only allows a limited number of reference genes and uses a time out of 2 minutes per transcript prediction. For unlimited use, please use the command line program or integrate GeMoMa in your only Galaxy instance.<br />
<br />
[[File:GeMoMa-Workflow.png|thumb|center|700px|GeMoMa workflow adapted from Galaxy]]<br />
<br />
== Running the command line application ==<br />
<br />
For running the command line application, Java v1.8 or later is required.<br />
<br />
=== GeMoMaPipeline ===<br />
<br />
If you like to run the GeMoMaPipeline on a server as a single job, you can use the module GeMoMaPipeline which allows to exploit the full compute power of the computer server via multi-threading. However, GeMoMaPipeline does not distribute task on a compute cluster.<br />
You can run GeMoMaPipeline from the command line with<br/><br />
<code>java -jar GeMoMa-<version>.jar CLI GeMoMaPipeline [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>target genome (Target genome file (FASTA))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>species (data for reference species, range={own, pre-extracted}, default = own)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;own&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>ID (ID to distinguish the different reference species, default = , OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (Reference annotation file (GFF or GTF), which contains gene models annotated in the reference genome)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (Reference genome file (FASTA))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>weight (the weight can be used to prioritize predictions from different input files; each prediction will get an additional attribute sumWeight that can be used in the filter, valid range = [0.0, 1000.0], default = 1.0, OPTIONAL)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ai</font></td><br />
<td>annotation info (annotation information of the reference, tab-delimted file containing at least the columns transcriptName, GO and .*defline, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;pre-extracted&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>ID (ID to distinguish the different reference species, default = , OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>cds parts (The query cds parts file (FASTA), i.e., the cds parts that have been blasted)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>assignment (The assignment file, which combines parts of the CDS to transcripts, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">q</font></td><br />
<td>query proteins (optional query protein file (FASTA) for computing the optimal alignment score against complete protein prediction, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>weight (the weight can be used to prioritize predictions from different input files; each prediction will get an additional attribute sumWeight that can be used in the filter, valid range = [0.0, 1000.0], default = 1.0, OPTIONAL)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ai</font></td><br />
<td>annotation info (annotation information of the reference, tab-delimted file containing at least the columns transcriptName, GO and .*defline, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">selected</font></td><br />
<td>selected (The path to list file, which allows to make only a predictions for the contained transcript ids. The first column should contain transcript IDs as given in the annotation. Remaining columns can be used to determine a target region that should be overlapped by the prediction, if columns 2 to 5 contain chromosome, strand, start and end of region, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">gc</font></td><br />
<td>genetic code (optional user-specified genetic code, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>maximum intron length (The maximum length of an intron, default = 15000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">tblastn</font></td><br />
<td>tblastn (if *true* tblastn is used as search algorithm, otherwise mmseqs is used. Tblastn and mmseqs need to be installed to use the corresponding option, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">tag</font></td><br />
<td>tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = prediction)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>RNA-seq evidence (data for RNA-seq evidence, range={NO, MAPPED, EXTRACTED}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;MAPPED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.s</font></td><br />
<td>Stranded (Defines whether the reads are stranded. In case of FR_FIRST_STRAND, the first read of a read pair or the only read in case of single-end data is assumed to be located on forward strand of the cDNA, i.e., reverse to the mRNA orientation. If you are using Illumina TruSeq you should use FR_FIRST_STRAND., range={FR_UNSTRANDED, FR_FIRST_STRAND, FR_SECOND_STRAND}, default = FR_UNSTRANDED)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.m</font></td><br />
<td>mapped reads file (BAM/SAM files containing the mapped reads)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.v</font></td><br />
<td>ValidationStringency (Defines how strict to be when reading a SAM or BAM, beyond bare minimum validation., range={STRICT, LENIENT, SILENT}, default = LENIENT)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.u</font></td><br />
<td>use secondary alignments (allows to filter flags in the SAM or BAM, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.c</font></td><br />
<td>coverage (allows to output the coverage, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.mmq</font></td><br />
<td>minimum mapping quality (reads with a mapping quality that is lower than this value will be ignored, valid range = [0, 255], default = 40)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;EXTRACTED&quot;:</b></td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">introns</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used zero or multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage</font></td><br />
<td>coverage (experimental coverage (RNA-seq), range={UNSTRANDED, STRANDED}, default = UNSTRANDED)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>denoise (removing questionable introns that have been extracted by ERE, range={DENOISE, RAW}, default = DENOISE)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;DENOISE&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Denoise.m</font></td><br />
<td>minimum expression (The threshold for removing introns, valid range = [0.0, 1.0], default = 0.01)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Denoise.c</font></td><br />
<td>context (The context upstream a donor and donwstream an acceptor site that is used to determine the expression of the region, valid range = [0, 100], default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;RAW&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.p</font></td><br />
<td>proteins (whether the complete proteins sequences should returned as output, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.r</font></td><br />
<td>repair (if a transcript annotation can not be parsed, the program will try to infer the phase of the CDS parts to repair the annotation, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.a</font></td><br />
<td>Ambiguity (This parameter defines how to deal with ambiguities in the DNA. There are 3 options: EXCEPTION, which will remove the corresponding transcript, AMBIGUOUS, which will use an X for the corresponding amino acid, and RANDOM, which will randomly select an amnio acid from the list of possibilities., range={EXCEPTION, AMBIGUOUS, RANDOM}, default = AMBIGUOUS)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.s</font></td><br />
<td>stop-codon excluded from CDS (A flag that states whether the reference annotation contains the stop codon in the CDS annotation or not, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.f</font></td><br />
<td>full-length (A flag which allows for choosing between only full-length and all (i.e., full-length and partial) transcripts, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.s</font></td><br />
<td>splice (if no intron is given by RNA-seq, compute candidate splice sites or not, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.sm</font></td><br />
<td>substitution matrix (optional user-specified substitution matrix, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.g</font></td><br />
<td>gap opening (The gap opening cost in the alignment, default = 11)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.ge</font></td><br />
<td>gap extension (The gap extension cost in the alignment, default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.i</font></td><br />
<td>intron-loss-gain-penalty (The penalty used for intron loss and gain, default = 25)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.e</font></td><br />
<td>e-value (The e-value for filtering blast results, default = 100.0)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.c</font></td><br />
<td>contig threshold (The threshold for evaluating contigs, valid range = [0.0, 1.0], default = 0.4)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.rt</font></td><br />
<td>region threshold (The threshold for evaluating regions, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.h</font></td><br />
<td>hit threshold (The threshold for adding additional hits, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.p</font></td><br />
<td>predictions (The (maximal) number of predictions per transcript, default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.a</font></td><br />
<td>avoid stop (A flag which allows to avoid stop codons in a transcript (except the last AS), default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.approx</font></td><br />
<td>approx (whether an approximation is used to compute the score for intron gain, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.prefix</font></td><br />
<td>prefix (A prefix to be used for naming the predictions, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.t</font></td><br />
<td>timeout (The (maximal) number of seconds to be used for the predictions of one transcript, if exceeded GeMoMa does not output a prediction for this transcript., valid range = [0, 604800], default = 3600)</td><br />
<td style="width:100px;">LONG</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.Score</font></td><br />
<td>Score (A flag which allows to do nothing, re-score or re-align the search results, range={Trust, ReScore, ReAlign}, default = Trust)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.s</font></td><br />
<td>sorting (comma-separated list that defines the way predictions are sorted within a cluster, default = evidence,score)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.c</font></td><br />
<td>common border filter (the filter on the common borders of transcripts, the lower the more transcripts will be checked as alternative splice isoforms, valid range = [0.0, 1.0], default = 0.75)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.m</font></td><br />
<td>maximal number of transcripts per gene (the maximal number of allowed transcript predictions per gene, valid range = [1, 2147483647], default = 2147483647)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.f</font></td><br />
<td>filter (A filter can be applied to transcript predictions to receive only reasonable predictions. The filter is applied to the GFF attributes. The deault filter decides based on the completeness of the prediction (start=='M' and stop=='*') and the relative score (score/AA>=0.75) whether a prediction is used or not. Different criteria can be combined using 'and ' and 'or'. A more sophisticated filter could be applied for instance using the completeness, the relative score, the evidence as well as statistics based on RNA-seq: start=='M' and stop =='*' and score/AA>=0.75 and (evidence>1 or tpc==1.0), default = start=='M' and stop =='*' and score/AA>=0.75, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.a</font></td><br />
<td>alternative transcript filter (If a prediction is suggested as an alternative transcript, this additional filter will be applied. The filter works syntactically like the 'filter' parameter and allows you to keep the number of alternative transcripts small based on meaningful criteria. Reasonable filter could be based on RNA-seq data (tie==1) or on evidence (evidence>1). A more sophisticated filter could be applied combining several criteria: tie==1 or evidence>1, default = tie==1 or evidence>1, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.u</font></td><br />
<td>UTR (allows to predict UTRs using RNA-seq data, range={NO, YES}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;YES&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.r</font></td><br />
<td>rename (allows to generate generic gene and transcripts names (cf. attribute &quot;Name&quot;), range={COMPOSED, SIMPLE, NO}, default = COMPOSED)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;COMPOSED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.i</font></td><br />
<td>infix (the infix of the generic name, default = G)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.s</font></td><br />
<td>suffix (the suffix of the generic name, default = 0)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.di</font></td><br />
<td>delete infix (a comma-separated list of infixes that is deleted from the sequence names before building the gene/transcript name, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;SIMPLE&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>predicted proteins (If *true*, returns the predicted proteins of the target organism as fastA file, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">pc</font></td><br />
<td>predicted CDSs (If *true*, returns the predicted CDSs of the target organism as fastA file, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">pgr</font></td><br />
<td>predicted genomic regions (If *true*, returns the genomic regions of predicted gene models of the target organism as fastA file, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">o</font></td><br />
<td>output individual predictions (If *true*, returns the predictions for each reference species, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">debug</font></td><br />
<td>debug (If *false* removes all temporary files even if the jobs exits unexpected, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">threads</font></td><br />
<td>The number of threads used for the tool, defaults to 1</td><br />
<td>INT</td><br />
</tr><br />
</table><br />
<br />
=== Extract RNA-seq Evidence (ERE) ===<br />
<br />
For post-processing the mapped RNA-seq data, we provide the tool ExtractRNAseqEvidence (ERE). You can run Extractor from the command line with<br/><br />
<code>java -jar GeMoMa-<version>.jar CLI ERE [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>Stranded (Defines whether the reads are stranded. In case of FR_FIRST_STRAND, the first read of a read pair or the only read in case of single-end data is assumed to be located on forward strand of the cDNA, i.e., reverse to the mRNA orientation. If you are using Illumina TruSeq you should use FR_FIRST_STRAND., range={FR_UNSTRANDED, FR_FIRST_STRAND, FR_SECOND_STRAND}, default = FR_UNSTRANDED)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>mapped reads file (BAM/SAM files containing the mapped reads)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>ValidationStringency (Defines how strict to be when reading a SAM or BAM, beyond bare minimum validation., range={STRICT, LENIENT, SILENT}, default = LENIENT)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">u</font></td><br />
<td>use secondary alignments (allows to filter flags in the SAM or BAM, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage (allows to output the coverage, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">mmq</font></td><br />
<td>minimum mapping quality (reads with a mapping quality that is lower than this value will be ignored, valid range = [0, 254], default = 40)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== CheckIntrons===<br />
<br />
This tool allows to check whether the extracted introns show the expected patterns of di-nucleotides at the splice sites. You can run CheckIntrons from the command line with<br/><br />
<code>java -jar GeMoMa-<version>.jar CLI CheckIntrons [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>target genome (The target genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>verbose (A flag which allows to output a wealth of additional information per transcript, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== Extractor ===<br />
<br />
For preparing the data, we provide the tool Extractor. You can run Extractor from the command line with<br/><br />
<code>java -jar GeMoMa-<version>.jar CLI Extractor [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (Reference annotation file (GFF or GTF), which contains gene models annotated in the reference genome)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (Reference genome file (FASTA))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">gc</font></td><br />
<td>genetic code (optional user-specified genetic code, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>proteins (whether the complete proteins sequences should returned as output, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>cds (whether the complete CDSs should returned as output, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">genomic</font></td><br />
<td>genomic (whether the genomic regions should be returned (upper case = coding, lower case = non coding), default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>repair (if a transcript annotation can not be parsed, the program will try to infer the phase of the CDS parts to repair the annotation, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>selected (The path to list file, which allows to make only a predictions for the contained transcript ids. The first column should contain transcript IDs as given in the annotation. Remaining columns will be ignored., OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Ambiguity</font></td><br />
<td>Ambiguity (This parameter defines how to deal with ambiguities in the DNA. There are 3 options: EXCEPTION, which will remove the corresponding transcript, AMBIGUOUS, which will use an X for the corresponding amino acid, and RANDOM, which will randomly select an amnio acid from the list of possibilities., range={EXCEPTION, AMBIGUOUS, RANDOM}, default = EXCEPTION)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sefc</font></td><br />
<td>stop-codon excluded from CDS (A flag that states whether the reference annotation contains the stop codon in the CDS annotation or not, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>full-length (A flag which allows for choosing between only full-length and all (i.e., full-length and partial) transcripts, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>verbose (A flag which allows to output a wealth of additional information, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== Gene Model Mapper (GeMoMa) ===<br />
For predicting gene models, we provide the tool GeMoMa. You can run GeMoMa from the command line with<br />
<br />
<code>java -jar GeMoMa-<version>.jar CLI GeMoMa [<parameter>=<value> ...]</code><br />
<br />
The parameters comprise: <br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>search results (The search results, e.g., from tblastn or mmseqs)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>target genome (The target genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>cds parts (The query cds parts file (FASTA), i.e., the cds parts that have been blasted)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>assignment (The assignment file, which combines parts of the CDS to transcripts, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">q</font></td><br />
<td>query proteins (optional query protein file (FASTA) for computing the optimal alignment score against complete protein prediction, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">splice</font></td><br />
<td>splice (if no intron is given by RNA-seq, compute candidate splice sites or not, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage</font></td><br />
<td>coverage (experimental coverage (RNA-seq), range={NO, UNSTRANDED, STRANDED}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genetic code (optional user-specified genetic code, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sm</font></td><br />
<td>substitution matrix (optional user-specified substitution matrix, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">go</font></td><br />
<td>gap opening (The gap opening cost in the alignment, default = 11)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ge</font></td><br />
<td>gap extension (The gap extension cost in the alignment, default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>maximum intron length (The maximum length of an intron, default = 15000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">intron-loss-gain-penalty</font></td><br />
<td>intron-loss-gain-penalty (The penalty used for intron loss and gain, default = 25)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">e</font></td><br />
<td>e-value (The e-value for filtering blast results, default = 100.0)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ct</font></td><br />
<td>contig threshold (The threshold for evaluating contigs, valid range = [0.0, 1.0], default = 0.4)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">rt</font></td><br />
<td>region threshold (The threshold for evaluating regions, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">h</font></td><br />
<td>hit threshold (The threshold for adding additional hits, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>predictions (The (maximal) number of predictions per transcript, default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">selected</font></td><br />
<td>selected (The path to list file, which allows to make only a predictions for the contained transcript ids. The first column should contain transcript IDs as given in the annotation. Remaining columns can be used to determine a target region that should be overlapped by the prediction, if columns 2 to 5 contain chromosome, strand, start and end of region, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">as</font></td><br />
<td>avoid stop (A flag which allows to avoid stop codons in a transcript (except the last AS), default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">approx</font></td><br />
<td>approx (whether an approximation is used to compute the score for intron gain, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">prefix</font></td><br />
<td>prefix (A prefix to be used for naming the predictions, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">tag</font></td><br />
<td>tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = prediction)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>verbose (A flag which allows to output a wealth of additional information per transcript, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">timeout</font></td><br />
<td>timeout (The (maximal) number of seconds to be used for the predictions of one transcript, if exceeded GeMoMa does not output a prediction for this transcript., valid range = [0, 604800], default = 3600)</td><br />
<td style="width:100px;">LONG</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sort</font></td><br />
<td>sort (A flag which allows to sort the search results, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Score</font></td><br />
<td>Score (A flag which allows to do nothing, re-score or re-align the search results, range={Trust, ReScore, ReAlign}, default = Trust)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
GeMoMa returns the predicted annotation as gff file.<br />
<br />
=== GeMoMa Annotation Filter (GAF) ===<br />
<br />
The GeMoMa Annotation Filter allows to combine and reduce predictions from GeMoMa into a single final prediction. It is able to handle predictions from different reference species. It also handles overlapping or identical predictions.<br />
You can run GeMoMa from the command line with<br/><br />
<code>java -jar GeMoMa-1.6.1.jar CLI GAF[&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>tag (the tag used to read the GeMoMa annotations, default = prediction)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>sorting (comma-separated list that defines the way predictions are sorted within a cluster, default = evidence,score)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>missing intron evidence filter (the filter for single-exon transcripts or if no RNA-seq data is used, decides for overlapping other transcripts whether they should be used (=true) or discarded (=false), default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>intron evidence filter (the filter on the intron evidence given by RNA-seq-data for overlapping transcripts, valid range = [0.0, 1.0], default = 1.0)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>common border filter (the filter on the common borders of transcripts, the lower the more transcripts will be checked as alternative splice isoforms, valid range = [0.0, 1.0], default = 0.75)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">mnotpg</font></td><br />
<td>maximal number of transcripts per gene (the maximal number of allowed transcript predictions per gene, valid range = [1, 2147483647], default = 2147483647)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prefix (the prefix can be used to distinguish predictions from different input files, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>weight (the weight can be used to prioritize predictions from different input files; each prediction will get an additional attribute sumWeight that can be used in the filter, valid range = [0.0, 1000.0], default = 1.0, OPTIONAL)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>gene annotation file (GFF files containing the gene annotations (predicted by GeMoMa))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>filter (A filter can be applied to transcript predictions to receive only reasonable predictions. The filter is applied to the GFF attributes. The deault filter decides based on the completeness of the prediction (start=='M' and stop=='*') and the relative score (score/AA>=0.75) whether a prediction is used or not. Different criteria can be combined using 'and ' and 'or'. A more sophisticated filter could be applied for instance using the completeness, the relative score, the evidence as well as statistics based on RNA-seq: start=='M' and stop =='*' and score/AA>=0.75 and (evidence>1 or tpc==1.0), default = start=='M' and stop =='*' and score/AA>=0.75, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== CompareTranscripts ===<br />
<br />
For comparing gene models from GeMoMa predictions with existing annotation, we provide the tool CompareTranscripts. You can run CompareTranscripts from the command line with<br/><br />
<code>java -jar GeMoMa-1.6.1.jar CLI CompareTranscripts [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prediction (The predicted annotation)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (The true annotation)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== AnnotationEvidence ===<br />
<br />
For providing RNA-seq evidence (e.g. tie) for existing annotation, we provide the tool AnnotationEvidence. You can run AnnotationEvidence from the command line with<br/><br />
<code>java -jar GeMoMa-1.6.1.jar CLI AnnotationEvidence [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (The genome annotation file (GFF))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (The genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns file (Introns (GFF), which might be obtained from RNA-seq, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage file (experimental coverage (RNA-seq), range={NO, UNSTRANDED, STRANDED}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ao</font></td><br />
<td>annotation output (if the annotation should be returned with attributes tie, tpc, and AA, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">gc</font></td><br />
<td>genetic code (optional user-specified genetic code, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== AnnotationFinializer ===<br />
<br />
This tool allows to predict UTR and to rename predictions. You can run AnnotationEvidence from the command line with<br/><br />
<code>java -jar GeMoMa-1.6.1.jar CLI AnnotationFinalizer [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (The genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (The predicted genome annotation file (GFF))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = prediction)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">u</font></td><br />
<td>UTR (allows to predict UTRs using RNA-seq data, range={NO, YES}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;YES&quot;:</b></td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns file (Introns (GFF), which might be obtained from RNA-seq, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage file (experimental coverage (RNA-seq), range={NO, UNSTRANDED, STRANDED}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">rename</font></td><br />
<td>rename (allows to generate generic gene and transcripts names (cf. attribute &quot;Name&quot;), range={COMPOSED, SIMPLE, NO}, default = COMPOSED)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;COMPOSED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">infix</font></td><br />
<td>infix (the infix of the generic name, default = G)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>suffix (the suffix of the generic name, default = 0)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">di</font></td><br />
<td>delete infix (a comma-separated list of infixes that is deleted from the sequence names before building the gene/transcript name, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;SIMPLE&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
== GFF attributes ==<br />
<br />
Using GeMoMa and GAF, you'll obtain GFFs containing some special attributes. We briefly explain the most prominent attributes in the following table.<br />
<br />
{| class="wikitable sortable"<br />
|-<br />
!Attribute !!Long name !!Tool !!Necessary parameter !!Feature !!Description<br />
|-<br />
| score || GeMoMa score || GeMoMa || || prediction || score computed by GeMoMa using the substitution matrix, gap costs and additional penalties<br />
|-<br />
| minCov || minimal coverage || GeMoMa || coverage, ... || prediction || minimal coverage of any base of the prediction given RNA-seq evidence<br />
|-<br />
| avgCov || average coverage || GeMoMa || coverage, ... || prediction || average coverage of all bases of the prediction given RNA-seq evidence<br />
|-<br />
| tpc || transcript percentage coverage || GeMoMa || coverage, ... || prediction || percentage of covered bases per predicted transcript given RNA-seq evidence<br />
|-<br />
| tae || transcript acceptor evidence || GeMoMa || introns || prediction || percentage of predicted acceptor sites per predicted transcript with RNA-seq evidence<br />
|-<br />
| tde || transcript donor evidence || GeMoMa || introns || prediction || percentage of predicted donor sites per predicted transcript with RNA-seq evidence<br />
|-<br />
| tie || transcript intron evidence || GeMoMa || introns || prediction || percentage of predicted introns per predicted transcript with RNA-seq evidence<br />
|-<br />
| minSplitReads || minimal split reads || GeMoMa || introns || prediction || minimal number of split reads for any of the predicted introns per predicted transcript<br />
|-<br />
| iAA || identical amino acid || GeMoMa || query proteins || prediction || percentage of identical amino acids between reference transcript and prediction<br />
|-<br />
| pAA || positive amino acid || GeMoMa || query proteins || prediction || percentage of aligned positions between reference transcript and prediction yielding a positive score in the substitution matrix<br />
|-<br />
| evidence || || GAF || || prediction || number of reference organisms that have a transcript yielding this prediction<br />
|-<br />
| alternative || || GAF || || prediction || alternative gene ID(s) leading to the same prediction<br />
|-<br />
| sumWeight || || GAF || || prediction || the sum of the weights of the references that perfectly support this prediction<br />
|-<br />
| maxTie || maximal tie || GAF || || gene || maximal tie of all transcripts of this gene<br />
|-<br />
| maxEvidence || maximal evidence || GAF || || gene || maximal evidence of all transcripts of this gene<br />
|-<br />
|}<br />
<br />
== FAQs ==<br />
<br />
; Why does the Extractor not return a single CDS-part, protein, ...?<br />
: First, please check whether the names of your contigs/chromosomes in your annotation (gff) and genome file (fasta) are identical. The fasta comments should at best only contain the contig/chromosome name. (Since GeMoMa 1.4, comments, which contain the contig/chromosome name and some additional information separated by a space, are also fine.) Second, please check whether you have a valid GFF/GTF file. Valid GFF files should have a valid "ID" or "Parent" entry in the attributes column. Valid GTF files should have a valid "gene_id" and "transcript_id" entry. Finally, please check the statistics that are given by the Extractor. It lists how many genes have been read and how many genes have been removed for different reasons. One common problem is that some annotation files do not include the stop codon in the CDS annotation.<br />
; How can I force GeMoMa to make more predictions?<br />
: There are several parameters affecting the number of predictions. The most prominent are the number of predictions (p) and the contig threshold (ct). For each reference transcript/CDS, GeMoMa initially makes a preliminary prediction and uses this prediction to determine whether a contig is promising and should be used to determine the final predictions. You may decrease ct and increase p to have more contigs in the final prediction. Increasing the number of predictions allows GeMoMa to output more predictions that have been computed. Decreasing the contig threshold allows to increase the number of predictions that are (internally) computed. Increasing p to a very large number without decreasing ct does not help.<br />
; Running GeMoMa on a single contig of my assembly yield thousands of weird predictions. What went wrong?<br />
: By default, GeMoMa is not build to be run on a single contig. GeMoMa tries to make predictions for all given reference CDS in the given target sequence(s). If the given target sequence is only a fraction of the complete target genome/assembly, GeMoMa will produce weird predictions as it does not filter for the quality of the predictions internally. There are two options to handle this:<br />
:* Use a list of gene models that you expect to be located on this contig (cf. parameter "selected").<br />
:* Filter the predictions using GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>).<br />
; Is it mandatory to use RNA-seq data?<br />
: No, GeMoMa is able to make predictions with and without RNA-seq evidence.<br />
; Is it possible to use multiple reference organisms?<br />
: It is possible to use multiple reference organisms for GeMoMa. Just run GeMoMa on each reference organism separately. Finally, you can employ GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>) to combine these annotations.<br />
; Why do some reference genes not lead to a prediction in the target genome?<br />
: Please first check whether your reference genes have been discarded by the Extractor (cf. assignment file).<br />
: If the genes have been discarded, there are two possibilities:<br />
:* The CDS might be redundant, i.e. the coding exons are identical to those of another transcript. In this case, only one CDS is further evaluated.<br />
:* There might be something wrong with your reference genes, e.g., missing start codon, missing stop codon, premature stop codon, ambiguous nucleotides, ... and you should check the options of Extractor or the annotation.<br />
: If the reference genes passed the Extractor, there are several possible explanations for this behavior. The two most prominent are:<br />
:* GeMoMa stopped the prediction of a reference genes since it does not return a result within the given time (cf. parameter "timeout").<br />
:* GeMoMa simply did not find a prediction matching the remaining quality criteria<br />
:* GeMoMa did find a prediction, but it was filtered out by GAF, e.g. to low relative score, missing start or stop codon (cf. GAF parameters).<br />
; What does "partial gene model" mean in the context of GeMoMa?<br />
: We called a gene model partial if it does not contain an initial start codon and a final stop codon. However, this does not mean that the gene model is located at or close to the border of a chromosome or contig.<br />
; For two different reference transcripts, the predictions of GeMoMa overlap or are identical. What should I do with those?<br />
: GeMoMa makes the predictions for each reference transcript independently. Hence, it can occur that some of predictions of different reference transcripts overlap or are identical especially in gene families. Typically, you might like to filter or rank these predictions. We have implemented GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>) to do this automatically. However, you can also do it by hand using the GFF attributes. Using RNA-seq data in GeMoMa yields additional fields in the annotation that can be used, e.g., average coverage (avgCov).<br />
; A lot of transcripts have been filtered out by the Extractor. What can I do?<br />
: There are several reasons for removing transcripts by the Extractor. At least in two cases you can try to get more transcripts by setting specific parameter values. First, if the transcript contains ambiguous nucleotides, please test the parameter "Ambiguity". Second, sometimes we received GFFs which contain wrong phases for CDS entries (e.g., 0 for all CDS entries in the phase column of the GFF). Since version 1.3.2, we provide the option "r" which stands for repair. If r=true is chosen, the Extractor tries to infer all phases for transcripts that show an error and would be filtered out.<br />
; Is GeMoMa able to predict pseudo-genes/ncRNA?<br />
: No, currently not.<br />
; My RNA-seq data indicates there is an additional intron in a transcipt, but GeMoMa does not predict this. Or vice versa, GeMoMa predicts an intron that is not supported by RNA-seq data. What's the reason?<br />
: GeMoMa is mainly based on the assumptions of amino acid and intron position conservation between reference and target species. Hence, GeMoMa tries to predict a gene model with similar exon-intron structure in the target species and does not stick too much to RNA-seq data. Although intron position conservation can be observed in most cases, sometimes new introns evolve or others vanish. For this reasons, GeMoMa also allows for the inclusion or exclusion of introns adding some additional costs (cf. GeMoMa parameter intron-loss-gain-penalty). However, the behaviour of GeMoMa depends on the parameters settings (especially intron-loss-gain-penalty, sm (substitution matrix), go (gap opening), ge (gap extension)) and the length of the missed/additional intron. Nevertheless, such cases can only occur if the additional/missed intron has a length that can be divided by 3 preserving the reading frame.<br />
:Since the available RNA-seq data only reflects a fraction of tissues/environmental conditions/..., missing RNA-seq evidence does not necessarily mean that the predictions is wrong.<br />
; My RNA-seq data indicates two alternative, highly overlapping introns. Interestingly, GeMoMa does not take the intron that is more abundant. Why?<br />
: GeMoMa reads the introns from the input file using some filter (cf. GeMoMa parameter r (reads)). All introns that pass the filter are used and treated equally. Hence, GeMoMa uses the intron that matches the expectation of intron position and amino acid conservation compared to the reference transcript.<br />
; Does GeMoMa predict multiple transcripts per gene?<br />
: GeMoMa in principle allows to predict multiple transcripts per gene, if corresponding transcripts are given in the reference species or if multiple reference species are used.<br />
; GeMoMa failed with java.lang.OutOfMemoryError. What can I do?<br />
: Whenever you see a java.lang.OutOfMemoryError, you should rerun the program with Java virtual machine (VM) options. More specifically you should set: '''-Xms''' the initally used RAM, e.g. to 5Gb (–Xms5G), and '''-Xmx''' the maximally used RAM, e.g. to 50Gb (-Xmx50G). GeMoMa often needs more memory if you have a large genome ''and'' if you’re providing a large coverage file (extracted from RNA-seq data). If you don’t have a compute node with enough memory, you can run GeMoMa without coverage, which will return the same predictions, but does not include all statistics. Another point could be the protein alignment, if you use the optional parameter query protein. Again you can run GeMoMa without this parameter, which will return the same predictions, but less statistics.<br />
; I need to specify the genetic code for my organisms. What is the expected format?<br />
: The genetic code is given in a two column tab-delimited table, where the first column is the one letter code of the amino acid and the second column is a comma-separated list of triplets. As we are working on genomic DNA, GeMoMa expects the bases A, C, G, and T, and not U (as expected in mRNA). Here is the link to the default genetic code, which might be used as template:<br />
: https://github.com/Jstacs/Jstacs/blob/master/projects/gemoma/test_data/genetic_code.txt<br />
: Alternative genetic codes are described here using the RNA alphabet:<br />
: https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi<br />
: The genetic code might be specified for a reference organism in the module Extractor or for a target organism in the module GeMoMa.<br />
<br />
== Version history ==<br />
[http://www.jstacs.de/download.php?which=GeMoMa GeMoMa 1.6.2] (17.12.2019)<br />
* Jstacs changes:<br />
** test methods for modules<br />
** live protocol for Galaxy<br />
* new module Denoise: allowing to clean introns extracted by ERE<br />
* new module NCBIReferenceRetriever: allowing to retrieve data for reference organisms easily from NCBI.<br />
* GAF:<br />
** bugfix for filter using specific attributes if no RNA-seq or query proteins was used<br />
** allow to add annotation info (as for instance provided by Phytozome) based on the reference organisms<br />
* GeMoMa: bugfix for timeout<br />
* GeMoMaPipeline:<br />
** bugfix reporting predicted partial proteins<br />
** improved protocol<br />
** new default value for query proteins (changed from false to true)<br />
** new default value for Ambiguity (changed from EXCEPTION to AMBIGUOUS)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.6.1.zip GeMoMa 1.6.1] (4.06.2019)<br />
* createGalaxyIntegration.sh: bugfix for GeMoMaPipeline<br />
* new module CheckIntrons: allowing to create statistics for introns (extracted by ERE)<br />
* AnnotationFinalizer: bugfix for sequence IDs with large numbers<br />
* CompareTranscripts: <br />
** bugfix for prefix of ref-gene<br />
** allow no transcript info, but making assignment non-optional if a transcript info is set <br />
* GAF: bugfix for Galaxy integration<br />
* GeMoMaPipeline:<br />
** improved output in case of Exceptions<br />
** new parameter "output individual predictions" allows to in- or exclude individual predictions from each reference organism in the final result<br />
** new parameter "weight" allows weights for reference species (cf. GAF)<br />
* ERE: new parameter "minimum mapping quality"<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.6.zip GeMoMa 1.6] (2.04.2019)<br />
* allow to use mmseqs as alternative to tblastn<br />
* AnnotationEvidence:<br />
** allows to add attributes to the input gff: tie, tpc, AA, start, stop<br />
** new parameter for gff output<br />
* AnnotationFinalizer: new tool for predicting UTRs and renaming predictions<br />
* GAF:<br />
** relative score filter and evidence filter are replaced by a flexible filter that allows to filter by relative score, evidence or other GFF attributes as well as combinations thereof<br />
** sorting criteria of the predictions within clusters can now be user-specified<br />
** new attribute for genes: combinedEvidence<br />
** new attribute for predictions: sumWeight<br />
** allows to use gene predictions from all sources, including for instance ab-initio gene predictors, purely RNA-seq based gene prediction and manually curation<br />
** bugfix for predictions from multiple reference organisms<br />
** improved statistic output<br />
* GeMoMa<br />
** renamed the parameter tblastn results to search results<br />
** new parameter for sorting the results of the similarity search (tblastn or mmseqs), if you use mmseqs for the similarity search you have use sort <br />
** new parameter for score of the search results: three options: Trust (as is), ReScore (use aligned sequence, but recompute score), and ReAlign (use detected sequence for realignment and rescore)<br />
** bugfix: threshold for introns from multiple files<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.5.2.zip GeMoMa 1.5.3] (23.07.2018)<br />
* improved parameter description and presentation<br />
* GeMoMaPipeline:<br />
** removed unnecessary parameters<br />
* GeMoMa:<br />
** bugfix: reading coverage file<br />
** removed parameter genomic (cf. Extractor)<br />
** removed protein output (cf. Extractor)<br />
* GAF:<br />
** bugfix: prefix<br />
* Extractor:<br />
** new parameter genomic<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.5.2.zip GeMoMa 1.5.2] (31.5.2018)<br />
* GAF:<br />
** new parameter that allows to restrict the maximal number of transcript predictions per gene<br />
** altered behavior of the evidence filter from percentages to absolute values<br />
** bugfix: nested genes <br />
** checking for duplicates in prediction IDs<br />
* GeMoMa:<br />
** warning if RNA-seq data does not match with target genome<br />
* GeMoMaPipeline: new tool for running the complete GeMoMa pipeline at once allowing multi-threading<br />
* folder for temporary files of GeMoMa<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.5.zip GeMoMa 1.5] (13.02.2018)<br />
* AnnotationEvidence: add chromosome to output<br />
* CompareTranscripts: new parameter that allows to remove prefixes introduces by GAF<br />
* Extractor: new parameter "stop-codon excluded from CDS" that might be used if the annotation does not contain the stop codons<br />
* ExtractRNASeqEvidence: <br />
** print intron length stats<br />
** include program infos in introns.gff3<br />
* GeMoMa:<br />
** new attribute pAA in gff output if query protein is given<br />
** include program infos in predicted_annotation.gff3<br />
** minor bugfix<br />
* GAF:<br />
** new parameter that allows to specify a prefix for each input gff<br />
** collect and print program infos to filtered_prediction.gff3<br />
** improved statistics output<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.4.2.zip GeMoMa 1.4.2] (21.07.2017)<br />
* automatic searching for available updates<br />
* AnnotationEvidence: bugfix (tie computation: Arrays.binarysearch does not find first match)<br />
* Extractor: bugfix (files that are not zipped)<br />
* GeMoMa: bugfix (tie computation: Arrays.binarysearch does not find first match)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.4.1.zip GeMoMa 1.4.1] (30.05.2017)<br />
* CompareTranscripts: bugfix (NullPointerException)<br />
* Extractor: reference genome can be .*fa.gz and .*fasta.gz<br />
* GeMoMa: bugfix (shutdown problem after timeout)<br />
* modified additional scripts and documentation<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.4.zip GeMoMa 1.4] (03.05.2017)<br />
* AnnotationEvidence: new tool computing tie and tpc for given annotation (gff)<br />
* CompareTranscripts: new tool comparing predicted and given annotation (gff)<br />
* Extractor:<br />
** reading CDS with no parent tag (cf. discontinuous feature)<br />
** automatic recognition of GFF or GTF annotation<br />
** Warning if sequences mentioned in the annotation are not included in the reference sequence<br />
* GeMoMa: <br />
** allowing for multiple intron and coverage files (= using different library types at the same time)<br />
** NA instead of "?" for tae, tde, tie, minSplitReads of single coding exon genes<br />
** new default values for the parameters: predictions (10 instead of 1) and contig threshold (0.4 instead of 0.9)<br />
** bugfix (write pc and minCov if possible for last CDS part in predicted annotation)<br />
** bugfix (ref-gene name if no assignment is used)<br />
** bugfix (minSplitReads, minCov, tpc, avgCov if no coverage available)<br />
* GAF:<br />
** nested genes on the same strand<br />
** bugfix (if nothing passes the filter)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.3.2.zip GeMoMa 1.3.2] (18.01.2017)<br />
* Extractor: new parameter repair for broken transcript annotations<br />
* GeMoMa: bugfixes (splice site computation)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa_1.3.1.zip GeMoMa 1.3.1] (09.12.2016)<br />
* GeMoMa bugfix (finding start/stop codon for very small exons)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.3.zip GeMoMa 1.3] (06.12.2016)<br />
* ERE: new tool for extracting RNA-seq evidence<br />
* Extractor: offers options for <br />
** partial gene models<br />
** ambiguities<br />
* GeMoMa:<br />
** RNA-seq<br />
*** defining splice sites<br />
*** additional feature in GFF and output<br />
**** transcript intron evidence (tie)<br />
**** transcript acceptor evidence (tae)<br />
**** transcript donor evidence (tde)<br />
**** transcript percentage coverage (tpc)<br />
**** ...<br />
** improved GFF<br />
** simplified the command line parameters<br />
** IMPORTANT: parameter names changed for some parameters<br />
* GAF: new tool for filtering and combining different predictions (especially of different reference organisms)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.1.3.zip GeMoMa 1.1.3] (06.06.2016)<br />
* minor modifications to the Extractor tool<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.1.2.zip GeMoMa 1.1.2] (05.02.2016)<br />
* GeMoMa bugfix (upstream, downstream sequence for splice site detection)<br />
* Extractor: new parameter s for selecting transcripts<br />
* improved Galaxy integration<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.1.1.zip GeMoMa 1.1.1] (01.02.2016)<br />
* initial release for paper</div>Keilwagenhttps://www.jstacs.de/index.php?title=GeMoMa&diff=1064GeMoMa2020-01-17T10:04:17Z<p>Keilwagen: /* Extractor */</p>
<hr />
<div>'''Ge'''ne '''Mo'''del '''Ma'''pper (GeMoMa) is a homology-based gene prediction program. GeMoMa uses the annotation of protein-coding genes in a reference genome to infer the annotation of protein-coding genes in a target genome. Thereby, GeMoMa utilizes amino acid sequence and intron position conservation. In addition, GeMoMa allows to incorporate RNA-seq evidence for splice site prediction.<br />
<br />
[[File:GeMoMa-schema.png|thumb|right|350px|Schema of GeMoMa algorithm]]<br />
<br />
== Paper ==<br />
If you use GeMoMa, please cite<br />
<br />
J. Keilwagen, M. Wenk, J. L. Erickson, M. H. Schattat, J. Grau, and F. Hartung. [https://nar.oxfordjournals.org/content/44/9/e89 Using intron position conservation for homology-based gene prediction]. ''Nucleic Acids Research'', 2016. doi: 10.1093/nar/gkw092<br />
<br />
J. Keilwagen, F. Hartung, M. Paulini, S. O. Twardziok, and J. Grau<br />
[https://rdcu.be/QbKc Combining RNA-seq data and homology-based gene prediction for plants, animals and fungi]. ''BMC Bioinformatics'', 2018. doi: 10.1186/s12859-018-2203-5<br />
<br />
== Download ==<br />
<br />
GeMoMa is implemented in Java using Jstacs. You can [http://www.jstacs.de/download.php?which=GeMoMa download a zip file] containing a readme, the GeMoMa jar file and some tiny scripts for running GeMoMa. The jar file allows for <br />
<ul><br />
<li>creating the XML file needed for the Galaxy integration</li><br />
<li>running the command line interface (CLI) version.</li><br />
</ul><br />
<br />
You can also [[:File:GeMoMa-manual.pdf|download a small manual for GeMoMa]] which explains the main steps for the analysis.<br />
<br />
== Galaxy ==<br />
GeMoMa is available in a public web-server at [http://galaxy.informatik.uni-halle.de galaxy.informatik.uni-halle.de]. The provided web-server only allows a limited number of reference genes and uses a time out of 2 minutes per transcript prediction. For unlimited use, please use the command line program or integrate GeMoMa in your only Galaxy instance.<br />
<br />
[[File:GeMoMa-Workflow.png|thumb|center|700px|GeMoMa workflow adapted from Galaxy]]<br />
<br />
== Running the command line application ==<br />
<br />
For running the command line application, Java v1.8 or later is required.<br />
<br />
=== GeMoMaPipeline ===<br />
<br />
If you like to run the GeMoMaPipeline on a server as a single job, you can use the module GeMoMaPipeline which allows to exploit the full compute power of the computer server via multi-threading. However, GeMoMaPipeline does not distribute task on a compute cluster.<br />
You can run GeMoMaPipeline from the command line with<br/><br />
<code>java -jar GeMoMa-<version>.jar CLI GeMoMaPipeline [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>target genome (Target genome file (FASTA))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>species (data for reference species, range={own, pre-extracted}, default = own)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;own&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>ID (ID to distinguish the different reference species, default = , OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (Reference annotation file (GFF or GTF), which contains gene models annotated in the reference genome)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (Reference genome file (FASTA))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>weight (the weight can be used to prioritize predictions from different input files; each prediction will get an additional attribute sumWeight that can be used in the filter, valid range = [0.0, 1000.0], default = 1.0, OPTIONAL)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ai</font></td><br />
<td>annotation info (annotation information of the reference, tab-delimted file containing at least the columns transcriptName, GO and .*defline, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;pre-extracted&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>ID (ID to distinguish the different reference species, default = , OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>cds parts (The query cds parts file (FASTA), i.e., the cds parts that have been blasted)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>assignment (The assignment file, which combines parts of the CDS to transcripts, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">q</font></td><br />
<td>query proteins (optional query protein file (FASTA) for computing the optimal alignment score against complete protein prediction, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>weight (the weight can be used to prioritize predictions from different input files; each prediction will get an additional attribute sumWeight that can be used in the filter, valid range = [0.0, 1000.0], default = 1.0, OPTIONAL)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ai</font></td><br />
<td>annotation info (annotation information of the reference, tab-delimted file containing at least the columns transcriptName, GO and .*defline, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">selected</font></td><br />
<td>selected (The path to list file, which allows to make only a predictions for the contained transcript ids. The first column should contain transcript IDs as given in the annotation. Remaining columns can be used to determine a target region that should be overlapped by the prediction, if columns 2 to 5 contain chromosome, strand, start and end of region, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">gc</font></td><br />
<td>genetic code (optional user-specified genetic code, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>maximum intron length (The maximum length of an intron, default = 15000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">tblastn</font></td><br />
<td>tblastn (if *true* tblastn is used as search algorithm, otherwise mmseqs is used. Tblastn and mmseqs need to be installed to use the corresponding option, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">tag</font></td><br />
<td>tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = prediction)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>RNA-seq evidence (data for RNA-seq evidence, range={NO, MAPPED, EXTRACTED}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;MAPPED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.s</font></td><br />
<td>Stranded (Defines whether the reads are stranded. In case of FR_FIRST_STRAND, the first read of a read pair or the only read in case of single-end data is assumed to be located on forward strand of the cDNA, i.e., reverse to the mRNA orientation. If you are using Illumina TruSeq you should use FR_FIRST_STRAND., range={FR_UNSTRANDED, FR_FIRST_STRAND, FR_SECOND_STRAND}, default = FR_UNSTRANDED)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.m</font></td><br />
<td>mapped reads file (BAM/SAM files containing the mapped reads)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.v</font></td><br />
<td>ValidationStringency (Defines how strict to be when reading a SAM or BAM, beyond bare minimum validation., range={STRICT, LENIENT, SILENT}, default = LENIENT)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.u</font></td><br />
<td>use secondary alignments (allows to filter flags in the SAM or BAM, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.c</font></td><br />
<td>coverage (allows to output the coverage, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.mmq</font></td><br />
<td>minimum mapping quality (reads with a mapping quality that is lower than this value will be ignored, valid range = [0, 255], default = 40)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;EXTRACTED&quot;:</b></td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">introns</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used zero or multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage</font></td><br />
<td>coverage (experimental coverage (RNA-seq), range={UNSTRANDED, STRANDED}, default = UNSTRANDED)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>denoise (removing questionable introns that have been extracted by ERE, range={DENOISE, RAW}, default = DENOISE)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;DENOISE&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Denoise.m</font></td><br />
<td>minimum expression (The threshold for removing introns, valid range = [0.0, 1.0], default = 0.01)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Denoise.c</font></td><br />
<td>context (The context upstream a donor and donwstream an acceptor site that is used to determine the expression of the region, valid range = [0, 100], default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;RAW&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.p</font></td><br />
<td>proteins (whether the complete proteins sequences should returned as output, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.r</font></td><br />
<td>repair (if a transcript annotation can not be parsed, the program will try to infer the phase of the CDS parts to repair the annotation, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.a</font></td><br />
<td>Ambiguity (This parameter defines how to deal with ambiguities in the DNA. There are 3 options: EXCEPTION, which will remove the corresponding transcript, AMBIGUOUS, which will use an X for the corresponding amino acid, and RANDOM, which will randomly select an amnio acid from the list of possibilities., range={EXCEPTION, AMBIGUOUS, RANDOM}, default = AMBIGUOUS)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.s</font></td><br />
<td>stop-codon excluded from CDS (A flag that states whether the reference annotation contains the stop codon in the CDS annotation or not, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.f</font></td><br />
<td>full-length (A flag which allows for choosing between only full-length and all (i.e., full-length and partial) transcripts, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.s</font></td><br />
<td>splice (if no intron is given by RNA-seq, compute candidate splice sites or not, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.sm</font></td><br />
<td>substitution matrix (optional user-specified substitution matrix, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.g</font></td><br />
<td>gap opening (The gap opening cost in the alignment, default = 11)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.ge</font></td><br />
<td>gap extension (The gap extension cost in the alignment, default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.i</font></td><br />
<td>intron-loss-gain-penalty (The penalty used for intron loss and gain, default = 25)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.e</font></td><br />
<td>e-value (The e-value for filtering blast results, default = 100.0)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.c</font></td><br />
<td>contig threshold (The threshold for evaluating contigs, valid range = [0.0, 1.0], default = 0.4)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.rt</font></td><br />
<td>region threshold (The threshold for evaluating regions, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.h</font></td><br />
<td>hit threshold (The threshold for adding additional hits, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.p</font></td><br />
<td>predictions (The (maximal) number of predictions per transcript, default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.a</font></td><br />
<td>avoid stop (A flag which allows to avoid stop codons in a transcript (except the last AS), default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.approx</font></td><br />
<td>approx (whether an approximation is used to compute the score for intron gain, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.prefix</font></td><br />
<td>prefix (A prefix to be used for naming the predictions, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.t</font></td><br />
<td>timeout (The (maximal) number of seconds to be used for the predictions of one transcript, if exceeded GeMoMa does not output a prediction for this transcript., valid range = [0, 604800], default = 3600)</td><br />
<td style="width:100px;">LONG</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.Score</font></td><br />
<td>Score (A flag which allows to do nothing, re-score or re-align the search results, range={Trust, ReScore, ReAlign}, default = Trust)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.s</font></td><br />
<td>sorting (comma-separated list that defines the way predictions are sorted within a cluster, default = evidence,score)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.c</font></td><br />
<td>common border filter (the filter on the common borders of transcripts, the lower the more transcripts will be checked as alternative splice isoforms, valid range = [0.0, 1.0], default = 0.75)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.m</font></td><br />
<td>maximal number of transcripts per gene (the maximal number of allowed transcript predictions per gene, valid range = [1, 2147483647], default = 2147483647)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.f</font></td><br />
<td>filter (A filter can be applied to transcript predictions to receive only reasonable predictions. The filter is applied to the GFF attributes. The deault filter decides based on the completeness of the prediction (start=='M' and stop=='*') and the relative score (score/AA>=0.75) whether a prediction is used or not. Different criteria can be combined using 'and ' and 'or'. A more sophisticated filter could be applied for instance using the completeness, the relative score, the evidence as well as statistics based on RNA-seq: start=='M' and stop =='*' and score/AA>=0.75 and (evidence>1 or tpc==1.0), default = start=='M' and stop =='*' and score/AA>=0.75, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.a</font></td><br />
<td>alternative transcript filter (If a prediction is suggested as an alternative transcript, this additional filter will be applied. The filter works syntactically like the 'filter' parameter and allows you to keep the number of alternative transcripts small based on meaningful criteria. Reasonable filter could be based on RNA-seq data (tie==1) or on evidence (evidence>1). A more sophisticated filter could be applied combining several criteria: tie==1 or evidence>1, default = tie==1 or evidence>1, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.u</font></td><br />
<td>UTR (allows to predict UTRs using RNA-seq data, range={NO, YES}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;YES&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.r</font></td><br />
<td>rename (allows to generate generic gene and transcripts names (cf. attribute &quot;Name&quot;), range={COMPOSED, SIMPLE, NO}, default = COMPOSED)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;COMPOSED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.i</font></td><br />
<td>infix (the infix of the generic name, default = G)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.s</font></td><br />
<td>suffix (the suffix of the generic name, default = 0)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.di</font></td><br />
<td>delete infix (a comma-separated list of infixes that is deleted from the sequence names before building the gene/transcript name, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;SIMPLE&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>predicted proteins (If *true*, returns the predicted proteins of the target organism as fastA file, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">pc</font></td><br />
<td>predicted CDSs (If *true*, returns the predicted CDSs of the target organism as fastA file, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">pgr</font></td><br />
<td>predicted genomic regions (If *true*, returns the genomic regions of predicted gene models of the target organism as fastA file, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">o</font></td><br />
<td>output individual predictions (If *true*, returns the predictions for each reference species, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">debug</font></td><br />
<td>debug (If *false* removes all temporary files even if the jobs exits unexpected, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">threads</font></td><br />
<td>The number of threads used for the tool, defaults to 1</td><br />
<td>INT</td><br />
</tr><br />
</table><br />
<br />
=== Extract RNA-seq Evidence (ERE) ===<br />
<br />
For post-processing the mapped RNA-seq data, we provide the tool ExtractRNAseqEvidence (ERE). You can run Extractor from the command line with<br/><br />
<code>java -jar GeMoMa-<version>.jar CLI ERE [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>Stranded (Defines whether the reads are stranded. In case of FR_FIRST_STRAND, the first read of a read pair or the only read in case of single-end data is assumed to be located on forward strand of the cDNA, i.e., reverse to the mRNA orientation. If you are using Illumina TruSeq you should use FR_FIRST_STRAND., range={FR_UNSTRANDED, FR_FIRST_STRAND, FR_SECOND_STRAND}, default = FR_UNSTRANDED)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>mapped reads file (BAM/SAM files containing the mapped reads)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>ValidationStringency (Defines how strict to be when reading a SAM or BAM, beyond bare minimum validation., range={STRICT, LENIENT, SILENT}, default = LENIENT)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">u</font></td><br />
<td>use secondary alignments (allows to filter flags in the SAM or BAM, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage (allows to output the coverage, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">mmq</font></td><br />
<td>minimum mapping quality (reads with a mapping quality that is lower than this value will be ignored, valid range = [0, 254], default = 40)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== CheckIntrons===<br />
<br />
This tool allows to check whether the extracted introns show the expected patterns of di-nucleotides at the splice sites. You can run CheckIntrons from the command line with<br/><br />
<code>java -jar GeMoMa-<version>.jar CLI CheckIntrons [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>target genome (The target genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>verbose (A flag which allows to output a wealth of additional information per transcript, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== Extractor ===<br />
<br />
For preparing the data, we provide the tool Extractor. You can run Extractor from the command line with<br/><br />
<code>java -jar GeMoMa-<version>.jar CLI Extractor [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (Reference annotation file (GFF or GTF), which contains gene models annotated in the reference genome)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (Reference genome file (FASTA))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">gc</font></td><br />
<td>genetic code (optional user-specified genetic code, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>proteins (whether the complete proteins sequences should returned as output, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>cds (whether the complete CDSs should returned as output, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">genomic</font></td><br />
<td>genomic (whether the genomic regions should be returned (upper case = coding, lower case = non coding), default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>repair (if a transcript annotation can not be parsed, the program will try to infer the phase of the CDS parts to repair the annotation, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>selected (The path to list file, which allows to make only a predictions for the contained transcript ids. The first column should contain transcript IDs as given in the annotation. Remaining columns will be ignored., OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Ambiguity</font></td><br />
<td>Ambiguity (This parameter defines how to deal with ambiguities in the DNA. There are 3 options: EXCEPTION, which will remove the corresponding transcript, AMBIGUOUS, which will use an X for the corresponding amino acid, and RANDOM, which will randomly select an amnio acid from the list of possibilities., range={EXCEPTION, AMBIGUOUS, RANDOM}, default = EXCEPTION)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sefc</font></td><br />
<td>stop-codon excluded from CDS (A flag that states whether the reference annotation contains the stop codon in the CDS annotation or not, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>full-length (A flag which allows for choosing between only full-length and all (i.e., full-length and partial) transcripts, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>verbose (A flag which allows to output a wealth of additional information, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== Gene Model Mapper (GeMoMa) ===<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>search results (The search results, e.g., from tblastn or mmseqs)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>target genome (The target genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>cds parts (The query cds parts file (FASTA), i.e., the cds parts that have been blasted)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>assignment (The assignment file, which combines parts of the CDS to transcripts, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">q</font></td><br />
<td>query proteins (optional query protein file (FASTA) for computing the optimal alignment score against complete protein prediction, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">splice</font></td><br />
<td>splice (if no intron is given by RNA-seq, compute candidate splice sites or not, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage</font></td><br />
<td>coverage (experimental coverage (RNA-seq), range={NO, UNSTRANDED, STRANDED}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genetic code (optional user-specified genetic code, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sm</font></td><br />
<td>substitution matrix (optional user-specified substitution matrix, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">go</font></td><br />
<td>gap opening (The gap opening cost in the alignment, default = 11)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ge</font></td><br />
<td>gap extension (The gap extension cost in the alignment, default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>maximum intron length (The maximum length of an intron, default = 15000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">intron-loss-gain-penalty</font></td><br />
<td>intron-loss-gain-penalty (The penalty used for intron loss and gain, default = 25)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">e</font></td><br />
<td>e-value (The e-value for filtering blast results, default = 100.0)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ct</font></td><br />
<td>contig threshold (The threshold for evaluating contigs, valid range = [0.0, 1.0], default = 0.4)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">rt</font></td><br />
<td>region threshold (The threshold for evaluating regions, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">h</font></td><br />
<td>hit threshold (The threshold for adding additional hits, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>predictions (The (maximal) number of predictions per transcript, default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">selected</font></td><br />
<td>selected (The path to list file, which allows to make only a predictions for the contained transcript ids. The first column should contain transcript IDs as given in the annotation. Remaining columns can be used to determine a target region that should be overlapped by the prediction, if columns 2 to 5 contain chromosome, strand, start and end of region, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">as</font></td><br />
<td>avoid stop (A flag which allows to avoid stop codons in a transcript (except the last AS), default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">approx</font></td><br />
<td>approx (whether an approximation is used to compute the score for intron gain, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">prefix</font></td><br />
<td>prefix (A prefix to be used for naming the predictions, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">tag</font></td><br />
<td>tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = prediction)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>verbose (A flag which allows to output a wealth of additional information per transcript, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">timeout</font></td><br />
<td>timeout (The (maximal) number of seconds to be used for the predictions of one transcript, if exceeded GeMoMa does not output a prediction for this transcript., valid range = [0, 604800], default = 3600)</td><br />
<td style="width:100px;">LONG</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sort</font></td><br />
<td>sort (A flag which allows to sort the search results, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Score</font></td><br />
<td>Score (A flag which allows to do nothing, re-score or re-align the search results, range={Trust, ReScore, ReAlign}, default = Trust)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
GeMoMa returns the predicted annotation as gff file.<br />
<br />
=== GeMoMa Annotation Filter (GAF) ===<br />
<br />
The GeMoMa Annotation Filter allows to combine and reduce predictions from GeMoMa into a single final prediction. It is able to handle predictions from different reference species. It also handles overlapping or identical predictions.<br />
You can run GeMoMa from the command line with<br/><br />
<code>java -jar GeMoMa-1.6.1.jar CLI GAF[&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>tag (the tag used to read the GeMoMa annotations, default = prediction)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>sorting (comma-separated list that defines the way predictions are sorted within a cluster, default = evidence,score)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>missing intron evidence filter (the filter for single-exon transcripts or if no RNA-seq data is used, decides for overlapping other transcripts whether they should be used (=true) or discarded (=false), default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>intron evidence filter (the filter on the intron evidence given by RNA-seq-data for overlapping transcripts, valid range = [0.0, 1.0], default = 1.0)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>common border filter (the filter on the common borders of transcripts, the lower the more transcripts will be checked as alternative splice isoforms, valid range = [0.0, 1.0], default = 0.75)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">mnotpg</font></td><br />
<td>maximal number of transcripts per gene (the maximal number of allowed transcript predictions per gene, valid range = [1, 2147483647], default = 2147483647)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prefix (the prefix can be used to distinguish predictions from different input files, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>weight (the weight can be used to prioritize predictions from different input files; each prediction will get an additional attribute sumWeight that can be used in the filter, valid range = [0.0, 1000.0], default = 1.0, OPTIONAL)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>gene annotation file (GFF files containing the gene annotations (predicted by GeMoMa))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>filter (A filter can be applied to transcript predictions to receive only reasonable predictions. The filter is applied to the GFF attributes. The deault filter decides based on the completeness of the prediction (start=='M' and stop=='*') and the relative score (score/AA>=0.75) whether a prediction is used or not. Different criteria can be combined using 'and ' and 'or'. A more sophisticated filter could be applied for instance using the completeness, the relative score, the evidence as well as statistics based on RNA-seq: start=='M' and stop =='*' and score/AA>=0.75 and (evidence>1 or tpc==1.0), default = start=='M' and stop =='*' and score/AA>=0.75, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== CompareTranscripts ===<br />
<br />
For comparing gene models from GeMoMa predictions with existing annotation, we provide the tool CompareTranscripts. You can run CompareTranscripts from the command line with<br/><br />
<code>java -jar GeMoMa-1.6.1.jar CLI CompareTranscripts [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prediction (The predicted annotation)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (The true annotation)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== AnnotationEvidence ===<br />
<br />
For providing RNA-seq evidence (e.g. tie) for existing annotation, we provide the tool AnnotationEvidence. You can run AnnotationEvidence from the command line with<br/><br />
<code>java -jar GeMoMa-1.6.1.jar CLI AnnotationEvidence [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (The genome annotation file (GFF))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (The genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns file (Introns (GFF), which might be obtained from RNA-seq, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage file (experimental coverage (RNA-seq), range={NO, UNSTRANDED, STRANDED}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ao</font></td><br />
<td>annotation output (if the annotation should be returned with attributes tie, tpc, and AA, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">gc</font></td><br />
<td>genetic code (optional user-specified genetic code, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== AnnotationFinializer ===<br />
<br />
This tool allows to predict UTR and to rename predictions. You can run AnnotationEvidence from the command line with<br/><br />
<code>java -jar GeMoMa-1.6.1.jar CLI AnnotationFinalizer [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (The genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (The predicted genome annotation file (GFF))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = prediction)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">u</font></td><br />
<td>UTR (allows to predict UTRs using RNA-seq data, range={NO, YES}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;YES&quot;:</b></td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns file (Introns (GFF), which might be obtained from RNA-seq, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage file (experimental coverage (RNA-seq), range={NO, UNSTRANDED, STRANDED}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">rename</font></td><br />
<td>rename (allows to generate generic gene and transcripts names (cf. attribute &quot;Name&quot;), range={COMPOSED, SIMPLE, NO}, default = COMPOSED)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;COMPOSED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">infix</font></td><br />
<td>infix (the infix of the generic name, default = G)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>suffix (the suffix of the generic name, default = 0)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">di</font></td><br />
<td>delete infix (a comma-separated list of infixes that is deleted from the sequence names before building the gene/transcript name, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;SIMPLE&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
== GFF attributes ==<br />
<br />
Using GeMoMa and GAF, you'll obtain GFFs containing some special attributes. We briefly explain the most prominent attributes in the following table.<br />
<br />
{| class="wikitable sortable"<br />
|-<br />
!Attribute !!Long name !!Tool !!Necessary parameter !!Feature !!Description<br />
|-<br />
| score || GeMoMa score || GeMoMa || || prediction || score computed by GeMoMa using the substitution matrix, gap costs and additional penalties<br />
|-<br />
| minCov || minimal coverage || GeMoMa || coverage, ... || prediction || minimal coverage of any base of the prediction given RNA-seq evidence<br />
|-<br />
| avgCov || average coverage || GeMoMa || coverage, ... || prediction || average coverage of all bases of the prediction given RNA-seq evidence<br />
|-<br />
| tpc || transcript percentage coverage || GeMoMa || coverage, ... || prediction || percentage of covered bases per predicted transcript given RNA-seq evidence<br />
|-<br />
| tae || transcript acceptor evidence || GeMoMa || introns || prediction || percentage of predicted acceptor sites per predicted transcript with RNA-seq evidence<br />
|-<br />
| tde || transcript donor evidence || GeMoMa || introns || prediction || percentage of predicted donor sites per predicted transcript with RNA-seq evidence<br />
|-<br />
| tie || transcript intron evidence || GeMoMa || introns || prediction || percentage of predicted introns per predicted transcript with RNA-seq evidence<br />
|-<br />
| minSplitReads || minimal split reads || GeMoMa || introns || prediction || minimal number of split reads for any of the predicted introns per predicted transcript<br />
|-<br />
| iAA || identical amino acid || GeMoMa || query proteins || prediction || percentage of identical amino acids between reference transcript and prediction<br />
|-<br />
| pAA || positive amino acid || GeMoMa || query proteins || prediction || percentage of aligned positions between reference transcript and prediction yielding a positive score in the substitution matrix<br />
|-<br />
| evidence || || GAF || || prediction || number of reference organisms that have a transcript yielding this prediction<br />
|-<br />
| alternative || || GAF || || prediction || alternative gene ID(s) leading to the same prediction<br />
|-<br />
| sumWeight || || GAF || || prediction || the sum of the weights of the references that perfectly support this prediction<br />
|-<br />
| maxTie || maximal tie || GAF || || gene || maximal tie of all transcripts of this gene<br />
|-<br />
| maxEvidence || maximal evidence || GAF || || gene || maximal evidence of all transcripts of this gene<br />
|-<br />
|}<br />
<br />
== FAQs ==<br />
<br />
; Why does the Extractor not return a single CDS-part, protein, ...?<br />
: First, please check whether the names of your contigs/chromosomes in your annotation (gff) and genome file (fasta) are identical. The fasta comments should at best only contain the contig/chromosome name. (Since GeMoMa 1.4, comments, which contain the contig/chromosome name and some additional information separated by a space, are also fine.) Second, please check whether you have a valid GFF/GTF file. Valid GFF files should have a valid "ID" or "Parent" entry in the attributes column. Valid GTF files should have a valid "gene_id" and "transcript_id" entry. Finally, please check the statistics that are given by the Extractor. It lists how many genes have been read and how many genes have been removed for different reasons. One common problem is that some annotation files do not include the stop codon in the CDS annotation.<br />
; How can I force GeMoMa to make more predictions?<br />
: There are several parameters affecting the number of predictions. The most prominent are the number of predictions (p) and the contig threshold (ct). For each reference transcript/CDS, GeMoMa initially makes a preliminary prediction and uses this prediction to determine whether a contig is promising and should be used to determine the final predictions. You may decrease ct and increase p to have more contigs in the final prediction. Increasing the number of predictions allows GeMoMa to output more predictions that have been computed. Decreasing the contig threshold allows to increase the number of predictions that are (internally) computed. Increasing p to a very large number without decreasing ct does not help.<br />
; Running GeMoMa on a single contig of my assembly yield thousands of weird predictions. What went wrong?<br />
: By default, GeMoMa is not build to be run on a single contig. GeMoMa tries to make predictions for all given reference CDS in the given target sequence(s). If the given target sequence is only a fraction of the complete target genome/assembly, GeMoMa will produce weird predictions as it does not filter for the quality of the predictions internally. There are two options to handle this:<br />
:* Use a list of gene models that you expect to be located on this contig (cf. parameter "selected").<br />
:* Filter the predictions using GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>).<br />
; Is it mandatory to use RNA-seq data?<br />
: No, GeMoMa is able to make predictions with and without RNA-seq evidence.<br />
; Is it possible to use multiple reference organisms?<br />
: It is possible to use multiple reference organisms for GeMoMa. Just run GeMoMa on each reference organism separately. Finally, you can employ GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>) to combine these annotations.<br />
; Why do some reference genes not lead to a prediction in the target genome?<br />
: Please first check whether your reference genes have been discarded by the Extractor (cf. assignment file).<br />
: If the genes have been discarded, there are two possibilities:<br />
:* The CDS might be redundant, i.e. the coding exons are identical to those of another transcript. In this case, only one CDS is further evaluated.<br />
:* There might be something wrong with your reference genes, e.g., missing start codon, missing stop codon, premature stop codon, ambiguous nucleotides, ... and you should check the options of Extractor or the annotation.<br />
: If the reference genes passed the Extractor, there are several possible explanations for this behavior. The two most prominent are:<br />
:* GeMoMa stopped the prediction of a reference genes since it does not return a result within the given time (cf. parameter "timeout").<br />
:* GeMoMa simply did not find a prediction matching the remaining quality criteria<br />
:* GeMoMa did find a prediction, but it was filtered out by GAF, e.g. to low relative score, missing start or stop codon (cf. GAF parameters).<br />
; What does "partial gene model" mean in the context of GeMoMa?<br />
: We called a gene model partial if it does not contain an initial start codon and a final stop codon. However, this does not mean that the gene model is located at or close to the border of a chromosome or contig.<br />
; For two different reference transcripts, the predictions of GeMoMa overlap or are identical. What should I do with those?<br />
: GeMoMa makes the predictions for each reference transcript independently. Hence, it can occur that some of predictions of different reference transcripts overlap or are identical especially in gene families. Typically, you might like to filter or rank these predictions. We have implemented GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>) to do this automatically. However, you can also do it by hand using the GFF attributes. Using RNA-seq data in GeMoMa yields additional fields in the annotation that can be used, e.g., average coverage (avgCov).<br />
; A lot of transcripts have been filtered out by the Extractor. What can I do?<br />
: There are several reasons for removing transcripts by the Extractor. At least in two cases you can try to get more transcripts by setting specific parameter values. First, if the transcript contains ambiguous nucleotides, please test the parameter "Ambiguity". Second, sometimes we received GFFs which contain wrong phases for CDS entries (e.g., 0 for all CDS entries in the phase column of the GFF). Since version 1.3.2, we provide the option "r" which stands for repair. If r=true is chosen, the Extractor tries to infer all phases for transcripts that show an error and would be filtered out.<br />
; Is GeMoMa able to predict pseudo-genes/ncRNA?<br />
: No, currently not.<br />
; My RNA-seq data indicates there is an additional intron in a transcipt, but GeMoMa does not predict this. Or vice versa, GeMoMa predicts an intron that is not supported by RNA-seq data. What's the reason?<br />
: GeMoMa is mainly based on the assumptions of amino acid and intron position conservation between reference and target species. Hence, GeMoMa tries to predict a gene model with similar exon-intron structure in the target species and does not stick too much to RNA-seq data. Although intron position conservation can be observed in most cases, sometimes new introns evolve or others vanish. For this reasons, GeMoMa also allows for the inclusion or exclusion of introns adding some additional costs (cf. GeMoMa parameter intron-loss-gain-penalty). However, the behaviour of GeMoMa depends on the parameters settings (especially intron-loss-gain-penalty, sm (substitution matrix), go (gap opening), ge (gap extension)) and the length of the missed/additional intron. Nevertheless, such cases can only occur if the additional/missed intron has a length that can be divided by 3 preserving the reading frame.<br />
:Since the available RNA-seq data only reflects a fraction of tissues/environmental conditions/..., missing RNA-seq evidence does not necessarily mean that the predictions is wrong.<br />
; My RNA-seq data indicates two alternative, highly overlapping introns. Interestingly, GeMoMa does not take the intron that is more abundant. Why?<br />
: GeMoMa reads the introns from the input file using some filter (cf. GeMoMa parameter r (reads)). All introns that pass the filter are used and treated equally. Hence, GeMoMa uses the intron that matches the expectation of intron position and amino acid conservation compared to the reference transcript.<br />
; Does GeMoMa predict multiple transcripts per gene?<br />
: GeMoMa in principle allows to predict multiple transcripts per gene, if corresponding transcripts are given in the reference species or if multiple reference species are used.<br />
; GeMoMa failed with java.lang.OutOfMemoryError. What can I do?<br />
: Whenever you see a java.lang.OutOfMemoryError, you should rerun the program with Java virtual machine (VM) options. More specifically you should set: '''-Xms''' the initally used RAM, e.g. to 5Gb (–Xms5G), and '''-Xmx''' the maximally used RAM, e.g. to 50Gb (-Xmx50G). GeMoMa often needs more memory if you have a large genome ''and'' if you’re providing a large coverage file (extracted from RNA-seq data). If you don’t have a compute node with enough memory, you can run GeMoMa without coverage, which will return the same predictions, but does not include all statistics. Another point could be the protein alignment, if you use the optional parameter query protein. Again you can run GeMoMa without this parameter, which will return the same predictions, but less statistics.<br />
; I need to specify the genetic code for my organisms. What is the expected format?<br />
: The genetic code is given in a two column tab-delimited table, where the first column is the one letter code of the amino acid and the second column is a comma-separated list of triplets. As we are working on genomic DNA, GeMoMa expects the bases A, C, G, and T, and not U (as expected in mRNA). Here is the link to the default genetic code, which might be used as template:<br />
: https://github.com/Jstacs/Jstacs/blob/master/projects/gemoma/test_data/genetic_code.txt<br />
: Alternative genetic codes are described here using the RNA alphabet:<br />
: https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi<br />
: The genetic code might be specified for a reference organism in the module Extractor or for a target organism in the module GeMoMa.<br />
<br />
== Version history ==<br />
[http://www.jstacs.de/download.php?which=GeMoMa GeMoMa 1.6.2] (17.12.2019)<br />
* Jstacs changes:<br />
** test methods for modules<br />
** live protocol for Galaxy<br />
* new module Denoise: allowing to clean introns extracted by ERE<br />
* new module NCBIReferenceRetriever: allowing to retrieve data for reference organisms easily from NCBI.<br />
* GAF:<br />
** bugfix for filter using specific attributes if no RNA-seq or query proteins was used<br />
** allow to add annotation info (as for instance provided by Phytozome) based on the reference organisms<br />
* GeMoMa: bugfix for timeout<br />
* GeMoMaPipeline:<br />
** bugfix reporting predicted partial proteins<br />
** improved protocol<br />
** new default value for query proteins (changed from false to true)<br />
** new default value for Ambiguity (changed from EXCEPTION to AMBIGUOUS)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.6.1.zip GeMoMa 1.6.1] (4.06.2019)<br />
* createGalaxyIntegration.sh: bugfix for GeMoMaPipeline<br />
* new module CheckIntrons: allowing to create statistics for introns (extracted by ERE)<br />
* AnnotationFinalizer: bugfix for sequence IDs with large numbers<br />
* CompareTranscripts: <br />
** bugfix for prefix of ref-gene<br />
** allow no transcript info, but making assignment non-optional if a transcript info is set <br />
* GAF: bugfix for Galaxy integration<br />
* GeMoMaPipeline:<br />
** improved output in case of Exceptions<br />
** new parameter "output individual predictions" allows to in- or exclude individual predictions from each reference organism in the final result<br />
** new parameter "weight" allows weights for reference species (cf. GAF)<br />
* ERE: new parameter "minimum mapping quality"<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.6.zip GeMoMa 1.6] (2.04.2019)<br />
* allow to use mmseqs as alternative to tblastn<br />
* AnnotationEvidence:<br />
** allows to add attributes to the input gff: tie, tpc, AA, start, stop<br />
** new parameter for gff output<br />
* AnnotationFinalizer: new tool for predicting UTRs and renaming predictions<br />
* GAF:<br />
** relative score filter and evidence filter are replaced by a flexible filter that allows to filter by relative score, evidence or other GFF attributes as well as combinations thereof<br />
** sorting criteria of the predictions within clusters can now be user-specified<br />
** new attribute for genes: combinedEvidence<br />
** new attribute for predictions: sumWeight<br />
** allows to use gene predictions from all sources, including for instance ab-initio gene predictors, purely RNA-seq based gene prediction and manually curation<br />
** bugfix for predictions from multiple reference organisms<br />
** improved statistic output<br />
* GeMoMa<br />
** renamed the parameter tblastn results to search results<br />
** new parameter for sorting the results of the similarity search (tblastn or mmseqs), if you use mmseqs for the similarity search you have use sort <br />
** new parameter for score of the search results: three options: Trust (as is), ReScore (use aligned sequence, but recompute score), and ReAlign (use detected sequence for realignment and rescore)<br />
** bugfix: threshold for introns from multiple files<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.5.2.zip GeMoMa 1.5.3] (23.07.2018)<br />
* improved parameter description and presentation<br />
* GeMoMaPipeline:<br />
** removed unnecessary parameters<br />
* GeMoMa:<br />
** bugfix: reading coverage file<br />
** removed parameter genomic (cf. Extractor)<br />
** removed protein output (cf. Extractor)<br />
* GAF:<br />
** bugfix: prefix<br />
* Extractor:<br />
** new parameter genomic<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.5.2.zip GeMoMa 1.5.2] (31.5.2018)<br />
* GAF:<br />
** new parameter that allows to restrict the maximal number of transcript predictions per gene<br />
** altered behavior of the evidence filter from percentages to absolute values<br />
** bugfix: nested genes <br />
** checking for duplicates in prediction IDs<br />
* GeMoMa:<br />
** warning if RNA-seq data does not match with target genome<br />
* GeMoMaPipeline: new tool for running the complete GeMoMa pipeline at once allowing multi-threading<br />
* folder for temporary files of GeMoMa<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.5.zip GeMoMa 1.5] (13.02.2018)<br />
* AnnotationEvidence: add chromosome to output<br />
* CompareTranscripts: new parameter that allows to remove prefixes introduces by GAF<br />
* Extractor: new parameter "stop-codon excluded from CDS" that might be used if the annotation does not contain the stop codons<br />
* ExtractRNASeqEvidence: <br />
** print intron length stats<br />
** include program infos in introns.gff3<br />
* GeMoMa:<br />
** new attribute pAA in gff output if query protein is given<br />
** include program infos in predicted_annotation.gff3<br />
** minor bugfix<br />
* GAF:<br />
** new parameter that allows to specify a prefix for each input gff<br />
** collect and print program infos to filtered_prediction.gff3<br />
** improved statistics output<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.4.2.zip GeMoMa 1.4.2] (21.07.2017)<br />
* automatic searching for available updates<br />
* AnnotationEvidence: bugfix (tie computation: Arrays.binarysearch does not find first match)<br />
* Extractor: bugfix (files that are not zipped)<br />
* GeMoMa: bugfix (tie computation: Arrays.binarysearch does not find first match)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.4.1.zip GeMoMa 1.4.1] (30.05.2017)<br />
* CompareTranscripts: bugfix (NullPointerException)<br />
* Extractor: reference genome can be .*fa.gz and .*fasta.gz<br />
* GeMoMa: bugfix (shutdown problem after timeout)<br />
* modified additional scripts and documentation<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.4.zip GeMoMa 1.4] (03.05.2017)<br />
* AnnotationEvidence: new tool computing tie and tpc for given annotation (gff)<br />
* CompareTranscripts: new tool comparing predicted and given annotation (gff)<br />
* Extractor:<br />
** reading CDS with no parent tag (cf. discontinuous feature)<br />
** automatic recognition of GFF or GTF annotation<br />
** Warning if sequences mentioned in the annotation are not included in the reference sequence<br />
* GeMoMa: <br />
** allowing for multiple intron and coverage files (= using different library types at the same time)<br />
** NA instead of "?" for tae, tde, tie, minSplitReads of single coding exon genes<br />
** new default values for the parameters: predictions (10 instead of 1) and contig threshold (0.4 instead of 0.9)<br />
** bugfix (write pc and minCov if possible for last CDS part in predicted annotation)<br />
** bugfix (ref-gene name if no assignment is used)<br />
** bugfix (minSplitReads, minCov, tpc, avgCov if no coverage available)<br />
* GAF:<br />
** nested genes on the same strand<br />
** bugfix (if nothing passes the filter)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.3.2.zip GeMoMa 1.3.2] (18.01.2017)<br />
* Extractor: new parameter repair for broken transcript annotations<br />
* GeMoMa: bugfixes (splice site computation)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa_1.3.1.zip GeMoMa 1.3.1] (09.12.2016)<br />
* GeMoMa bugfix (finding start/stop codon for very small exons)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.3.zip GeMoMa 1.3] (06.12.2016)<br />
* ERE: new tool for extracting RNA-seq evidence<br />
* Extractor: offers options for <br />
** partial gene models<br />
** ambiguities<br />
* GeMoMa:<br />
** RNA-seq<br />
*** defining splice sites<br />
*** additional feature in GFF and output<br />
**** transcript intron evidence (tie)<br />
**** transcript acceptor evidence (tae)<br />
**** transcript donor evidence (tde)<br />
**** transcript percentage coverage (tpc)<br />
**** ...<br />
** improved GFF<br />
** simplified the command line parameters<br />
** IMPORTANT: parameter names changed for some parameters<br />
* GAF: new tool for filtering and combining different predictions (especially of different reference organisms)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.1.3.zip GeMoMa 1.1.3] (06.06.2016)<br />
* minor modifications to the Extractor tool<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.1.2.zip GeMoMa 1.1.2] (05.02.2016)<br />
* GeMoMa bugfix (upstream, downstream sequence for splice site detection)<br />
* Extractor: new parameter s for selecting transcripts<br />
* improved Galaxy integration<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.1.1.zip GeMoMa 1.1.1] (01.02.2016)<br />
* initial release for paper</div>Keilwagenhttps://www.jstacs.de/index.php?title=GeMoMa&diff=1063GeMoMa2020-01-17T10:03:56Z<p>Keilwagen: /* CheckIntrons */</p>
<hr />
<div>'''Ge'''ne '''Mo'''del '''Ma'''pper (GeMoMa) is a homology-based gene prediction program. GeMoMa uses the annotation of protein-coding genes in a reference genome to infer the annotation of protein-coding genes in a target genome. Thereby, GeMoMa utilizes amino acid sequence and intron position conservation. In addition, GeMoMa allows to incorporate RNA-seq evidence for splice site prediction.<br />
<br />
[[File:GeMoMa-schema.png|thumb|right|350px|Schema of GeMoMa algorithm]]<br />
<br />
== Paper ==<br />
If you use GeMoMa, please cite<br />
<br />
J. Keilwagen, M. Wenk, J. L. Erickson, M. H. Schattat, J. Grau, and F. Hartung. [https://nar.oxfordjournals.org/content/44/9/e89 Using intron position conservation for homology-based gene prediction]. ''Nucleic Acids Research'', 2016. doi: 10.1093/nar/gkw092<br />
<br />
J. Keilwagen, F. Hartung, M. Paulini, S. O. Twardziok, and J. Grau<br />
[https://rdcu.be/QbKc Combining RNA-seq data and homology-based gene prediction for plants, animals and fungi]. ''BMC Bioinformatics'', 2018. doi: 10.1186/s12859-018-2203-5<br />
<br />
== Download ==<br />
<br />
GeMoMa is implemented in Java using Jstacs. You can [http://www.jstacs.de/download.php?which=GeMoMa download a zip file] containing a readme, the GeMoMa jar file and some tiny scripts for running GeMoMa. The jar file allows for <br />
<ul><br />
<li>creating the XML file needed for the Galaxy integration</li><br />
<li>running the command line interface (CLI) version.</li><br />
</ul><br />
<br />
You can also [[:File:GeMoMa-manual.pdf|download a small manual for GeMoMa]] which explains the main steps for the analysis.<br />
<br />
== Galaxy ==<br />
GeMoMa is available in a public web-server at [http://galaxy.informatik.uni-halle.de galaxy.informatik.uni-halle.de]. The provided web-server only allows a limited number of reference genes and uses a time out of 2 minutes per transcript prediction. For unlimited use, please use the command line program or integrate GeMoMa in your only Galaxy instance.<br />
<br />
[[File:GeMoMa-Workflow.png|thumb|center|700px|GeMoMa workflow adapted from Galaxy]]<br />
<br />
== Running the command line application ==<br />
<br />
For running the command line application, Java v1.8 or later is required.<br />
<br />
=== GeMoMaPipeline ===<br />
<br />
If you like to run the GeMoMaPipeline on a server as a single job, you can use the module GeMoMaPipeline which allows to exploit the full compute power of the computer server via multi-threading. However, GeMoMaPipeline does not distribute task on a compute cluster.<br />
You can run GeMoMaPipeline from the command line with<br/><br />
<code>java -jar GeMoMa-<version>.jar CLI GeMoMaPipeline [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>target genome (Target genome file (FASTA))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>species (data for reference species, range={own, pre-extracted}, default = own)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;own&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>ID (ID to distinguish the different reference species, default = , OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (Reference annotation file (GFF or GTF), which contains gene models annotated in the reference genome)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (Reference genome file (FASTA))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>weight (the weight can be used to prioritize predictions from different input files; each prediction will get an additional attribute sumWeight that can be used in the filter, valid range = [0.0, 1000.0], default = 1.0, OPTIONAL)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ai</font></td><br />
<td>annotation info (annotation information of the reference, tab-delimted file containing at least the columns transcriptName, GO and .*defline, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;pre-extracted&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>ID (ID to distinguish the different reference species, default = , OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>cds parts (The query cds parts file (FASTA), i.e., the cds parts that have been blasted)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>assignment (The assignment file, which combines parts of the CDS to transcripts, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">q</font></td><br />
<td>query proteins (optional query protein file (FASTA) for computing the optimal alignment score against complete protein prediction, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>weight (the weight can be used to prioritize predictions from different input files; each prediction will get an additional attribute sumWeight that can be used in the filter, valid range = [0.0, 1000.0], default = 1.0, OPTIONAL)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ai</font></td><br />
<td>annotation info (annotation information of the reference, tab-delimted file containing at least the columns transcriptName, GO and .*defline, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">selected</font></td><br />
<td>selected (The path to list file, which allows to make only a predictions for the contained transcript ids. The first column should contain transcript IDs as given in the annotation. Remaining columns can be used to determine a target region that should be overlapped by the prediction, if columns 2 to 5 contain chromosome, strand, start and end of region, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">gc</font></td><br />
<td>genetic code (optional user-specified genetic code, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>maximum intron length (The maximum length of an intron, default = 15000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">tblastn</font></td><br />
<td>tblastn (if *true* tblastn is used as search algorithm, otherwise mmseqs is used. Tblastn and mmseqs need to be installed to use the corresponding option, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">tag</font></td><br />
<td>tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = prediction)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>RNA-seq evidence (data for RNA-seq evidence, range={NO, MAPPED, EXTRACTED}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;MAPPED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.s</font></td><br />
<td>Stranded (Defines whether the reads are stranded. In case of FR_FIRST_STRAND, the first read of a read pair or the only read in case of single-end data is assumed to be located on forward strand of the cDNA, i.e., reverse to the mRNA orientation. If you are using Illumina TruSeq you should use FR_FIRST_STRAND., range={FR_UNSTRANDED, FR_FIRST_STRAND, FR_SECOND_STRAND}, default = FR_UNSTRANDED)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.m</font></td><br />
<td>mapped reads file (BAM/SAM files containing the mapped reads)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.v</font></td><br />
<td>ValidationStringency (Defines how strict to be when reading a SAM or BAM, beyond bare minimum validation., range={STRICT, LENIENT, SILENT}, default = LENIENT)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.u</font></td><br />
<td>use secondary alignments (allows to filter flags in the SAM or BAM, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.c</font></td><br />
<td>coverage (allows to output the coverage, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.mmq</font></td><br />
<td>minimum mapping quality (reads with a mapping quality that is lower than this value will be ignored, valid range = [0, 255], default = 40)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;EXTRACTED&quot;:</b></td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">introns</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used zero or multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage</font></td><br />
<td>coverage (experimental coverage (RNA-seq), range={UNSTRANDED, STRANDED}, default = UNSTRANDED)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>denoise (removing questionable introns that have been extracted by ERE, range={DENOISE, RAW}, default = DENOISE)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;DENOISE&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Denoise.m</font></td><br />
<td>minimum expression (The threshold for removing introns, valid range = [0.0, 1.0], default = 0.01)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Denoise.c</font></td><br />
<td>context (The context upstream a donor and donwstream an acceptor site that is used to determine the expression of the region, valid range = [0, 100], default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;RAW&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.p</font></td><br />
<td>proteins (whether the complete proteins sequences should returned as output, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.r</font></td><br />
<td>repair (if a transcript annotation can not be parsed, the program will try to infer the phase of the CDS parts to repair the annotation, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.a</font></td><br />
<td>Ambiguity (This parameter defines how to deal with ambiguities in the DNA. There are 3 options: EXCEPTION, which will remove the corresponding transcript, AMBIGUOUS, which will use an X for the corresponding amino acid, and RANDOM, which will randomly select an amnio acid from the list of possibilities., range={EXCEPTION, AMBIGUOUS, RANDOM}, default = AMBIGUOUS)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.s</font></td><br />
<td>stop-codon excluded from CDS (A flag that states whether the reference annotation contains the stop codon in the CDS annotation or not, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.f</font></td><br />
<td>full-length (A flag which allows for choosing between only full-length and all (i.e., full-length and partial) transcripts, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.s</font></td><br />
<td>splice (if no intron is given by RNA-seq, compute candidate splice sites or not, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.sm</font></td><br />
<td>substitution matrix (optional user-specified substitution matrix, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.g</font></td><br />
<td>gap opening (The gap opening cost in the alignment, default = 11)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.ge</font></td><br />
<td>gap extension (The gap extension cost in the alignment, default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.i</font></td><br />
<td>intron-loss-gain-penalty (The penalty used for intron loss and gain, default = 25)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.e</font></td><br />
<td>e-value (The e-value for filtering blast results, default = 100.0)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.c</font></td><br />
<td>contig threshold (The threshold for evaluating contigs, valid range = [0.0, 1.0], default = 0.4)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.rt</font></td><br />
<td>region threshold (The threshold for evaluating regions, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.h</font></td><br />
<td>hit threshold (The threshold for adding additional hits, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.p</font></td><br />
<td>predictions (The (maximal) number of predictions per transcript, default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.a</font></td><br />
<td>avoid stop (A flag which allows to avoid stop codons in a transcript (except the last AS), default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.approx</font></td><br />
<td>approx (whether an approximation is used to compute the score for intron gain, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.prefix</font></td><br />
<td>prefix (A prefix to be used for naming the predictions, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.t</font></td><br />
<td>timeout (The (maximal) number of seconds to be used for the predictions of one transcript, if exceeded GeMoMa does not output a prediction for this transcript., valid range = [0, 604800], default = 3600)</td><br />
<td style="width:100px;">LONG</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.Score</font></td><br />
<td>Score (A flag which allows to do nothing, re-score or re-align the search results, range={Trust, ReScore, ReAlign}, default = Trust)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.s</font></td><br />
<td>sorting (comma-separated list that defines the way predictions are sorted within a cluster, default = evidence,score)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.c</font></td><br />
<td>common border filter (the filter on the common borders of transcripts, the lower the more transcripts will be checked as alternative splice isoforms, valid range = [0.0, 1.0], default = 0.75)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.m</font></td><br />
<td>maximal number of transcripts per gene (the maximal number of allowed transcript predictions per gene, valid range = [1, 2147483647], default = 2147483647)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.f</font></td><br />
<td>filter (A filter can be applied to transcript predictions to receive only reasonable predictions. The filter is applied to the GFF attributes. The deault filter decides based on the completeness of the prediction (start=='M' and stop=='*') and the relative score (score/AA>=0.75) whether a prediction is used or not. Different criteria can be combined using 'and ' and 'or'. A more sophisticated filter could be applied for instance using the completeness, the relative score, the evidence as well as statistics based on RNA-seq: start=='M' and stop =='*' and score/AA>=0.75 and (evidence>1 or tpc==1.0), default = start=='M' and stop =='*' and score/AA>=0.75, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.a</font></td><br />
<td>alternative transcript filter (If a prediction is suggested as an alternative transcript, this additional filter will be applied. The filter works syntactically like the 'filter' parameter and allows you to keep the number of alternative transcripts small based on meaningful criteria. Reasonable filter could be based on RNA-seq data (tie==1) or on evidence (evidence>1). A more sophisticated filter could be applied combining several criteria: tie==1 or evidence>1, default = tie==1 or evidence>1, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.u</font></td><br />
<td>UTR (allows to predict UTRs using RNA-seq data, range={NO, YES}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;YES&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.r</font></td><br />
<td>rename (allows to generate generic gene and transcripts names (cf. attribute &quot;Name&quot;), range={COMPOSED, SIMPLE, NO}, default = COMPOSED)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;COMPOSED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.i</font></td><br />
<td>infix (the infix of the generic name, default = G)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.s</font></td><br />
<td>suffix (the suffix of the generic name, default = 0)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.di</font></td><br />
<td>delete infix (a comma-separated list of infixes that is deleted from the sequence names before building the gene/transcript name, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;SIMPLE&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>predicted proteins (If *true*, returns the predicted proteins of the target organism as fastA file, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">pc</font></td><br />
<td>predicted CDSs (If *true*, returns the predicted CDSs of the target organism as fastA file, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">pgr</font></td><br />
<td>predicted genomic regions (If *true*, returns the genomic regions of predicted gene models of the target organism as fastA file, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">o</font></td><br />
<td>output individual predictions (If *true*, returns the predictions for each reference species, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">debug</font></td><br />
<td>debug (If *false* removes all temporary files even if the jobs exits unexpected, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">threads</font></td><br />
<td>The number of threads used for the tool, defaults to 1</td><br />
<td>INT</td><br />
</tr><br />
</table><br />
<br />
=== Extract RNA-seq Evidence (ERE) ===<br />
<br />
For post-processing the mapped RNA-seq data, we provide the tool ExtractRNAseqEvidence (ERE). You can run Extractor from the command line with<br/><br />
<code>java -jar GeMoMa-<version>.jar CLI ERE [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>Stranded (Defines whether the reads are stranded. In case of FR_FIRST_STRAND, the first read of a read pair or the only read in case of single-end data is assumed to be located on forward strand of the cDNA, i.e., reverse to the mRNA orientation. If you are using Illumina TruSeq you should use FR_FIRST_STRAND., range={FR_UNSTRANDED, FR_FIRST_STRAND, FR_SECOND_STRAND}, default = FR_UNSTRANDED)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>mapped reads file (BAM/SAM files containing the mapped reads)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>ValidationStringency (Defines how strict to be when reading a SAM or BAM, beyond bare minimum validation., range={STRICT, LENIENT, SILENT}, default = LENIENT)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">u</font></td><br />
<td>use secondary alignments (allows to filter flags in the SAM or BAM, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage (allows to output the coverage, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">mmq</font></td><br />
<td>minimum mapping quality (reads with a mapping quality that is lower than this value will be ignored, valid range = [0, 254], default = 40)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== CheckIntrons===<br />
<br />
This tool allows to check whether the extracted introns show the expected patterns of di-nucleotides at the splice sites. You can run CheckIntrons from the command line with<br/><br />
<code>java -jar GeMoMa-<version>.jar CLI CheckIntrons [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>target genome (The target genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>verbose (A flag which allows to output a wealth of additional information per transcript, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== Extractor ===<br />
<br />
For preparing the data, we provide the tool Extractor. You can run Extractor from the command line with<br/><br />
<code>java -jar GeMoMa-1.6.1.jar CLI Extractor [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (Reference annotation file (GFF or GTF), which contains gene models annotated in the reference genome)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (Reference genome file (FASTA))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">gc</font></td><br />
<td>genetic code (optional user-specified genetic code, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>proteins (whether the complete proteins sequences should returned as output, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>cds (whether the complete CDSs should returned as output, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">genomic</font></td><br />
<td>genomic (whether the genomic regions should be returned (upper case = coding, lower case = non coding), default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>repair (if a transcript annotation can not be parsed, the program will try to infer the phase of the CDS parts to repair the annotation, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>selected (The path to list file, which allows to make only a predictions for the contained transcript ids. The first column should contain transcript IDs as given in the annotation. Remaining columns will be ignored., OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Ambiguity</font></td><br />
<td>Ambiguity (This parameter defines how to deal with ambiguities in the DNA. There are 3 options: EXCEPTION, which will remove the corresponding transcript, AMBIGUOUS, which will use an X for the corresponding amino acid, and RANDOM, which will randomly select an amnio acid from the list of possibilities., range={EXCEPTION, AMBIGUOUS, RANDOM}, default = EXCEPTION)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sefc</font></td><br />
<td>stop-codon excluded from CDS (A flag that states whether the reference annotation contains the stop codon in the CDS annotation or not, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>full-length (A flag which allows for choosing between only full-length and all (i.e., full-length and partial) transcripts, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>verbose (A flag which allows to output a wealth of additional information, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== Gene Model Mapper (GeMoMa) ===<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>search results (The search results, e.g., from tblastn or mmseqs)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>target genome (The target genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>cds parts (The query cds parts file (FASTA), i.e., the cds parts that have been blasted)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>assignment (The assignment file, which combines parts of the CDS to transcripts, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">q</font></td><br />
<td>query proteins (optional query protein file (FASTA) for computing the optimal alignment score against complete protein prediction, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">splice</font></td><br />
<td>splice (if no intron is given by RNA-seq, compute candidate splice sites or not, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage</font></td><br />
<td>coverage (experimental coverage (RNA-seq), range={NO, UNSTRANDED, STRANDED}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genetic code (optional user-specified genetic code, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sm</font></td><br />
<td>substitution matrix (optional user-specified substitution matrix, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">go</font></td><br />
<td>gap opening (The gap opening cost in the alignment, default = 11)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ge</font></td><br />
<td>gap extension (The gap extension cost in the alignment, default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>maximum intron length (The maximum length of an intron, default = 15000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">intron-loss-gain-penalty</font></td><br />
<td>intron-loss-gain-penalty (The penalty used for intron loss and gain, default = 25)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">e</font></td><br />
<td>e-value (The e-value for filtering blast results, default = 100.0)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ct</font></td><br />
<td>contig threshold (The threshold for evaluating contigs, valid range = [0.0, 1.0], default = 0.4)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">rt</font></td><br />
<td>region threshold (The threshold for evaluating regions, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">h</font></td><br />
<td>hit threshold (The threshold for adding additional hits, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>predictions (The (maximal) number of predictions per transcript, default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">selected</font></td><br />
<td>selected (The path to list file, which allows to make only a predictions for the contained transcript ids. The first column should contain transcript IDs as given in the annotation. Remaining columns can be used to determine a target region that should be overlapped by the prediction, if columns 2 to 5 contain chromosome, strand, start and end of region, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">as</font></td><br />
<td>avoid stop (A flag which allows to avoid stop codons in a transcript (except the last AS), default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">approx</font></td><br />
<td>approx (whether an approximation is used to compute the score for intron gain, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">prefix</font></td><br />
<td>prefix (A prefix to be used for naming the predictions, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">tag</font></td><br />
<td>tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = prediction)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>verbose (A flag which allows to output a wealth of additional information per transcript, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">timeout</font></td><br />
<td>timeout (The (maximal) number of seconds to be used for the predictions of one transcript, if exceeded GeMoMa does not output a prediction for this transcript., valid range = [0, 604800], default = 3600)</td><br />
<td style="width:100px;">LONG</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sort</font></td><br />
<td>sort (A flag which allows to sort the search results, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Score</font></td><br />
<td>Score (A flag which allows to do nothing, re-score or re-align the search results, range={Trust, ReScore, ReAlign}, default = Trust)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
GeMoMa returns the predicted annotation as gff file.<br />
<br />
=== GeMoMa Annotation Filter (GAF) ===<br />
<br />
The GeMoMa Annotation Filter allows to combine and reduce predictions from GeMoMa into a single final prediction. It is able to handle predictions from different reference species. It also handles overlapping or identical predictions.<br />
You can run GeMoMa from the command line with<br/><br />
<code>java -jar GeMoMa-1.6.1.jar CLI GAF[&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>tag (the tag used to read the GeMoMa annotations, default = prediction)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>sorting (comma-separated list that defines the way predictions are sorted within a cluster, default = evidence,score)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>missing intron evidence filter (the filter for single-exon transcripts or if no RNA-seq data is used, decides for overlapping other transcripts whether they should be used (=true) or discarded (=false), default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>intron evidence filter (the filter on the intron evidence given by RNA-seq-data for overlapping transcripts, valid range = [0.0, 1.0], default = 1.0)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>common border filter (the filter on the common borders of transcripts, the lower the more transcripts will be checked as alternative splice isoforms, valid range = [0.0, 1.0], default = 0.75)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">mnotpg</font></td><br />
<td>maximal number of transcripts per gene (the maximal number of allowed transcript predictions per gene, valid range = [1, 2147483647], default = 2147483647)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prefix (the prefix can be used to distinguish predictions from different input files, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>weight (the weight can be used to prioritize predictions from different input files; each prediction will get an additional attribute sumWeight that can be used in the filter, valid range = [0.0, 1000.0], default = 1.0, OPTIONAL)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>gene annotation file (GFF files containing the gene annotations (predicted by GeMoMa))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>filter (A filter can be applied to transcript predictions to receive only reasonable predictions. The filter is applied to the GFF attributes. The deault filter decides based on the completeness of the prediction (start=='M' and stop=='*') and the relative score (score/AA>=0.75) whether a prediction is used or not. Different criteria can be combined using 'and ' and 'or'. A more sophisticated filter could be applied for instance using the completeness, the relative score, the evidence as well as statistics based on RNA-seq: start=='M' and stop =='*' and score/AA>=0.75 and (evidence>1 or tpc==1.0), default = start=='M' and stop =='*' and score/AA>=0.75, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== CompareTranscripts ===<br />
<br />
For comparing gene models from GeMoMa predictions with existing annotation, we provide the tool CompareTranscripts. You can run CompareTranscripts from the command line with<br/><br />
<code>java -jar GeMoMa-1.6.1.jar CLI CompareTranscripts [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prediction (The predicted annotation)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (The true annotation)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== AnnotationEvidence ===<br />
<br />
For providing RNA-seq evidence (e.g. tie) for existing annotation, we provide the tool AnnotationEvidence. You can run AnnotationEvidence from the command line with<br/><br />
<code>java -jar GeMoMa-1.6.1.jar CLI AnnotationEvidence [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (The genome annotation file (GFF))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (The genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns file (Introns (GFF), which might be obtained from RNA-seq, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage file (experimental coverage (RNA-seq), range={NO, UNSTRANDED, STRANDED}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ao</font></td><br />
<td>annotation output (if the annotation should be returned with attributes tie, tpc, and AA, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">gc</font></td><br />
<td>genetic code (optional user-specified genetic code, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== AnnotationFinializer ===<br />
<br />
This tool allows to predict UTR and to rename predictions. You can run AnnotationEvidence from the command line with<br/><br />
<code>java -jar GeMoMa-1.6.1.jar CLI AnnotationFinalizer [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (The genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (The predicted genome annotation file (GFF))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = prediction)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">u</font></td><br />
<td>UTR (allows to predict UTRs using RNA-seq data, range={NO, YES}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;YES&quot;:</b></td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns file (Introns (GFF), which might be obtained from RNA-seq, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage file (experimental coverage (RNA-seq), range={NO, UNSTRANDED, STRANDED}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">rename</font></td><br />
<td>rename (allows to generate generic gene and transcripts names (cf. attribute &quot;Name&quot;), range={COMPOSED, SIMPLE, NO}, default = COMPOSED)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;COMPOSED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">infix</font></td><br />
<td>infix (the infix of the generic name, default = G)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>suffix (the suffix of the generic name, default = 0)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">di</font></td><br />
<td>delete infix (a comma-separated list of infixes that is deleted from the sequence names before building the gene/transcript name, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;SIMPLE&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
== GFF attributes ==<br />
<br />
Using GeMoMa and GAF, you'll obtain GFFs containing some special attributes. We briefly explain the most prominent attributes in the following table.<br />
<br />
{| class="wikitable sortable"<br />
|-<br />
!Attribute !!Long name !!Tool !!Necessary parameter !!Feature !!Description<br />
|-<br />
| score || GeMoMa score || GeMoMa || || prediction || score computed by GeMoMa using the substitution matrix, gap costs and additional penalties<br />
|-<br />
| minCov || minimal coverage || GeMoMa || coverage, ... || prediction || minimal coverage of any base of the prediction given RNA-seq evidence<br />
|-<br />
| avgCov || average coverage || GeMoMa || coverage, ... || prediction || average coverage of all bases of the prediction given RNA-seq evidence<br />
|-<br />
| tpc || transcript percentage coverage || GeMoMa || coverage, ... || prediction || percentage of covered bases per predicted transcript given RNA-seq evidence<br />
|-<br />
| tae || transcript acceptor evidence || GeMoMa || introns || prediction || percentage of predicted acceptor sites per predicted transcript with RNA-seq evidence<br />
|-<br />
| tde || transcript donor evidence || GeMoMa || introns || prediction || percentage of predicted donor sites per predicted transcript with RNA-seq evidence<br />
|-<br />
| tie || transcript intron evidence || GeMoMa || introns || prediction || percentage of predicted introns per predicted transcript with RNA-seq evidence<br />
|-<br />
| minSplitReads || minimal split reads || GeMoMa || introns || prediction || minimal number of split reads for any of the predicted introns per predicted transcript<br />
|-<br />
| iAA || identical amino acid || GeMoMa || query proteins || prediction || percentage of identical amino acids between reference transcript and prediction<br />
|-<br />
| pAA || positive amino acid || GeMoMa || query proteins || prediction || percentage of aligned positions between reference transcript and prediction yielding a positive score in the substitution matrix<br />
|-<br />
| evidence || || GAF || || prediction || number of reference organisms that have a transcript yielding this prediction<br />
|-<br />
| alternative || || GAF || || prediction || alternative gene ID(s) leading to the same prediction<br />
|-<br />
| sumWeight || || GAF || || prediction || the sum of the weights of the references that perfectly support this prediction<br />
|-<br />
| maxTie || maximal tie || GAF || || gene || maximal tie of all transcripts of this gene<br />
|-<br />
| maxEvidence || maximal evidence || GAF || || gene || maximal evidence of all transcripts of this gene<br />
|-<br />
|}<br />
<br />
== FAQs ==<br />
<br />
; Why does the Extractor not return a single CDS-part, protein, ...?<br />
: First, please check whether the names of your contigs/chromosomes in your annotation (gff) and genome file (fasta) are identical. The fasta comments should at best only contain the contig/chromosome name. (Since GeMoMa 1.4, comments, which contain the contig/chromosome name and some additional information separated by a space, are also fine.) Second, please check whether you have a valid GFF/GTF file. Valid GFF files should have a valid "ID" or "Parent" entry in the attributes column. Valid GTF files should have a valid "gene_id" and "transcript_id" entry. Finally, please check the statistics that are given by the Extractor. It lists how many genes have been read and how many genes have been removed for different reasons. One common problem is that some annotation files do not include the stop codon in the CDS annotation.<br />
; How can I force GeMoMa to make more predictions?<br />
: There are several parameters affecting the number of predictions. The most prominent are the number of predictions (p) and the contig threshold (ct). For each reference transcript/CDS, GeMoMa initially makes a preliminary prediction and uses this prediction to determine whether a contig is promising and should be used to determine the final predictions. You may decrease ct and increase p to have more contigs in the final prediction. Increasing the number of predictions allows GeMoMa to output more predictions that have been computed. Decreasing the contig threshold allows to increase the number of predictions that are (internally) computed. Increasing p to a very large number without decreasing ct does not help.<br />
; Running GeMoMa on a single contig of my assembly yield thousands of weird predictions. What went wrong?<br />
: By default, GeMoMa is not build to be run on a single contig. GeMoMa tries to make predictions for all given reference CDS in the given target sequence(s). If the given target sequence is only a fraction of the complete target genome/assembly, GeMoMa will produce weird predictions as it does not filter for the quality of the predictions internally. There are two options to handle this:<br />
:* Use a list of gene models that you expect to be located on this contig (cf. parameter "selected").<br />
:* Filter the predictions using GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>).<br />
; Is it mandatory to use RNA-seq data?<br />
: No, GeMoMa is able to make predictions with and without RNA-seq evidence.<br />
; Is it possible to use multiple reference organisms?<br />
: It is possible to use multiple reference organisms for GeMoMa. Just run GeMoMa on each reference organism separately. Finally, you can employ GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>) to combine these annotations.<br />
; Why do some reference genes not lead to a prediction in the target genome?<br />
: Please first check whether your reference genes have been discarded by the Extractor (cf. assignment file).<br />
: If the genes have been discarded, there are two possibilities:<br />
:* The CDS might be redundant, i.e. the coding exons are identical to those of another transcript. In this case, only one CDS is further evaluated.<br />
:* There might be something wrong with your reference genes, e.g., missing start codon, missing stop codon, premature stop codon, ambiguous nucleotides, ... and you should check the options of Extractor or the annotation.<br />
: If the reference genes passed the Extractor, there are several possible explanations for this behavior. The two most prominent are:<br />
:* GeMoMa stopped the prediction of a reference genes since it does not return a result within the given time (cf. parameter "timeout").<br />
:* GeMoMa simply did not find a prediction matching the remaining quality criteria<br />
:* GeMoMa did find a prediction, but it was filtered out by GAF, e.g. to low relative score, missing start or stop codon (cf. GAF parameters).<br />
; What does "partial gene model" mean in the context of GeMoMa?<br />
: We called a gene model partial if it does not contain an initial start codon and a final stop codon. However, this does not mean that the gene model is located at or close to the border of a chromosome or contig.<br />
; For two different reference transcripts, the predictions of GeMoMa overlap or are identical. What should I do with those?<br />
: GeMoMa makes the predictions for each reference transcript independently. Hence, it can occur that some of predictions of different reference transcripts overlap or are identical especially in gene families. Typically, you might like to filter or rank these predictions. We have implemented GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>) to do this automatically. However, you can also do it by hand using the GFF attributes. Using RNA-seq data in GeMoMa yields additional fields in the annotation that can be used, e.g., average coverage (avgCov).<br />
; A lot of transcripts have been filtered out by the Extractor. What can I do?<br />
: There are several reasons for removing transcripts by the Extractor. At least in two cases you can try to get more transcripts by setting specific parameter values. First, if the transcript contains ambiguous nucleotides, please test the parameter "Ambiguity". Second, sometimes we received GFFs which contain wrong phases for CDS entries (e.g., 0 for all CDS entries in the phase column of the GFF). Since version 1.3.2, we provide the option "r" which stands for repair. If r=true is chosen, the Extractor tries to infer all phases for transcripts that show an error and would be filtered out.<br />
; Is GeMoMa able to predict pseudo-genes/ncRNA?<br />
: No, currently not.<br />
; My RNA-seq data indicates there is an additional intron in a transcipt, but GeMoMa does not predict this. Or vice versa, GeMoMa predicts an intron that is not supported by RNA-seq data. What's the reason?<br />
: GeMoMa is mainly based on the assumptions of amino acid and intron position conservation between reference and target species. Hence, GeMoMa tries to predict a gene model with similar exon-intron structure in the target species and does not stick too much to RNA-seq data. Although intron position conservation can be observed in most cases, sometimes new introns evolve or others vanish. For this reasons, GeMoMa also allows for the inclusion or exclusion of introns adding some additional costs (cf. GeMoMa parameter intron-loss-gain-penalty). However, the behaviour of GeMoMa depends on the parameters settings (especially intron-loss-gain-penalty, sm (substitution matrix), go (gap opening), ge (gap extension)) and the length of the missed/additional intron. Nevertheless, such cases can only occur if the additional/missed intron has a length that can be divided by 3 preserving the reading frame.<br />
:Since the available RNA-seq data only reflects a fraction of tissues/environmental conditions/..., missing RNA-seq evidence does not necessarily mean that the predictions is wrong.<br />
; My RNA-seq data indicates two alternative, highly overlapping introns. Interestingly, GeMoMa does not take the intron that is more abundant. Why?<br />
: GeMoMa reads the introns from the input file using some filter (cf. GeMoMa parameter r (reads)). All introns that pass the filter are used and treated equally. Hence, GeMoMa uses the intron that matches the expectation of intron position and amino acid conservation compared to the reference transcript.<br />
; Does GeMoMa predict multiple transcripts per gene?<br />
: GeMoMa in principle allows to predict multiple transcripts per gene, if corresponding transcripts are given in the reference species or if multiple reference species are used.<br />
; GeMoMa failed with java.lang.OutOfMemoryError. What can I do?<br />
: Whenever you see a java.lang.OutOfMemoryError, you should rerun the program with Java virtual machine (VM) options. More specifically you should set: '''-Xms''' the initally used RAM, e.g. to 5Gb (–Xms5G), and '''-Xmx''' the maximally used RAM, e.g. to 50Gb (-Xmx50G). GeMoMa often needs more memory if you have a large genome ''and'' if you’re providing a large coverage file (extracted from RNA-seq data). If you don’t have a compute node with enough memory, you can run GeMoMa without coverage, which will return the same predictions, but does not include all statistics. Another point could be the protein alignment, if you use the optional parameter query protein. Again you can run GeMoMa without this parameter, which will return the same predictions, but less statistics.<br />
; I need to specify the genetic code for my organisms. What is the expected format?<br />
: The genetic code is given in a two column tab-delimited table, where the first column is the one letter code of the amino acid and the second column is a comma-separated list of triplets. As we are working on genomic DNA, GeMoMa expects the bases A, C, G, and T, and not U (as expected in mRNA). Here is the link to the default genetic code, which might be used as template:<br />
: https://github.com/Jstacs/Jstacs/blob/master/projects/gemoma/test_data/genetic_code.txt<br />
: Alternative genetic codes are described here using the RNA alphabet:<br />
: https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi<br />
: The genetic code might be specified for a reference organism in the module Extractor or for a target organism in the module GeMoMa.<br />
<br />
== Version history ==<br />
[http://www.jstacs.de/download.php?which=GeMoMa GeMoMa 1.6.2] (17.12.2019)<br />
* Jstacs changes:<br />
** test methods for modules<br />
** live protocol for Galaxy<br />
* new module Denoise: allowing to clean introns extracted by ERE<br />
* new module NCBIReferenceRetriever: allowing to retrieve data for reference organisms easily from NCBI.<br />
* GAF:<br />
** bugfix for filter using specific attributes if no RNA-seq or query proteins was used<br />
** allow to add annotation info (as for instance provided by Phytozome) based on the reference organisms<br />
* GeMoMa: bugfix for timeout<br />
* GeMoMaPipeline:<br />
** bugfix reporting predicted partial proteins<br />
** improved protocol<br />
** new default value for query proteins (changed from false to true)<br />
** new default value for Ambiguity (changed from EXCEPTION to AMBIGUOUS)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.6.1.zip GeMoMa 1.6.1] (4.06.2019)<br />
* createGalaxyIntegration.sh: bugfix for GeMoMaPipeline<br />
* new module CheckIntrons: allowing to create statistics for introns (extracted by ERE)<br />
* AnnotationFinalizer: bugfix for sequence IDs with large numbers<br />
* CompareTranscripts: <br />
** bugfix for prefix of ref-gene<br />
** allow no transcript info, but making assignment non-optional if a transcript info is set <br />
* GAF: bugfix for Galaxy integration<br />
* GeMoMaPipeline:<br />
** improved output in case of Exceptions<br />
** new parameter "output individual predictions" allows to in- or exclude individual predictions from each reference organism in the final result<br />
** new parameter "weight" allows weights for reference species (cf. GAF)<br />
* ERE: new parameter "minimum mapping quality"<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.6.zip GeMoMa 1.6] (2.04.2019)<br />
* allow to use mmseqs as alternative to tblastn<br />
* AnnotationEvidence:<br />
** allows to add attributes to the input gff: tie, tpc, AA, start, stop<br />
** new parameter for gff output<br />
* AnnotationFinalizer: new tool for predicting UTRs and renaming predictions<br />
* GAF:<br />
** relative score filter and evidence filter are replaced by a flexible filter that allows to filter by relative score, evidence or other GFF attributes as well as combinations thereof<br />
** sorting criteria of the predictions within clusters can now be user-specified<br />
** new attribute for genes: combinedEvidence<br />
** new attribute for predictions: sumWeight<br />
** allows to use gene predictions from all sources, including for instance ab-initio gene predictors, purely RNA-seq based gene prediction and manually curation<br />
** bugfix for predictions from multiple reference organisms<br />
** improved statistic output<br />
* GeMoMa<br />
** renamed the parameter tblastn results to search results<br />
** new parameter for sorting the results of the similarity search (tblastn or mmseqs), if you use mmseqs for the similarity search you have use sort <br />
** new parameter for score of the search results: three options: Trust (as is), ReScore (use aligned sequence, but recompute score), and ReAlign (use detected sequence for realignment and rescore)<br />
** bugfix: threshold for introns from multiple files<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.5.2.zip GeMoMa 1.5.3] (23.07.2018)<br />
* improved parameter description and presentation<br />
* GeMoMaPipeline:<br />
** removed unnecessary parameters<br />
* GeMoMa:<br />
** bugfix: reading coverage file<br />
** removed parameter genomic (cf. Extractor)<br />
** removed protein output (cf. Extractor)<br />
* GAF:<br />
** bugfix: prefix<br />
* Extractor:<br />
** new parameter genomic<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.5.2.zip GeMoMa 1.5.2] (31.5.2018)<br />
* GAF:<br />
** new parameter that allows to restrict the maximal number of transcript predictions per gene<br />
** altered behavior of the evidence filter from percentages to absolute values<br />
** bugfix: nested genes <br />
** checking for duplicates in prediction IDs<br />
* GeMoMa:<br />
** warning if RNA-seq data does not match with target genome<br />
* GeMoMaPipeline: new tool for running the complete GeMoMa pipeline at once allowing multi-threading<br />
* folder for temporary files of GeMoMa<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.5.zip GeMoMa 1.5] (13.02.2018)<br />
* AnnotationEvidence: add chromosome to output<br />
* CompareTranscripts: new parameter that allows to remove prefixes introduces by GAF<br />
* Extractor: new parameter "stop-codon excluded from CDS" that might be used if the annotation does not contain the stop codons<br />
* ExtractRNASeqEvidence: <br />
** print intron length stats<br />
** include program infos in introns.gff3<br />
* GeMoMa:<br />
** new attribute pAA in gff output if query protein is given<br />
** include program infos in predicted_annotation.gff3<br />
** minor bugfix<br />
* GAF:<br />
** new parameter that allows to specify a prefix for each input gff<br />
** collect and print program infos to filtered_prediction.gff3<br />
** improved statistics output<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.4.2.zip GeMoMa 1.4.2] (21.07.2017)<br />
* automatic searching for available updates<br />
* AnnotationEvidence: bugfix (tie computation: Arrays.binarysearch does not find first match)<br />
* Extractor: bugfix (files that are not zipped)<br />
* GeMoMa: bugfix (tie computation: Arrays.binarysearch does not find first match)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.4.1.zip GeMoMa 1.4.1] (30.05.2017)<br />
* CompareTranscripts: bugfix (NullPointerException)<br />
* Extractor: reference genome can be .*fa.gz and .*fasta.gz<br />
* GeMoMa: bugfix (shutdown problem after timeout)<br />
* modified additional scripts and documentation<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.4.zip GeMoMa 1.4] (03.05.2017)<br />
* AnnotationEvidence: new tool computing tie and tpc for given annotation (gff)<br />
* CompareTranscripts: new tool comparing predicted and given annotation (gff)<br />
* Extractor:<br />
** reading CDS with no parent tag (cf. discontinuous feature)<br />
** automatic recognition of GFF or GTF annotation<br />
** Warning if sequences mentioned in the annotation are not included in the reference sequence<br />
* GeMoMa: <br />
** allowing for multiple intron and coverage files (= using different library types at the same time)<br />
** NA instead of "?" for tae, tde, tie, minSplitReads of single coding exon genes<br />
** new default values for the parameters: predictions (10 instead of 1) and contig threshold (0.4 instead of 0.9)<br />
** bugfix (write pc and minCov if possible for last CDS part in predicted annotation)<br />
** bugfix (ref-gene name if no assignment is used)<br />
** bugfix (minSplitReads, minCov, tpc, avgCov if no coverage available)<br />
* GAF:<br />
** nested genes on the same strand<br />
** bugfix (if nothing passes the filter)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.3.2.zip GeMoMa 1.3.2] (18.01.2017)<br />
* Extractor: new parameter repair for broken transcript annotations<br />
* GeMoMa: bugfixes (splice site computation)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa_1.3.1.zip GeMoMa 1.3.1] (09.12.2016)<br />
* GeMoMa bugfix (finding start/stop codon for very small exons)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.3.zip GeMoMa 1.3] (06.12.2016)<br />
* ERE: new tool for extracting RNA-seq evidence<br />
* Extractor: offers options for <br />
** partial gene models<br />
** ambiguities<br />
* GeMoMa:<br />
** RNA-seq<br />
*** defining splice sites<br />
*** additional feature in GFF and output<br />
**** transcript intron evidence (tie)<br />
**** transcript acceptor evidence (tae)<br />
**** transcript donor evidence (tde)<br />
**** transcript percentage coverage (tpc)<br />
**** ...<br />
** improved GFF<br />
** simplified the command line parameters<br />
** IMPORTANT: parameter names changed for some parameters<br />
* GAF: new tool for filtering and combining different predictions (especially of different reference organisms)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.1.3.zip GeMoMa 1.1.3] (06.06.2016)<br />
* minor modifications to the Extractor tool<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.1.2.zip GeMoMa 1.1.2] (05.02.2016)<br />
* GeMoMa bugfix (upstream, downstream sequence for splice site detection)<br />
* Extractor: new parameter s for selecting transcripts<br />
* improved Galaxy integration<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.1.1.zip GeMoMa 1.1.1] (01.02.2016)<br />
* initial release for paper</div>Keilwagenhttps://www.jstacs.de/index.php?title=GeMoMa&diff=1062GeMoMa2020-01-17T10:03:42Z<p>Keilwagen: /* Extract RNA-seq Evidence (ERE) */</p>
<hr />
<div>'''Ge'''ne '''Mo'''del '''Ma'''pper (GeMoMa) is a homology-based gene prediction program. GeMoMa uses the annotation of protein-coding genes in a reference genome to infer the annotation of protein-coding genes in a target genome. Thereby, GeMoMa utilizes amino acid sequence and intron position conservation. In addition, GeMoMa allows to incorporate RNA-seq evidence for splice site prediction.<br />
<br />
[[File:GeMoMa-schema.png|thumb|right|350px|Schema of GeMoMa algorithm]]<br />
<br />
== Paper ==<br />
If you use GeMoMa, please cite<br />
<br />
J. Keilwagen, M. Wenk, J. L. Erickson, M. H. Schattat, J. Grau, and F. Hartung. [https://nar.oxfordjournals.org/content/44/9/e89 Using intron position conservation for homology-based gene prediction]. ''Nucleic Acids Research'', 2016. doi: 10.1093/nar/gkw092<br />
<br />
J. Keilwagen, F. Hartung, M. Paulini, S. O. Twardziok, and J. Grau<br />
[https://rdcu.be/QbKc Combining RNA-seq data and homology-based gene prediction for plants, animals and fungi]. ''BMC Bioinformatics'', 2018. doi: 10.1186/s12859-018-2203-5<br />
<br />
== Download ==<br />
<br />
GeMoMa is implemented in Java using Jstacs. You can [http://www.jstacs.de/download.php?which=GeMoMa download a zip file] containing a readme, the GeMoMa jar file and some tiny scripts for running GeMoMa. The jar file allows for <br />
<ul><br />
<li>creating the XML file needed for the Galaxy integration</li><br />
<li>running the command line interface (CLI) version.</li><br />
</ul><br />
<br />
You can also [[:File:GeMoMa-manual.pdf|download a small manual for GeMoMa]] which explains the main steps for the analysis.<br />
<br />
== Galaxy ==<br />
GeMoMa is available in a public web-server at [http://galaxy.informatik.uni-halle.de galaxy.informatik.uni-halle.de]. The provided web-server only allows a limited number of reference genes and uses a time out of 2 minutes per transcript prediction. For unlimited use, please use the command line program or integrate GeMoMa in your only Galaxy instance.<br />
<br />
[[File:GeMoMa-Workflow.png|thumb|center|700px|GeMoMa workflow adapted from Galaxy]]<br />
<br />
== Running the command line application ==<br />
<br />
For running the command line application, Java v1.8 or later is required.<br />
<br />
=== GeMoMaPipeline ===<br />
<br />
If you like to run the GeMoMaPipeline on a server as a single job, you can use the module GeMoMaPipeline which allows to exploit the full compute power of the computer server via multi-threading. However, GeMoMaPipeline does not distribute task on a compute cluster.<br />
You can run GeMoMaPipeline from the command line with<br/><br />
<code>java -jar GeMoMa-<version>.jar CLI GeMoMaPipeline [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>target genome (Target genome file (FASTA))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>species (data for reference species, range={own, pre-extracted}, default = own)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;own&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>ID (ID to distinguish the different reference species, default = , OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (Reference annotation file (GFF or GTF), which contains gene models annotated in the reference genome)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (Reference genome file (FASTA))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>weight (the weight can be used to prioritize predictions from different input files; each prediction will get an additional attribute sumWeight that can be used in the filter, valid range = [0.0, 1000.0], default = 1.0, OPTIONAL)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ai</font></td><br />
<td>annotation info (annotation information of the reference, tab-delimted file containing at least the columns transcriptName, GO and .*defline, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;pre-extracted&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>ID (ID to distinguish the different reference species, default = , OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>cds parts (The query cds parts file (FASTA), i.e., the cds parts that have been blasted)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>assignment (The assignment file, which combines parts of the CDS to transcripts, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">q</font></td><br />
<td>query proteins (optional query protein file (FASTA) for computing the optimal alignment score against complete protein prediction, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>weight (the weight can be used to prioritize predictions from different input files; each prediction will get an additional attribute sumWeight that can be used in the filter, valid range = [0.0, 1000.0], default = 1.0, OPTIONAL)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ai</font></td><br />
<td>annotation info (annotation information of the reference, tab-delimted file containing at least the columns transcriptName, GO and .*defline, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">selected</font></td><br />
<td>selected (The path to list file, which allows to make only a predictions for the contained transcript ids. The first column should contain transcript IDs as given in the annotation. Remaining columns can be used to determine a target region that should be overlapped by the prediction, if columns 2 to 5 contain chromosome, strand, start and end of region, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">gc</font></td><br />
<td>genetic code (optional user-specified genetic code, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>maximum intron length (The maximum length of an intron, default = 15000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">tblastn</font></td><br />
<td>tblastn (if *true* tblastn is used as search algorithm, otherwise mmseqs is used. Tblastn and mmseqs need to be installed to use the corresponding option, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">tag</font></td><br />
<td>tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = prediction)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>RNA-seq evidence (data for RNA-seq evidence, range={NO, MAPPED, EXTRACTED}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;MAPPED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.s</font></td><br />
<td>Stranded (Defines whether the reads are stranded. In case of FR_FIRST_STRAND, the first read of a read pair or the only read in case of single-end data is assumed to be located on forward strand of the cDNA, i.e., reverse to the mRNA orientation. If you are using Illumina TruSeq you should use FR_FIRST_STRAND., range={FR_UNSTRANDED, FR_FIRST_STRAND, FR_SECOND_STRAND}, default = FR_UNSTRANDED)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.m</font></td><br />
<td>mapped reads file (BAM/SAM files containing the mapped reads)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.v</font></td><br />
<td>ValidationStringency (Defines how strict to be when reading a SAM or BAM, beyond bare minimum validation., range={STRICT, LENIENT, SILENT}, default = LENIENT)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.u</font></td><br />
<td>use secondary alignments (allows to filter flags in the SAM or BAM, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.c</font></td><br />
<td>coverage (allows to output the coverage, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.mmq</font></td><br />
<td>minimum mapping quality (reads with a mapping quality that is lower than this value will be ignored, valid range = [0, 255], default = 40)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;EXTRACTED&quot;:</b></td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">introns</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used zero or multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage</font></td><br />
<td>coverage (experimental coverage (RNA-seq), range={UNSTRANDED, STRANDED}, default = UNSTRANDED)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>denoise (removing questionable introns that have been extracted by ERE, range={DENOISE, RAW}, default = DENOISE)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;DENOISE&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Denoise.m</font></td><br />
<td>minimum expression (The threshold for removing introns, valid range = [0.0, 1.0], default = 0.01)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Denoise.c</font></td><br />
<td>context (The context upstream a donor and donwstream an acceptor site that is used to determine the expression of the region, valid range = [0, 100], default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;RAW&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.p</font></td><br />
<td>proteins (whether the complete proteins sequences should returned as output, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.r</font></td><br />
<td>repair (if a transcript annotation can not be parsed, the program will try to infer the phase of the CDS parts to repair the annotation, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.a</font></td><br />
<td>Ambiguity (This parameter defines how to deal with ambiguities in the DNA. There are 3 options: EXCEPTION, which will remove the corresponding transcript, AMBIGUOUS, which will use an X for the corresponding amino acid, and RANDOM, which will randomly select an amnio acid from the list of possibilities., range={EXCEPTION, AMBIGUOUS, RANDOM}, default = AMBIGUOUS)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.s</font></td><br />
<td>stop-codon excluded from CDS (A flag that states whether the reference annotation contains the stop codon in the CDS annotation or not, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.f</font></td><br />
<td>full-length (A flag which allows for choosing between only full-length and all (i.e., full-length and partial) transcripts, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.s</font></td><br />
<td>splice (if no intron is given by RNA-seq, compute candidate splice sites or not, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.sm</font></td><br />
<td>substitution matrix (optional user-specified substitution matrix, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.g</font></td><br />
<td>gap opening (The gap opening cost in the alignment, default = 11)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.ge</font></td><br />
<td>gap extension (The gap extension cost in the alignment, default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.i</font></td><br />
<td>intron-loss-gain-penalty (The penalty used for intron loss and gain, default = 25)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.e</font></td><br />
<td>e-value (The e-value for filtering blast results, default = 100.0)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.c</font></td><br />
<td>contig threshold (The threshold for evaluating contigs, valid range = [0.0, 1.0], default = 0.4)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.rt</font></td><br />
<td>region threshold (The threshold for evaluating regions, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.h</font></td><br />
<td>hit threshold (The threshold for adding additional hits, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.p</font></td><br />
<td>predictions (The (maximal) number of predictions per transcript, default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.a</font></td><br />
<td>avoid stop (A flag which allows to avoid stop codons in a transcript (except the last AS), default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.approx</font></td><br />
<td>approx (whether an approximation is used to compute the score for intron gain, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.prefix</font></td><br />
<td>prefix (A prefix to be used for naming the predictions, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.t</font></td><br />
<td>timeout (The (maximal) number of seconds to be used for the predictions of one transcript, if exceeded GeMoMa does not output a prediction for this transcript., valid range = [0, 604800], default = 3600)</td><br />
<td style="width:100px;">LONG</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.Score</font></td><br />
<td>Score (A flag which allows to do nothing, re-score or re-align the search results, range={Trust, ReScore, ReAlign}, default = Trust)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.s</font></td><br />
<td>sorting (comma-separated list that defines the way predictions are sorted within a cluster, default = evidence,score)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.c</font></td><br />
<td>common border filter (the filter on the common borders of transcripts, the lower the more transcripts will be checked as alternative splice isoforms, valid range = [0.0, 1.0], default = 0.75)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.m</font></td><br />
<td>maximal number of transcripts per gene (the maximal number of allowed transcript predictions per gene, valid range = [1, 2147483647], default = 2147483647)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.f</font></td><br />
<td>filter (A filter can be applied to transcript predictions to receive only reasonable predictions. The filter is applied to the GFF attributes. The deault filter decides based on the completeness of the prediction (start=='M' and stop=='*') and the relative score (score/AA>=0.75) whether a prediction is used or not. Different criteria can be combined using 'and ' and 'or'. A more sophisticated filter could be applied for instance using the completeness, the relative score, the evidence as well as statistics based on RNA-seq: start=='M' and stop =='*' and score/AA>=0.75 and (evidence>1 or tpc==1.0), default = start=='M' and stop =='*' and score/AA>=0.75, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.a</font></td><br />
<td>alternative transcript filter (If a prediction is suggested as an alternative transcript, this additional filter will be applied. The filter works syntactically like the 'filter' parameter and allows you to keep the number of alternative transcripts small based on meaningful criteria. Reasonable filter could be based on RNA-seq data (tie==1) or on evidence (evidence>1). A more sophisticated filter could be applied combining several criteria: tie==1 or evidence>1, default = tie==1 or evidence>1, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.u</font></td><br />
<td>UTR (allows to predict UTRs using RNA-seq data, range={NO, YES}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;YES&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.r</font></td><br />
<td>rename (allows to generate generic gene and transcripts names (cf. attribute &quot;Name&quot;), range={COMPOSED, SIMPLE, NO}, default = COMPOSED)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;COMPOSED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.i</font></td><br />
<td>infix (the infix of the generic name, default = G)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.s</font></td><br />
<td>suffix (the suffix of the generic name, default = 0)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.di</font></td><br />
<td>delete infix (a comma-separated list of infixes that is deleted from the sequence names before building the gene/transcript name, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;SIMPLE&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>predicted proteins (If *true*, returns the predicted proteins of the target organism as fastA file, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">pc</font></td><br />
<td>predicted CDSs (If *true*, returns the predicted CDSs of the target organism as fastA file, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">pgr</font></td><br />
<td>predicted genomic regions (If *true*, returns the genomic regions of predicted gene models of the target organism as fastA file, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">o</font></td><br />
<td>output individual predictions (If *true*, returns the predictions for each reference species, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">debug</font></td><br />
<td>debug (If *false* removes all temporary files even if the jobs exits unexpected, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">threads</font></td><br />
<td>The number of threads used for the tool, defaults to 1</td><br />
<td>INT</td><br />
</tr><br />
</table><br />
<br />
=== Extract RNA-seq Evidence (ERE) ===<br />
<br />
For post-processing the mapped RNA-seq data, we provide the tool ExtractRNAseqEvidence (ERE). You can run Extractor from the command line with<br/><br />
<code>java -jar GeMoMa-<version>.jar CLI ERE [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>Stranded (Defines whether the reads are stranded. In case of FR_FIRST_STRAND, the first read of a read pair or the only read in case of single-end data is assumed to be located on forward strand of the cDNA, i.e., reverse to the mRNA orientation. If you are using Illumina TruSeq you should use FR_FIRST_STRAND., range={FR_UNSTRANDED, FR_FIRST_STRAND, FR_SECOND_STRAND}, default = FR_UNSTRANDED)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>mapped reads file (BAM/SAM files containing the mapped reads)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>ValidationStringency (Defines how strict to be when reading a SAM or BAM, beyond bare minimum validation., range={STRICT, LENIENT, SILENT}, default = LENIENT)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">u</font></td><br />
<td>use secondary alignments (allows to filter flags in the SAM or BAM, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage (allows to output the coverage, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">mmq</font></td><br />
<td>minimum mapping quality (reads with a mapping quality that is lower than this value will be ignored, valid range = [0, 254], default = 40)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== CheckIntrons===<br />
<br />
This tool allows to check whether the extracted introns show the expected patterns of di-nucleotides at the splice sites. You can run CheckIntrons from the command line with<br/><br />
<code>java -jar GeMoMa-1.6.1.jar CLI CheckIntrons [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>target genome (The target genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>verbose (A flag which allows to output a wealth of additional information per transcript, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== Extractor ===<br />
<br />
For preparing the data, we provide the tool Extractor. You can run Extractor from the command line with<br/><br />
<code>java -jar GeMoMa-1.6.1.jar CLI Extractor [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (Reference annotation file (GFF or GTF), which contains gene models annotated in the reference genome)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (Reference genome file (FASTA))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">gc</font></td><br />
<td>genetic code (optional user-specified genetic code, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>proteins (whether the complete proteins sequences should returned as output, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>cds (whether the complete CDSs should returned as output, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">genomic</font></td><br />
<td>genomic (whether the genomic regions should be returned (upper case = coding, lower case = non coding), default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>repair (if a transcript annotation can not be parsed, the program will try to infer the phase of the CDS parts to repair the annotation, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>selected (The path to list file, which allows to make only a predictions for the contained transcript ids. The first column should contain transcript IDs as given in the annotation. Remaining columns will be ignored., OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Ambiguity</font></td><br />
<td>Ambiguity (This parameter defines how to deal with ambiguities in the DNA. There are 3 options: EXCEPTION, which will remove the corresponding transcript, AMBIGUOUS, which will use an X for the corresponding amino acid, and RANDOM, which will randomly select an amnio acid from the list of possibilities., range={EXCEPTION, AMBIGUOUS, RANDOM}, default = EXCEPTION)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sefc</font></td><br />
<td>stop-codon excluded from CDS (A flag that states whether the reference annotation contains the stop codon in the CDS annotation or not, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>full-length (A flag which allows for choosing between only full-length and all (i.e., full-length and partial) transcripts, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>verbose (A flag which allows to output a wealth of additional information, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== Gene Model Mapper (GeMoMa) ===<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>search results (The search results, e.g., from tblastn or mmseqs)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>target genome (The target genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>cds parts (The query cds parts file (FASTA), i.e., the cds parts that have been blasted)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>assignment (The assignment file, which combines parts of the CDS to transcripts, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">q</font></td><br />
<td>query proteins (optional query protein file (FASTA) for computing the optimal alignment score against complete protein prediction, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">splice</font></td><br />
<td>splice (if no intron is given by RNA-seq, compute candidate splice sites or not, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage</font></td><br />
<td>coverage (experimental coverage (RNA-seq), range={NO, UNSTRANDED, STRANDED}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genetic code (optional user-specified genetic code, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sm</font></td><br />
<td>substitution matrix (optional user-specified substitution matrix, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">go</font></td><br />
<td>gap opening (The gap opening cost in the alignment, default = 11)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ge</font></td><br />
<td>gap extension (The gap extension cost in the alignment, default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>maximum intron length (The maximum length of an intron, default = 15000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">intron-loss-gain-penalty</font></td><br />
<td>intron-loss-gain-penalty (The penalty used for intron loss and gain, default = 25)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">e</font></td><br />
<td>e-value (The e-value for filtering blast results, default = 100.0)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ct</font></td><br />
<td>contig threshold (The threshold for evaluating contigs, valid range = [0.0, 1.0], default = 0.4)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">rt</font></td><br />
<td>region threshold (The threshold for evaluating regions, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">h</font></td><br />
<td>hit threshold (The threshold for adding additional hits, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>predictions (The (maximal) number of predictions per transcript, default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">selected</font></td><br />
<td>selected (The path to list file, which allows to make only a predictions for the contained transcript ids. The first column should contain transcript IDs as given in the annotation. Remaining columns can be used to determine a target region that should be overlapped by the prediction, if columns 2 to 5 contain chromosome, strand, start and end of region, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">as</font></td><br />
<td>avoid stop (A flag which allows to avoid stop codons in a transcript (except the last AS), default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">approx</font></td><br />
<td>approx (whether an approximation is used to compute the score for intron gain, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">prefix</font></td><br />
<td>prefix (A prefix to be used for naming the predictions, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">tag</font></td><br />
<td>tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = prediction)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>verbose (A flag which allows to output a wealth of additional information per transcript, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">timeout</font></td><br />
<td>timeout (The (maximal) number of seconds to be used for the predictions of one transcript, if exceeded GeMoMa does not output a prediction for this transcript., valid range = [0, 604800], default = 3600)</td><br />
<td style="width:100px;">LONG</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sort</font></td><br />
<td>sort (A flag which allows to sort the search results, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Score</font></td><br />
<td>Score (A flag which allows to do nothing, re-score or re-align the search results, range={Trust, ReScore, ReAlign}, default = Trust)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
GeMoMa returns the predicted annotation as gff file.<br />
<br />
=== GeMoMa Annotation Filter (GAF) ===<br />
<br />
The GeMoMa Annotation Filter allows to combine and reduce predictions from GeMoMa into a single final prediction. It is able to handle predictions from different reference species. It also handles overlapping or identical predictions.<br />
You can run GeMoMa from the command line with<br/><br />
<code>java -jar GeMoMa-1.6.1.jar CLI GAF[&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>tag (the tag used to read the GeMoMa annotations, default = prediction)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>sorting (comma-separated list that defines the way predictions are sorted within a cluster, default = evidence,score)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>missing intron evidence filter (the filter for single-exon transcripts or if no RNA-seq data is used, decides for overlapping other transcripts whether they should be used (=true) or discarded (=false), default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>intron evidence filter (the filter on the intron evidence given by RNA-seq-data for overlapping transcripts, valid range = [0.0, 1.0], default = 1.0)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>common border filter (the filter on the common borders of transcripts, the lower the more transcripts will be checked as alternative splice isoforms, valid range = [0.0, 1.0], default = 0.75)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">mnotpg</font></td><br />
<td>maximal number of transcripts per gene (the maximal number of allowed transcript predictions per gene, valid range = [1, 2147483647], default = 2147483647)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prefix (the prefix can be used to distinguish predictions from different input files, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>weight (the weight can be used to prioritize predictions from different input files; each prediction will get an additional attribute sumWeight that can be used in the filter, valid range = [0.0, 1000.0], default = 1.0, OPTIONAL)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>gene annotation file (GFF files containing the gene annotations (predicted by GeMoMa))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>filter (A filter can be applied to transcript predictions to receive only reasonable predictions. The filter is applied to the GFF attributes. The deault filter decides based on the completeness of the prediction (start=='M' and stop=='*') and the relative score (score/AA>=0.75) whether a prediction is used or not. Different criteria can be combined using 'and ' and 'or'. A more sophisticated filter could be applied for instance using the completeness, the relative score, the evidence as well as statistics based on RNA-seq: start=='M' and stop =='*' and score/AA>=0.75 and (evidence>1 or tpc==1.0), default = start=='M' and stop =='*' and score/AA>=0.75, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== CompareTranscripts ===<br />
<br />
For comparing gene models from GeMoMa predictions with existing annotation, we provide the tool CompareTranscripts. You can run CompareTranscripts from the command line with<br/><br />
<code>java -jar GeMoMa-1.6.1.jar CLI CompareTranscripts [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prediction (The predicted annotation)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (The true annotation)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== AnnotationEvidence ===<br />
<br />
For providing RNA-seq evidence (e.g. tie) for existing annotation, we provide the tool AnnotationEvidence. You can run AnnotationEvidence from the command line with<br/><br />
<code>java -jar GeMoMa-1.6.1.jar CLI AnnotationEvidence [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (The genome annotation file (GFF))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (The genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns file (Introns (GFF), which might be obtained from RNA-seq, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage file (experimental coverage (RNA-seq), range={NO, UNSTRANDED, STRANDED}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ao</font></td><br />
<td>annotation output (if the annotation should be returned with attributes tie, tpc, and AA, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">gc</font></td><br />
<td>genetic code (optional user-specified genetic code, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== AnnotationFinializer ===<br />
<br />
This tool allows to predict UTR and to rename predictions. You can run AnnotationEvidence from the command line with<br/><br />
<code>java -jar GeMoMa-1.6.1.jar CLI AnnotationFinalizer [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (The genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (The predicted genome annotation file (GFF))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = prediction)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">u</font></td><br />
<td>UTR (allows to predict UTRs using RNA-seq data, range={NO, YES}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;YES&quot;:</b></td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns file (Introns (GFF), which might be obtained from RNA-seq, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage file (experimental coverage (RNA-seq), range={NO, UNSTRANDED, STRANDED}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">rename</font></td><br />
<td>rename (allows to generate generic gene and transcripts names (cf. attribute &quot;Name&quot;), range={COMPOSED, SIMPLE, NO}, default = COMPOSED)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;COMPOSED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">infix</font></td><br />
<td>infix (the infix of the generic name, default = G)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>suffix (the suffix of the generic name, default = 0)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">di</font></td><br />
<td>delete infix (a comma-separated list of infixes that is deleted from the sequence names before building the gene/transcript name, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;SIMPLE&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
== GFF attributes ==<br />
<br />
Using GeMoMa and GAF, you'll obtain GFFs containing some special attributes. We briefly explain the most prominent attributes in the following table.<br />
<br />
{| class="wikitable sortable"<br />
|-<br />
!Attribute !!Long name !!Tool !!Necessary parameter !!Feature !!Description<br />
|-<br />
| score || GeMoMa score || GeMoMa || || prediction || score computed by GeMoMa using the substitution matrix, gap costs and additional penalties<br />
|-<br />
| minCov || minimal coverage || GeMoMa || coverage, ... || prediction || minimal coverage of any base of the prediction given RNA-seq evidence<br />
|-<br />
| avgCov || average coverage || GeMoMa || coverage, ... || prediction || average coverage of all bases of the prediction given RNA-seq evidence<br />
|-<br />
| tpc || transcript percentage coverage || GeMoMa || coverage, ... || prediction || percentage of covered bases per predicted transcript given RNA-seq evidence<br />
|-<br />
| tae || transcript acceptor evidence || GeMoMa || introns || prediction || percentage of predicted acceptor sites per predicted transcript with RNA-seq evidence<br />
|-<br />
| tde || transcript donor evidence || GeMoMa || introns || prediction || percentage of predicted donor sites per predicted transcript with RNA-seq evidence<br />
|-<br />
| tie || transcript intron evidence || GeMoMa || introns || prediction || percentage of predicted introns per predicted transcript with RNA-seq evidence<br />
|-<br />
| minSplitReads || minimal split reads || GeMoMa || introns || prediction || minimal number of split reads for any of the predicted introns per predicted transcript<br />
|-<br />
| iAA || identical amino acid || GeMoMa || query proteins || prediction || percentage of identical amino acids between reference transcript and prediction<br />
|-<br />
| pAA || positive amino acid || GeMoMa || query proteins || prediction || percentage of aligned positions between reference transcript and prediction yielding a positive score in the substitution matrix<br />
|-<br />
| evidence || || GAF || || prediction || number of reference organisms that have a transcript yielding this prediction<br />
|-<br />
| alternative || || GAF || || prediction || alternative gene ID(s) leading to the same prediction<br />
|-<br />
| sumWeight || || GAF || || prediction || the sum of the weights of the references that perfectly support this prediction<br />
|-<br />
| maxTie || maximal tie || GAF || || gene || maximal tie of all transcripts of this gene<br />
|-<br />
| maxEvidence || maximal evidence || GAF || || gene || maximal evidence of all transcripts of this gene<br />
|-<br />
|}<br />
<br />
== FAQs ==<br />
<br />
; Why does the Extractor not return a single CDS-part, protein, ...?<br />
: First, please check whether the names of your contigs/chromosomes in your annotation (gff) and genome file (fasta) are identical. The fasta comments should at best only contain the contig/chromosome name. (Since GeMoMa 1.4, comments, which contain the contig/chromosome name and some additional information separated by a space, are also fine.) Second, please check whether you have a valid GFF/GTF file. Valid GFF files should have a valid "ID" or "Parent" entry in the attributes column. Valid GTF files should have a valid "gene_id" and "transcript_id" entry. Finally, please check the statistics that are given by the Extractor. It lists how many genes have been read and how many genes have been removed for different reasons. One common problem is that some annotation files do not include the stop codon in the CDS annotation.<br />
; How can I force GeMoMa to make more predictions?<br />
: There are several parameters affecting the number of predictions. The most prominent are the number of predictions (p) and the contig threshold (ct). For each reference transcript/CDS, GeMoMa initially makes a preliminary prediction and uses this prediction to determine whether a contig is promising and should be used to determine the final predictions. You may decrease ct and increase p to have more contigs in the final prediction. Increasing the number of predictions allows GeMoMa to output more predictions that have been computed. Decreasing the contig threshold allows to increase the number of predictions that are (internally) computed. Increasing p to a very large number without decreasing ct does not help.<br />
; Running GeMoMa on a single contig of my assembly yield thousands of weird predictions. What went wrong?<br />
: By default, GeMoMa is not build to be run on a single contig. GeMoMa tries to make predictions for all given reference CDS in the given target sequence(s). If the given target sequence is only a fraction of the complete target genome/assembly, GeMoMa will produce weird predictions as it does not filter for the quality of the predictions internally. There are two options to handle this:<br />
:* Use a list of gene models that you expect to be located on this contig (cf. parameter "selected").<br />
:* Filter the predictions using GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>).<br />
; Is it mandatory to use RNA-seq data?<br />
: No, GeMoMa is able to make predictions with and without RNA-seq evidence.<br />
; Is it possible to use multiple reference organisms?<br />
: It is possible to use multiple reference organisms for GeMoMa. Just run GeMoMa on each reference organism separately. Finally, you can employ GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>) to combine these annotations.<br />
; Why do some reference genes not lead to a prediction in the target genome?<br />
: Please first check whether your reference genes have been discarded by the Extractor (cf. assignment file).<br />
: If the genes have been discarded, there are two possibilities:<br />
:* The CDS might be redundant, i.e. the coding exons are identical to those of another transcript. In this case, only one CDS is further evaluated.<br />
:* There might be something wrong with your reference genes, e.g., missing start codon, missing stop codon, premature stop codon, ambiguous nucleotides, ... and you should check the options of Extractor or the annotation.<br />
: If the reference genes passed the Extractor, there are several possible explanations for this behavior. The two most prominent are:<br />
:* GeMoMa stopped the prediction of a reference genes since it does not return a result within the given time (cf. parameter "timeout").<br />
:* GeMoMa simply did not find a prediction matching the remaining quality criteria<br />
:* GeMoMa did find a prediction, but it was filtered out by GAF, e.g. to low relative score, missing start or stop codon (cf. GAF parameters).<br />
; What does "partial gene model" mean in the context of GeMoMa?<br />
: We called a gene model partial if it does not contain an initial start codon and a final stop codon. However, this does not mean that the gene model is located at or close to the border of a chromosome or contig.<br />
; For two different reference transcripts, the predictions of GeMoMa overlap or are identical. What should I do with those?<br />
: GeMoMa makes the predictions for each reference transcript independently. Hence, it can occur that some of predictions of different reference transcripts overlap or are identical especially in gene families. Typically, you might like to filter or rank these predictions. We have implemented GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>) to do this automatically. However, you can also do it by hand using the GFF attributes. Using RNA-seq data in GeMoMa yields additional fields in the annotation that can be used, e.g., average coverage (avgCov).<br />
; A lot of transcripts have been filtered out by the Extractor. What can I do?<br />
: There are several reasons for removing transcripts by the Extractor. At least in two cases you can try to get more transcripts by setting specific parameter values. First, if the transcript contains ambiguous nucleotides, please test the parameter "Ambiguity". Second, sometimes we received GFFs which contain wrong phases for CDS entries (e.g., 0 for all CDS entries in the phase column of the GFF). Since version 1.3.2, we provide the option "r" which stands for repair. If r=true is chosen, the Extractor tries to infer all phases for transcripts that show an error and would be filtered out.<br />
; Is GeMoMa able to predict pseudo-genes/ncRNA?<br />
: No, currently not.<br />
; My RNA-seq data indicates there is an additional intron in a transcipt, but GeMoMa does not predict this. Or vice versa, GeMoMa predicts an intron that is not supported by RNA-seq data. What's the reason?<br />
: GeMoMa is mainly based on the assumptions of amino acid and intron position conservation between reference and target species. Hence, GeMoMa tries to predict a gene model with similar exon-intron structure in the target species and does not stick too much to RNA-seq data. Although intron position conservation can be observed in most cases, sometimes new introns evolve or others vanish. For this reasons, GeMoMa also allows for the inclusion or exclusion of introns adding some additional costs (cf. GeMoMa parameter intron-loss-gain-penalty). However, the behaviour of GeMoMa depends on the parameters settings (especially intron-loss-gain-penalty, sm (substitution matrix), go (gap opening), ge (gap extension)) and the length of the missed/additional intron. Nevertheless, such cases can only occur if the additional/missed intron has a length that can be divided by 3 preserving the reading frame.<br />
:Since the available RNA-seq data only reflects a fraction of tissues/environmental conditions/..., missing RNA-seq evidence does not necessarily mean that the predictions is wrong.<br />
; My RNA-seq data indicates two alternative, highly overlapping introns. Interestingly, GeMoMa does not take the intron that is more abundant. Why?<br />
: GeMoMa reads the introns from the input file using some filter (cf. GeMoMa parameter r (reads)). All introns that pass the filter are used and treated equally. Hence, GeMoMa uses the intron that matches the expectation of intron position and amino acid conservation compared to the reference transcript.<br />
; Does GeMoMa predict multiple transcripts per gene?<br />
: GeMoMa in principle allows to predict multiple transcripts per gene, if corresponding transcripts are given in the reference species or if multiple reference species are used.<br />
; GeMoMa failed with java.lang.OutOfMemoryError. What can I do?<br />
: Whenever you see a java.lang.OutOfMemoryError, you should rerun the program with Java virtual machine (VM) options. More specifically you should set: '''-Xms''' the initally used RAM, e.g. to 5Gb (–Xms5G), and '''-Xmx''' the maximally used RAM, e.g. to 50Gb (-Xmx50G). GeMoMa often needs more memory if you have a large genome ''and'' if you’re providing a large coverage file (extracted from RNA-seq data). If you don’t have a compute node with enough memory, you can run GeMoMa without coverage, which will return the same predictions, but does not include all statistics. Another point could be the protein alignment, if you use the optional parameter query protein. Again you can run GeMoMa without this parameter, which will return the same predictions, but less statistics.<br />
; I need to specify the genetic code for my organisms. What is the expected format?<br />
: The genetic code is given in a two column tab-delimited table, where the first column is the one letter code of the amino acid and the second column is a comma-separated list of triplets. As we are working on genomic DNA, GeMoMa expects the bases A, C, G, and T, and not U (as expected in mRNA). Here is the link to the default genetic code, which might be used as template:<br />
: https://github.com/Jstacs/Jstacs/blob/master/projects/gemoma/test_data/genetic_code.txt<br />
: Alternative genetic codes are described here using the RNA alphabet:<br />
: https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi<br />
: The genetic code might be specified for a reference organism in the module Extractor or for a target organism in the module GeMoMa.<br />
<br />
== Version history ==<br />
[http://www.jstacs.de/download.php?which=GeMoMa GeMoMa 1.6.2] (17.12.2019)<br />
* Jstacs changes:<br />
** test methods for modules<br />
** live protocol for Galaxy<br />
* new module Denoise: allowing to clean introns extracted by ERE<br />
* new module NCBIReferenceRetriever: allowing to retrieve data for reference organisms easily from NCBI.<br />
* GAF:<br />
** bugfix for filter using specific attributes if no RNA-seq or query proteins was used<br />
** allow to add annotation info (as for instance provided by Phytozome) based on the reference organisms<br />
* GeMoMa: bugfix for timeout<br />
* GeMoMaPipeline:<br />
** bugfix reporting predicted partial proteins<br />
** improved protocol<br />
** new default value for query proteins (changed from false to true)<br />
** new default value for Ambiguity (changed from EXCEPTION to AMBIGUOUS)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.6.1.zip GeMoMa 1.6.1] (4.06.2019)<br />
* createGalaxyIntegration.sh: bugfix for GeMoMaPipeline<br />
* new module CheckIntrons: allowing to create statistics for introns (extracted by ERE)<br />
* AnnotationFinalizer: bugfix for sequence IDs with large numbers<br />
* CompareTranscripts: <br />
** bugfix for prefix of ref-gene<br />
** allow no transcript info, but making assignment non-optional if a transcript info is set <br />
* GAF: bugfix for Galaxy integration<br />
* GeMoMaPipeline:<br />
** improved output in case of Exceptions<br />
** new parameter "output individual predictions" allows to in- or exclude individual predictions from each reference organism in the final result<br />
** new parameter "weight" allows weights for reference species (cf. GAF)<br />
* ERE: new parameter "minimum mapping quality"<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.6.zip GeMoMa 1.6] (2.04.2019)<br />
* allow to use mmseqs as alternative to tblastn<br />
* AnnotationEvidence:<br />
** allows to add attributes to the input gff: tie, tpc, AA, start, stop<br />
** new parameter for gff output<br />
* AnnotationFinalizer: new tool for predicting UTRs and renaming predictions<br />
* GAF:<br />
** relative score filter and evidence filter are replaced by a flexible filter that allows to filter by relative score, evidence or other GFF attributes as well as combinations thereof<br />
** sorting criteria of the predictions within clusters can now be user-specified<br />
** new attribute for genes: combinedEvidence<br />
** new attribute for predictions: sumWeight<br />
** allows to use gene predictions from all sources, including for instance ab-initio gene predictors, purely RNA-seq based gene prediction and manually curation<br />
** bugfix for predictions from multiple reference organisms<br />
** improved statistic output<br />
* GeMoMa<br />
** renamed the parameter tblastn results to search results<br />
** new parameter for sorting the results of the similarity search (tblastn or mmseqs), if you use mmseqs for the similarity search you have use sort <br />
** new parameter for score of the search results: three options: Trust (as is), ReScore (use aligned sequence, but recompute score), and ReAlign (use detected sequence for realignment and rescore)<br />
** bugfix: threshold for introns from multiple files<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.5.2.zip GeMoMa 1.5.3] (23.07.2018)<br />
* improved parameter description and presentation<br />
* GeMoMaPipeline:<br />
** removed unnecessary parameters<br />
* GeMoMa:<br />
** bugfix: reading coverage file<br />
** removed parameter genomic (cf. Extractor)<br />
** removed protein output (cf. Extractor)<br />
* GAF:<br />
** bugfix: prefix<br />
* Extractor:<br />
** new parameter genomic<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.5.2.zip GeMoMa 1.5.2] (31.5.2018)<br />
* GAF:<br />
** new parameter that allows to restrict the maximal number of transcript predictions per gene<br />
** altered behavior of the evidence filter from percentages to absolute values<br />
** bugfix: nested genes <br />
** checking for duplicates in prediction IDs<br />
* GeMoMa:<br />
** warning if RNA-seq data does not match with target genome<br />
* GeMoMaPipeline: new tool for running the complete GeMoMa pipeline at once allowing multi-threading<br />
* folder for temporary files of GeMoMa<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.5.zip GeMoMa 1.5] (13.02.2018)<br />
* AnnotationEvidence: add chromosome to output<br />
* CompareTranscripts: new parameter that allows to remove prefixes introduces by GAF<br />
* Extractor: new parameter "stop-codon excluded from CDS" that might be used if the annotation does not contain the stop codons<br />
* ExtractRNASeqEvidence: <br />
** print intron length stats<br />
** include program infos in introns.gff3<br />
* GeMoMa:<br />
** new attribute pAA in gff output if query protein is given<br />
** include program infos in predicted_annotation.gff3<br />
** minor bugfix<br />
* GAF:<br />
** new parameter that allows to specify a prefix for each input gff<br />
** collect and print program infos to filtered_prediction.gff3<br />
** improved statistics output<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.4.2.zip GeMoMa 1.4.2] (21.07.2017)<br />
* automatic searching for available updates<br />
* AnnotationEvidence: bugfix (tie computation: Arrays.binarysearch does not find first match)<br />
* Extractor: bugfix (files that are not zipped)<br />
* GeMoMa: bugfix (tie computation: Arrays.binarysearch does not find first match)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.4.1.zip GeMoMa 1.4.1] (30.05.2017)<br />
* CompareTranscripts: bugfix (NullPointerException)<br />
* Extractor: reference genome can be .*fa.gz and .*fasta.gz<br />
* GeMoMa: bugfix (shutdown problem after timeout)<br />
* modified additional scripts and documentation<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.4.zip GeMoMa 1.4] (03.05.2017)<br />
* AnnotationEvidence: new tool computing tie and tpc for given annotation (gff)<br />
* CompareTranscripts: new tool comparing predicted and given annotation (gff)<br />
* Extractor:<br />
** reading CDS with no parent tag (cf. discontinuous feature)<br />
** automatic recognition of GFF or GTF annotation<br />
** Warning if sequences mentioned in the annotation are not included in the reference sequence<br />
* GeMoMa: <br />
** allowing for multiple intron and coverage files (= using different library types at the same time)<br />
** NA instead of "?" for tae, tde, tie, minSplitReads of single coding exon genes<br />
** new default values for the parameters: predictions (10 instead of 1) and contig threshold (0.4 instead of 0.9)<br />
** bugfix (write pc and minCov if possible for last CDS part in predicted annotation)<br />
** bugfix (ref-gene name if no assignment is used)<br />
** bugfix (minSplitReads, minCov, tpc, avgCov if no coverage available)<br />
* GAF:<br />
** nested genes on the same strand<br />
** bugfix (if nothing passes the filter)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.3.2.zip GeMoMa 1.3.2] (18.01.2017)<br />
* Extractor: new parameter repair for broken transcript annotations<br />
* GeMoMa: bugfixes (splice site computation)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa_1.3.1.zip GeMoMa 1.3.1] (09.12.2016)<br />
* GeMoMa bugfix (finding start/stop codon for very small exons)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.3.zip GeMoMa 1.3] (06.12.2016)<br />
* ERE: new tool for extracting RNA-seq evidence<br />
* Extractor: offers options for <br />
** partial gene models<br />
** ambiguities<br />
* GeMoMa:<br />
** RNA-seq<br />
*** defining splice sites<br />
*** additional feature in GFF and output<br />
**** transcript intron evidence (tie)<br />
**** transcript acceptor evidence (tae)<br />
**** transcript donor evidence (tde)<br />
**** transcript percentage coverage (tpc)<br />
**** ...<br />
** improved GFF<br />
** simplified the command line parameters<br />
** IMPORTANT: parameter names changed for some parameters<br />
* GAF: new tool for filtering and combining different predictions (especially of different reference organisms)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.1.3.zip GeMoMa 1.1.3] (06.06.2016)<br />
* minor modifications to the Extractor tool<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.1.2.zip GeMoMa 1.1.2] (05.02.2016)<br />
* GeMoMa bugfix (upstream, downstream sequence for splice site detection)<br />
* Extractor: new parameter s for selecting transcripts<br />
* improved Galaxy integration<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.1.1.zip GeMoMa 1.1.1] (01.02.2016)<br />
* initial release for paper</div>Keilwagenhttps://www.jstacs.de/index.php?title=GeMoMa&diff=1061GeMoMa2020-01-17T10:03:11Z<p>Keilwagen: /* GeMoMaPipeline */</p>
<hr />
<div>'''Ge'''ne '''Mo'''del '''Ma'''pper (GeMoMa) is a homology-based gene prediction program. GeMoMa uses the annotation of protein-coding genes in a reference genome to infer the annotation of protein-coding genes in a target genome. Thereby, GeMoMa utilizes amino acid sequence and intron position conservation. In addition, GeMoMa allows to incorporate RNA-seq evidence for splice site prediction.<br />
<br />
[[File:GeMoMa-schema.png|thumb|right|350px|Schema of GeMoMa algorithm]]<br />
<br />
== Paper ==<br />
If you use GeMoMa, please cite<br />
<br />
J. Keilwagen, M. Wenk, J. L. Erickson, M. H. Schattat, J. Grau, and F. Hartung. [https://nar.oxfordjournals.org/content/44/9/e89 Using intron position conservation for homology-based gene prediction]. ''Nucleic Acids Research'', 2016. doi: 10.1093/nar/gkw092<br />
<br />
J. Keilwagen, F. Hartung, M. Paulini, S. O. Twardziok, and J. Grau<br />
[https://rdcu.be/QbKc Combining RNA-seq data and homology-based gene prediction for plants, animals and fungi]. ''BMC Bioinformatics'', 2018. doi: 10.1186/s12859-018-2203-5<br />
<br />
== Download ==<br />
<br />
GeMoMa is implemented in Java using Jstacs. You can [http://www.jstacs.de/download.php?which=GeMoMa download a zip file] containing a readme, the GeMoMa jar file and some tiny scripts for running GeMoMa. The jar file allows for <br />
<ul><br />
<li>creating the XML file needed for the Galaxy integration</li><br />
<li>running the command line interface (CLI) version.</li><br />
</ul><br />
<br />
You can also [[:File:GeMoMa-manual.pdf|download a small manual for GeMoMa]] which explains the main steps for the analysis.<br />
<br />
== Galaxy ==<br />
GeMoMa is available in a public web-server at [http://galaxy.informatik.uni-halle.de galaxy.informatik.uni-halle.de]. The provided web-server only allows a limited number of reference genes and uses a time out of 2 minutes per transcript prediction. For unlimited use, please use the command line program or integrate GeMoMa in your only Galaxy instance.<br />
<br />
[[File:GeMoMa-Workflow.png|thumb|center|700px|GeMoMa workflow adapted from Galaxy]]<br />
<br />
== Running the command line application ==<br />
<br />
For running the command line application, Java v1.8 or later is required.<br />
<br />
=== GeMoMaPipeline ===<br />
<br />
If you like to run the GeMoMaPipeline on a server as a single job, you can use the module GeMoMaPipeline which allows to exploit the full compute power of the computer server via multi-threading. However, GeMoMaPipeline does not distribute task on a compute cluster.<br />
You can run GeMoMaPipeline from the command line with<br/><br />
<code>java -jar GeMoMa-<version>.jar CLI GeMoMaPipeline [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>target genome (Target genome file (FASTA))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>species (data for reference species, range={own, pre-extracted}, default = own)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;own&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>ID (ID to distinguish the different reference species, default = , OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (Reference annotation file (GFF or GTF), which contains gene models annotated in the reference genome)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (Reference genome file (FASTA))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>weight (the weight can be used to prioritize predictions from different input files; each prediction will get an additional attribute sumWeight that can be used in the filter, valid range = [0.0, 1000.0], default = 1.0, OPTIONAL)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ai</font></td><br />
<td>annotation info (annotation information of the reference, tab-delimted file containing at least the columns transcriptName, GO and .*defline, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;pre-extracted&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>ID (ID to distinguish the different reference species, default = , OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>cds parts (The query cds parts file (FASTA), i.e., the cds parts that have been blasted)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>assignment (The assignment file, which combines parts of the CDS to transcripts, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">q</font></td><br />
<td>query proteins (optional query protein file (FASTA) for computing the optimal alignment score against complete protein prediction, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>weight (the weight can be used to prioritize predictions from different input files; each prediction will get an additional attribute sumWeight that can be used in the filter, valid range = [0.0, 1000.0], default = 1.0, OPTIONAL)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ai</font></td><br />
<td>annotation info (annotation information of the reference, tab-delimted file containing at least the columns transcriptName, GO and .*defline, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">selected</font></td><br />
<td>selected (The path to list file, which allows to make only a predictions for the contained transcript ids. The first column should contain transcript IDs as given in the annotation. Remaining columns can be used to determine a target region that should be overlapped by the prediction, if columns 2 to 5 contain chromosome, strand, start and end of region, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">gc</font></td><br />
<td>genetic code (optional user-specified genetic code, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>maximum intron length (The maximum length of an intron, default = 15000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">tblastn</font></td><br />
<td>tblastn (if *true* tblastn is used as search algorithm, otherwise mmseqs is used. Tblastn and mmseqs need to be installed to use the corresponding option, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">tag</font></td><br />
<td>tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = prediction)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>RNA-seq evidence (data for RNA-seq evidence, range={NO, MAPPED, EXTRACTED}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;MAPPED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.s</font></td><br />
<td>Stranded (Defines whether the reads are stranded. In case of FR_FIRST_STRAND, the first read of a read pair or the only read in case of single-end data is assumed to be located on forward strand of the cDNA, i.e., reverse to the mRNA orientation. If you are using Illumina TruSeq you should use FR_FIRST_STRAND., range={FR_UNSTRANDED, FR_FIRST_STRAND, FR_SECOND_STRAND}, default = FR_UNSTRANDED)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.m</font></td><br />
<td>mapped reads file (BAM/SAM files containing the mapped reads)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.v</font></td><br />
<td>ValidationStringency (Defines how strict to be when reading a SAM or BAM, beyond bare minimum validation., range={STRICT, LENIENT, SILENT}, default = LENIENT)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.u</font></td><br />
<td>use secondary alignments (allows to filter flags in the SAM or BAM, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.c</font></td><br />
<td>coverage (allows to output the coverage, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.mmq</font></td><br />
<td>minimum mapping quality (reads with a mapping quality that is lower than this value will be ignored, valid range = [0, 255], default = 40)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;EXTRACTED&quot;:</b></td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">introns</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used zero or multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage</font></td><br />
<td>coverage (experimental coverage (RNA-seq), range={UNSTRANDED, STRANDED}, default = UNSTRANDED)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>denoise (removing questionable introns that have been extracted by ERE, range={DENOISE, RAW}, default = DENOISE)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;DENOISE&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Denoise.m</font></td><br />
<td>minimum expression (The threshold for removing introns, valid range = [0.0, 1.0], default = 0.01)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Denoise.c</font></td><br />
<td>context (The context upstream a donor and donwstream an acceptor site that is used to determine the expression of the region, valid range = [0, 100], default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;RAW&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.p</font></td><br />
<td>proteins (whether the complete proteins sequences should returned as output, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.r</font></td><br />
<td>repair (if a transcript annotation can not be parsed, the program will try to infer the phase of the CDS parts to repair the annotation, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.a</font></td><br />
<td>Ambiguity (This parameter defines how to deal with ambiguities in the DNA. There are 3 options: EXCEPTION, which will remove the corresponding transcript, AMBIGUOUS, which will use an X for the corresponding amino acid, and RANDOM, which will randomly select an amnio acid from the list of possibilities., range={EXCEPTION, AMBIGUOUS, RANDOM}, default = AMBIGUOUS)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.s</font></td><br />
<td>stop-codon excluded from CDS (A flag that states whether the reference annotation contains the stop codon in the CDS annotation or not, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.f</font></td><br />
<td>full-length (A flag which allows for choosing between only full-length and all (i.e., full-length and partial) transcripts, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.s</font></td><br />
<td>splice (if no intron is given by RNA-seq, compute candidate splice sites or not, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.sm</font></td><br />
<td>substitution matrix (optional user-specified substitution matrix, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.g</font></td><br />
<td>gap opening (The gap opening cost in the alignment, default = 11)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.ge</font></td><br />
<td>gap extension (The gap extension cost in the alignment, default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.i</font></td><br />
<td>intron-loss-gain-penalty (The penalty used for intron loss and gain, default = 25)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.e</font></td><br />
<td>e-value (The e-value for filtering blast results, default = 100.0)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.c</font></td><br />
<td>contig threshold (The threshold for evaluating contigs, valid range = [0.0, 1.0], default = 0.4)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.rt</font></td><br />
<td>region threshold (The threshold for evaluating regions, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.h</font></td><br />
<td>hit threshold (The threshold for adding additional hits, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.p</font></td><br />
<td>predictions (The (maximal) number of predictions per transcript, default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.a</font></td><br />
<td>avoid stop (A flag which allows to avoid stop codons in a transcript (except the last AS), default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.approx</font></td><br />
<td>approx (whether an approximation is used to compute the score for intron gain, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.prefix</font></td><br />
<td>prefix (A prefix to be used for naming the predictions, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.t</font></td><br />
<td>timeout (The (maximal) number of seconds to be used for the predictions of one transcript, if exceeded GeMoMa does not output a prediction for this transcript., valid range = [0, 604800], default = 3600)</td><br />
<td style="width:100px;">LONG</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.Score</font></td><br />
<td>Score (A flag which allows to do nothing, re-score or re-align the search results, range={Trust, ReScore, ReAlign}, default = Trust)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.s</font></td><br />
<td>sorting (comma-separated list that defines the way predictions are sorted within a cluster, default = evidence,score)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.c</font></td><br />
<td>common border filter (the filter on the common borders of transcripts, the lower the more transcripts will be checked as alternative splice isoforms, valid range = [0.0, 1.0], default = 0.75)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.m</font></td><br />
<td>maximal number of transcripts per gene (the maximal number of allowed transcript predictions per gene, valid range = [1, 2147483647], default = 2147483647)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.f</font></td><br />
<td>filter (A filter can be applied to transcript predictions to receive only reasonable predictions. The filter is applied to the GFF attributes. The deault filter decides based on the completeness of the prediction (start=='M' and stop=='*') and the relative score (score/AA>=0.75) whether a prediction is used or not. Different criteria can be combined using 'and ' and 'or'. A more sophisticated filter could be applied for instance using the completeness, the relative score, the evidence as well as statistics based on RNA-seq: start=='M' and stop =='*' and score/AA>=0.75 and (evidence>1 or tpc==1.0), default = start=='M' and stop =='*' and score/AA>=0.75, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.a</font></td><br />
<td>alternative transcript filter (If a prediction is suggested as an alternative transcript, this additional filter will be applied. The filter works syntactically like the 'filter' parameter and allows you to keep the number of alternative transcripts small based on meaningful criteria. Reasonable filter could be based on RNA-seq data (tie==1) or on evidence (evidence>1). A more sophisticated filter could be applied combining several criteria: tie==1 or evidence>1, default = tie==1 or evidence>1, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.u</font></td><br />
<td>UTR (allows to predict UTRs using RNA-seq data, range={NO, YES}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;YES&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.r</font></td><br />
<td>rename (allows to generate generic gene and transcripts names (cf. attribute &quot;Name&quot;), range={COMPOSED, SIMPLE, NO}, default = COMPOSED)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;COMPOSED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.i</font></td><br />
<td>infix (the infix of the generic name, default = G)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.s</font></td><br />
<td>suffix (the suffix of the generic name, default = 0)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.di</font></td><br />
<td>delete infix (a comma-separated list of infixes that is deleted from the sequence names before building the gene/transcript name, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;SIMPLE&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>predicted proteins (If *true*, returns the predicted proteins of the target organism as fastA file, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">pc</font></td><br />
<td>predicted CDSs (If *true*, returns the predicted CDSs of the target organism as fastA file, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">pgr</font></td><br />
<td>predicted genomic regions (If *true*, returns the genomic regions of predicted gene models of the target organism as fastA file, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">o</font></td><br />
<td>output individual predictions (If *true*, returns the predictions for each reference species, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">debug</font></td><br />
<td>debug (If *false* removes all temporary files even if the jobs exits unexpected, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">threads</font></td><br />
<td>The number of threads used for the tool, defaults to 1</td><br />
<td>INT</td><br />
</tr><br />
</table><br />
<br />
=== Extract RNA-seq Evidence (ERE) ===<br />
<br />
For post-processing the mapped RNA-seq data, we provide the tool ExtractRNAseqEvidence (ERE). You can run Extractor from the command line with<br/><br />
<code>java -jar GeMoMa-1.6.1.jar CLI ERE [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>Stranded (Defines whether the reads are stranded. In case of FR_FIRST_STRAND, the first read of a read pair or the only read in case of single-end data is assumed to be located on forward strand of the cDNA, i.e., reverse to the mRNA orientation. If you are using Illumina TruSeq you should use FR_FIRST_STRAND., range={FR_UNSTRANDED, FR_FIRST_STRAND, FR_SECOND_STRAND}, default = FR_UNSTRANDED)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>mapped reads file (BAM/SAM files containing the mapped reads)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>ValidationStringency (Defines how strict to be when reading a SAM or BAM, beyond bare minimum validation., range={STRICT, LENIENT, SILENT}, default = LENIENT)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">u</font></td><br />
<td>use secondary alignments (allows to filter flags in the SAM or BAM, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage (allows to output the coverage, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">mmq</font></td><br />
<td>minimum mapping quality (reads with a mapping quality that is lower than this value will be ignored, valid range = [0, 254], default = 40)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== CheckIntrons===<br />
<br />
This tool allows to check whether the extracted introns show the expected patterns of di-nucleotides at the splice sites. You can run CheckIntrons from the command line with<br/><br />
<code>java -jar GeMoMa-1.6.1.jar CLI CheckIntrons [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>target genome (The target genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>verbose (A flag which allows to output a wealth of additional information per transcript, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== Extractor ===<br />
<br />
For preparing the data, we provide the tool Extractor. You can run Extractor from the command line with<br/><br />
<code>java -jar GeMoMa-1.6.1.jar CLI Extractor [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (Reference annotation file (GFF or GTF), which contains gene models annotated in the reference genome)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (Reference genome file (FASTA))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">gc</font></td><br />
<td>genetic code (optional user-specified genetic code, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>proteins (whether the complete proteins sequences should returned as output, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>cds (whether the complete CDSs should returned as output, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">genomic</font></td><br />
<td>genomic (whether the genomic regions should be returned (upper case = coding, lower case = non coding), default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>repair (if a transcript annotation can not be parsed, the program will try to infer the phase of the CDS parts to repair the annotation, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>selected (The path to list file, which allows to make only a predictions for the contained transcript ids. The first column should contain transcript IDs as given in the annotation. Remaining columns will be ignored., OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Ambiguity</font></td><br />
<td>Ambiguity (This parameter defines how to deal with ambiguities in the DNA. There are 3 options: EXCEPTION, which will remove the corresponding transcript, AMBIGUOUS, which will use an X for the corresponding amino acid, and RANDOM, which will randomly select an amnio acid from the list of possibilities., range={EXCEPTION, AMBIGUOUS, RANDOM}, default = EXCEPTION)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sefc</font></td><br />
<td>stop-codon excluded from CDS (A flag that states whether the reference annotation contains the stop codon in the CDS annotation or not, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>full-length (A flag which allows for choosing between only full-length and all (i.e., full-length and partial) transcripts, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>verbose (A flag which allows to output a wealth of additional information, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== Gene Model Mapper (GeMoMa) ===<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>search results (The search results, e.g., from tblastn or mmseqs)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>target genome (The target genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>cds parts (The query cds parts file (FASTA), i.e., the cds parts that have been blasted)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>assignment (The assignment file, which combines parts of the CDS to transcripts, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">q</font></td><br />
<td>query proteins (optional query protein file (FASTA) for computing the optimal alignment score against complete protein prediction, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">splice</font></td><br />
<td>splice (if no intron is given by RNA-seq, compute candidate splice sites or not, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage</font></td><br />
<td>coverage (experimental coverage (RNA-seq), range={NO, UNSTRANDED, STRANDED}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genetic code (optional user-specified genetic code, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sm</font></td><br />
<td>substitution matrix (optional user-specified substitution matrix, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">go</font></td><br />
<td>gap opening (The gap opening cost in the alignment, default = 11)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ge</font></td><br />
<td>gap extension (The gap extension cost in the alignment, default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>maximum intron length (The maximum length of an intron, default = 15000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">intron-loss-gain-penalty</font></td><br />
<td>intron-loss-gain-penalty (The penalty used for intron loss and gain, default = 25)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">e</font></td><br />
<td>e-value (The e-value for filtering blast results, default = 100.0)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ct</font></td><br />
<td>contig threshold (The threshold for evaluating contigs, valid range = [0.0, 1.0], default = 0.4)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">rt</font></td><br />
<td>region threshold (The threshold for evaluating regions, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">h</font></td><br />
<td>hit threshold (The threshold for adding additional hits, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>predictions (The (maximal) number of predictions per transcript, default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">selected</font></td><br />
<td>selected (The path to list file, which allows to make only a predictions for the contained transcript ids. The first column should contain transcript IDs as given in the annotation. Remaining columns can be used to determine a target region that should be overlapped by the prediction, if columns 2 to 5 contain chromosome, strand, start and end of region, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">as</font></td><br />
<td>avoid stop (A flag which allows to avoid stop codons in a transcript (except the last AS), default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">approx</font></td><br />
<td>approx (whether an approximation is used to compute the score for intron gain, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">prefix</font></td><br />
<td>prefix (A prefix to be used for naming the predictions, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">tag</font></td><br />
<td>tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = prediction)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>verbose (A flag which allows to output a wealth of additional information per transcript, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">timeout</font></td><br />
<td>timeout (The (maximal) number of seconds to be used for the predictions of one transcript, if exceeded GeMoMa does not output a prediction for this transcript., valid range = [0, 604800], default = 3600)</td><br />
<td style="width:100px;">LONG</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sort</font></td><br />
<td>sort (A flag which allows to sort the search results, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Score</font></td><br />
<td>Score (A flag which allows to do nothing, re-score or re-align the search results, range={Trust, ReScore, ReAlign}, default = Trust)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
GeMoMa returns the predicted annotation as gff file.<br />
<br />
=== GeMoMa Annotation Filter (GAF) ===<br />
<br />
The GeMoMa Annotation Filter allows to combine and reduce predictions from GeMoMa into a single final prediction. It is able to handle predictions from different reference species. It also handles overlapping or identical predictions.<br />
You can run GeMoMa from the command line with<br/><br />
<code>java -jar GeMoMa-1.6.1.jar CLI GAF[&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>tag (the tag used to read the GeMoMa annotations, default = prediction)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>sorting (comma-separated list that defines the way predictions are sorted within a cluster, default = evidence,score)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>missing intron evidence filter (the filter for single-exon transcripts or if no RNA-seq data is used, decides for overlapping other transcripts whether they should be used (=true) or discarded (=false), default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>intron evidence filter (the filter on the intron evidence given by RNA-seq-data for overlapping transcripts, valid range = [0.0, 1.0], default = 1.0)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>common border filter (the filter on the common borders of transcripts, the lower the more transcripts will be checked as alternative splice isoforms, valid range = [0.0, 1.0], default = 0.75)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">mnotpg</font></td><br />
<td>maximal number of transcripts per gene (the maximal number of allowed transcript predictions per gene, valid range = [1, 2147483647], default = 2147483647)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prefix (the prefix can be used to distinguish predictions from different input files, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>weight (the weight can be used to prioritize predictions from different input files; each prediction will get an additional attribute sumWeight that can be used in the filter, valid range = [0.0, 1000.0], default = 1.0, OPTIONAL)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>gene annotation file (GFF files containing the gene annotations (predicted by GeMoMa))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>filter (A filter can be applied to transcript predictions to receive only reasonable predictions. The filter is applied to the GFF attributes. The deault filter decides based on the completeness of the prediction (start=='M' and stop=='*') and the relative score (score/AA>=0.75) whether a prediction is used or not. Different criteria can be combined using 'and ' and 'or'. A more sophisticated filter could be applied for instance using the completeness, the relative score, the evidence as well as statistics based on RNA-seq: start=='M' and stop =='*' and score/AA>=0.75 and (evidence>1 or tpc==1.0), default = start=='M' and stop =='*' and score/AA>=0.75, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== CompareTranscripts ===<br />
<br />
For comparing gene models from GeMoMa predictions with existing annotation, we provide the tool CompareTranscripts. You can run CompareTranscripts from the command line with<br/><br />
<code>java -jar GeMoMa-1.6.1.jar CLI CompareTranscripts [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prediction (The predicted annotation)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (The true annotation)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== AnnotationEvidence ===<br />
<br />
For providing RNA-seq evidence (e.g. tie) for existing annotation, we provide the tool AnnotationEvidence. You can run AnnotationEvidence from the command line with<br/><br />
<code>java -jar GeMoMa-1.6.1.jar CLI AnnotationEvidence [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (The genome annotation file (GFF))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (The genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns file (Introns (GFF), which might be obtained from RNA-seq, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage file (experimental coverage (RNA-seq), range={NO, UNSTRANDED, STRANDED}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ao</font></td><br />
<td>annotation output (if the annotation should be returned with attributes tie, tpc, and AA, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">gc</font></td><br />
<td>genetic code (optional user-specified genetic code, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== AnnotationFinializer ===<br />
<br />
This tool allows to predict UTR and to rename predictions. You can run AnnotationEvidence from the command line with<br/><br />
<code>java -jar GeMoMa-1.6.1.jar CLI AnnotationFinalizer [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (The genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (The predicted genome annotation file (GFF))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = prediction)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">u</font></td><br />
<td>UTR (allows to predict UTRs using RNA-seq data, range={NO, YES}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;YES&quot;:</b></td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns file (Introns (GFF), which might be obtained from RNA-seq, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage file (experimental coverage (RNA-seq), range={NO, UNSTRANDED, STRANDED}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">rename</font></td><br />
<td>rename (allows to generate generic gene and transcripts names (cf. attribute &quot;Name&quot;), range={COMPOSED, SIMPLE, NO}, default = COMPOSED)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;COMPOSED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">infix</font></td><br />
<td>infix (the infix of the generic name, default = G)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>suffix (the suffix of the generic name, default = 0)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">di</font></td><br />
<td>delete infix (a comma-separated list of infixes that is deleted from the sequence names before building the gene/transcript name, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;SIMPLE&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
== GFF attributes ==<br />
<br />
Using GeMoMa and GAF, you'll obtain GFFs containing some special attributes. We briefly explain the most prominent attributes in the following table.<br />
<br />
{| class="wikitable sortable"<br />
|-<br />
!Attribute !!Long name !!Tool !!Necessary parameter !!Feature !!Description<br />
|-<br />
| score || GeMoMa score || GeMoMa || || prediction || score computed by GeMoMa using the substitution matrix, gap costs and additional penalties<br />
|-<br />
| minCov || minimal coverage || GeMoMa || coverage, ... || prediction || minimal coverage of any base of the prediction given RNA-seq evidence<br />
|-<br />
| avgCov || average coverage || GeMoMa || coverage, ... || prediction || average coverage of all bases of the prediction given RNA-seq evidence<br />
|-<br />
| tpc || transcript percentage coverage || GeMoMa || coverage, ... || prediction || percentage of covered bases per predicted transcript given RNA-seq evidence<br />
|-<br />
| tae || transcript acceptor evidence || GeMoMa || introns || prediction || percentage of predicted acceptor sites per predicted transcript with RNA-seq evidence<br />
|-<br />
| tde || transcript donor evidence || GeMoMa || introns || prediction || percentage of predicted donor sites per predicted transcript with RNA-seq evidence<br />
|-<br />
| tie || transcript intron evidence || GeMoMa || introns || prediction || percentage of predicted introns per predicted transcript with RNA-seq evidence<br />
|-<br />
| minSplitReads || minimal split reads || GeMoMa || introns || prediction || minimal number of split reads for any of the predicted introns per predicted transcript<br />
|-<br />
| iAA || identical amino acid || GeMoMa || query proteins || prediction || percentage of identical amino acids between reference transcript and prediction<br />
|-<br />
| pAA || positive amino acid || GeMoMa || query proteins || prediction || percentage of aligned positions between reference transcript and prediction yielding a positive score in the substitution matrix<br />
|-<br />
| evidence || || GAF || || prediction || number of reference organisms that have a transcript yielding this prediction<br />
|-<br />
| alternative || || GAF || || prediction || alternative gene ID(s) leading to the same prediction<br />
|-<br />
| sumWeight || || GAF || || prediction || the sum of the weights of the references that perfectly support this prediction<br />
|-<br />
| maxTie || maximal tie || GAF || || gene || maximal tie of all transcripts of this gene<br />
|-<br />
| maxEvidence || maximal evidence || GAF || || gene || maximal evidence of all transcripts of this gene<br />
|-<br />
|}<br />
<br />
== FAQs ==<br />
<br />
; Why does the Extractor not return a single CDS-part, protein, ...?<br />
: First, please check whether the names of your contigs/chromosomes in your annotation (gff) and genome file (fasta) are identical. The fasta comments should at best only contain the contig/chromosome name. (Since GeMoMa 1.4, comments, which contain the contig/chromosome name and some additional information separated by a space, are also fine.) Second, please check whether you have a valid GFF/GTF file. Valid GFF files should have a valid "ID" or "Parent" entry in the attributes column. Valid GTF files should have a valid "gene_id" and "transcript_id" entry. Finally, please check the statistics that are given by the Extractor. It lists how many genes have been read and how many genes have been removed for different reasons. One common problem is that some annotation files do not include the stop codon in the CDS annotation.<br />
; How can I force GeMoMa to make more predictions?<br />
: There are several parameters affecting the number of predictions. The most prominent are the number of predictions (p) and the contig threshold (ct). For each reference transcript/CDS, GeMoMa initially makes a preliminary prediction and uses this prediction to determine whether a contig is promising and should be used to determine the final predictions. You may decrease ct and increase p to have more contigs in the final prediction. Increasing the number of predictions allows GeMoMa to output more predictions that have been computed. Decreasing the contig threshold allows to increase the number of predictions that are (internally) computed. Increasing p to a very large number without decreasing ct does not help.<br />
; Running GeMoMa on a single contig of my assembly yield thousands of weird predictions. What went wrong?<br />
: By default, GeMoMa is not build to be run on a single contig. GeMoMa tries to make predictions for all given reference CDS in the given target sequence(s). If the given target sequence is only a fraction of the complete target genome/assembly, GeMoMa will produce weird predictions as it does not filter for the quality of the predictions internally. There are two options to handle this:<br />
:* Use a list of gene models that you expect to be located on this contig (cf. parameter "selected").<br />
:* Filter the predictions using GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>).<br />
; Is it mandatory to use RNA-seq data?<br />
: No, GeMoMa is able to make predictions with and without RNA-seq evidence.<br />
; Is it possible to use multiple reference organisms?<br />
: It is possible to use multiple reference organisms for GeMoMa. Just run GeMoMa on each reference organism separately. Finally, you can employ GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>) to combine these annotations.<br />
; Why do some reference genes not lead to a prediction in the target genome?<br />
: Please first check whether your reference genes have been discarded by the Extractor (cf. assignment file).<br />
: If the genes have been discarded, there are two possibilities:<br />
:* The CDS might be redundant, i.e. the coding exons are identical to those of another transcript. In this case, only one CDS is further evaluated.<br />
:* There might be something wrong with your reference genes, e.g., missing start codon, missing stop codon, premature stop codon, ambiguous nucleotides, ... and you should check the options of Extractor or the annotation.<br />
: If the reference genes passed the Extractor, there are several possible explanations for this behavior. The two most prominent are:<br />
:* GeMoMa stopped the prediction of a reference genes since it does not return a result within the given time (cf. parameter "timeout").<br />
:* GeMoMa simply did not find a prediction matching the remaining quality criteria<br />
:* GeMoMa did find a prediction, but it was filtered out by GAF, e.g. to low relative score, missing start or stop codon (cf. GAF parameters).<br />
; What does "partial gene model" mean in the context of GeMoMa?<br />
: We called a gene model partial if it does not contain an initial start codon and a final stop codon. However, this does not mean that the gene model is located at or close to the border of a chromosome or contig.<br />
; For two different reference transcripts, the predictions of GeMoMa overlap or are identical. What should I do with those?<br />
: GeMoMa makes the predictions for each reference transcript independently. Hence, it can occur that some of predictions of different reference transcripts overlap or are identical especially in gene families. Typically, you might like to filter or rank these predictions. We have implemented GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>) to do this automatically. However, you can also do it by hand using the GFF attributes. Using RNA-seq data in GeMoMa yields additional fields in the annotation that can be used, e.g., average coverage (avgCov).<br />
; A lot of transcripts have been filtered out by the Extractor. What can I do?<br />
: There are several reasons for removing transcripts by the Extractor. At least in two cases you can try to get more transcripts by setting specific parameter values. First, if the transcript contains ambiguous nucleotides, please test the parameter "Ambiguity". Second, sometimes we received GFFs which contain wrong phases for CDS entries (e.g., 0 for all CDS entries in the phase column of the GFF). Since version 1.3.2, we provide the option "r" which stands for repair. If r=true is chosen, the Extractor tries to infer all phases for transcripts that show an error and would be filtered out.<br />
; Is GeMoMa able to predict pseudo-genes/ncRNA?<br />
: No, currently not.<br />
; My RNA-seq data indicates there is an additional intron in a transcipt, but GeMoMa does not predict this. Or vice versa, GeMoMa predicts an intron that is not supported by RNA-seq data. What's the reason?<br />
: GeMoMa is mainly based on the assumptions of amino acid and intron position conservation between reference and target species. Hence, GeMoMa tries to predict a gene model with similar exon-intron structure in the target species and does not stick too much to RNA-seq data. Although intron position conservation can be observed in most cases, sometimes new introns evolve or others vanish. For this reasons, GeMoMa also allows for the inclusion or exclusion of introns adding some additional costs (cf. GeMoMa parameter intron-loss-gain-penalty). However, the behaviour of GeMoMa depends on the parameters settings (especially intron-loss-gain-penalty, sm (substitution matrix), go (gap opening), ge (gap extension)) and the length of the missed/additional intron. Nevertheless, such cases can only occur if the additional/missed intron has a length that can be divided by 3 preserving the reading frame.<br />
:Since the available RNA-seq data only reflects a fraction of tissues/environmental conditions/..., missing RNA-seq evidence does not necessarily mean that the predictions is wrong.<br />
; My RNA-seq data indicates two alternative, highly overlapping introns. Interestingly, GeMoMa does not take the intron that is more abundant. Why?<br />
: GeMoMa reads the introns from the input file using some filter (cf. GeMoMa parameter r (reads)). All introns that pass the filter are used and treated equally. Hence, GeMoMa uses the intron that matches the expectation of intron position and amino acid conservation compared to the reference transcript.<br />
; Does GeMoMa predict multiple transcripts per gene?<br />
: GeMoMa in principle allows to predict multiple transcripts per gene, if corresponding transcripts are given in the reference species or if multiple reference species are used.<br />
; GeMoMa failed with java.lang.OutOfMemoryError. What can I do?<br />
: Whenever you see a java.lang.OutOfMemoryError, you should rerun the program with Java virtual machine (VM) options. More specifically you should set: '''-Xms''' the initally used RAM, e.g. to 5Gb (–Xms5G), and '''-Xmx''' the maximally used RAM, e.g. to 50Gb (-Xmx50G). GeMoMa often needs more memory if you have a large genome ''and'' if you’re providing a large coverage file (extracted from RNA-seq data). If you don’t have a compute node with enough memory, you can run GeMoMa without coverage, which will return the same predictions, but does not include all statistics. Another point could be the protein alignment, if you use the optional parameter query protein. Again you can run GeMoMa without this parameter, which will return the same predictions, but less statistics.<br />
; I need to specify the genetic code for my organisms. What is the expected format?<br />
: The genetic code is given in a two column tab-delimited table, where the first column is the one letter code of the amino acid and the second column is a comma-separated list of triplets. As we are working on genomic DNA, GeMoMa expects the bases A, C, G, and T, and not U (as expected in mRNA). Here is the link to the default genetic code, which might be used as template:<br />
: https://github.com/Jstacs/Jstacs/blob/master/projects/gemoma/test_data/genetic_code.txt<br />
: Alternative genetic codes are described here using the RNA alphabet:<br />
: https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi<br />
: The genetic code might be specified for a reference organism in the module Extractor or for a target organism in the module GeMoMa.<br />
<br />
== Version history ==<br />
[http://www.jstacs.de/download.php?which=GeMoMa GeMoMa 1.6.2] (17.12.2019)<br />
* Jstacs changes:<br />
** test methods for modules<br />
** live protocol for Galaxy<br />
* new module Denoise: allowing to clean introns extracted by ERE<br />
* new module NCBIReferenceRetriever: allowing to retrieve data for reference organisms easily from NCBI.<br />
* GAF:<br />
** bugfix for filter using specific attributes if no RNA-seq or query proteins was used<br />
** allow to add annotation info (as for instance provided by Phytozome) based on the reference organisms<br />
* GeMoMa: bugfix for timeout<br />
* GeMoMaPipeline:<br />
** bugfix reporting predicted partial proteins<br />
** improved protocol<br />
** new default value for query proteins (changed from false to true)<br />
** new default value for Ambiguity (changed from EXCEPTION to AMBIGUOUS)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.6.1.zip GeMoMa 1.6.1] (4.06.2019)<br />
* createGalaxyIntegration.sh: bugfix for GeMoMaPipeline<br />
* new module CheckIntrons: allowing to create statistics for introns (extracted by ERE)<br />
* AnnotationFinalizer: bugfix for sequence IDs with large numbers<br />
* CompareTranscripts: <br />
** bugfix for prefix of ref-gene<br />
** allow no transcript info, but making assignment non-optional if a transcript info is set <br />
* GAF: bugfix for Galaxy integration<br />
* GeMoMaPipeline:<br />
** improved output in case of Exceptions<br />
** new parameter "output individual predictions" allows to in- or exclude individual predictions from each reference organism in the final result<br />
** new parameter "weight" allows weights for reference species (cf. GAF)<br />
* ERE: new parameter "minimum mapping quality"<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.6.zip GeMoMa 1.6] (2.04.2019)<br />
* allow to use mmseqs as alternative to tblastn<br />
* AnnotationEvidence:<br />
** allows to add attributes to the input gff: tie, tpc, AA, start, stop<br />
** new parameter for gff output<br />
* AnnotationFinalizer: new tool for predicting UTRs and renaming predictions<br />
* GAF:<br />
** relative score filter and evidence filter are replaced by a flexible filter that allows to filter by relative score, evidence or other GFF attributes as well as combinations thereof<br />
** sorting criteria of the predictions within clusters can now be user-specified<br />
** new attribute for genes: combinedEvidence<br />
** new attribute for predictions: sumWeight<br />
** allows to use gene predictions from all sources, including for instance ab-initio gene predictors, purely RNA-seq based gene prediction and manually curation<br />
** bugfix for predictions from multiple reference organisms<br />
** improved statistic output<br />
* GeMoMa<br />
** renamed the parameter tblastn results to search results<br />
** new parameter for sorting the results of the similarity search (tblastn or mmseqs), if you use mmseqs for the similarity search you have use sort <br />
** new parameter for score of the search results: three options: Trust (as is), ReScore (use aligned sequence, but recompute score), and ReAlign (use detected sequence for realignment and rescore)<br />
** bugfix: threshold for introns from multiple files<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.5.2.zip GeMoMa 1.5.3] (23.07.2018)<br />
* improved parameter description and presentation<br />
* GeMoMaPipeline:<br />
** removed unnecessary parameters<br />
* GeMoMa:<br />
** bugfix: reading coverage file<br />
** removed parameter genomic (cf. Extractor)<br />
** removed protein output (cf. Extractor)<br />
* GAF:<br />
** bugfix: prefix<br />
* Extractor:<br />
** new parameter genomic<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.5.2.zip GeMoMa 1.5.2] (31.5.2018)<br />
* GAF:<br />
** new parameter that allows to restrict the maximal number of transcript predictions per gene<br />
** altered behavior of the evidence filter from percentages to absolute values<br />
** bugfix: nested genes <br />
** checking for duplicates in prediction IDs<br />
* GeMoMa:<br />
** warning if RNA-seq data does not match with target genome<br />
* GeMoMaPipeline: new tool for running the complete GeMoMa pipeline at once allowing multi-threading<br />
* folder for temporary files of GeMoMa<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.5.zip GeMoMa 1.5] (13.02.2018)<br />
* AnnotationEvidence: add chromosome to output<br />
* CompareTranscripts: new parameter that allows to remove prefixes introduces by GAF<br />
* Extractor: new parameter "stop-codon excluded from CDS" that might be used if the annotation does not contain the stop codons<br />
* ExtractRNASeqEvidence: <br />
** print intron length stats<br />
** include program infos in introns.gff3<br />
* GeMoMa:<br />
** new attribute pAA in gff output if query protein is given<br />
** include program infos in predicted_annotation.gff3<br />
** minor bugfix<br />
* GAF:<br />
** new parameter that allows to specify a prefix for each input gff<br />
** collect and print program infos to filtered_prediction.gff3<br />
** improved statistics output<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.4.2.zip GeMoMa 1.4.2] (21.07.2017)<br />
* automatic searching for available updates<br />
* AnnotationEvidence: bugfix (tie computation: Arrays.binarysearch does not find first match)<br />
* Extractor: bugfix (files that are not zipped)<br />
* GeMoMa: bugfix (tie computation: Arrays.binarysearch does not find first match)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.4.1.zip GeMoMa 1.4.1] (30.05.2017)<br />
* CompareTranscripts: bugfix (NullPointerException)<br />
* Extractor: reference genome can be .*fa.gz and .*fasta.gz<br />
* GeMoMa: bugfix (shutdown problem after timeout)<br />
* modified additional scripts and documentation<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.4.zip GeMoMa 1.4] (03.05.2017)<br />
* AnnotationEvidence: new tool computing tie and tpc for given annotation (gff)<br />
* CompareTranscripts: new tool comparing predicted and given annotation (gff)<br />
* Extractor:<br />
** reading CDS with no parent tag (cf. discontinuous feature)<br />
** automatic recognition of GFF or GTF annotation<br />
** Warning if sequences mentioned in the annotation are not included in the reference sequence<br />
* GeMoMa: <br />
** allowing for multiple intron and coverage files (= using different library types at the same time)<br />
** NA instead of "?" for tae, tde, tie, minSplitReads of single coding exon genes<br />
** new default values for the parameters: predictions (10 instead of 1) and contig threshold (0.4 instead of 0.9)<br />
** bugfix (write pc and minCov if possible for last CDS part in predicted annotation)<br />
** bugfix (ref-gene name if no assignment is used)<br />
** bugfix (minSplitReads, minCov, tpc, avgCov if no coverage available)<br />
* GAF:<br />
** nested genes on the same strand<br />
** bugfix (if nothing passes the filter)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.3.2.zip GeMoMa 1.3.2] (18.01.2017)<br />
* Extractor: new parameter repair for broken transcript annotations<br />
* GeMoMa: bugfixes (splice site computation)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa_1.3.1.zip GeMoMa 1.3.1] (09.12.2016)<br />
* GeMoMa bugfix (finding start/stop codon for very small exons)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.3.zip GeMoMa 1.3] (06.12.2016)<br />
* ERE: new tool for extracting RNA-seq evidence<br />
* Extractor: offers options for <br />
** partial gene models<br />
** ambiguities<br />
* GeMoMa:<br />
** RNA-seq<br />
*** defining splice sites<br />
*** additional feature in GFF and output<br />
**** transcript intron evidence (tie)<br />
**** transcript acceptor evidence (tae)<br />
**** transcript donor evidence (tde)<br />
**** transcript percentage coverage (tpc)<br />
**** ...<br />
** improved GFF<br />
** simplified the command line parameters<br />
** IMPORTANT: parameter names changed for some parameters<br />
* GAF: new tool for filtering and combining different predictions (especially of different reference organisms)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.1.3.zip GeMoMa 1.1.3] (06.06.2016)<br />
* minor modifications to the Extractor tool<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.1.2.zip GeMoMa 1.1.2] (05.02.2016)<br />
* GeMoMa bugfix (upstream, downstream sequence for splice site detection)<br />
* Extractor: new parameter s for selecting transcripts<br />
* improved Galaxy integration<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.1.1.zip GeMoMa 1.1.1] (01.02.2016)<br />
* initial release for paper</div>Keilwagenhttps://www.jstacs.de/index.php?title=GeMoMa&diff=1060GeMoMa2020-01-17T10:00:25Z<p>Keilwagen: /* Version history */ version 1.6.2</p>
<hr />
<div>'''Ge'''ne '''Mo'''del '''Ma'''pper (GeMoMa) is a homology-based gene prediction program. GeMoMa uses the annotation of protein-coding genes in a reference genome to infer the annotation of protein-coding genes in a target genome. Thereby, GeMoMa utilizes amino acid sequence and intron position conservation. In addition, GeMoMa allows to incorporate RNA-seq evidence for splice site prediction.<br />
<br />
[[File:GeMoMa-schema.png|thumb|right|350px|Schema of GeMoMa algorithm]]<br />
<br />
== Paper ==<br />
If you use GeMoMa, please cite<br />
<br />
J. Keilwagen, M. Wenk, J. L. Erickson, M. H. Schattat, J. Grau, and F. Hartung. [https://nar.oxfordjournals.org/content/44/9/e89 Using intron position conservation for homology-based gene prediction]. ''Nucleic Acids Research'', 2016. doi: 10.1093/nar/gkw092<br />
<br />
J. Keilwagen, F. Hartung, M. Paulini, S. O. Twardziok, and J. Grau<br />
[https://rdcu.be/QbKc Combining RNA-seq data and homology-based gene prediction for plants, animals and fungi]. ''BMC Bioinformatics'', 2018. doi: 10.1186/s12859-018-2203-5<br />
<br />
== Download ==<br />
<br />
GeMoMa is implemented in Java using Jstacs. You can [http://www.jstacs.de/download.php?which=GeMoMa download a zip file] containing a readme, the GeMoMa jar file and some tiny scripts for running GeMoMa. The jar file allows for <br />
<ul><br />
<li>creating the XML file needed for the Galaxy integration</li><br />
<li>running the command line interface (CLI) version.</li><br />
</ul><br />
<br />
You can also [[:File:GeMoMa-manual.pdf|download a small manual for GeMoMa]] which explains the main steps for the analysis.<br />
<br />
== Galaxy ==<br />
GeMoMa is available in a public web-server at [http://galaxy.informatik.uni-halle.de galaxy.informatik.uni-halle.de]. The provided web-server only allows a limited number of reference genes and uses a time out of 2 minutes per transcript prediction. For unlimited use, please use the command line program or integrate GeMoMa in your only Galaxy instance.<br />
<br />
[[File:GeMoMa-Workflow.png|thumb|center|700px|GeMoMa workflow adapted from Galaxy]]<br />
<br />
== Running the command line application ==<br />
<br />
For running the command line application, Java v1.8 or later is required.<br />
<br />
=== GeMoMaPipeline ===<br />
<br />
If you like to run the GeMoMaPipeline on a server as a single job, you can use the module GeMoMaPipeline which allows to exploit the full compute power of the computer server via multi-threading. However, GeMoMaPipeline does not distribute task on a compute cluster.<br />
You can run GeMoMaPipeline from the command line with<br/><br />
<code>java -jar GeMoMa-1.6.1.jar CLI GeMoMaPipeline [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>target genome (Target genome file (FASTA))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>species (data for reference species, range={own, pre-extracted}, default = own)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;own&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>ID (ID to distinguish the different reference species, default = , OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (Reference annotation file (GFF or GTF), which contains gene models annotated in the reference genome)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (Reference genome file (FASTA))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>weight (the weight can be used to prioritize predictions from different input files; each prediction will get an additional attribute sumWeight that can be used in the filter, valid range = [0.0, 1000.0], default = 1.0, OPTIONAL)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;pre-extracted&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>ID (ID to distinguish the different reference species, default = , OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>cds parts (The query cds parts file (FASTA), i.e., the cds parts that have been blasted)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>assignment (The assignment file, which combines parts of the CDS to transcripts, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">q</font></td><br />
<td>query proteins (optional query protein file (FASTA) for computing the optimal alignment score against complete protein prediction, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>weight (the weight can be used to prioritize predictions from different input files; each prediction will get an additional attribute sumWeight that can be used in the filter, valid range = [0.0, 1000.0], default = 1.0, OPTIONAL)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">selected</font></td><br />
<td>selected (The path to list file, which allows to make only a predictions for the contained transcript ids. The first column should contain transcript IDs as given in the annotation. Remaining columns can be used to determine a target region that should be overlapped by the prediction, if columns 2 to 5 contain chromosome, strand, start and end of region, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">gc</font></td><br />
<td>genetic code (optional user-specified genetic code, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">tag</font></td><br />
<td>tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = prediction)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>RNA-seq evidence (data for RNA-seq evidence, range={NO, MAPPED, EXTRACTED}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;MAPPED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.s</font></td><br />
<td>Stranded (Defines whether the reads are stranded. In case of FR_FIRST_STRAND, the first read of a read pair or the only read in case of single-end data is assumed to be located on forward strand of the cDNA, i.e., reverse to the mRNA orientation. If you are using Illumina TruSeq you should use FR_FIRST_STRAND., range={FR_UNSTRANDED, FR_FIRST_STRAND, FR_SECOND_STRAND}, default = FR_UNSTRANDED)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.m</font></td><br />
<td>mapped reads file (BAM/SAM files containing the mapped reads)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.v</font></td><br />
<td>ValidationStringency (Defines how strict to be when reading a SAM or BAM, beyond bare minimum validation., range={STRICT, LENIENT, SILENT}, default = LENIENT)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.u</font></td><br />
<td>use secondary alignments (allows to filter flags in the SAM or BAM, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.c</font></td><br />
<td>coverage (allows to output the coverage, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.mmq</font></td><br />
<td>minimum mapping quality (reads with a mapping quality that is lower than this value will be ignored, valid range = [0, 254], default = 40)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;EXTRACTED&quot;:</b></td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">introns</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage</font></td><br />
<td>coverage (experimental coverage (RNA-seq), range={NO, UNSTRANDED, STRANDED}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">tblastn</font></td><br />
<td>tblastn (if *true* tblastn is used as search algorithm, otherwise mmseqs is used. Tblastn and mmseqs need to be installed to use the corresponding option, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.p</font></td><br />
<td>proteins (whether the complete proteins sequences should returned as output, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.r</font></td><br />
<td>repair (if a transcript annotation can not be parsed, the program will try to infer the phase of the CDS parts to repair the annotation, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.a</font></td><br />
<td>Ambiguity (This parameter defines how to deal with ambiguities in the DNA. There are 3 options: EXCEPTION, which will remove the corresponding transcript, AMBIGUOUS, which will use an X for the corresponding amino acid, and RANDOM, which will randomly select an amnio acid from the list of possibilities., range={EXCEPTION, AMBIGUOUS, RANDOM}, default = EXCEPTION)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.s</font></td><br />
<td>stop-codon excluded from CDS (A flag that states whether the reference annotation contains the stop codon in the CDS annotation or not, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.f</font></td><br />
<td>full-length (A flag which allows for choosing between only full-length and all (i.e., full-length and partial) transcripts, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.s</font></td><br />
<td>splice (if no intron is given by RNA-seq, compute candidate splice sites or not, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.sm</font></td><br />
<td>substitution matrix (optional user-specified substitution matrix, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.g</font></td><br />
<td>gap opening (The gap opening cost in the alignment, default = 11)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.ge</font></td><br />
<td>gap extension (The gap extension cost in the alignment, default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.m</font></td><br />
<td>maximum intron length (The maximum length of an intron, default = 15000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.i</font></td><br />
<td>intron-loss-gain-penalty (The penalty used for intron loss and gain, default = 25)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.e</font></td><br />
<td>e-value (The e-value for filtering blast results, default = 100.0)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.c</font></td><br />
<td>contig threshold (The threshold for evaluating contigs, valid range = [0.0, 1.0], default = 0.4)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.rt</font></td><br />
<td>region threshold (The threshold for evaluating regions, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.h</font></td><br />
<td>hit threshold (The threshold for adding additional hits, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.p</font></td><br />
<td>predictions (The (maximal) number of predictions per transcript, default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.a</font></td><br />
<td>avoid stop (A flag which allows to avoid stop codons in a transcript (except the last AS), default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.approx</font></td><br />
<td>approx (whether an approximation is used to compute the score for intron gain, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.prefix</font></td><br />
<td>prefix (A prefix to be used for naming the predictions, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.t</font></td><br />
<td>timeout (The (maximal) number of seconds to be used for the predictions of one transcript, if exceeded GeMoMa does not output a prediction for this transcript., valid range = [0, 604800], default = 3600)</td><br />
<td style="width:100px;">LONG</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.Score</font></td><br />
<td>Score (A flag which allows to do nothing, re-score or re-align the search results, range={Trust, ReScore, ReAlign}, default = Trust)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.s</font></td><br />
<td>sorting (comma-separated list that defines the way predictions are sorted within a cluster, default = evidence,score)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.m</font></td><br />
<td>missing intron evidence filter (the filter for single-exon transcripts or if no RNA-seq data is used, decides for overlapping other transcripts whether they should be used (=true) or discarded (=false), default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.i</font></td><br />
<td>intron evidence filter (the filter on the intron evidence given by RNA-seq-data for overlapping transcripts, valid range = [0.0, 1.0], default = 1.0)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.c</font></td><br />
<td>common border filter (the filter on the common borders of transcripts, the lower the more transcripts will be checked as alternative splice isoforms, valid range = [0.0, 1.0], default = 0.75)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.mnotpg</font></td><br />
<td>maximal number of transcripts per gene (the maximal number of allowed transcript predictions per gene, valid range = [1, 2147483647], default = 2147483647)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.f</font></td><br />
<td>filter (A filter can be applied to transcript predictions to receive only reasonable predictions. The filter is applied to the GFF attributes. The deault filter decides based on the completeness of the prediction (start=='M' and stop=='*') and the relative score (score/AA>=0.75) whether a prediction is used or not. Different criteria can be combined using 'and ' and 'or'. A more sophisticated filter could be applied for instance using the completeness, the relative score, the evidence as well as statistics based on RNA-seq: start=='M' and stop =='*' and score/AA>=0.75 and (evidence>1 or tpc==1.0), default = start=='M' and stop =='*' and score/AA>=0.75, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.u</font></td><br />
<td>UTR (allows to predict UTRs using RNA-seq data, range={NO, YES}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;YES&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.r</font></td><br />
<td>rename (allows to generate generic gene and transcripts names (cf. attribute &quot;Name&quot;), range={COMPOSED, SIMPLE, NO}, default = COMPOSED)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;COMPOSED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.i</font></td><br />
<td>infix (the infix of the generic name, default = G)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.s</font></td><br />
<td>suffix (the suffix of the generic name, default = 0)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.di</font></td><br />
<td>delete infix (a comma-separated list of infixes that is deleted from the sequence names before building the gene/transcript name, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;SIMPLE&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>predicted proteins (If *true*, returns the predicted proteins of the target organism as fastA file, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">pc</font></td><br />
<td>predicted CDSs (If *true*, returns the predicted CDSs of the target organism as fastA file, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">pgr</font></td><br />
<td>predicted genomic regions (If *true*, returns the genomic regions of predicted gene models of the target organism as fastA file, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">o</font></td><br />
<td>output individual predictions (If *true*, returns the predictions for each reference species, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>debug (If *false* removes all temporary files even if the jobs exits unexpected, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">threads</font></td><br />
<td>The number of threads used for the tool, defaults to 1</td><br />
<td>INT</td><br />
</tr><br />
</table><br />
<br />
=== Extract RNA-seq Evidence (ERE) ===<br />
<br />
For post-processing the mapped RNA-seq data, we provide the tool ExtractRNAseqEvidence (ERE). You can run Extractor from the command line with<br/><br />
<code>java -jar GeMoMa-1.6.1.jar CLI ERE [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>Stranded (Defines whether the reads are stranded. In case of FR_FIRST_STRAND, the first read of a read pair or the only read in case of single-end data is assumed to be located on forward strand of the cDNA, i.e., reverse to the mRNA orientation. If you are using Illumina TruSeq you should use FR_FIRST_STRAND., range={FR_UNSTRANDED, FR_FIRST_STRAND, FR_SECOND_STRAND}, default = FR_UNSTRANDED)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>mapped reads file (BAM/SAM files containing the mapped reads)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>ValidationStringency (Defines how strict to be when reading a SAM or BAM, beyond bare minimum validation., range={STRICT, LENIENT, SILENT}, default = LENIENT)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">u</font></td><br />
<td>use secondary alignments (allows to filter flags in the SAM or BAM, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage (allows to output the coverage, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">mmq</font></td><br />
<td>minimum mapping quality (reads with a mapping quality that is lower than this value will be ignored, valid range = [0, 254], default = 40)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== CheckIntrons===<br />
<br />
This tool allows to check whether the extracted introns show the expected patterns of di-nucleotides at the splice sites. You can run CheckIntrons from the command line with<br/><br />
<code>java -jar GeMoMa-1.6.1.jar CLI CheckIntrons [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>target genome (The target genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>verbose (A flag which allows to output a wealth of additional information per transcript, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== Extractor ===<br />
<br />
For preparing the data, we provide the tool Extractor. You can run Extractor from the command line with<br/><br />
<code>java -jar GeMoMa-1.6.1.jar CLI Extractor [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (Reference annotation file (GFF or GTF), which contains gene models annotated in the reference genome)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (Reference genome file (FASTA))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">gc</font></td><br />
<td>genetic code (optional user-specified genetic code, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>proteins (whether the complete proteins sequences should returned as output, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>cds (whether the complete CDSs should returned as output, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">genomic</font></td><br />
<td>genomic (whether the genomic regions should be returned (upper case = coding, lower case = non coding), default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>repair (if a transcript annotation can not be parsed, the program will try to infer the phase of the CDS parts to repair the annotation, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>selected (The path to list file, which allows to make only a predictions for the contained transcript ids. The first column should contain transcript IDs as given in the annotation. Remaining columns will be ignored., OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Ambiguity</font></td><br />
<td>Ambiguity (This parameter defines how to deal with ambiguities in the DNA. There are 3 options: EXCEPTION, which will remove the corresponding transcript, AMBIGUOUS, which will use an X for the corresponding amino acid, and RANDOM, which will randomly select an amnio acid from the list of possibilities., range={EXCEPTION, AMBIGUOUS, RANDOM}, default = EXCEPTION)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sefc</font></td><br />
<td>stop-codon excluded from CDS (A flag that states whether the reference annotation contains the stop codon in the CDS annotation or not, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>full-length (A flag which allows for choosing between only full-length and all (i.e., full-length and partial) transcripts, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>verbose (A flag which allows to output a wealth of additional information, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== Gene Model Mapper (GeMoMa) ===<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>search results (The search results, e.g., from tblastn or mmseqs)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>target genome (The target genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>cds parts (The query cds parts file (FASTA), i.e., the cds parts that have been blasted)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>assignment (The assignment file, which combines parts of the CDS to transcripts, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">q</font></td><br />
<td>query proteins (optional query protein file (FASTA) for computing the optimal alignment score against complete protein prediction, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">splice</font></td><br />
<td>splice (if no intron is given by RNA-seq, compute candidate splice sites or not, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage</font></td><br />
<td>coverage (experimental coverage (RNA-seq), range={NO, UNSTRANDED, STRANDED}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genetic code (optional user-specified genetic code, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sm</font></td><br />
<td>substitution matrix (optional user-specified substitution matrix, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">go</font></td><br />
<td>gap opening (The gap opening cost in the alignment, default = 11)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ge</font></td><br />
<td>gap extension (The gap extension cost in the alignment, default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>maximum intron length (The maximum length of an intron, default = 15000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">intron-loss-gain-penalty</font></td><br />
<td>intron-loss-gain-penalty (The penalty used for intron loss and gain, default = 25)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">e</font></td><br />
<td>e-value (The e-value for filtering blast results, default = 100.0)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ct</font></td><br />
<td>contig threshold (The threshold for evaluating contigs, valid range = [0.0, 1.0], default = 0.4)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">rt</font></td><br />
<td>region threshold (The threshold for evaluating regions, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">h</font></td><br />
<td>hit threshold (The threshold for adding additional hits, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>predictions (The (maximal) number of predictions per transcript, default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">selected</font></td><br />
<td>selected (The path to list file, which allows to make only a predictions for the contained transcript ids. The first column should contain transcript IDs as given in the annotation. Remaining columns can be used to determine a target region that should be overlapped by the prediction, if columns 2 to 5 contain chromosome, strand, start and end of region, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">as</font></td><br />
<td>avoid stop (A flag which allows to avoid stop codons in a transcript (except the last AS), default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">approx</font></td><br />
<td>approx (whether an approximation is used to compute the score for intron gain, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">prefix</font></td><br />
<td>prefix (A prefix to be used for naming the predictions, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">tag</font></td><br />
<td>tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = prediction)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>verbose (A flag which allows to output a wealth of additional information per transcript, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">timeout</font></td><br />
<td>timeout (The (maximal) number of seconds to be used for the predictions of one transcript, if exceeded GeMoMa does not output a prediction for this transcript., valid range = [0, 604800], default = 3600)</td><br />
<td style="width:100px;">LONG</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sort</font></td><br />
<td>sort (A flag which allows to sort the search results, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Score</font></td><br />
<td>Score (A flag which allows to do nothing, re-score or re-align the search results, range={Trust, ReScore, ReAlign}, default = Trust)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
GeMoMa returns the predicted annotation as gff file.<br />
<br />
=== GeMoMa Annotation Filter (GAF) ===<br />
<br />
The GeMoMa Annotation Filter allows to combine and reduce predictions from GeMoMa into a single final prediction. It is able to handle predictions from different reference species. It also handles overlapping or identical predictions.<br />
You can run GeMoMa from the command line with<br/><br />
<code>java -jar GeMoMa-1.6.1.jar CLI GAF[&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>tag (the tag used to read the GeMoMa annotations, default = prediction)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>sorting (comma-separated list that defines the way predictions are sorted within a cluster, default = evidence,score)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>missing intron evidence filter (the filter for single-exon transcripts or if no RNA-seq data is used, decides for overlapping other transcripts whether they should be used (=true) or discarded (=false), default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>intron evidence filter (the filter on the intron evidence given by RNA-seq-data for overlapping transcripts, valid range = [0.0, 1.0], default = 1.0)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>common border filter (the filter on the common borders of transcripts, the lower the more transcripts will be checked as alternative splice isoforms, valid range = [0.0, 1.0], default = 0.75)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">mnotpg</font></td><br />
<td>maximal number of transcripts per gene (the maximal number of allowed transcript predictions per gene, valid range = [1, 2147483647], default = 2147483647)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prefix (the prefix can be used to distinguish predictions from different input files, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>weight (the weight can be used to prioritize predictions from different input files; each prediction will get an additional attribute sumWeight that can be used in the filter, valid range = [0.0, 1000.0], default = 1.0, OPTIONAL)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>gene annotation file (GFF files containing the gene annotations (predicted by GeMoMa))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>filter (A filter can be applied to transcript predictions to receive only reasonable predictions. The filter is applied to the GFF attributes. The deault filter decides based on the completeness of the prediction (start=='M' and stop=='*') and the relative score (score/AA>=0.75) whether a prediction is used or not. Different criteria can be combined using 'and ' and 'or'. A more sophisticated filter could be applied for instance using the completeness, the relative score, the evidence as well as statistics based on RNA-seq: start=='M' and stop =='*' and score/AA>=0.75 and (evidence>1 or tpc==1.0), default = start=='M' and stop =='*' and score/AA>=0.75, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== CompareTranscripts ===<br />
<br />
For comparing gene models from GeMoMa predictions with existing annotation, we provide the tool CompareTranscripts. You can run CompareTranscripts from the command line with<br/><br />
<code>java -jar GeMoMa-1.6.1.jar CLI CompareTranscripts [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prediction (The predicted annotation)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (The true annotation)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== AnnotationEvidence ===<br />
<br />
For providing RNA-seq evidence (e.g. tie) for existing annotation, we provide the tool AnnotationEvidence. You can run AnnotationEvidence from the command line with<br/><br />
<code>java -jar GeMoMa-1.6.1.jar CLI AnnotationEvidence [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (The genome annotation file (GFF))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (The genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns file (Introns (GFF), which might be obtained from RNA-seq, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage file (experimental coverage (RNA-seq), range={NO, UNSTRANDED, STRANDED}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ao</font></td><br />
<td>annotation output (if the annotation should be returned with attributes tie, tpc, and AA, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">gc</font></td><br />
<td>genetic code (optional user-specified genetic code, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== AnnotationFinializer ===<br />
<br />
This tool allows to predict UTR and to rename predictions. You can run AnnotationEvidence from the command line with<br/><br />
<code>java -jar GeMoMa-1.6.1.jar CLI AnnotationFinalizer [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (The genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (The predicted genome annotation file (GFF))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = prediction)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">u</font></td><br />
<td>UTR (allows to predict UTRs using RNA-seq data, range={NO, YES}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;YES&quot;:</b></td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns file (Introns (GFF), which might be obtained from RNA-seq, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage file (experimental coverage (RNA-seq), range={NO, UNSTRANDED, STRANDED}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">rename</font></td><br />
<td>rename (allows to generate generic gene and transcripts names (cf. attribute &quot;Name&quot;), range={COMPOSED, SIMPLE, NO}, default = COMPOSED)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;COMPOSED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">infix</font></td><br />
<td>infix (the infix of the generic name, default = G)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>suffix (the suffix of the generic name, default = 0)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">di</font></td><br />
<td>delete infix (a comma-separated list of infixes that is deleted from the sequence names before building the gene/transcript name, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;SIMPLE&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
== GFF attributes ==<br />
<br />
Using GeMoMa and GAF, you'll obtain GFFs containing some special attributes. We briefly explain the most prominent attributes in the following table.<br />
<br />
{| class="wikitable sortable"<br />
|-<br />
!Attribute !!Long name !!Tool !!Necessary parameter !!Feature !!Description<br />
|-<br />
| score || GeMoMa score || GeMoMa || || prediction || score computed by GeMoMa using the substitution matrix, gap costs and additional penalties<br />
|-<br />
| minCov || minimal coverage || GeMoMa || coverage, ... || prediction || minimal coverage of any base of the prediction given RNA-seq evidence<br />
|-<br />
| avgCov || average coverage || GeMoMa || coverage, ... || prediction || average coverage of all bases of the prediction given RNA-seq evidence<br />
|-<br />
| tpc || transcript percentage coverage || GeMoMa || coverage, ... || prediction || percentage of covered bases per predicted transcript given RNA-seq evidence<br />
|-<br />
| tae || transcript acceptor evidence || GeMoMa || introns || prediction || percentage of predicted acceptor sites per predicted transcript with RNA-seq evidence<br />
|-<br />
| tde || transcript donor evidence || GeMoMa || introns || prediction || percentage of predicted donor sites per predicted transcript with RNA-seq evidence<br />
|-<br />
| tie || transcript intron evidence || GeMoMa || introns || prediction || percentage of predicted introns per predicted transcript with RNA-seq evidence<br />
|-<br />
| minSplitReads || minimal split reads || GeMoMa || introns || prediction || minimal number of split reads for any of the predicted introns per predicted transcript<br />
|-<br />
| iAA || identical amino acid || GeMoMa || query proteins || prediction || percentage of identical amino acids between reference transcript and prediction<br />
|-<br />
| pAA || positive amino acid || GeMoMa || query proteins || prediction || percentage of aligned positions between reference transcript and prediction yielding a positive score in the substitution matrix<br />
|-<br />
| evidence || || GAF || || prediction || number of reference organisms that have a transcript yielding this prediction<br />
|-<br />
| alternative || || GAF || || prediction || alternative gene ID(s) leading to the same prediction<br />
|-<br />
| sumWeight || || GAF || || prediction || the sum of the weights of the references that perfectly support this prediction<br />
|-<br />
| maxTie || maximal tie || GAF || || gene || maximal tie of all transcripts of this gene<br />
|-<br />
| maxEvidence || maximal evidence || GAF || || gene || maximal evidence of all transcripts of this gene<br />
|-<br />
|}<br />
<br />
== FAQs ==<br />
<br />
; Why does the Extractor not return a single CDS-part, protein, ...?<br />
: First, please check whether the names of your contigs/chromosomes in your annotation (gff) and genome file (fasta) are identical. The fasta comments should at best only contain the contig/chromosome name. (Since GeMoMa 1.4, comments, which contain the contig/chromosome name and some additional information separated by a space, are also fine.) Second, please check whether you have a valid GFF/GTF file. Valid GFF files should have a valid "ID" or "Parent" entry in the attributes column. Valid GTF files should have a valid "gene_id" and "transcript_id" entry. Finally, please check the statistics that are given by the Extractor. It lists how many genes have been read and how many genes have been removed for different reasons. One common problem is that some annotation files do not include the stop codon in the CDS annotation.<br />
; How can I force GeMoMa to make more predictions?<br />
: There are several parameters affecting the number of predictions. The most prominent are the number of predictions (p) and the contig threshold (ct). For each reference transcript/CDS, GeMoMa initially makes a preliminary prediction and uses this prediction to determine whether a contig is promising and should be used to determine the final predictions. You may decrease ct and increase p to have more contigs in the final prediction. Increasing the number of predictions allows GeMoMa to output more predictions that have been computed. Decreasing the contig threshold allows to increase the number of predictions that are (internally) computed. Increasing p to a very large number without decreasing ct does not help.<br />
; Running GeMoMa on a single contig of my assembly yield thousands of weird predictions. What went wrong?<br />
: By default, GeMoMa is not build to be run on a single contig. GeMoMa tries to make predictions for all given reference CDS in the given target sequence(s). If the given target sequence is only a fraction of the complete target genome/assembly, GeMoMa will produce weird predictions as it does not filter for the quality of the predictions internally. There are two options to handle this:<br />
:* Use a list of gene models that you expect to be located on this contig (cf. parameter "selected").<br />
:* Filter the predictions using GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>).<br />
; Is it mandatory to use RNA-seq data?<br />
: No, GeMoMa is able to make predictions with and without RNA-seq evidence.<br />
; Is it possible to use multiple reference organisms?<br />
: It is possible to use multiple reference organisms for GeMoMa. Just run GeMoMa on each reference organism separately. Finally, you can employ GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>) to combine these annotations.<br />
; Why do some reference genes not lead to a prediction in the target genome?<br />
: Please first check whether your reference genes have been discarded by the Extractor (cf. assignment file).<br />
: If the genes have been discarded, there are two possibilities:<br />
:* The CDS might be redundant, i.e. the coding exons are identical to those of another transcript. In this case, only one CDS is further evaluated.<br />
:* There might be something wrong with your reference genes, e.g., missing start codon, missing stop codon, premature stop codon, ambiguous nucleotides, ... and you should check the options of Extractor or the annotation.<br />
: If the reference genes passed the Extractor, there are several possible explanations for this behavior. The two most prominent are:<br />
:* GeMoMa stopped the prediction of a reference genes since it does not return a result within the given time (cf. parameter "timeout").<br />
:* GeMoMa simply did not find a prediction matching the remaining quality criteria<br />
:* GeMoMa did find a prediction, but it was filtered out by GAF, e.g. to low relative score, missing start or stop codon (cf. GAF parameters).<br />
; What does "partial gene model" mean in the context of GeMoMa?<br />
: We called a gene model partial if it does not contain an initial start codon and a final stop codon. However, this does not mean that the gene model is located at or close to the border of a chromosome or contig.<br />
; For two different reference transcripts, the predictions of GeMoMa overlap or are identical. What should I do with those?<br />
: GeMoMa makes the predictions for each reference transcript independently. Hence, it can occur that some of predictions of different reference transcripts overlap or are identical especially in gene families. Typically, you might like to filter or rank these predictions. We have implemented GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>) to do this automatically. However, you can also do it by hand using the GFF attributes. Using RNA-seq data in GeMoMa yields additional fields in the annotation that can be used, e.g., average coverage (avgCov).<br />
; A lot of transcripts have been filtered out by the Extractor. What can I do?<br />
: There are several reasons for removing transcripts by the Extractor. At least in two cases you can try to get more transcripts by setting specific parameter values. First, if the transcript contains ambiguous nucleotides, please test the parameter "Ambiguity". Second, sometimes we received GFFs which contain wrong phases for CDS entries (e.g., 0 for all CDS entries in the phase column of the GFF). Since version 1.3.2, we provide the option "r" which stands for repair. If r=true is chosen, the Extractor tries to infer all phases for transcripts that show an error and would be filtered out.<br />
; Is GeMoMa able to predict pseudo-genes/ncRNA?<br />
: No, currently not.<br />
; My RNA-seq data indicates there is an additional intron in a transcipt, but GeMoMa does not predict this. Or vice versa, GeMoMa predicts an intron that is not supported by RNA-seq data. What's the reason?<br />
: GeMoMa is mainly based on the assumptions of amino acid and intron position conservation between reference and target species. Hence, GeMoMa tries to predict a gene model with similar exon-intron structure in the target species and does not stick too much to RNA-seq data. Although intron position conservation can be observed in most cases, sometimes new introns evolve or others vanish. For this reasons, GeMoMa also allows for the inclusion or exclusion of introns adding some additional costs (cf. GeMoMa parameter intron-loss-gain-penalty). However, the behaviour of GeMoMa depends on the parameters settings (especially intron-loss-gain-penalty, sm (substitution matrix), go (gap opening), ge (gap extension)) and the length of the missed/additional intron. Nevertheless, such cases can only occur if the additional/missed intron has a length that can be divided by 3 preserving the reading frame.<br />
:Since the available RNA-seq data only reflects a fraction of tissues/environmental conditions/..., missing RNA-seq evidence does not necessarily mean that the predictions is wrong.<br />
; My RNA-seq data indicates two alternative, highly overlapping introns. Interestingly, GeMoMa does not take the intron that is more abundant. Why?<br />
: GeMoMa reads the introns from the input file using some filter (cf. GeMoMa parameter r (reads)). All introns that pass the filter are used and treated equally. Hence, GeMoMa uses the intron that matches the expectation of intron position and amino acid conservation compared to the reference transcript.<br />
; Does GeMoMa predict multiple transcripts per gene?<br />
: GeMoMa in principle allows to predict multiple transcripts per gene, if corresponding transcripts are given in the reference species or if multiple reference species are used.<br />
; GeMoMa failed with java.lang.OutOfMemoryError. What can I do?<br />
: Whenever you see a java.lang.OutOfMemoryError, you should rerun the program with Java virtual machine (VM) options. More specifically you should set: '''-Xms''' the initally used RAM, e.g. to 5Gb (–Xms5G), and '''-Xmx''' the maximally used RAM, e.g. to 50Gb (-Xmx50G). GeMoMa often needs more memory if you have a large genome ''and'' if you’re providing a large coverage file (extracted from RNA-seq data). If you don’t have a compute node with enough memory, you can run GeMoMa without coverage, which will return the same predictions, but does not include all statistics. Another point could be the protein alignment, if you use the optional parameter query protein. Again you can run GeMoMa without this parameter, which will return the same predictions, but less statistics.<br />
; I need to specify the genetic code for my organisms. What is the expected format?<br />
: The genetic code is given in a two column tab-delimited table, where the first column is the one letter code of the amino acid and the second column is a comma-separated list of triplets. As we are working on genomic DNA, GeMoMa expects the bases A, C, G, and T, and not U (as expected in mRNA). Here is the link to the default genetic code, which might be used as template:<br />
: https://github.com/Jstacs/Jstacs/blob/master/projects/gemoma/test_data/genetic_code.txt<br />
: Alternative genetic codes are described here using the RNA alphabet:<br />
: https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi<br />
: The genetic code might be specified for a reference organism in the module Extractor or for a target organism in the module GeMoMa.<br />
<br />
== Version history ==<br />
[http://www.jstacs.de/download.php?which=GeMoMa GeMoMa 1.6.2] (17.12.2019)<br />
* Jstacs changes:<br />
** test methods for modules<br />
** live protocol for Galaxy<br />
* new module Denoise: allowing to clean introns extracted by ERE<br />
* new module NCBIReferenceRetriever: allowing to retrieve data for reference organisms easily from NCBI.<br />
* GAF:<br />
** bugfix for filter using specific attributes if no RNA-seq or query proteins was used<br />
** allow to add annotation info (as for instance provided by Phytozome) based on the reference organisms<br />
* GeMoMa: bugfix for timeout<br />
* GeMoMaPipeline:<br />
** bugfix reporting predicted partial proteins<br />
** improved protocol<br />
** new default value for query proteins (changed from false to true)<br />
** new default value for Ambiguity (changed from EXCEPTION to AMBIGUOUS)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.6.1.zip GeMoMa 1.6.1] (4.06.2019)<br />
* createGalaxyIntegration.sh: bugfix for GeMoMaPipeline<br />
* new module CheckIntrons: allowing to create statistics for introns (extracted by ERE)<br />
* AnnotationFinalizer: bugfix for sequence IDs with large numbers<br />
* CompareTranscripts: <br />
** bugfix for prefix of ref-gene<br />
** allow no transcript info, but making assignment non-optional if a transcript info is set <br />
* GAF: bugfix for Galaxy integration<br />
* GeMoMaPipeline:<br />
** improved output in case of Exceptions<br />
** new parameter "output individual predictions" allows to in- or exclude individual predictions from each reference organism in the final result<br />
** new parameter "weight" allows weights for reference species (cf. GAF)<br />
* ERE: new parameter "minimum mapping quality"<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.6.zip GeMoMa 1.6] (2.04.2019)<br />
* allow to use mmseqs as alternative to tblastn<br />
* AnnotationEvidence:<br />
** allows to add attributes to the input gff: tie, tpc, AA, start, stop<br />
** new parameter for gff output<br />
* AnnotationFinalizer: new tool for predicting UTRs and renaming predictions<br />
* GAF:<br />
** relative score filter and evidence filter are replaced by a flexible filter that allows to filter by relative score, evidence or other GFF attributes as well as combinations thereof<br />
** sorting criteria of the predictions within clusters can now be user-specified<br />
** new attribute for genes: combinedEvidence<br />
** new attribute for predictions: sumWeight<br />
** allows to use gene predictions from all sources, including for instance ab-initio gene predictors, purely RNA-seq based gene prediction and manually curation<br />
** bugfix for predictions from multiple reference organisms<br />
** improved statistic output<br />
* GeMoMa<br />
** renamed the parameter tblastn results to search results<br />
** new parameter for sorting the results of the similarity search (tblastn or mmseqs), if you use mmseqs for the similarity search you have use sort <br />
** new parameter for score of the search results: three options: Trust (as is), ReScore (use aligned sequence, but recompute score), and ReAlign (use detected sequence for realignment and rescore)<br />
** bugfix: threshold for introns from multiple files<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.5.2.zip GeMoMa 1.5.3] (23.07.2018)<br />
* improved parameter description and presentation<br />
* GeMoMaPipeline:<br />
** removed unnecessary parameters<br />
* GeMoMa:<br />
** bugfix: reading coverage file<br />
** removed parameter genomic (cf. Extractor)<br />
** removed protein output (cf. Extractor)<br />
* GAF:<br />
** bugfix: prefix<br />
* Extractor:<br />
** new parameter genomic<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.5.2.zip GeMoMa 1.5.2] (31.5.2018)<br />
* GAF:<br />
** new parameter that allows to restrict the maximal number of transcript predictions per gene<br />
** altered behavior of the evidence filter from percentages to absolute values<br />
** bugfix: nested genes <br />
** checking for duplicates in prediction IDs<br />
* GeMoMa:<br />
** warning if RNA-seq data does not match with target genome<br />
* GeMoMaPipeline: new tool for running the complete GeMoMa pipeline at once allowing multi-threading<br />
* folder for temporary files of GeMoMa<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.5.zip GeMoMa 1.5] (13.02.2018)<br />
* AnnotationEvidence: add chromosome to output<br />
* CompareTranscripts: new parameter that allows to remove prefixes introduces by GAF<br />
* Extractor: new parameter "stop-codon excluded from CDS" that might be used if the annotation does not contain the stop codons<br />
* ExtractRNASeqEvidence: <br />
** print intron length stats<br />
** include program infos in introns.gff3<br />
* GeMoMa:<br />
** new attribute pAA in gff output if query protein is given<br />
** include program infos in predicted_annotation.gff3<br />
** minor bugfix<br />
* GAF:<br />
** new parameter that allows to specify a prefix for each input gff<br />
** collect and print program infos to filtered_prediction.gff3<br />
** improved statistics output<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.4.2.zip GeMoMa 1.4.2] (21.07.2017)<br />
* automatic searching for available updates<br />
* AnnotationEvidence: bugfix (tie computation: Arrays.binarysearch does not find first match)<br />
* Extractor: bugfix (files that are not zipped)<br />
* GeMoMa: bugfix (tie computation: Arrays.binarysearch does not find first match)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.4.1.zip GeMoMa 1.4.1] (30.05.2017)<br />
* CompareTranscripts: bugfix (NullPointerException)<br />
* Extractor: reference genome can be .*fa.gz and .*fasta.gz<br />
* GeMoMa: bugfix (shutdown problem after timeout)<br />
* modified additional scripts and documentation<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.4.zip GeMoMa 1.4] (03.05.2017)<br />
* AnnotationEvidence: new tool computing tie and tpc for given annotation (gff)<br />
* CompareTranscripts: new tool comparing predicted and given annotation (gff)<br />
* Extractor:<br />
** reading CDS with no parent tag (cf. discontinuous feature)<br />
** automatic recognition of GFF or GTF annotation<br />
** Warning if sequences mentioned in the annotation are not included in the reference sequence<br />
* GeMoMa: <br />
** allowing for multiple intron and coverage files (= using different library types at the same time)<br />
** NA instead of "?" for tae, tde, tie, minSplitReads of single coding exon genes<br />
** new default values for the parameters: predictions (10 instead of 1) and contig threshold (0.4 instead of 0.9)<br />
** bugfix (write pc and minCov if possible for last CDS part in predicted annotation)<br />
** bugfix (ref-gene name if no assignment is used)<br />
** bugfix (minSplitReads, minCov, tpc, avgCov if no coverage available)<br />
* GAF:<br />
** nested genes on the same strand<br />
** bugfix (if nothing passes the filter)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.3.2.zip GeMoMa 1.3.2] (18.01.2017)<br />
* Extractor: new parameter repair for broken transcript annotations<br />
* GeMoMa: bugfixes (splice site computation)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa_1.3.1.zip GeMoMa 1.3.1] (09.12.2016)<br />
* GeMoMa bugfix (finding start/stop codon for very small exons)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.3.zip GeMoMa 1.3] (06.12.2016)<br />
* ERE: new tool for extracting RNA-seq evidence<br />
* Extractor: offers options for <br />
** partial gene models<br />
** ambiguities<br />
* GeMoMa:<br />
** RNA-seq<br />
*** defining splice sites<br />
*** additional feature in GFF and output<br />
**** transcript intron evidence (tie)<br />
**** transcript acceptor evidence (tae)<br />
**** transcript donor evidence (tde)<br />
**** transcript percentage coverage (tpc)<br />
**** ...<br />
** improved GFF<br />
** simplified the command line parameters<br />
** IMPORTANT: parameter names changed for some parameters<br />
* GAF: new tool for filtering and combining different predictions (especially of different reference organisms)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.1.3.zip GeMoMa 1.1.3] (06.06.2016)<br />
* minor modifications to the Extractor tool<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.1.2.zip GeMoMa 1.1.2] (05.02.2016)<br />
* GeMoMa bugfix (upstream, downstream sequence for splice site detection)<br />
* Extractor: new parameter s for selecting transcripts<br />
* improved Galaxy integration<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.1.1.zip GeMoMa 1.1.1] (01.02.2016)<br />
* initial release for paper</div>Keilwagenhttps://www.jstacs.de/index.php?title=GeMoMa&diff=1059GeMoMa2019-11-05T06:50:26Z<p>Keilwagen: /* FAQs */</p>
<hr />
<div>'''Ge'''ne '''Mo'''del '''Ma'''pper (GeMoMa) is a homology-based gene prediction program. GeMoMa uses the annotation of protein-coding genes in a reference genome to infer the annotation of protein-coding genes in a target genome. Thereby, GeMoMa utilizes amino acid sequence and intron position conservation. In addition, GeMoMa allows to incorporate RNA-seq evidence for splice site prediction.<br />
<br />
[[File:GeMoMa-schema.png|thumb|right|350px|Schema of GeMoMa algorithm]]<br />
<br />
== Paper ==<br />
If you use GeMoMa, please cite<br />
<br />
J. Keilwagen, M. Wenk, J. L. Erickson, M. H. Schattat, J. Grau, and F. Hartung. [https://nar.oxfordjournals.org/content/44/9/e89 Using intron position conservation for homology-based gene prediction]. ''Nucleic Acids Research'', 2016. doi: 10.1093/nar/gkw092<br />
<br />
J. Keilwagen, F. Hartung, M. Paulini, S. O. Twardziok, and J. Grau<br />
[https://rdcu.be/QbKc Combining RNA-seq data and homology-based gene prediction for plants, animals and fungi]. ''BMC Bioinformatics'', 2018. doi: 10.1186/s12859-018-2203-5<br />
<br />
== Download ==<br />
<br />
GeMoMa is implemented in Java using Jstacs. You can [http://www.jstacs.de/download.php?which=GeMoMa download a zip file] containing a readme, the GeMoMa jar file and some tiny scripts for running GeMoMa. The jar file allows for <br />
<ul><br />
<li>creating the XML file needed for the Galaxy integration</li><br />
<li>running the command line interface (CLI) version.</li><br />
</ul><br />
<br />
You can also [[:File:GeMoMa-manual.pdf|download a small manual for GeMoMa]] which explains the main steps for the analysis.<br />
<br />
== Galaxy ==<br />
GeMoMa is available in a public web-server at [http://galaxy.informatik.uni-halle.de galaxy.informatik.uni-halle.de]. The provided web-server only allows a limited number of reference genes and uses a time out of 2 minutes per transcript prediction. For unlimited use, please use the command line program or integrate GeMoMa in your only Galaxy instance.<br />
<br />
[[File:GeMoMa-Workflow.png|thumb|center|700px|GeMoMa workflow adapted from Galaxy]]<br />
<br />
== Running the command line application ==<br />
<br />
For running the command line application, Java v1.8 or later is required.<br />
<br />
=== GeMoMaPipeline ===<br />
<br />
If you like to run the GeMoMaPipeline on a server as a single job, you can use the module GeMoMaPipeline which allows to exploit the full compute power of the computer server via multi-threading. However, GeMoMaPipeline does not distribute task on a compute cluster.<br />
You can run GeMoMaPipeline from the command line with<br/><br />
<code>java -jar GeMoMa-1.6.1.jar CLI GeMoMaPipeline [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>target genome (Target genome file (FASTA))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>species (data for reference species, range={own, pre-extracted}, default = own)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;own&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>ID (ID to distinguish the different reference species, default = , OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (Reference annotation file (GFF or GTF), which contains gene models annotated in the reference genome)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (Reference genome file (FASTA))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>weight (the weight can be used to prioritize predictions from different input files; each prediction will get an additional attribute sumWeight that can be used in the filter, valid range = [0.0, 1000.0], default = 1.0, OPTIONAL)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;pre-extracted&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>ID (ID to distinguish the different reference species, default = , OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>cds parts (The query cds parts file (FASTA), i.e., the cds parts that have been blasted)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>assignment (The assignment file, which combines parts of the CDS to transcripts, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">q</font></td><br />
<td>query proteins (optional query protein file (FASTA) for computing the optimal alignment score against complete protein prediction, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>weight (the weight can be used to prioritize predictions from different input files; each prediction will get an additional attribute sumWeight that can be used in the filter, valid range = [0.0, 1000.0], default = 1.0, OPTIONAL)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">selected</font></td><br />
<td>selected (The path to list file, which allows to make only a predictions for the contained transcript ids. The first column should contain transcript IDs as given in the annotation. Remaining columns can be used to determine a target region that should be overlapped by the prediction, if columns 2 to 5 contain chromosome, strand, start and end of region, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">gc</font></td><br />
<td>genetic code (optional user-specified genetic code, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">tag</font></td><br />
<td>tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = prediction)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>RNA-seq evidence (data for RNA-seq evidence, range={NO, MAPPED, EXTRACTED}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;MAPPED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.s</font></td><br />
<td>Stranded (Defines whether the reads are stranded. In case of FR_FIRST_STRAND, the first read of a read pair or the only read in case of single-end data is assumed to be located on forward strand of the cDNA, i.e., reverse to the mRNA orientation. If you are using Illumina TruSeq you should use FR_FIRST_STRAND., range={FR_UNSTRANDED, FR_FIRST_STRAND, FR_SECOND_STRAND}, default = FR_UNSTRANDED)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.m</font></td><br />
<td>mapped reads file (BAM/SAM files containing the mapped reads)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.v</font></td><br />
<td>ValidationStringency (Defines how strict to be when reading a SAM or BAM, beyond bare minimum validation., range={STRICT, LENIENT, SILENT}, default = LENIENT)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.u</font></td><br />
<td>use secondary alignments (allows to filter flags in the SAM or BAM, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.c</font></td><br />
<td>coverage (allows to output the coverage, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.mmq</font></td><br />
<td>minimum mapping quality (reads with a mapping quality that is lower than this value will be ignored, valid range = [0, 254], default = 40)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;EXTRACTED&quot;:</b></td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">introns</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage</font></td><br />
<td>coverage (experimental coverage (RNA-seq), range={NO, UNSTRANDED, STRANDED}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">tblastn</font></td><br />
<td>tblastn (if *true* tblastn is used as search algorithm, otherwise mmseqs is used. Tblastn and mmseqs need to be installed to use the corresponding option, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.p</font></td><br />
<td>proteins (whether the complete proteins sequences should returned as output, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.r</font></td><br />
<td>repair (if a transcript annotation can not be parsed, the program will try to infer the phase of the CDS parts to repair the annotation, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.a</font></td><br />
<td>Ambiguity (This parameter defines how to deal with ambiguities in the DNA. There are 3 options: EXCEPTION, which will remove the corresponding transcript, AMBIGUOUS, which will use an X for the corresponding amino acid, and RANDOM, which will randomly select an amnio acid from the list of possibilities., range={EXCEPTION, AMBIGUOUS, RANDOM}, default = EXCEPTION)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.s</font></td><br />
<td>stop-codon excluded from CDS (A flag that states whether the reference annotation contains the stop codon in the CDS annotation or not, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.f</font></td><br />
<td>full-length (A flag which allows for choosing between only full-length and all (i.e., full-length and partial) transcripts, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.s</font></td><br />
<td>splice (if no intron is given by RNA-seq, compute candidate splice sites or not, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.sm</font></td><br />
<td>substitution matrix (optional user-specified substitution matrix, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.g</font></td><br />
<td>gap opening (The gap opening cost in the alignment, default = 11)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.ge</font></td><br />
<td>gap extension (The gap extension cost in the alignment, default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.m</font></td><br />
<td>maximum intron length (The maximum length of an intron, default = 15000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.i</font></td><br />
<td>intron-loss-gain-penalty (The penalty used for intron loss and gain, default = 25)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.e</font></td><br />
<td>e-value (The e-value for filtering blast results, default = 100.0)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.c</font></td><br />
<td>contig threshold (The threshold for evaluating contigs, valid range = [0.0, 1.0], default = 0.4)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.rt</font></td><br />
<td>region threshold (The threshold for evaluating regions, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.h</font></td><br />
<td>hit threshold (The threshold for adding additional hits, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.p</font></td><br />
<td>predictions (The (maximal) number of predictions per transcript, default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.a</font></td><br />
<td>avoid stop (A flag which allows to avoid stop codons in a transcript (except the last AS), default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.approx</font></td><br />
<td>approx (whether an approximation is used to compute the score for intron gain, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.prefix</font></td><br />
<td>prefix (A prefix to be used for naming the predictions, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.t</font></td><br />
<td>timeout (The (maximal) number of seconds to be used for the predictions of one transcript, if exceeded GeMoMa does not output a prediction for this transcript., valid range = [0, 604800], default = 3600)</td><br />
<td style="width:100px;">LONG</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.Score</font></td><br />
<td>Score (A flag which allows to do nothing, re-score or re-align the search results, range={Trust, ReScore, ReAlign}, default = Trust)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.s</font></td><br />
<td>sorting (comma-separated list that defines the way predictions are sorted within a cluster, default = evidence,score)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.m</font></td><br />
<td>missing intron evidence filter (the filter for single-exon transcripts or if no RNA-seq data is used, decides for overlapping other transcripts whether they should be used (=true) or discarded (=false), default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.i</font></td><br />
<td>intron evidence filter (the filter on the intron evidence given by RNA-seq-data for overlapping transcripts, valid range = [0.0, 1.0], default = 1.0)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.c</font></td><br />
<td>common border filter (the filter on the common borders of transcripts, the lower the more transcripts will be checked as alternative splice isoforms, valid range = [0.0, 1.0], default = 0.75)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.mnotpg</font></td><br />
<td>maximal number of transcripts per gene (the maximal number of allowed transcript predictions per gene, valid range = [1, 2147483647], default = 2147483647)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.f</font></td><br />
<td>filter (A filter can be applied to transcript predictions to receive only reasonable predictions. The filter is applied to the GFF attributes. The deault filter decides based on the completeness of the prediction (start=='M' and stop=='*') and the relative score (score/AA>=0.75) whether a prediction is used or not. Different criteria can be combined using 'and ' and 'or'. A more sophisticated filter could be applied for instance using the completeness, the relative score, the evidence as well as statistics based on RNA-seq: start=='M' and stop =='*' and score/AA>=0.75 and (evidence>1 or tpc==1.0), default = start=='M' and stop =='*' and score/AA>=0.75, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.u</font></td><br />
<td>UTR (allows to predict UTRs using RNA-seq data, range={NO, YES}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;YES&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.r</font></td><br />
<td>rename (allows to generate generic gene and transcripts names (cf. attribute &quot;Name&quot;), range={COMPOSED, SIMPLE, NO}, default = COMPOSED)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;COMPOSED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.i</font></td><br />
<td>infix (the infix of the generic name, default = G)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.s</font></td><br />
<td>suffix (the suffix of the generic name, default = 0)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.di</font></td><br />
<td>delete infix (a comma-separated list of infixes that is deleted from the sequence names before building the gene/transcript name, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;SIMPLE&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>predicted proteins (If *true*, returns the predicted proteins of the target organism as fastA file, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">pc</font></td><br />
<td>predicted CDSs (If *true*, returns the predicted CDSs of the target organism as fastA file, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">pgr</font></td><br />
<td>predicted genomic regions (If *true*, returns the genomic regions of predicted gene models of the target organism as fastA file, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">o</font></td><br />
<td>output individual predictions (If *true*, returns the predictions for each reference species, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>debug (If *false* removes all temporary files even if the jobs exits unexpected, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">threads</font></td><br />
<td>The number of threads used for the tool, defaults to 1</td><br />
<td>INT</td><br />
</tr><br />
</table><br />
<br />
=== Extract RNA-seq Evidence (ERE) ===<br />
<br />
For post-processing the mapped RNA-seq data, we provide the tool ExtractRNAseqEvidence (ERE). You can run Extractor from the command line with<br/><br />
<code>java -jar GeMoMa-1.6.1.jar CLI ERE [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>Stranded (Defines whether the reads are stranded. In case of FR_FIRST_STRAND, the first read of a read pair or the only read in case of single-end data is assumed to be located on forward strand of the cDNA, i.e., reverse to the mRNA orientation. If you are using Illumina TruSeq you should use FR_FIRST_STRAND., range={FR_UNSTRANDED, FR_FIRST_STRAND, FR_SECOND_STRAND}, default = FR_UNSTRANDED)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>mapped reads file (BAM/SAM files containing the mapped reads)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>ValidationStringency (Defines how strict to be when reading a SAM or BAM, beyond bare minimum validation., range={STRICT, LENIENT, SILENT}, default = LENIENT)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">u</font></td><br />
<td>use secondary alignments (allows to filter flags in the SAM or BAM, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage (allows to output the coverage, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">mmq</font></td><br />
<td>minimum mapping quality (reads with a mapping quality that is lower than this value will be ignored, valid range = [0, 254], default = 40)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== CheckIntrons===<br />
<br />
This tool allows to check whether the extracted introns show the expected patterns of di-nucleotides at the splice sites. You can run CheckIntrons from the command line with<br/><br />
<code>java -jar GeMoMa-1.6.1.jar CLI CheckIntrons [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>target genome (The target genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>verbose (A flag which allows to output a wealth of additional information per transcript, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== Extractor ===<br />
<br />
For preparing the data, we provide the tool Extractor. You can run Extractor from the command line with<br/><br />
<code>java -jar GeMoMa-1.6.1.jar CLI Extractor [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (Reference annotation file (GFF or GTF), which contains gene models annotated in the reference genome)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (Reference genome file (FASTA))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">gc</font></td><br />
<td>genetic code (optional user-specified genetic code, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>proteins (whether the complete proteins sequences should returned as output, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>cds (whether the complete CDSs should returned as output, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">genomic</font></td><br />
<td>genomic (whether the genomic regions should be returned (upper case = coding, lower case = non coding), default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>repair (if a transcript annotation can not be parsed, the program will try to infer the phase of the CDS parts to repair the annotation, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>selected (The path to list file, which allows to make only a predictions for the contained transcript ids. The first column should contain transcript IDs as given in the annotation. Remaining columns will be ignored., OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Ambiguity</font></td><br />
<td>Ambiguity (This parameter defines how to deal with ambiguities in the DNA. There are 3 options: EXCEPTION, which will remove the corresponding transcript, AMBIGUOUS, which will use an X for the corresponding amino acid, and RANDOM, which will randomly select an amnio acid from the list of possibilities., range={EXCEPTION, AMBIGUOUS, RANDOM}, default = EXCEPTION)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sefc</font></td><br />
<td>stop-codon excluded from CDS (A flag that states whether the reference annotation contains the stop codon in the CDS annotation or not, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>full-length (A flag which allows for choosing between only full-length and all (i.e., full-length and partial) transcripts, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>verbose (A flag which allows to output a wealth of additional information, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== Gene Model Mapper (GeMoMa) ===<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>search results (The search results, e.g., from tblastn or mmseqs)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>target genome (The target genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>cds parts (The query cds parts file (FASTA), i.e., the cds parts that have been blasted)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>assignment (The assignment file, which combines parts of the CDS to transcripts, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">q</font></td><br />
<td>query proteins (optional query protein file (FASTA) for computing the optimal alignment score against complete protein prediction, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">splice</font></td><br />
<td>splice (if no intron is given by RNA-seq, compute candidate splice sites or not, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage</font></td><br />
<td>coverage (experimental coverage (RNA-seq), range={NO, UNSTRANDED, STRANDED}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genetic code (optional user-specified genetic code, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sm</font></td><br />
<td>substitution matrix (optional user-specified substitution matrix, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">go</font></td><br />
<td>gap opening (The gap opening cost in the alignment, default = 11)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ge</font></td><br />
<td>gap extension (The gap extension cost in the alignment, default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>maximum intron length (The maximum length of an intron, default = 15000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">intron-loss-gain-penalty</font></td><br />
<td>intron-loss-gain-penalty (The penalty used for intron loss and gain, default = 25)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">e</font></td><br />
<td>e-value (The e-value for filtering blast results, default = 100.0)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ct</font></td><br />
<td>contig threshold (The threshold for evaluating contigs, valid range = [0.0, 1.0], default = 0.4)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">rt</font></td><br />
<td>region threshold (The threshold for evaluating regions, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">h</font></td><br />
<td>hit threshold (The threshold for adding additional hits, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>predictions (The (maximal) number of predictions per transcript, default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">selected</font></td><br />
<td>selected (The path to list file, which allows to make only a predictions for the contained transcript ids. The first column should contain transcript IDs as given in the annotation. Remaining columns can be used to determine a target region that should be overlapped by the prediction, if columns 2 to 5 contain chromosome, strand, start and end of region, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">as</font></td><br />
<td>avoid stop (A flag which allows to avoid stop codons in a transcript (except the last AS), default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">approx</font></td><br />
<td>approx (whether an approximation is used to compute the score for intron gain, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">prefix</font></td><br />
<td>prefix (A prefix to be used for naming the predictions, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">tag</font></td><br />
<td>tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = prediction)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>verbose (A flag which allows to output a wealth of additional information per transcript, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">timeout</font></td><br />
<td>timeout (The (maximal) number of seconds to be used for the predictions of one transcript, if exceeded GeMoMa does not output a prediction for this transcript., valid range = [0, 604800], default = 3600)</td><br />
<td style="width:100px;">LONG</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sort</font></td><br />
<td>sort (A flag which allows to sort the search results, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Score</font></td><br />
<td>Score (A flag which allows to do nothing, re-score or re-align the search results, range={Trust, ReScore, ReAlign}, default = Trust)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
GeMoMa returns the predicted annotation as gff file.<br />
<br />
=== GeMoMa Annotation Filter (GAF) ===<br />
<br />
The GeMoMa Annotation Filter allows to combine and reduce predictions from GeMoMa into a single final prediction. It is able to handle predictions from different reference species. It also handles overlapping or identical predictions.<br />
You can run GeMoMa from the command line with<br/><br />
<code>java -jar GeMoMa-1.6.1.jar CLI GAF[&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>tag (the tag used to read the GeMoMa annotations, default = prediction)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>sorting (comma-separated list that defines the way predictions are sorted within a cluster, default = evidence,score)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>missing intron evidence filter (the filter for single-exon transcripts or if no RNA-seq data is used, decides for overlapping other transcripts whether they should be used (=true) or discarded (=false), default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>intron evidence filter (the filter on the intron evidence given by RNA-seq-data for overlapping transcripts, valid range = [0.0, 1.0], default = 1.0)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>common border filter (the filter on the common borders of transcripts, the lower the more transcripts will be checked as alternative splice isoforms, valid range = [0.0, 1.0], default = 0.75)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">mnotpg</font></td><br />
<td>maximal number of transcripts per gene (the maximal number of allowed transcript predictions per gene, valid range = [1, 2147483647], default = 2147483647)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prefix (the prefix can be used to distinguish predictions from different input files, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>weight (the weight can be used to prioritize predictions from different input files; each prediction will get an additional attribute sumWeight that can be used in the filter, valid range = [0.0, 1000.0], default = 1.0, OPTIONAL)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>gene annotation file (GFF files containing the gene annotations (predicted by GeMoMa))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>filter (A filter can be applied to transcript predictions to receive only reasonable predictions. The filter is applied to the GFF attributes. The deault filter decides based on the completeness of the prediction (start=='M' and stop=='*') and the relative score (score/AA>=0.75) whether a prediction is used or not. Different criteria can be combined using 'and ' and 'or'. A more sophisticated filter could be applied for instance using the completeness, the relative score, the evidence as well as statistics based on RNA-seq: start=='M' and stop =='*' and score/AA>=0.75 and (evidence>1 or tpc==1.0), default = start=='M' and stop =='*' and score/AA>=0.75, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== CompareTranscripts ===<br />
<br />
For comparing gene models from GeMoMa predictions with existing annotation, we provide the tool CompareTranscripts. You can run CompareTranscripts from the command line with<br/><br />
<code>java -jar GeMoMa-1.6.1.jar CLI CompareTranscripts [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prediction (The predicted annotation)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (The true annotation)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== AnnotationEvidence ===<br />
<br />
For providing RNA-seq evidence (e.g. tie) for existing annotation, we provide the tool AnnotationEvidence. You can run AnnotationEvidence from the command line with<br/><br />
<code>java -jar GeMoMa-1.6.1.jar CLI AnnotationEvidence [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (The genome annotation file (GFF))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (The genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns file (Introns (GFF), which might be obtained from RNA-seq, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage file (experimental coverage (RNA-seq), range={NO, UNSTRANDED, STRANDED}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ao</font></td><br />
<td>annotation output (if the annotation should be returned with attributes tie, tpc, and AA, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">gc</font></td><br />
<td>genetic code (optional user-specified genetic code, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== AnnotationFinializer ===<br />
<br />
This tool allows to predict UTR and to rename predictions. You can run AnnotationEvidence from the command line with<br/><br />
<code>java -jar GeMoMa-1.6.1.jar CLI AnnotationFinalizer [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (The genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (The predicted genome annotation file (GFF))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = prediction)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">u</font></td><br />
<td>UTR (allows to predict UTRs using RNA-seq data, range={NO, YES}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;YES&quot;:</b></td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns file (Introns (GFF), which might be obtained from RNA-seq, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage file (experimental coverage (RNA-seq), range={NO, UNSTRANDED, STRANDED}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">rename</font></td><br />
<td>rename (allows to generate generic gene and transcripts names (cf. attribute &quot;Name&quot;), range={COMPOSED, SIMPLE, NO}, default = COMPOSED)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;COMPOSED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">infix</font></td><br />
<td>infix (the infix of the generic name, default = G)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>suffix (the suffix of the generic name, default = 0)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">di</font></td><br />
<td>delete infix (a comma-separated list of infixes that is deleted from the sequence names before building the gene/transcript name, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;SIMPLE&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
== GFF attributes ==<br />
<br />
Using GeMoMa and GAF, you'll obtain GFFs containing some special attributes. We briefly explain the most prominent attributes in the following table.<br />
<br />
{| class="wikitable sortable"<br />
|-<br />
!Attribute !!Long name !!Tool !!Necessary parameter !!Feature !!Description<br />
|-<br />
| score || GeMoMa score || GeMoMa || || prediction || score computed by GeMoMa using the substitution matrix, gap costs and additional penalties<br />
|-<br />
| minCov || minimal coverage || GeMoMa || coverage, ... || prediction || minimal coverage of any base of the prediction given RNA-seq evidence<br />
|-<br />
| avgCov || average coverage || GeMoMa || coverage, ... || prediction || average coverage of all bases of the prediction given RNA-seq evidence<br />
|-<br />
| tpc || transcript percentage coverage || GeMoMa || coverage, ... || prediction || percentage of covered bases per predicted transcript given RNA-seq evidence<br />
|-<br />
| tae || transcript acceptor evidence || GeMoMa || introns || prediction || percentage of predicted acceptor sites per predicted transcript with RNA-seq evidence<br />
|-<br />
| tde || transcript donor evidence || GeMoMa || introns || prediction || percentage of predicted donor sites per predicted transcript with RNA-seq evidence<br />
|-<br />
| tie || transcript intron evidence || GeMoMa || introns || prediction || percentage of predicted introns per predicted transcript with RNA-seq evidence<br />
|-<br />
| minSplitReads || minimal split reads || GeMoMa || introns || prediction || minimal number of split reads for any of the predicted introns per predicted transcript<br />
|-<br />
| iAA || identical amino acid || GeMoMa || query proteins || prediction || percentage of identical amino acids between reference transcript and prediction<br />
|-<br />
| pAA || positive amino acid || GeMoMa || query proteins || prediction || percentage of aligned positions between reference transcript and prediction yielding a positive score in the substitution matrix<br />
|-<br />
| evidence || || GAF || || prediction || number of reference organisms that have a transcript yielding this prediction<br />
|-<br />
| alternative || || GAF || || prediction || alternative gene ID(s) leading to the same prediction<br />
|-<br />
| sumWeight || || GAF || || prediction || the sum of the weights of the references that perfectly support this prediction<br />
|-<br />
| maxTie || maximal tie || GAF || || gene || maximal tie of all transcripts of this gene<br />
|-<br />
| maxEvidence || maximal evidence || GAF || || gene || maximal evidence of all transcripts of this gene<br />
|-<br />
|}<br />
<br />
== FAQs ==<br />
<br />
; Why does the Extractor not return a single CDS-part, protein, ...?<br />
: First, please check whether the names of your contigs/chromosomes in your annotation (gff) and genome file (fasta) are identical. The fasta comments should at best only contain the contig/chromosome name. (Since GeMoMa 1.4, comments, which contain the contig/chromosome name and some additional information separated by a space, are also fine.) Second, please check whether you have a valid GFF/GTF file. Valid GFF files should have a valid "ID" or "Parent" entry in the attributes column. Valid GTF files should have a valid "gene_id" and "transcript_id" entry. Finally, please check the statistics that are given by the Extractor. It lists how many genes have been read and how many genes have been removed for different reasons. One common problem is that some annotation files do not include the stop codon in the CDS annotation.<br />
; How can I force GeMoMa to make more predictions?<br />
: There are several parameters affecting the number of predictions. The most prominent are the number of predictions (p) and the contig threshold (ct). For each reference transcript/CDS, GeMoMa initially makes a preliminary prediction and uses this prediction to determine whether a contig is promising and should be used to determine the final predictions. You may decrease ct and increase p to have more contigs in the final prediction. Increasing the number of predictions allows GeMoMa to output more predictions that have been computed. Decreasing the contig threshold allows to increase the number of predictions that are (internally) computed. Increasing p to a very large number without decreasing ct does not help.<br />
; Running GeMoMa on a single contig of my assembly yield thousands of weird predictions. What went wrong?<br />
: By default, GeMoMa is not build to be run on a single contig. GeMoMa tries to make predictions for all given reference CDS in the given target sequence(s). If the given target sequence is only a fraction of the complete target genome/assembly, GeMoMa will produce weird predictions as it does not filter for the quality of the predictions internally. There are two options to handle this:<br />
:* Use a list of gene models that you expect to be located on this contig (cf. parameter "selected").<br />
:* Filter the predictions using GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>).<br />
; Is it mandatory to use RNA-seq data?<br />
: No, GeMoMa is able to make predictions with and without RNA-seq evidence.<br />
; Is it possible to use multiple reference organisms?<br />
: It is possible to use multiple reference organisms for GeMoMa. Just run GeMoMa on each reference organism separately. Finally, you can employ GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>) to combine these annotations.<br />
; Why do some reference genes not lead to a prediction in the target genome?<br />
: Please first check whether your reference genes have been discarded by the Extractor (cf. assignment file).<br />
: If the genes have been discarded, there are two possibilities:<br />
:* The CDS might be redundant, i.e. the coding exons are identical to those of another transcript. In this case, only one CDS is further evaluated.<br />
:* There might be something wrong with your reference genes, e.g., missing start codon, missing stop codon, premature stop codon, ambiguous nucleotides, ... and you should check the options of Extractor or the annotation.<br />
: If the reference genes passed the Extractor, there are several possible explanations for this behavior. The two most prominent are:<br />
:* GeMoMa stopped the prediction of a reference genes since it does not return a result within the given time (cf. parameter "timeout").<br />
:* GeMoMa simply did not find a prediction matching the remaining quality criteria<br />
:* GeMoMa did find a prediction, but it was filtered out by GAF, e.g. to low relative score, missing start or stop codon (cf. GAF parameters).<br />
; What does "partial gene model" mean in the context of GeMoMa?<br />
: We called a gene model partial if it does not contain an initial start codon and a final stop codon. However, this does not mean that the gene model is located at or close to the border of a chromosome or contig.<br />
; For two different reference transcripts, the predictions of GeMoMa overlap or are identical. What should I do with those?<br />
: GeMoMa makes the predictions for each reference transcript independently. Hence, it can occur that some of predictions of different reference transcripts overlap or are identical especially in gene families. Typically, you might like to filter or rank these predictions. We have implemented GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>) to do this automatically. However, you can also do it by hand using the GFF attributes. Using RNA-seq data in GeMoMa yields additional fields in the annotation that can be used, e.g., average coverage (avgCov).<br />
; A lot of transcripts have been filtered out by the Extractor. What can I do?<br />
: There are several reasons for removing transcripts by the Extractor. At least in two cases you can try to get more transcripts by setting specific parameter values. First, if the transcript contains ambiguous nucleotides, please test the parameter "Ambiguity". Second, sometimes we received GFFs which contain wrong phases for CDS entries (e.g., 0 for all CDS entries in the phase column of the GFF). Since version 1.3.2, we provide the option "r" which stands for repair. If r=true is chosen, the Extractor tries to infer all phases for transcripts that show an error and would be filtered out.<br />
; Is GeMoMa able to predict pseudo-genes/ncRNA?<br />
: No, currently not.<br />
; My RNA-seq data indicates there is an additional intron in a transcipt, but GeMoMa does not predict this. Or vice versa, GeMoMa predicts an intron that is not supported by RNA-seq data. What's the reason?<br />
: GeMoMa is mainly based on the assumptions of amino acid and intron position conservation between reference and target species. Hence, GeMoMa tries to predict a gene model with similar exon-intron structure in the target species and does not stick too much to RNA-seq data. Although intron position conservation can be observed in most cases, sometimes new introns evolve or others vanish. For this reasons, GeMoMa also allows for the inclusion or exclusion of introns adding some additional costs (cf. GeMoMa parameter intron-loss-gain-penalty). However, the behaviour of GeMoMa depends on the parameters settings (especially intron-loss-gain-penalty, sm (substitution matrix), go (gap opening), ge (gap extension)) and the length of the missed/additional intron. Nevertheless, such cases can only occur if the additional/missed intron has a length that can be divided by 3 preserving the reading frame.<br />
:Since the available RNA-seq data only reflects a fraction of tissues/environmental conditions/..., missing RNA-seq evidence does not necessarily mean that the predictions is wrong.<br />
; My RNA-seq data indicates two alternative, highly overlapping introns. Interestingly, GeMoMa does not take the intron that is more abundant. Why?<br />
: GeMoMa reads the introns from the input file using some filter (cf. GeMoMa parameter r (reads)). All introns that pass the filter are used and treated equally. Hence, GeMoMa uses the intron that matches the expectation of intron position and amino acid conservation compared to the reference transcript.<br />
; Does GeMoMa predict multiple transcripts per gene?<br />
: GeMoMa in principle allows to predict multiple transcripts per gene, if corresponding transcripts are given in the reference species or if multiple reference species are used.<br />
; GeMoMa failed with java.lang.OutOfMemoryError. What can I do?<br />
: Whenever you see a java.lang.OutOfMemoryError, you should rerun the program with Java virtual machine (VM) options. More specifically you should set: '''-Xms''' the initally used RAM, e.g. to 5Gb (–Xms5G), and '''-Xmx''' the maximally used RAM, e.g. to 50Gb (-Xmx50G). GeMoMa often needs more memory if you have a large genome ''and'' if you’re providing a large coverage file (extracted from RNA-seq data). If you don’t have a compute node with enough memory, you can run GeMoMa without coverage, which will return the same predictions, but does not include all statistics. Another point could be the protein alignment, if you use the optional parameter query protein. Again you can run GeMoMa without this parameter, which will return the same predictions, but less statistics.<br />
; I need to specify the genetic code for my organisms. What is the expected format?<br />
: The genetic code is given in a two column tab-delimited table, where the first column is the one letter code of the amino acid and the second column is a comma-separated list of triplets. As we are working on genomic DNA, GeMoMa expects the bases A, C, G, and T, and not U (as expected in mRNA). Here is the link to the default genetic code, which might be used as template:<br />
: https://github.com/Jstacs/Jstacs/blob/master/projects/gemoma/test_data/genetic_code.txt<br />
: Alternative genetic codes are described here using the RNA alphabet:<br />
: https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi<br />
: The genetic code might be specified for a reference organism in the module Extractor or for a target organism in the module GeMoMa.<br />
<br />
== Version history ==<br />
[http://www.jstacs.de/download.php?which=GeMoMa GeMoMa 1.6.1] (4.06.2019)<br />
* createGalaxyIntegration.sh: bugfix for GeMoMaPipeline<br />
* new module CheckIntrons: allowing to create statistics for introns (extracted by ERE)<br />
* AnnotationFinalizer: bugfix for sequence IDs with large numbers<br />
* CompareTranscripts: <br />
** bugfix for prefix of ref-gene<br />
** allow no transcript info, but making assignment non-optional if a transcript info is set <br />
* GAF: bugfix for Galaxy integration<br />
* GeMoMaPipeline:<br />
** improved output in case of Exceptions<br />
** new parameter "output individual predictions" allows to in- or exclude individual predictions from each reference organism in the final result<br />
** new parameter "weight" allows weights for reference species (cf. GAF)<br />
* ERE: new parameter "minimum mapping quality"<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.6.zip GeMoMa 1.6] (2.04.2019)<br />
* allow to use mmseqs as alternative to tblastn<br />
* AnnotationEvidence:<br />
** allows to add attributes to the input gff: tie, tpc, AA, start, stop<br />
** new parameter for gff output<br />
* AnnotationFinalizer: new tool for predicting UTRs and renaming predictions<br />
* GAF:<br />
** relative score filter and evidence filter are replaced by a flexible filter that allows to filter by relative score, evidence or other GFF attributes as well as combinations thereof<br />
** sorting criteria of the predictions within clusters can now be user-specified<br />
** new attribute for genes: combinedEvidence<br />
** new attribute for predictions: sumWeight<br />
** allows to use gene predictions from all sources, including for instance ab-initio gene predictors, purely RNA-seq based gene prediction and manually curation<br />
** bugfix for predictions from multiple reference organisms<br />
** improved statistic output<br />
* GeMoMa<br />
** renamed the parameter tblastn results to search results<br />
** new parameter for sorting the results of the similarity search (tblastn or mmseqs), if you use mmseqs for the similarity search you have use sort <br />
** new parameter for score of the search results: three options: Trust (as is), ReScore (use aligned sequence, but recompute score), and ReAlign (use detected sequence for realignment and rescore)<br />
** bugfix: threshold for introns from multiple files<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.5.2.zip GeMoMa 1.5.3] (23.07.2018)<br />
* improved parameter description and presentation<br />
* GeMoMaPipeline:<br />
** removed unnecessary parameters<br />
* GeMoMa:<br />
** bugfix: reading coverage file<br />
** removed parameter genomic (cf. Extractor)<br />
** removed protein output (cf. Extractor)<br />
* GAF:<br />
** bugfix: prefix<br />
* Extractor:<br />
** new parameter genomic<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.5.2.zip GeMoMa 1.5.2] (31.5.2018)<br />
* GAF:<br />
** new parameter that allows to restrict the maximal number of transcript predictions per gene<br />
** altered behavior of the evidence filter from percentages to absolute values<br />
** bugfix: nested genes <br />
** checking for duplicates in prediction IDs<br />
* GeMoMa:<br />
** warning if RNA-seq data does not match with target genome<br />
* GeMoMaPipeline: new tool for running the complete GeMoMa pipeline at once allowing multi-threading<br />
* folder for temporary files of GeMoMa<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.5.zip GeMoMa 1.5] (13.02.2018)<br />
* AnnotationEvidence: add chromosome to output<br />
* CompareTranscripts: new parameter that allows to remove prefixes introduces by GAF<br />
* Extractor: new parameter "stop-codon excluded from CDS" that might be used if the annotation does not contain the stop codons<br />
* ExtractRNASeqEvidence: <br />
** print intron length stats<br />
** include program infos in introns.gff3<br />
* GeMoMa:<br />
** new attribute pAA in gff output if query protein is given<br />
** include program infos in predicted_annotation.gff3<br />
** minor bugfix<br />
* GAF:<br />
** new parameter that allows to specify a prefix for each input gff<br />
** collect and print program infos to filtered_prediction.gff3<br />
** improved statistics output<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.4.2.zip GeMoMa 1.4.2] (21.07.2017)<br />
* automatic searching for available updates<br />
* AnnotationEvidence: bugfix (tie computation: Arrays.binarysearch does not find first match)<br />
* Extractor: bugfix (files that are not zipped)<br />
* GeMoMa: bugfix (tie computation: Arrays.binarysearch does not find first match)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.4.1.zip GeMoMa 1.4.1] (30.05.2017)<br />
* CompareTranscripts: bugfix (NullPointerException)<br />
* Extractor: reference genome can be .*fa.gz and .*fasta.gz<br />
* GeMoMa: bugfix (shutdown problem after timeout)<br />
* modified additional scripts and documentation<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.4.zip GeMoMa 1.4] (03.05.2017)<br />
* AnnotationEvidence: new tool computing tie and tpc for given annotation (gff)<br />
* CompareTranscripts: new tool comparing predicted and given annotation (gff)<br />
* Extractor:<br />
** reading CDS with no parent tag (cf. discontinuous feature)<br />
** automatic recognition of GFF or GTF annotation<br />
** Warning if sequences mentioned in the annotation are not included in the reference sequence<br />
* GeMoMa: <br />
** allowing for multiple intron and coverage files (= using different library types at the same time)<br />
** NA instead of "?" for tae, tde, tie, minSplitReads of single coding exon genes<br />
** new default values for the parameters: predictions (10 instead of 1) and contig threshold (0.4 instead of 0.9)<br />
** bugfix (write pc and minCov if possible for last CDS part in predicted annotation)<br />
** bugfix (ref-gene name if no assignment is used)<br />
** bugfix (minSplitReads, minCov, tpc, avgCov if no coverage available)<br />
* GAF:<br />
** nested genes on the same strand<br />
** bugfix (if nothing passes the filter)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.3.2.zip GeMoMa 1.3.2] (18.01.2017)<br />
* Extractor: new parameter repair for broken transcript annotations<br />
* GeMoMa: bugfixes (splice site computation)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa_1.3.1.zip GeMoMa 1.3.1] (09.12.2016)<br />
* GeMoMa bugfix (finding start/stop codon for very small exons)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.3.zip GeMoMa 1.3] (06.12.2016)<br />
* ERE: new tool for extracting RNA-seq evidence<br />
* Extractor: offers options for <br />
** partial gene models<br />
** ambiguities<br />
* GeMoMa:<br />
** RNA-seq<br />
*** defining splice sites<br />
*** additional feature in GFF and output<br />
**** transcript intron evidence (tie)<br />
**** transcript acceptor evidence (tae)<br />
**** transcript donor evidence (tde)<br />
**** transcript percentage coverage (tpc)<br />
**** ...<br />
** improved GFF<br />
** simplified the command line parameters<br />
** IMPORTANT: parameter names changed for some parameters<br />
* GAF: new tool for filtering and combining different predictions (especially of different reference organisms)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.1.3.zip GeMoMa 1.1.3] (06.06.2016)<br />
* minor modifications to the Extractor tool<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.1.2.zip GeMoMa 1.1.2] (05.02.2016)<br />
* GeMoMa bugfix (upstream, downstream sequence for splice site detection)<br />
* Extractor: new parameter s for selecting transcripts<br />
* improved Galaxy integration<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.1.1.zip GeMoMa 1.1.1] (01.02.2016)<br />
* initial release for paper</div>Keilwagenhttps://www.jstacs.de/index.php?title=GeMoMa&diff=1058GeMoMa2019-11-05T06:45:28Z<p>Keilwagen: /* FAQs */</p>
<hr />
<div>'''Ge'''ne '''Mo'''del '''Ma'''pper (GeMoMa) is a homology-based gene prediction program. GeMoMa uses the annotation of protein-coding genes in a reference genome to infer the annotation of protein-coding genes in a target genome. Thereby, GeMoMa utilizes amino acid sequence and intron position conservation. In addition, GeMoMa allows to incorporate RNA-seq evidence for splice site prediction.<br />
<br />
[[File:GeMoMa-schema.png|thumb|right|350px|Schema of GeMoMa algorithm]]<br />
<br />
== Paper ==<br />
If you use GeMoMa, please cite<br />
<br />
J. Keilwagen, M. Wenk, J. L. Erickson, M. H. Schattat, J. Grau, and F. Hartung. [https://nar.oxfordjournals.org/content/44/9/e89 Using intron position conservation for homology-based gene prediction]. ''Nucleic Acids Research'', 2016. doi: 10.1093/nar/gkw092<br />
<br />
J. Keilwagen, F. Hartung, M. Paulini, S. O. Twardziok, and J. Grau<br />
[https://rdcu.be/QbKc Combining RNA-seq data and homology-based gene prediction for plants, animals and fungi]. ''BMC Bioinformatics'', 2018. doi: 10.1186/s12859-018-2203-5<br />
<br />
== Download ==<br />
<br />
GeMoMa is implemented in Java using Jstacs. You can [http://www.jstacs.de/download.php?which=GeMoMa download a zip file] containing a readme, the GeMoMa jar file and some tiny scripts for running GeMoMa. The jar file allows for <br />
<ul><br />
<li>creating the XML file needed for the Galaxy integration</li><br />
<li>running the command line interface (CLI) version.</li><br />
</ul><br />
<br />
You can also [[:File:GeMoMa-manual.pdf|download a small manual for GeMoMa]] which explains the main steps for the analysis.<br />
<br />
== Galaxy ==<br />
GeMoMa is available in a public web-server at [http://galaxy.informatik.uni-halle.de galaxy.informatik.uni-halle.de]. The provided web-server only allows a limited number of reference genes and uses a time out of 2 minutes per transcript prediction. For unlimited use, please use the command line program or integrate GeMoMa in your only Galaxy instance.<br />
<br />
[[File:GeMoMa-Workflow.png|thumb|center|700px|GeMoMa workflow adapted from Galaxy]]<br />
<br />
== Running the command line application ==<br />
<br />
For running the command line application, Java v1.8 or later is required.<br />
<br />
=== GeMoMaPipeline ===<br />
<br />
If you like to run the GeMoMaPipeline on a server as a single job, you can use the module GeMoMaPipeline which allows to exploit the full compute power of the computer server via multi-threading. However, GeMoMaPipeline does not distribute task on a compute cluster.<br />
You can run GeMoMaPipeline from the command line with<br/><br />
<code>java -jar GeMoMa-1.6.1.jar CLI GeMoMaPipeline [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>target genome (Target genome file (FASTA))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>species (data for reference species, range={own, pre-extracted}, default = own)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;own&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>ID (ID to distinguish the different reference species, default = , OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (Reference annotation file (GFF or GTF), which contains gene models annotated in the reference genome)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (Reference genome file (FASTA))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>weight (the weight can be used to prioritize predictions from different input files; each prediction will get an additional attribute sumWeight that can be used in the filter, valid range = [0.0, 1000.0], default = 1.0, OPTIONAL)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;pre-extracted&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>ID (ID to distinguish the different reference species, default = , OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>cds parts (The query cds parts file (FASTA), i.e., the cds parts that have been blasted)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>assignment (The assignment file, which combines parts of the CDS to transcripts, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">q</font></td><br />
<td>query proteins (optional query protein file (FASTA) for computing the optimal alignment score against complete protein prediction, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>weight (the weight can be used to prioritize predictions from different input files; each prediction will get an additional attribute sumWeight that can be used in the filter, valid range = [0.0, 1000.0], default = 1.0, OPTIONAL)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">selected</font></td><br />
<td>selected (The path to list file, which allows to make only a predictions for the contained transcript ids. The first column should contain transcript IDs as given in the annotation. Remaining columns can be used to determine a target region that should be overlapped by the prediction, if columns 2 to 5 contain chromosome, strand, start and end of region, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">gc</font></td><br />
<td>genetic code (optional user-specified genetic code, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">tag</font></td><br />
<td>tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = prediction)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>RNA-seq evidence (data for RNA-seq evidence, range={NO, MAPPED, EXTRACTED}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;MAPPED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.s</font></td><br />
<td>Stranded (Defines whether the reads are stranded. In case of FR_FIRST_STRAND, the first read of a read pair or the only read in case of single-end data is assumed to be located on forward strand of the cDNA, i.e., reverse to the mRNA orientation. If you are using Illumina TruSeq you should use FR_FIRST_STRAND., range={FR_UNSTRANDED, FR_FIRST_STRAND, FR_SECOND_STRAND}, default = FR_UNSTRANDED)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.m</font></td><br />
<td>mapped reads file (BAM/SAM files containing the mapped reads)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.v</font></td><br />
<td>ValidationStringency (Defines how strict to be when reading a SAM or BAM, beyond bare minimum validation., range={STRICT, LENIENT, SILENT}, default = LENIENT)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.u</font></td><br />
<td>use secondary alignments (allows to filter flags in the SAM or BAM, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.c</font></td><br />
<td>coverage (allows to output the coverage, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.mmq</font></td><br />
<td>minimum mapping quality (reads with a mapping quality that is lower than this value will be ignored, valid range = [0, 254], default = 40)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;EXTRACTED&quot;:</b></td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">introns</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage</font></td><br />
<td>coverage (experimental coverage (RNA-seq), range={NO, UNSTRANDED, STRANDED}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">tblastn</font></td><br />
<td>tblastn (if *true* tblastn is used as search algorithm, otherwise mmseqs is used. Tblastn and mmseqs need to be installed to use the corresponding option, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.p</font></td><br />
<td>proteins (whether the complete proteins sequences should returned as output, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.r</font></td><br />
<td>repair (if a transcript annotation can not be parsed, the program will try to infer the phase of the CDS parts to repair the annotation, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.a</font></td><br />
<td>Ambiguity (This parameter defines how to deal with ambiguities in the DNA. There are 3 options: EXCEPTION, which will remove the corresponding transcript, AMBIGUOUS, which will use an X for the corresponding amino acid, and RANDOM, which will randomly select an amnio acid from the list of possibilities., range={EXCEPTION, AMBIGUOUS, RANDOM}, default = EXCEPTION)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.s</font></td><br />
<td>stop-codon excluded from CDS (A flag that states whether the reference annotation contains the stop codon in the CDS annotation or not, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.f</font></td><br />
<td>full-length (A flag which allows for choosing between only full-length and all (i.e., full-length and partial) transcripts, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.s</font></td><br />
<td>splice (if no intron is given by RNA-seq, compute candidate splice sites or not, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.sm</font></td><br />
<td>substitution matrix (optional user-specified substitution matrix, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.g</font></td><br />
<td>gap opening (The gap opening cost in the alignment, default = 11)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.ge</font></td><br />
<td>gap extension (The gap extension cost in the alignment, default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.m</font></td><br />
<td>maximum intron length (The maximum length of an intron, default = 15000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.i</font></td><br />
<td>intron-loss-gain-penalty (The penalty used for intron loss and gain, default = 25)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.e</font></td><br />
<td>e-value (The e-value for filtering blast results, default = 100.0)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.c</font></td><br />
<td>contig threshold (The threshold for evaluating contigs, valid range = [0.0, 1.0], default = 0.4)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.rt</font></td><br />
<td>region threshold (The threshold for evaluating regions, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.h</font></td><br />
<td>hit threshold (The threshold for adding additional hits, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.p</font></td><br />
<td>predictions (The (maximal) number of predictions per transcript, default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.a</font></td><br />
<td>avoid stop (A flag which allows to avoid stop codons in a transcript (except the last AS), default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.approx</font></td><br />
<td>approx (whether an approximation is used to compute the score for intron gain, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.prefix</font></td><br />
<td>prefix (A prefix to be used for naming the predictions, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.t</font></td><br />
<td>timeout (The (maximal) number of seconds to be used for the predictions of one transcript, if exceeded GeMoMa does not output a prediction for this transcript., valid range = [0, 604800], default = 3600)</td><br />
<td style="width:100px;">LONG</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.Score</font></td><br />
<td>Score (A flag which allows to do nothing, re-score or re-align the search results, range={Trust, ReScore, ReAlign}, default = Trust)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.s</font></td><br />
<td>sorting (comma-separated list that defines the way predictions are sorted within a cluster, default = evidence,score)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.m</font></td><br />
<td>missing intron evidence filter (the filter for single-exon transcripts or if no RNA-seq data is used, decides for overlapping other transcripts whether they should be used (=true) or discarded (=false), default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.i</font></td><br />
<td>intron evidence filter (the filter on the intron evidence given by RNA-seq-data for overlapping transcripts, valid range = [0.0, 1.0], default = 1.0)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.c</font></td><br />
<td>common border filter (the filter on the common borders of transcripts, the lower the more transcripts will be checked as alternative splice isoforms, valid range = [0.0, 1.0], default = 0.75)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.mnotpg</font></td><br />
<td>maximal number of transcripts per gene (the maximal number of allowed transcript predictions per gene, valid range = [1, 2147483647], default = 2147483647)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.f</font></td><br />
<td>filter (A filter can be applied to transcript predictions to receive only reasonable predictions. The filter is applied to the GFF attributes. The deault filter decides based on the completeness of the prediction (start=='M' and stop=='*') and the relative score (score/AA>=0.75) whether a prediction is used or not. Different criteria can be combined using 'and ' and 'or'. A more sophisticated filter could be applied for instance using the completeness, the relative score, the evidence as well as statistics based on RNA-seq: start=='M' and stop =='*' and score/AA>=0.75 and (evidence>1 or tpc==1.0), default = start=='M' and stop =='*' and score/AA>=0.75, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.u</font></td><br />
<td>UTR (allows to predict UTRs using RNA-seq data, range={NO, YES}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;YES&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.r</font></td><br />
<td>rename (allows to generate generic gene and transcripts names (cf. attribute &quot;Name&quot;), range={COMPOSED, SIMPLE, NO}, default = COMPOSED)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;COMPOSED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.i</font></td><br />
<td>infix (the infix of the generic name, default = G)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.s</font></td><br />
<td>suffix (the suffix of the generic name, default = 0)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.di</font></td><br />
<td>delete infix (a comma-separated list of infixes that is deleted from the sequence names before building the gene/transcript name, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;SIMPLE&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>predicted proteins (If *true*, returns the predicted proteins of the target organism as fastA file, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">pc</font></td><br />
<td>predicted CDSs (If *true*, returns the predicted CDSs of the target organism as fastA file, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">pgr</font></td><br />
<td>predicted genomic regions (If *true*, returns the genomic regions of predicted gene models of the target organism as fastA file, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">o</font></td><br />
<td>output individual predictions (If *true*, returns the predictions for each reference species, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>debug (If *false* removes all temporary files even if the jobs exits unexpected, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">threads</font></td><br />
<td>The number of threads used for the tool, defaults to 1</td><br />
<td>INT</td><br />
</tr><br />
</table><br />
<br />
=== Extract RNA-seq Evidence (ERE) ===<br />
<br />
For post-processing the mapped RNA-seq data, we provide the tool ExtractRNAseqEvidence (ERE). You can run Extractor from the command line with<br/><br />
<code>java -jar GeMoMa-1.6.1.jar CLI ERE [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>Stranded (Defines whether the reads are stranded. In case of FR_FIRST_STRAND, the first read of a read pair or the only read in case of single-end data is assumed to be located on forward strand of the cDNA, i.e., reverse to the mRNA orientation. If you are using Illumina TruSeq you should use FR_FIRST_STRAND., range={FR_UNSTRANDED, FR_FIRST_STRAND, FR_SECOND_STRAND}, default = FR_UNSTRANDED)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>mapped reads file (BAM/SAM files containing the mapped reads)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>ValidationStringency (Defines how strict to be when reading a SAM or BAM, beyond bare minimum validation., range={STRICT, LENIENT, SILENT}, default = LENIENT)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">u</font></td><br />
<td>use secondary alignments (allows to filter flags in the SAM or BAM, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage (allows to output the coverage, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">mmq</font></td><br />
<td>minimum mapping quality (reads with a mapping quality that is lower than this value will be ignored, valid range = [0, 254], default = 40)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== CheckIntrons===<br />
<br />
This tool allows to check whether the extracted introns show the expected patterns of di-nucleotides at the splice sites. You can run CheckIntrons from the command line with<br/><br />
<code>java -jar GeMoMa-1.6.1.jar CLI CheckIntrons [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>target genome (The target genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>verbose (A flag which allows to output a wealth of additional information per transcript, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== Extractor ===<br />
<br />
For preparing the data, we provide the tool Extractor. You can run Extractor from the command line with<br/><br />
<code>java -jar GeMoMa-1.6.1.jar CLI Extractor [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (Reference annotation file (GFF or GTF), which contains gene models annotated in the reference genome)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (Reference genome file (FASTA))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">gc</font></td><br />
<td>genetic code (optional user-specified genetic code, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>proteins (whether the complete proteins sequences should returned as output, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>cds (whether the complete CDSs should returned as output, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">genomic</font></td><br />
<td>genomic (whether the genomic regions should be returned (upper case = coding, lower case = non coding), default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>repair (if a transcript annotation can not be parsed, the program will try to infer the phase of the CDS parts to repair the annotation, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>selected (The path to list file, which allows to make only a predictions for the contained transcript ids. The first column should contain transcript IDs as given in the annotation. Remaining columns will be ignored., OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Ambiguity</font></td><br />
<td>Ambiguity (This parameter defines how to deal with ambiguities in the DNA. There are 3 options: EXCEPTION, which will remove the corresponding transcript, AMBIGUOUS, which will use an X for the corresponding amino acid, and RANDOM, which will randomly select an amnio acid from the list of possibilities., range={EXCEPTION, AMBIGUOUS, RANDOM}, default = EXCEPTION)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sefc</font></td><br />
<td>stop-codon excluded from CDS (A flag that states whether the reference annotation contains the stop codon in the CDS annotation or not, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>full-length (A flag which allows for choosing between only full-length and all (i.e., full-length and partial) transcripts, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>verbose (A flag which allows to output a wealth of additional information, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== Gene Model Mapper (GeMoMa) ===<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>search results (The search results, e.g., from tblastn or mmseqs)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>target genome (The target genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>cds parts (The query cds parts file (FASTA), i.e., the cds parts that have been blasted)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>assignment (The assignment file, which combines parts of the CDS to transcripts, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">q</font></td><br />
<td>query proteins (optional query protein file (FASTA) for computing the optimal alignment score against complete protein prediction, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">splice</font></td><br />
<td>splice (if no intron is given by RNA-seq, compute candidate splice sites or not, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage</font></td><br />
<td>coverage (experimental coverage (RNA-seq), range={NO, UNSTRANDED, STRANDED}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genetic code (optional user-specified genetic code, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sm</font></td><br />
<td>substitution matrix (optional user-specified substitution matrix, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">go</font></td><br />
<td>gap opening (The gap opening cost in the alignment, default = 11)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ge</font></td><br />
<td>gap extension (The gap extension cost in the alignment, default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>maximum intron length (The maximum length of an intron, default = 15000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">intron-loss-gain-penalty</font></td><br />
<td>intron-loss-gain-penalty (The penalty used for intron loss and gain, default = 25)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">e</font></td><br />
<td>e-value (The e-value for filtering blast results, default = 100.0)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ct</font></td><br />
<td>contig threshold (The threshold for evaluating contigs, valid range = [0.0, 1.0], default = 0.4)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">rt</font></td><br />
<td>region threshold (The threshold for evaluating regions, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">h</font></td><br />
<td>hit threshold (The threshold for adding additional hits, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>predictions (The (maximal) number of predictions per transcript, default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">selected</font></td><br />
<td>selected (The path to list file, which allows to make only a predictions for the contained transcript ids. The first column should contain transcript IDs as given in the annotation. Remaining columns can be used to determine a target region that should be overlapped by the prediction, if columns 2 to 5 contain chromosome, strand, start and end of region, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">as</font></td><br />
<td>avoid stop (A flag which allows to avoid stop codons in a transcript (except the last AS), default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">approx</font></td><br />
<td>approx (whether an approximation is used to compute the score for intron gain, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">prefix</font></td><br />
<td>prefix (A prefix to be used for naming the predictions, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">tag</font></td><br />
<td>tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = prediction)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>verbose (A flag which allows to output a wealth of additional information per transcript, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">timeout</font></td><br />
<td>timeout (The (maximal) number of seconds to be used for the predictions of one transcript, if exceeded GeMoMa does not output a prediction for this transcript., valid range = [0, 604800], default = 3600)</td><br />
<td style="width:100px;">LONG</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sort</font></td><br />
<td>sort (A flag which allows to sort the search results, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Score</font></td><br />
<td>Score (A flag which allows to do nothing, re-score or re-align the search results, range={Trust, ReScore, ReAlign}, default = Trust)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
GeMoMa returns the predicted annotation as gff file.<br />
<br />
=== GeMoMa Annotation Filter (GAF) ===<br />
<br />
The GeMoMa Annotation Filter allows to combine and reduce predictions from GeMoMa into a single final prediction. It is able to handle predictions from different reference species. It also handles overlapping or identical predictions.<br />
You can run GeMoMa from the command line with<br/><br />
<code>java -jar GeMoMa-1.6.1.jar CLI GAF[&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>tag (the tag used to read the GeMoMa annotations, default = prediction)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>sorting (comma-separated list that defines the way predictions are sorted within a cluster, default = evidence,score)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>missing intron evidence filter (the filter for single-exon transcripts or if no RNA-seq data is used, decides for overlapping other transcripts whether they should be used (=true) or discarded (=false), default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>intron evidence filter (the filter on the intron evidence given by RNA-seq-data for overlapping transcripts, valid range = [0.0, 1.0], default = 1.0)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>common border filter (the filter on the common borders of transcripts, the lower the more transcripts will be checked as alternative splice isoforms, valid range = [0.0, 1.0], default = 0.75)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">mnotpg</font></td><br />
<td>maximal number of transcripts per gene (the maximal number of allowed transcript predictions per gene, valid range = [1, 2147483647], default = 2147483647)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prefix (the prefix can be used to distinguish predictions from different input files, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>weight (the weight can be used to prioritize predictions from different input files; each prediction will get an additional attribute sumWeight that can be used in the filter, valid range = [0.0, 1000.0], default = 1.0, OPTIONAL)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>gene annotation file (GFF files containing the gene annotations (predicted by GeMoMa))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>filter (A filter can be applied to transcript predictions to receive only reasonable predictions. The filter is applied to the GFF attributes. The deault filter decides based on the completeness of the prediction (start=='M' and stop=='*') and the relative score (score/AA>=0.75) whether a prediction is used or not. Different criteria can be combined using 'and ' and 'or'. A more sophisticated filter could be applied for instance using the completeness, the relative score, the evidence as well as statistics based on RNA-seq: start=='M' and stop =='*' and score/AA>=0.75 and (evidence>1 or tpc==1.0), default = start=='M' and stop =='*' and score/AA>=0.75, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== CompareTranscripts ===<br />
<br />
For comparing gene models from GeMoMa predictions with existing annotation, we provide the tool CompareTranscripts. You can run CompareTranscripts from the command line with<br/><br />
<code>java -jar GeMoMa-1.6.1.jar CLI CompareTranscripts [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prediction (The predicted annotation)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (The true annotation)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== AnnotationEvidence ===<br />
<br />
For providing RNA-seq evidence (e.g. tie) for existing annotation, we provide the tool AnnotationEvidence. You can run AnnotationEvidence from the command line with<br/><br />
<code>java -jar GeMoMa-1.6.1.jar CLI AnnotationEvidence [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (The genome annotation file (GFF))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (The genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns file (Introns (GFF), which might be obtained from RNA-seq, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage file (experimental coverage (RNA-seq), range={NO, UNSTRANDED, STRANDED}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ao</font></td><br />
<td>annotation output (if the annotation should be returned with attributes tie, tpc, and AA, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">gc</font></td><br />
<td>genetic code (optional user-specified genetic code, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== AnnotationFinializer ===<br />
<br />
This tool allows to predict UTR and to rename predictions. You can run AnnotationEvidence from the command line with<br/><br />
<code>java -jar GeMoMa-1.6.1.jar CLI AnnotationFinalizer [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (The genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (The predicted genome annotation file (GFF))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = prediction)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">u</font></td><br />
<td>UTR (allows to predict UTRs using RNA-seq data, range={NO, YES}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;YES&quot;:</b></td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns file (Introns (GFF), which might be obtained from RNA-seq, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage file (experimental coverage (RNA-seq), range={NO, UNSTRANDED, STRANDED}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">rename</font></td><br />
<td>rename (allows to generate generic gene and transcripts names (cf. attribute &quot;Name&quot;), range={COMPOSED, SIMPLE, NO}, default = COMPOSED)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;COMPOSED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">infix</font></td><br />
<td>infix (the infix of the generic name, default = G)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>suffix (the suffix of the generic name, default = 0)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">di</font></td><br />
<td>delete infix (a comma-separated list of infixes that is deleted from the sequence names before building the gene/transcript name, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;SIMPLE&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
== GFF attributes ==<br />
<br />
Using GeMoMa and GAF, you'll obtain GFFs containing some special attributes. We briefly explain the most prominent attributes in the following table.<br />
<br />
{| class="wikitable sortable"<br />
|-<br />
!Attribute !!Long name !!Tool !!Necessary parameter !!Feature !!Description<br />
|-<br />
| score || GeMoMa score || GeMoMa || || prediction || score computed by GeMoMa using the substitution matrix, gap costs and additional penalties<br />
|-<br />
| minCov || minimal coverage || GeMoMa || coverage, ... || prediction || minimal coverage of any base of the prediction given RNA-seq evidence<br />
|-<br />
| avgCov || average coverage || GeMoMa || coverage, ... || prediction || average coverage of all bases of the prediction given RNA-seq evidence<br />
|-<br />
| tpc || transcript percentage coverage || GeMoMa || coverage, ... || prediction || percentage of covered bases per predicted transcript given RNA-seq evidence<br />
|-<br />
| tae || transcript acceptor evidence || GeMoMa || introns || prediction || percentage of predicted acceptor sites per predicted transcript with RNA-seq evidence<br />
|-<br />
| tde || transcript donor evidence || GeMoMa || introns || prediction || percentage of predicted donor sites per predicted transcript with RNA-seq evidence<br />
|-<br />
| tie || transcript intron evidence || GeMoMa || introns || prediction || percentage of predicted introns per predicted transcript with RNA-seq evidence<br />
|-<br />
| minSplitReads || minimal split reads || GeMoMa || introns || prediction || minimal number of split reads for any of the predicted introns per predicted transcript<br />
|-<br />
| iAA || identical amino acid || GeMoMa || query proteins || prediction || percentage of identical amino acids between reference transcript and prediction<br />
|-<br />
| pAA || positive amino acid || GeMoMa || query proteins || prediction || percentage of aligned positions between reference transcript and prediction yielding a positive score in the substitution matrix<br />
|-<br />
| evidence || || GAF || || prediction || number of reference organisms that have a transcript yielding this prediction<br />
|-<br />
| alternative || || GAF || || prediction || alternative gene ID(s) leading to the same prediction<br />
|-<br />
| sumWeight || || GAF || || prediction || the sum of the weights of the references that perfectly support this prediction<br />
|-<br />
| maxTie || maximal tie || GAF || || gene || maximal tie of all transcripts of this gene<br />
|-<br />
| maxEvidence || maximal evidence || GAF || || gene || maximal evidence of all transcripts of this gene<br />
|-<br />
|}<br />
<br />
== FAQs ==<br />
<br />
; Why does the Extractor not return a single CDS-part, protein, ...?<br />
: First, please check whether the names of your contigs/chromosomes in your annotation (gff) and genome file (fasta) are identical. The fasta comments should at best only contain the contig/chromosome name. (Since GeMoMa 1.4, comments, which contain the contig/chromosome name and some additional information separated by a space, are also fine.) Second, please check whether you have a valid GFF/GTF file. Valid GFF files should have a valid "ID" or "Parent" entry in the attributes column. Valid GTF files should have a valid "gene_id" and "transcript_id" entry. Finally, please check the statistics that are given by the Extractor. It lists how many genes have been read and how many genes have been removed for different reasons. One common problem is that some annotation files do not include the stop codon in the CDS annotation.<br />
; How can I force GeMoMa to make more predictions?<br />
: There are several parameters affecting the number of predictions. The most prominent are the number of predictions (p) and the contig threshold (ct). For each reference transcript/CDS, GeMoMa initially makes a preliminary prediction and uses this prediction to determine whether a contig is promising and should be used to determine the final predictions. You may decrease ct and increase p to have more contigs in the final prediction. Increasing the number of predictions allows GeMoMa to output more predictions that have been computed. Decreasing the contig threshold allows to increase the number of predictions that are (internally) computed. Increasing p to a very large number without decreasing ct does not help.<br />
; Running GeMoMa on a single contig of my assembly yield thousands of weird predictions. What went wrong?<br />
: By default, GeMoMa is not build to be run on a single contig. GeMoMa tries to make predictions for all given reference CDS in the given target sequence(s). If the given target sequence is only a fraction of the complete target genome/assembly, GeMoMa will produce weird predictions as it does not filter for the quality of the predictions internally. There are two options to handle this:<br />
:* Use a list of gene models that you expect to be located on this contig (cf. parameter "selected").<br />
:* Filter the predictions using GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>).<br />
; Is it mandatory to use RNA-seq data?<br />
: No, GeMoMa is able to make predictions with and without RNA-seq evidence.<br />
; Is it possible to use multiple reference organisms?<br />
: It is possible to use multiple reference organisms for GeMoMa. Just run GeMoMa on each reference organism separately. Finally, you can employ GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>) to combine these annotations.<br />
; Why do some reference genes not lead to a prediction in the target genome?<br />
: Please first check whether your reference genes have been discarded by the Extractor (cf. assignment file).<br />
: If the genes have been discarded, there are two possibilities:<br />
:* The CDS might be redundant, i.e. the coding exons are identical to those of another transcript. In this case, only one CDS is further evaluated.<br />
:* There might be something wrong with your reference genes, e.g., missing start codon, missing stop codon, premature stop codon, ambiguous nucleotides, ... and you should check the options of Extractor or the annotation.<br />
: If the reference genes passed the Extractor, there are several possible explanations for this behavior. The two most prominent are:<br />
:* GeMoMa stopped the prediction of a reference genes since it does not return a result within the given time (cf. parameter "timeout").<br />
:* GeMoMa simply did not find a prediction matching the remaining quality criteria<br />
:* GeMoMa did find a prediction, but it was filtered out by GAF, e.g. to low relative score, missing start or stop codon (cf. GAF parameters).<br />
; What does "partial gene model" mean in the context of GeMoMa?<br />
: We called a gene model partial if it does not contain an initial start codon and a final stop codon. However, this does not mean that the gene model is located at or close to the border of a chromosome or contig.<br />
; For two different reference transcripts, the predictions of GeMoMa overlap or are identical. What should I do with those?<br />
: GeMoMa makes the predictions for each reference transcript independently. Hence, it can occur that some of predictions of different reference transcripts overlap or are identical especially in gene families. Typically, you might like to filter or rank these predictions. We have implemented GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>) to do this automatically. However, you can also do it by hand using the GFF attributes. Using RNA-seq data in GeMoMa yields additional fields in the annotation that can be used, e.g., average coverage (avgCov).<br />
; A lot of transcripts have been filtered out by the Extractor. What can I do?<br />
: There are several reasons for removing transcripts by the Extractor. At least in two cases you can try to get more transcripts by setting specific parameter values. First, if the transcript contains ambiguous nucleotides, please test the parameter "Ambiguity". Second, sometimes we received GFFs which contain wrong phases for CDS entries (e.g., 0 for all CDS entries in the phase column of the GFF). Since version 1.3.2, we provide the option "r" which stands for repair. If r=true is chosen, the Extractor tries to infer all phases for transcripts that show an error and would be filtered out.<br />
; Is GeMoMa able to predict pseudo-genes/ncRNA?<br />
: No, currently not.<br />
; My RNA-seq data indicates there is an additional intron in a transcipt, but GeMoMa does not predict this. Or vice versa, GeMoMa predicts an intron that is not supported by RNA-seq data. What's the reason?<br />
: GeMoMa is mainly based on the assumptions of amino acid and intron position conservation between reference and target species. Hence, GeMoMa tries to predict a gene model with similar exon-intron structure in the target species and does not stick too much to RNA-seq data. Although intron position conservation can be observed in most cases, sometimes new introns evolve or others vanish. For this reasons, GeMoMa also allows for the inclusion or exclusion of introns adding some additional costs (cf. GeMoMa parameter intron-loss-gain-penalty). However, the behaviour of GeMoMa depends on the parameters settings (especially intron-loss-gain-penalty, sm (substitution matrix), go (gap opening), ge (gap extension)) and the length of the missed/additional intron. Nevertheless, such cases can only occur if the additional/missed intron has a length that can be divided by 3 preserving the reading frame.<br />
:Since the available RNA-seq data only reflects a fraction of tissues/environmental conditions/..., missing RNA-seq evidence does not necessarily mean that the predictions is wrong.<br />
; My RNA-seq data indicates two alternative, highly overlapping introns. Interestingly, GeMoMa does not take the intron that is more abundant. Why?<br />
: GeMoMa reads the introns from the input file using some filter (cf. GeMoMa parameter r (reads)). All introns that pass the filter are used and treated equally. Hence, GeMoMa uses the intron that matches the expectation of intron position and amino acid conservation compared to the reference transcript.<br />
; Does GeMoMa predict multiple transcripts per gene?<br />
: GeMoMa in principle allows to predict multiple transcripts per gene, if corresponding transcripts are given in the reference species or if multiple reference species are used.<br />
; GeMoMa failed with java.lang.OutOfMemoryError. What can I do?<br />
: Whenever you see a java.lang.OutOfMemoryError, you should rerun the program with Java virtual machine (VM) options. More specifically you should set: '''-Xms''' the initally used RAM, e.g. to 5Gb (–Xms5G), and '''-Xmx''' the maximally used RAM, e.g. to 50Gb (-Xmx50G). GeMoMa often needs more memory if you have a large genome ''and'' if you’re providing a large coverage file (extracted from RNA-seq data). If you don’t have a compute node with enough memory, you can run GeMoMa without coverage, which will return the same predictions, but does not include all statistics. Another point could be the protein alignment, if you use the optional parameter query protein. Again you can run GeMoMa without this parameter, which will return the same predictions, but less statistics.<br />
; I need to specify the genetic code for my organisms. What is the expected format?<br />
: The genetic code is given in a two column tab-delimited table, where the first column is the one letter code of the amino acid and the second column is a comma-separated list of triplets. As we are working on genomic DNA, GeMoMa expects the bases A, C, G, and T, and not U (as expected in mRNA). Here is the link to the default genetic code, which might be used as template:<br />
: https://github.com/Jstacs/Jstacs/blob/master/projects/gemoma/test_data/genetic_code.txt<br />
: The genetic code might be specified for a reference organism in the module Extractor or for a target organism in the module GeMoMa.<br />
<br />
== Version history ==<br />
[http://www.jstacs.de/download.php?which=GeMoMa GeMoMa 1.6.1] (4.06.2019)<br />
* createGalaxyIntegration.sh: bugfix for GeMoMaPipeline<br />
* new module CheckIntrons: allowing to create statistics for introns (extracted by ERE)<br />
* AnnotationFinalizer: bugfix for sequence IDs with large numbers<br />
* CompareTranscripts: <br />
** bugfix for prefix of ref-gene<br />
** allow no transcript info, but making assignment non-optional if a transcript info is set <br />
* GAF: bugfix for Galaxy integration<br />
* GeMoMaPipeline:<br />
** improved output in case of Exceptions<br />
** new parameter "output individual predictions" allows to in- or exclude individual predictions from each reference organism in the final result<br />
** new parameter "weight" allows weights for reference species (cf. GAF)<br />
* ERE: new parameter "minimum mapping quality"<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.6.zip GeMoMa 1.6] (2.04.2019)<br />
* allow to use mmseqs as alternative to tblastn<br />
* AnnotationEvidence:<br />
** allows to add attributes to the input gff: tie, tpc, AA, start, stop<br />
** new parameter for gff output<br />
* AnnotationFinalizer: new tool for predicting UTRs and renaming predictions<br />
* GAF:<br />
** relative score filter and evidence filter are replaced by a flexible filter that allows to filter by relative score, evidence or other GFF attributes as well as combinations thereof<br />
** sorting criteria of the predictions within clusters can now be user-specified<br />
** new attribute for genes: combinedEvidence<br />
** new attribute for predictions: sumWeight<br />
** allows to use gene predictions from all sources, including for instance ab-initio gene predictors, purely RNA-seq based gene prediction and manually curation<br />
** bugfix for predictions from multiple reference organisms<br />
** improved statistic output<br />
* GeMoMa<br />
** renamed the parameter tblastn results to search results<br />
** new parameter for sorting the results of the similarity search (tblastn or mmseqs), if you use mmseqs for the similarity search you have use sort <br />
** new parameter for score of the search results: three options: Trust (as is), ReScore (use aligned sequence, but recompute score), and ReAlign (use detected sequence for realignment and rescore)<br />
** bugfix: threshold for introns from multiple files<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.5.2.zip GeMoMa 1.5.3] (23.07.2018)<br />
* improved parameter description and presentation<br />
* GeMoMaPipeline:<br />
** removed unnecessary parameters<br />
* GeMoMa:<br />
** bugfix: reading coverage file<br />
** removed parameter genomic (cf. Extractor)<br />
** removed protein output (cf. Extractor)<br />
* GAF:<br />
** bugfix: prefix<br />
* Extractor:<br />
** new parameter genomic<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.5.2.zip GeMoMa 1.5.2] (31.5.2018)<br />
* GAF:<br />
** new parameter that allows to restrict the maximal number of transcript predictions per gene<br />
** altered behavior of the evidence filter from percentages to absolute values<br />
** bugfix: nested genes <br />
** checking for duplicates in prediction IDs<br />
* GeMoMa:<br />
** warning if RNA-seq data does not match with target genome<br />
* GeMoMaPipeline: new tool for running the complete GeMoMa pipeline at once allowing multi-threading<br />
* folder for temporary files of GeMoMa<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.5.zip GeMoMa 1.5] (13.02.2018)<br />
* AnnotationEvidence: add chromosome to output<br />
* CompareTranscripts: new parameter that allows to remove prefixes introduces by GAF<br />
* Extractor: new parameter "stop-codon excluded from CDS" that might be used if the annotation does not contain the stop codons<br />
* ExtractRNASeqEvidence: <br />
** print intron length stats<br />
** include program infos in introns.gff3<br />
* GeMoMa:<br />
** new attribute pAA in gff output if query protein is given<br />
** include program infos in predicted_annotation.gff3<br />
** minor bugfix<br />
* GAF:<br />
** new parameter that allows to specify a prefix for each input gff<br />
** collect and print program infos to filtered_prediction.gff3<br />
** improved statistics output<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.4.2.zip GeMoMa 1.4.2] (21.07.2017)<br />
* automatic searching for available updates<br />
* AnnotationEvidence: bugfix (tie computation: Arrays.binarysearch does not find first match)<br />
* Extractor: bugfix (files that are not zipped)<br />
* GeMoMa: bugfix (tie computation: Arrays.binarysearch does not find first match)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.4.1.zip GeMoMa 1.4.1] (30.05.2017)<br />
* CompareTranscripts: bugfix (NullPointerException)<br />
* Extractor: reference genome can be .*fa.gz and .*fasta.gz<br />
* GeMoMa: bugfix (shutdown problem after timeout)<br />
* modified additional scripts and documentation<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.4.zip GeMoMa 1.4] (03.05.2017)<br />
* AnnotationEvidence: new tool computing tie and tpc for given annotation (gff)<br />
* CompareTranscripts: new tool comparing predicted and given annotation (gff)<br />
* Extractor:<br />
** reading CDS with no parent tag (cf. discontinuous feature)<br />
** automatic recognition of GFF or GTF annotation<br />
** Warning if sequences mentioned in the annotation are not included in the reference sequence<br />
* GeMoMa: <br />
** allowing for multiple intron and coverage files (= using different library types at the same time)<br />
** NA instead of "?" for tae, tde, tie, minSplitReads of single coding exon genes<br />
** new default values for the parameters: predictions (10 instead of 1) and contig threshold (0.4 instead of 0.9)<br />
** bugfix (write pc and minCov if possible for last CDS part in predicted annotation)<br />
** bugfix (ref-gene name if no assignment is used)<br />
** bugfix (minSplitReads, minCov, tpc, avgCov if no coverage available)<br />
* GAF:<br />
** nested genes on the same strand<br />
** bugfix (if nothing passes the filter)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.3.2.zip GeMoMa 1.3.2] (18.01.2017)<br />
* Extractor: new parameter repair for broken transcript annotations<br />
* GeMoMa: bugfixes (splice site computation)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa_1.3.1.zip GeMoMa 1.3.1] (09.12.2016)<br />
* GeMoMa bugfix (finding start/stop codon for very small exons)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.3.zip GeMoMa 1.3] (06.12.2016)<br />
* ERE: new tool for extracting RNA-seq evidence<br />
* Extractor: offers options for <br />
** partial gene models<br />
** ambiguities<br />
* GeMoMa:<br />
** RNA-seq<br />
*** defining splice sites<br />
*** additional feature in GFF and output<br />
**** transcript intron evidence (tie)<br />
**** transcript acceptor evidence (tae)<br />
**** transcript donor evidence (tde)<br />
**** transcript percentage coverage (tpc)<br />
**** ...<br />
** improved GFF<br />
** simplified the command line parameters<br />
** IMPORTANT: parameter names changed for some parameters<br />
* GAF: new tool for filtering and combining different predictions (especially of different reference organisms)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.1.3.zip GeMoMa 1.1.3] (06.06.2016)<br />
* minor modifications to the Extractor tool<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.1.2.zip GeMoMa 1.1.2] (05.02.2016)<br />
* GeMoMa bugfix (upstream, downstream sequence for splice site detection)<br />
* Extractor: new parameter s for selecting transcripts<br />
* improved Galaxy integration<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.1.1.zip GeMoMa 1.1.1] (01.02.2016)<br />
* initial release for paper</div>Keilwagenhttps://www.jstacs.de/index.php?title=GeMoMa&diff=1057GeMoMa2019-11-05T06:43:56Z<p>Keilwagen: /* FAQs */</p>
<hr />
<div>'''Ge'''ne '''Mo'''del '''Ma'''pper (GeMoMa) is a homology-based gene prediction program. GeMoMa uses the annotation of protein-coding genes in a reference genome to infer the annotation of protein-coding genes in a target genome. Thereby, GeMoMa utilizes amino acid sequence and intron position conservation. In addition, GeMoMa allows to incorporate RNA-seq evidence for splice site prediction.<br />
<br />
[[File:GeMoMa-schema.png|thumb|right|350px|Schema of GeMoMa algorithm]]<br />
<br />
== Paper ==<br />
If you use GeMoMa, please cite<br />
<br />
J. Keilwagen, M. Wenk, J. L. Erickson, M. H. Schattat, J. Grau, and F. Hartung. [https://nar.oxfordjournals.org/content/44/9/e89 Using intron position conservation for homology-based gene prediction]. ''Nucleic Acids Research'', 2016. doi: 10.1093/nar/gkw092<br />
<br />
J. Keilwagen, F. Hartung, M. Paulini, S. O. Twardziok, and J. Grau<br />
[https://rdcu.be/QbKc Combining RNA-seq data and homology-based gene prediction for plants, animals and fungi]. ''BMC Bioinformatics'', 2018. doi: 10.1186/s12859-018-2203-5<br />
<br />
== Download ==<br />
<br />
GeMoMa is implemented in Java using Jstacs. You can [http://www.jstacs.de/download.php?which=GeMoMa download a zip file] containing a readme, the GeMoMa jar file and some tiny scripts for running GeMoMa. The jar file allows for <br />
<ul><br />
<li>creating the XML file needed for the Galaxy integration</li><br />
<li>running the command line interface (CLI) version.</li><br />
</ul><br />
<br />
You can also [[:File:GeMoMa-manual.pdf|download a small manual for GeMoMa]] which explains the main steps for the analysis.<br />
<br />
== Galaxy ==<br />
GeMoMa is available in a public web-server at [http://galaxy.informatik.uni-halle.de galaxy.informatik.uni-halle.de]. The provided web-server only allows a limited number of reference genes and uses a time out of 2 minutes per transcript prediction. For unlimited use, please use the command line program or integrate GeMoMa in your only Galaxy instance.<br />
<br />
[[File:GeMoMa-Workflow.png|thumb|center|700px|GeMoMa workflow adapted from Galaxy]]<br />
<br />
== Running the command line application ==<br />
<br />
For running the command line application, Java v1.8 or later is required.<br />
<br />
=== GeMoMaPipeline ===<br />
<br />
If you like to run the GeMoMaPipeline on a server as a single job, you can use the module GeMoMaPipeline which allows to exploit the full compute power of the computer server via multi-threading. However, GeMoMaPipeline does not distribute task on a compute cluster.<br />
You can run GeMoMaPipeline from the command line with<br/><br />
<code>java -jar GeMoMa-1.6.1.jar CLI GeMoMaPipeline [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>target genome (Target genome file (FASTA))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>species (data for reference species, range={own, pre-extracted}, default = own)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;own&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>ID (ID to distinguish the different reference species, default = , OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (Reference annotation file (GFF or GTF), which contains gene models annotated in the reference genome)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (Reference genome file (FASTA))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>weight (the weight can be used to prioritize predictions from different input files; each prediction will get an additional attribute sumWeight that can be used in the filter, valid range = [0.0, 1000.0], default = 1.0, OPTIONAL)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;pre-extracted&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>ID (ID to distinguish the different reference species, default = , OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>cds parts (The query cds parts file (FASTA), i.e., the cds parts that have been blasted)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>assignment (The assignment file, which combines parts of the CDS to transcripts, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">q</font></td><br />
<td>query proteins (optional query protein file (FASTA) for computing the optimal alignment score against complete protein prediction, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>weight (the weight can be used to prioritize predictions from different input files; each prediction will get an additional attribute sumWeight that can be used in the filter, valid range = [0.0, 1000.0], default = 1.0, OPTIONAL)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">selected</font></td><br />
<td>selected (The path to list file, which allows to make only a predictions for the contained transcript ids. The first column should contain transcript IDs as given in the annotation. Remaining columns can be used to determine a target region that should be overlapped by the prediction, if columns 2 to 5 contain chromosome, strand, start and end of region, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">gc</font></td><br />
<td>genetic code (optional user-specified genetic code, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">tag</font></td><br />
<td>tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = prediction)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>RNA-seq evidence (data for RNA-seq evidence, range={NO, MAPPED, EXTRACTED}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;MAPPED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.s</font></td><br />
<td>Stranded (Defines whether the reads are stranded. In case of FR_FIRST_STRAND, the first read of a read pair or the only read in case of single-end data is assumed to be located on forward strand of the cDNA, i.e., reverse to the mRNA orientation. If you are using Illumina TruSeq you should use FR_FIRST_STRAND., range={FR_UNSTRANDED, FR_FIRST_STRAND, FR_SECOND_STRAND}, default = FR_UNSTRANDED)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.m</font></td><br />
<td>mapped reads file (BAM/SAM files containing the mapped reads)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.v</font></td><br />
<td>ValidationStringency (Defines how strict to be when reading a SAM or BAM, beyond bare minimum validation., range={STRICT, LENIENT, SILENT}, default = LENIENT)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.u</font></td><br />
<td>use secondary alignments (allows to filter flags in the SAM or BAM, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.c</font></td><br />
<td>coverage (allows to output the coverage, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.mmq</font></td><br />
<td>minimum mapping quality (reads with a mapping quality that is lower than this value will be ignored, valid range = [0, 254], default = 40)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;EXTRACTED&quot;:</b></td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">introns</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage</font></td><br />
<td>coverage (experimental coverage (RNA-seq), range={NO, UNSTRANDED, STRANDED}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">tblastn</font></td><br />
<td>tblastn (if *true* tblastn is used as search algorithm, otherwise mmseqs is used. Tblastn and mmseqs need to be installed to use the corresponding option, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.p</font></td><br />
<td>proteins (whether the complete proteins sequences should returned as output, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.r</font></td><br />
<td>repair (if a transcript annotation can not be parsed, the program will try to infer the phase of the CDS parts to repair the annotation, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.a</font></td><br />
<td>Ambiguity (This parameter defines how to deal with ambiguities in the DNA. There are 3 options: EXCEPTION, which will remove the corresponding transcript, AMBIGUOUS, which will use an X for the corresponding amino acid, and RANDOM, which will randomly select an amnio acid from the list of possibilities., range={EXCEPTION, AMBIGUOUS, RANDOM}, default = EXCEPTION)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.s</font></td><br />
<td>stop-codon excluded from CDS (A flag that states whether the reference annotation contains the stop codon in the CDS annotation or not, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.f</font></td><br />
<td>full-length (A flag which allows for choosing between only full-length and all (i.e., full-length and partial) transcripts, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.s</font></td><br />
<td>splice (if no intron is given by RNA-seq, compute candidate splice sites or not, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.sm</font></td><br />
<td>substitution matrix (optional user-specified substitution matrix, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.g</font></td><br />
<td>gap opening (The gap opening cost in the alignment, default = 11)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.ge</font></td><br />
<td>gap extension (The gap extension cost in the alignment, default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.m</font></td><br />
<td>maximum intron length (The maximum length of an intron, default = 15000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.i</font></td><br />
<td>intron-loss-gain-penalty (The penalty used for intron loss and gain, default = 25)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.e</font></td><br />
<td>e-value (The e-value for filtering blast results, default = 100.0)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.c</font></td><br />
<td>contig threshold (The threshold for evaluating contigs, valid range = [0.0, 1.0], default = 0.4)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.rt</font></td><br />
<td>region threshold (The threshold for evaluating regions, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.h</font></td><br />
<td>hit threshold (The threshold for adding additional hits, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.p</font></td><br />
<td>predictions (The (maximal) number of predictions per transcript, default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.a</font></td><br />
<td>avoid stop (A flag which allows to avoid stop codons in a transcript (except the last AS), default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.approx</font></td><br />
<td>approx (whether an approximation is used to compute the score for intron gain, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.prefix</font></td><br />
<td>prefix (A prefix to be used for naming the predictions, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.t</font></td><br />
<td>timeout (The (maximal) number of seconds to be used for the predictions of one transcript, if exceeded GeMoMa does not output a prediction for this transcript., valid range = [0, 604800], default = 3600)</td><br />
<td style="width:100px;">LONG</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.Score</font></td><br />
<td>Score (A flag which allows to do nothing, re-score or re-align the search results, range={Trust, ReScore, ReAlign}, default = Trust)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.s</font></td><br />
<td>sorting (comma-separated list that defines the way predictions are sorted within a cluster, default = evidence,score)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.m</font></td><br />
<td>missing intron evidence filter (the filter for single-exon transcripts or if no RNA-seq data is used, decides for overlapping other transcripts whether they should be used (=true) or discarded (=false), default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.i</font></td><br />
<td>intron evidence filter (the filter on the intron evidence given by RNA-seq-data for overlapping transcripts, valid range = [0.0, 1.0], default = 1.0)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.c</font></td><br />
<td>common border filter (the filter on the common borders of transcripts, the lower the more transcripts will be checked as alternative splice isoforms, valid range = [0.0, 1.0], default = 0.75)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.mnotpg</font></td><br />
<td>maximal number of transcripts per gene (the maximal number of allowed transcript predictions per gene, valid range = [1, 2147483647], default = 2147483647)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.f</font></td><br />
<td>filter (A filter can be applied to transcript predictions to receive only reasonable predictions. The filter is applied to the GFF attributes. The deault filter decides based on the completeness of the prediction (start=='M' and stop=='*') and the relative score (score/AA>=0.75) whether a prediction is used or not. Different criteria can be combined using 'and ' and 'or'. A more sophisticated filter could be applied for instance using the completeness, the relative score, the evidence as well as statistics based on RNA-seq: start=='M' and stop =='*' and score/AA>=0.75 and (evidence>1 or tpc==1.0), default = start=='M' and stop =='*' and score/AA>=0.75, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.u</font></td><br />
<td>UTR (allows to predict UTRs using RNA-seq data, range={NO, YES}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;YES&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.r</font></td><br />
<td>rename (allows to generate generic gene and transcripts names (cf. attribute &quot;Name&quot;), range={COMPOSED, SIMPLE, NO}, default = COMPOSED)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;COMPOSED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.i</font></td><br />
<td>infix (the infix of the generic name, default = G)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.s</font></td><br />
<td>suffix (the suffix of the generic name, default = 0)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.di</font></td><br />
<td>delete infix (a comma-separated list of infixes that is deleted from the sequence names before building the gene/transcript name, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;SIMPLE&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>predicted proteins (If *true*, returns the predicted proteins of the target organism as fastA file, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">pc</font></td><br />
<td>predicted CDSs (If *true*, returns the predicted CDSs of the target organism as fastA file, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">pgr</font></td><br />
<td>predicted genomic regions (If *true*, returns the genomic regions of predicted gene models of the target organism as fastA file, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">o</font></td><br />
<td>output individual predictions (If *true*, returns the predictions for each reference species, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>debug (If *false* removes all temporary files even if the jobs exits unexpected, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">threads</font></td><br />
<td>The number of threads used for the tool, defaults to 1</td><br />
<td>INT</td><br />
</tr><br />
</table><br />
<br />
=== Extract RNA-seq Evidence (ERE) ===<br />
<br />
For post-processing the mapped RNA-seq data, we provide the tool ExtractRNAseqEvidence (ERE). You can run Extractor from the command line with<br/><br />
<code>java -jar GeMoMa-1.6.1.jar CLI ERE [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>Stranded (Defines whether the reads are stranded. In case of FR_FIRST_STRAND, the first read of a read pair or the only read in case of single-end data is assumed to be located on forward strand of the cDNA, i.e., reverse to the mRNA orientation. If you are using Illumina TruSeq you should use FR_FIRST_STRAND., range={FR_UNSTRANDED, FR_FIRST_STRAND, FR_SECOND_STRAND}, default = FR_UNSTRANDED)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>mapped reads file (BAM/SAM files containing the mapped reads)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>ValidationStringency (Defines how strict to be when reading a SAM or BAM, beyond bare minimum validation., range={STRICT, LENIENT, SILENT}, default = LENIENT)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">u</font></td><br />
<td>use secondary alignments (allows to filter flags in the SAM or BAM, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage (allows to output the coverage, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">mmq</font></td><br />
<td>minimum mapping quality (reads with a mapping quality that is lower than this value will be ignored, valid range = [0, 254], default = 40)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== CheckIntrons===<br />
<br />
This tool allows to check whether the extracted introns show the expected patterns of di-nucleotides at the splice sites. You can run CheckIntrons from the command line with<br/><br />
<code>java -jar GeMoMa-1.6.1.jar CLI CheckIntrons [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>target genome (The target genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>verbose (A flag which allows to output a wealth of additional information per transcript, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== Extractor ===<br />
<br />
For preparing the data, we provide the tool Extractor. You can run Extractor from the command line with<br/><br />
<code>java -jar GeMoMa-1.6.1.jar CLI Extractor [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (Reference annotation file (GFF or GTF), which contains gene models annotated in the reference genome)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (Reference genome file (FASTA))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">gc</font></td><br />
<td>genetic code (optional user-specified genetic code, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>proteins (whether the complete proteins sequences should returned as output, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>cds (whether the complete CDSs should returned as output, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">genomic</font></td><br />
<td>genomic (whether the genomic regions should be returned (upper case = coding, lower case = non coding), default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>repair (if a transcript annotation can not be parsed, the program will try to infer the phase of the CDS parts to repair the annotation, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>selected (The path to list file, which allows to make only a predictions for the contained transcript ids. The first column should contain transcript IDs as given in the annotation. Remaining columns will be ignored., OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Ambiguity</font></td><br />
<td>Ambiguity (This parameter defines how to deal with ambiguities in the DNA. There are 3 options: EXCEPTION, which will remove the corresponding transcript, AMBIGUOUS, which will use an X for the corresponding amino acid, and RANDOM, which will randomly select an amnio acid from the list of possibilities., range={EXCEPTION, AMBIGUOUS, RANDOM}, default = EXCEPTION)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sefc</font></td><br />
<td>stop-codon excluded from CDS (A flag that states whether the reference annotation contains the stop codon in the CDS annotation or not, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>full-length (A flag which allows for choosing between only full-length and all (i.e., full-length and partial) transcripts, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>verbose (A flag which allows to output a wealth of additional information, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== Gene Model Mapper (GeMoMa) ===<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>search results (The search results, e.g., from tblastn or mmseqs)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>target genome (The target genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>cds parts (The query cds parts file (FASTA), i.e., the cds parts that have been blasted)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>assignment (The assignment file, which combines parts of the CDS to transcripts, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">q</font></td><br />
<td>query proteins (optional query protein file (FASTA) for computing the optimal alignment score against complete protein prediction, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">splice</font></td><br />
<td>splice (if no intron is given by RNA-seq, compute candidate splice sites or not, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage</font></td><br />
<td>coverage (experimental coverage (RNA-seq), range={NO, UNSTRANDED, STRANDED}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genetic code (optional user-specified genetic code, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sm</font></td><br />
<td>substitution matrix (optional user-specified substitution matrix, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">go</font></td><br />
<td>gap opening (The gap opening cost in the alignment, default = 11)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ge</font></td><br />
<td>gap extension (The gap extension cost in the alignment, default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>maximum intron length (The maximum length of an intron, default = 15000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">intron-loss-gain-penalty</font></td><br />
<td>intron-loss-gain-penalty (The penalty used for intron loss and gain, default = 25)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">e</font></td><br />
<td>e-value (The e-value for filtering blast results, default = 100.0)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ct</font></td><br />
<td>contig threshold (The threshold for evaluating contigs, valid range = [0.0, 1.0], default = 0.4)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">rt</font></td><br />
<td>region threshold (The threshold for evaluating regions, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">h</font></td><br />
<td>hit threshold (The threshold for adding additional hits, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>predictions (The (maximal) number of predictions per transcript, default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">selected</font></td><br />
<td>selected (The path to list file, which allows to make only a predictions for the contained transcript ids. The first column should contain transcript IDs as given in the annotation. Remaining columns can be used to determine a target region that should be overlapped by the prediction, if columns 2 to 5 contain chromosome, strand, start and end of region, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">as</font></td><br />
<td>avoid stop (A flag which allows to avoid stop codons in a transcript (except the last AS), default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">approx</font></td><br />
<td>approx (whether an approximation is used to compute the score for intron gain, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">prefix</font></td><br />
<td>prefix (A prefix to be used for naming the predictions, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">tag</font></td><br />
<td>tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = prediction)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>verbose (A flag which allows to output a wealth of additional information per transcript, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">timeout</font></td><br />
<td>timeout (The (maximal) number of seconds to be used for the predictions of one transcript, if exceeded GeMoMa does not output a prediction for this transcript., valid range = [0, 604800], default = 3600)</td><br />
<td style="width:100px;">LONG</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sort</font></td><br />
<td>sort (A flag which allows to sort the search results, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Score</font></td><br />
<td>Score (A flag which allows to do nothing, re-score or re-align the search results, range={Trust, ReScore, ReAlign}, default = Trust)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
GeMoMa returns the predicted annotation as gff file.<br />
<br />
=== GeMoMa Annotation Filter (GAF) ===<br />
<br />
The GeMoMa Annotation Filter allows to combine and reduce predictions from GeMoMa into a single final prediction. It is able to handle predictions from different reference species. It also handles overlapping or identical predictions.<br />
You can run GeMoMa from the command line with<br/><br />
<code>java -jar GeMoMa-1.6.1.jar CLI GAF[&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>tag (the tag used to read the GeMoMa annotations, default = prediction)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>sorting (comma-separated list that defines the way predictions are sorted within a cluster, default = evidence,score)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>missing intron evidence filter (the filter for single-exon transcripts or if no RNA-seq data is used, decides for overlapping other transcripts whether they should be used (=true) or discarded (=false), default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>intron evidence filter (the filter on the intron evidence given by RNA-seq-data for overlapping transcripts, valid range = [0.0, 1.0], default = 1.0)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>common border filter (the filter on the common borders of transcripts, the lower the more transcripts will be checked as alternative splice isoforms, valid range = [0.0, 1.0], default = 0.75)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">mnotpg</font></td><br />
<td>maximal number of transcripts per gene (the maximal number of allowed transcript predictions per gene, valid range = [1, 2147483647], default = 2147483647)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prefix (the prefix can be used to distinguish predictions from different input files, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>weight (the weight can be used to prioritize predictions from different input files; each prediction will get an additional attribute sumWeight that can be used in the filter, valid range = [0.0, 1000.0], default = 1.0, OPTIONAL)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>gene annotation file (GFF files containing the gene annotations (predicted by GeMoMa))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>filter (A filter can be applied to transcript predictions to receive only reasonable predictions. The filter is applied to the GFF attributes. The deault filter decides based on the completeness of the prediction (start=='M' and stop=='*') and the relative score (score/AA>=0.75) whether a prediction is used or not. Different criteria can be combined using 'and ' and 'or'. A more sophisticated filter could be applied for instance using the completeness, the relative score, the evidence as well as statistics based on RNA-seq: start=='M' and stop =='*' and score/AA>=0.75 and (evidence>1 or tpc==1.0), default = start=='M' and stop =='*' and score/AA>=0.75, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== CompareTranscripts ===<br />
<br />
For comparing gene models from GeMoMa predictions with existing annotation, we provide the tool CompareTranscripts. You can run CompareTranscripts from the command line with<br/><br />
<code>java -jar GeMoMa-1.6.1.jar CLI CompareTranscripts [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prediction (The predicted annotation)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (The true annotation)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== AnnotationEvidence ===<br />
<br />
For providing RNA-seq evidence (e.g. tie) for existing annotation, we provide the tool AnnotationEvidence. You can run AnnotationEvidence from the command line with<br/><br />
<code>java -jar GeMoMa-1.6.1.jar CLI AnnotationEvidence [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (The genome annotation file (GFF))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (The genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns file (Introns (GFF), which might be obtained from RNA-seq, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage file (experimental coverage (RNA-seq), range={NO, UNSTRANDED, STRANDED}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ao</font></td><br />
<td>annotation output (if the annotation should be returned with attributes tie, tpc, and AA, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">gc</font></td><br />
<td>genetic code (optional user-specified genetic code, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== AnnotationFinializer ===<br />
<br />
This tool allows to predict UTR and to rename predictions. You can run AnnotationEvidence from the command line with<br/><br />
<code>java -jar GeMoMa-1.6.1.jar CLI AnnotationFinalizer [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (The genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (The predicted genome annotation file (GFF))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = prediction)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">u</font></td><br />
<td>UTR (allows to predict UTRs using RNA-seq data, range={NO, YES}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;YES&quot;:</b></td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns file (Introns (GFF), which might be obtained from RNA-seq, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage file (experimental coverage (RNA-seq), range={NO, UNSTRANDED, STRANDED}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">rename</font></td><br />
<td>rename (allows to generate generic gene and transcripts names (cf. attribute &quot;Name&quot;), range={COMPOSED, SIMPLE, NO}, default = COMPOSED)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;COMPOSED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">infix</font></td><br />
<td>infix (the infix of the generic name, default = G)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>suffix (the suffix of the generic name, default = 0)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">di</font></td><br />
<td>delete infix (a comma-separated list of infixes that is deleted from the sequence names before building the gene/transcript name, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;SIMPLE&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
== GFF attributes ==<br />
<br />
Using GeMoMa and GAF, you'll obtain GFFs containing some special attributes. We briefly explain the most prominent attributes in the following table.<br />
<br />
{| class="wikitable sortable"<br />
|-<br />
!Attribute !!Long name !!Tool !!Necessary parameter !!Feature !!Description<br />
|-<br />
| score || GeMoMa score || GeMoMa || || prediction || score computed by GeMoMa using the substitution matrix, gap costs and additional penalties<br />
|-<br />
| minCov || minimal coverage || GeMoMa || coverage, ... || prediction || minimal coverage of any base of the prediction given RNA-seq evidence<br />
|-<br />
| avgCov || average coverage || GeMoMa || coverage, ... || prediction || average coverage of all bases of the prediction given RNA-seq evidence<br />
|-<br />
| tpc || transcript percentage coverage || GeMoMa || coverage, ... || prediction || percentage of covered bases per predicted transcript given RNA-seq evidence<br />
|-<br />
| tae || transcript acceptor evidence || GeMoMa || introns || prediction || percentage of predicted acceptor sites per predicted transcript with RNA-seq evidence<br />
|-<br />
| tde || transcript donor evidence || GeMoMa || introns || prediction || percentage of predicted donor sites per predicted transcript with RNA-seq evidence<br />
|-<br />
| tie || transcript intron evidence || GeMoMa || introns || prediction || percentage of predicted introns per predicted transcript with RNA-seq evidence<br />
|-<br />
| minSplitReads || minimal split reads || GeMoMa || introns || prediction || minimal number of split reads for any of the predicted introns per predicted transcript<br />
|-<br />
| iAA || identical amino acid || GeMoMa || query proteins || prediction || percentage of identical amino acids between reference transcript and prediction<br />
|-<br />
| pAA || positive amino acid || GeMoMa || query proteins || prediction || percentage of aligned positions between reference transcript and prediction yielding a positive score in the substitution matrix<br />
|-<br />
| evidence || || GAF || || prediction || number of reference organisms that have a transcript yielding this prediction<br />
|-<br />
| alternative || || GAF || || prediction || alternative gene ID(s) leading to the same prediction<br />
|-<br />
| sumWeight || || GAF || || prediction || the sum of the weights of the references that perfectly support this prediction<br />
|-<br />
| maxTie || maximal tie || GAF || || gene || maximal tie of all transcripts of this gene<br />
|-<br />
| maxEvidence || maximal evidence || GAF || || gene || maximal evidence of all transcripts of this gene<br />
|-<br />
|}<br />
<br />
== FAQs ==<br />
<br />
; Why does the Extractor not return a single CDS-part, protein, ...?<br />
: First, please check whether the names of your contigs/chromosomes in your annotation (gff) and genome file (fasta) are identical. The fasta comments should at best only contain the contig/chromosome name. (Since GeMoMa 1.4, comments, which contain the contig/chromosome name and some additional information separated by a space, are also fine.) Second, please check whether you have a valid GFF/GTF file. Valid GFF files should have a valid "ID" or "Parent" entry in the attributes column. Valid GTF files should have a valid "gene_id" and "transcript_id" entry. Finally, please check the statistics that are given by the Extractor. It lists how many genes have been read and how many genes have been removed for different reasons. One common problem is that some annotation files do not include the stop codon in the CDS annotation.<br />
; How can I force GeMoMa to make more predictions?<br />
: There are several parameters affecting the number of predictions. The most prominent are the number of predictions (p) and the contig threshold (ct). For each reference transcript/CDS, GeMoMa initially makes a preliminary prediction and uses this prediction to determine whether a contig is promising and should be used to determine the final predictions. You may decrease ct and increase p to have more contigs in the final prediction. Increasing the number of predictions allows GeMoMa to output more predictions that have been computed. Decreasing the contig threshold allows to increase the number of predictions that are (internally) computed. Increasing p to a very large number without decreasing ct does not help.<br />
; Running GeMoMa on a single contig of my assembly yield thousands of weird predictions. What went wrong?<br />
: By default, GeMoMa is not build to be run on a single contig. GeMoMa tries to make predictions for all given reference CDS in the given target sequence(s). If the given target sequence is only a fraction of the complete target genome/assembly, GeMoMa will produce weird predictions as it does not filter for the quality of the predictions internally. There are two options to handle this:<br />
:* Use a list of gene models that you expect to be located on this contig (cf. parameter "selected").<br />
:* Filter the predictions using GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>).<br />
; Is it mandatory to use RNA-seq data?<br />
: No, GeMoMa is able to make predictions with and without RNA-seq evidence.<br />
; Is it possible to use multiple reference organisms?<br />
: It is possible to use multiple reference organisms for GeMoMa. Just run GeMoMa on each reference organism separately. Finally, you can employ GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>) to combine these annotations.<br />
; Why do some reference genes not lead to a prediction in the target genome?<br />
: Please first check whether your reference genes have been discarded by the Extractor (cf. assignment file).<br />
: If the genes have been discarded, there are two possibilities:<br />
:* The CDS might be redundant, i.e. the coding exons are identical to those of another transcript. In this case, only one CDS is further evaluated.<br />
:* There might be something wrong with your reference genes, e.g., missing start codon, missing stop codon, premature stop codon, ambiguous nucleotides, ... and you should check the options of Extractor or the annotation.<br />
: If the reference genes passed the Extractor, there are several possible explanations for this behavior. The two most prominent are:<br />
:* GeMoMa stopped the prediction of a reference genes since it does not return a result within the given time (cf. parameter "timeout").<br />
:* GeMoMa simply did not find a prediction matching the remaining quality criteria<br />
:* GeMoMa did find a prediction, but it was filtered out by GAF, e.g. to low relative score, missing start or stop codon (cf. GAF parameters).<br />
; What does "partial gene model" mean in the context of GeMoMa?<br />
: We called a gene model partial if it does not contain an initial start codon and a final stop codon. However, this does not mean that the gene model is located at or close to the border of a chromosome or contig.<br />
; For two different reference transcripts, the predictions of GeMoMa overlap or are identical. What should I do with those?<br />
: GeMoMa makes the predictions for each reference transcript independently. Hence, it can occur that some of predictions of different reference transcripts overlap or are identical especially in gene families. Typically, you might like to filter or rank these predictions. We have implemented GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>) to do this automatically. However, you can also do it by hand using the GFF attributes. Using RNA-seq data in GeMoMa yields additional fields in the annotation that can be used, e.g., average coverage (avgCov).<br />
; A lot of transcripts have been filtered out by the Extractor. What can I do?<br />
: There are several reasons for removing transcripts by the Extractor. At least in two cases you can try to get more transcripts by setting specific parameter values. First, if the transcript contains ambiguous nucleotides, please test the parameter "Ambiguity". Second, sometimes we received GFFs which contain wrong phases for CDS entries (e.g., 0 for all CDS entries in the phase column of the GFF). Since version 1.3.2, we provide the option "r" which stands for repair. If r=true is chosen, the Extractor tries to infer all phases for transcripts that show an error and would be filtered out.<br />
; Is GeMoMa able to predict pseudo-genes/ncRNA?<br />
: No, currently not.<br />
; My RNA-seq data indicates there is an additional intron in a transcipt, but GeMoMa does not predict this. Or vice versa, GeMoMa predicts an intron that is not supported by RNA-seq data. What's the reason?<br />
: GeMoMa is mainly based on the assumptions of amino acid and intron position conservation between reference and target species. Hence, GeMoMa tries to predict a gene model with similar exon-intron structure in the target species and does not stick too much to RNA-seq data. Although intron position conservation can be observed in most cases, sometimes new introns evolve or others vanish. For this reasons, GeMoMa also allows for the inclusion or exclusion of introns adding some additional costs (cf. GeMoMa parameter intron-loss-gain-penalty). However, the behaviour of GeMoMa depends on the parameters settings (especially intron-loss-gain-penalty, sm (substitution matrix), go (gap opening), ge (gap extension)) and the length of the missed/additional intron. Nevertheless, such cases can only occur if the additional/missed intron has a length that can be divided by 3 preserving the reading frame.<br />
:Since the available RNA-seq data only reflects a fraction of tissues/environmental conditions/..., missing RNA-seq evidence does not necessarily mean that the predictions is wrong.<br />
; My RNA-seq data indicates two alternative, highly overlapping introns. Interestingly, GeMoMa does not take the intron that is more abundant. Why?<br />
: GeMoMa reads the introns from the input file using some filter (cf. GeMoMa parameter r (reads)). All introns that pass the filter are used and treated equally. Hence, GeMoMa uses the intron that matches the expectation of intron position and amino acid conservation compared to the reference transcript.<br />
; Does GeMoMa predict multiple transcripts per gene?<br />
: GeMoMa in principle allows to predict multiple transcripts per gene, if corresponding transcripts are given in the reference species or if multiple reference species are used.<br />
; GeMoMa failed with java.lang.OutOfMemoryError. What can I do?<br />
: Whenever you see a java.lang.OutOfMemoryError, you should rerun the program with Java virtual machine (VM) options. More specifically you should set: '''-Xms''' the initally used RAM, e.g. to 5Gb (–Xms5G), and '''-Xmx''' the maximally used RAM, e.g. to 50Gb (-Xmx50G). GeMoMa often needs more memory if you have a large genome ''and'' if you’re providing a large coverage file (extracted from RNA-seq data). If you don’t have a compute node with enough memory, you can run GeMoMa without coverage, which will return the same predictions, but does not include all statistics. Another point could be the protein alignment, if you use the optional parameter query protein. Again you can run GeMoMa without this parameter, which will return the same predictions, but less statistics.<br />
; I need to specify the genetic code for my organisms. What is the expected format?<br />
: The genetic code is given in a two column tab-delimited table, where the first column is the one letter code of the amino acid and the second column is a comma-separated list of triples. As we are working on genomic DNA, GeMoMa expects the bases A, C, G, and T, and not U (as expected in mRNA). Here is the link to the default genetic code, which might be used as template:<br />
: https://github.com/Jstacs/Jstacs/blob/master/projects/gemoma/test_data/genetic_code.txt<br />
: The genetic code might be specified for a reference organism in the module Extractor or for a target organism in the module GeMoMa.<br />
<br />
== Version history ==<br />
[http://www.jstacs.de/download.php?which=GeMoMa GeMoMa 1.6.1] (4.06.2019)<br />
* createGalaxyIntegration.sh: bugfix for GeMoMaPipeline<br />
* new module CheckIntrons: allowing to create statistics for introns (extracted by ERE)<br />
* AnnotationFinalizer: bugfix for sequence IDs with large numbers<br />
* CompareTranscripts: <br />
** bugfix for prefix of ref-gene<br />
** allow no transcript info, but making assignment non-optional if a transcript info is set <br />
* GAF: bugfix for Galaxy integration<br />
* GeMoMaPipeline:<br />
** improved output in case of Exceptions<br />
** new parameter "output individual predictions" allows to in- or exclude individual predictions from each reference organism in the final result<br />
** new parameter "weight" allows weights for reference species (cf. GAF)<br />
* ERE: new parameter "minimum mapping quality"<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.6.zip GeMoMa 1.6] (2.04.2019)<br />
* allow to use mmseqs as alternative to tblastn<br />
* AnnotationEvidence:<br />
** allows to add attributes to the input gff: tie, tpc, AA, start, stop<br />
** new parameter for gff output<br />
* AnnotationFinalizer: new tool for predicting UTRs and renaming predictions<br />
* GAF:<br />
** relative score filter and evidence filter are replaced by a flexible filter that allows to filter by relative score, evidence or other GFF attributes as well as combinations thereof<br />
** sorting criteria of the predictions within clusters can now be user-specified<br />
** new attribute for genes: combinedEvidence<br />
** new attribute for predictions: sumWeight<br />
** allows to use gene predictions from all sources, including for instance ab-initio gene predictors, purely RNA-seq based gene prediction and manually curation<br />
** bugfix for predictions from multiple reference organisms<br />
** improved statistic output<br />
* GeMoMa<br />
** renamed the parameter tblastn results to search results<br />
** new parameter for sorting the results of the similarity search (tblastn or mmseqs), if you use mmseqs for the similarity search you have use sort <br />
** new parameter for score of the search results: three options: Trust (as is), ReScore (use aligned sequence, but recompute score), and ReAlign (use detected sequence for realignment and rescore)<br />
** bugfix: threshold for introns from multiple files<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.5.2.zip GeMoMa 1.5.3] (23.07.2018)<br />
* improved parameter description and presentation<br />
* GeMoMaPipeline:<br />
** removed unnecessary parameters<br />
* GeMoMa:<br />
** bugfix: reading coverage file<br />
** removed parameter genomic (cf. Extractor)<br />
** removed protein output (cf. Extractor)<br />
* GAF:<br />
** bugfix: prefix<br />
* Extractor:<br />
** new parameter genomic<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.5.2.zip GeMoMa 1.5.2] (31.5.2018)<br />
* GAF:<br />
** new parameter that allows to restrict the maximal number of transcript predictions per gene<br />
** altered behavior of the evidence filter from percentages to absolute values<br />
** bugfix: nested genes <br />
** checking for duplicates in prediction IDs<br />
* GeMoMa:<br />
** warning if RNA-seq data does not match with target genome<br />
* GeMoMaPipeline: new tool for running the complete GeMoMa pipeline at once allowing multi-threading<br />
* folder for temporary files of GeMoMa<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.5.zip GeMoMa 1.5] (13.02.2018)<br />
* AnnotationEvidence: add chromosome to output<br />
* CompareTranscripts: new parameter that allows to remove prefixes introduces by GAF<br />
* Extractor: new parameter "stop-codon excluded from CDS" that might be used if the annotation does not contain the stop codons<br />
* ExtractRNASeqEvidence: <br />
** print intron length stats<br />
** include program infos in introns.gff3<br />
* GeMoMa:<br />
** new attribute pAA in gff output if query protein is given<br />
** include program infos in predicted_annotation.gff3<br />
** minor bugfix<br />
* GAF:<br />
** new parameter that allows to specify a prefix for each input gff<br />
** collect and print program infos to filtered_prediction.gff3<br />
** improved statistics output<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.4.2.zip GeMoMa 1.4.2] (21.07.2017)<br />
* automatic searching for available updates<br />
* AnnotationEvidence: bugfix (tie computation: Arrays.binarysearch does not find first match)<br />
* Extractor: bugfix (files that are not zipped)<br />
* GeMoMa: bugfix (tie computation: Arrays.binarysearch does not find first match)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.4.1.zip GeMoMa 1.4.1] (30.05.2017)<br />
* CompareTranscripts: bugfix (NullPointerException)<br />
* Extractor: reference genome can be .*fa.gz and .*fasta.gz<br />
* GeMoMa: bugfix (shutdown problem after timeout)<br />
* modified additional scripts and documentation<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.4.zip GeMoMa 1.4] (03.05.2017)<br />
* AnnotationEvidence: new tool computing tie and tpc for given annotation (gff)<br />
* CompareTranscripts: new tool comparing predicted and given annotation (gff)<br />
* Extractor:<br />
** reading CDS with no parent tag (cf. discontinuous feature)<br />
** automatic recognition of GFF or GTF annotation<br />
** Warning if sequences mentioned in the annotation are not included in the reference sequence<br />
* GeMoMa: <br />
** allowing for multiple intron and coverage files (= using different library types at the same time)<br />
** NA instead of "?" for tae, tde, tie, minSplitReads of single coding exon genes<br />
** new default values for the parameters: predictions (10 instead of 1) and contig threshold (0.4 instead of 0.9)<br />
** bugfix (write pc and minCov if possible for last CDS part in predicted annotation)<br />
** bugfix (ref-gene name if no assignment is used)<br />
** bugfix (minSplitReads, minCov, tpc, avgCov if no coverage available)<br />
* GAF:<br />
** nested genes on the same strand<br />
** bugfix (if nothing passes the filter)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.3.2.zip GeMoMa 1.3.2] (18.01.2017)<br />
* Extractor: new parameter repair for broken transcript annotations<br />
* GeMoMa: bugfixes (splice site computation)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa_1.3.1.zip GeMoMa 1.3.1] (09.12.2016)<br />
* GeMoMa bugfix (finding start/stop codon for very small exons)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.3.zip GeMoMa 1.3] (06.12.2016)<br />
* ERE: new tool for extracting RNA-seq evidence<br />
* Extractor: offers options for <br />
** partial gene models<br />
** ambiguities<br />
* GeMoMa:<br />
** RNA-seq<br />
*** defining splice sites<br />
*** additional feature in GFF and output<br />
**** transcript intron evidence (tie)<br />
**** transcript acceptor evidence (tae)<br />
**** transcript donor evidence (tde)<br />
**** transcript percentage coverage (tpc)<br />
**** ...<br />
** improved GFF<br />
** simplified the command line parameters<br />
** IMPORTANT: parameter names changed for some parameters<br />
* GAF: new tool for filtering and combining different predictions (especially of different reference organisms)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.1.3.zip GeMoMa 1.1.3] (06.06.2016)<br />
* minor modifications to the Extractor tool<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.1.2.zip GeMoMa 1.1.2] (05.02.2016)<br />
* GeMoMa bugfix (upstream, downstream sequence for splice site detection)<br />
* Extractor: new parameter s for selecting transcripts<br />
* improved Galaxy integration<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.1.1.zip GeMoMa 1.1.1] (01.02.2016)<br />
* initial release for paper</div>Keilwagenhttps://www.jstacs.de/index.php?title=GeMoMa&diff=1039GeMoMa2019-06-05T12:34:30Z<p>Keilwagen: /* Version history */</p>
<hr />
<div>'''Ge'''ne '''Mo'''del '''Ma'''pper (GeMoMa) is a homology-based gene prediction program. GeMoMa uses the annotation of protein-coding genes in a reference genome to infer the annotation of protein-coding genes in a target genome. Thereby, GeMoMa utilizes amino acid sequence and intron position conservation. In addition, GeMoMa allows to incorporate RNA-seq evidence for splice site prediction.<br />
<br />
[[File:GeMoMa-schema.png|thumb|right|350px|Schema of GeMoMa algorithm]]<br />
<br />
== Paper ==<br />
If you use GeMoMa, please cite<br />
<br />
J. Keilwagen, M. Wenk, J. L. Erickson, M. H. Schattat, J. Grau, and F. Hartung. [https://nar.oxfordjournals.org/content/44/9/e89 Using intron position conservation for homology-based gene prediction]. ''Nucleic Acids Research'', 2016. doi: 10.1093/nar/gkw092<br />
<br />
J. Keilwagen, F. Hartung, M. Paulini, S. O. Twardziok, and J. Grau<br />
[https://rdcu.be/QbKc Combining RNA-seq data and homology-based gene prediction for plants, animals and fungi]. ''BMC Bioinformatics'', 2018. doi: 10.1186/s12859-018-2203-5<br />
<br />
== Download ==<br />
<br />
GeMoMa is implemented in Java using Jstacs. You can [http://www.jstacs.de/download.php?which=GeMoMa download a zip file] containing a readme, the GeMoMa jar file and some tiny scripts for running GeMoMa. The jar file allows for <br />
<ul><br />
<li>creating the XML file needed for the Galaxy integration</li><br />
<li>running the command line interface (CLI) version.</li><br />
</ul><br />
<br />
You can also [[:File:GeMoMa-manual.pdf|download a small manual for GeMoMa]] which explains the main steps for the analysis.<br />
<br />
== Galaxy ==<br />
GeMoMa is available in a public web-server at [http://galaxy.informatik.uni-halle.de galaxy.informatik.uni-halle.de]. The provided web-server only allows a limited number of reference genes and uses a time out of 2 minutes per transcript prediction. For unlimited use, please use the command line program or integrate GeMoMa in your only Galaxy instance.<br />
<br />
[[File:GeMoMa-Workflow.png|thumb|center|700px|GeMoMa workflow adapted from Galaxy]]<br />
<br />
== Running the command line application ==<br />
<br />
For running the command line application, Java v1.8 or later is required.<br />
<br />
=== GeMoMaPipeline ===<br />
<br />
If you like to run the GeMoMaPipeline on a server as a single job, you can use the module GeMoMaPipeline which allows to exploit the full compute power of the computer server via multi-threading. However, GeMoMaPipeline does not distribute task on a compute cluster.<br />
You can run GeMoMaPipeline from the command line with<br/><br />
<code>java -jar GeMoMa-1.6.1.jar CLI GeMoMaPipeline [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>target genome (Target genome file (FASTA))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>species (data for reference species, range={own, pre-extracted}, default = own)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;own&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>ID (ID to distinguish the different reference species, default = , OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (Reference annotation file (GFF or GTF), which contains gene models annotated in the reference genome)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (Reference genome file (FASTA))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>weight (the weight can be used to prioritize predictions from different input files; each prediction will get an additional attribute sumWeight that can be used in the filter, valid range = [0.0, 1000.0], default = 1.0, OPTIONAL)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;pre-extracted&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>ID (ID to distinguish the different reference species, default = , OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>cds parts (The query cds parts file (FASTA), i.e., the cds parts that have been blasted)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>assignment (The assignment file, which combines parts of the CDS to transcripts, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">q</font></td><br />
<td>query proteins (optional query protein file (FASTA) for computing the optimal alignment score against complete protein prediction, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>weight (the weight can be used to prioritize predictions from different input files; each prediction will get an additional attribute sumWeight that can be used in the filter, valid range = [0.0, 1000.0], default = 1.0, OPTIONAL)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">selected</font></td><br />
<td>selected (The path to list file, which allows to make only a predictions for the contained transcript ids. The first column should contain transcript IDs as given in the annotation. Remaining columns can be used to determine a target region that should be overlapped by the prediction, if columns 2 to 5 contain chromosome, strand, start and end of region, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">gc</font></td><br />
<td>genetic code (optional user-specified genetic code, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">tag</font></td><br />
<td>tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = prediction)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>RNA-seq evidence (data for RNA-seq evidence, range={NO, MAPPED, EXTRACTED}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;MAPPED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.s</font></td><br />
<td>Stranded (Defines whether the reads are stranded. In case of FR_FIRST_STRAND, the first read of a read pair or the only read in case of single-end data is assumed to be located on forward strand of the cDNA, i.e., reverse to the mRNA orientation. If you are using Illumina TruSeq you should use FR_FIRST_STRAND., range={FR_UNSTRANDED, FR_FIRST_STRAND, FR_SECOND_STRAND}, default = FR_UNSTRANDED)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.m</font></td><br />
<td>mapped reads file (BAM/SAM files containing the mapped reads)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.v</font></td><br />
<td>ValidationStringency (Defines how strict to be when reading a SAM or BAM, beyond bare minimum validation., range={STRICT, LENIENT, SILENT}, default = LENIENT)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.u</font></td><br />
<td>use secondary alignments (allows to filter flags in the SAM or BAM, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.c</font></td><br />
<td>coverage (allows to output the coverage, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.mmq</font></td><br />
<td>minimum mapping quality (reads with a mapping quality that is lower than this value will be ignored, valid range = [0, 254], default = 40)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;EXTRACTED&quot;:</b></td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">introns</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage</font></td><br />
<td>coverage (experimental coverage (RNA-seq), range={NO, UNSTRANDED, STRANDED}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">tblastn</font></td><br />
<td>tblastn (if *true* tblastn is used as search algorithm, otherwise mmseqs is used. Tblastn and mmseqs need to be installed to use the corresponding option, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.p</font></td><br />
<td>proteins (whether the complete proteins sequences should returned as output, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.r</font></td><br />
<td>repair (if a transcript annotation can not be parsed, the program will try to infer the phase of the CDS parts to repair the annotation, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.a</font></td><br />
<td>Ambiguity (This parameter defines how to deal with ambiguities in the DNA. There are 3 options: EXCEPTION, which will remove the corresponding transcript, AMBIGUOUS, which will use an X for the corresponding amino acid, and RANDOM, which will randomly select an amnio acid from the list of possibilities., range={EXCEPTION, AMBIGUOUS, RANDOM}, default = EXCEPTION)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.s</font></td><br />
<td>stop-codon excluded from CDS (A flag that states whether the reference annotation contains the stop codon in the CDS annotation or not, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.f</font></td><br />
<td>full-length (A flag which allows for choosing between only full-length and all (i.e., full-length and partial) transcripts, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.s</font></td><br />
<td>splice (if no intron is given by RNA-seq, compute candidate splice sites or not, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.sm</font></td><br />
<td>substitution matrix (optional user-specified substitution matrix, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.g</font></td><br />
<td>gap opening (The gap opening cost in the alignment, default = 11)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.ge</font></td><br />
<td>gap extension (The gap extension cost in the alignment, default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.m</font></td><br />
<td>maximum intron length (The maximum length of an intron, default = 15000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.i</font></td><br />
<td>intron-loss-gain-penalty (The penalty used for intron loss and gain, default = 25)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.e</font></td><br />
<td>e-value (The e-value for filtering blast results, default = 100.0)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.c</font></td><br />
<td>contig threshold (The threshold for evaluating contigs, valid range = [0.0, 1.0], default = 0.4)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.rt</font></td><br />
<td>region threshold (The threshold for evaluating regions, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.h</font></td><br />
<td>hit threshold (The threshold for adding additional hits, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.p</font></td><br />
<td>predictions (The (maximal) number of predictions per transcript, default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.a</font></td><br />
<td>avoid stop (A flag which allows to avoid stop codons in a transcript (except the last AS), default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.approx</font></td><br />
<td>approx (whether an approximation is used to compute the score for intron gain, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.prefix</font></td><br />
<td>prefix (A prefix to be used for naming the predictions, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.t</font></td><br />
<td>timeout (The (maximal) number of seconds to be used for the predictions of one transcript, if exceeded GeMoMa does not output a prediction for this transcript., valid range = [0, 604800], default = 3600)</td><br />
<td style="width:100px;">LONG</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.Score</font></td><br />
<td>Score (A flag which allows to do nothing, re-score or re-align the search results, range={Trust, ReScore, ReAlign}, default = Trust)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.s</font></td><br />
<td>sorting (comma-separated list that defines the way predictions are sorted within a cluster, default = evidence,score)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.m</font></td><br />
<td>missing intron evidence filter (the filter for single-exon transcripts or if no RNA-seq data is used, decides for overlapping other transcripts whether they should be used (=true) or discarded (=false), default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.i</font></td><br />
<td>intron evidence filter (the filter on the intron evidence given by RNA-seq-data for overlapping transcripts, valid range = [0.0, 1.0], default = 1.0)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.c</font></td><br />
<td>common border filter (the filter on the common borders of transcripts, the lower the more transcripts will be checked as alternative splice isoforms, valid range = [0.0, 1.0], default = 0.75)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.mnotpg</font></td><br />
<td>maximal number of transcripts per gene (the maximal number of allowed transcript predictions per gene, valid range = [1, 2147483647], default = 2147483647)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.f</font></td><br />
<td>filter (A filter can be applied to transcript predictions to receive only reasonable predictions. The filter is applied to the GFF attributes. The deault filter decides based on the completeness of the prediction (start=='M' and stop=='*') and the relative score (score/AA>=0.75) whether a prediction is used or not. Different criteria can be combined using 'and ' and 'or'. A more sophisticated filter could be applied for instance using the completeness, the relative score, the evidence as well as statistics based on RNA-seq: start=='M' and stop =='*' and score/AA>=0.75 and (evidence>1 or tpc==1.0), default = start=='M' and stop =='*' and score/AA>=0.75, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.u</font></td><br />
<td>UTR (allows to predict UTRs using RNA-seq data, range={NO, YES}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;YES&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.r</font></td><br />
<td>rename (allows to generate generic gene and transcripts names (cf. attribute &quot;Name&quot;), range={COMPOSED, SIMPLE, NO}, default = COMPOSED)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;COMPOSED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.i</font></td><br />
<td>infix (the infix of the generic name, default = G)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.s</font></td><br />
<td>suffix (the suffix of the generic name, default = 0)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.di</font></td><br />
<td>delete infix (a comma-separated list of infixes that is deleted from the sequence names before building the gene/transcript name, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;SIMPLE&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>predicted proteins (If *true*, returns the predicted proteins of the target organism as fastA file, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">pc</font></td><br />
<td>predicted CDSs (If *true*, returns the predicted CDSs of the target organism as fastA file, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">pgr</font></td><br />
<td>predicted genomic regions (If *true*, returns the genomic regions of predicted gene models of the target organism as fastA file, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">o</font></td><br />
<td>output individual predictions (If *true*, returns the predictions for each reference species, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>debug (If *false* removes all temporary files even if the jobs exits unexpected, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">threads</font></td><br />
<td>The number of threads used for the tool, defaults to 1</td><br />
<td>INT</td><br />
</tr><br />
</table><br />
<br />
=== Extract RNA-seq Evidence (ERE) ===<br />
<br />
For post-processing the mapped RNA-seq data, we provide the tool ExtractRNAseqEvidence (ERE). You can run Extractor from the command line with<br/><br />
<code>java -jar GeMoMa-1.6.1.jar CLI ERE [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>Stranded (Defines whether the reads are stranded. In case of FR_FIRST_STRAND, the first read of a read pair or the only read in case of single-end data is assumed to be located on forward strand of the cDNA, i.e., reverse to the mRNA orientation. If you are using Illumina TruSeq you should use FR_FIRST_STRAND., range={FR_UNSTRANDED, FR_FIRST_STRAND, FR_SECOND_STRAND}, default = FR_UNSTRANDED)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>mapped reads file (BAM/SAM files containing the mapped reads)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>ValidationStringency (Defines how strict to be when reading a SAM or BAM, beyond bare minimum validation., range={STRICT, LENIENT, SILENT}, default = LENIENT)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">u</font></td><br />
<td>use secondary alignments (allows to filter flags in the SAM or BAM, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage (allows to output the coverage, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">mmq</font></td><br />
<td>minimum mapping quality (reads with a mapping quality that is lower than this value will be ignored, valid range = [0, 254], default = 40)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== CheckIntrons===<br />
<br />
This tool allows to check whether the extracted introns show the expected patterns of di-nucleotides at the splice sites. You can run CheckIntrons from the command line with<br/><br />
<code>java -jar GeMoMa-1.6.1.jar CLI CheckIntrons [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>target genome (The target genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>verbose (A flag which allows to output a wealth of additional information per transcript, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== Extractor ===<br />
<br />
For preparing the data, we provide the tool Extractor. You can run Extractor from the command line with<br/><br />
<code>java -jar GeMoMa-1.6.1.jar CLI Extractor [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (Reference annotation file (GFF or GTF), which contains gene models annotated in the reference genome)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (Reference genome file (FASTA))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">gc</font></td><br />
<td>genetic code (optional user-specified genetic code, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>proteins (whether the complete proteins sequences should returned as output, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>cds (whether the complete CDSs should returned as output, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">genomic</font></td><br />
<td>genomic (whether the genomic regions should be returned (upper case = coding, lower case = non coding), default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>repair (if a transcript annotation can not be parsed, the program will try to infer the phase of the CDS parts to repair the annotation, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>selected (The path to list file, which allows to make only a predictions for the contained transcript ids. The first column should contain transcript IDs as given in the annotation. Remaining columns will be ignored., OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Ambiguity</font></td><br />
<td>Ambiguity (This parameter defines how to deal with ambiguities in the DNA. There are 3 options: EXCEPTION, which will remove the corresponding transcript, AMBIGUOUS, which will use an X for the corresponding amino acid, and RANDOM, which will randomly select an amnio acid from the list of possibilities., range={EXCEPTION, AMBIGUOUS, RANDOM}, default = EXCEPTION)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sefc</font></td><br />
<td>stop-codon excluded from CDS (A flag that states whether the reference annotation contains the stop codon in the CDS annotation or not, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>full-length (A flag which allows for choosing between only full-length and all (i.e., full-length and partial) transcripts, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>verbose (A flag which allows to output a wealth of additional information, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== Gene Model Mapper (GeMoMa) ===<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>search results (The search results, e.g., from tblastn or mmseqs)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>target genome (The target genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>cds parts (The query cds parts file (FASTA), i.e., the cds parts that have been blasted)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>assignment (The assignment file, which combines parts of the CDS to transcripts, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">q</font></td><br />
<td>query proteins (optional query protein file (FASTA) for computing the optimal alignment score against complete protein prediction, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">splice</font></td><br />
<td>splice (if no intron is given by RNA-seq, compute candidate splice sites or not, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage</font></td><br />
<td>coverage (experimental coverage (RNA-seq), range={NO, UNSTRANDED, STRANDED}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genetic code (optional user-specified genetic code, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sm</font></td><br />
<td>substitution matrix (optional user-specified substitution matrix, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">go</font></td><br />
<td>gap opening (The gap opening cost in the alignment, default = 11)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ge</font></td><br />
<td>gap extension (The gap extension cost in the alignment, default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>maximum intron length (The maximum length of an intron, default = 15000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">intron-loss-gain-penalty</font></td><br />
<td>intron-loss-gain-penalty (The penalty used for intron loss and gain, default = 25)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">e</font></td><br />
<td>e-value (The e-value for filtering blast results, default = 100.0)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ct</font></td><br />
<td>contig threshold (The threshold for evaluating contigs, valid range = [0.0, 1.0], default = 0.4)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">rt</font></td><br />
<td>region threshold (The threshold for evaluating regions, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">h</font></td><br />
<td>hit threshold (The threshold for adding additional hits, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>predictions (The (maximal) number of predictions per transcript, default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">selected</font></td><br />
<td>selected (The path to list file, which allows to make only a predictions for the contained transcript ids. The first column should contain transcript IDs as given in the annotation. Remaining columns can be used to determine a target region that should be overlapped by the prediction, if columns 2 to 5 contain chromosome, strand, start and end of region, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">as</font></td><br />
<td>avoid stop (A flag which allows to avoid stop codons in a transcript (except the last AS), default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">approx</font></td><br />
<td>approx (whether an approximation is used to compute the score for intron gain, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">prefix</font></td><br />
<td>prefix (A prefix to be used for naming the predictions, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">tag</font></td><br />
<td>tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = prediction)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>verbose (A flag which allows to output a wealth of additional information per transcript, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">timeout</font></td><br />
<td>timeout (The (maximal) number of seconds to be used for the predictions of one transcript, if exceeded GeMoMa does not output a prediction for this transcript., valid range = [0, 604800], default = 3600)</td><br />
<td style="width:100px;">LONG</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sort</font></td><br />
<td>sort (A flag which allows to sort the search results, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Score</font></td><br />
<td>Score (A flag which allows to do nothing, re-score or re-align the search results, range={Trust, ReScore, ReAlign}, default = Trust)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
GeMoMa returns the predicted annotation as gff file.<br />
<br />
=== GeMoMa Annotation Filter (GAF) ===<br />
<br />
The GeMoMa Annotation Filter allows to combine and reduce predictions from GeMoMa into a single final prediction. It is able to handle predictions from different reference species. It also handles overlapping or identical predictions.<br />
You can run GeMoMa from the command line with<br/><br />
<code>java -jar GeMoMa-1.6.1.jar CLI GAF[&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>tag (the tag used to read the GeMoMa annotations, default = prediction)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>sorting (comma-separated list that defines the way predictions are sorted within a cluster, default = evidence,score)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>missing intron evidence filter (the filter for single-exon transcripts or if no RNA-seq data is used, decides for overlapping other transcripts whether they should be used (=true) or discarded (=false), default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>intron evidence filter (the filter on the intron evidence given by RNA-seq-data for overlapping transcripts, valid range = [0.0, 1.0], default = 1.0)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>common border filter (the filter on the common borders of transcripts, the lower the more transcripts will be checked as alternative splice isoforms, valid range = [0.0, 1.0], default = 0.75)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">mnotpg</font></td><br />
<td>maximal number of transcripts per gene (the maximal number of allowed transcript predictions per gene, valid range = [1, 2147483647], default = 2147483647)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prefix (the prefix can be used to distinguish predictions from different input files, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>weight (the weight can be used to prioritize predictions from different input files; each prediction will get an additional attribute sumWeight that can be used in the filter, valid range = [0.0, 1000.0], default = 1.0, OPTIONAL)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>gene annotation file (GFF files containing the gene annotations (predicted by GeMoMa))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>filter (A filter can be applied to transcript predictions to receive only reasonable predictions. The filter is applied to the GFF attributes. The deault filter decides based on the completeness of the prediction (start=='M' and stop=='*') and the relative score (score/AA>=0.75) whether a prediction is used or not. Different criteria can be combined using 'and ' and 'or'. A more sophisticated filter could be applied for instance using the completeness, the relative score, the evidence as well as statistics based on RNA-seq: start=='M' and stop =='*' and score/AA>=0.75 and (evidence>1 or tpc==1.0), default = start=='M' and stop =='*' and score/AA>=0.75, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== CompareTranscripts ===<br />
<br />
For comparing gene models from GeMoMa predictions with existing annotation, we provide the tool CompareTranscripts. You can run CompareTranscripts from the command line with<br/><br />
<code>java -jar GeMoMa-1.6.1.jar CLI CompareTranscripts [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prediction (The predicted annotation)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (The true annotation)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== AnnotationEvidence ===<br />
<br />
For providing RNA-seq evidence (e.g. tie) for existing annotation, we provide the tool AnnotationEvidence. You can run AnnotationEvidence from the command line with<br/><br />
<code>java -jar GeMoMa-1.6.1.jar CLI AnnotationEvidence [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (The genome annotation file (GFF))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (The genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns file (Introns (GFF), which might be obtained from RNA-seq, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage file (experimental coverage (RNA-seq), range={NO, UNSTRANDED, STRANDED}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ao</font></td><br />
<td>annotation output (if the annotation should be returned with attributes tie, tpc, and AA, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">gc</font></td><br />
<td>genetic code (optional user-specified genetic code, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== AnnotationFinializer ===<br />
<br />
This tool allows to predict UTR and to rename predictions. You can run AnnotationEvidence from the command line with<br/><br />
<code>java -jar GeMoMa-1.6.1.jar CLI AnnotationFinalizer [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (The genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (The predicted genome annotation file (GFF))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = prediction)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">u</font></td><br />
<td>UTR (allows to predict UTRs using RNA-seq data, range={NO, YES}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;YES&quot;:</b></td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns file (Introns (GFF), which might be obtained from RNA-seq, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage file (experimental coverage (RNA-seq), range={NO, UNSTRANDED, STRANDED}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">rename</font></td><br />
<td>rename (allows to generate generic gene and transcripts names (cf. attribute &quot;Name&quot;), range={COMPOSED, SIMPLE, NO}, default = COMPOSED)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;COMPOSED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">infix</font></td><br />
<td>infix (the infix of the generic name, default = G)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>suffix (the suffix of the generic name, default = 0)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">di</font></td><br />
<td>delete infix (a comma-separated list of infixes that is deleted from the sequence names before building the gene/transcript name, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;SIMPLE&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
== GFF attributes ==<br />
<br />
Using GeMoMa and GAF, you'll obtain GFFs containing some special attributes. We briefly explain the most prominent attributes in the following table.<br />
<br />
{| class="wikitable sortable"<br />
|-<br />
!Attribute !!Long name !!Tool !!Necessary parameter !!Feature !!Description<br />
|-<br />
| score || GeMoMa score || GeMoMa || || prediction || score computed by GeMoMa using the substitution matrix, gap costs and additional penalties<br />
|-<br />
| minCov || minimal coverage || GeMoMa || coverage, ... || prediction || minimal coverage of any base of the prediction given RNA-seq evidence<br />
|-<br />
| avgCov || average coverage || GeMoMa || coverage, ... || prediction || average coverage of all bases of the prediction given RNA-seq evidence<br />
|-<br />
| tpc || transcript percentage coverage || GeMoMa || coverage, ... || prediction || percentage of covered bases per predicted transcript given RNA-seq evidence<br />
|-<br />
| tae || transcript acceptor evidence || GeMoMa || introns || prediction || percentage of predicted acceptor sites per predicted transcript with RNA-seq evidence<br />
|-<br />
| tde || transcript donor evidence || GeMoMa || introns || prediction || percentage of predicted donor sites per predicted transcript with RNA-seq evidence<br />
|-<br />
| tie || transcript intron evidence || GeMoMa || introns || prediction || percentage of predicted introns per predicted transcript with RNA-seq evidence<br />
|-<br />
| minSplitReads || minimal split reads || GeMoMa || introns || prediction || minimal number of split reads for any of the predicted introns per predicted transcript<br />
|-<br />
| iAA || identical amino acid || GeMoMa || query proteins || prediction || percentage of identical amino acids between reference transcript and prediction<br />
|-<br />
| pAA || positive amino acid || GeMoMa || query proteins || prediction || percentage of aligned positions between reference transcript and prediction yielding a positive score in the substitution matrix<br />
|-<br />
| evidence || || GAF || || prediction || number of reference organisms that have a transcript yielding this prediction<br />
|-<br />
| alternative || || GAF || || prediction || alternative gene ID(s) leading to the same prediction<br />
|-<br />
| sumWeight || || GAF || || prediction || the sum of the weights of the references that perfectly support this prediction<br />
|-<br />
| maxTie || maximal tie || GAF || || gene || maximal tie of all transcripts of this gene<br />
|-<br />
| maxEvidence || maximal evidence || GAF || || gene || maximal evidence of all transcripts of this gene<br />
|-<br />
|}<br />
<br />
== FAQs ==<br />
<br />
; Why does the Extractor not return a single CDS-part, protein, ...?<br />
: First, please check whether the names of your contigs/chromosomes in your annotation (gff) and genome file (fasta) are identical. The fasta comments should at best only contain the contig/chromosome name. (Since GeMoMa 1.4, comments, which contain the contig/chromosome name and some additional information separated by a space, are also fine.) Second, please check whether you have a valid GFF/GTF file. Valid GFF files should have a valid "ID" or "Parent" entry in the attributes column. Valid GTF files should have a valid "gene_id" and "transcript_id" entry. Finally, please check the statistics that are given by the Extractor. It lists how many genes have been read and how many genes have been removed for different reasons. One common problem is that some annotation files do not include the stop codon in the CDS annotation.<br />
; How can I force GeMoMa to make more predictions?<br />
: There are several parameters affecting the number of predictions. The most prominent are the number of predictions (p) and the contig threshold (ct). For each reference transcript/CDS, GeMoMa initially makes a preliminary prediction and uses this prediction to determine whether a contig is promising and should be used to determine the final predictions. You may decrease ct and increase p to have more contigs in the final prediction. Increasing the number of predictions allows GeMoMa to output more predictions that have been computed. Decreasing the contig threshold allows to increase the number of predictions that are (internally) computed. Increasing p to a very large number without decreasing ct does not help.<br />
; Running GeMoMa on a single contig of my assembly yield thousands of weird predictions. What went wrong?<br />
: By default, GeMoMa is not build to be run on a single contig. GeMoMa tries to make predictions for all given reference CDS in the given target sequence(s). If the given target sequence is only a fraction of the complete target genome/assembly, GeMoMa will produce weird predictions as it does not filter for the quality of the predictions internally. There are two options to handle this:<br />
:* Use a list of gene models that you expect to be located on this contig (cf. parameter "selected").<br />
:* Filter the predictions using GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>).<br />
; Is it mandatory to use RNA-seq data?<br />
: No, GeMoMa is able to make predictions with and without RNA-seq evidence.<br />
; Is it possible to use multiple reference organisms?<br />
: It is possible to use multiple reference organisms for GeMoMa. Just run GeMoMa on each reference organism separately. Finally, you can employ GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>) to combine these annotations.<br />
; Why do some reference genes not lead to a prediction in the target genome?<br />
: Please first check whether your reference genes have been discarded by the Extractor (cf. assignment file).<br />
: If the genes have been discarded, there are two possibilities:<br />
:* The CDS might be redundant, i.e. the coding exons are identical to those of another transcript. In this case, only one CDS is further evaluated.<br />
:* There might be something wrong with your reference genes, e.g., missing start codon, missing stop codon, premature stop codon, ambiguous nucleotides, ... and you should check the options of Extractor or the annotation.<br />
: If the reference genes passed the Extractor, there are several possible explanations for this behavior. The two most prominent are:<br />
:* GeMoMa stopped the prediction of a reference genes since it does not return a result within the given time (cf. parameter "timeout").<br />
:* GeMoMa simply did not find a prediction matching the remaining quality criteria<br />
:* GeMoMa did find a prediction, but it was filtered out by GAF, e.g. to low relative score, missing start or stop codon (cf. GAF parameters).<br />
; What does "partial gene model" mean in the context of GeMoMa?<br />
: We called a gene model partial if it does not contain an initial start codon and a final stop codon. However, this does not mean that the gene model is located at or close to the border of a chromosome or contig.<br />
; For two different reference transcripts, the predictions of GeMoMa overlap or are identical. What should I do with those?<br />
: GeMoMa makes the predictions for each reference transcript independently. Hence, it can occur that some of predictions of different reference transcripts overlap or are identical especially in gene families. Typically, you might like to filter or rank these predictions. We have implemented GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>) to do this automatically. However, you can also do it by hand using the GFF attributes. Using RNA-seq data in GeMoMa yields additional fields in the annotation that can be used, e.g., average coverage (avgCov).<br />
; A lot of transcripts have been filtered out by the Extractor. What can I do?<br />
: There are several reasons for removing transcripts by the Extractor. At least in two cases you can try to get more transcripts by setting specific parameter values. First, if the transcript contains ambiguous nucleotides, please test the parameter "Ambiguity". Second, sometimes we received GFFs which contain wrong phases for CDS entries (e.g., 0 for all CDS entries in the phase column of the GFF). Since version 1.3.2, we provide the option "r" which stands for repair. If r=true is chosen, the Extractor tries to infer all phases for transcripts that show an error and would be filtered out.<br />
; Is GeMoMa able to predict pseudo-genes/ncRNA?<br />
: No, currently not.<br />
; My RNA-seq data indicates there is an additional intron in a transcipt, but GeMoMa does not predict this. Or vice versa, GeMoMa predicts an intron that is not supported by RNA-seq data. What's the reason?<br />
: GeMoMa is mainly based on the assumptions of amino acid and intron position conservation between reference and target species. Hence, GeMoMa tries to predict a gene model with similar exon-intron structure in the target species and does not stick too much to RNA-seq data. Although intron position conservation can be observed in most cases, sometimes new introns evolve or others vanish. For this reasons, GeMoMa also allows for the inclusion or exclusion of introns adding some additional costs (cf. GeMoMa parameter intron-loss-gain-penalty). However, the behaviour of GeMoMa depends on the parameters settings (especially intron-loss-gain-penalty, sm (substitution matrix), go (gap opening), ge (gap extension)) and the length of the missed/additional intron. Nevertheless, such cases can only occur if the additional/missed intron has a length that can be divided by 3 preserving the reading frame.<br />
:Since the available RNA-seq data only reflects a fraction of tissues/environmental conditions/..., missing RNA-seq evidence does not necessarily mean that the predictions is wrong.<br />
; My RNA-seq data indicates two alternative, highly overlapping introns. Interestingly, GeMoMa does not take the intron that is more abundant. Why?<br />
: GeMoMa reads the introns from the input file using some filter (cf. GeMoMa parameter r (reads)). All introns that pass the filter are used and treated equally. Hence, GeMoMa uses the intron that matches the expectation of intron position and amino acid conservation compared to the reference transcript.<br />
; Does GeMoMa predict multiple transcripts per gene?<br />
: GeMoMa in principle allows to predict multiple transcripts per gene, if corresponding transcripts are given in the reference species or if multiple reference species are used.<br />
; GeMoMa failed with java.lang.OutOfMemoryError. What can I do?<br />
: Whenever you see a java.lang.OutOfMemoryError, you should rerun the program with Java virtual machine (VM) options. More specifically you should set: '''-Xms''' the initally used RAM, e.g. to 5Gb (–Xms5G), and '''-Xmx''' the maximally used RAM, e.g. to 50Gb (-Xmx50G). GeMoMa often needs more memory if you have a large genome ''and'' if you’re providing a large coverage file (extracted from RNA-seq data). If you don’t have a compute node with enough memory, you can run GeMoMa without coverage, which will return the same predictions, but does not include all statistics. Another point could be the protein alignment, if you use the optional parameter query protein. Again you can run GeMoMa without this parameter, which will return the same predictions, but less statistics.<br />
<br />
== Version history ==<br />
[http://www.jstacs.de/download.php?which=GeMoMa GeMoMa 1.6.1] (4.06.2019)<br />
* createGalaxyIntegration.sh: bugfix for GeMoMaPipeline<br />
* new module CheckIntrons: allowing to create statistics for introns (extracted by ERE)<br />
* AnnotationFinalizer: bugfix for sequence IDs with large numbers<br />
* CompareTranscripts: <br />
** bugfix for prefix of ref-gene<br />
** allow no transcript info, but making assignment non-optional if a transcript info is set <br />
* GAF: bugfix for Galaxy integration<br />
* GeMoMaPipeline:<br />
** improved output in case of Exceptions<br />
** new parameter "output individual predictions" allows to in- or exclude individual predictions from each reference organism in the final result<br />
** new parameter "weight" allows weights for reference species (cf. GAF)<br />
* ERE: new parameter "minimum mapping quality"<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.6.zip GeMoMa 1.6] (2.04.2019)<br />
* allow to use mmseqs as alternative to tblastn<br />
* AnnotationEvidence:<br />
** allows to add attributes to the input gff: tie, tpc, AA, start, stop<br />
** new parameter for gff output<br />
* AnnotationFinalizer: new tool for predicting UTRs and renaming predictions<br />
* GAF:<br />
** relative score filter and evidence filter are replaced by a flexible filter that allows to filter by relative score, evidence or other GFF attributes as well as combinations thereof<br />
** sorting criteria of the predictions within clusters can now be user-specified<br />
** new attribute for genes: combinedEvidence<br />
** new attribute for predictions: sumWeight<br />
** allows to use gene predictions from all sources, including for instance ab-initio gene predictors, purely RNA-seq based gene prediction and manually curation<br />
** bugfix for predictions from multiple reference organisms<br />
** improved statistic output<br />
* GeMoMa<br />
** renamed the parameter tblastn results to search results<br />
** new parameter for sorting the results of the similarity search (tblastn or mmseqs), if you use mmseqs for the similarity search you have use sort <br />
** new parameter for score of the search results: three options: Trust (as is), ReScore (use aligned sequence, but recompute score), and ReAlign (use detected sequence for realignment and rescore)<br />
** bugfix: threshold for introns from multiple files<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.5.2.zip GeMoMa 1.5.3] (23.07.2018)<br />
* improved parameter description and presentation<br />
* GeMoMaPipeline:<br />
** removed unnecessary parameters<br />
* GeMoMa:<br />
** bugfix: reading coverage file<br />
** removed parameter genomic (cf. Extractor)<br />
** removed protein output (cf. Extractor)<br />
* GAF:<br />
** bugfix: prefix<br />
* Extractor:<br />
** new parameter genomic<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.5.2.zip GeMoMa 1.5.2] (31.5.2018)<br />
* GAF:<br />
** new parameter that allows to restrict the maximal number of transcript predictions per gene<br />
** altered behavior of the evidence filter from percentages to absolute values<br />
** bugfix: nested genes <br />
** checking for duplicates in prediction IDs<br />
* GeMoMa:<br />
** warning if RNA-seq data does not match with target genome<br />
* GeMoMaPipeline: new tool for running the complete GeMoMa pipeline at once allowing multi-threading<br />
* folder for temporary files of GeMoMa<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.5.zip GeMoMa 1.5] (13.02.2018)<br />
* AnnotationEvidence: add chromosome to output<br />
* CompareTranscripts: new parameter that allows to remove prefixes introduces by GAF<br />
* Extractor: new parameter "stop-codon excluded from CDS" that might be used if the annotation does not contain the stop codons<br />
* ExtractRNASeqEvidence: <br />
** print intron length stats<br />
** include program infos in introns.gff3<br />
* GeMoMa:<br />
** new attribute pAA in gff output if query protein is given<br />
** include program infos in predicted_annotation.gff3<br />
** minor bugfix<br />
* GAF:<br />
** new parameter that allows to specify a prefix for each input gff<br />
** collect and print program infos to filtered_prediction.gff3<br />
** improved statistics output<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.4.2.zip GeMoMa 1.4.2] (21.07.2017)<br />
* automatic searching for available updates<br />
* AnnotationEvidence: bugfix (tie computation: Arrays.binarysearch does not find first match)<br />
* Extractor: bugfix (files that are not zipped)<br />
* GeMoMa: bugfix (tie computation: Arrays.binarysearch does not find first match)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.4.1.zip GeMoMa 1.4.1] (30.05.2017)<br />
* CompareTranscripts: bugfix (NullPointerException)<br />
* Extractor: reference genome can be .*fa.gz and .*fasta.gz<br />
* GeMoMa: bugfix (shutdown problem after timeout)<br />
* modified additional scripts and documentation<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.4.zip GeMoMa 1.4] (03.05.2017)<br />
* AnnotationEvidence: new tool computing tie and tpc for given annotation (gff)<br />
* CompareTranscripts: new tool comparing predicted and given annotation (gff)<br />
* Extractor:<br />
** reading CDS with no parent tag (cf. discontinuous feature)<br />
** automatic recognition of GFF or GTF annotation<br />
** Warning if sequences mentioned in the annotation are not included in the reference sequence<br />
* GeMoMa: <br />
** allowing for multiple intron and coverage files (= using different library types at the same time)<br />
** NA instead of "?" for tae, tde, tie, minSplitReads of single coding exon genes<br />
** new default values for the parameters: predictions (10 instead of 1) and contig threshold (0.4 instead of 0.9)<br />
** bugfix (write pc and minCov if possible for last CDS part in predicted annotation)<br />
** bugfix (ref-gene name if no assignment is used)<br />
** bugfix (minSplitReads, minCov, tpc, avgCov if no coverage available)<br />
* GAF:<br />
** nested genes on the same strand<br />
** bugfix (if nothing passes the filter)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.3.2.zip GeMoMa 1.3.2] (18.01.2017)<br />
* Extractor: new parameter repair for broken transcript annotations<br />
* GeMoMa: bugfixes (splice site computation)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa_1.3.1.zip GeMoMa 1.3.1] (09.12.2016)<br />
* GeMoMa bugfix (finding start/stop codon for very small exons)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.3.zip GeMoMa 1.3] (06.12.2016)<br />
* ERE: new tool for extracting RNA-seq evidence<br />
* Extractor: offers options for <br />
** partial gene models<br />
** ambiguities<br />
* GeMoMa:<br />
** RNA-seq<br />
*** defining splice sites<br />
*** additional feature in GFF and output<br />
**** transcript intron evidence (tie)<br />
**** transcript acceptor evidence (tae)<br />
**** transcript donor evidence (tde)<br />
**** transcript percentage coverage (tpc)<br />
**** ...<br />
** improved GFF<br />
** simplified the command line parameters<br />
** IMPORTANT: parameter names changed for some parameters<br />
* GAF: new tool for filtering and combining different predictions (especially of different reference organisms)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.1.3.zip GeMoMa 1.1.3] (06.06.2016)<br />
* minor modifications to the Extractor tool<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.1.2.zip GeMoMa 1.1.2] (05.02.2016)<br />
* GeMoMa bugfix (upstream, downstream sequence for splice site detection)<br />
* Extractor: new parameter s for selecting transcripts<br />
* improved Galaxy integration<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.1.1.zip GeMoMa 1.1.1] (01.02.2016)<br />
* initial release for paper</div>Keilwagenhttps://www.jstacs.de/index.php?title=GeMoMa&diff=1038GeMoMa2019-06-05T12:33:25Z<p>Keilwagen: /* Version history */</p>
<hr />
<div>'''Ge'''ne '''Mo'''del '''Ma'''pper (GeMoMa) is a homology-based gene prediction program. GeMoMa uses the annotation of protein-coding genes in a reference genome to infer the annotation of protein-coding genes in a target genome. Thereby, GeMoMa utilizes amino acid sequence and intron position conservation. In addition, GeMoMa allows to incorporate RNA-seq evidence for splice site prediction.<br />
<br />
[[File:GeMoMa-schema.png|thumb|right|350px|Schema of GeMoMa algorithm]]<br />
<br />
== Paper ==<br />
If you use GeMoMa, please cite<br />
<br />
J. Keilwagen, M. Wenk, J. L. Erickson, M. H. Schattat, J. Grau, and F. Hartung. [https://nar.oxfordjournals.org/content/44/9/e89 Using intron position conservation for homology-based gene prediction]. ''Nucleic Acids Research'', 2016. doi: 10.1093/nar/gkw092<br />
<br />
J. Keilwagen, F. Hartung, M. Paulini, S. O. Twardziok, and J. Grau<br />
[https://rdcu.be/QbKc Combining RNA-seq data and homology-based gene prediction for plants, animals and fungi]. ''BMC Bioinformatics'', 2018. doi: 10.1186/s12859-018-2203-5<br />
<br />
== Download ==<br />
<br />
GeMoMa is implemented in Java using Jstacs. You can [http://www.jstacs.de/download.php?which=GeMoMa download a zip file] containing a readme, the GeMoMa jar file and some tiny scripts for running GeMoMa. The jar file allows for <br />
<ul><br />
<li>creating the XML file needed for the Galaxy integration</li><br />
<li>running the command line interface (CLI) version.</li><br />
</ul><br />
<br />
You can also [[:File:GeMoMa-manual.pdf|download a small manual for GeMoMa]] which explains the main steps for the analysis.<br />
<br />
== Galaxy ==<br />
GeMoMa is available in a public web-server at [http://galaxy.informatik.uni-halle.de galaxy.informatik.uni-halle.de]. The provided web-server only allows a limited number of reference genes and uses a time out of 2 minutes per transcript prediction. For unlimited use, please use the command line program or integrate GeMoMa in your only Galaxy instance.<br />
<br />
[[File:GeMoMa-Workflow.png|thumb|center|700px|GeMoMa workflow adapted from Galaxy]]<br />
<br />
== Running the command line application ==<br />
<br />
For running the command line application, Java v1.8 or later is required.<br />
<br />
=== GeMoMaPipeline ===<br />
<br />
If you like to run the GeMoMaPipeline on a server as a single job, you can use the module GeMoMaPipeline which allows to exploit the full compute power of the computer server via multi-threading. However, GeMoMaPipeline does not distribute task on a compute cluster.<br />
You can run GeMoMaPipeline from the command line with<br/><br />
<code>java -jar GeMoMa-1.6.1.jar CLI GeMoMaPipeline [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>target genome (Target genome file (FASTA))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>species (data for reference species, range={own, pre-extracted}, default = own)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;own&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>ID (ID to distinguish the different reference species, default = , OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (Reference annotation file (GFF or GTF), which contains gene models annotated in the reference genome)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (Reference genome file (FASTA))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>weight (the weight can be used to prioritize predictions from different input files; each prediction will get an additional attribute sumWeight that can be used in the filter, valid range = [0.0, 1000.0], default = 1.0, OPTIONAL)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;pre-extracted&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>ID (ID to distinguish the different reference species, default = , OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>cds parts (The query cds parts file (FASTA), i.e., the cds parts that have been blasted)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>assignment (The assignment file, which combines parts of the CDS to transcripts, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">q</font></td><br />
<td>query proteins (optional query protein file (FASTA) for computing the optimal alignment score against complete protein prediction, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>weight (the weight can be used to prioritize predictions from different input files; each prediction will get an additional attribute sumWeight that can be used in the filter, valid range = [0.0, 1000.0], default = 1.0, OPTIONAL)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">selected</font></td><br />
<td>selected (The path to list file, which allows to make only a predictions for the contained transcript ids. The first column should contain transcript IDs as given in the annotation. Remaining columns can be used to determine a target region that should be overlapped by the prediction, if columns 2 to 5 contain chromosome, strand, start and end of region, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">gc</font></td><br />
<td>genetic code (optional user-specified genetic code, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">tag</font></td><br />
<td>tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = prediction)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>RNA-seq evidence (data for RNA-seq evidence, range={NO, MAPPED, EXTRACTED}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;MAPPED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.s</font></td><br />
<td>Stranded (Defines whether the reads are stranded. In case of FR_FIRST_STRAND, the first read of a read pair or the only read in case of single-end data is assumed to be located on forward strand of the cDNA, i.e., reverse to the mRNA orientation. If you are using Illumina TruSeq you should use FR_FIRST_STRAND., range={FR_UNSTRANDED, FR_FIRST_STRAND, FR_SECOND_STRAND}, default = FR_UNSTRANDED)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.m</font></td><br />
<td>mapped reads file (BAM/SAM files containing the mapped reads)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.v</font></td><br />
<td>ValidationStringency (Defines how strict to be when reading a SAM or BAM, beyond bare minimum validation., range={STRICT, LENIENT, SILENT}, default = LENIENT)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.u</font></td><br />
<td>use secondary alignments (allows to filter flags in the SAM or BAM, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.c</font></td><br />
<td>coverage (allows to output the coverage, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.mmq</font></td><br />
<td>minimum mapping quality (reads with a mapping quality that is lower than this value will be ignored, valid range = [0, 254], default = 40)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;EXTRACTED&quot;:</b></td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">introns</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage</font></td><br />
<td>coverage (experimental coverage (RNA-seq), range={NO, UNSTRANDED, STRANDED}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">tblastn</font></td><br />
<td>tblastn (if *true* tblastn is used as search algorithm, otherwise mmseqs is used. Tblastn and mmseqs need to be installed to use the corresponding option, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.p</font></td><br />
<td>proteins (whether the complete proteins sequences should returned as output, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.r</font></td><br />
<td>repair (if a transcript annotation can not be parsed, the program will try to infer the phase of the CDS parts to repair the annotation, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.a</font></td><br />
<td>Ambiguity (This parameter defines how to deal with ambiguities in the DNA. There are 3 options: EXCEPTION, which will remove the corresponding transcript, AMBIGUOUS, which will use an X for the corresponding amino acid, and RANDOM, which will randomly select an amnio acid from the list of possibilities., range={EXCEPTION, AMBIGUOUS, RANDOM}, default = EXCEPTION)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.s</font></td><br />
<td>stop-codon excluded from CDS (A flag that states whether the reference annotation contains the stop codon in the CDS annotation or not, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.f</font></td><br />
<td>full-length (A flag which allows for choosing between only full-length and all (i.e., full-length and partial) transcripts, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.s</font></td><br />
<td>splice (if no intron is given by RNA-seq, compute candidate splice sites or not, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.sm</font></td><br />
<td>substitution matrix (optional user-specified substitution matrix, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.g</font></td><br />
<td>gap opening (The gap opening cost in the alignment, default = 11)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.ge</font></td><br />
<td>gap extension (The gap extension cost in the alignment, default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.m</font></td><br />
<td>maximum intron length (The maximum length of an intron, default = 15000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.i</font></td><br />
<td>intron-loss-gain-penalty (The penalty used for intron loss and gain, default = 25)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.e</font></td><br />
<td>e-value (The e-value for filtering blast results, default = 100.0)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.c</font></td><br />
<td>contig threshold (The threshold for evaluating contigs, valid range = [0.0, 1.0], default = 0.4)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.rt</font></td><br />
<td>region threshold (The threshold for evaluating regions, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.h</font></td><br />
<td>hit threshold (The threshold for adding additional hits, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.p</font></td><br />
<td>predictions (The (maximal) number of predictions per transcript, default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.a</font></td><br />
<td>avoid stop (A flag which allows to avoid stop codons in a transcript (except the last AS), default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.approx</font></td><br />
<td>approx (whether an approximation is used to compute the score for intron gain, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.prefix</font></td><br />
<td>prefix (A prefix to be used for naming the predictions, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.t</font></td><br />
<td>timeout (The (maximal) number of seconds to be used for the predictions of one transcript, if exceeded GeMoMa does not output a prediction for this transcript., valid range = [0, 604800], default = 3600)</td><br />
<td style="width:100px;">LONG</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.Score</font></td><br />
<td>Score (A flag which allows to do nothing, re-score or re-align the search results, range={Trust, ReScore, ReAlign}, default = Trust)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.s</font></td><br />
<td>sorting (comma-separated list that defines the way predictions are sorted within a cluster, default = evidence,score)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.m</font></td><br />
<td>missing intron evidence filter (the filter for single-exon transcripts or if no RNA-seq data is used, decides for overlapping other transcripts whether they should be used (=true) or discarded (=false), default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.i</font></td><br />
<td>intron evidence filter (the filter on the intron evidence given by RNA-seq-data for overlapping transcripts, valid range = [0.0, 1.0], default = 1.0)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.c</font></td><br />
<td>common border filter (the filter on the common borders of transcripts, the lower the more transcripts will be checked as alternative splice isoforms, valid range = [0.0, 1.0], default = 0.75)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.mnotpg</font></td><br />
<td>maximal number of transcripts per gene (the maximal number of allowed transcript predictions per gene, valid range = [1, 2147483647], default = 2147483647)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.f</font></td><br />
<td>filter (A filter can be applied to transcript predictions to receive only reasonable predictions. The filter is applied to the GFF attributes. The deault filter decides based on the completeness of the prediction (start=='M' and stop=='*') and the relative score (score/AA>=0.75) whether a prediction is used or not. Different criteria can be combined using 'and ' and 'or'. A more sophisticated filter could be applied for instance using the completeness, the relative score, the evidence as well as statistics based on RNA-seq: start=='M' and stop =='*' and score/AA>=0.75 and (evidence>1 or tpc==1.0), default = start=='M' and stop =='*' and score/AA>=0.75, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.u</font></td><br />
<td>UTR (allows to predict UTRs using RNA-seq data, range={NO, YES}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;YES&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.r</font></td><br />
<td>rename (allows to generate generic gene and transcripts names (cf. attribute &quot;Name&quot;), range={COMPOSED, SIMPLE, NO}, default = COMPOSED)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;COMPOSED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.i</font></td><br />
<td>infix (the infix of the generic name, default = G)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.s</font></td><br />
<td>suffix (the suffix of the generic name, default = 0)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.di</font></td><br />
<td>delete infix (a comma-separated list of infixes that is deleted from the sequence names before building the gene/transcript name, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;SIMPLE&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>predicted proteins (If *true*, returns the predicted proteins of the target organism as fastA file, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">pc</font></td><br />
<td>predicted CDSs (If *true*, returns the predicted CDSs of the target organism as fastA file, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">pgr</font></td><br />
<td>predicted genomic regions (If *true*, returns the genomic regions of predicted gene models of the target organism as fastA file, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">o</font></td><br />
<td>output individual predictions (If *true*, returns the predictions for each reference species, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>debug (If *false* removes all temporary files even if the jobs exits unexpected, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">threads</font></td><br />
<td>The number of threads used for the tool, defaults to 1</td><br />
<td>INT</td><br />
</tr><br />
</table><br />
<br />
=== Extract RNA-seq Evidence (ERE) ===<br />
<br />
For post-processing the mapped RNA-seq data, we provide the tool ExtractRNAseqEvidence (ERE). You can run Extractor from the command line with<br/><br />
<code>java -jar GeMoMa-1.6.1.jar CLI ERE [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>Stranded (Defines whether the reads are stranded. In case of FR_FIRST_STRAND, the first read of a read pair or the only read in case of single-end data is assumed to be located on forward strand of the cDNA, i.e., reverse to the mRNA orientation. If you are using Illumina TruSeq you should use FR_FIRST_STRAND., range={FR_UNSTRANDED, FR_FIRST_STRAND, FR_SECOND_STRAND}, default = FR_UNSTRANDED)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>mapped reads file (BAM/SAM files containing the mapped reads)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>ValidationStringency (Defines how strict to be when reading a SAM or BAM, beyond bare minimum validation., range={STRICT, LENIENT, SILENT}, default = LENIENT)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">u</font></td><br />
<td>use secondary alignments (allows to filter flags in the SAM or BAM, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage (allows to output the coverage, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">mmq</font></td><br />
<td>minimum mapping quality (reads with a mapping quality that is lower than this value will be ignored, valid range = [0, 254], default = 40)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== CheckIntrons===<br />
<br />
This tool allows to check whether the extracted introns show the expected patterns of di-nucleotides at the splice sites. You can run CheckIntrons from the command line with<br/><br />
<code>java -jar GeMoMa-1.6.1.jar CLI CheckIntrons [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>target genome (The target genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>verbose (A flag which allows to output a wealth of additional information per transcript, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== Extractor ===<br />
<br />
For preparing the data, we provide the tool Extractor. You can run Extractor from the command line with<br/><br />
<code>java -jar GeMoMa-1.6.1.jar CLI Extractor [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (Reference annotation file (GFF or GTF), which contains gene models annotated in the reference genome)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (Reference genome file (FASTA))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">gc</font></td><br />
<td>genetic code (optional user-specified genetic code, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>proteins (whether the complete proteins sequences should returned as output, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>cds (whether the complete CDSs should returned as output, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">genomic</font></td><br />
<td>genomic (whether the genomic regions should be returned (upper case = coding, lower case = non coding), default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>repair (if a transcript annotation can not be parsed, the program will try to infer the phase of the CDS parts to repair the annotation, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>selected (The path to list file, which allows to make only a predictions for the contained transcript ids. The first column should contain transcript IDs as given in the annotation. Remaining columns will be ignored., OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Ambiguity</font></td><br />
<td>Ambiguity (This parameter defines how to deal with ambiguities in the DNA. There are 3 options: EXCEPTION, which will remove the corresponding transcript, AMBIGUOUS, which will use an X for the corresponding amino acid, and RANDOM, which will randomly select an amnio acid from the list of possibilities., range={EXCEPTION, AMBIGUOUS, RANDOM}, default = EXCEPTION)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sefc</font></td><br />
<td>stop-codon excluded from CDS (A flag that states whether the reference annotation contains the stop codon in the CDS annotation or not, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>full-length (A flag which allows for choosing between only full-length and all (i.e., full-length and partial) transcripts, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>verbose (A flag which allows to output a wealth of additional information, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== Gene Model Mapper (GeMoMa) ===<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>search results (The search results, e.g., from tblastn or mmseqs)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>target genome (The target genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>cds parts (The query cds parts file (FASTA), i.e., the cds parts that have been blasted)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>assignment (The assignment file, which combines parts of the CDS to transcripts, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">q</font></td><br />
<td>query proteins (optional query protein file (FASTA) for computing the optimal alignment score against complete protein prediction, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">splice</font></td><br />
<td>splice (if no intron is given by RNA-seq, compute candidate splice sites or not, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage</font></td><br />
<td>coverage (experimental coverage (RNA-seq), range={NO, UNSTRANDED, STRANDED}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genetic code (optional user-specified genetic code, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sm</font></td><br />
<td>substitution matrix (optional user-specified substitution matrix, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">go</font></td><br />
<td>gap opening (The gap opening cost in the alignment, default = 11)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ge</font></td><br />
<td>gap extension (The gap extension cost in the alignment, default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>maximum intron length (The maximum length of an intron, default = 15000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">intron-loss-gain-penalty</font></td><br />
<td>intron-loss-gain-penalty (The penalty used for intron loss and gain, default = 25)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">e</font></td><br />
<td>e-value (The e-value for filtering blast results, default = 100.0)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ct</font></td><br />
<td>contig threshold (The threshold for evaluating contigs, valid range = [0.0, 1.0], default = 0.4)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">rt</font></td><br />
<td>region threshold (The threshold for evaluating regions, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">h</font></td><br />
<td>hit threshold (The threshold for adding additional hits, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>predictions (The (maximal) number of predictions per transcript, default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">selected</font></td><br />
<td>selected (The path to list file, which allows to make only a predictions for the contained transcript ids. The first column should contain transcript IDs as given in the annotation. Remaining columns can be used to determine a target region that should be overlapped by the prediction, if columns 2 to 5 contain chromosome, strand, start and end of region, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">as</font></td><br />
<td>avoid stop (A flag which allows to avoid stop codons in a transcript (except the last AS), default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">approx</font></td><br />
<td>approx (whether an approximation is used to compute the score for intron gain, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">prefix</font></td><br />
<td>prefix (A prefix to be used for naming the predictions, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">tag</font></td><br />
<td>tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = prediction)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>verbose (A flag which allows to output a wealth of additional information per transcript, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">timeout</font></td><br />
<td>timeout (The (maximal) number of seconds to be used for the predictions of one transcript, if exceeded GeMoMa does not output a prediction for this transcript., valid range = [0, 604800], default = 3600)</td><br />
<td style="width:100px;">LONG</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sort</font></td><br />
<td>sort (A flag which allows to sort the search results, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Score</font></td><br />
<td>Score (A flag which allows to do nothing, re-score or re-align the search results, range={Trust, ReScore, ReAlign}, default = Trust)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
GeMoMa returns the predicted annotation as gff file.<br />
<br />
=== GeMoMa Annotation Filter (GAF) ===<br />
<br />
The GeMoMa Annotation Filter allows to combine and reduce predictions from GeMoMa into a single final prediction. It is able to handle predictions from different reference species. It also handles overlapping or identical predictions.<br />
You can run GeMoMa from the command line with<br/><br />
<code>java -jar GeMoMa-1.6.1.jar CLI GAF[&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>tag (the tag used to read the GeMoMa annotations, default = prediction)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>sorting (comma-separated list that defines the way predictions are sorted within a cluster, default = evidence,score)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>missing intron evidence filter (the filter for single-exon transcripts or if no RNA-seq data is used, decides for overlapping other transcripts whether they should be used (=true) or discarded (=false), default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>intron evidence filter (the filter on the intron evidence given by RNA-seq-data for overlapping transcripts, valid range = [0.0, 1.0], default = 1.0)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>common border filter (the filter on the common borders of transcripts, the lower the more transcripts will be checked as alternative splice isoforms, valid range = [0.0, 1.0], default = 0.75)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">mnotpg</font></td><br />
<td>maximal number of transcripts per gene (the maximal number of allowed transcript predictions per gene, valid range = [1, 2147483647], default = 2147483647)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prefix (the prefix can be used to distinguish predictions from different input files, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>weight (the weight can be used to prioritize predictions from different input files; each prediction will get an additional attribute sumWeight that can be used in the filter, valid range = [0.0, 1000.0], default = 1.0, OPTIONAL)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>gene annotation file (GFF files containing the gene annotations (predicted by GeMoMa))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>filter (A filter can be applied to transcript predictions to receive only reasonable predictions. The filter is applied to the GFF attributes. The deault filter decides based on the completeness of the prediction (start=='M' and stop=='*') and the relative score (score/AA>=0.75) whether a prediction is used or not. Different criteria can be combined using 'and ' and 'or'. A more sophisticated filter could be applied for instance using the completeness, the relative score, the evidence as well as statistics based on RNA-seq: start=='M' and stop =='*' and score/AA>=0.75 and (evidence>1 or tpc==1.0), default = start=='M' and stop =='*' and score/AA>=0.75, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== CompareTranscripts ===<br />
<br />
For comparing gene models from GeMoMa predictions with existing annotation, we provide the tool CompareTranscripts. You can run CompareTranscripts from the command line with<br/><br />
<code>java -jar GeMoMa-1.6.1.jar CLI CompareTranscripts [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prediction (The predicted annotation)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (The true annotation)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== AnnotationEvidence ===<br />
<br />
For providing RNA-seq evidence (e.g. tie) for existing annotation, we provide the tool AnnotationEvidence. You can run AnnotationEvidence from the command line with<br/><br />
<code>java -jar GeMoMa-1.6.1.jar CLI AnnotationEvidence [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (The genome annotation file (GFF))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (The genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns file (Introns (GFF), which might be obtained from RNA-seq, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage file (experimental coverage (RNA-seq), range={NO, UNSTRANDED, STRANDED}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ao</font></td><br />
<td>annotation output (if the annotation should be returned with attributes tie, tpc, and AA, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">gc</font></td><br />
<td>genetic code (optional user-specified genetic code, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== AnnotationFinializer ===<br />
<br />
This tool allows to predict UTR and to rename predictions. You can run AnnotationEvidence from the command line with<br/><br />
<code>java -jar GeMoMa-1.6.1.jar CLI AnnotationFinalizer [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (The genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (The predicted genome annotation file (GFF))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = prediction)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">u</font></td><br />
<td>UTR (allows to predict UTRs using RNA-seq data, range={NO, YES}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;YES&quot;:</b></td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns file (Introns (GFF), which might be obtained from RNA-seq, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage file (experimental coverage (RNA-seq), range={NO, UNSTRANDED, STRANDED}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">rename</font></td><br />
<td>rename (allows to generate generic gene and transcripts names (cf. attribute &quot;Name&quot;), range={COMPOSED, SIMPLE, NO}, default = COMPOSED)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;COMPOSED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">infix</font></td><br />
<td>infix (the infix of the generic name, default = G)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>suffix (the suffix of the generic name, default = 0)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">di</font></td><br />
<td>delete infix (a comma-separated list of infixes that is deleted from the sequence names before building the gene/transcript name, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;SIMPLE&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
== GFF attributes ==<br />
<br />
Using GeMoMa and GAF, you'll obtain GFFs containing some special attributes. We briefly explain the most prominent attributes in the following table.<br />
<br />
{| class="wikitable sortable"<br />
|-<br />
!Attribute !!Long name !!Tool !!Necessary parameter !!Feature !!Description<br />
|-<br />
| score || GeMoMa score || GeMoMa || || prediction || score computed by GeMoMa using the substitution matrix, gap costs and additional penalties<br />
|-<br />
| minCov || minimal coverage || GeMoMa || coverage, ... || prediction || minimal coverage of any base of the prediction given RNA-seq evidence<br />
|-<br />
| avgCov || average coverage || GeMoMa || coverage, ... || prediction || average coverage of all bases of the prediction given RNA-seq evidence<br />
|-<br />
| tpc || transcript percentage coverage || GeMoMa || coverage, ... || prediction || percentage of covered bases per predicted transcript given RNA-seq evidence<br />
|-<br />
| tae || transcript acceptor evidence || GeMoMa || introns || prediction || percentage of predicted acceptor sites per predicted transcript with RNA-seq evidence<br />
|-<br />
| tde || transcript donor evidence || GeMoMa || introns || prediction || percentage of predicted donor sites per predicted transcript with RNA-seq evidence<br />
|-<br />
| tie || transcript intron evidence || GeMoMa || introns || prediction || percentage of predicted introns per predicted transcript with RNA-seq evidence<br />
|-<br />
| minSplitReads || minimal split reads || GeMoMa || introns || prediction || minimal number of split reads for any of the predicted introns per predicted transcript<br />
|-<br />
| iAA || identical amino acid || GeMoMa || query proteins || prediction || percentage of identical amino acids between reference transcript and prediction<br />
|-<br />
| pAA || positive amino acid || GeMoMa || query proteins || prediction || percentage of aligned positions between reference transcript and prediction yielding a positive score in the substitution matrix<br />
|-<br />
| evidence || || GAF || || prediction || number of reference organisms that have a transcript yielding this prediction<br />
|-<br />
| alternative || || GAF || || prediction || alternative gene ID(s) leading to the same prediction<br />
|-<br />
| sumWeight || || GAF || || prediction || the sum of the weights of the references that perfectly support this prediction<br />
|-<br />
| maxTie || maximal tie || GAF || || gene || maximal tie of all transcripts of this gene<br />
|-<br />
| maxEvidence || maximal evidence || GAF || || gene || maximal evidence of all transcripts of this gene<br />
|-<br />
|}<br />
<br />
== FAQs ==<br />
<br />
; Why does the Extractor not return a single CDS-part, protein, ...?<br />
: First, please check whether the names of your contigs/chromosomes in your annotation (gff) and genome file (fasta) are identical. The fasta comments should at best only contain the contig/chromosome name. (Since GeMoMa 1.4, comments, which contain the contig/chromosome name and some additional information separated by a space, are also fine.) Second, please check whether you have a valid GFF/GTF file. Valid GFF files should have a valid "ID" or "Parent" entry in the attributes column. Valid GTF files should have a valid "gene_id" and "transcript_id" entry. Finally, please check the statistics that are given by the Extractor. It lists how many genes have been read and how many genes have been removed for different reasons. One common problem is that some annotation files do not include the stop codon in the CDS annotation.<br />
; How can I force GeMoMa to make more predictions?<br />
: There are several parameters affecting the number of predictions. The most prominent are the number of predictions (p) and the contig threshold (ct). For each reference transcript/CDS, GeMoMa initially makes a preliminary prediction and uses this prediction to determine whether a contig is promising and should be used to determine the final predictions. You may decrease ct and increase p to have more contigs in the final prediction. Increasing the number of predictions allows GeMoMa to output more predictions that have been computed. Decreasing the contig threshold allows to increase the number of predictions that are (internally) computed. Increasing p to a very large number without decreasing ct does not help.<br />
; Running GeMoMa on a single contig of my assembly yield thousands of weird predictions. What went wrong?<br />
: By default, GeMoMa is not build to be run on a single contig. GeMoMa tries to make predictions for all given reference CDS in the given target sequence(s). If the given target sequence is only a fraction of the complete target genome/assembly, GeMoMa will produce weird predictions as it does not filter for the quality of the predictions internally. There are two options to handle this:<br />
:* Use a list of gene models that you expect to be located on this contig (cf. parameter "selected").<br />
:* Filter the predictions using GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>).<br />
; Is it mandatory to use RNA-seq data?<br />
: No, GeMoMa is able to make predictions with and without RNA-seq evidence.<br />
; Is it possible to use multiple reference organisms?<br />
: It is possible to use multiple reference organisms for GeMoMa. Just run GeMoMa on each reference organism separately. Finally, you can employ GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>) to combine these annotations.<br />
; Why do some reference genes not lead to a prediction in the target genome?<br />
: Please first check whether your reference genes have been discarded by the Extractor (cf. assignment file).<br />
: If the genes have been discarded, there are two possibilities:<br />
:* The CDS might be redundant, i.e. the coding exons are identical to those of another transcript. In this case, only one CDS is further evaluated.<br />
:* There might be something wrong with your reference genes, e.g., missing start codon, missing stop codon, premature stop codon, ambiguous nucleotides, ... and you should check the options of Extractor or the annotation.<br />
: If the reference genes passed the Extractor, there are several possible explanations for this behavior. The two most prominent are:<br />
:* GeMoMa stopped the prediction of a reference genes since it does not return a result within the given time (cf. parameter "timeout").<br />
:* GeMoMa simply did not find a prediction matching the remaining quality criteria<br />
:* GeMoMa did find a prediction, but it was filtered out by GAF, e.g. to low relative score, missing start or stop codon (cf. GAF parameters).<br />
; What does "partial gene model" mean in the context of GeMoMa?<br />
: We called a gene model partial if it does not contain an initial start codon and a final stop codon. However, this does not mean that the gene model is located at or close to the border of a chromosome or contig.<br />
; For two different reference transcripts, the predictions of GeMoMa overlap or are identical. What should I do with those?<br />
: GeMoMa makes the predictions for each reference transcript independently. Hence, it can occur that some of predictions of different reference transcripts overlap or are identical especially in gene families. Typically, you might like to filter or rank these predictions. We have implemented GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>) to do this automatically. However, you can also do it by hand using the GFF attributes. Using RNA-seq data in GeMoMa yields additional fields in the annotation that can be used, e.g., average coverage (avgCov).<br />
; A lot of transcripts have been filtered out by the Extractor. What can I do?<br />
: There are several reasons for removing transcripts by the Extractor. At least in two cases you can try to get more transcripts by setting specific parameter values. First, if the transcript contains ambiguous nucleotides, please test the parameter "Ambiguity". Second, sometimes we received GFFs which contain wrong phases for CDS entries (e.g., 0 for all CDS entries in the phase column of the GFF). Since version 1.3.2, we provide the option "r" which stands for repair. If r=true is chosen, the Extractor tries to infer all phases for transcripts that show an error and would be filtered out.<br />
; Is GeMoMa able to predict pseudo-genes/ncRNA?<br />
: No, currently not.<br />
; My RNA-seq data indicates there is an additional intron in a transcipt, but GeMoMa does not predict this. Or vice versa, GeMoMa predicts an intron that is not supported by RNA-seq data. What's the reason?<br />
: GeMoMa is mainly based on the assumptions of amino acid and intron position conservation between reference and target species. Hence, GeMoMa tries to predict a gene model with similar exon-intron structure in the target species and does not stick too much to RNA-seq data. Although intron position conservation can be observed in most cases, sometimes new introns evolve or others vanish. For this reasons, GeMoMa also allows for the inclusion or exclusion of introns adding some additional costs (cf. GeMoMa parameter intron-loss-gain-penalty). However, the behaviour of GeMoMa depends on the parameters settings (especially intron-loss-gain-penalty, sm (substitution matrix), go (gap opening), ge (gap extension)) and the length of the missed/additional intron. Nevertheless, such cases can only occur if the additional/missed intron has a length that can be divided by 3 preserving the reading frame.<br />
:Since the available RNA-seq data only reflects a fraction of tissues/environmental conditions/..., missing RNA-seq evidence does not necessarily mean that the predictions is wrong.<br />
; My RNA-seq data indicates two alternative, highly overlapping introns. Interestingly, GeMoMa does not take the intron that is more abundant. Why?<br />
: GeMoMa reads the introns from the input file using some filter (cf. GeMoMa parameter r (reads)). All introns that pass the filter are used and treated equally. Hence, GeMoMa uses the intron that matches the expectation of intron position and amino acid conservation compared to the reference transcript.<br />
; Does GeMoMa predict multiple transcripts per gene?<br />
: GeMoMa in principle allows to predict multiple transcripts per gene, if corresponding transcripts are given in the reference species or if multiple reference species are used.<br />
; GeMoMa failed with java.lang.OutOfMemoryError. What can I do?<br />
: Whenever you see a java.lang.OutOfMemoryError, you should rerun the program with Java virtual machine (VM) options. More specifically you should set: '''-Xms''' the initally used RAM, e.g. to 5Gb (–Xms5G), and '''-Xmx''' the maximally used RAM, e.g. to 50Gb (-Xmx50G). GeMoMa often needs more memory if you have a large genome ''and'' if you’re providing a large coverage file (extracted from RNA-seq data). If you don’t have a compute node with enough memory, you can run GeMoMa without coverage, which will return the same predictions, but does not include all statistics. Another point could be the protein alignment, if you use the optional parameter query protein. Again you can run GeMoMa without this parameter, which will return the same predictions, but less statistics.<br />
<br />
== Version history ==<br />
[http://www.jstacs.de/download.php?which=GeMoMa GeMoMa 1.6.1] (4.06.2019)<br />
- createGalaxyIntegration.sh: bugfix for GeMoMaPipeline<br />
- new module CheckIntrons: allowing to create statistics for introns (extracted by ERE)<br />
- AnnotationFinalizer: bugfix for sequence IDs with large numbers<br />
- CompareTranscripts: <br />
- bugfix for prefix of ref-gene<br />
- allow no transcript info, but making assignment non-optional if a transcript info is set <br />
- GAF: bugfix for Galaxy integration<br />
- GeMoMaPipeline:<br />
- improved output in case of Exceptions<br />
- new parameter "output individual predictions" allows to in- or exclude individual predictions from each reference organism in the final result<br />
- new parameter "weight" allows weights for reference species (cf. GAF)<br />
- ERE: new parameter "minimum mapping quality"<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.6.zip GeMoMa 1.6] (2.04.2019)<br />
* allow to use mmseqs as alternative to tblastn<br />
* AnnotationEvidence:<br />
** allows to add attributes to the input gff: tie, tpc, AA, start, stop<br />
** new parameter for gff output<br />
* AnnotationFinalizer: new tool for predicting UTRs and renaming predictions<br />
* GAF:<br />
** relative score filter and evidence filter are replaced by a flexible filter that allows to filter by relative score, evidence or other GFF attributes as well as combinations thereof<br />
** sorting criteria of the predictions within clusters can now be user-specified<br />
** new attribute for genes: combinedEvidence<br />
** new attribute for predictions: sumWeight<br />
** allows to use gene predictions from all sources, including for instance ab-initio gene predictors, purely RNA-seq based gene prediction and manually curation<br />
** bugfix for predictions from multiple reference organisms<br />
** improved statistic output<br />
* GeMoMa<br />
** renamed the parameter tblastn results to search results<br />
** new parameter for sorting the results of the similarity search (tblastn or mmseqs), if you use mmseqs for the similarity search you have use sort <br />
** new parameter for score of the search results: three options: Trust (as is), ReScore (use aligned sequence, but recompute score), and ReAlign (use detected sequence for realignment and rescore)<br />
** bugfix: threshold for introns from multiple files<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.5.2.zip GeMoMa 1.5.3] (23.07.2018)<br />
* improved parameter description and presentation<br />
* GeMoMaPipeline:<br />
** removed unnecessary parameters<br />
* GeMoMa:<br />
** bugfix: reading coverage file<br />
** removed parameter genomic (cf. Extractor)<br />
** removed protein output (cf. Extractor)<br />
* GAF:<br />
** bugfix: prefix<br />
* Extractor:<br />
** new parameter genomic<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.5.2.zip GeMoMa 1.5.2] (31.5.2018)<br />
* GAF:<br />
** new parameter that allows to restrict the maximal number of transcript predictions per gene<br />
** altered behavior of the evidence filter from percentages to absolute values<br />
** bugfix: nested genes <br />
** checking for duplicates in prediction IDs<br />
* GeMoMa:<br />
** warning if RNA-seq data does not match with target genome<br />
* GeMoMaPipeline: new tool for running the complete GeMoMa pipeline at once allowing multi-threading<br />
* folder for temporary files of GeMoMa<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.5.zip GeMoMa 1.5] (13.02.2018)<br />
* AnnotationEvidence: add chromosome to output<br />
* CompareTranscripts: new parameter that allows to remove prefixes introduces by GAF<br />
* Extractor: new parameter "stop-codon excluded from CDS" that might be used if the annotation does not contain the stop codons<br />
* ExtractRNASeqEvidence: <br />
** print intron length stats<br />
** include program infos in introns.gff3<br />
* GeMoMa:<br />
** new attribute pAA in gff output if query protein is given<br />
** include program infos in predicted_annotation.gff3<br />
** minor bugfix<br />
* GAF:<br />
** new parameter that allows to specify a prefix for each input gff<br />
** collect and print program infos to filtered_prediction.gff3<br />
** improved statistics output<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.4.2.zip GeMoMa 1.4.2] (21.07.2017)<br />
* automatic searching for available updates<br />
* AnnotationEvidence: bugfix (tie computation: Arrays.binarysearch does not find first match)<br />
* Extractor: bugfix (files that are not zipped)<br />
* GeMoMa: bugfix (tie computation: Arrays.binarysearch does not find first match)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.4.1.zip GeMoMa 1.4.1] (30.05.2017)<br />
* CompareTranscripts: bugfix (NullPointerException)<br />
* Extractor: reference genome can be .*fa.gz and .*fasta.gz<br />
* GeMoMa: bugfix (shutdown problem after timeout)<br />
* modified additional scripts and documentation<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.4.zip GeMoMa 1.4] (03.05.2017)<br />
* AnnotationEvidence: new tool computing tie and tpc for given annotation (gff)<br />
* CompareTranscripts: new tool comparing predicted and given annotation (gff)<br />
* Extractor:<br />
** reading CDS with no parent tag (cf. discontinuous feature)<br />
** automatic recognition of GFF or GTF annotation<br />
** Warning if sequences mentioned in the annotation are not included in the reference sequence<br />
* GeMoMa: <br />
** allowing for multiple intron and coverage files (= using different library types at the same time)<br />
** NA instead of "?" for tae, tde, tie, minSplitReads of single coding exon genes<br />
** new default values for the parameters: predictions (10 instead of 1) and contig threshold (0.4 instead of 0.9)<br />
** bugfix (write pc and minCov if possible for last CDS part in predicted annotation)<br />
** bugfix (ref-gene name if no assignment is used)<br />
** bugfix (minSplitReads, minCov, tpc, avgCov if no coverage available)<br />
* GAF:<br />
** nested genes on the same strand<br />
** bugfix (if nothing passes the filter)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.3.2.zip GeMoMa 1.3.2] (18.01.2017)<br />
* Extractor: new parameter repair for broken transcript annotations<br />
* GeMoMa: bugfixes (splice site computation)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa_1.3.1.zip GeMoMa 1.3.1] (09.12.2016)<br />
* GeMoMa bugfix (finding start/stop codon for very small exons)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.3.zip GeMoMa 1.3] (06.12.2016)<br />
* ERE: new tool for extracting RNA-seq evidence<br />
* Extractor: offers options for <br />
** partial gene models<br />
** ambiguities<br />
* GeMoMa:<br />
** RNA-seq<br />
*** defining splice sites<br />
*** additional feature in GFF and output<br />
**** transcript intron evidence (tie)<br />
**** transcript acceptor evidence (tae)<br />
**** transcript donor evidence (tde)<br />
**** transcript percentage coverage (tpc)<br />
**** ...<br />
** improved GFF<br />
** simplified the command line parameters<br />
** IMPORTANT: parameter names changed for some parameters<br />
* GAF: new tool for filtering and combining different predictions (especially of different reference organisms)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.1.3.zip GeMoMa 1.1.3] (06.06.2016)<br />
* minor modifications to the Extractor tool<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.1.2.zip GeMoMa 1.1.2] (05.02.2016)<br />
* GeMoMa bugfix (upstream, downstream sequence for splice site detection)<br />
* Extractor: new parameter s for selecting transcripts<br />
* improved Galaxy integration<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.1.1.zip GeMoMa 1.1.1] (01.02.2016)<br />
* initial release for paper</div>Keilwagenhttps://www.jstacs.de/index.php?title=GeMoMa&diff=1037GeMoMa2019-06-05T12:32:56Z<p>Keilwagen: /* Version history */</p>
<hr />
<div>'''Ge'''ne '''Mo'''del '''Ma'''pper (GeMoMa) is a homology-based gene prediction program. GeMoMa uses the annotation of protein-coding genes in a reference genome to infer the annotation of protein-coding genes in a target genome. Thereby, GeMoMa utilizes amino acid sequence and intron position conservation. In addition, GeMoMa allows to incorporate RNA-seq evidence for splice site prediction.<br />
<br />
[[File:GeMoMa-schema.png|thumb|right|350px|Schema of GeMoMa algorithm]]<br />
<br />
== Paper ==<br />
If you use GeMoMa, please cite<br />
<br />
J. Keilwagen, M. Wenk, J. L. Erickson, M. H. Schattat, J. Grau, and F. Hartung. [https://nar.oxfordjournals.org/content/44/9/e89 Using intron position conservation for homology-based gene prediction]. ''Nucleic Acids Research'', 2016. doi: 10.1093/nar/gkw092<br />
<br />
J. Keilwagen, F. Hartung, M. Paulini, S. O. Twardziok, and J. Grau<br />
[https://rdcu.be/QbKc Combining RNA-seq data and homology-based gene prediction for plants, animals and fungi]. ''BMC Bioinformatics'', 2018. doi: 10.1186/s12859-018-2203-5<br />
<br />
== Download ==<br />
<br />
GeMoMa is implemented in Java using Jstacs. You can [http://www.jstacs.de/download.php?which=GeMoMa download a zip file] containing a readme, the GeMoMa jar file and some tiny scripts for running GeMoMa. The jar file allows for <br />
<ul><br />
<li>creating the XML file needed for the Galaxy integration</li><br />
<li>running the command line interface (CLI) version.</li><br />
</ul><br />
<br />
You can also [[:File:GeMoMa-manual.pdf|download a small manual for GeMoMa]] which explains the main steps for the analysis.<br />
<br />
== Galaxy ==<br />
GeMoMa is available in a public web-server at [http://galaxy.informatik.uni-halle.de galaxy.informatik.uni-halle.de]. The provided web-server only allows a limited number of reference genes and uses a time out of 2 minutes per transcript prediction. For unlimited use, please use the command line program or integrate GeMoMa in your only Galaxy instance.<br />
<br />
[[File:GeMoMa-Workflow.png|thumb|center|700px|GeMoMa workflow adapted from Galaxy]]<br />
<br />
== Running the command line application ==<br />
<br />
For running the command line application, Java v1.8 or later is required.<br />
<br />
=== GeMoMaPipeline ===<br />
<br />
If you like to run the GeMoMaPipeline on a server as a single job, you can use the module GeMoMaPipeline which allows to exploit the full compute power of the computer server via multi-threading. However, GeMoMaPipeline does not distribute task on a compute cluster.<br />
You can run GeMoMaPipeline from the command line with<br/><br />
<code>java -jar GeMoMa-1.6.1.jar CLI GeMoMaPipeline [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>target genome (Target genome file (FASTA))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>species (data for reference species, range={own, pre-extracted}, default = own)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;own&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>ID (ID to distinguish the different reference species, default = , OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (Reference annotation file (GFF or GTF), which contains gene models annotated in the reference genome)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (Reference genome file (FASTA))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>weight (the weight can be used to prioritize predictions from different input files; each prediction will get an additional attribute sumWeight that can be used in the filter, valid range = [0.0, 1000.0], default = 1.0, OPTIONAL)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;pre-extracted&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>ID (ID to distinguish the different reference species, default = , OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>cds parts (The query cds parts file (FASTA), i.e., the cds parts that have been blasted)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>assignment (The assignment file, which combines parts of the CDS to transcripts, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">q</font></td><br />
<td>query proteins (optional query protein file (FASTA) for computing the optimal alignment score against complete protein prediction, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>weight (the weight can be used to prioritize predictions from different input files; each prediction will get an additional attribute sumWeight that can be used in the filter, valid range = [0.0, 1000.0], default = 1.0, OPTIONAL)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">selected</font></td><br />
<td>selected (The path to list file, which allows to make only a predictions for the contained transcript ids. The first column should contain transcript IDs as given in the annotation. Remaining columns can be used to determine a target region that should be overlapped by the prediction, if columns 2 to 5 contain chromosome, strand, start and end of region, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">gc</font></td><br />
<td>genetic code (optional user-specified genetic code, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">tag</font></td><br />
<td>tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = prediction)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>RNA-seq evidence (data for RNA-seq evidence, range={NO, MAPPED, EXTRACTED}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;MAPPED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.s</font></td><br />
<td>Stranded (Defines whether the reads are stranded. In case of FR_FIRST_STRAND, the first read of a read pair or the only read in case of single-end data is assumed to be located on forward strand of the cDNA, i.e., reverse to the mRNA orientation. If you are using Illumina TruSeq you should use FR_FIRST_STRAND., range={FR_UNSTRANDED, FR_FIRST_STRAND, FR_SECOND_STRAND}, default = FR_UNSTRANDED)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.m</font></td><br />
<td>mapped reads file (BAM/SAM files containing the mapped reads)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.v</font></td><br />
<td>ValidationStringency (Defines how strict to be when reading a SAM or BAM, beyond bare minimum validation., range={STRICT, LENIENT, SILENT}, default = LENIENT)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.u</font></td><br />
<td>use secondary alignments (allows to filter flags in the SAM or BAM, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.c</font></td><br />
<td>coverage (allows to output the coverage, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ERE.mmq</font></td><br />
<td>minimum mapping quality (reads with a mapping quality that is lower than this value will be ignored, valid range = [0, 254], default = 40)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;EXTRACTED&quot;:</b></td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">introns</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage</font></td><br />
<td>coverage (experimental coverage (RNA-seq), range={NO, UNSTRANDED, STRANDED}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">tblastn</font></td><br />
<td>tblastn (if *true* tblastn is used as search algorithm, otherwise mmseqs is used. Tblastn and mmseqs need to be installed to use the corresponding option, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.p</font></td><br />
<td>proteins (whether the complete proteins sequences should returned as output, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.r</font></td><br />
<td>repair (if a transcript annotation can not be parsed, the program will try to infer the phase of the CDS parts to repair the annotation, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.a</font></td><br />
<td>Ambiguity (This parameter defines how to deal with ambiguities in the DNA. There are 3 options: EXCEPTION, which will remove the corresponding transcript, AMBIGUOUS, which will use an X for the corresponding amino acid, and RANDOM, which will randomly select an amnio acid from the list of possibilities., range={EXCEPTION, AMBIGUOUS, RANDOM}, default = EXCEPTION)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.s</font></td><br />
<td>stop-codon excluded from CDS (A flag that states whether the reference annotation contains the stop codon in the CDS annotation or not, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Extractor.f</font></td><br />
<td>full-length (A flag which allows for choosing between only full-length and all (i.e., full-length and partial) transcripts, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.s</font></td><br />
<td>splice (if no intron is given by RNA-seq, compute candidate splice sites or not, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.sm</font></td><br />
<td>substitution matrix (optional user-specified substitution matrix, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.g</font></td><br />
<td>gap opening (The gap opening cost in the alignment, default = 11)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.ge</font></td><br />
<td>gap extension (The gap extension cost in the alignment, default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.m</font></td><br />
<td>maximum intron length (The maximum length of an intron, default = 15000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.i</font></td><br />
<td>intron-loss-gain-penalty (The penalty used for intron loss and gain, default = 25)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.e</font></td><br />
<td>e-value (The e-value for filtering blast results, default = 100.0)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.c</font></td><br />
<td>contig threshold (The threshold for evaluating contigs, valid range = [0.0, 1.0], default = 0.4)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.rt</font></td><br />
<td>region threshold (The threshold for evaluating regions, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.h</font></td><br />
<td>hit threshold (The threshold for adding additional hits, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.p</font></td><br />
<td>predictions (The (maximal) number of predictions per transcript, default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.a</font></td><br />
<td>avoid stop (A flag which allows to avoid stop codons in a transcript (except the last AS), default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.approx</font></td><br />
<td>approx (whether an approximation is used to compute the score for intron gain, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.prefix</font></td><br />
<td>prefix (A prefix to be used for naming the predictions, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.t</font></td><br />
<td>timeout (The (maximal) number of seconds to be used for the predictions of one transcript, if exceeded GeMoMa does not output a prediction for this transcript., valid range = [0, 604800], default = 3600)</td><br />
<td style="width:100px;">LONG</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GeMoMa.Score</font></td><br />
<td>Score (A flag which allows to do nothing, re-score or re-align the search results, range={Trust, ReScore, ReAlign}, default = Trust)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.s</font></td><br />
<td>sorting (comma-separated list that defines the way predictions are sorted within a cluster, default = evidence,score)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.m</font></td><br />
<td>missing intron evidence filter (the filter for single-exon transcripts or if no RNA-seq data is used, decides for overlapping other transcripts whether they should be used (=true) or discarded (=false), default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.i</font></td><br />
<td>intron evidence filter (the filter on the intron evidence given by RNA-seq-data for overlapping transcripts, valid range = [0.0, 1.0], default = 1.0)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.c</font></td><br />
<td>common border filter (the filter on the common borders of transcripts, the lower the more transcripts will be checked as alternative splice isoforms, valid range = [0.0, 1.0], default = 0.75)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.mnotpg</font></td><br />
<td>maximal number of transcripts per gene (the maximal number of allowed transcript predictions per gene, valid range = [1, 2147483647], default = 2147483647)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">GAF.f</font></td><br />
<td>filter (A filter can be applied to transcript predictions to receive only reasonable predictions. The filter is applied to the GFF attributes. The deault filter decides based on the completeness of the prediction (start=='M' and stop=='*') and the relative score (score/AA>=0.75) whether a prediction is used or not. Different criteria can be combined using 'and ' and 'or'. A more sophisticated filter could be applied for instance using the completeness, the relative score, the evidence as well as statistics based on RNA-seq: start=='M' and stop =='*' and score/AA>=0.75 and (evidence>1 or tpc==1.0), default = start=='M' and stop =='*' and score/AA>=0.75, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.u</font></td><br />
<td>UTR (allows to predict UTRs using RNA-seq data, range={NO, YES}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;YES&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.r</font></td><br />
<td>rename (allows to generate generic gene and transcripts names (cf. attribute &quot;Name&quot;), range={COMPOSED, SIMPLE, NO}, default = COMPOSED)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;COMPOSED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.i</font></td><br />
<td>infix (the infix of the generic name, default = G)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.s</font></td><br />
<td>suffix (the suffix of the generic name, default = 0)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.di</font></td><br />
<td>delete infix (a comma-separated list of infixes that is deleted from the sequence names before building the gene/transcript name, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;SIMPLE&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">AnnotationFinalizer.d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>predicted proteins (If *true*, returns the predicted proteins of the target organism as fastA file, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">pc</font></td><br />
<td>predicted CDSs (If *true*, returns the predicted CDSs of the target organism as fastA file, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">pgr</font></td><br />
<td>predicted genomic regions (If *true*, returns the genomic regions of predicted gene models of the target organism as fastA file, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">o</font></td><br />
<td>output individual predictions (If *true*, returns the predictions for each reference species, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>debug (If *false* removes all temporary files even if the jobs exits unexpected, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">threads</font></td><br />
<td>The number of threads used for the tool, defaults to 1</td><br />
<td>INT</td><br />
</tr><br />
</table><br />
<br />
=== Extract RNA-seq Evidence (ERE) ===<br />
<br />
For post-processing the mapped RNA-seq data, we provide the tool ExtractRNAseqEvidence (ERE). You can run Extractor from the command line with<br/><br />
<code>java -jar GeMoMa-1.6.1.jar CLI ERE [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>Stranded (Defines whether the reads are stranded. In case of FR_FIRST_STRAND, the first read of a read pair or the only read in case of single-end data is assumed to be located on forward strand of the cDNA, i.e., reverse to the mRNA orientation. If you are using Illumina TruSeq you should use FR_FIRST_STRAND., range={FR_UNSTRANDED, FR_FIRST_STRAND, FR_SECOND_STRAND}, default = FR_UNSTRANDED)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>mapped reads file (BAM/SAM files containing the mapped reads)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>ValidationStringency (Defines how strict to be when reading a SAM or BAM, beyond bare minimum validation., range={STRICT, LENIENT, SILENT}, default = LENIENT)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">u</font></td><br />
<td>use secondary alignments (allows to filter flags in the SAM or BAM, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage (allows to output the coverage, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">mmq</font></td><br />
<td>minimum mapping quality (reads with a mapping quality that is lower than this value will be ignored, valid range = [0, 254], default = 40)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== CheckIntrons===<br />
<br />
This tool allows to check whether the extracted introns show the expected patterns of di-nucleotides at the splice sites. You can run CheckIntrons from the command line with<br/><br />
<code>java -jar GeMoMa-1.6.1.jar CLI CheckIntrons [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>target genome (The target genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>verbose (A flag which allows to output a wealth of additional information per transcript, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== Extractor ===<br />
<br />
For preparing the data, we provide the tool Extractor. You can run Extractor from the command line with<br/><br />
<code>java -jar GeMoMa-1.6.1.jar CLI Extractor [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (Reference annotation file (GFF or GTF), which contains gene models annotated in the reference genome)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (Reference genome file (FASTA))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">gc</font></td><br />
<td>genetic code (optional user-specified genetic code, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>proteins (whether the complete proteins sequences should returned as output, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>cds (whether the complete CDSs should returned as output, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">genomic</font></td><br />
<td>genomic (whether the genomic regions should be returned (upper case = coding, lower case = non coding), default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>repair (if a transcript annotation can not be parsed, the program will try to infer the phase of the CDS parts to repair the annotation, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>selected (The path to list file, which allows to make only a predictions for the contained transcript ids. The first column should contain transcript IDs as given in the annotation. Remaining columns will be ignored., OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Ambiguity</font></td><br />
<td>Ambiguity (This parameter defines how to deal with ambiguities in the DNA. There are 3 options: EXCEPTION, which will remove the corresponding transcript, AMBIGUOUS, which will use an X for the corresponding amino acid, and RANDOM, which will randomly select an amnio acid from the list of possibilities., range={EXCEPTION, AMBIGUOUS, RANDOM}, default = EXCEPTION)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sefc</font></td><br />
<td>stop-codon excluded from CDS (A flag that states whether the reference annotation contains the stop codon in the CDS annotation or not, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>full-length (A flag which allows for choosing between only full-length and all (i.e., full-length and partial) transcripts, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>verbose (A flag which allows to output a wealth of additional information, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== Gene Model Mapper (GeMoMa) ===<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>search results (The search results, e.g., from tblastn or mmseqs)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>target genome (The target genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>cds parts (The query cds parts file (FASTA), i.e., the cds parts that have been blasted)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>assignment (The assignment file, which combines parts of the CDS to transcripts, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">q</font></td><br />
<td>query proteins (optional query protein file (FASTA) for computing the optimal alignment score against complete protein prediction, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns (Introns (GFF), which might be obtained from RNA-seq, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">splice</font></td><br />
<td>splice (if no intron is given by RNA-seq, compute candidate splice sites or not, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage</font></td><br />
<td>coverage (experimental coverage (RNA-seq), range={NO, UNSTRANDED, STRANDED}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genetic code (optional user-specified genetic code, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sm</font></td><br />
<td>substitution matrix (optional user-specified substitution matrix, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">go</font></td><br />
<td>gap opening (The gap opening cost in the alignment, default = 11)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ge</font></td><br />
<td>gap extension (The gap extension cost in the alignment, default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>maximum intron length (The maximum length of an intron, default = 15000)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">intron-loss-gain-penalty</font></td><br />
<td>intron-loss-gain-penalty (The penalty used for intron loss and gain, default = 25)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">e</font></td><br />
<td>e-value (The e-value for filtering blast results, default = 100.0)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ct</font></td><br />
<td>contig threshold (The threshold for evaluating contigs, valid range = [0.0, 1.0], default = 0.4)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">rt</font></td><br />
<td>region threshold (The threshold for evaluating regions, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">h</font></td><br />
<td>hit threshold (The threshold for adding additional hits, valid range = [0.0, 1.0], default = 0.9)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>predictions (The (maximal) number of predictions per transcript, default = 10)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">selected</font></td><br />
<td>selected (The path to list file, which allows to make only a predictions for the contained transcript ids. The first column should contain transcript IDs as given in the annotation. Remaining columns can be used to determine a target region that should be overlapped by the prediction, if columns 2 to 5 contain chromosome, strand, start and end of region, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">as</font></td><br />
<td>avoid stop (A flag which allows to avoid stop codons in a transcript (except the last AS), default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">approx</font></td><br />
<td>approx (whether an approximation is used to compute the score for intron gain, default = true)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">prefix</font></td><br />
<td>prefix (A prefix to be used for naming the predictions, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">tag</font></td><br />
<td>tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = prediction)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">v</font></td><br />
<td>verbose (A flag which allows to output a wealth of additional information per transcript, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">timeout</font></td><br />
<td>timeout (The (maximal) number of seconds to be used for the predictions of one transcript, if exceeded GeMoMa does not output a prediction for this transcript., valid range = [0, 604800], default = 3600)</td><br />
<td style="width:100px;">LONG</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">sort</font></td><br />
<td>sort (A flag which allows to sort the search results, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">Score</font></td><br />
<td>Score (A flag which allows to do nothing, re-score or re-align the search results, range={Trust, ReScore, ReAlign}, default = Trust)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
GeMoMa returns the predicted annotation as gff file.<br />
<br />
=== GeMoMa Annotation Filter (GAF) ===<br />
<br />
The GeMoMa Annotation Filter allows to combine and reduce predictions from GeMoMa into a single final prediction. It is able to handle predictions from different reference species. It also handles overlapping or identical predictions.<br />
You can run GeMoMa from the command line with<br/><br />
<code>java -jar GeMoMa-1.6.1.jar CLI GAF[&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>tag (the tag used to read the GeMoMa annotations, default = prediction)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>sorting (comma-separated list that defines the way predictions are sorted within a cluster, default = evidence,score)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">m</font></td><br />
<td>missing intron evidence filter (the filter for single-exon transcripts or if no RNA-seq data is used, decides for overlapping other transcripts whether they should be used (=true) or discarded (=false), default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>intron evidence filter (the filter on the intron evidence given by RNA-seq-data for overlapping transcripts, valid range = [0.0, 1.0], default = 1.0)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>common border filter (the filter on the common borders of transcripts, the lower the more transcripts will be checked as alternative splice isoforms, valid range = [0.0, 1.0], default = 0.75)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">mnotpg</font></td><br />
<td>maximal number of transcripts per gene (the maximal number of allowed transcript predictions per gene, valid range = [1, 2147483647], default = 2147483647)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prefix (the prefix can be used to distinguish predictions from different input files, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">w</font></td><br />
<td>weight (the weight can be used to prioritize predictions from different input files; each prediction will get an additional attribute sumWeight that can be used in the filter, valid range = [0.0, 1000.0], default = 1.0, OPTIONAL)</td><br />
<td style="width:100px;">DOUBLE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>gene annotation file (GFF files containing the gene annotations (predicted by GeMoMa))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">f</font></td><br />
<td>filter (A filter can be applied to transcript predictions to receive only reasonable predictions. The filter is applied to the GFF attributes. The deault filter decides based on the completeness of the prediction (start=='M' and stop=='*') and the relative score (score/AA>=0.75) whether a prediction is used or not. Different criteria can be combined using 'and ' and 'or'. A more sophisticated filter could be applied for instance using the completeness, the relative score, the evidence as well as statistics based on RNA-seq: start=='M' and stop =='*' and score/AA>=0.75 and (evidence>1 or tpc==1.0), default = start=='M' and stop =='*' and score/AA>=0.75, OPTIONAL)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== CompareTranscripts ===<br />
<br />
For comparing gene models from GeMoMa predictions with existing annotation, we provide the tool CompareTranscripts. You can run CompareTranscripts from the command line with<br/><br />
<code>java -jar GeMoMa-1.6.1.jar CLI CompareTranscripts [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prediction (The predicted annotation)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (The true annotation)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== AnnotationEvidence ===<br />
<br />
For providing RNA-seq evidence (e.g. tie) for existing annotation, we provide the tool AnnotationEvidence. You can run AnnotationEvidence from the command line with<br/><br />
<code>java -jar GeMoMa-1.6.1.jar CLI AnnotationEvidence [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (The genome annotation file (GFF))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (The genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns file (Introns (GFF), which might be obtained from RNA-seq, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage file (experimental coverage (RNA-seq), range={NO, UNSTRANDED, STRANDED}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">ao</font></td><br />
<td>annotation output (if the annotation should be returned with attributes tie, tpc, and AA, default = false)</td><br />
<td style="width:100px;">BOOLEAN</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">gc</font></td><br />
<td>genetic code (optional user-specified genetic code, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
=== AnnotationFinializer ===<br />
<br />
This tool allows to predict UTR and to rename predictions. You can run AnnotationEvidence from the command line with<br/><br />
<code>java -jar GeMoMa-1.6.1.jar CLI AnnotationFinalizer [&lt;parameter&gt;=&lt;value&gt; ...]</code><br/><br />
The parameters comprise:<br />
<br />
<table border=0 cellpadding=10 align="center" width="100%"><br />
<tr><br />
<td>name</td><br />
<td>comment</td><br />
<td>type</td><br />
</tr><br />
<tr><td colspan=3><hr></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">g</font></td><br />
<td>genome (The genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">a</font></td><br />
<td>annotation (The predicted genome annotation file (GFF))</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">t</font></td><br />
<td>tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = prediction)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">u</font></td><br />
<td>UTR (allows to predict UTRs using RNA-seq data, range={NO, YES}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;YES&quot;:</b></td></tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">i</font></td><br />
<td>introns file (Introns (GFF), which might be obtained from RNA-seq, OPTIONAL)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table><br />
</td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">r</font></td><br />
<td>reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3>The following parameter(s) can be used multiple times:</td></tr><br />
<tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr style="vertical-align:top"><br />
<td><font color="green">c</font></td><br />
<td>coverage file (experimental coverage (RNA-seq), range={NO, UNSTRANDED, STRANDED}, default = NO)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;UNSTRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_unstranded</font></td><br />
<td>coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;STRANDED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_forward</font></td><br />
<td>coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">coverage_reverse</font></td><br />
<td>coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)</td><br />
<td style="width:100px;">FILE</td><br />
</tr><br />
</table></td></tr><br />
</table><br />
</td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">rename</font></td><br />
<td>rename (allows to generate generic gene and transcripts names (cf. attribute &quot;Name&quot;), range={COMPOSED, SIMPLE, NO}, default = COMPOSED)</td><br />
<td style="width:100px;"></td></tr><tr><td></td><td colspan=2><table border=0 cellpadding=0 align="center" width="100%"><br />
<tr><td colspan=3><b>Parameters for selection &quot;COMPOSED&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">infix</font></td><br />
<td>infix (the infix of the generic name, default = G)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">s</font></td><br />
<td>suffix (the suffix of the generic name, default = 0)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">di</font></td><br />
<td>delete infix (a comma-separated list of infixes that is deleted from the sequence names before building the gene/transcript name, default = )</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr><td colspan=3><b>Parameters for selection &quot;SIMPLE&quot;:</b></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">p</font></td><br />
<td>prefix (the prefix of the generic name)</td><br />
<td style="width:100px;">STRING</td><br />
</tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">d</font></td><br />
<td>digits (the number of informative digits, valid range = [4, 10], default = 5)</td><br />
<td style="width:100px;">INT</td><br />
</tr><br />
<tr><td colspan=3><b>No parameters for selection &quot;NO&quot;</b></td></tr><br />
</table></td></tr><br />
<tr style="vertical-align:top"><br />
<td><font color="green">outdir</font></td><br />
<td>The output directory, defaults to the current working directory (.)</td><br />
<td>STRING</td><br />
</tr><br />
</table><br />
<br />
== GFF attributes ==<br />
<br />
Using GeMoMa and GAF, you'll obtain GFFs containing some special attributes. We briefly explain the most prominent attributes in the following table.<br />
<br />
{| class="wikitable sortable"<br />
|-<br />
!Attribute !!Long name !!Tool !!Necessary parameter !!Feature !!Description<br />
|-<br />
| score || GeMoMa score || GeMoMa || || prediction || score computed by GeMoMa using the substitution matrix, gap costs and additional penalties<br />
|-<br />
| minCov || minimal coverage || GeMoMa || coverage, ... || prediction || minimal coverage of any base of the prediction given RNA-seq evidence<br />
|-<br />
| avgCov || average coverage || GeMoMa || coverage, ... || prediction || average coverage of all bases of the prediction given RNA-seq evidence<br />
|-<br />
| tpc || transcript percentage coverage || GeMoMa || coverage, ... || prediction || percentage of covered bases per predicted transcript given RNA-seq evidence<br />
|-<br />
| tae || transcript acceptor evidence || GeMoMa || introns || prediction || percentage of predicted acceptor sites per predicted transcript with RNA-seq evidence<br />
|-<br />
| tde || transcript donor evidence || GeMoMa || introns || prediction || percentage of predicted donor sites per predicted transcript with RNA-seq evidence<br />
|-<br />
| tie || transcript intron evidence || GeMoMa || introns || prediction || percentage of predicted introns per predicted transcript with RNA-seq evidence<br />
|-<br />
| minSplitReads || minimal split reads || GeMoMa || introns || prediction || minimal number of split reads for any of the predicted introns per predicted transcript<br />
|-<br />
| iAA || identical amino acid || GeMoMa || query proteins || prediction || percentage of identical amino acids between reference transcript and prediction<br />
|-<br />
| pAA || positive amino acid || GeMoMa || query proteins || prediction || percentage of aligned positions between reference transcript and prediction yielding a positive score in the substitution matrix<br />
|-<br />
| evidence || || GAF || || prediction || number of reference organisms that have a transcript yielding this prediction<br />
|-<br />
| alternative || || GAF || || prediction || alternative gene ID(s) leading to the same prediction<br />
|-<br />
| sumWeight || || GAF || || prediction || the sum of the weights of the references that perfectly support this prediction<br />
|-<br />
| maxTie || maximal tie || GAF || || gene || maximal tie of all transcripts of this gene<br />
|-<br />
| maxEvidence || maximal evidence || GAF || || gene || maximal evidence of all transcripts of this gene<br />
|-<br />
|}<br />
<br />
== FAQs ==<br />
<br />
; Why does the Extractor not return a single CDS-part, protein, ...?<br />
: First, please check whether the names of your contigs/chromosomes in your annotation (gff) and genome file (fasta) are identical. The fasta comments should at best only contain the contig/chromosome name. (Since GeMoMa 1.4, comments, which contain the contig/chromosome name and some additional information separated by a space, are also fine.) Second, please check whether you have a valid GFF/GTF file. Valid GFF files should have a valid "ID" or "Parent" entry in the attributes column. Valid GTF files should have a valid "gene_id" and "transcript_id" entry. Finally, please check the statistics that are given by the Extractor. It lists how many genes have been read and how many genes have been removed for different reasons. One common problem is that some annotation files do not include the stop codon in the CDS annotation.<br />
; How can I force GeMoMa to make more predictions?<br />
: There are several parameters affecting the number of predictions. The most prominent are the number of predictions (p) and the contig threshold (ct). For each reference transcript/CDS, GeMoMa initially makes a preliminary prediction and uses this prediction to determine whether a contig is promising and should be used to determine the final predictions. You may decrease ct and increase p to have more contigs in the final prediction. Increasing the number of predictions allows GeMoMa to output more predictions that have been computed. Decreasing the contig threshold allows to increase the number of predictions that are (internally) computed. Increasing p to a very large number without decreasing ct does not help.<br />
; Running GeMoMa on a single contig of my assembly yield thousands of weird predictions. What went wrong?<br />
: By default, GeMoMa is not build to be run on a single contig. GeMoMa tries to make predictions for all given reference CDS in the given target sequence(s). If the given target sequence is only a fraction of the complete target genome/assembly, GeMoMa will produce weird predictions as it does not filter for the quality of the predictions internally. There are two options to handle this:<br />
:* Use a list of gene models that you expect to be located on this contig (cf. parameter "selected").<br />
:* Filter the predictions using GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>).<br />
; Is it mandatory to use RNA-seq data?<br />
: No, GeMoMa is able to make predictions with and without RNA-seq evidence.<br />
; Is it possible to use multiple reference organisms?<br />
: It is possible to use multiple reference organisms for GeMoMa. Just run GeMoMa on each reference organism separately. Finally, you can employ GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>) to combine these annotations.<br />
; Why do some reference genes not lead to a prediction in the target genome?<br />
: Please first check whether your reference genes have been discarded by the Extractor (cf. assignment file).<br />
: If the genes have been discarded, there are two possibilities:<br />
:* The CDS might be redundant, i.e. the coding exons are identical to those of another transcript. In this case, only one CDS is further evaluated.<br />
:* There might be something wrong with your reference genes, e.g., missing start codon, missing stop codon, premature stop codon, ambiguous nucleotides, ... and you should check the options of Extractor or the annotation.<br />
: If the reference genes passed the Extractor, there are several possible explanations for this behavior. The two most prominent are:<br />
:* GeMoMa stopped the prediction of a reference genes since it does not return a result within the given time (cf. parameter "timeout").<br />
:* GeMoMa simply did not find a prediction matching the remaining quality criteria<br />
:* GeMoMa did find a prediction, but it was filtered out by GAF, e.g. to low relative score, missing start or stop codon (cf. GAF parameters).<br />
; What does "partial gene model" mean in the context of GeMoMa?<br />
: We called a gene model partial if it does not contain an initial start codon and a final stop codon. However, this does not mean that the gene model is located at or close to the border of a chromosome or contig.<br />
; For two different reference transcripts, the predictions of GeMoMa overlap or are identical. What should I do with those?<br />
: GeMoMa makes the predictions for each reference transcript independently. Hence, it can occur that some of predictions of different reference transcripts overlap or are identical especially in gene families. Typically, you might like to filter or rank these predictions. We have implemented GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>) to do this automatically. However, you can also do it by hand using the GFF attributes. Using RNA-seq data in GeMoMa yields additional fields in the annotation that can be used, e.g., average coverage (avgCov).<br />
; A lot of transcripts have been filtered out by the Extractor. What can I do?<br />
: There are several reasons for removing transcripts by the Extractor. At least in two cases you can try to get more transcripts by setting specific parameter values. First, if the transcript contains ambiguous nucleotides, please test the parameter "Ambiguity". Second, sometimes we received GFFs which contain wrong phases for CDS entries (e.g., 0 for all CDS entries in the phase column of the GFF). Since version 1.3.2, we provide the option "r" which stands for repair. If r=true is chosen, the Extractor tries to infer all phases for transcripts that show an error and would be filtered out.<br />
; Is GeMoMa able to predict pseudo-genes/ncRNA?<br />
: No, currently not.<br />
; My RNA-seq data indicates there is an additional intron in a transcipt, but GeMoMa does not predict this. Or vice versa, GeMoMa predicts an intron that is not supported by RNA-seq data. What's the reason?<br />
: GeMoMa is mainly based on the assumptions of amino acid and intron position conservation between reference and target species. Hence, GeMoMa tries to predict a gene model with similar exon-intron structure in the target species and does not stick too much to RNA-seq data. Although intron position conservation can be observed in most cases, sometimes new introns evolve or others vanish. For this reasons, GeMoMa also allows for the inclusion or exclusion of introns adding some additional costs (cf. GeMoMa parameter intron-loss-gain-penalty). However, the behaviour of GeMoMa depends on the parameters settings (especially intron-loss-gain-penalty, sm (substitution matrix), go (gap opening), ge (gap extension)) and the length of the missed/additional intron. Nevertheless, such cases can only occur if the additional/missed intron has a length that can be divided by 3 preserving the reading frame.<br />
:Since the available RNA-seq data only reflects a fraction of tissues/environmental conditions/..., missing RNA-seq evidence does not necessarily mean that the predictions is wrong.<br />
; My RNA-seq data indicates two alternative, highly overlapping introns. Interestingly, GeMoMa does not take the intron that is more abundant. Why?<br />
: GeMoMa reads the introns from the input file using some filter (cf. GeMoMa parameter r (reads)). All introns that pass the filter are used and treated equally. Hence, GeMoMa uses the intron that matches the expectation of intron position and amino acid conservation compared to the reference transcript.<br />
; Does GeMoMa predict multiple transcripts per gene?<br />
: GeMoMa in principle allows to predict multiple transcripts per gene, if corresponding transcripts are given in the reference species or if multiple reference species are used.<br />
; GeMoMa failed with java.lang.OutOfMemoryError. What can I do?<br />
: Whenever you see a java.lang.OutOfMemoryError, you should rerun the program with Java virtual machine (VM) options. More specifically you should set: '''-Xms''' the initally used RAM, e.g. to 5Gb (–Xms5G), and '''-Xmx''' the maximally used RAM, e.g. to 50Gb (-Xmx50G). GeMoMa often needs more memory if you have a large genome ''and'' if you’re providing a large coverage file (extracted from RNA-seq data). If you don’t have a compute node with enough memory, you can run GeMoMa without coverage, which will return the same predictions, but does not include all statistics. Another point could be the protein alignment, if you use the optional parameter query protein. Again you can run GeMoMa without this parameter, which will return the same predictions, but less statistics.<br />
<br />
== Version history ==<br />
[http://www.jstacs.de/download.php?which=GeMoMa GeMoMa 1.6] (4.06.2019)<br />
- createGalaxyIntegration.sh: bugfix for GeMoMaPipeline<br />
- new module CheckIntrons: allowing to create statistics for introns (extracted by ERE)<br />
- AnnotationFinalizer: bugfix for sequence IDs with large numbers<br />
- CompareTranscripts: <br />
- bugfix for prefix of ref-gene<br />
- allow no transcript info, but making assignment non-optional if a transcript info is set <br />
- GAF: bugfix for Galaxy integration<br />
- GeMoMaPipeline:<br />
- improved output in case of Exceptions<br />
- new parameter "output individual predictions" allows to in- or exclude individual predictions from each reference organism in the final result<br />
- new parameter "weight" allows weights for reference species (cf. GAF)<br />
- ERE: new parameter "minimum mapping quality"<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.6.zip GeMoMa 1.6] (2.04.2019)<br />
* allow to use mmseqs as alternative to tblastn<br />
* AnnotationEvidence:<br />
** allows to add attributes to the input gff: tie, tpc, AA, start, stop<br />
** new parameter for gff output<br />
* AnnotationFinalizer: new tool for predicting UTRs and renaming predictions<br />
* GAF:<br />
** relative score filter and evidence filter are replaced by a flexible filter that allows to filter by relative score, evidence or other GFF attributes as well as combinations thereof<br />
** sorting criteria of the predictions within clusters can now be user-specified<br />
** new attribute for genes: combinedEvidence<br />
** new attribute for predictions: sumWeight<br />
** allows to use gene predictions from all sources, including for instance ab-initio gene predictors, purely RNA-seq based gene prediction and manually curation<br />
** bugfix for predictions from multiple reference organisms<br />
** improved statistic output<br />
* GeMoMa<br />
** renamed the parameter tblastn results to search results<br />
** new parameter for sorting the results of the similarity search (tblastn or mmseqs), if you use mmseqs for the similarity search you have use sort <br />
** new parameter for score of the search results: three options: Trust (as is), ReScore (use aligned sequence, but recompute score), and ReAlign (use detected sequence for realignment and rescore)<br />
** bugfix: threshold for introns from multiple files<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.5.2.zip GeMoMa 1.5.3] (23.07.2018)<br />
* improved parameter description and presentation<br />
* GeMoMaPipeline:<br />
** removed unnecessary parameters<br />
* GeMoMa:<br />
** bugfix: reading coverage file<br />
** removed parameter genomic (cf. Extractor)<br />
** removed protein output (cf. Extractor)<br />
* GAF:<br />
** bugfix: prefix<br />
* Extractor:<br />
** new parameter genomic<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.5.2.zip GeMoMa 1.5.2] (31.5.2018)<br />
* GAF:<br />
** new parameter that allows to restrict the maximal number of transcript predictions per gene<br />
** altered behavior of the evidence filter from percentages to absolute values<br />
** bugfix: nested genes <br />
** checking for duplicates in prediction IDs<br />
* GeMoMa:<br />
** warning if RNA-seq data does not match with target genome<br />
* GeMoMaPipeline: new tool for running the complete GeMoMa pipeline at once allowing multi-threading<br />
* folder for temporary files of GeMoMa<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.5.zip GeMoMa 1.5] (13.02.2018)<br />
* AnnotationEvidence: add chromosome to output<br />
* CompareTranscripts: new parameter that allows to remove prefixes introduces by GAF<br />
* Extractor: new parameter "stop-codon excluded from CDS" that might be used if the annotation does not contain the stop codons<br />
* ExtractRNASeqEvidence: <br />
** print intron length stats<br />
** include program infos in introns.gff3<br />
* GeMoMa:<br />
** new attribute pAA in gff output if query protein is given<br />
** include program infos in predicted_annotation.gff3<br />
** minor bugfix<br />
* GAF:<br />
** new parameter that allows to specify a prefix for each input gff<br />
** collect and print program infos to filtered_prediction.gff3<br />
** improved statistics output<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.4.2.zip GeMoMa 1.4.2] (21.07.2017)<br />
* automatic searching for available updates<br />
* AnnotationEvidence: bugfix (tie computation: Arrays.binarysearch does not find first match)<br />
* Extractor: bugfix (files that are not zipped)<br />
* GeMoMa: bugfix (tie computation: Arrays.binarysearch does not find first match)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.4.1.zip GeMoMa 1.4.1] (30.05.2017)<br />
* CompareTranscripts: bugfix (NullPointerException)<br />
* Extractor: reference genome can be .*fa.gz and .*fasta.gz<br />
* GeMoMa: bugfix (shutdown problem after timeout)<br />
* modified additional scripts and documentation<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.4.zip GeMoMa 1.4] (03.05.2017)<br />
* AnnotationEvidence: new tool computing tie and tpc for given annotation (gff)<br />
* CompareTranscripts: new tool comparing predicted and given annotation (gff)<br />
* Extractor:<br />
** reading CDS with no parent tag (cf. discontinuous feature)<br />
** automatic recognition of GFF or GTF annotation<br />
** Warning if sequences mentioned in the annotation are not included in the reference sequence<br />
* GeMoMa: <br />
** allowing for multiple intron and coverage files (= using different library types at the same time)<br />
** NA instead of "?" for tae, tde, tie, minSplitReads of single coding exon genes<br />
** new default values for the parameters: predictions (10 instead of 1) and contig threshold (0.4 instead of 0.9)<br />
** bugfix (write pc and minCov if possible for last CDS part in predicted annotation)<br />
** bugfix (ref-gene name if no assignment is used)<br />
** bugfix (minSplitReads, minCov, tpc, avgCov if no coverage available)<br />
* GAF:<br />
** nested genes on the same strand<br />
** bugfix (if nothing passes the filter)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.3.2.zip GeMoMa 1.3.2] (18.01.2017)<br />
* Extractor: new parameter repair for broken transcript annotations<br />
* GeMoMa: bugfixes (splice site computation)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa_1.3.1.zip GeMoMa 1.3.1] (09.12.2016)<br />
* GeMoMa bugfix (finding start/stop codon for very small exons)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.3.zip GeMoMa 1.3] (06.12.2016)<br />
* ERE: new tool for extracting RNA-seq evidence<br />
* Extractor: offers options for <br />
** partial gene models<br />
** ambiguities<br />
* GeMoMa:<br />
** RNA-seq<br />
*** defining splice sites<br />
*** additional feature in GFF and output<br />
**** transcript intron evidence (tie)<br />
**** transcript acceptor evidence (tae)<br />
**** transcript donor evidence (tde)<br />
**** transcript percentage coverage (tpc)<br />
**** ...<br />
** improved GFF<br />
** simplified the command line parameters<br />
** IMPORTANT: parameter names changed for some parameters<br />
* GAF: new tool for filtering and combining different predictions (especially of different reference organisms)<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.1.3.zip GeMoMa 1.1.3] (06.06.2016)<br />
* minor modifications to the Extractor tool<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.1.2.zip GeMoMa 1.1.2] (05.02.2016)<br />
* GeMoMa bugfix (upstream, downstream sequence for splice site detection)<br />
* Extractor: new parameter s for selecting transcripts<br />
* improved Galaxy integration<br />
<br />
[http://www.jstacs.de/downloads/GeMoMa-1.1.1.zip GeMoMa 1.1.1] (01.02.2016)<br />
* initial release for paper</div>Keilwagen