GeMoMa-Docs
This page describes the parameters of all GeMoMa modules.
If you have any questions, comments or bugs, please check the FAQs, our github page or contact Jens Keilwagen.
GeMoMa pipeline
This tool can be used to run the complete GeMoMa pipeline. The tool is multi-threaded and can utilize all compute cores on one machine, but not distributed as for instance in a compute cluster. It basically runs: Extract RNA-seq evidence (ERE), DenoiseIntrons, Extractor, external search (tblastn or mmseqs), Gene Model Mapper (GeMoMa), GeMoMa Annotation Filter (GAF), and AnnnotationFinalizer.
GeMoMa pipeline may be called with
java -jar GeMoMa-1.8.jar CLI GeMoMaPipeline
and has the following parameters
name | comment | type | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
t | target genome (Target genome file (FASTA), type = fasta,fa,fas,fna,fasta.gz,fa.gz,fas.gz,fna.gz) | FILE | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
The following parameter(s) can be used zero or multiple times: | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
The following parameter(s) can be used zero or multiple times: | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
selected | selected (The path to list file, which allows to make only a predictions for the contained transcript ids. The first column should contain transcript IDs as given in the annotation. Remaining columns can be used to determine a target region that should be overlapped by the prediction, if columns 2 to 5 contain chromosome, strand, start and end of region, type = tabular,txt, OPTIONAL) | FILE | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
gc | genetic code (optional user-specified genetic code, type = tabular, OPTIONAL) | FILE | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
tblastn | tblastn (if *true* tblastn is used as search algorithm, otherwise mmseqs is used. Tblastn and mmseqs need to be installed to use the corresponding option, default = false) | BOOLEAN | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
tag | tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = mRNA) | STRING | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
r | RNA-seq evidence (data for RNA-seq evidence, range={NO, MAPPED, EXTRACTED}, default = NO) | STRING | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
d | denoise (removing questionable introns that have been extracted by ERE, range={DENOISE, RAW}, default = DENOISE) | STRING | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Extractor.u | upcase IDs (whether the IDs in the GFF should be upcased, default = false) | BOOLEAN | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Extractor.r | repair (if a transcript annotation can not be parsed, the program will try to infer the phase of the CDS parts to repair the annotation, default = false) | BOOLEAN | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Extractor.a | Ambiguity (This parameter defines how to deal with ambiguities in the DNA. There are 3 options: EXCEPTION, which will remove the corresponding transcript, AMBIGUOUS, which will use an X for the corresponding amino acid, and RANDOM, which will randomly select an amnio acid from the list of possibilities., range={EXCEPTION, AMBIGUOUS, RANDOM}, default = AMBIGUOUS) | STRING | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Extractor.d | discard pre-mature stop (if *true* transcripts with pre-mature stop codon are discarded as they often indicate misannotation, default = true) | BOOLEAN | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Extractor.s | stop-codon excluded from CDS (A flag that states whether the reference annotation contains the stop codon in the CDS annotation or not, default = false) | BOOLEAN | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Extractor.f | full-length (A flag which allows for choosing between only full-length and all (i.e., full-length and partial) transcripts, default = true) | BOOLEAN | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Extractor.l | long fasta comment (whether a short (transcript ID) or a long (transcript ID, gene ID, chromosome, strand, interval) fasta comment should be written for proteins, CDSs, and genomic regions, default = false) | BOOLEAN | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
GeMoMa.r | reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1) | INT | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
GeMoMa.s | splice (if no intron is given by RNA-seq, compute candidate splice sites or not, default = true) | BOOLEAN | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
GeMoMa.sm | substitution matrix (optional user-specified substitution matrix, type = tabular, OPTIONAL) | FILE | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
GeMoMa.g | gap opening (The gap opening cost in the alignment, default = 11) | INT | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
GeMoMa.ge | gap extension (The gap extension cost in the alignment, default = 1) | INT | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
GeMoMa.m | maximum intron length (The maximum length of an intron, default = 15000) | INT | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
GeMoMa.sil | static intron length (A flag which allows to switch between static intron length, which can be specified by the user and is identical for all genes, and dynamic intron length, which is based on the gene-specific maximum intron length in the reference organism plus the user given maximum intron length, default = true) | BOOLEAN | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
GeMoMa.i | intron-loss-gain-penalty (The penalty used for intron loss and gain, default = 25) | INT | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
GeMoMa.e | e-value (The e-value for filtering blast results, default = 100.0) | DOUBLE | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
GeMoMa.c | contig threshold (The threshold for evaluating contigs, valid range = [0.0, 1.0], default = 0.4) | DOUBLE | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
GeMoMa.rt | region threshold (The threshold for evaluating regions, valid range = [0.0, 1.0], default = 0.9) | DOUBLE | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
GeMoMa.h | hit threshold (The threshold for adding additional hits, valid range = [0.0, 1.0], default = 0.9) | DOUBLE | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
GeMoMa.p | predictions (The (maximal) number of predictions per transcript, default = 10) | INT | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
GeMoMa.a | avoid stop (A flag which allows to avoid (additional) pre-mature stop codons in a transcript, default = true) | BOOLEAN | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
GeMoMa.approx | approx (whether an approximation is used to compute the score for intron gain, default = true) | BOOLEAN | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
GeMoMa.pa | protein alignment (whether a protein alignment between the prediction and the reference transcript should be computed. If so two additional attributes (iAA, pAA) will be added to predictions in the gff output. These might be used in GAF. However, since some transcripts are very long this can increase the needed runtime and memory (RAM)., default = true) | BOOLEAN | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
GeMoMa.prefix | prefix (A prefix to be used for naming the predictions, default = ) | STRING | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
GeMoMa.v | verbose (A flag which allows to output a wealth of additional information per transcript, default = false) | BOOLEAN | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
GeMoMa.t | timeout (The (maximal) number of seconds to be used for the predictions of one transcript, if exceeded GeMoMa does not output a prediction for this transcript., valid range = [0, 604800], default = 3600) | LONG | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
GeMoMa.ru | replace unknown (Replace unknown amino acid symbols by X, default = false) | BOOLEAN | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
GeMoMa.Score | Score (A flag which allows to do nothing, re-score or re-align the search results, range={Trust, ReScore, ReAlign}, default = ReAlign) | STRING | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
GAF.d | default attributes (Comma-separated list of attributes that is set to NaN if they are not given in the annotation file. This allows to use these attributes for sorting or filter criteria. It is especially meaningful if the gene annotation files were received fom different software packages (e.g., GeMoMa, Braker, ...) having different attributes., default = tie,tde,tae,iAA,pAA,score,lpm,maxGap,bestScore,maxScore,raa,rce) | STRING | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
GAF.f | filter (A filter can be applied to transcript predictions to receive only reasonable predictions. The filter is applied to the GFF attributes. The default filter decides based on the completeness of the prediction (start=='M' and stop=='*') and the relative score (score/aa>=0.75) whether a prediction is used or not. In addition, predictions without score (isNaN(score)) will be used as external annotations do not have the attribute score. Different criteria can be combined using 'and' and 'or'. A more sophisticated filter could be applied for instance using the completeness, the relative score, the evidence as well as statistics based on RNA-seq: start=='M' and stop=='*' and score/aa>=0.75 and (evidence>1 or tpc==1.0), default = start=='M' and stop=='*' and (isNaN(score) or score/aa>=0.75), OPTIONAL) | STRING | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
GAF.s | sorting (comma-separated list that defines the way predictions are sorted within a cluster, default = sumWeight,score) | STRING | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
GAF.a | alternative transcript filter (If a prediction is suggested as an alternative transcript, this additional filter will be applied. The filter works syntactically like the 'filter' parameter and allows you to keep the number of alternative transcripts small based on meaningful criteria. Reasonable filter could be based on RNA-seq data (tie==1) or on sumWeight (sumWeight>1). A more sophisticated filter could be applied combining several criteria: tie==1 or sumWeight>1, default = tie==1 or sumWeight>1, OPTIONAL) | STRING | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
GAF.c | common border filter (the filter on the common borders of transcripts, the lower the more transcripts will be checked as alternative splice isoforms, valid range = [0.0, 1.0], default = 0.75) | DOUBLE | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
GAF.m | maximal number of transcripts per gene (the maximal number of allowed transcript predictions per gene, valid range = [1, 2147483647], default = 2147483647) | INT | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
GAF.aat | add alternative transcripts (a switch to decide whether the IDs of alternative transcripts that have the same CDS should be listed for each prediction, default = false) | BOOLEAN | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
GAF.t | transfer features (if true, additional features like UTRs will be transfered from the input to the output. Only features of the representatives will be transfered. The UTRs of identical CDS predictions listed in "alternative" will not be transfered or combined, default = false) | BOOLEAN | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
AnnotationFinalizer.u | UTR (allows to predict UTRs using RNA-seq data, range={NO, YES}, default = NO) | STRING | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
AnnotationFinalizer.r | rename (allows to generate generic gene and transcripts names (cf. parameter "name attribute"), range={COMPOSED, SIMPLE, NO}, default = COMPOSED) | STRING | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
AnnotationFinalizer.n | name attribute (if true the new name is added as new attribute "Name", otherwise "Parent" and "ID" values are modified accordingly, default = true) | BOOLEAN | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
sc | synteny check (run SyntenyChecker if possible, default = true) | BOOLEAN | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
p | predicted proteins (If *true*, returns the predicted proteins of the target organism as fastA file, default = true) | BOOLEAN | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
pc | predicted CDSs (If *true*, returns the predicted CDSs of the target organism as fastA file, default = false) | BOOLEAN | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
pgr | predicted genomic regions (If *true*, returns the genomic regions of predicted gene models of the target organism as fastA file, default = false) | BOOLEAN | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
o | output individual predictions (If *true*, returns the predictions for each reference species, default = false) | BOOLEAN | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
debug | debug (If *false* removes all temporary files even if the jobs exits unexpected, default = true) | BOOLEAN | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
restart | restart (can be used to restart the latest GeMoMaPipeline run, which was finished without results, with very similar parameters, e.g., after an exception was thrown (cf. parameter debug), default = false) | BOOLEAN | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
b | BLAST_PATH (allows to set a path to the blast binaries if not set in the environment, default = , OPTIONAL) | STRING | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
m | MMSEQS_PATH (allows to set a path to the blast binaries if not set in the environment, default = , OPTIONAL) | STRING | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
outdir | The output directory, defaults to the current working directory (.) | STRING | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
threads | The number of threads used for the tool, defaults to 1 | INT |
Example:
java -jar GeMoMa-1.8.jar CLI GeMoMaPipeline a=<reference_annotation> g=<reference_genome> t=<target_genome> AnnotationFinalizer.p=<prefix>
Extract RNA-seq Evidence
This tools extracts introns and coverage from mapped RNA-seq reads. Introns might be denoised by the tool DenoiseIntrons. Introns and coverage results can be used in GeMoMa to improve the predictions and might help to select better gene models in GAF. In addition, introns and coverage can be used to predict UTRs by AnnotationFinalizer.
Extract RNA-seq Evidence may be called with
java -jar GeMoMa-1.8.jar CLI ERE
and has the following parameters
name | comment | type | |||||||||||||||
s | Stranded (Defines whether the reads are stranded. In case of FR_FIRST_STRAND, the first read of a read pair or the only read in case of single-end data is assumed to be located on forward strand of the cDNA, i.e., reverse to the mRNA orientation. If you are using Illumina TruSeq you should use FR_FIRST_STRAND., range={FR_UNSTRANDED, FR_FIRST_STRAND, FR_SECOND_STRAND}, default = FR_UNSTRANDED) | STRING | |||||||||||||||
The following parameter(s) can be used multiple times: | |||||||||||||||||
| |||||||||||||||||
v | ValidationStringency (Defines how strict to be when reading a SAM or BAM, beyond bare minimum validation., range={STRICT, LENIENT, SILENT}, default = LENIENT) | STRING | |||||||||||||||
u | use secondary alignments (allows to filter flags in the SAM or BAM, default = true) | BOOLEAN | |||||||||||||||
c | coverage (allows to output the coverage, default = true) | BOOLEAN | |||||||||||||||
mmq | minimum mapping quality (reads with a mapping quality that is lower than this value will be ignored, valid range = [0, 255], default = 40) | INT | |||||||||||||||
mc | minimum context (only introns that have evidence of at least one split read with a minimal M (=(mis)match) stretch in the cigar string larger than or equal to this value will be used, valid range = [1, 1000000], default = 1) | INT | |||||||||||||||
maximumcoverage | maximum coverage (optional parameter to reduce the size of coverage output files, coverage higher than this value will be reported as this value, valid range = [1, 10000], OPTIONAL) | INT | |||||||||||||||
f | filter by intron mismatches (filter reads by the number of mismatches around splits, range={NO, YES}, default = NO) | STRING | |||||||||||||||
| |||||||||||||||||
e | evidence long splits (require introns to have at least this number of times the supporting reads as their length deviates from the mean split length, valid range = [0.0, 100.0], default = 0.0) | DOUBLE | |||||||||||||||
mil | minimum intron length (introns shorter than the minimum length are discarded and considered as contiguous, valid range = [0, 1000], default = 0) | INT | |||||||||||||||
repositioning | repositioning (due to limitations in BAM/SAM format huge chromosomes need to be split before mapping. This parameter allows to undo the split mapping to real chromosomes and coordinates. The repositioning file has 3 columns: split_chr_name, original_chr_name, offset_in_original_chr, type = tabular, OPTIONAL) | FILE | |||||||||||||||
outdir | The output directory, defaults to the current working directory (.) | STRING |
Example:
java -jar GeMoMa-1.8.jar CLI ERE m=<mapped_reads_file>
CheckIntrons
The tool checks the distribution of introns on the strands and the dinucleotide distribution at splice sites.
CheckIntrons may be called with
java -jar GeMoMa-1.8.jar CLI CheckIntrons
and has the following parameters
name | comment | type | |||
t | target genome (The target genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code, type = fasta) | FILE | |||
The following parameter(s) can be used multiple times: | |||||
| |||||
v | verbose (A flag which allows to output a wealth of additional information per transcript, default = false) | BOOLEAN | |||
outdir | The output directory, defaults to the current working directory (.) | STRING |
Example:
java -jar GeMoMa-1.8.jar CLI CheckIntrons t=<target_genome> i=<introns>
DenoiseIntrons
This module allows to analyze introns extracted by ERE. Introns with a large intron size or a low relative expression are possibly artefacts and will be removed. The result of this module can be used in the module GeMoMa, AnnotationEvidence, and AnnotationFinalizer.
DenoiseIntrons may be called with
java -jar GeMoMa-1.8.jar CLI DenoiseIntrons
and has the following parameters
name | comment | type | |||||||||||||||||||||
The following parameter(s) can be used multiple times: | |||||||||||||||||||||||
| |||||||||||||||||||||||
The following parameter(s) can be used multiple times: | |||||||||||||||||||||||
| |||||||||||||||||||||||
m | maximum intron length (The maximum length of an intron, default = 15000) | INT | |||||||||||||||||||||
me | minimum expression (The threshold for removing introns, valid range = [0.0, 1.0], default = 0.01) | DOUBLE | |||||||||||||||||||||
context | context (The context upstream a donor and donwstream an acceptor site that is used to determine the expression of the region, valid range = [0, 100], default = 10) | INT | |||||||||||||||||||||
outdir | The output directory, defaults to the current working directory (.) | STRING |
Example:
java -jar GeMoMa-1.8.jar CLI DenoiseIntrons i=<introns> coverage_unstranded=<coverage_unstranded>
NCBI Reference Retriever
This tool can be used to download or update assembly and annotation files of reference organsims from NCBI. This way it allows to easily collect all data necessary to start GeMoMaPipeline or Extractor.
NCBI Reference Retriever may be called with
java -jar GeMoMa-1.8.jar CLI NRR
and has the following parameters
name | comment | type |
r | reference directory (the directory where the genome and annotation files of the reference organisms should be stored, default = references/) | STRING |
n | number of tries (the number of tries for downloading a reference file, valid range = [1, 100], default = 10) | INT |
rl | reference list (a list of reference organisms, type = txt) | FILE |
outdir | The output directory, defaults to the current working directory (.) | STRING |
Example:
java -jar GeMoMa-1.8.jar CLI NRR rl=<reference_list>
Extractor
This tool can be used to create input files for GeMoMa, i.e., it creates at least a fasta file containing the translated parts of the CDS and a tabular file containing the assignment of transcripts to genes and parts of CDS to transcripts. In addition, Extractor can be used to create several additional files from the final prediction, e.g. proteins, CDSs, ... . Two inputs are mandatory: The genome as fasta or fasta.gz and the corresponding annotation as gff or gff.gz. The gff file should be sorted. If you like to set a user-specific genetic code, please use a tab-delimited file with two columns. The first column contains the amino acid in one letter code, the second a list of tripletts.
Extractor may be called with
java -jar GeMoMa-1.8.jar CLI Extractor
and has the following parameters
name | comment | type |
a | annotation (Reference annotation file (GFF or GTF), which contains gene models annotated in the reference genome, type = gff,gff3,gtf,gff.gz,gff3.gz,gtf.gz) | FILE |
g | genome (Reference genome file (FASTA), type = fasta,fa,fas,fna,fasta.gz,fa.gz,fas.gz,fna.gz) | FILE |
gc | genetic code (optional user-specified genetic code, type = tabular, OPTIONAL) | FILE |
p | proteins (whether the complete proteins sequences should returned as output, default = false) | BOOLEAN |
c | cds (whether the complete CDSs should returned as output, default = false) | BOOLEAN |
genomic | genomic (whether the genomic regions should be returned (upper case = coding, lower case = non coding), default = false) | BOOLEAN |
i | introns (whether introns should be extracted from annotation, that might be used for test cases, default = false) | BOOLEAN |
identical | identical (if CDS is identical Extractor only used one transcript. This parameter allows to return a table that lists in the first column the used transcript and in the second column the discarded transcript. If no transcript is discarded, the list is empty., default = false) | BOOLEAN |
u | upcase IDs (whether the IDs in the GFF should be upcased, default = false) | BOOLEAN |
r | repair (if a transcript annotation can not be parsed, the program will try to infer the phase of the CDS parts to repair the annotation, default = false) | BOOLEAN |
s | selected (The path to list file, which allows to make only a predictions for the contained transcript ids. The first column should contain transcript IDs as given in the annotation. Remaining columns will be ignored., type = tabular,txt, OPTIONAL) | FILE |
Ambiguity | Ambiguity (This parameter defines how to deal with ambiguities in the DNA. There are 3 options: EXCEPTION, which will remove the corresponding transcript, AMBIGUOUS, which will use an X for the corresponding amino acid, and RANDOM, which will randomly select an amnio acid from the list of possibilities., range={EXCEPTION, AMBIGUOUS, RANDOM}, default = EXCEPTION) | STRING |
d | discard pre-mature stop (if *true* transcripts with pre-mature stop codon are discarded as they often indicate misannotation, default = true) | BOOLEAN |
sefc | stop-codon excluded from CDS (A flag that states whether the reference annotation contains the stop codon in the CDS annotation or not, default = false) | BOOLEAN |
f | full-length (A flag which allows for choosing between only full-length and all (i.e., full-length and partial) transcripts, default = true) | BOOLEAN |
l | long fasta comment (whether a short (transcript ID) or a long (transcript ID, gene ID, chromosome, strand, interval) fasta comment should be written for proteins, CDSs, and genomic regions, default = false) | BOOLEAN |
v | verbose (A flag which allows to output a wealth of additional information, default = false) | BOOLEAN |
outdir | The output directory, defaults to the current working directory (.) | STRING |
Example:
java -jar GeMoMa-1.8.jar CLI Extractor a=<annotation> g=<genome>
GeneModelMapper
This tool is the main part of, a homology-based gene prediction tool. GeMoMa builds gene models from search results (e.g. tblastn or mmseqs).
As first step, you should run Extractor obtaining cds parts and assignment. Second, you should run a search algorithm, e.g. tblastn or mmseqs, with cds parts as query. Finally, these search results are then used in GeMoMa. Search results should be clustered according to the reference genes. The most easiest way is to sort the search results accoring to the first column. If the search results are not sorted by default (e.g. mmseqs), you should the parameter sort. If you like to run GeMoMa ignoring intron position conservation, you should blast protein sequences and feed the results in query cds parts and leave assignment unselected.
If you like to run GeMoMa using RNA-seq evidence, you should map your RNA-seq reads to the genome and run ERE on the mapped reads. For several reasons, spurious introns can be extracted from RNA-seq data. Hence, we recommend to run DenoiseIntrons to remove such spurious introns. Finally, you can use the obtained introns (and coverage) in GeMoMa.
If you like to obtain multiple predictions per gene model of the reference organism, you should set predictions accordingly. In addition, we suggest to decrease the value of contig threshold allowing GeMoMa to evaluate more candidate contigs/chromosomes.
If you change the values of contig threshold, region threshold and hit threshold, this will influence the predictions as well as the runtime of the algorithm. The lower the values are, the slower the algorithm is.
You can filter your predictions using GAF, which also allows for combining predictions from different reference organismns.
Finally, you can predict UTRs and rename predictions using AnnotationFinalizer.
If you like to run the complete GeMoMa pipeline and not only specific module, you can run the multi-threaded module GeMoMaPipeline.
GeneModelMapper may be called with
java -jar GeMoMa-1.8.jar CLI GeMoMa
and has the following parameters
name | comment | type | |||||||||||||||||||||
s | search results (The search results, e.g., from tblastn or mmseqs, type = tabular) | FILE | |||||||||||||||||||||
t | target genome (The target genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code, type = fasta,fas,fa,fna,fasta.gz,fas.gz,fa.gz,fna.gz) | FILE | |||||||||||||||||||||
c | cds parts (The query CDS parts file (protein FASTA), i.e., the CDS parts that have been searched in the target genome using for instance BLAST or mmseqs, type = fasta,fa,fas,fna) | FILE | |||||||||||||||||||||
a | assignment (The assignment file, which combines CDS parts to proteins, type = tabular, OPTIONAL) | FILE | |||||||||||||||||||||
The following parameter(s) can be used zero or multiple times: | |||||||||||||||||||||||
| |||||||||||||||||||||||
r | reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1) | INT | |||||||||||||||||||||
splice | splice (if no intron is given by RNA-seq, compute candidate splice sites or not, default = true) | BOOLEAN | |||||||||||||||||||||
The following parameter(s) can be used zero or multiple times: | |||||||||||||||||||||||
| |||||||||||||||||||||||
g | genetic code (optional user-specified genetic code, type = tabular, OPTIONAL) | FILE | |||||||||||||||||||||
sm | substitution matrix (optional user-specified substitution matrix, type = tabular, OPTIONAL) | FILE | |||||||||||||||||||||
go | gap opening (The gap opening cost in the alignment, default = 11) | INT | |||||||||||||||||||||
ge | gap extension (The gap extension cost in the alignment, default = 1) | INT | |||||||||||||||||||||
m | maximum intron length (The maximum length of an intron, default = 15000) | INT | |||||||||||||||||||||
sil | static intron length (A flag which allows to switch between static intron length, which can be specified by the user and is identical for all genes, and dynamic intron length, which is based on the gene-specific maximum intron length in the reference organism plus the user given maximum intron length, default = true) | BOOLEAN | |||||||||||||||||||||
intron-loss-gain-penalty | intron-loss-gain-penalty (The penalty used for intron loss and gain, default = 25) | INT | |||||||||||||||||||||
e | e-value (The e-value for filtering blast results, default = 100.0) | DOUBLE | |||||||||||||||||||||
ct | contig threshold (The threshold for evaluating contigs, valid range = [0.0, 1.0], default = 0.4) | DOUBLE | |||||||||||||||||||||
rt | region threshold (The threshold for evaluating regions, valid range = [0.0, 1.0], default = 0.9) | DOUBLE | |||||||||||||||||||||
h | hit threshold (The threshold for adding additional hits, valid range = [0.0, 1.0], default = 0.9) | DOUBLE | |||||||||||||||||||||
p | predictions (The (maximal) number of predictions per transcript, default = 10) | INT | |||||||||||||||||||||
selected | selected (The path to list file, which allows to make only a predictions for the contained transcript ids. The first column should contain transcript IDs as given in the annotation. Remaining columns can be used to determine a target region that should be overlapped by the prediction, if columns 2 to 5 contain chromosome, strand, start and end of region, type = tabular,txt, OPTIONAL) | FILE | |||||||||||||||||||||
as | avoid stop (A flag which allows to avoid (additional) pre-mature stop codons in a transcript, default = true) | BOOLEAN | |||||||||||||||||||||
approx | approx (whether an approximation is used to compute the score for intron gain, default = true) | BOOLEAN | |||||||||||||||||||||
pa | protein alignment (whether a protein alignment between the prediction and the reference transcript should be computed. If so two additional attributes (iAA, pAA) will be added to predictions in the gff output. These might be used in GAF. However, since some transcripts are very long this can increase the needed runtime and memory (RAM)., default = true) | BOOLEAN | |||||||||||||||||||||
prefix | prefix (A prefix to be used for naming the predictions, default = ) | STRING | |||||||||||||||||||||
tag | tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = mRNA) | STRING | |||||||||||||||||||||
v | verbose (A flag which allows to output a wealth of additional information per transcript, default = false) | BOOLEAN | |||||||||||||||||||||
timeout | timeout (The (maximal) number of seconds to be used for the predictions of one transcript, if exceeded GeMoMa does not output a prediction for this transcript., valid range = [0, 604800], default = 3600) | LONG | |||||||||||||||||||||
sort | sort (A flag which allows to sort the search results, default = false) | BOOLEAN | |||||||||||||||||||||
ru | replace unknown (Replace unknown amino acid symbols by X, default = false) | BOOLEAN | |||||||||||||||||||||
Score | Score (A flag which allows to do nothing, re-score or re-align the search results, range={Trust, ReScore, ReAlign}, default = Trust) | STRING | |||||||||||||||||||||
outdir | The output directory, defaults to the current working directory (.) | STRING |
Example:
java -jar GeMoMa-1.8.jar CLI GeMoMa s=<search_results> t=<target_genome> c=<cds_parts>
GeMoMa Annotation Filter
This tool combines and filters gene predictions from different sources yielding a common gene prediction. The tool does not modify the predictions, but filters redundant and low-quality predictions and selects relevant predictions. In addition, it adds attributes to the annotation of transcript predictions.
The algorithm does the following: First, redundant predictions are identified (and additional attributes (evidence, sumWeight) are introduced). Second, the predictions are filtered using the user-specified criterium based on the attributes from the annotation. Third, clusters of overlapping predictions are determined, the predictions are sorted within the cluster and relevant predictions are extracted.
Optionally, annotation info can be added for each reference organism enabling a functional prediction for predicted transcripts based on the function of the reference transcript. Phytozome provides annotation info tables for plants, but annotation info can be used from any source as long as they are tab-delimited files with at least the following columns: transcriptName, GO and .*defline.
Initially, GAF was build to combine gene predictions obtained from GeMoMa. It allows to combine the predictions from multiple reference organisms, but works also using only one reference organism. However, GAF also allows to integrate predictions from ab-initio or purely RNA-seq-based gene predictors as well as manually curated annotation. If you like to do so, we recommend to run AnnotationEvidence for each of these input files to add additional attributes that can be used for sorting and filtering within GAF. The sort and filter criteria need to be carefully revised in this case. Default values can be set for missing attributes.
GeMoMa Annotation Filter may be called with
java -jar GeMoMa-1.8.jar CLI GAF
and has the following parameters
name | comment | type | ||||||||||||
t | tag (the tag used to read the GeMoMa annotations, default = mRNA) | STRING | ||||||||||||
The following parameter(s) can be used multiple times: | ||||||||||||||
| ||||||||||||||
d | default attributes (Comma-separated list of attributes that is set to NaN if they are not given in the annotation file. This allows to use these attributes for sorting or filter criteria. It is especially meaningful if the gene annotation files were received fom different software packages (e.g., GeMoMa, Braker, ...) having different attributes., default = tie,tde,tae,iAA,pAA,score,lpm,maxGap,bestScore,maxScore,raa,rce) | STRING | ||||||||||||
f | filter (A filter can be applied to transcript predictions to receive only reasonable predictions. The filter is applied to the GFF attributes. The default filter decides based on the completeness of the prediction (start=='M' and stop=='*') and the relative score (score/aa>=0.75) whether a prediction is used or not. In addition, predictions without score (isNaN(score)) will be used as external annotations do not have the attribute score. Different criteria can be combined using 'and' and 'or'. A more sophisticated filter could be applied for instance using the completeness, the relative score, the evidence as well as statistics based on RNA-seq: start=='M' and stop=='*' and score/aa>=0.75 and (evidence>1 or tpc==1.0), default = start=='M' and stop=='*' and (isNaN(score) or score/aa>=0.75), OPTIONAL) | STRING | ||||||||||||
s | sorting (comma-separated list that defines the way predictions are sorted within a cluster, default = sumWeight,score) | STRING | ||||||||||||
atf | alternative transcript filter (If a prediction is suggested as an alternative transcript, this additional filter will be applied. The filter works syntactically like the 'filter' parameter and allows you to keep the number of alternative transcripts small based on meaningful criteria. Reasonable filter could be based on RNA-seq data (tie==1) or on sumWeight (sumWeight>1). A more sophisticated filter could be applied combining several criteria: tie==1 or sumWeight>1, default = tie==1 or sumWeight>1, OPTIONAL) | STRING | ||||||||||||
c | common border filter (the filter on the common borders of transcripts, the lower the more transcripts will be checked as alternative splice isoforms, valid range = [0.0, 1.0], default = 0.75) | DOUBLE | ||||||||||||
m | maximal number of transcripts per gene (the maximal number of allowed transcript predictions per gene, valid range = [1, 2147483647], default = 2147483647) | INT | ||||||||||||
aat | add alternative transcripts (a switch to decide whether the IDs of alternative transcripts that have the same CDS should be listed for each prediction, default = false) | BOOLEAN | ||||||||||||
tf | transfer features (if true, additional features like UTRs will be transfered from the input to the output. Only features of the representatives will be transfered. The UTRs of identical CDS predictions listed in "alternative" will not be transfered or combined, default = false) | BOOLEAN | ||||||||||||
outdir | The output directory, defaults to the current working directory (.) | STRING |
Example:
java -jar GeMoMa-1.8.jar CLI GAF g=<gene_annotation_file>
AnnotationFinalizer
This tool finalizes an annotation. It allows to predict for UTRs for annotated coding sequences and to generate generic gene and transcript names. UTR prediction might be negatively influenced (i.e. too long predictions) by genomic contamination of RNA-seq libraries, overlapping genes or genes in close proximity as well as unstranded RNA-seq libraries. Please use ERE to preprocess the mapped reads.
AnnotationFinalizer may be called with
java -jar GeMoMa-1.8.jar CLI AnnotationFinalizer
and has the following parameters
name | comment | type | |||||||||||||||||||||||||||||||||||||||||||||||||||
g | genome (The genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code, type = fasta,fa,fas,fna,fasta.gz,fa.gz,fas.gz,fna.gz) | FILE | |||||||||||||||||||||||||||||||||||||||||||||||||||
a | annotation (The predicted genome annotation file (GFF), type = gff,gff3) | FILE | |||||||||||||||||||||||||||||||||||||||||||||||||||
t | tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = mRNA) | STRING | |||||||||||||||||||||||||||||||||||||||||||||||||||
u | UTR (allows to predict UTRs using RNA-seq data, range={NO, YES}, default = NO) | STRING | |||||||||||||||||||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||
rename | rename (allows to generate generic gene and transcripts names (cf. parameter "name attribute"), range={COMPOSED, SIMPLE, NO}, default = COMPOSED) | STRING | |||||||||||||||||||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||
n | name attribute (if true the new name is added as new attribute "Name", otherwise "Parent" and "ID" values are modified accordingly, default = true) | BOOLEAN | |||||||||||||||||||||||||||||||||||||||||||||||||||
outdir | The output directory, defaults to the current working directory (.) | STRING |
Example:
java -jar GeMoMa-1.8.jar CLI AnnotationFinalizer g=<genome> a=<annotation> p=<prefix>
Annotation evidence
This tool adds attributes to the annotation, e.g., tie, tpc, aa, start, stop. These attributes can be used, for instance, if the annotation is used in GAF. All predictions of the annotation are used. The predictions are not filtered for internal stop codons, missing start or stop codons, frame-shifts, ... . Please use ERE to preprocess the mapped reads.
Annotation evidence may be called with
java -jar GeMoMa-1.8.jar CLI AnnotationEvidence
and has the following parameters
name | comment | type | ||||||||||||||||||||||||
a | annotation (The genome annotation file (GFF,GTF), type = gff,gff3,gtf,gff.gz,gff3.gz,gtf.gz) | FILE | ||||||||||||||||||||||||
t | tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = mRNA) | STRING | ||||||||||||||||||||||||
g | genome (The genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code, type = fasta,fas,fa,fna,fasta.gz,fas.gz,fa.gz,fna.gz) | FILE | ||||||||||||||||||||||||
The following parameter(s) can be used multiple times: | ||||||||||||||||||||||||||
| ||||||||||||||||||||||||||
r | reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1) | INT | ||||||||||||||||||||||||
The following parameter(s) can be used multiple times: | ||||||||||||||||||||||||||
| ||||||||||||||||||||||||||
ao | annotation output (if the annotation should be returned with attributes tie, tpc, and aa, default = true) | BOOLEAN | ||||||||||||||||||||||||
gc | genetic code (optional user-specified genetic code, type = tabular, OPTIONAL) | FILE | ||||||||||||||||||||||||
outdir | The output directory, defaults to the current working directory (.) | STRING |
Example:
java -jar GeMoMa-1.8.jar CLI AnnotationEvidence a=<annotation> g=<genome>
Synteny checker
This tool can be used to determine syntenic regions between target organism and reference organism based on similiarity of genes. The tool returns a table of reference genes per predicted gene. This table can be easily visualized with an R script that is included in the GeMoMa package.
Synteny checker may be called with
java -jar GeMoMa-1.8.jar CLI SyntenyChecker
and has the following parameters
name | comment | type | ||||||
t | tag (the tag used to read the GeMoMa annotations, default = mRNA) | STRING | ||||||
The following parameter(s) can be used multiple times: | ||||||||
| ||||||||
g | gene annotation file (GFF file containing the gene annotations predicted by GAF, type = gff,gff3) | FILE | ||||||
outdir | The output directory, defaults to the current working directory (.) | STRING |
Example:
java -jar GeMoMa-1.8.jar CLI SyntenyChecker a=<assignment> g=<gene_annotation_file>
AddAttribute
This tool allows to add an additional attribute to specific features of an annotation.
Those additional attributes might be used in GAF for filtering or sorting or might be displayed in genome browsers like IGV or WebApollo. The user can choose binary attributes (true or false) or attributes with values according to given tab-delimited table.
AddAttribute may be called with
java -jar GeMoMa-1.8.jar CLI AddAttribute
and has the following parameters
name | comment | type | |||||||||
a | annotation (annotation file, type = gff,gff3) | FILE | |||||||||
f | feature (a feature of the annotation, e.g., gene, transcript or mRNA, default = mRNA) | STRING | |||||||||
attribute | attribute (the name of the attribute that is added to the annotation) | STRING | |||||||||
t | table (a tab-delimited file containing IDs and additional attribute, type = tabular) | FILE | |||||||||
i | ID column (the ID column in the tab-delimited file, valid range = [0, 2147483647]) | INT | |||||||||
type | type (type of addition attribute, range={VALUES, BINARY}, default = VALUES) | STRING | |||||||||
| |||||||||||
outdir | The output directory, defaults to the current working directory (.) | STRING |
Example:
java -jar GeMoMa-1.8.jar CLI AddAttribute a=<annotation> attribute=<attribute> t=<table> i=<ID_column> ac=<attribute_column>
GAFComparison
This tool allows to compare results from GAF based on the attributed ref-gene and alternative. Hence, you can compare the annotation of different genomes or the effect of different parameters on the annotation of one genome.
GAFComparison may be called with
java -jar GeMoMa-1.8.jar CLI GAFComparison
and has the following parameters
name | comment | type | ||||||
t | tag (the tag used to read the GAF annotations, default = mRNA) | STRING | ||||||
The following parameter(s) can be used multiple times: | ||||||||
| ||||||||
s | split prefix (a switch to decide whether the prefix should be split and writen in a separat column, default = false) | BOOLEAN | ||||||
d | differences (a switch to decide whether only genes with difference should be returned, default = true) | BOOLEAN | ||||||
outdir | The output directory, defaults to the current working directory (.) | STRING |
Example:
java -jar GeMoMa-1.8.jar CLI GAFComparison n=<name> g=<gene_annotation_file>
Analyzer
This tools allows to compare true annotation with predicted annotation as it is frequently done in benchmark studies. Furthermore, it can return a detailed table comparing true annotation and predicted annotation which might help to identify systematical errors or biases in the predictions. Hence, this tool might help to detect weaknesses of the prediction algorithm.
True and predicted transcripts are evaluated based on nucleotide F1 measure. For each predicted transcript, the true transcript with highest nucleotide F1 measure is listed. A negative value in a F1 measure column indicates that there is a predicted transcript that matches the true transcript with a F1 measure value that is the absolute value of this entry, but there is another true transcript that matches this predicted transcript with an even better F1. Also true and predicted transcripts are listed that do not overlap with any transcript from the predicted and true annotation, respectively. The table contains the attributes of the true and the predicted annotation besides some additional columns allowing to easily filter interesting examples and to do statistics.
The evaluation can be based on CDS (default) or exon features. The tool also reports sensitivity and precision for the categories gene and transcript.
Analyzer may be called with
java -jar GeMoMa-1.8.jar CLI Analyzer
and has the following parameters
name | comment | type | |||||||||
t | truth (the true annotation, type = gff,gff3,gff.gz,gff3.gz) | FILE | |||||||||
The following parameter(s) can be used multiple times: | |||||||||||
| |||||||||||
c | CDS (if true CDS features are used otherwise exon features, default = true) | BOOLEAN | |||||||||
w | write (write detailed table comparing the true and the predicted annotation, range={NO, YES}, default = NO) | STRING | |||||||||
| |||||||||||
r | reliable (additionally evaluate sensitivity for reliable transcripts, range={NO, YES}, default = NO) | STRING | |||||||||
| |||||||||||
outdir | The output directory, defaults to the current working directory (.) | STRING |
Example:
java -jar GeMoMa-1.8.jar CLI Analyzer t=<truth> p=<predicted_annotation>
BUSCORecomputer
This tool can be used to compute BUSCO statistics for genes instead of transcripts. Proteins of an annotation file can be extracted with Exctractor, Proteins can be used to compute BUSCO statistics with BUSCO. The full BUSCO table and the assignment file from the Extractor can be used as input for this tool. Alternatively, a table can be generated from the annotation file that can be used instead of the assignment file.
BUSCORecomputer may be called with
java -jar GeMoMa-1.8.jar CLI BUSCORecomputer
and has the following parameters
name | comment | type |
b | BUSCO (the BUSCO full table based on transcripts/proteins, type = tabular) | FILE |
i | IDs (a table with at leat two columns, the first is the gene ID, the second is the transcript/protein ID. The assignment file from the Extractor can be used or a table can be derived by the user from the gene annotation file (gff,gtf), type = tabular) | FILE |
outdir | The output directory, defaults to the current working directory (.) | STRING |
Example:
java -jar GeMoMa-1.8.jar CLI BUSCORecomputer b=<BUSCO> i=<IDs>