GeMoMa: Difference between revisions

Revision as of 05:23, 20 May 2020

Gene Model Mapper (GeMoMa) is a homology-based gene prediction program. GeMoMa uses the annotation of protein-coding genes in a reference genome to infer the annotation of protein-coding genes in a target genome. Thereby, GeMoMa utilizes amino acid sequence and intron position conservation. In addition, GeMoMa allows to incorporate RNA-seq evidence for splice site prediction.

GeMoMa is available in a public web-server at galaxy.informatik.uni-halle.de. The provided web-server only allows a limited number of reference genes and uses a time out of 2 minutes per transcript prediction. For unlimited use, please use the command line program or integrate GeMoMa in your own Galaxy instance.

Schema of GeMoMa algorithm

Installation

GeMoMa is now available via bioconda. Here is the direct link to the package. To install this package with conda run:

conda install -c bioconda gemoma

However, you can also install GeMoMa manually.

Requirements

For running the GeMoMa, you need the following software on your computer

Java v1.8 or later
blast or mmseqs

Download

GeMoMa is implemented in Java using Jstacs. You can download a zip file containing a readme, the GeMoMa jar file and some tiny scripts for running GeMoMa. The jar file allows for

creating the XML file needed for the Galaxy integration
running the command line interface (CLI) version.

You can also download a small manual for GeMoMa which explains the main steps for the analysis.

In a nutshell

GeMoMa is a modular, homology-based gene prediction program with huge flexibility. However, we also provide a pipeline allowing to use GeMoMa easily. If you like to start GeMoMa for the first time, we recommend to use the GeMoMaPipeline like this

java -jar GeMoMa-1.6.4.jar CLI GeMoMaPipeline threads=<threads> outdir=<outdir> tblastn=false GeMoMa.Score=ReAlign AnnotationFinalizer.r=NO o=true t=<target_genome> i=<reference_1_id> a=<reference_1_annotation> g=<reference_1_genome>

there are several parameters that need to be set indicated with <foo>. You can specify

the number of threads
the output directory
the target genome
and the reference ID (optional), annotation and genome. If you have several references just repeat the parameter tags i, a, g with the corresponding values.

In addition, we recommend to set several parameters:

tblastn=false: use mmseqs instead of tblastn, since mmseqs is faster
GeMoMa.Score=ReAlign: states that the score from mmseqs should be recomputed as mmseqs uses an approximation
AnnotationFinalizer.r=NO: do not rename genes and transcripts
o=true: output individual predictions for each reference as a separate file allowing to rerun the combination step (GAF) very easily and quickly

If you like to specify the maximum intron length please consider the parameters GeMoMa.m and GeMoMa.sil. If you have RNA-seq data either from own experiments or publicly available data sets (cf. NCBI SRA, EMBL-EBI ENA), we recommend to use them. You need to map the data against the target genome with your favorite read mapper. In addition, we recommend to check the parameters of the section DenoiseIntrons.

Tools

GeMoMa pipeline

This tool can be used to run the complete GeMoMa pipeline. The tool is multi-threaded and can utilize all compute cores on one machine, but not distributed as for instance in a compute cluster. It basically runs: Extract RNA-seq evidence (ERE), DenoiseIntrons, Extractor, external search (tblastn or mmseqs), Gene Model Mapper (GeMoMa), GeMoMa Annotation Filter (GAF), and AnnnotationFinalizer.

GeMoMa pipeline may be called with

java -jar GeMoMa-1.6.4.jar CLI GeMoMaPipeline

and has the following parameters

name

comment

type

t

target genome (Target genome file (FASTA))

FILE

The following parameter(s) can be used multiple times:

s

species (data for reference species, range={own, pre-extracted}, default = own)

Parameters for selection "own":
i	ID (ID to distinguish the different reference species, default = , OPTIONAL)	STRING
a	annotation (Reference annotation file (GFF or GTF), which contains gene models annotated in the reference genome)	FILE
g	genome (Reference genome file (FASTA))	FILE
w	weight (the weight can be used to prioritize predictions from different input files; each prediction will get an additional attribute sumWeight that can be used in the filter, valid range = [0.0, 1000.0], default = 1.0, OPTIONAL)	DOUBLE
ai	annotation info (annotation information of the reference, tab-delimted file containing at least the columns transcriptName, GO and .*defline, OPTIONAL)	FILE
Parameters for selection "pre-extracted":
i	ID (ID to distinguish the different reference species, default = , OPTIONAL)	STRING
c	cds parts (The query cds parts file (FASTA), i.e., the cds parts that have been blasted)	FILE
a	assignment (The assignment file, which combines parts of the CDS to transcripts, OPTIONAL)	FILE
q	query proteins (optional query protein file (FASTA) for computing the optimal alignment score against complete protein prediction, OPTIONAL)	FILE
w	weight (the weight can be used to prioritize predictions from different input files; each prediction will get an additional attribute sumWeight that can be used in the filter, valid range = [0.0, 1000.0], default = 1.0, OPTIONAL)	DOUBLE
ai	annotation info (annotation information of the reference, tab-delimted file containing at least the columns transcriptName, GO and .*defline, OPTIONAL)	FILE

selected

selected (The path to list file, which allows to make only a predictions for the contained transcript ids. The first column should contain transcript IDs as given in the annotation. Remaining columns can be used to determine a target region that should be overlapped by the prediction, if columns 2 to 5 contain chromosome, strand, start and end of region, OPTIONAL)

FILE

gc

genetic code (optional user-specified genetic code, OPTIONAL)

FILE

tblastn

tblastn (if *true* tblastn is used as search algorithm, otherwise mmseqs is used. Tblastn and mmseqs need to be installed to use the corresponding option, default = true)

BOOLEAN

tag

tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = prediction)

STRING

r

RNA-seq evidence (data for RNA-seq evidence, range={NO, MAPPED, EXTRACTED}, default = NO)

No parameters for selection "NO"

Parameters for selection "MAPPED":

ERE.s

Stranded (Defines whether the reads are stranded. In case of FR_FIRST_STRAND, the first read of a read pair or the only read in case of single-end data is assumed to be located on forward strand of the cDNA, i.e., reverse to the mRNA orientation. If you are using Illumina TruSeq you should use FR_FIRST_STRAND., range={FR_UNSTRANDED, FR_FIRST_STRAND, FR_SECOND_STRAND}, default = FR_UNSTRANDED)

STRING

The following parameter(s) can be used multiple times:

ERE.m

mapped reads file (BAM/SAM files containing the mapped reads)

FILE

ERE.v

ValidationStringency (Defines how strict to be when reading a SAM or BAM, beyond bare minimum validation., range={STRICT, LENIENT, SILENT}, default = LENIENT)

STRING

ERE.u

use secondary alignments (allows to filter flags in the SAM or BAM, default = true)

BOOLEAN

ERE.c

coverage (allows to output the coverage, default = false)

BOOLEAN

ERE.mmq

minimum mapping quality (reads with a mapping quality that is lower than this value will be ignored, valid range = [0, 255], default = 40)

INT

Parameters for selection "EXTRACTED":

The following parameter(s) can be used multiple times:

introns

introns (Introns (GFF), which might be obtained from RNA-seq)

FILE

The following parameter(s) can be used zero or multiple times:

coverage

coverage (experimental coverage (RNA-seq), range={UNSTRANDED, STRANDED}, default = UNSTRANDED)

Parameters for selection "UNSTRANDED":
coverage_unstranded	coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)	FILE
Parameters for selection "STRANDED":
coverage_forward	coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)	FILE
coverage_reverse	coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)	FILE

d

denoise (removing questionable introns that have been extracted by ERE, range={DENOISE, RAW}, default = DENOISE)

Parameters for selection "DENOISE":
DenoiseIntrons.m	maximum intron length (The maximum length of an intron, default = 15000)	INT
DenoiseIntrons.me	minimum expression (The threshold for removing introns, valid range = [0.0, 1.0], default = 0.01)	DOUBLE
DenoiseIntrons.c	context (The context upstream a donor and donwstream an acceptor site that is used to determine the expression of the region, valid range = [0, 100], default = 10)	INT
No parameters for selection "RAW"

Extractor.p

proteins (whether the complete proteins sequences should returned as output, default = true)

BOOLEAN

Extractor.r

repair (if a transcript annotation can not be parsed, the program will try to infer the phase of the CDS parts to repair the annotation, default = false)

BOOLEAN

Extractor.a

Ambiguity (This parameter defines how to deal with ambiguities in the DNA. There are 3 options: EXCEPTION, which will remove the corresponding transcript, AMBIGUOUS, which will use an X for the corresponding amino acid, and RANDOM, which will randomly select an amnio acid from the list of possibilities., range={EXCEPTION, AMBIGUOUS, RANDOM}, default = AMBIGUOUS)

STRING

Extractor.s

stop-codon excluded from CDS (A flag that states whether the reference annotation contains the stop codon in the CDS annotation or not, default = false)

BOOLEAN

Extractor.f

full-length (A flag which allows for choosing between only full-length and all (i.e., full-length and partial) transcripts, default = true)

BOOLEAN

GeMoMa.r

reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)

INT

GeMoMa.s

splice (if no intron is given by RNA-seq, compute candidate splice sites or not, default = true)

BOOLEAN

GeMoMa.sm

substitution matrix (optional user-specified substitution matrix, OPTIONAL)

FILE

GeMoMa.g

gap opening (The gap opening cost in the alignment, default = 11)

INT

GeMoMa.ge

gap extension (The gap extension cost in the alignment, default = 1)

INT

GeMoMa.m

maximum intron length (The maximum length of an intron, default = 15000)

INT

GeMoMa.sil

static intron length (A flag which allows to switch between static intron length, which can be specified by the user and is identical for all genes, and dynamic intron length, which is based on the gene-specific maximum intron length in the reference organism plus the user given maximum intron length, default = true)

BOOLEAN

GeMoMa.i

intron-loss-gain-penalty (The penalty used for intron loss and gain, default = 25)

INT

GeMoMa.e

e-value (The e-value for filtering blast results, default = 100.0)

DOUBLE

GeMoMa.c

contig threshold (The threshold for evaluating contigs, valid range = [0.0, 1.0], default = 0.4)

DOUBLE

GeMoMa.rt

region threshold (The threshold for evaluating regions, valid range = [0.0, 1.0], default = 0.9)

DOUBLE

GeMoMa.h

hit threshold (The threshold for adding additional hits, valid range = [0.0, 1.0], default = 0.9)

DOUBLE

GeMoMa.p

predictions (The (maximal) number of predictions per transcript, default = 10)

INT

GeMoMa.a

avoid stop (A flag which allows to avoid stop codons in a transcript (except the last AS), default = true)

BOOLEAN

GeMoMa.approx

approx (whether an approximation is used to compute the score for intron gain, default = true)

BOOLEAN

GeMoMa.prefix

prefix (A prefix to be used for naming the predictions, default = )

STRING

GeMoMa.t

timeout (The (maximal) number of seconds to be used for the predictions of one transcript, if exceeded GeMoMa does not output a prediction for this transcript., valid range = [0, 604800], default = 3600)

LONG

GeMoMa.Score

Score (A flag which allows to do nothing, re-score or re-align the search results, range={Trust, ReScore, ReAlign}, default = Trust)

STRING

GAF.c

common border filter (the filter on the common borders of transcripts, the lower the more transcripts will be checked as alternative splice isoforms, valid range = [0.0, 1.0], default = 0.75)

DOUBLE

GAF.m

maximal number of transcripts per gene (the maximal number of allowed transcript predictions per gene, valid range = [1, 2147483647], default = 2147483647)

INT

GAF.d

default attributes (Comma-separated list of attributes that is set to NaN if they are not given in the annotation file. This allows to use these attributes for sorting or filter criteria. It is especially meaningful if the gene annotation files were received fom different software packages (e.g., GeMoMa, Braker, ...) having different attributes., default = tie,tde,tae,iAA,pAA,score)

STRING

GAF.f

filter (A filter can be applied to transcript predictions to receive only reasonable predictions. The filter is applied to the GFF attributes. The deault filter decides based on the completeness of the prediction (start=='M' and stop=='*') and the relative score (score/aa>=0.75) whether a prediction is used or not. Different criteria can be combined using 'and' and 'or'. In addition, you can check for NaN, e.g., 'isNaN(score)'. A more sophisticated filter could be applied for instance using the completeness, the relative score, the evidence as well as statistics based on RNA-seq: start=='M' and stop =='*' and score/aa>=0.75 and (evidence>1 or tpc==1.0), default = start=='M' and stop =='*' and score/aa>=0.75, OPTIONAL)

STRING

GAF.s

sorting (comma-separated list that defines the way predictions are sorted within a cluster, default = evidence,score)

STRING

GAF.a

alternative transcript filter (If a prediction is suggested as an alternative transcript, this additional filter will be applied. The filter works syntactically like the 'filter' parameter and allows you to keep the number of alternative transcripts small based on meaningful criteria. Reasonable filter could be based on RNA-seq data (tie==1) or on evidence (evidence>1). A more sophisticated filter could be applied combining several criteria: tie==1 or evidence>1, default = tie==1 or evidence>1, OPTIONAL)

STRING

AnnotationFinalizer.u

UTR (allows to predict UTRs using RNA-seq data, range={NO, YES}, default = NO)

No parameters for selection "NO"
No parameters for selection "YES"

AnnotationFinalizer.r

rename (allows to generate generic gene and transcripts names (cf. parameter "name attribute"), range={COMPOSED, SIMPLE, NO}, default = COMPOSED)

Parameters for selection "COMPOSED":
AnnotationFinalizer.p	prefix (the prefix of the generic name)	STRING
AnnotationFinalizer.i	infix (the infix of the generic name, default = G)	STRING
AnnotationFinalizer.s	suffix (the suffix of the generic name, default = 0)	STRING
AnnotationFinalizer.d	digits (the number of informative digits, valid range = [4, 10], default = 5)	INT
AnnotationFinalizer.di	delete infix (a comma-separated list of infixes that is deleted from the sequence names before building the gene/transcript name, default = )	STRING
Parameters for selection "SIMPLE":
AnnotationFinalizer.p	prefix (the prefix of the generic name)	STRING
AnnotationFinalizer.d	digits (the number of informative digits, valid range = [4, 10], default = 5)	INT
No parameters for selection "NO"

AnnotationFinalizer.n

name attribute (if true the new name is added as new attribute "Name", otherwise "Parent" and "ID" values are modified accordingly, default = true)

BOOLEAN

p

predicted proteins (If *true*, returns the predicted proteins of the target organism as fastA file, default = true)

BOOLEAN

pc

predicted CDSs (If *true*, returns the predicted CDSs of the target organism as fastA file, default = false)

BOOLEAN

pgr

predicted genomic regions (If *true*, returns the genomic regions of predicted gene models of the target organism as fastA file, default = false)

BOOLEAN

o

output individual predictions (If *true*, returns the predictions for each reference species, default = false)

BOOLEAN

debug

debug (If *false* removes all temporary files even if the jobs exits unexpected, default = true)

BOOLEAN

outdir

The output directory, defaults to the current working directory (.)

STRING

threads

The number of threads used for the tool, defaults to 1

INT

Example:

java -jar GeMoMa-1.6.4.jar CLI GeMoMaPipeline t=<target_genome> a=<annotation> g=<genome> AnnotationFinalizer.p=<prefix>

Extract RNA-seq Evidence

This tools extracts introns and coverage from mapped RNA-seq reads. Introns might be denoised by the tool DenoiseIntrons. Introns and coverage results can be used in GeMoMa to improve the predictions and might help to select better gene models in GAF. In addition, introns and coverage can be used to predict UTRs by AnnotationFinalizer.

Extract RNA-seq Evidence may be called with

java -jar GeMoMa-1.6.4.jar CLI ERE

and has the following parameters

name

comment

type

s

Stranded (Defines whether the reads are stranded. In case of FR_FIRST_STRAND, the first read of a read pair or the only read in case of single-end data is assumed to be located on forward strand of the cDNA, i.e., reverse to the mRNA orientation. If you are using Illumina TruSeq you should use FR_FIRST_STRAND., range={FR_UNSTRANDED, FR_FIRST_STRAND, FR_SECOND_STRAND}, default = FR_UNSTRANDED)

STRING

The following parameter(s) can be used multiple times:

m

mapped reads file (BAM/SAM files containing the mapped reads)

FILE

v

ValidationStringency (Defines how strict to be when reading a SAM or BAM, beyond bare minimum validation., range={STRICT, LENIENT, SILENT}, default = LENIENT)

STRING

u

use secondary alignments (allows to filter flags in the SAM or BAM, default = true)

BOOLEAN

c

coverage (allows to output the coverage, default = false)

BOOLEAN

mmq

minimum mapping quality (reads with a mapping quality that is lower than this value will be ignored, valid range = [0, 255], default = 40)

INT

outdir

The output directory, defaults to the current working directory (.)

STRING

Example:

java -jar GeMoMa-1.6.4.jar CLI ERE m=<mapped_reads_file>

CheckIntrons

The tool checks the distribution of introns on the strands and the dinucleotide distribution at splice sites.

CheckIntrons may be called with

java -jar GeMoMa-1.6.4.jar CLI CheckIntrons

and has the following parameters

name

comment

type

t

target genome (The target genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code)

FILE

The following parameter(s) can be used multiple times:

i

introns (Introns (GFF), which might be obtained from RNA-seq, OPTIONAL)

FILE

v

verbose (A flag which allows to output a wealth of additional information per transcript, default = false)

BOOLEAN

outdir

The output directory, defaults to the current working directory (.)

STRING

Example:

java -jar GeMoMa-1.6.4.jar CLI CheckIntrons t=<target_genome>

DenoiseIntrons

This module allows to analyze introns extracted by ERE. Introns with a large intron size or a low relative expression are possibly artefacts and will be removed. The result of this module can be used in the module GeMoMa, AnnotationEvidence, and AnnotationFinalizer.

DenoiseIntrons may be called with

java -jar GeMoMa-1.6.4.jar CLI DenoiseIntrons

and has the following parameters

name

comment

type

The following parameter(s) can be used multiple times:

i

introns (Introns (GFF), which might be obtained from RNA-seq)

FILE

The following parameter(s) can be used multiple times:

c

coverage (experimental coverage (RNA-seq), range={UNSTRANDED, STRANDED}, default = UNSTRANDED)

Parameters for selection "UNSTRANDED":
coverage_unstranded	coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)	FILE
Parameters for selection "STRANDED":
coverage_forward	coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)	FILE
coverage_reverse	coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)	FILE

m

maximum intron length (The maximum length of an intron, default = 15000)

INT

me

minimum expression (The threshold for removing introns, valid range = [0.0, 1.0], default = 0.01)

DOUBLE

context

context (The context upstream a donor and donwstream an acceptor site that is used to determine the expression of the region, valid range = [0, 100], default = 10)

INT

outdir

The output directory, defaults to the current working directory (.)

STRING

Example:

java -jar GeMoMa-1.6.4.jar CLI DenoiseIntrons i=<introns> coverage_unstranded=<coverage_unstranded>

NCBI Reference Retriever

This tool can be used to download or update assembly and annotation files of reference organsims from NCBI. This way it allows to easily collect all data necessary to start GeMoMaPipeline or Extractor.

NCBI Reference Retriever may be called with

java -jar GeMoMa-1.6.4.jar CLI NRR

and has the following parameters

name	comment	type

r	reference directory (the directory where the genome and annotation files of the reference organisms should be stored, default = references/)	STRING
n	number of tries (the number of tries for downloading a reference file, valid range = [1, 100], default = 10)	INT
rl	reference list (a list of reference organisms)	FILE
outdir	The output directory, defaults to the current working directory (.)	STRING

Example:

java -jar GeMoMa-1.6.4.jar CLI NRR rl=<reference_list>

Extractor

This tool can be used to create input files for GeMoMa, i.e., it creates at least a fasta file containing the translated parts of the CDS and a tabular file containing the assignment of transcripts to genes and parts of CDS to transcripts. In addition, Extractor can be used to create several additional files from the final prediction, e.g. proteins, CDSs, ... . Two inputs are mandatory: The genome as fasta or fasta.gz and the corresponding annotation as gff or gff.gz. The gff file should be sorted. If you like to set a user-specific genetic code, please use a tab-delimited file with two columns. The first column contains the amino acid in one letter code, the second a list of tripletts.

Extractor may be called with

java -jar GeMoMa-1.6.4.jar CLI Extractor

and has the following parameters

name	comment	type

a	annotation (Reference annotation file (GFF or GTF), which contains gene models annotated in the reference genome)	FILE
g	genome (Reference genome file (FASTA))	FILE
gc	genetic code (optional user-specified genetic code, OPTIONAL)	FILE
p	proteins (whether the complete proteins sequences should returned as output, default = false)	BOOLEAN
c	cds (whether the complete CDSs should returned as output, default = false)	BOOLEAN
genomic	genomic (whether the genomic regions should be returned (upper case = coding, lower case = non coding), default = false)	BOOLEAN
r	repair (if a transcript annotation can not be parsed, the program will try to infer the phase of the CDS parts to repair the annotation, default = false)	BOOLEAN
s	selected (The path to list file, which allows to make only a predictions for the contained transcript ids. The first column should contain transcript IDs as given in the annotation. Remaining columns will be ignored., OPTIONAL)	FILE
Ambiguity	Ambiguity (This parameter defines how to deal with ambiguities in the DNA. There are 3 options: EXCEPTION, which will remove the corresponding transcript, AMBIGUOUS, which will use an X for the corresponding amino acid, and RANDOM, which will randomly select an amnio acid from the list of possibilities., range={EXCEPTION, AMBIGUOUS, RANDOM}, default = EXCEPTION)	STRING
sefc	stop-codon excluded from CDS (A flag that states whether the reference annotation contains the stop codon in the CDS annotation or not, default = false)	BOOLEAN
f	full-length (A flag which allows for choosing between only full-length and all (i.e., full-length and partial) transcripts, default = true)	BOOLEAN
v	verbose (A flag which allows to output a wealth of additional information, default = false)	BOOLEAN
outdir	The output directory, defaults to the current working directory (.)	STRING

Example:

java -jar GeMoMa-1.6.4.jar CLI Extractor a=<annotation> g=<genome>

GeneModelMapper

This tool is the main part of, a homology-based gene prediction tool. GeMoMa builds gene models from search results (e.g. tblastn or mmseqs).

As first step, you should run Extractor obtaining cds parts and assignment. Second, you should run a search algorithm, e.g. tblastn or mmseqs, with cds parts as query. Finally, these search results are then used in GeMoMa. Search results should be clustered according to the reference genes. The most easiest way is to sort the search results accoring to the first column. If the search results are not sorted by default (e.g. mmseqs), you should the parameter sort. If you like to run GeMoMa ignoring intron position conservation, you should blast protein sequences and feed the results in query cds parts and leave assignment unselected.

If you like to run GeMoMa using RNA-seq evidence, you should map your RNA-seq reads to the genome and run ERE on the mapped reads. For several reasons, spurious introns can be extracted from RNA-seq data. Hence, we recommend to run DenoiseIntrons to remove such spurious introns. Finally, you can use the obtained introns (and coverage) in GeMoMa.

If you like to obtain multiple predictions per gene model of the reference organism, you should set predictions accordingly. In addition, we suggest to decrease the value of contig threshold allowing GeMoMa to evaluate more candidate contigs/chromosomes.

If you change the values of contig threshold, region threshold and hit threshold, this will influence the predictions as well as the runtime of the algorithm. The lower the values are, the slower the algorithm is.

You can filter your predictions using GAF, which also allows for combining predictions from different reference organismns.

Finally, you can predict UTRs and rename predictions using AnnotationFinalizer.

If you like to run the complete GeMoMa pipeline and not only specific module, you can run the multi-threaded module GeMoMaPipeline.

GeneModelMapper may be called with

java -jar GeMoMa-1.6.4.jar CLI GeMoMa

and has the following parameters

name

comment

type

s

search results (The search results, e.g., from tblastn or mmseqs)

FILE

t

target genome (The target genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code)

FILE

c

cds parts (The query cds parts file (FASTA), i.e., the cds parts that have been blasted)

FILE

a

assignment (The assignment file, which combines parts of the CDS to transcripts, OPTIONAL)

FILE

q

query proteins (optional query protein file (FASTA) for computing the optimal alignment score against complete protein prediction, OPTIONAL)

FILE

The following parameter(s) can be used zero or multiple times:

i

introns (Introns (GFF), which might be obtained from RNA-seq)

FILE

r

reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)

INT

splice

splice (if no intron is given by RNA-seq, compute candidate splice sites or not, default = true)

BOOLEAN

The following parameter(s) can be used zero or multiple times:

coverage

coverage (experimental coverage (RNA-seq), range={UNSTRANDED, STRANDED}, default = UNSTRANDED)

Parameters for selection "UNSTRANDED":
coverage_unstranded	coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)	FILE
Parameters for selection "STRANDED":
coverage_forward	coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)	FILE
coverage_reverse	coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)	FILE

g

genetic code (optional user-specified genetic code, OPTIONAL)

FILE

sm

substitution matrix (optional user-specified substitution matrix, OPTIONAL)

FILE

go

gap opening (The gap opening cost in the alignment, default = 11)

INT

ge

gap extension (The gap extension cost in the alignment, default = 1)

INT

m

maximum intron length (The maximum length of an intron, default = 15000)

INT

sil

static intron length (A flag which allows to switch between static intron length, which can be specified by the user and is identical for all genes, and dynamic intron length, which is based on the gene-specific maximum intron length in the reference organism plus the user given maximum intron length, default = true)

BOOLEAN

intron-loss-gain-penalty

intron-loss-gain-penalty (The penalty used for intron loss and gain, default = 25)

INT

e

e-value (The e-value for filtering blast results, default = 100.0)

DOUBLE

ct

contig threshold (The threshold for evaluating contigs, valid range = [0.0, 1.0], default = 0.4)

DOUBLE

rt

region threshold (The threshold for evaluating regions, valid range = [0.0, 1.0], default = 0.9)

DOUBLE

h

hit threshold (The threshold for adding additional hits, valid range = [0.0, 1.0], default = 0.9)

DOUBLE

p

predictions (The (maximal) number of predictions per transcript, default = 10)

INT

selected

selected (The path to list file, which allows to make only a predictions for the contained transcript ids. The first column should contain transcript IDs as given in the annotation. Remaining columns can be used to determine a target region that should be overlapped by the prediction, if columns 2 to 5 contain chromosome, strand, start and end of region, OPTIONAL)

FILE

as

avoid stop (A flag which allows to avoid stop codons in a transcript (except the last AS), default = true)

BOOLEAN

approx

approx (whether an approximation is used to compute the score for intron gain, default = true)

BOOLEAN

prefix

prefix (A prefix to be used for naming the predictions, default = )

STRING

tag

tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = prediction)

STRING

v

verbose (A flag which allows to output a wealth of additional information per transcript, default = false)

BOOLEAN

timeout

timeout (The (maximal) number of seconds to be used for the predictions of one transcript, if exceeded GeMoMa does not output a prediction for this transcript., valid range = [0, 604800], default = 3600)

LONG

sort

sort (A flag which allows to sort the search results, default = false)

BOOLEAN

Score

Score (A flag which allows to do nothing, re-score or re-align the search results, range={Trust, ReScore, ReAlign}, default = Trust)

STRING

outdir

The output directory, defaults to the current working directory (.)

STRING

Example:

java -jar GeMoMa-1.6.4.jar CLI GeMoMa s=<search_results> t=<target_genome> c=<cds_parts> a=<assignment>

GeMoMa Annotation Filter

This tool combines and filters gene predictions from different sources yielding a common gene prediction. The tool does not modify the predictions, but filters redundant and low-quality predictions and selects relevant predictions. In addition, it adds attributes to the annotation of transcript predictions.

The algorithm does the following: First, redundant predictions are identified (and additional attributes (evidence, sumWeight) are introduced). Second, the predictions are filtered using the user-specified criterium based on the attributes from the annotation. Third, clusters of overlapping predictions are determined, the predictions are sorted within the cluster and relevant predictions are extracted.

Optionally, annotation info can be added for each reference organism enabling a functional prediction for predicted transcripts based on the function of the reference transcript. Phytozome provides annotation info tables for plants, but annotation info can be used from any source as long as they are tab-delimited files with at least the following columns: transcriptName, GO and .*defline.

Initially, GAF was build to combine gene predictions obtained from GeMoMa. It allows to combine the predictions from multiple reference organisms, but works also using only one reference organism. However, GAF also allows to integrate predictions from ab-initio or purely RNA-seq-based gene predictors as well as manually curated annotation. If you like to do so, we recommend to run AnnotationEvidence for each of these input files to add additional attributes that can be used for sorting and filtering within GAF. The sort and filter criteria need to be carefully revised in this case. Default values can be set for missing attributes.

GeMoMa Annotation Filter may be called with

java -jar GeMoMa-1.6.4.jar CLI GAF

and has the following parameters

name

comment

type

t

tag (the tag used to read the GeMoMa annotations, default = prediction)

STRING

c

common border filter (the filter on the common borders of transcripts, the lower the more transcripts will be checked as alternative splice isoforms, valid range = [0.0, 1.0], default = 0.75)

DOUBLE

m

maximal number of transcripts per gene (the maximal number of allowed transcript predictions per gene, valid range = [1, 2147483647], default = 2147483647)

INT

The following parameter(s) can be used multiple times:

p	prefix (the prefix can be used to distinguish predictions from different input files, OPTIONAL)	STRING
w	weight (the weight can be used to prioritize predictions from different input files; each prediction will get an additional attribute sumWeight that can be used in the filter, valid range = [0.0, 1000.0], default = 1.0, OPTIONAL)	DOUBLE
g	gene annotation file (GFF files containing the gene annotations (predicted by GeMoMa))	FILE
a	annotation info (annotation information of the reference, tab-delimted file containing at least the columns transcriptName, GO and .*defline, OPTIONAL)	FILE

d

default attributes (Comma-separated list of attributes that is set to NaN if they are not given in the annotation file. This allows to use these attributes for sorting or filter criteria. It is especially meaningful if the gene annotation files were received fom different software packages (e.g., GeMoMa, Braker, ...) having different attributes., default = tie,tde,tae,iAA,pAA,score)

STRING

f

filter (A filter can be applied to transcript predictions to receive only reasonable predictions. The filter is applied to the GFF attributes. The deault filter decides based on the completeness of the prediction (start=='M' and stop=='*') and the relative score (score/aa>=0.75) whether a prediction is used or not. Different criteria can be combined using 'and' and 'or'. In addition, you can check for NaN, e.g., 'isNaN(score)'. A more sophisticated filter could be applied for instance using the completeness, the relative score, the evidence as well as statistics based on RNA-seq: start=='M' and stop =='*' and score/aa>=0.75 and (evidence>1 or tpc==1.0), default = start=='M' and stop =='*' and score/aa>=0.75, OPTIONAL)

STRING

s

sorting (comma-separated list that defines the way predictions are sorted within a cluster, default = evidence,score)

STRING

atf

alternative transcript filter (If a prediction is suggested as an alternative transcript, this additional filter will be applied. The filter works syntactically like the 'filter' parameter and allows you to keep the number of alternative transcripts small based on meaningful criteria. Reasonable filter could be based on RNA-seq data (tie==1) or on evidence (evidence>1). A more sophisticated filter could be applied combining several criteria: tie==1 or evidence>1, default = tie==1 or evidence>1, OPTIONAL)

STRING

outdir

The output directory, defaults to the current working directory (.)

STRING

Example:

java -jar GeMoMa-1.6.4.jar CLI GAF g=<gene_annotation_file>

AnnotationFinalizer

This tool finalizes an annotation. It allows to predict for UTRs for annotated coding sequences and to generate generic gene and transcript names. UTR prediction might be negatively influenced (i.e. too long predictions) by genomic contamination of RNA-seq libraries, overlapping genes or genes in close proximity as well as unstranded RNA-seq libraries. Please use ERE to preprocess the mapped reads.

AnnotationFinalizer may be called with

java -jar GeMoMa-1.6.4.jar CLI AnnotationFinalizer

and has the following parameters

name

comment

type

g

genome (The genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code)

FILE

a

annotation (The predicted genome annotation file (GFF))

FILE

t

tag (A user-specified tag for transcript predictions in the third column of the returned gff. It might be beneficial to set this to a specific value for some genome browsers., default = prediction)

STRING

u

UTR (allows to predict UTRs using RNA-seq data, range={NO, YES}, default = NO)

No parameters for selection "NO"

Parameters for selection "YES":

The following parameter(s) can be used multiple times:

i

introns file (Introns (GFF), which might be obtained from RNA-seq)

FILE

r

reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)

INT

The following parameter(s) can be used multiple times:

c

coverage file (experimental coverage (RNA-seq), range={NO, UNSTRANDED, STRANDED}, default = NO)

No parameters for selection "NO"
Parameters for selection "UNSTRANDED":
coverage_unstranded	coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)	FILE
Parameters for selection "STRANDED":
coverage_forward	coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)	FILE
coverage_reverse	coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)	FILE

rename

rename (allows to generate generic gene and transcripts names (cf. parameter "name attribute"), range={COMPOSED, SIMPLE, NO}, default = COMPOSED)

Parameters for selection "COMPOSED":
p	prefix (the prefix of the generic name)	STRING
infix	infix (the infix of the generic name, default = G)	STRING
s	suffix (the suffix of the generic name, default = 0)	STRING
d	digits (the number of informative digits, valid range = [4, 10], default = 5)	INT
di	delete infix (a comma-separated list of infixes that is deleted from the sequence names before building the gene/transcript name, default = )	STRING
Parameters for selection "SIMPLE":
p	prefix (the prefix of the generic name)	STRING
d	digits (the number of informative digits, valid range = [4, 10], default = 5)	INT
No parameters for selection "NO"

n

name attribute (if true the new name is added as new attribute "Name", otherwise "Parent" and "ID" values are modified accordingly, default = true)

BOOLEAN

outdir

The output directory, defaults to the current working directory (.)

STRING

Example:

java -jar GeMoMa-1.6.4.jar CLI AnnotationFinalizer g=<genome> a=<annotation> p=<prefix>

Annotation evidence

This tool adds attributes to the annotation, e.g., tie, tpc, aa, start, stop. These attributes can be used, for instance, if the annotation is used in GAF. All predictions of the annotation are used. The predictions are not filtered for internal stop codons, missing start or stop codons, frame-shifts, ... . Please use ERE to preprocess the mapped reads.

Annotation evidence may be called with

java -jar GeMoMa-1.6.4.jar CLI AnnotationEvidence

and has the following parameters

name

comment

type

a

annotation (The genome annotation file (GFF))

FILE

g

genome (The genome file (FASTA), i.e., the target sequences in the blast run. Should be in IUPAC code)

FILE

The following parameter(s) can be used multiple times:

i

introns file (Introns (GFF), which might be obtained from RNA-seq, OPTIONAL)

FILE

r

reads (if introns are given by a GFF, only use those which have at least this number of supporting split reads, valid range = [1, 2147483647], default = 1)

INT

The following parameter(s) can be used multiple times:

c

coverage file (experimental coverage (RNA-seq), range={NO, UNSTRANDED, STRANDED}, default = NO)

No parameters for selection "NO"
Parameters for selection "UNSTRANDED":
coverage_unstranded	coverage_unstranded (The coverage file contains the unstranded coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)	FILE
Parameters for selection "STRANDED":
coverage_forward	coverage_forward (The coverage file contains the forward coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)	FILE
coverage_reverse	coverage_reverse (The coverage file contains the reverse coverage of the genome per interval. Intervals with coverage 0 (zero) can be left out.)	FILE

ao

annotation output (if the annotation should be returned with attributes tie, tpc, and aa, default = false)

BOOLEAN

gc

genetic code (optional user-specified genetic code, OPTIONAL)

FILE

outdir

The output directory, defaults to the current working directory (.)

STRING

Example:

java -jar GeMoMa-1.6.4.jar CLI AnnotationEvidence a=<annotation> g=<genome>

Compare transcripts

This tool compares a predicted annotation with a given annotation in terms of F1 measure. If the F1 measure is 1 both annotations are in perfect agreement for this transcript. The smaller the value is the low is the agreement. If it is NA then there is no overlapping annotation.

Compare transcripts may be called with

java -jar GeMoMa-1.6.4.jar CLI CompareTranscripts

and has the following parameters

name

comment

type

p

prediction (The predicted annotation)

FILE

a

annotation (The true annotation)

FILE

The following parameter(s) can be used zero or multiple times:

prefix	prefix (the prefix can be used to distinguish predictions from different input files, default = , OPTIONAL)	STRING
assignment	assignment (the transcript info for the reference of the prediction)	FILE

outdir

The output directory, defaults to the current working directory (.)

STRING

Example:

java -jar GeMoMa-1.6.4.jar CLI CompareTranscripts p=<prediction> a=<annotation>

GFF attributes

Using GeMoMa and GAF, you'll obtain GFFs containing some special attributes. We briefly explain the most prominent attributes in the following table.

Attribute	Long name	Tool	Necessary parameter	Feature	Description
aa	amino acids	GeMoMa		prediction	the number of amino acids in the protein
score	GeMoMa score	GeMoMa		prediction	score computed by GeMoMa using the substitution matrix, gap costs and additional penalties
minCov	minimal coverage	GeMoMa	coverage, ...	prediction	minimal coverage of any base of the prediction given RNA-seq evidence
avgCov	average coverage	GeMoMa	coverage, ...	prediction	average coverage of all bases of the prediction given RNA-seq evidence
tpc	transcript percentage coverage	GeMoMa	coverage, ...	prediction	percentage of covered bases per predicted transcript given RNA-seq evidence
tae	transcript acceptor evidence	GeMoMa	introns	prediction	percentage of predicted acceptor sites per predicted transcript with RNA-seq evidence
tde	transcript donor evidence	GeMoMa	introns	prediction	percentage of predicted donor sites per predicted transcript with RNA-seq evidence
tie	transcript intron evidence	GeMoMa	introns	prediction	percentage of predicted introns per predicted transcript with RNA-seq evidence
minSplitReads	minimal split reads	GeMoMa	introns	prediction	minimal number of split reads for any of the predicted introns per predicted transcript
iAA	identical amino acid	GeMoMa	query proteins	prediction	percentage of identical amino acids between reference transcript and prediction
pAA	positive amino acid	GeMoMa	query proteins	prediction	percentage of aligned positions between reference transcript and prediction yielding a positive score in the substitution matrix
evidence		GAF		prediction	number of reference organisms that have a transcript yielding this prediction
alternative		GAF		prediction	alternative gene ID(s) leading to the same prediction
sumWeight		GAF		prediction	the sum of the weights of the references that perfectly support this prediction
maxTie	maximal tie	GAF		gene	maximal tie of all transcripts of this gene
maxEvidence	maximal evidence	GAF		gene	maximal evidence of all transcripts of this gene

FAQs

Why does the Extractor not return a single CDS-part, protein, ...?

First, please check whether the names of your contigs/chromosomes in your annotation (gff) and genome file (fasta) are identical. The fasta comments should at best only contain the contig/chromosome name. (Since GeMoMa 1.4, comments, which contain the contig/chromosome name and some additional information separated by a space, are also fine.) Second, please check whether you have a valid GFF/GTF file. Valid GFF files should have a valid "ID" or "Parent" entry in the attributes column. Valid GTF files should have a valid "gene_id" and "transcript_id" entry. Finally, please check the statistics that are given by the Extractor. It lists how many genes have been read and how many genes have been removed for different reasons. One common problem is that some annotation files do not include the stop codon in the CDS annotation.

How can I force GeMoMa to make more predictions?

There are several parameters affecting the number of predictions. The most prominent are the number of predictions (p) and the contig threshold (ct). For each reference transcript/CDS, GeMoMa initially makes a preliminary prediction and uses this prediction to determine whether a contig is promising and should be used to determine the final predictions. You may decrease ct and increase p to have more contigs in the final prediction. Increasing the number of predictions allows GeMoMa to output more predictions that have been computed. Decreasing the contig threshold allows to increase the number of predictions that are (internally) computed. Increasing p to a very large number without decreasing ct does not help.

Running GeMoMa on a single contig of my assembly yield thousands of weird predictions. What went wrong?

By default, GeMoMa is not build to be run on a single contig. GeMoMa tries to make predictions for all given reference CDS in the given target sequence(s). If the given target sequence is only a fraction of the complete target genome/assembly, GeMoMa will produce weird predictions as it does not filter for the quality of the predictions internally. There are two options to handle this:

Use a list of gene models that you expect to be located on this contig (cf. parameter "selected").
Filter the predictions using GAF (cf. java -jar GeMoMa-<version>.jar CLI GAF).

Is it mandatory to use RNA-seq data?

No, GeMoMa is able to make predictions with and without RNA-seq evidence.

Is it possible to use multiple reference organisms?

It is possible to use multiple reference organisms for GeMoMa. Just run GeMoMa on each reference organism separately. Finally, you can employ GAF (cf. java -jar GeMoMa-<version>.jar CLI GAF) to combine these annotations.

Why do some reference genes not lead to a prediction in the target genome?

Please first check whether your reference genes have been discarded by the Extractor (cf. assignment file).

If the genes have been discarded, there are two possibilities:

The CDS might be redundant, i.e. the coding exons are identical to those of another transcript. In this case, only one CDS is further evaluated.
There might be something wrong with your reference genes, e.g., missing start codon, missing stop codon, premature stop codon, ambiguous nucleotides, ... and you should check the options of Extractor or the annotation.

If the reference genes passed the Extractor, there are several possible explanations for this behavior. The two most prominent are:

GeMoMa stopped the prediction of a reference genes since it does not return a result within the given time (cf. parameter "timeout").
GeMoMa simply did not find a prediction matching the remaining quality criteria
GeMoMa did find a prediction, but it was filtered out by GAF, e.g. to low relative score, missing start or stop codon (cf. GAF parameters).

What does "partial gene model" mean in the context of GeMoMa?: We called a gene model partial if it does not contain an initial start codon and a final stop codon. However, this does not mean that the gene model is located at or close to the border of a chromosome or contig.
For two different reference transcripts, the predictions of GeMoMa overlap or are identical. What should I do with those?: GeMoMa makes the predictions for each reference transcript independently. Hence, it can occur that some of predictions of different reference transcripts overlap or are identical especially in gene families. Typically, you might like to filter or rank these predictions. We have implemented GAF (cf. java -jar GeMoMa-<version>.jar CLI GAF) to do this automatically. However, you can also do it by hand using the GFF attributes. Using RNA-seq data in GeMoMa yields additional fields in the annotation that can be used, e.g., average coverage (avgCov).
A lot of transcripts have been filtered out by the Extractor. What can I do?: There are several reasons for removing transcripts by the Extractor. At least in two cases you can try to get more transcripts by setting specific parameter values. First, if the transcript contains ambiguous nucleotides, please test the parameter "Ambiguity". Second, sometimes we received GFFs which contain wrong phases for CDS entries (e.g., 0 for all CDS entries in the phase column of the GFF). Since version 1.3.2, we provide the option "r" which stands for repair. If r=true is chosen, the Extractor tries to infer all phases for transcripts that show an error and would be filtered out.
Is GeMoMa able to predict pseudo-genes/ncRNA?: No, currently not.
My RNA-seq data indicates there is an additional intron in a transcipt, but GeMoMa does not predict this. Or vice versa, GeMoMa predicts an intron that is not supported by RNA-seq data. What's the reason?: GeMoMa is mainly based on the assumptions of amino acid and intron position conservation between reference and target species. Hence, GeMoMa tries to predict a gene model with similar exon-intron structure in the target species and does not stick too much to RNA-seq data. Although intron position conservation can be observed in most cases, sometimes new introns evolve or others vanish. For this reasons, GeMoMa also allows for the inclusion or exclusion of introns adding some additional costs (cf. GeMoMa parameter intron-loss-gain-penalty). However, the behaviour of GeMoMa depends on the parameters settings (especially intron-loss-gain-penalty, sm (substitution matrix), go (gap opening), ge (gap extension)) and the length of the missed/additional intron. Nevertheless, such cases can only occur if the additional/missed intron has a length that can be divided by 3 preserving the reading frame.; Since the available RNA-seq data only reflects a fraction of tissues/environmental conditions/..., missing RNA-seq evidence does not necessarily mean that the predictions is wrong.
My RNA-seq data indicates two alternative, highly overlapping introns. Interestingly, GeMoMa does not take the intron that is more abundant. Why?: GeMoMa reads the introns from the input file using some filter (cf. GeMoMa parameter r (reads)). All introns that pass the filter are used and treated equally. Hence, GeMoMa uses the intron that matches the expectation of intron position and amino acid conservation compared to the reference transcript.
Does GeMoMa predict multiple transcripts per gene?: GeMoMa in principle allows to predict multiple transcripts per gene, if corresponding transcripts are given in the reference species or if multiple reference species are used.
GeMoMa failed with java.lang.OutOfMemoryError. What can I do?: Whenever you see a java.lang.OutOfMemoryError, you should rerun the program with Java virtual machine (VM) options. More specifically you should set: -Xms the initally used RAM, e.g. to 5Gb (–Xms5G), and -Xmx the maximally used RAM, e.g. to 50Gb (-Xmx50G). GeMoMa often needs more memory if you have a large genome and if you’re providing a large coverage file (extracted from RNA-seq data). If you don’t have a compute node with enough memory, you can run GeMoMa without coverage, which will return the same predictions, but does not include all statistics. Another point could be the protein alignment, if you use the optional parameter query protein. Again you can run GeMoMa without this parameter, which will return the same predictions, but less statistics.
I need to specify the genetic code for my organisms. What is the expected format?: The genetic code is given in a two column tab-delimited table, where the first column is the one letter code of the amino acid and the second column is a comma-separated list of triplets. As we are working on genomic DNA, GeMoMa expects the bases A, C, G, and T, and not U (as expected in mRNA). Here is the link to the default genetic code, which might be used as template:; https://github.com/Jstacs/Jstacs/blob/master/projects/gemoma/test_data/genetic_code.txt; Alternative genetic codes are described here using the RNA alphabet:; https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi; The genetic code might be specified for a reference organism in the module Extractor or for a target organism in the module GeMoMa.
I like to accelerate GeMoMa. What can I do?: If you like to improve the runtime, you can use several threads for the computation. If you run the GeMoMaPipeline you just have to select threads=<your_number>.; In addition, you can change the search algorithm that is used in GeMoMa. Tblastn is used by default as search algorithm in GeMoMa (for historical reasons). However, tblastn can be replaced by mmseqs which is typically much faster. If you run the GeMoMaPipeline you just have to select tblastn=false. However, changing the search algorithm can also effect the results. We try to minimize these effect using specific parameters for the search algorithms.; If you modify other parameters, you will probably receive results that differ to a larger extend from those received using the default parameters.
Is there a way to use the GeMoMa code to search a single CDS or protein sequence against a genome and return the predicted gene model (CDS fasta, protein fasta, GFF) similar to exonerate?: There at least two ways to do this. If you use GeMoMaPipeline you can; (A) Use the parameter “selected” to select specific gene models (=transcripts/proteins) from the annotation instead of using all or; (B) Use s=pre-extracted, use a fasta file with the proteins for the parameter cds-parts and leave assignment unset.; Using one of these options you can either look for a single or few transcripts/proteins either with (A) or without (B) intron-position conservation. In addition, you can use RNA-seq data to improve the predictions, which should be not possible with exonerate.

References

If you use GeMoMa, please cite

J. Keilwagen, M. Wenk, J. L. Erickson, M. H. Schattat, J. Grau, and F. Hartung. Using intron position conservation for homology-based gene prediction. Nucleic Acids Research, 2016. doi: 10.1093/nar/gkw092

J. Keilwagen, F. Hartung, M. Paulini, S. O. Twardziok, and J. Grau Combining RNA-seq data and homology-based gene prediction for plants, animals and fungi. BMC Bioinformatics, 2018. doi: 10.1186/s12859-018-2203-5

Version history

GeMoMa 1.6.4 (24.04.2020)

improved help section
change gff attribute "AA" to "aa"
GAF:
- bugfix overlapping genes
- accelerated computation
GeMoMa:
- bugfix: if no assignment file is used and protein ID are prefixes of other protein IDs
- change GFF attribute AA to aa
AnnotationFinalizer: new parameter "name attribute" allowing to decide whether a name attribute or the Parent and ID attributes should be used for renaming

GeMoMa 1.6.3 (05.03.2020)

Jstacs changes:
- CLI: bugfix ExpandableParameterSet
python wrapper (for *conda)
updated tests.sh, run.sh, pipeline.sh
rename Denoise to DenoiseIntrons
AnnotationEvidence: write phase (as given) to gff
GAF: new parameter: default attributes allows to set attributes that are not included in some gene annotation files
GeMoMa: new parameter: static intron length allowing to use dynamic intron length if set to false
GeMoMaPipeline:
- bugfix: time-out
- improve output
- separate parameters for maximum intron length (DenoiseIntrons, GeMoMa)

GeMoMa 1.6.2 (17.12.2019)

Jstacs changes:
- test methods for modules
- live protocol for Galaxy
new module Denoise: allowing to clean introns extracted by ERE
new module NCBIReferenceRetriever: allowing to retrieve data for reference organisms easily from NCBI.
GAF:
- bugfix for filter using specific attributes if no RNA-seq or query proteins was used
- allow to add annotation info (as for instance provided by Phytozome) based on the reference organisms
GeMoMa: bugfix for timeout
GeMoMaPipeline:
- bugfix reporting predicted partial proteins
- improved protocol
- new default value for query proteins (changed from false to true)
- new default value for Ambiguity (changed from EXCEPTION to AMBIGUOUS)

GeMoMa 1.6.1 (4.06.2019)

createGalaxyIntegration.sh: bugfix for GeMoMaPipeline
new module CheckIntrons: allowing to create statistics for introns (extracted by ERE)
AnnotationFinalizer: bugfix for sequence IDs with large numbers
CompareTranscripts:
- bugfix for prefix of ref-gene
- allow no transcript info, but making assignment non-optional if a transcript info is set
GAF: bugfix for Galaxy integration
GeMoMaPipeline:
- improved output in case of Exceptions
- new parameter "output individual predictions" allows to in- or exclude individual predictions from each reference organism in the final result
- new parameter "weight" allows weights for reference species (cf. GAF)
ERE: new parameter "minimum mapping quality"

GeMoMa 1.6 (2.04.2019)

allow to use mmseqs as alternative to tblastn
AnnotationEvidence:
- allows to add attributes to the input gff: tie, tpc, AA, start, stop
- new parameter for gff output
AnnotationFinalizer: new tool for predicting UTRs and renaming predictions
GAF:
- relative score filter and evidence filter are replaced by a flexible filter that allows to filter by relative score, evidence or other GFF attributes as well as combinations thereof
- sorting criteria of the predictions within clusters can now be user-specified
- new attribute for genes: combinedEvidence
- new attribute for predictions: sumWeight
- allows to use gene predictions from all sources, including for instance ab-initio gene predictors, purely RNA-seq based gene prediction and manually curation
- bugfix for predictions from multiple reference organisms
- improved statistic output
GeMoMa
- renamed the parameter tblastn results to search results
- new parameter for sorting the results of the similarity search (tblastn or mmseqs), if you use mmseqs for the similarity search you have use sort
- new parameter for score of the search results: three options: Trust (as is), ReScore (use aligned sequence, but recompute score), and ReAlign (use detected sequence for realignment and rescore)
- bugfix: threshold for introns from multiple files

GeMoMa 1.5.3 (23.07.2018)

improved parameter description and presentation
GeMoMaPipeline:
- removed unnecessary parameters
GeMoMa:
- bugfix: reading coverage file
- removed parameter genomic (cf. Extractor)
- removed protein output (cf. Extractor)
GAF:
- bugfix: prefix
Extractor:
- new parameter genomic

GeMoMa 1.5.2 (31.5.2018)

GAF:
- new parameter that allows to restrict the maximal number of transcript predictions per gene
- altered behavior of the evidence filter from percentages to absolute values
- bugfix: nested genes
- checking for duplicates in prediction IDs
GeMoMa:
- warning if RNA-seq data does not match with target genome
GeMoMaPipeline: new tool for running the complete GeMoMa pipeline at once allowing multi-threading
folder for temporary files of GeMoMa

GeMoMa 1.5 (13.02.2018)

AnnotationEvidence: add chromosome to output
CompareTranscripts: new parameter that allows to remove prefixes introduces by GAF
Extractor: new parameter "stop-codon excluded from CDS" that might be used if the annotation does not contain the stop codons
ExtractRNASeqEvidence:
- print intron length stats
- include program infos in introns.gff3
GeMoMa:
- new attribute pAA in gff output if query protein is given
- include program infos in predicted_annotation.gff3
- minor bugfix
GAF:
- new parameter that allows to specify a prefix for each input gff
- collect and print program infos to filtered_prediction.gff3
- improved statistics output

GeMoMa 1.4.2 (21.07.2017)

automatic searching for available updates
AnnotationEvidence: bugfix (tie computation: Arrays.binarysearch does not find first match)
Extractor: bugfix (files that are not zipped)
GeMoMa: bugfix (tie computation: Arrays.binarysearch does not find first match)

GeMoMa 1.4.1 (30.05.2017)

CompareTranscripts: bugfix (NullPointerException)
Extractor: reference genome can be .*fa.gz and .*fasta.gz
GeMoMa: bugfix (shutdown problem after timeout)
modified additional scripts and documentation

GeMoMa 1.4 (03.05.2017)

AnnotationEvidence: new tool computing tie and tpc for given annotation (gff)
CompareTranscripts: new tool comparing predicted and given annotation (gff)
Extractor:
- reading CDS with no parent tag (cf. discontinuous feature)
- automatic recognition of GFF or GTF annotation
- Warning if sequences mentioned in the annotation are not included in the reference sequence
GeMoMa:
- allowing for multiple intron and coverage files (= using different library types at the same time)
- NA instead of "?" for tae, tde, tie, minSplitReads of single coding exon genes
- new default values for the parameters: predictions (10 instead of 1) and contig threshold (0.4 instead of 0.9)
- bugfix (write pc and minCov if possible for last CDS part in predicted annotation)
- bugfix (ref-gene name if no assignment is used)
- bugfix (minSplitReads, minCov, tpc, avgCov if no coverage available)
GAF:
- nested genes on the same strand
- bugfix (if nothing passes the filter)

GeMoMa 1.3.2 (18.01.2017)

Extractor: new parameter repair for broken transcript annotations
GeMoMa: bugfixes (splice site computation)

GeMoMa 1.3.1 (09.12.2016)

GeMoMa bugfix (finding start/stop codon for very small exons)

GeMoMa 1.3 (06.12.2016)

ERE: new tool for extracting RNA-seq evidence
Extractor: offers options for
- partial gene models
- ambiguities
GeMoMa:
- RNA-seq
  - defining splice sites
  - additional feature in GFF and output
    - transcript intron evidence (tie)
    - transcript acceptor evidence (tae)
    - transcript donor evidence (tde)
    - transcript percentage coverage (tpc)
    - ...
- improved GFF
- simplified the command line parameters
- IMPORTANT: parameter names changed for some parameters
GAF: new tool for filtering and combining different predictions (especially of different reference organisms)

GeMoMa 1.1.3 (06.06.2016)

minor modifications to the Extractor tool

GeMoMa 1.1.2 (05.02.2016)

GeMoMa bugfix (upstream, downstream sequence for splice site detection)
Extractor: new parameter s for selecting transcripts
improved Galaxy integration

GeMoMa 1.1.1 (01.02.2016)

initial release for paper

@@ Line 1,512: / Line 1,512: @@
 :In addition, you can change the search algorithm that is used in GeMoMa. Tblastn is used by default as search algorithm in GeMoMa (for historical reasons). However, tblastn can be replaced by mmseqs which is typically much faster. If you run the GeMoMaPipeline you just have to select <code>tblastn=false</code>. However, changing the search algorithm can also effect the results. We try to minimize these effect using specific parameters for the search algorithms.
 :If you modify other parameters, you will probably receive results that differ to a larger extend from those received using the default parameters.
+; Is there a way to use the GeMoMa code to search a single CDS or protein sequence against a genome and return the predicted gene model (CDS fasta, protein fasta, GFF) similar to exonerate?
+: There at least two ways to do this. If you use GeMoMaPipeline you can
+: (A)	Use the parameter “selected” to select specific gene models (=transcripts/proteins) from the annotation instead of using all or
+: (B)	Use <code>s=pre-extracted</code>, use a fasta file with the proteins for the parameter cds-parts and leave assignment unset.
+: Using one of these options you can either look for a single or few transcripts/proteins either with (A) or without (B) intron-position conservation. In addition, you can use RNA-seq data to improve the predictions, which should be not possible with exonerate.
 == References ==

GeMoMa: Difference between revisions

Revision as of 05:23, 20 May 2020

Contents

Installation

Requirements

Download

In a nutshell

Tools

GeMoMa pipeline

Extract RNA-seq Evidence

CheckIntrons

DenoiseIntrons

NCBI Reference Retriever

Extractor

GeneModelMapper

GeMoMa Annotation Filter

AnnotationFinalizer

Annotation evidence

Compare transcripts

GFF attributes

FAQs

References

Version history

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Documentation

Tools