GeMoMa: Difference between revisions
(→Version history: version 1.6.3) |
|||
(25 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
'''Ge'''ne '''Mo'''del '''Ma'''pper (GeMoMa) is a homology-based gene prediction program. GeMoMa uses the annotation of protein-coding genes in a reference genome to infer the annotation of protein-coding genes in a target genome. Thereby, GeMoMa utilizes amino acid sequence and intron position conservation. In addition, GeMoMa allows to incorporate RNA-seq evidence for splice site prediction. | '''Ge'''ne '''Mo'''del '''Ma'''pper (GeMoMa) is a homology-based gene prediction program. GeMoMa uses the annotation of protein-coding genes in a reference genome to infer the annotation of protein-coding genes in a target genome. Thereby, GeMoMa utilizes amino acid sequence and intron position conservation. In addition, GeMoMa allows to incorporate RNA-seq evidence for splice site prediction. | ||
[ | GeMoMa is available in a public web-server at [http://galaxy.informatik.uni-halle.de galaxy.informatik.uni-halle.de]. The provided web-server only allows a limited number of reference genes and uses a time out of 2 minutes per transcript prediction. For unlimited use, please use the command line program or integrate GeMoMa in your own [https://galaxyproject.org/ Galaxy] instance. | ||
== | {| | ||
|__TOC__ | |||
|[[File:GeMoMa-schema.png|thumb|right|450px|Schema of GeMoMa algorithm]] | |||
|} | |||
== Installation == | |||
GeMoMa is now available via bioconda. Here is the [https://anaconda.org/bioconda/gemoma direct link to the package]. To install this package with conda run: | |||
conda install -c bioconda gemoma | |||
However, you can also install GeMoMa manually. | |||
=== Requirements === | |||
For running the GeMoMa, you need the following software on your computer | |||
[https:// | * Java v1.8 or later | ||
* [ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ blast] or [https://github.com/soedinglab/MMseqs2 mmseqs] | |||
== Download == | === Download === | ||
GeMoMa is implemented in Java using Jstacs. You can [http://www.jstacs.de/download.php?which=GeMoMa download a zip file] containing a readme, the GeMoMa jar file and some tiny scripts for running GeMoMa. The jar file allows for | GeMoMa is implemented in Java using Jstacs. You can [http://www.jstacs.de/download.php?which=GeMoMa download a zip file] containing a readme, the GeMoMa jar file and some tiny scripts for running GeMoMa. The jar file allows for | ||
Line 19: | Line 28: | ||
</ul> | </ul> | ||
== In a nutshell == | |||
== | GeMoMa is a modular, homology-based gene prediction program with huge flexibility. However, we also provide a pipeline allowing to use GeMoMa easily. If you like to start GeMoMa for the first time, we recommend to use the GeMoMaPipeline like this | ||
GeMoMa | java -jar GeMoMa-1.9.jar CLI GeMoMaPipeline threads=<threads> outdir=<outdir> GeMoMa.Score=ReAlign AnnotationFinalizer.r=NO o=true t=<target_genome> i=<reference_1_id> a=<reference_1_annotation> g=<reference_1_genome> | ||
there are several parameters that need to be set indicated with '''<'''foo'''>'''. You can specify | |||
* the number of threads | |||
* the output directory | |||
* the target genome | |||
* and the reference ID (optional), annotation and genome. If you have several references just repeat <code>s=own</code> and the parameter tags <code>i</code>, <code>a</code>, <code>g</code> with the corresponding values. | |||
In addition, we recommend to set several parameters: | |||
* <code>GeMoMa.Score=ReAlign</code>: states that the score from mmseqs should be recomputed as mmseqs uses an approximation | |||
* <code>AnnotationFinalizer.r=NO</code>: do not rename genes and transcripts | |||
* <code>o=true</code>: output individual predictions for each reference as a separate file allowing to rerun the combination step ('''GAF''') very easily and quickly | |||
If you like to specify the maximum intron length please consider the parameters <code>GeMoMa.m</code> and <code>GeMoMa.sil</code>. | |||
If you have RNA-seq data either from own experiments or publicly available data sets (cf. [https://www.ncbi.nlm.nih.gov/sra NCBI SRA], [https://www.ebi.ac.uk/ena EMBL-EBI ENA]), we recommend to use them. You need to map the data against the target genome with your favorite read mapper. In addition, we recommend to check the parameters of the section '''DenoiseIntrons'''. | |||
The complete documentation describing all GeMoMa modules and all parameters can be accessed at [[GeMoMa-Docs]]. | |||
You can also [[:File:GeMoMa-manual.pdf|download a small manual for GeMoMa]] which explains the main steps for the analysis. | |||
== GFF attributes == | == GFF attributes == | ||
Line 1,255: | Line 57: | ||
!Attribute !!Long name !!Tool !!Necessary parameter !!Feature !!Description | !Attribute !!Long name !!Tool !!Necessary parameter !!Feature !!Description | ||
|- | |- | ||
| | | aa || amino acids || GeMoMa || || mRNA || the number of amino acids in the predicted protein | ||
|- | |- | ||
| | | raa || reference amino acids || GeMoMa || || mRNA || the number of amino acids in the reference protein | ||
|- | |- | ||
| | | score || GeMoMa score || GeMoMa || || mRNA || score computed by GeMoMa using the substitution matrix, gap costs and additional penalties | ||
|- | |- | ||
| | | maxScore || maximal GeMoMa score || GeMoMa || || mRNA || maximal score which will be obtained by a prediction that is identical to the reference transcript | ||
|- | |- | ||
| | | bestScore || best GeMoMa score || GeMoMa || || mRNA || score of the best GeMoMa prediction of this transcript and this target organism | ||
|- | |- | ||
| | | maxGap || maximal gap || GeMoMa || || mRNA || length of the longest gap in the alignment between predicted and reference protein | ||
|- | |- | ||
| | | lpm || longest positive match || GeMoMa || || mRNA || length of the longest positive scoring match in the alignment between predicted and reference protein, i.e., each pair of amino acids in the match has a positive score | ||
|- | |- | ||
| | | nps || number of premature stops || GeMoMa || || mRNA || the number of premature stop codons in the prediction | ||
|- | |- | ||
| | | ce || coding exons || GeMoMa || assignment || mRNA || the number of coding exons of the prediction | ||
|- | |- | ||
| | | rce || reference coding exons || GeMoMa || assignment || mRNA || the number of coding exons of the reference transcript | ||
|- | |- | ||
| | | minCov || minimal coverage || GeMoMa || coverage, ... || mRNA || minimal coverage of any base of the prediction given RNA-seq evidence | ||
|- | |- | ||
| | | avgCov || average coverage || GeMoMa || coverage, ... || mRNA || average coverage of all bases of the prediction given RNA-seq evidence | ||
|- | |- | ||
| sumWeight || || GAF || || | | tpc || transcript percentage coverage || GeMoMa || coverage, ... || mRNA || percentage of covered bases per predicted transcript given RNA-seq evidence | ||
|- | |||
| tae || transcript acceptor evidence || GeMoMa || introns || mRNA || percentage of predicted acceptor sites per predicted transcript with RNA-seq evidence | |||
|- | |||
| tde || transcript donor evidence || GeMoMa || introns || mRNA || percentage of predicted donor sites per predicted transcript with RNA-seq evidence | |||
|- | |||
| tie || transcript intron evidence || GeMoMa || introns || mRNA || percentage of predicted introns per predicted transcript with RNA-seq evidence | |||
|- | |||
| minSplitReads || minimal split reads || GeMoMa || introns || mRNA || minimal number of split reads for any of the predicted introns per predicted transcript | |||
|- | |||
| iAA || identical amino acid || GeMoMa || protein alignment || mRNA || percentage of identical amino acids between reference transcript and prediction | |||
|- | |||
| pAA || positive amino acid || GeMoMa || protein alignment || mRNA || percentage of aligned positions between reference transcript and prediction yielding a positive score in the substitution matrix | |||
|- | |||
| evidence || || GAF || || mRNA || number of reference organisms that have a transcript yielding this prediction | |||
|- | |||
| alternative || || GAF || || mRNA || alternative gene ID(s) leading to the same prediction | |||
|- | |||
| sumWeight || || GAF || || mRNA || the sum of the weights of the references that perfectly support this prediction | |||
|- | |- | ||
| maxTie || maximal tie || GAF || || gene || maximal tie of all transcripts of this gene | | maxTie || maximal tie || GAF || || gene || maximal tie of all transcripts of this gene | ||
Line 1,287: | Line 107: | ||
|} | |} | ||
== | The name of the feature describing a transcript prediction can be altered using the parameter "tag". Before version 1.7 the default value of tag was "prediction" instead of "mRNA". | ||
== Frequently asked questions == | |||
; Why does the Extractor not return a single CDS-part, protein, ...? | ; Why does the Extractor not return a single CDS-part, protein, ...? | ||
: | :Please check whether the names of your contigs/chromosomes in your annotation (gff) and genome file (fasta) are identical. The fasta comments should at best only contain the contig/chromosome name. (Since GeMoMa 1.4, comments, which contain the contig/chromosome name and some additional information separated by a space, are also fine.) In addition, check the statistics that are given by the Extractor. It lists how many genes have been read and how many genes have been removed for different reasons. One common problem is that some annotation files do not include the stop codon in the CDS annotation. | ||
; How can I force GeMoMa to make more predictions? | ; How can I force GeMoMa to make more predictions? | ||
: There are several parameters affecting the number of predictions. The most prominent are the number of predictions (p) and the contig threshold (ct). For each reference transcript/CDS, GeMoMa initially makes a preliminary prediction and uses this prediction to determine whether a contig is promising and should be used to determine the final predictions. You may decrease ct and increase p to have more contigs in the final prediction. Increasing the number of predictions allows GeMoMa to output more predictions that have been computed. Decreasing the contig threshold allows to increase the number of predictions that are (internally) computed. Increasing p to a very large number without decreasing ct does not help. | :There are several parameters affecting the number of predictions. The most prominent are the number of predictions (p) and the contig threshold (ct). For each reference transcript/CDS, GeMoMa initially makes a preliminary prediction and uses this prediction to determine whether a contig is promising and should be used to determine the final predictions. You may decrease ct and increase p to have more contigs in the final prediction. Increasing the number of predictions allows GeMoMa to output more predictions that have been computed. Decreasing the contig threshold allows to increase the number of predictions that are (internally) computed. Increasing p to a very large number without decreasing ct does not help. | ||
; Running GeMoMa on a single contig of my assembly yield thousands of weird predictions. What went wrong? | ; Running GeMoMa on a single contig of my assembly yield thousands of weird predictions. What went wrong? | ||
: By default, GeMoMa is not build to be run on a single contig. GeMoMa tries to make predictions for all given reference CDS in the given target sequence(s). If the given target sequence is only a fraction of the complete target genome/assembly, GeMoMa will produce weird predictions as it does not filter for the quality of the predictions internally. There are two options to handle this: | :By default, GeMoMa is not build to be run on a single contig. GeMoMa tries to make predictions for all given reference CDS in the given target sequence(s). If the given target sequence is only a fraction of the complete target genome/assembly, GeMoMa will produce weird predictions as it does not filter for the quality of the predictions internally. There are two options to handle this: | ||
:* Use a list of gene models that you expect to be located on this contig (cf. parameter "selected"). | :* Use a list of gene models that you expect to be located on this contig (cf. parameter "selected"). | ||
:* Filter the predictions using GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>). | :* Filter the predictions using GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>). | ||
; Is it mandatory to use RNA-seq data? | ; Is it mandatory to use RNA-seq data? | ||
: No, GeMoMa is able to make predictions with and without RNA-seq evidence. | :No, GeMoMa is able to make predictions with and without RNA-seq evidence. | ||
; Is it possible to use multiple reference organisms? | ; Is it possible to use multiple reference organisms? | ||
: It is possible to use multiple reference organisms for GeMoMa. Just run GeMoMa on each reference organism separately. Finally, you can employ GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>) to combine these annotations. | :It is possible to use multiple reference organisms for GeMoMa. Just run GeMoMa on each reference organism separately. Finally, you can employ GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>) to combine these annotations. | ||
; Why do some reference genes not lead to a prediction in the target genome? | ; Why do some reference genes not lead to a prediction in the target genome? | ||
: Please first check whether your reference genes have been discarded by the Extractor (cf. assignment file). | :Please first check whether your reference genes have been discarded by the Extractor (cf. assignment file). | ||
: If the genes have been discarded, there are two possibilities: | :If the genes have been discarded, there are two possibilities: | ||
:* The CDS might be redundant, i.e. the coding exons are identical to those of another transcript. In this case, only one CDS is further evaluated. | :* The CDS might be redundant, i.e. the coding exons are identical to those of another transcript. In this case, only one CDS is further evaluated. | ||
:* There might be something wrong with your reference genes, e.g., missing start codon, missing stop codon, premature stop codon, ambiguous nucleotides, ... and you should check the options of Extractor or the annotation. | :* There might be something wrong with your reference genes, e.g., missing start codon, missing stop codon, premature stop codon, ambiguous nucleotides, ... and you should check the options of Extractor or the annotation. | ||
: If the reference genes passed the Extractor, there are several possible explanations for this behavior. The two most prominent are: | :If the reference genes passed the Extractor, there are several possible explanations for this behavior. The two most prominent are: | ||
:* GeMoMa stopped the prediction of a reference genes since it does not return a result within the given time (cf. parameter "timeout"). | :* GeMoMa stopped the prediction of a reference genes since it does not return a result within the given time (cf. parameter "timeout"). | ||
:* GeMoMa simply did not find a prediction matching the remaining quality criteria | :* GeMoMa simply did not find a prediction matching the remaining quality criteria. | ||
; What does "partial gene model" mean in the context of GeMoMa? | ; What does "partial gene model" mean in the context of GeMoMa? | ||
: We called a gene model partial if it does not contain an initial start codon and a final stop codon. However, this does not mean that the gene model is located at or close to the border of a chromosome or contig. | :We called a gene model partial if it does not contain an initial start codon and a final stop codon. However, this does not mean that the gene model is located at or close to the border of a chromosome or contig. | ||
; For two different reference transcripts, the predictions of GeMoMa overlap or are identical. What should I do with those? | ; For two different reference transcripts, the predictions of GeMoMa overlap or are identical. What should I do with those? | ||
: GeMoMa makes the predictions for each reference transcript independently. Hence, it can occur that some of predictions of different reference transcripts overlap or are identical especially in gene families. Typically, you might like to filter or rank these predictions. We have implemented GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>) to do this automatically. However, you can also do it by hand using the GFF attributes. Using RNA-seq data in GeMoMa yields additional fields in the annotation that can be used, e.g., average coverage (avgCov). | :GeMoMa makes the predictions for each reference transcript independently. Hence, it can occur that some of predictions of different reference transcripts overlap or are identical especially in gene families. Typically, you might like to filter or rank these predictions. We have implemented GAF (cf. <code>java -jar GeMoMa-<version>.jar CLI GAF</code>) to do this automatically. However, you can also do it by hand using the GFF attributes. Using RNA-seq data in GeMoMa yields additional fields in the annotation that can be used, e.g., average coverage (avgCov). | ||
; A lot of transcripts have been filtered out by the Extractor. What can I do? | ; A lot of transcripts have been filtered out by the Extractor. What can I do? | ||
: There are several reasons for removing transcripts by the Extractor. At least in two cases you can try to get more transcripts by setting specific parameter values. First, if the transcript contains ambiguous nucleotides, please test the parameter "Ambiguity". Second, sometimes we received GFFs which contain wrong phases for CDS entries (e.g., 0 for all CDS entries in the phase column of the GFF). Since version 1.3.2, we provide the option "r" which stands for repair. If r=true is chosen, the Extractor tries to infer all phases for transcripts that show an error and would be filtered out. | :There are several reasons for removing transcripts by the Extractor. At least in two cases you can try to get more transcripts by setting specific parameter values. First, if the transcript contains ambiguous nucleotides, please test the parameter "Ambiguity". Second, sometimes we received GFFs which contain wrong phases for CDS entries (e.g., 0 for all CDS entries in the phase column of the GFF). Since version 1.3.2, we provide the option "r" which stands for repair. If r=true is chosen, the Extractor tries to infer all phases for transcripts that show an error and would be filtered out. | ||
; Is GeMoMa able to predict pseudo-genes/ncRNA? | ; Is GeMoMa able to predict pseudo-genes/ncRNA? | ||
: No, currently not. | :No, currently not. | ||
; My RNA-seq data indicates there is an additional intron in a transcipt, but GeMoMa does not predict this. Or vice versa, GeMoMa predicts an intron that is not supported by RNA-seq data. What's the reason? | ; My RNA-seq data indicates there is an additional intron in a transcipt, but GeMoMa does not predict this. Or vice versa, GeMoMa predicts an intron that is not supported by RNA-seq data. What's the reason? | ||
: GeMoMa is mainly based on the assumptions of amino acid and intron position conservation between reference and target species. Hence, GeMoMa tries to predict a gene model with similar exon-intron structure in the target species and does not stick too much | :GeMoMa is mainly based on the assumptions of amino acid and intron position conservation between reference and target species. Hence, GeMoMa tries to predict a gene model with similar exon-intron-structure in the target species and does not stick too much with RNA-seq data. Although intron position conservation can be observed in most cases, sometimes new introns evolve or others vanish. For this reasons, GeMoMa also allows for the inclusion or exclusion of introns adding some additional costs (cf. GeMoMa parameter intron-loss-gain-penalty). However, the behaviour of GeMoMa depends on the parameters settings (especially intron-loss-gain-penalty, sm (substitution matrix), go (gap opening), ge (gap extension)) and the length of the missed/additional intron. Nevertheless, such cases can only occur if the additional/missed intron has a length that can be divided by 3 preserving the reading frame. Since the available RNA-seq data only reflects a fraction of tissues/environmental conditions/..., missing RNA-seq evidence does not necessarily mean that the predictions is wrong. | ||
; My RNA-seq data indicates two alternative, highly overlapping introns. Interestingly, GeMoMa does not take the intron that is more abundant. Why? | ; My RNA-seq data indicates two alternative, highly overlapping introns. Interestingly, GeMoMa does not take the intron that is more abundant. Why? | ||
: GeMoMa reads the introns from the input file using some filter (cf. GeMoMa parameter r (reads)). All introns that pass the filter are used and treated equally. Hence, GeMoMa uses the intron that matches the expectation of intron position and amino acid conservation compared to the reference transcript. | :GeMoMa reads the introns from the input file using some filter (cf. GeMoMa parameter r (reads)). All introns that pass the filter are used and treated equally. Hence, GeMoMa uses the intron that matches the expectation of intron position and amino acid conservation compared to the reference transcript. | ||
; Does GeMoMa predict multiple transcripts per gene? | ; Does GeMoMa predict multiple transcripts per gene? | ||
: GeMoMa in principle allows to predict multiple transcripts per gene, if corresponding transcripts are given in the reference species or if multiple reference species are used. | :GeMoMa in principle allows to predict multiple transcripts per gene, if corresponding transcripts are given in the reference species or if multiple reference species are used. | ||
; GeMoMa failed with java.lang.OutOfMemoryError. What can I do? | ; GeMoMa failed with java.lang.OutOfMemoryError. What can I do? | ||
: Whenever you see a java.lang.OutOfMemoryError, you should rerun the program with Java virtual machine (VM) options. More specifically you should set: | :Whenever you see a java.lang.OutOfMemoryError, you should rerun the program with Java virtual machine (VM) options. More specifically you should set: -Xms the initially used RAM, e.g. to 5Gb (–Xms5G), and -Xmx the maximally used RAM, e.g. to 50Gb (-Xmx50G). GeMoMa often needs more memory if you have a large genome and if you’re providing a large coverage file (extracted from RNA-seq data). If you don’t have a compute node with enough memory, you can run GeMoMa without coverage, which will return the same predictions, but does not include all statistics. Another point could be the protein alignment, if you use the optional parameter "query protein" (below version 1.; or "protein alignment" (since version 1.;. Again you can run GeMoMa without protein alignment, which will return the same predictions, but less statistics. | ||
; I need to specify the genetic code for my organisms. What is the expected format? | ; I need to specify the genetic code for my organisms. What is the expected format? | ||
: The genetic code is given in a two column tab-delimited table, where the first column is the one letter code of the amino acid and the second column is a comma-separated list of triplets. As we are working on genomic DNA, GeMoMa expects the bases A, C, G, and T, and not U (as expected in mRNA). Here is the link to the default genetic code, which might be used as template: | :The genetic code is given in a two column tab-delimited table, where the first column is the one letter code of the amino acid and the second column is a comma-separated list of triplets. As we are working on genomic DNA, GeMoMa expects the bases A, C, G, and T, and not U (as expected in mRNA). Here is the link to the default genetic code, which might be used as template: | ||
: https://github.com/Jstacs/Jstacs/blob/master/projects/gemoma/test_data/genetic_code.txt | :https://github.com/Jstacs/Jstacs/blob/master/projects/gemoma/test_data/genetic_code.txt | ||
: Alternative genetic codes are described here using the RNA alphabet: | :Alternative genetic codes are described here using the RNA alphabet: | ||
: https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi | :https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi | ||
: The genetic code might be specified for a reference organism in the module Extractor or for a target organism in the module GeMoMa. | :The genetic code might be specified for a reference organism in the module Extractor or for a target organism in the module GeMoMa. | ||
; I like to accelerate GeMoMa. What can I do? | |||
:You can use several threads for the computation. If you run the GeMoMaPipeline you just have to select threads=<your_number>. | |||
:In addition, you can change the search algorithm that is used in GeMoMa. Tblastn is used by default as search algorithm in GeMoMa (for historical reasons, until version 1.6.4). However, tblastn can be replaced by mmseqs which is typically much faster. If you run the GeMoMaPipeline you just have to select tblastn=false, which is default since version 1.7. However, changing the search algorithm can also effect the results. We try to minimize these effect using specific parameters for the search algorithms. | |||
:If you modify other parameters, you will probably receive results that differ to a larger extend from those received using the default parameters. | |||
; Is there a way to use GeMoMa to search a single CDS or protein sequence against a genome and return the predicted gene model (CDS fasta, protein fasta, GFF) similar to exonerate? | |||
:There at least two ways to do this. If you use GeMoMaPipeline you can | |||
: (A) Use the parameter “selected” to select specific gene models (=transcripts/proteins) from the annotation instead of using all or | |||
: (B) Use s=pre-extracted, use a fasta file with the proteins for the parameter cds-parts and leave assignment unset. | |||
:Using one of these options you can either look for a single or few transcripts/proteins either with (A) or without (B) intron-position conservation. In addition, you can use RNA-seq data to improve the predictions, which should be not possible with exonerate. | |||
; Can I determine synteny based on GeMoMa predictions? | |||
:Yes, since version 1.7 we provide the module SyntenyChecker and a R script that can be used for this purpose. It exploits the fact the the reference gene and the alternative are known. Hence no alignment is need at this point and synteny can be determined quite fast. | |||
; How, can I add additional attributes to the annotation? | |||
:Additional attribute, e.g., functional annotation from InterProScan, can be added to the structural gene annotation using the module AddAttribute, which has been included since version 1.7. Such additional attributes might be used in GAF for filtering and sorting and can also be displayed in genome browsers like IGV or WebApollo. | |||
; Can structural gene annotation provided by GeMoMa be submitted to NCBI? | |||
:Yes, NCBI allows to submit structural gene annotation in GFF format (https://www.ncbi.nlm.nih.gov/genbank/genomes_gff/). If you run GeMoMaPipeline or AnnotationFinalizer, the GFF should be valid for conversion. | |||
; Running GeMoMaPipeline throws an exception. Can I restart GeMoMaPipeline using intermediate results? | |||
:Yes, since version 1.7 we have a new parameter in GeMoMaPipeline called restart. | |||
:If you want to restart the last broken GeMoMaPipeline run, you have to execute GeMoMaPipeline with the same command line as before and add restart=true. | |||
:If necessary, you can also slightly change the other parameters. However, if the parameters differ too much from those used before, GeMoMaPipeline will decide to perform a new independent run. | |||
:A restart of GeMoMaPipeline is particularly useful if the time-consuming search (tblastn or mmseqs) was successful, since this can save runtime. | |||
If you have any further questions, comments or bugs, please check the [[GeMoMa-Docs]], [https://github.com/Jstacs/Jstacs/labels/GeMoMa our github page] or contact [mailto:jens.keilwagen@julius-kuehn.de Jens Keilwagen]. | |||
== References == | |||
If you use GeMoMa, please cite | |||
J. Keilwagen, M. Wenk, J. L. Erickson, M. H. Schattat, J. Grau, and F. Hartung. [https://nar.oxfordjournals.org/content/44/9/e89 Using intron position conservation for homology-based gene prediction]. ''Nucleic Acids Research'', 2016. doi: 10.1093/nar/gkw092 | |||
J. Keilwagen, F. Hartung, M. Paulini, S. O. Twardziok, and J. Grau | |||
[https://rdcu.be/QbKc Combining RNA-seq data and homology-based gene prediction for plants, animals and fungi]. ''BMC Bioinformatics'', 2018. doi: 10.1186/s12859-018-2203-5 | |||
== Version history == | == Version history == | ||
[http://www.jstacs.de/download.php?which=GeMoMa GeMoMa 1.6.3] (05.03.2020) | [http://www.jstacs.de/download.php?which=GeMoMa GeMoMa 1.9] (15.07.2022) | ||
* improved handling of warnings in Galaxy | |||
* new modules: Attribute2Table, GFFAttributes and TranscribedCluster | |||
* AnnotationFinalizer: | |||
** new parameters: transfer features, additional source suffix | |||
** changed renaming using regex | |||
** adding oldID if renaming IDs | |||
** do not re-sort transcripts of a gene | |||
* BUSCORecomputer: | |||
** add FileExistsValidator | |||
** extend to polyploid organisms | |||
** new result: BUSCO parsed full table | |||
** bugfix last duplicated | |||
* CombineCoverageFiles: | |||
** make it more memory efficient | |||
* CombineIntronFiles: | |||
** make it more memory efficient | |||
* GAF: | |||
** new parameters allowing gene set specific kmeans using global or local detrending | |||
** new parameter "intermediate result" allowing to retrieve intermediate results | |||
** new parameter "length difference" allowing to discard predictions that deviate too much from the representative transcript at a locus | |||
* GeMoMa: | |||
** new parameter options for the amount of predictions per reference transcript: STATIC(=default) or DYNAMIC | |||
** delete unnecessary parameter "region threshold" | |||
** improved verbose output | |||
** improve fasta header parsing | |||
* GeMoMaPipeline: | |||
** removed long fasta comment parameter | |||
** improved behaviour if errors occur if restart=true | |||
** shifted prefix from GAF to GeMoMa module were possible | |||
** bug fix: SyntenyChecker if assignment is not used | |||
* Extractor: | |||
** new category for discarded annotation: non-linear transcripts | |||
** bugfix longest intron=0 | |||
* ExtractRNAseqEvidence: | |||
** improved protocol if errors with the repositioning occur | |||
[http://www.jstacs.de/downloads/GeMoMa-1.8.zip GeMoMa 1.8] (07.10.2021) | |||
* extended manual | |||
* new module Analyzer: for benchmarking | |||
* new module BUSCORecomputer: allowing to recompute BUSCO stats based on geneID instead of transcriptID avoiding to overestimate the number of duplicates | |||
* AnnotationEvidence | |||
** bugfix: gene borders if only one gene is on the contig | |||
** discard genes that do not code for a protein | |||
* AnnotationFinalizer: | |||
** new parameter "transfer feature" allowing to keep additional features like UTRs, ... | |||
** implemented check of regular expression for prefix | |||
** bugfix if score==NA | |||
* CheckIntrons: introns are not optional | |||
* ERE: | |||
** new parameters for handling spurious split reads | |||
** new parameter for repositioning that is needed for genomes with huge chromosome due to limitations of BAM/SAM | |||
** bugfix last intron | |||
** improved protocol | |||
* Extractor: | |||
** new parameter for long fasta comment | |||
** new parameter identical | |||
** more verbose output in case of problems | |||
** finding errors if CDS parts have different strands | |||
** changed optional intron output | |||
** bugfix for exons with DNA but no AA | |||
* GAF: | |||
** new parameter allowing to output the transcript names of redundant predictions as GFF attribute | |||
** new parameter "transfer feature" allowing to keep additional features like UTRs, ... | |||
** bugfix: missing entries for alternative | |||
** changed default value for atf and sorting | |||
** implemented check of regular expression for prefix | |||
** changed handling of transcript within clusters | |||
** changed output order in gff: now for each gene the gene feature is reported first and subsequently the mRNA and CDS features | |||
* GeMoMa: | |||
** new parameter for replacing unknown AA by X | |||
** handling missing GeMoMa.ini.xml | |||
** additional GFF attributes: lpm, maxScore, maxGap, bestScore | |||
** improved error handling and protocol | |||
** changed heuristic for identifying multiple transcripts predictions on one contig/chromosome | |||
* GeMoMaPipeline: | |||
** new parameter "check synteny" allowing to run SyntenyChecker | |||
** implemented check of regular expression for prefix | |||
** removed unnecessary parameter | |||
** improved handling of exceptions | |||
** bugfix for stranded RNA-seq evidence | |||
** allow re-start only for same version | |||
** improved protocol if threads==1 | |||
* SyntenyChecker: implemented check of regular expression for prefix | |||
[http://www.jstacs.de/downloads/GeMoMa-1.7,1.zip GeMoMa 1.7.1] (07.09.2020) | |||
*GeMoMa: | |||
**bugfix if assignment == null | |||
**bugfix remove toUpperCase | |||
*GeMoMaPipeline | |||
**Galaxy integration bugfix for hidden parameter restart | |||
**hide BLAST_PATH and MMSEQS_PATH from Galaxy integration | |||
**improved protocol output if threads=1 | |||
**add addtional test to GeMoMaPipeline | |||
[http://www.jstacs.de/downloads/GeMoMa-1.7.zip GeMoMa 1.7] (29.07.2020) | |||
*improved manual including new module and runtime | |||
*check whether input files exist before execution | |||
*partially checking MIME types in CLI before execution | |||
*changed homepage from http to https | |||
*new module AddAttribute: allows to add attributes (like functional annotation from InterProScan) to gene annotation files that might be used in GAF or displayed genome browsers like IGV or WebApollo | |||
*new module SyntenyChecker: creates a table that can be used to create dot plots between the annotation of the target and reference organism | |||
*changed default value of parameter "tag" from "prediction" to "mRNA" | |||
*AnnotationEvidence: | |||
**additional attributes: avgCov, minCov, nps, ce | |||
**changed default value of "annotation output" to true | |||
**bugfix: transcript start and end | |||
*ERE: | |||
**changed default value of coverage to "true" | |||
**new parameter "minimum context": allows to discard introns if all split reads have short aligned contexts | |||
*Extractor: | |||
**bugfix splitAA if coding exon is very short | |||
**improved verbose mode | |||
**new parameter "upcase IDs" | |||
**new parameter "introns" allowing to extract introns from the reference (only for test cases) | |||
**new parameter "discard pre-mature stop" allowing to discard or use transcripts with pre-mature stop | |||
**improved handling of corrupt annotations | |||
*GAF: | |||
**bugfix missing transcripts | |||
**slightly changed the default value of "filter" | |||
*GeMoMa: | |||
**replaced parameter "query proteins" by "protein alignment" | |||
**using splitAA for scoring predictions | |||
**new gff attributes: | |||
*** ce and rce for the feature prediction indicating the number of coding exons for the prediction and the reference, respectively | |||
*** nps for the number of premature stop codons (if avoid stop is false) | |||
**slightly changed the meaning of the parameter "avoid stop" | |||
*GeMoMaPipeline: | |||
**changed the default value of tblastn to false, hence mmseqs is used as search algorithm | |||
**changed the default value of score to ReAlign | |||
**remove "--dont-split-seq-by-len" from mmseqs createdb | |||
**new optional parameter BLAST_PATH | |||
**new optional parameter MMSEQS_PATH | |||
**new option to allow for incorporation of external annotation, e.g., from ab-initio gene prediction | |||
**new parameter restart allowing to restart the latest GeMoMaPipeline run, which was finished without results, with very similar parameters, e.g., after an exception was thrown (cf. parameter debug) | |||
[http://www.jstacs.de/downloads/GeMoMa-1.6.4.zip GeMoMa 1.6.4] (24.04.2020) | |||
* improved help section | |||
* change gff attribute "AA" to "aa" | |||
* GAF: | |||
** bugfix overlapping genes | |||
** accelerated computation | |||
* GeMoMa: | |||
** bugfix: if no assignment file is used and protein ID are prefixes of other protein IDs | |||
** change GFF attribute AA to aa | |||
* AnnotationFinalizer: new parameter "name attribute" allowing to decide whether a name attribute or the Parent and ID attributes should be used for renaming | |||
[http://www.jstacs.de/downloads/GeMoMa-1.6.3.zip GeMoMa 1.6.3] (05.03.2020) | |||
* Jstacs changes: | * Jstacs changes: | ||
** CLI: bugfix ExpandableParameterSet | ** CLI: bugfix ExpandableParameterSet |
Latest revision as of 20:02, 16 July 2022
Gene Model Mapper (GeMoMa) is a homology-based gene prediction program. GeMoMa uses the annotation of protein-coding genes in a reference genome to infer the annotation of protein-coding genes in a target genome. Thereby, GeMoMa utilizes amino acid sequence and intron position conservation. In addition, GeMoMa allows to incorporate RNA-seq evidence for splice site prediction.
GeMoMa is available in a public web-server at galaxy.informatik.uni-halle.de. The provided web-server only allows a limited number of reference genes and uses a time out of 2 minutes per transcript prediction. For unlimited use, please use the command line program or integrate GeMoMa in your own Galaxy instance.
Installation
GeMoMa is now available via bioconda. Here is the direct link to the package. To install this package with conda run:
conda install -c bioconda gemoma
However, you can also install GeMoMa manually.
Requirements
For running the GeMoMa, you need the following software on your computer
Download
GeMoMa is implemented in Java using Jstacs. You can download a zip file containing a readme, the GeMoMa jar file and some tiny scripts for running GeMoMa. The jar file allows for
- creating the XML file needed for the Galaxy integration
- running the command line interface (CLI) version.
In a nutshell
GeMoMa is a modular, homology-based gene prediction program with huge flexibility. However, we also provide a pipeline allowing to use GeMoMa easily. If you like to start GeMoMa for the first time, we recommend to use the GeMoMaPipeline like this
java -jar GeMoMa-1.9.jar CLI GeMoMaPipeline threads=<threads> outdir=<outdir> GeMoMa.Score=ReAlign AnnotationFinalizer.r=NO o=true t=<target_genome> i=<reference_1_id> a=<reference_1_annotation> g=<reference_1_genome>
there are several parameters that need to be set indicated with <foo>. You can specify
- the number of threads
- the output directory
- the target genome
- and the reference ID (optional), annotation and genome. If you have several references just repeat
s=own
and the parameter tagsi
,a
,g
with the corresponding values.
In addition, we recommend to set several parameters:
GeMoMa.Score=ReAlign
: states that the score from mmseqs should be recomputed as mmseqs uses an approximationAnnotationFinalizer.r=NO
: do not rename genes and transcriptso=true
: output individual predictions for each reference as a separate file allowing to rerun the combination step (GAF) very easily and quickly
If you like to specify the maximum intron length please consider the parameters GeMoMa.m
and GeMoMa.sil
.
If you have RNA-seq data either from own experiments or publicly available data sets (cf. NCBI SRA, EMBL-EBI ENA), we recommend to use them. You need to map the data against the target genome with your favorite read mapper. In addition, we recommend to check the parameters of the section DenoiseIntrons.
The complete documentation describing all GeMoMa modules and all parameters can be accessed at GeMoMa-Docs.
You can also download a small manual for GeMoMa which explains the main steps for the analysis.
GFF attributes
Using GeMoMa and GAF, you'll obtain GFFs containing some special attributes. We briefly explain the most prominent attributes in the following table.
Attribute | Long name | Tool | Necessary parameter | Feature | Description |
---|---|---|---|---|---|
aa | amino acids | GeMoMa | mRNA | the number of amino acids in the predicted protein | |
raa | reference amino acids | GeMoMa | mRNA | the number of amino acids in the reference protein | |
score | GeMoMa score | GeMoMa | mRNA | score computed by GeMoMa using the substitution matrix, gap costs and additional penalties | |
maxScore | maximal GeMoMa score | GeMoMa | mRNA | maximal score which will be obtained by a prediction that is identical to the reference transcript | |
bestScore | best GeMoMa score | GeMoMa | mRNA | score of the best GeMoMa prediction of this transcript and this target organism | |
maxGap | maximal gap | GeMoMa | mRNA | length of the longest gap in the alignment between predicted and reference protein | |
lpm | longest positive match | GeMoMa | mRNA | length of the longest positive scoring match in the alignment between predicted and reference protein, i.e., each pair of amino acids in the match has a positive score | |
nps | number of premature stops | GeMoMa | mRNA | the number of premature stop codons in the prediction | |
ce | coding exons | GeMoMa | assignment | mRNA | the number of coding exons of the prediction |
rce | reference coding exons | GeMoMa | assignment | mRNA | the number of coding exons of the reference transcript |
minCov | minimal coverage | GeMoMa | coverage, ... | mRNA | minimal coverage of any base of the prediction given RNA-seq evidence |
avgCov | average coverage | GeMoMa | coverage, ... | mRNA | average coverage of all bases of the prediction given RNA-seq evidence |
tpc | transcript percentage coverage | GeMoMa | coverage, ... | mRNA | percentage of covered bases per predicted transcript given RNA-seq evidence |
tae | transcript acceptor evidence | GeMoMa | introns | mRNA | percentage of predicted acceptor sites per predicted transcript with RNA-seq evidence |
tde | transcript donor evidence | GeMoMa | introns | mRNA | percentage of predicted donor sites per predicted transcript with RNA-seq evidence |
tie | transcript intron evidence | GeMoMa | introns | mRNA | percentage of predicted introns per predicted transcript with RNA-seq evidence |
minSplitReads | minimal split reads | GeMoMa | introns | mRNA | minimal number of split reads for any of the predicted introns per predicted transcript |
iAA | identical amino acid | GeMoMa | protein alignment | mRNA | percentage of identical amino acids between reference transcript and prediction |
pAA | positive amino acid | GeMoMa | protein alignment | mRNA | percentage of aligned positions between reference transcript and prediction yielding a positive score in the substitution matrix |
evidence | GAF | mRNA | number of reference organisms that have a transcript yielding this prediction | ||
alternative | GAF | mRNA | alternative gene ID(s) leading to the same prediction | ||
sumWeight | GAF | mRNA | the sum of the weights of the references that perfectly support this prediction | ||
maxTie | maximal tie | GAF | gene | maximal tie of all transcripts of this gene | |
maxEvidence | maximal evidence | GAF | gene | maximal evidence of all transcripts of this gene |
The name of the feature describing a transcript prediction can be altered using the parameter "tag". Before version 1.7 the default value of tag was "prediction" instead of "mRNA".
Frequently asked questions
- Why does the Extractor not return a single CDS-part, protein, ...?
- Please check whether the names of your contigs/chromosomes in your annotation (gff) and genome file (fasta) are identical. The fasta comments should at best only contain the contig/chromosome name. (Since GeMoMa 1.4, comments, which contain the contig/chromosome name and some additional information separated by a space, are also fine.) In addition, check the statistics that are given by the Extractor. It lists how many genes have been read and how many genes have been removed for different reasons. One common problem is that some annotation files do not include the stop codon in the CDS annotation.
- How can I force GeMoMa to make more predictions?
- There are several parameters affecting the number of predictions. The most prominent are the number of predictions (p) and the contig threshold (ct). For each reference transcript/CDS, GeMoMa initially makes a preliminary prediction and uses this prediction to determine whether a contig is promising and should be used to determine the final predictions. You may decrease ct and increase p to have more contigs in the final prediction. Increasing the number of predictions allows GeMoMa to output more predictions that have been computed. Decreasing the contig threshold allows to increase the number of predictions that are (internally) computed. Increasing p to a very large number without decreasing ct does not help.
- Running GeMoMa on a single contig of my assembly yield thousands of weird predictions. What went wrong?
- By default, GeMoMa is not build to be run on a single contig. GeMoMa tries to make predictions for all given reference CDS in the given target sequence(s). If the given target sequence is only a fraction of the complete target genome/assembly, GeMoMa will produce weird predictions as it does not filter for the quality of the predictions internally. There are two options to handle this:
- Use a list of gene models that you expect to be located on this contig (cf. parameter "selected").
- Filter the predictions using GAF (cf.
java -jar GeMoMa-<version>.jar CLI GAF
).
- Is it mandatory to use RNA-seq data?
- No, GeMoMa is able to make predictions with and without RNA-seq evidence.
- Is it possible to use multiple reference organisms?
- It is possible to use multiple reference organisms for GeMoMa. Just run GeMoMa on each reference organism separately. Finally, you can employ GAF (cf.
java -jar GeMoMa-<version>.jar CLI GAF
) to combine these annotations.
- Why do some reference genes not lead to a prediction in the target genome?
- Please first check whether your reference genes have been discarded by the Extractor (cf. assignment file).
- If the genes have been discarded, there are two possibilities:
- The CDS might be redundant, i.e. the coding exons are identical to those of another transcript. In this case, only one CDS is further evaluated.
- There might be something wrong with your reference genes, e.g., missing start codon, missing stop codon, premature stop codon, ambiguous nucleotides, ... and you should check the options of Extractor or the annotation.
- If the reference genes passed the Extractor, there are several possible explanations for this behavior. The two most prominent are:
- GeMoMa stopped the prediction of a reference genes since it does not return a result within the given time (cf. parameter "timeout").
- GeMoMa simply did not find a prediction matching the remaining quality criteria.
- What does "partial gene model" mean in the context of GeMoMa?
- We called a gene model partial if it does not contain an initial start codon and a final stop codon. However, this does not mean that the gene model is located at or close to the border of a chromosome or contig.
- For two different reference transcripts, the predictions of GeMoMa overlap or are identical. What should I do with those?
- GeMoMa makes the predictions for each reference transcript independently. Hence, it can occur that some of predictions of different reference transcripts overlap or are identical especially in gene families. Typically, you might like to filter or rank these predictions. We have implemented GAF (cf.
java -jar GeMoMa-<version>.jar CLI GAF
) to do this automatically. However, you can also do it by hand using the GFF attributes. Using RNA-seq data in GeMoMa yields additional fields in the annotation that can be used, e.g., average coverage (avgCov).
- A lot of transcripts have been filtered out by the Extractor. What can I do?
- There are several reasons for removing transcripts by the Extractor. At least in two cases you can try to get more transcripts by setting specific parameter values. First, if the transcript contains ambiguous nucleotides, please test the parameter "Ambiguity". Second, sometimes we received GFFs which contain wrong phases for CDS entries (e.g., 0 for all CDS entries in the phase column of the GFF). Since version 1.3.2, we provide the option "r" which stands for repair. If r=true is chosen, the Extractor tries to infer all phases for transcripts that show an error and would be filtered out.
- Is GeMoMa able to predict pseudo-genes/ncRNA?
- No, currently not.
- My RNA-seq data indicates there is an additional intron in a transcipt, but GeMoMa does not predict this. Or vice versa, GeMoMa predicts an intron that is not supported by RNA-seq data. What's the reason?
- GeMoMa is mainly based on the assumptions of amino acid and intron position conservation between reference and target species. Hence, GeMoMa tries to predict a gene model with similar exon-intron-structure in the target species and does not stick too much with RNA-seq data. Although intron position conservation can be observed in most cases, sometimes new introns evolve or others vanish. For this reasons, GeMoMa also allows for the inclusion or exclusion of introns adding some additional costs (cf. GeMoMa parameter intron-loss-gain-penalty). However, the behaviour of GeMoMa depends on the parameters settings (especially intron-loss-gain-penalty, sm (substitution matrix), go (gap opening), ge (gap extension)) and the length of the missed/additional intron. Nevertheless, such cases can only occur if the additional/missed intron has a length that can be divided by 3 preserving the reading frame. Since the available RNA-seq data only reflects a fraction of tissues/environmental conditions/..., missing RNA-seq evidence does not necessarily mean that the predictions is wrong.
- My RNA-seq data indicates two alternative, highly overlapping introns. Interestingly, GeMoMa does not take the intron that is more abundant. Why?
- GeMoMa reads the introns from the input file using some filter (cf. GeMoMa parameter r (reads)). All introns that pass the filter are used and treated equally. Hence, GeMoMa uses the intron that matches the expectation of intron position and amino acid conservation compared to the reference transcript.
- Does GeMoMa predict multiple transcripts per gene?
- GeMoMa in principle allows to predict multiple transcripts per gene, if corresponding transcripts are given in the reference species or if multiple reference species are used.
- GeMoMa failed with java.lang.OutOfMemoryError. What can I do?
- Whenever you see a java.lang.OutOfMemoryError, you should rerun the program with Java virtual machine (VM) options. More specifically you should set: -Xms the initially used RAM, e.g. to 5Gb (–Xms5G), and -Xmx the maximally used RAM, e.g. to 50Gb (-Xmx50G). GeMoMa often needs more memory if you have a large genome and if you’re providing a large coverage file (extracted from RNA-seq data). If you don’t have a compute node with enough memory, you can run GeMoMa without coverage, which will return the same predictions, but does not include all statistics. Another point could be the protein alignment, if you use the optional parameter "query protein" (below version 1.; or "protein alignment" (since version 1.;. Again you can run GeMoMa without protein alignment, which will return the same predictions, but less statistics.
- I need to specify the genetic code for my organisms. What is the expected format?
- The genetic code is given in a two column tab-delimited table, where the first column is the one letter code of the amino acid and the second column is a comma-separated list of triplets. As we are working on genomic DNA, GeMoMa expects the bases A, C, G, and T, and not U (as expected in mRNA). Here is the link to the default genetic code, which might be used as template:
- https://github.com/Jstacs/Jstacs/blob/master/projects/gemoma/test_data/genetic_code.txt
- Alternative genetic codes are described here using the RNA alphabet:
- https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi
- The genetic code might be specified for a reference organism in the module Extractor or for a target organism in the module GeMoMa.
- I like to accelerate GeMoMa. What can I do?
- You can use several threads for the computation. If you run the GeMoMaPipeline you just have to select threads=<your_number>.
- In addition, you can change the search algorithm that is used in GeMoMa. Tblastn is used by default as search algorithm in GeMoMa (for historical reasons, until version 1.6.4). However, tblastn can be replaced by mmseqs which is typically much faster. If you run the GeMoMaPipeline you just have to select tblastn=false, which is default since version 1.7. However, changing the search algorithm can also effect the results. We try to minimize these effect using specific parameters for the search algorithms.
- If you modify other parameters, you will probably receive results that differ to a larger extend from those received using the default parameters.
- Is there a way to use GeMoMa to search a single CDS or protein sequence against a genome and return the predicted gene model (CDS fasta, protein fasta, GFF) similar to exonerate?
- There at least two ways to do this. If you use GeMoMaPipeline you can
- (A) Use the parameter “selected” to select specific gene models (=transcripts/proteins) from the annotation instead of using all or
- (B) Use s=pre-extracted, use a fasta file with the proteins for the parameter cds-parts and leave assignment unset.
- Using one of these options you can either look for a single or few transcripts/proteins either with (A) or without (B) intron-position conservation. In addition, you can use RNA-seq data to improve the predictions, which should be not possible with exonerate.
- Can I determine synteny based on GeMoMa predictions?
- Yes, since version 1.7 we provide the module SyntenyChecker and a R script that can be used for this purpose. It exploits the fact the the reference gene and the alternative are known. Hence no alignment is need at this point and synteny can be determined quite fast.
- How, can I add additional attributes to the annotation?
- Additional attribute, e.g., functional annotation from InterProScan, can be added to the structural gene annotation using the module AddAttribute, which has been included since version 1.7. Such additional attributes might be used in GAF for filtering and sorting and can also be displayed in genome browsers like IGV or WebApollo.
- Can structural gene annotation provided by GeMoMa be submitted to NCBI?
- Yes, NCBI allows to submit structural gene annotation in GFF format (https://www.ncbi.nlm.nih.gov/genbank/genomes_gff/). If you run GeMoMaPipeline or AnnotationFinalizer, the GFF should be valid for conversion.
- Running GeMoMaPipeline throws an exception. Can I restart GeMoMaPipeline using intermediate results?
- Yes, since version 1.7 we have a new parameter in GeMoMaPipeline called restart.
- If you want to restart the last broken GeMoMaPipeline run, you have to execute GeMoMaPipeline with the same command line as before and add restart=true.
- If necessary, you can also slightly change the other parameters. However, if the parameters differ too much from those used before, GeMoMaPipeline will decide to perform a new independent run.
- A restart of GeMoMaPipeline is particularly useful if the time-consuming search (tblastn or mmseqs) was successful, since this can save runtime.
If you have any further questions, comments or bugs, please check the GeMoMa-Docs, our github page or contact Jens Keilwagen.
References
If you use GeMoMa, please cite
J. Keilwagen, M. Wenk, J. L. Erickson, M. H. Schattat, J. Grau, and F. Hartung. Using intron position conservation for homology-based gene prediction. Nucleic Acids Research, 2016. doi: 10.1093/nar/gkw092
J. Keilwagen, F. Hartung, M. Paulini, S. O. Twardziok, and J. Grau Combining RNA-seq data and homology-based gene prediction for plants, animals and fungi. BMC Bioinformatics, 2018. doi: 10.1186/s12859-018-2203-5
Version history
GeMoMa 1.9 (15.07.2022)
- improved handling of warnings in Galaxy
- new modules: Attribute2Table, GFFAttributes and TranscribedCluster
- AnnotationFinalizer:
- new parameters: transfer features, additional source suffix
- changed renaming using regex
- adding oldID if renaming IDs
- do not re-sort transcripts of a gene
- BUSCORecomputer:
- add FileExistsValidator
- extend to polyploid organisms
- new result: BUSCO parsed full table
- bugfix last duplicated
- CombineCoverageFiles:
- make it more memory efficient
- CombineIntronFiles:
- make it more memory efficient
- GAF:
- new parameters allowing gene set specific kmeans using global or local detrending
- new parameter "intermediate result" allowing to retrieve intermediate results
- new parameter "length difference" allowing to discard predictions that deviate too much from the representative transcript at a locus
- GeMoMa:
- new parameter options for the amount of predictions per reference transcript: STATIC(=default) or DYNAMIC
- delete unnecessary parameter "region threshold"
- improved verbose output
- improve fasta header parsing
- GeMoMaPipeline:
- removed long fasta comment parameter
- improved behaviour if errors occur if restart=true
- shifted prefix from GAF to GeMoMa module were possible
- bug fix: SyntenyChecker if assignment is not used
- Extractor:
- new category for discarded annotation: non-linear transcripts
- bugfix longest intron=0
- ExtractRNAseqEvidence:
- improved protocol if errors with the repositioning occur
GeMoMa 1.8 (07.10.2021)
- extended manual
- new module Analyzer: for benchmarking
- new module BUSCORecomputer: allowing to recompute BUSCO stats based on geneID instead of transcriptID avoiding to overestimate the number of duplicates
- AnnotationEvidence
- bugfix: gene borders if only one gene is on the contig
- discard genes that do not code for a protein
- AnnotationFinalizer:
- new parameter "transfer feature" allowing to keep additional features like UTRs, ...
- implemented check of regular expression for prefix
- bugfix if score==NA
- CheckIntrons: introns are not optional
- ERE:
- new parameters for handling spurious split reads
- new parameter for repositioning that is needed for genomes with huge chromosome due to limitations of BAM/SAM
- bugfix last intron
- improved protocol
- Extractor:
- new parameter for long fasta comment
- new parameter identical
- more verbose output in case of problems
- finding errors if CDS parts have different strands
- changed optional intron output
- bugfix for exons with DNA but no AA
- GAF:
- new parameter allowing to output the transcript names of redundant predictions as GFF attribute
- new parameter "transfer feature" allowing to keep additional features like UTRs, ...
- bugfix: missing entries for alternative
- changed default value for atf and sorting
- implemented check of regular expression for prefix
- changed handling of transcript within clusters
- changed output order in gff: now for each gene the gene feature is reported first and subsequently the mRNA and CDS features
- GeMoMa:
- new parameter for replacing unknown AA by X
- handling missing GeMoMa.ini.xml
- additional GFF attributes: lpm, maxScore, maxGap, bestScore
- improved error handling and protocol
- changed heuristic for identifying multiple transcripts predictions on one contig/chromosome
- GeMoMaPipeline:
- new parameter "check synteny" allowing to run SyntenyChecker
- implemented check of regular expression for prefix
- removed unnecessary parameter
- improved handling of exceptions
- bugfix for stranded RNA-seq evidence
- allow re-start only for same version
- improved protocol if threads==1
- SyntenyChecker: implemented check of regular expression for prefix
GeMoMa 1.7.1 (07.09.2020)
- GeMoMa:
- bugfix if assignment == null
- bugfix remove toUpperCase
- GeMoMaPipeline
- Galaxy integration bugfix for hidden parameter restart
- hide BLAST_PATH and MMSEQS_PATH from Galaxy integration
- improved protocol output if threads=1
- add addtional test to GeMoMaPipeline
GeMoMa 1.7 (29.07.2020)
- improved manual including new module and runtime
- check whether input files exist before execution
- partially checking MIME types in CLI before execution
- changed homepage from http to https
- new module AddAttribute: allows to add attributes (like functional annotation from InterProScan) to gene annotation files that might be used in GAF or displayed genome browsers like IGV or WebApollo
- new module SyntenyChecker: creates a table that can be used to create dot plots between the annotation of the target and reference organism
- changed default value of parameter "tag" from "prediction" to "mRNA"
- AnnotationEvidence:
- additional attributes: avgCov, minCov, nps, ce
- changed default value of "annotation output" to true
- bugfix: transcript start and end
- ERE:
- changed default value of coverage to "true"
- new parameter "minimum context": allows to discard introns if all split reads have short aligned contexts
- Extractor:
- bugfix splitAA if coding exon is very short
- improved verbose mode
- new parameter "upcase IDs"
- new parameter "introns" allowing to extract introns from the reference (only for test cases)
- new parameter "discard pre-mature stop" allowing to discard or use transcripts with pre-mature stop
- improved handling of corrupt annotations
- GAF:
- bugfix missing transcripts
- slightly changed the default value of "filter"
- GeMoMa:
- replaced parameter "query proteins" by "protein alignment"
- using splitAA for scoring predictions
- new gff attributes:
- ce and rce for the feature prediction indicating the number of coding exons for the prediction and the reference, respectively
- nps for the number of premature stop codons (if avoid stop is false)
- slightly changed the meaning of the parameter "avoid stop"
- GeMoMaPipeline:
- changed the default value of tblastn to false, hence mmseqs is used as search algorithm
- changed the default value of score to ReAlign
- remove "--dont-split-seq-by-len" from mmseqs createdb
- new optional parameter BLAST_PATH
- new optional parameter MMSEQS_PATH
- new option to allow for incorporation of external annotation, e.g., from ab-initio gene prediction
- new parameter restart allowing to restart the latest GeMoMaPipeline run, which was finished without results, with very similar parameters, e.g., after an exception was thrown (cf. parameter debug)
GeMoMa 1.6.4 (24.04.2020)
- improved help section
- change gff attribute "AA" to "aa"
- GAF:
- bugfix overlapping genes
- accelerated computation
- GeMoMa:
- bugfix: if no assignment file is used and protein ID are prefixes of other protein IDs
- change GFF attribute AA to aa
- AnnotationFinalizer: new parameter "name attribute" allowing to decide whether a name attribute or the Parent and ID attributes should be used for renaming
GeMoMa 1.6.3 (05.03.2020)
- Jstacs changes:
- CLI: bugfix ExpandableParameterSet
- python wrapper (for *conda)
- updated tests.sh, run.sh, pipeline.sh
- rename Denoise to DenoiseIntrons
- AnnotationEvidence: write phase (as given) to gff
- GAF: new parameter: default attributes allows to set attributes that are not included in some gene annotation files
- GeMoMa: new parameter: static intron length allowing to use dynamic intron length if set to false
- GeMoMaPipeline:
- bugfix: time-out
- improve output
- separate parameters for maximum intron length (DenoiseIntrons, GeMoMa)
GeMoMa 1.6.2 (17.12.2019)
- Jstacs changes:
- test methods for modules
- live protocol for Galaxy
- new module Denoise: allowing to clean introns extracted by ERE
- new module NCBIReferenceRetriever: allowing to retrieve data for reference organisms easily from NCBI.
- GAF:
- bugfix for filter using specific attributes if no RNA-seq or query proteins was used
- allow to add annotation info (as for instance provided by Phytozome) based on the reference organisms
- GeMoMa: bugfix for timeout
- GeMoMaPipeline:
- bugfix reporting predicted partial proteins
- improved protocol
- new default value for query proteins (changed from false to true)
- new default value for Ambiguity (changed from EXCEPTION to AMBIGUOUS)
GeMoMa 1.6.1 (4.06.2019)
- createGalaxyIntegration.sh: bugfix for GeMoMaPipeline
- new module CheckIntrons: allowing to create statistics for introns (extracted by ERE)
- AnnotationFinalizer: bugfix for sequence IDs with large numbers
- CompareTranscripts:
- bugfix for prefix of ref-gene
- allow no transcript info, but making assignment non-optional if a transcript info is set
- GAF: bugfix for Galaxy integration
- GeMoMaPipeline:
- improved output in case of Exceptions
- new parameter "output individual predictions" allows to in- or exclude individual predictions from each reference organism in the final result
- new parameter "weight" allows weights for reference species (cf. GAF)
- ERE: new parameter "minimum mapping quality"
GeMoMa 1.6 (2.04.2019)
- allow to use mmseqs as alternative to tblastn
- AnnotationEvidence:
- allows to add attributes to the input gff: tie, tpc, AA, start, stop
- new parameter for gff output
- AnnotationFinalizer: new tool for predicting UTRs and renaming predictions
- GAF:
- relative score filter and evidence filter are replaced by a flexible filter that allows to filter by relative score, evidence or other GFF attributes as well as combinations thereof
- sorting criteria of the predictions within clusters can now be user-specified
- new attribute for genes: combinedEvidence
- new attribute for predictions: sumWeight
- allows to use gene predictions from all sources, including for instance ab-initio gene predictors, purely RNA-seq based gene prediction and manually curation
- bugfix for predictions from multiple reference organisms
- improved statistic output
- GeMoMa
- renamed the parameter tblastn results to search results
- new parameter for sorting the results of the similarity search (tblastn or mmseqs), if you use mmseqs for the similarity search you have use sort
- new parameter for score of the search results: three options: Trust (as is), ReScore (use aligned sequence, but recompute score), and ReAlign (use detected sequence for realignment and rescore)
- bugfix: threshold for introns from multiple files
GeMoMa 1.5.3 (23.07.2018)
- improved parameter description and presentation
- GeMoMaPipeline:
- removed unnecessary parameters
- GeMoMa:
- bugfix: reading coverage file
- removed parameter genomic (cf. Extractor)
- removed protein output (cf. Extractor)
- GAF:
- bugfix: prefix
- Extractor:
- new parameter genomic
GeMoMa 1.5.2 (31.5.2018)
- GAF:
- new parameter that allows to restrict the maximal number of transcript predictions per gene
- altered behavior of the evidence filter from percentages to absolute values
- bugfix: nested genes
- checking for duplicates in prediction IDs
- GeMoMa:
- warning if RNA-seq data does not match with target genome
- GeMoMaPipeline: new tool for running the complete GeMoMa pipeline at once allowing multi-threading
- folder for temporary files of GeMoMa
GeMoMa 1.5 (13.02.2018)
- AnnotationEvidence: add chromosome to output
- CompareTranscripts: new parameter that allows to remove prefixes introduces by GAF
- Extractor: new parameter "stop-codon excluded from CDS" that might be used if the annotation does not contain the stop codons
- ExtractRNASeqEvidence:
- print intron length stats
- include program infos in introns.gff3
- GeMoMa:
- new attribute pAA in gff output if query protein is given
- include program infos in predicted_annotation.gff3
- minor bugfix
- GAF:
- new parameter that allows to specify a prefix for each input gff
- collect and print program infos to filtered_prediction.gff3
- improved statistics output
GeMoMa 1.4.2 (21.07.2017)
- automatic searching for available updates
- AnnotationEvidence: bugfix (tie computation: Arrays.binarysearch does not find first match)
- Extractor: bugfix (files that are not zipped)
- GeMoMa: bugfix (tie computation: Arrays.binarysearch does not find first match)
GeMoMa 1.4.1 (30.05.2017)
- CompareTranscripts: bugfix (NullPointerException)
- Extractor: reference genome can be .*fa.gz and .*fasta.gz
- GeMoMa: bugfix (shutdown problem after timeout)
- modified additional scripts and documentation
GeMoMa 1.4 (03.05.2017)
- AnnotationEvidence: new tool computing tie and tpc for given annotation (gff)
- CompareTranscripts: new tool comparing predicted and given annotation (gff)
- Extractor:
- reading CDS with no parent tag (cf. discontinuous feature)
- automatic recognition of GFF or GTF annotation
- Warning if sequences mentioned in the annotation are not included in the reference sequence
- GeMoMa:
- allowing for multiple intron and coverage files (= using different library types at the same time)
- NA instead of "?" for tae, tde, tie, minSplitReads of single coding exon genes
- new default values for the parameters: predictions (10 instead of 1) and contig threshold (0.4 instead of 0.9)
- bugfix (write pc and minCov if possible for last CDS part in predicted annotation)
- bugfix (ref-gene name if no assignment is used)
- bugfix (minSplitReads, minCov, tpc, avgCov if no coverage available)
- GAF:
- nested genes on the same strand
- bugfix (if nothing passes the filter)
GeMoMa 1.3.2 (18.01.2017)
- Extractor: new parameter repair for broken transcript annotations
- GeMoMa: bugfixes (splice site computation)
GeMoMa 1.3.1 (09.12.2016)
- GeMoMa bugfix (finding start/stop codon for very small exons)
GeMoMa 1.3 (06.12.2016)
- ERE: new tool for extracting RNA-seq evidence
- Extractor: offers options for
- partial gene models
- ambiguities
- GeMoMa:
- RNA-seq
- defining splice sites
- additional feature in GFF and output
- transcript intron evidence (tie)
- transcript acceptor evidence (tae)
- transcript donor evidence (tde)
- transcript percentage coverage (tpc)
- ...
- improved GFF
- simplified the command line parameters
- IMPORTANT: parameter names changed for some parameters
- RNA-seq
- GAF: new tool for filtering and combining different predictions (especially of different reference organisms)
GeMoMa 1.1.3 (06.06.2016)
- minor modifications to the Extractor tool
GeMoMa 1.1.2 (05.02.2016)
- GeMoMa bugfix (upstream, downstream sequence for splice site detection)
- Extractor: new parameter s for selecting transcripts
- improved Galaxy integration
GeMoMa 1.1.1 (01.02.2016)
- initial release for paper