GeMoSeq: Difference between revisions

From Jstacs
Jump to navigationJump to search
 
(19 intermediate revisions by the same user not shown)
Line 1: Line 1:
GeMoRNA reconstructs genes and transcript models from mapped RNA-seq reads (in coordinate-sorted BAM format) and reports these in GFF format.
GeMoSeq reconstructs genes and transcript models from mapped RNA-seq reads (in coordinate-sorted BAM format) and reports these in GFF format.


It is intended as a companion for the homology-based gene prediction program [[GeMoMa]].
It is intended as a companion for the homology-based gene prediction program [[GeMoMa]].


In a typical workflow, predictions of transcript models may be obtained from GeMoRNA for a collection of BAM files individually and subsequently merged using the [[GeMoMa]] Annotation Filter (GAF). Optionally, homology-based gene prediction may be performed using [[GeMoMa]] and the resulting GFF files may be merged using the [[#Merge|Merge]] tool of GeMoRNA.
In a typical workflow, predictions of transcript models may be obtained from GeMoSeq for a collection of BAM files individually and subsequently merged using the [[GeMoMa]] Annotation Filter (GAF). Optionally, homology-based gene prediction may be performed using [[GeMoMa]] and the resulting GFF files may be merged using the [[#Merge|Merge]] tool of GeMoSeq.




== Command line tool ==
== Command line tool ==


''GeMoRNA is distributed in the hope that it will be useful, but without any warranty; without even the implied warranty of merchantability or fitness for a particular purpose.''
''GeMoSeq is distributed in the hope that it will be useful, but without any warranty; without even the implied warranty of merchantability or fitness for a particular purpose.''


GeMoRNA and auxiliary tools are packaged in one [http://www.jstacs.de/downloads/GeMoRNA-1.0.jar runnable JAR] that may be run from the command line with
GeMoSeq and auxiliary tools are packaged in one [http://www.jstacs.de/downloads/GeMoSeq-1.2.3.jar runnable JAR] that may be run from the command line with
  java -jar GeMoRNA-1.1.jar
  java -jar GeMoSeq-1.2.3.jar


which lists the tools available and usage information
which lists the tools available and usage information


  Available tools:
Available tools:
   
   
  gemorna - GeMoRNA
  gemoseq - GeMoSeq
  predictCDS - Predict CDS from GFF
  predictCDS - Predict CDS from GFF
  GAF - GeMoMa Annotation Filter
  GAF - GeMoMa Annotation Filter
Line 23: Line 23:
  merge - Merge
  merge - Merge
   
   
  Syntax: java -jar GeMoRNA-1.1.jar <toolname> [<parameter=value> ...]
  Syntax: java -jar GeMoSeq-1.2.3.jar <toolname> [<parameter=value> ...]
   
   
  Further info about the tools is given with
  Further info about the tools is given with
  java -jar GeMoRNA-1.1.jar <toolname> info
  java -jar GeMoSeq-1.2.3.jar <toolname> info
   
   
  For tests of individual tools:
  For tests of individual tools:
  java -jar GeMoRNA-1.1.jar <toolname> test [<verbose>]
  java -jar GeMoSeq-1.2.3.jar <toolname> test [<verbose>]
   
   
  Tool parameters are listed with
  Tool parameters are listed with
  java -jar GeMoRNA-1.1.jar <toolname>
  java -jar GeMoSeq-1.2.3.jar <toolname>


You get a list of the tool parameters by calling GeMoRNA-1.0.jar with the corresponding tool name, e.g.,


  java -jar GeMoRNA-1.1.jar gemorna
You get a list of the tool parameters by calling GeMoSeq-1.2.3.jar with the corresponding tool name, e.g.,
 
  java -jar GeMoSeq-1.2.3.jar gemoseq


The meaning of the individual tool parameters is described below.
The meaning of the individual tool parameters is described below.
Line 43: Line 44:
== Source code ==
== Source code ==


The source code of GeMoRNA is available from the [https://github.com/Jstacs/Jstacs/tree/master/projects/gemorna Jstacs GitHub repository].
The source code of GeMoSeq is available from the [https://github.com/Jstacs/Jstacs/tree/master/projects/gemoseq Jstacs GitHub repository].
 
== Examples ==
 
We give examples for applying GeMoSeq to a single sequencing library and for a larger-scale, integrated genome annotation together with GeMoMa [[GeMoSeq-Examples|on a separate wiki page]].


== GeMoRNA ==
== GeMoSeq ==


Prediction of transcript models using GeMoRNA.
Prediction of transcript models using GeMoSeq.




''GeMoRNA'' may be called with
''GeMoSeq'' may be called with


  java -jar GeMoRNA-1.1.jar gemorna
  java -jar GeMoSeq-1.2.3.jar gemoseq


and has the following parameters
and has the following parameters
Line 142: Line 147:
<td>Maximum region length (Maximum length of a region considered before it is split, default = 750000)</td>
<td>Maximum region length (Maximum length of a region considered before it is split, default = 750000)</td>
<td style="width:100px;">INT</td>
<td style="width:100px;">INT</td>
</tr>
<tr style="vertical-align:top">
<td><font color="green">mrc</font></td>
<td>Maximum region coverage (Maximum coverage in a region before reads are down-sampled, valid range = [0.0, Infinity], default = 100.0)</td>
<td style="width:100px;">DOUBLE</td>
</tr>
</tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
Line 175: Line 185:
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">threads</font></td>
<td><font color="green">threads</font></td>
<td>The number of threads used for the tool, defaults to 1. Currently, I/O of GeMoRNA runs on a single thread and runtime is limited by I/O performance. Hence, running GeMoRNA with a large number of threads is not recommended. On our infrastructure, a number of 6 threads has been the sweet spot.</td>
<td>The number of threads used for the tool, defaults to 1. Currently, I/O of GeMoSeq runs on a single thread and runtime is limited by I/O performance. Hence, running GeMoSeq with a large number of threads is not recommended. On our infrastructure, a number of 6 threads has been the sweet spot.</td>
<td>INT</td>
<td>INT</td>
</tr>
</tr>
Line 182: Line 192:
'''Example:'''
'''Example:'''


  java -jar GeMoRNA-1.1.jar gemorna g=&lt;Genome&gt; m=&lt;Mapped_reads&gt;
  java -jar GeMoSeq-1.2.3.jar gemoseq g=&lt;Genome&gt; m=&lt;Mapped_reads&gt;


== Predict CDS from GFF ==
== Predict CDS from GFF ==


Prediction of CDSs using the longest-ORF heuristic based on an existing GFF or GTF file.




''Predict CDS from GFF'' may be called with
''Predict CDS from GFF'' may be called with


  java -jar GeMoRNA-1.1.jar predictCDS
  java -jar GeMoSeq-1.2.3.jar predictCDS


and has the following parameters
and has the following parameters
Line 226: Line 235:
'''Example:'''
'''Example:'''


  java -jar GeMoRNA-1.1.jar predictCDS g=&lt;Genome&gt; p=&lt;predicted_annotation&gt;
  java -jar GeMoSeq-1.2.3.jar predictCDS g=&lt;Genome&gt; p=&lt;predicted_annotation&gt;
 


== Merge ==
== Merge ==


Merging GeMoRNA and GeMoMa predictions.
 


''Merge'' may be called with
''Merge'' may be called with


  java -jar GeMoRNA-1.1.jar merge
  java -jar GeMoSeq-1.2.3.jar merge


and has the following parameters
and has the following parameters
Line 251: Line 261:
</tr>
</tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">GeMoRNA</font></td>
<td><font color="green">GeMoSeq</font></td>
<td>GeMoRNA (GeMoRNA predictions, type = gff,gff3)</td>
<td>GeMoSeq (GeMoSeq predictions, type = gff,gff3)</td>
<td style="width:100px;">FILE</td>
<td style="width:100px;">FILE</td>
</tr>
</tr>
Line 269: Line 279:
</tr>
</tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">GeMoRNA-strict</font></td>
<td><font color="green">GeMoSeq-strict</font></td>
<td>GeMoRNA-strict (GeMoRNA predictions with strict settings, type = gff,gff3)</td>
<td>GeMoSeq-strict (GeMoSeq predictions with strict settings, type = gff,gff3)</td>
<td style="width:100px;">FILE</td>
<td style="width:100px;">FILE</td>
</tr>
</tr>
Line 280: Line 290:
</tr>
</tr>
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">GeMoRNA-strict</font></td>
<td><font color="green">GeMoSeq-strict</font></td>
<td>GeMoRNA-strict (GeMoRNA predictions with strict settings, type = gff,gff3)</td>
<td>GeMoSeq-strict (GeMoSeq predictions with strict settings, type = gff,gff3)</td>
<td style="width:100px;">FILE</td>
<td style="width:100px;">FILE</td>
</tr>
</tr>
Line 299: Line 309:
'''Example:'''
'''Example:'''


  java -jar GeMoRNA-1.0.jar merge g=&lt;GeMoMa&gt; GeMoRNA=&lt;GeMoRNA&gt;
  java -jar GeMoSeq-1.2.3.jar merge g=&lt;GeMoMa&gt; GeMoSeq=&lt;GeMoSeq&gt;
 
 
 
 
== Version history ==
 
* Version 1.2.3 (2025/11/11): Renamed the tool to GeMoSeq and improved prediction from long-read data
* [http://www.jstacs.de/downloads/GeMoRNA-1.2.1.jar Version 1.2.1] (2025/05/28): improved handling of exceptions in multi-thread mode
* [http://www.jstacs.de/downloads/GeMoRNA-1.2.jar Version 1.2] (2025/05/12): changes in the following tools
** gemorna: fixed a problem where (incomplete) CDS would be predicted in transcripts without any proper stop codon
* [http://www.jstacs.de/downloads/GeMoRNA-1.1.jar Version 1.1] (2025/04/15): changes in the following tools
** merge: include flag if low-confidence predictions will be included in "annotate" mode
** gemorna: allow to provide custom prefix for gene names and to include the chromosome into the gene names
* [http://www.jstacs.de/downloads/GeMoRNA-1.0.jar Version 1.0]: initial version of GeMoRNA

Latest revision as of 16:08, 18 November 2025

GeMoSeq reconstructs genes and transcript models from mapped RNA-seq reads (in coordinate-sorted BAM format) and reports these in GFF format.

It is intended as a companion for the homology-based gene prediction program GeMoMa.

In a typical workflow, predictions of transcript models may be obtained from GeMoSeq for a collection of BAM files individually and subsequently merged using the GeMoMa Annotation Filter (GAF). Optionally, homology-based gene prediction may be performed using GeMoMa and the resulting GFF files may be merged using the Merge tool of GeMoSeq.


Command line tool

GeMoSeq is distributed in the hope that it will be useful, but without any warranty; without even the implied warranty of merchantability or fitness for a particular purpose.

GeMoSeq and auxiliary tools are packaged in one runnable JAR that may be run from the command line with

java -jar GeMoSeq-1.2.3.jar

which lists the tools available and usage information

Available tools:

	gemoseq - GeMoSeq
	predictCDS - Predict CDS from GFF
	GAF - GeMoMa Annotation Filter
	Analyzer - Analyzer
	merge - Merge

Syntax: java -jar GeMoSeq-1.2.3.jar <toolname> [<parameter=value> ...]

Further info about the tools is given with
	java -jar GeMoSeq-1.2.3.jar <toolname> info

For tests of individual tools:
	java -jar GeMoSeq-1.2.3.jar <toolname> test [<verbose>]

Tool parameters are listed with
	java -jar GeMoSeq-1.2.3.jar <toolname>


You get a list of the tool parameters by calling GeMoSeq-1.2.3.jar with the corresponding tool name, e.g.,

java -jar GeMoSeq-1.2.3.jar gemoseq

The meaning of the individual tool parameters is described below. For convenience, we also include the GeMoMa tools Analyzer and GAF.

Source code

The source code of GeMoSeq is available from the Jstacs GitHub repository.

Examples

We give examples for applying GeMoSeq to a single sequencing library and for a larger-scale, integrated genome annotation together with GeMoMa on a separate wiki page.

GeMoSeq

Prediction of transcript models using GeMoSeq.


GeMoSeq may be called with

java -jar GeMoSeq-1.2.3.jar gemoseq

and has the following parameters

name comment type

g Genome (Genome sequence as FastA, type = fa,fna,fasta) FILE
m Mapped reads (Mapped Reads in BAM format, coordinate sorted, type = bam) FILE
s Stranded (Library strandedness, range={FR_UNSTRANDED, FR_FIRST_STRAND, FR_SECOND_STRAND}, default = FR_UNSTRANDED) STRING
l Longest intron length (Length of the longest intron reported, default = 100000) INT
sil Shortest intron length (Length of the shortest intron considered, default = 10) INT
lr Long reads (Long-read mode, default = false) BOOLEAN
mnor Minimum number of reads (Minimum number of reads required for an edge in the read graph, default = 1.0) DOUBLE
mfor Minimum fraction of reads (Minimum fraction of reads relative to adjacent exons that must support an intron in the enumeration, default = 0.01) DOUBLE
mnoir Minimum number of intron reads (Minimum number of reads required for an intron, default = 1.0) DOUBLE
mfoir Minimum fraction of intron reads (Minimum fraction of reads relative to adjacent exons for an intron to be considered, default = 0.01) DOUBLE
p Percent explained (Percent of abundance that must be explained by transcript models after quantification, default = 0.9) DOUBLE
mrpg Minimum reads per gene (Minimum abundance required for a gene to be reported, default = 40.0) DOUBLE
mrpt Minimum reads per transcript (Minimum abundance required for a transcript to be reported, default = 20.0) DOUBLE
pa Percent abundance (Minimum relative abundance required for a transcript to be reported, default = 0.05) DOUBLE
sf Successive fraction (Factor of the drop in abundance between successive transcript models, default = 20.0) DOUBLE
mrl Maximum region length (Maximum length of a region considered before it is split, default = 750000) INT
mrc Maximum region coverage (Maximum coverage in a region before reads are down-sampled, valid range = [0.0, Infinity], default = 100.0) DOUBLE
mfgl Maximum filled gap length (Maximum length of a gap filled by dummy reads, default = 50) INT
q Quality filter (Minimum mapping quality required for a read to be considered, default = 40) INT
mpl Minimum protein length (Minimum length of protein in AA, default = 70) INT
gp Gene prefix (Prefix to add to all gene names, default = G) STRING
gnwc Gene names with chromosome (If true, gene names will be constructed as <Gene prefix><chr>.<geneNumber>. Gene numbers will be assigned successively across all chromosomes., default = false) BOOLEAN
outdir The output directory, defaults to the current working directory (.) STRING
threads The number of threads used for the tool, defaults to 1. Currently, I/O of GeMoSeq runs on a single thread and runtime is limited by I/O performance. Hence, running GeMoSeq with a large number of threads is not recommended. On our infrastructure, a number of 6 threads has been the sweet spot. INT

Example:

java -jar GeMoSeq-1.2.3.jar gemoseq g=<Genome> m=<Mapped_reads>

Predict CDS from GFF

Predict CDS from GFF may be called with

java -jar GeMoSeq-1.2.3.jar predictCDS

and has the following parameters

name comment type

g Genome (Genome sequence as FastA, type = fa,fna.fasta) FILE
p predicted annotation ("GFF or GTF file containing the predicted annotation", type = gff,gff3,gff.gz,gff3.gz,gtf,gtf.gz) FILE
m Minimum protein length (Minimum length of protein in AA, default = 70) INT
outdir The output directory, defaults to the current working directory (.) STRING

Example:

java -jar GeMoSeq-1.2.3.jar predictCDS g=<Genome> p=<predicted_annotation>


Merge

Merge may be called with

java -jar GeMoSeq-1.2.3.jar merge

and has the following parameters

name comment type

g GeMoMa (GeMoMa predictions, type = gff,gff3) FILE
GeMoSeq GeMoSeq (GeMoSeq predictions, type = gff,gff3) FILE
m Mode (, range={intersect, union, intermediate, annotate}, default = intersect) STRING
No parameters for selection "intersect"
No parameters for selection "union"
Parameters for selection "intermediate":
GeMoMa-strict GeMoMa-strict (GeMoMa predictions with strict settings, type = gff,gff3) FILE
GeMoSeq-strict GeMoSeq-strict (GeMoSeq predictions with strict settings, type = gff,gff3) FILE
Parameters for selection "annotate":
GeMoMa-strict GeMoMa-strict (GeMoMa predictions with strict settings, type = gff,gff3) FILE
GeMoSeq-strict GeMoSeq-strict (GeMoSeq predictions with strict settings, type = gff,gff3) FILE
l Low-confidence (include low-confidence predictions, default = true) BOOLEAN
outdir The output directory, defaults to the current working directory (.) STRING

Example:

java -jar GeMoSeq-1.2.3.jar merge g=<GeMoMa> GeMoSeq=<GeMoSeq>



Version history

  • Version 1.2.3 (2025/11/11): Renamed the tool to GeMoSeq and improved prediction from long-read data
  • Version 1.2.1 (2025/05/28): improved handling of exceptions in multi-thread mode
  • Version 1.2 (2025/05/12): changes in the following tools
    • gemorna: fixed a problem where (incomplete) CDS would be predicted in transcripts without any proper stop codon
  • Version 1.1 (2025/04/15): changes in the following tools
    • merge: include flag if low-confidence predictions will be included in "annotate" mode
    • gemorna: allow to provide custom prefix for gene names and to include the chromosome into the gene names
  • Version 1.0: initial version of GeMoRNA