GeMoRNA

From Jstacs
Revision as of 16:32, 8 November 2024 by Grau (talk | contribs)
Jump to navigationJump to search

GeMoRNA reconstructs genes and transcript models from mapped RNA-seq reads (in coordinate-sorted BAM format) and reports these in GFF format.

It is intended as a companion for the homology-based gene prediction program GeMoMa.

In a typical workflow, predictions of transcript models may be obtained from GeMoRNA for a collection of BAM files individually and subsequently merged using the GeMoMa Annotation Filter (GAF). Optionally, homology-based gene prediction may be performed using GeMoMa and the resulting GFF files may be merged using the Merge tool of GeMoRNA.


Command line tool

GeMoRNA is distributed in the hope that it will be useful, but without any warranty; without even the implied warranty of merchantability or fitness for a particular purpose.

GeMoRNA and auxiliary tools are packaged in one runnable JAR that may be run from the command line with

java -jar GeMoRNA-1.0.jar

which lists the tools available and usage information

 Available tools:

	gemorna - GeMoRNA
	predictCDS - Predict CDS from GFF
	GAF - GeMoMa Annotation Filter
	Analyzer - Analyzer
	merge - Merge

Syntax: java -jar GeMoRNA-1.0.jar <toolname> [<parameter=value> ...]

Further info about the tools is given with
	java -jar GeMoRNA-1.0.jar <toolname> info

For tests of individual tools:
	java -jar GeMoRNA-1.0.jar <toolname> test [<verbose>]

Tool parameters are listed with
	java -jar GeMoRNA-1.0.jar <toolname>

You get a list of the tool parameters by calling GeMoRNA-1.0.jar with the corresponding tool name, e.g.,

java -jar GeMoRNA-1.0.jar gemorna

The meaning of the individual tool parameters is described below. For convenience, we also include the GeMoMa tools Analyzer and GAF.

Source code

The source code of GeMoRNA is available from the Jstacs GitHub repository.

GeMoRNA

Prediction of transcript models using GeMoRNA.


GeMoRNA may be called with

java -jar GeMoRNA-1.0.jar gemorna

and has the following parameters

name comment type

g Genome (Genome sequence as FastA, type = fa,fna,fasta) FILE
m Mapped reads (Mapped Reads in BAM format, coordinate sorted, type = bam) FILE
s Stranded (Library strandedness, range={FR_UNSTRANDED, FR_FIRST_STRAND, FR_SECOND_STRAND}, default = FR_UNSTRANDED) STRING
l Longest intron length (Length of the longest intron reported, default = 100000) INT
sil Shortest intron length (Length of the shortest intron considered, default = 10) INT
lr Long reads (Long-read mode, default = false) BOOLEAN
mnor Minimum number of reads (Minimum number of reads required for an edge in the read graph, default = 1.0) DOUBLE
mfor Minimum fraction of reads (Minimum fraction of reads relative to adjacent exons that must support an intron in the enumeration, default = 0.01) DOUBLE
mnoir Minimum number of intron reads (Minimum number of reads required for an intron, default = 1.0) DOUBLE
mfoir Minimum fraction of intron reads (Minimum fraction of reads relative to adjacent exons for an intron to be considered, default = 0.01) DOUBLE
p Percent explained (Percent of abundance that must be explained by transcript models after quantification, default = 0.9) DOUBLE
mrpg Minimum reads per gene (Minimum abundance required for a gene to be reported, default = 40.0) DOUBLE
mrpt Minimum reads per transcript (Minimum abundance required for a transcript to be reported, default = 20.0) DOUBLE
pa Percent abundance (Minimum relative abundance required for a transcript to be reported, default = 0.05) DOUBLE
sf Successive fraction (Factor of the drop in abundance between successive transcript models, default = 20.0) DOUBLE
mrl Maximum region length (Maximum length of a region considered before it is split, default = 750000) INT
mfgl Maximum filled gap length (Maximum length of a gap filled by dummy reads, default = 50) INT
q Quality filter (Minimum mapping quality required for a read to be considered, default = 40) INT
mpl Minimum protein length (Minimum length of protein in AA, default = 70) INT
outdir The output directory, defaults to the current working directory (.) STRING
threads The number of threads used for the tool, defaults to 1 INT

Example:

java -jar GeMoRNA-1.0.jar gemorna g=<Genome> m=<Mapped_reads>


Predict CDS from GFF

Prediction of CDSs using the longest-ORF heuristic based on an existing GFF or GTF file.


Predict CDS from GFF may be called with

java -jar GeMoRNA-1.0.jar predictCDS

and has the following parameters

name comment type

g Genome (Genome sequence as FastA, type = fa,fna.fasta) FILE
p predicted annotation ("GFF or GTF file containing the predicted annotation", type = gff,gff3,gff.gz,gff3.gz,gtf,gtf.gz) FILE
m Minimum protein length (Minimum length of protein in AA, default = 70) INT
outdir The output directory, defaults to the current working directory (.) STRING

Example:

java -jar GeMoRNA-1.0.jar predictCDS g=<Genome> p=<predicted_annotation>


Merge

Merging GeMoRNA and GeMoMa predictions.

Merge may be called with

java -jar GeMoRNA-1.0.jar merge

and has the following parameters

name comment type

g GeMoMa (GeMoMa predictions, type = gff,gff3) FILE
GeMoRNA GeMoRNA (GeMoRNA predictions, type = gff,gff3) FILE
m Mode (, range={intersect, union, intermediate, annotate}, default = intersect) STRING
No parameters for selection "intersect"
No parameters for selection "union"
Parameters for selection "intermediate":
GeMoMa-strict GeMoMa-strict (GeMoMa predictions with strict settings, type = gff,gff3) FILE
GeMoRNA-strict GeMoRNA-strict (GeMoRNA predictions with strict settings, type = gff,gff3) FILE
Parameters for selection "annotate":
GeMoMa-strict GeMoMa-strict (GeMoMa predictions with strict settings, type = gff,gff3) FILE
GeMoRNA-strict GeMoRNA-strict (GeMoRNA predictions with strict settings, type = gff,gff3) FILE
outdir The output directory, defaults to the current working directory (.) STRING

Example:

java -jar GeMoRNA-1.0.jar merge g=<GeMoMa> GeMoRNA=<GeMoRNA>