GeMoSeq-Examples

From Jstacs
Jump to navigationJump to search

GeMoSeq example run

We use a data set for yeast for an example run of GeMoSeq. As input of GeMoSeq, we need a reference genome and a library of reads mapped to that reference. Here, we use the S. cerevisiae S288C reference genome R64 available from NCBI. For later evaluation of GeMoSeq prediction results, we also download the reference annotation. As a sequencing library, we use the library with accession SRR8552464 available from the European Nucleotide archive or NCBI Sequence Read Archive.

Mapping

If you just want to test GeMoSeq, you can skip this step and proceed with the download provided at the end of this section.

Before proceeding with GeMoSeq, these reads first need to be mapped to the reference genome. For yeast and other eukaryotes, we should use a splicing-aware mapping tool. Our current favourite is STAR also available from bioconda.

First, we need to index the reference genome with

STAR --runMode genomeGenerate --genomeDir yeast_index --genomeFastaFiles GCF_000146045.2_R64_genomic.fna --genomeSAindexNbases 10

Here, we specify --genomeSAindexNbases 10 since the yeast genome is rather small.

Assuming reads have been stored in files SRR8552464_1.fastq.gz and SRR8552464_2.fastq.gz, we run STAR with the following command

STAR --genomeDir yeast_index --readFilesIn SRR8552464_1.fastq.gz SRR8552464_2.fastq.gz --readFilesCommand gunzip -c --outFileNamePrefix star/SRR8552464 --alignIntronMax 20000 --alignMatesGapMax 20000 --outSAMtype BAM SortedByCoordinate

Here, we limit the maximum intron size to 20000, since larger introns do not exist in yeast, but STAR tends to produce spurios long-split mappings, and let STAR output an already coordinate-sorted BAM file.

For convenience, we provide the resulting mapping file for download.

Running GeMoSeq

The resulting mapping file and the reference genome used for mapping are input of GeMoSeq in the next step. For the next steps, we assume that the BAM file has been moved to a file SRR8552464.bam in the working directory (or has been downloaded from the provided link).

We leave most arguments at their defaults, calling GeMoSeq with

java -jar GeMoSeq-1.2.3.jar gemoseq g=GCF_000146045.2_R64_genomic.fna m=SRR8552464.bam s=FR_FIRST_STRAND threads=6 outdir=gemoseq

Here, we use the parameter s=FR_FIRST_STRAND to specify that this is a stranded library and set the number of threads to 6. The output directory is set to gemoseq. At execution, GeMoSeq creates a temporary file with predictions in the working directory, which is later copied to the specified output directory together with a protocol of the GeMoSeq run. In the output directory, you will find the final predictions as a file Transcript_Predictions.gff3.

Evaluating prediction performance

To evaluate the accuracy of GeMoSeq predictions, we compare the GeMoSeq GFF against the yeast reference annotation. For this purpose, the GeMoSeq Jar file provides the GeMoMa Analyzer tool, which can be run with

java -jar GeMoSeq-1.2.3.jar Analyzer t=GCF_000146045.2_R64_genomic.gff p=gemoseq/Transcript_Predictions.gff3

The Analyzer tools outputs several metrics, including CDS sensitivity and precision, which are in this case 72.13 and 90.67, respectively.

Using multiple libraries in combination with homology-based gene prediction

We use a recently published genome of a Nicotiana benthamiana lab strain ([1]) to illustrate how GeMoSeq can be used to re-annotate a genome based on multiple sequencing libraries and in combination with the homology-based gene prediction of GeMoMa.

In the GeMoSeq preprint, we sampled 319 sequencing libraries for N. benthamiana for this purpose, and we provide the corresponding ENA accession table for download.

All of these libraries need to be downloaded, mapped to the genome sequence, and strandedness needs to be determined. For mapping, we follow the same procedure as described for the GeMoSeq example run above. To determine strandedness, we use featureCounts applied to the previous genome annotation with flags for either strand orientation and choose the one with the larger number of assigned reads, or unstranded if both yield roughly equal numbers of assigned reads. The strand orientation that has been determined by this procedure is listed in column "type" of the accession table.

Applying GeMoSeq to individual libraries

Assuming that the mapped reads are stored in directories named by the accession of the sequencing run, namely mappings/<accession>/Aligned.sortedByCoord.out.bam, e.g., mappings/DRR377420/Aligned.sortedByCoord.out.bam, we need to call GeMoSeq for each of these mapping files independently. This can be achieved by a (bash) for loop over all BAM files or in parallel on a compute cluster using a scheduling system like slurm.

For instance, library DRR377420 has strand orientation FR_FIRST_STRAND (see above), and the corresponding GeMoSeq call would be

java -jar GeMoSeq-1.2.3.jar gemoseq g=NbLab360.genome.fasta m=mappings/DRR377420/Aligned.sortedByCoord.out.bam s=FR_FIRST_STRAND threads=6 outdir=gemoseq/DRR377420

where we, again, choose the name of the output directory based on the accession of the sequencing run. Within that directory, each GeMoSeq run will produce a file Transcript_Predictions.gff3 with the predicted gene models.

After GeMoSeq has been run for all mapping files, the resulting prediction files are located in gemoseq/<accession>/Transcript_Predictions.gff3.

Joining individual predictions

To yield a joint GeMoSeq prediction across all libraries, we join the individual predictions using the GeMoMa Annotation Filter (GAF) tool of GeMoMa, which also comes packaged with the GeMoSeq JAR file. For future merging with homology-based GeMoMa predictions, we run the GAF tool twice, once with looser and once with stricter filtering criteria.

For the looser filtering, we specify

java -jar GeMoSeq-1.2.3.jar GAF p=DRR377420 g=gemoseq/DRR377420/Transcript_Predictions.gff3 ... p=SRR29507616 g=gemoseq/SRR29507616/Transcript_Predictions.gff3 f="start=='M' and stop=='*' and sumWeight>7 and (isNaN(tie) or tie==1)" atf="sumWeight>23" mnotpg=20 tf=true outdir=gemoseq_loose

where ... indicates that the arguments p and g need to be repeated for each of the individual predictions. The filtering criteria (parameter f) specify that each tranferred transcript must have a proper start and stop codon, must appear in more than 7 (of 319) individual predictions and may either be a single-exon transcript (isNaN(tie)) or have perfect support for all splice sites (tie==1). The parameter atf defines that alternative transcripts are only transferred, if they occur in at least 24 of 319 libraries, parameter mnotpg specifies the maximum number of transcripts per gene, and tf=true indicates that additional features (i.e., exons + CDS) should be tranferred to the final prediction. After the GAF run, the joint predictions are stored in gemoseq_loose/filtered_predictions.gff.

For the stricter filtering, we use adapted parameters

java -jar GeMoSeq-1.2.3.jar GAF p=DRR377420 g=gemoseq/DRR377420/Transcript_Predictions.gff3 ... p=SRR29507616 g=gemoseq/SRR29507616/Transcript_Predictions.gff3 f="start=='M' and stop=='*' and sumWeight>30 and (isNaN(tie) or tie==1)" atf="sumWeight>70" mnotpg=20 tf=true outdir=gemoseq_strict

and obtain joint predictions in gemoseq_strict/filtered_predictions.gff.

Homology-based gene prediction

For homology-based gene prediction, we use GeMoMa for the reference species A. thaliana, N. tabacum, S. lycopersicum, and two previous N. benthamiana genome versions (1.0.1 and 2.6.1). These predictions are also joined using the GAF tool.

For details about using GeMoMa for homology-based gene prediction, please refer to GeMoMa-Docs.

Here, we assume that for the homology-based predictions, we also obtain a loosely and a strictly filtered prediction file as gemoma_loose/filtered_predictions.gff and gemoma_strict/filtered_predictions.gff, respectively.


Merging RNA-seq-based and homology-based predictions

We finally merge the four previously obtained prediction files (loose and strict for GeMoSeq and GeMoMa, respectively) using the GeMoSeq Merge tool. We call

java -jar GeMoSeq-1.2.3.jar merge g=gemoma_loose/filtered_predictions.gff GeMoSeq=gemoseq_loose/filtered_predictions.gff m=annotate GeMoMa-strict=gemoma_strict/filtered_predictions.gff GeMoSeq-strict=gemoseq_strict/filtered_predictions.gff l=false outdir=final

where parameter m specifies that merged predictions should be annotated with their confidence (high, medium) and l=false indicates that low-confidence predictions should be excluded from the output. After running the Merge tool, final predictions can be found in final/Merged_Predictions.gff.