MeDeMo: Difference between revisions

From Jstacs
Jump to navigationJump to search
Line 12: Line 12:
Source code is available from the [https://github.com/Jstacs/Jstacs Jstacs github page] in package <code>projects.methyl</code>.
Source code is available from the [https://github.com/Jstacs/Jstacs Jstacs github page] in package <code>projects.methyl</code>.


Examples data (also used for the code examples below) is [http://www.jstacs.de/downloads/MeDeMo-examples.tgz available for download].


== Tools ==
== Tools ==

Revision as of 13:15, 19 March 2020

Accurate models describing the binding specificity of transcription factors (TFs) are essential for a better understanding of transcriptional regulation. Aside from chromatin accessibility and sequence specificity, several studies suggested that DNA methylation influences TF binding in both activating and repressive ways. However, currently available TF motif inference and TF binding site prediction approaches do not adequately incorporate DNA methylation.

We present MeDeMo (Methylation and Dependencies in Motifs) a novel framework for TF motif discovery and TFBS prediction that incorporates DNA methylation by extending Slim models. We show that dependencies between nucleotides, captured by MeDeMo are essential to represent DNA methylation and that MeDeMo achieves superior prediction performance compared to related approaches. The inferred TF motifs are highly interpretable and can provide new insights into the relation between DNA methylation and TF binding.


Download

MeDeMo is available as

Source code is available from the Jstacs github page in package projects.methyl.

Examples data (also used for the code examples below) is available for download.

Tools

The description of tools and tool parameters refers to the command line version, but the same parameters are also present in the GUI version. Additional help may be requested in the GUI version by clicking on the "?" button.


Data Extractor

Data Extractor prepares an annotated FastA file as required by Dimont from a genome (in FastA format, including methylated variants) and a tabular file (e.g., BED, GTF, narrowPeak,...). The regions specified in the tabular file are used to determine the center of the extracted sequences. All extracted sequences have the same length as specified by parameter "Width".

In case of ChIP data, the center position could for instance be the peak summit. An annotated FastA file for ChIP-seq data comprising sequences of length 100 centered around the peak summit might look like:

> peak: 50; signal: 515
ggccatgtgtatttttttaaatttccac...
> peak: 50; signal: 199
GGTCCCCTGGGAGGATGGGGACGTGCTG...
...

where the center is given as 50 for the first two sequences, and the confidence amounts to 515 and 199, respectively.


If you experience problems using Data Extractor, please contact us.


Data Extractor may be called with

java -jar MeDeMo-1.0.jar extract

and has the following parameters

name comment type

g Genome (The FastA containing all chromosome sequences, may be gzipped) FILE
p Peaks (The file containing the peaks in tabular format) FILE
c Chromosome column (The column of the peaks file containing the chromosome, default = 1) INT
s Start column (The column of the peaks file containing the start position relative to the chromsome start, default = 2) INT
pp Peak position (The kind how the peak is specified, range={Peak center, End of peak}, default = End of peak)
Parameters for selection "Peak center":
cc Center column (The column of the peaks file containing the peak center relative to the start position) INT
Parameters for selection "End of peak":
e End column (The column of the peaks file containing the end position relative to the chromsome start, default = 3) INT
w Width (The fixed width of all extracted regions, valid range = [1, 10000], default = 1000) INT
sc Statistics column (The column of the peaks file containing the peak statistic or a similar measure of confidence, default = 7) INT
outdir The output directory, defaults to the current working directory (.) STRING

Example:

java -jar MeDeMo-1.0.jar extract g=0data/genomes/HepG2_converted_genome.unmasked.fa.gz p=0data/HepG2/NRF1_ENCFF313RFR_train1.bed outdir=1extracted/NRF1_HepG2_train1

Methyl SlimDimont

Methyl SlimDimont is a tool for de-novo motif discovery from DNA sequences including extended, e.g., methylation-aware alphabets.

Input sequences must be supplied in an annotated FastA format as generated by the Data Extractor tool. Input sequences may also obtained from other sources. In this case, the annotation of each sequence needs to provide a value that reflects the confidence that this sequence is bound by the factor of interest. Such confidences may be peak statistics (e.g., number of fragments under a peak) for ChIP data or signal intensities for PBM data. In addition, you need to provide an anchor position within the sequence. In case of ChIP data, this anchor position could for instance be the peak summit. An annotated FastA file for ChIP-seq data comprising sequences of length 100 centered around the peak summit could look like:

> peak: 50; signal: 515
ggccatgtgtatttttttaaatttccac...
> peak: 50; signal: 199
GGTCCCCTGGGAGGATGGGGACGTGCTG...
...

where the anchor point is given as 50 for the first two sequences, and the confidence amounts to 515 and 199, respectively. The FastA comment may contain additional annotations of the format key1 : value1; key2: value2;....

Accordingly, you would need to set the parameter "Position tag" to peak and the parameter "Value tag" to signal for the input file (default values). The parameter Alphabet specifies the symbols of the (extended) alphabet and their complementary symbols. Default is standard DNA alphabet.

For the standard deviation of the position prior, the initial motif length and the number of pre-optimization runs, we provide default values that worked well in our studies on ChIP and PBM data. However, you may want adjust these parameters to meet your prior information.

The parameter "Markov order of the motif model" sets the order of the inhomogeneous Markov model used for modeling the motif. If this parameter is set to 0, you obtain a position weight matrix (PWM) model. If it is set to 1, you obtain a weight array matrix (WAM) model. You can set the order of the motif model to at most 3.

The parameter "Markov order of the background model" sets the order of the homogeneous Markov model used for modeling positions not covered by a motif. If this parameter is set to -1, you obtain a uniform distribution, which worked well for ChIP data. For PBM data, orders of up to 4 resulted in an increased prediction performance in our case studies. The maximum allowed value is 5.

The parameter "Weighting factor" defines the proportion of sequences that you expect to be bound by the targeted factor with high confidence. For ChIP data, the default value of 0.2 typically works well. For PBM data, containing a large number of unspecific probes, this parameter should be set to a lower value, e.g. 0.01.

The "Equivalent sample size" reflects the strength of the influence of the prior on the model parameters, where higher values smooth out the parameters to a greater extent.

The parameter "Delete BSs from profile" defines if BSs of already discovered motifs should be deleted, i.e., "blanked out", from the sequence before searching for futher motifs.

You can also install this web-application within your local Galaxy server. Instructions can be found at the Dimont_ page of Jstacs. There you can also download a command line version of Dimont.

If you experience problems using Methyl SlimDimont, please contact us.


Methyl SlimDimont may be called with

java -jar MeDeMo-1.0.jar slimdimont

and has the following parameters

name comment type

a Alphabet (Characters of the alphabet as a string of unseparated characters, first listing the symbols in forward orientation and then their complement in the same order. For instance, a methylation-aware alphabet would be specified as ACGTMH,TGCAHM and a standard DNA alphabet as ACGT,TGCA, default = ACGTMH,TGCAHM) STRING
i Input file (The file name of the file containing the input sequences in annotated FastA format as generated by the Data Extractor tool) FILE
b Background sample (Background sample containing negative examples, may be di-nucleotide shuffled input sequences, range={background file, shuffled input}, default = shuffled input)
Parameters for selection "background file":
bf Background file (The file name of the file containing background sequences in annotated FastA format., OPTIONAL) FILE
No parameters for selection "shuffled input"
p Position tag (The tag for the position information in the FastA-annotation of the input file, default = peak) STRING
v Value tag (The tag for the value information in the FastA-annotation of the input file, default = signal) STRING
s Standard deviation (The standard deviation of the position distribution centered at the position specified by the position tag, valid range = [1.0, 10000.0], default = 75.0) DOUBLE
w Weighting factor (The value for weighting the data, between 0 and 1, valid range = [0.0, 1.0], default = 0.2) DOUBLE
Starts Starts (The number of pre-optimization runs., valid range = [1, 100], default = 20) INT
imw Initial motif width (The motif width that is used initially, may be adjusted during optimization., valid range = [1, 50], default = 20) INT
m Model type (The type of the motif model; a PWM model corresponds to a Markov model of order 0., range={LSlim model, Markov model}, default = LSlim model)
Parameters for selection "LSlim model":
md Maximum distance (The maximum distance considered in the LSlim model, valid range = [1, 2147483647], default = 5) INT
Parameters for selection "Markov model":
o Order (The order of the Markov model, valid range = [0, 5], default = 0) INT
moobm Markov order of background model (The Markov order of the model for the background sequence and the background sequence, -1 defines uniform distribution., valid range = [-1, 5], default = -1) INT
e Equivalent sample size (Reflects the strength of the prior on the model parameters., valid range = [0.0, Infinity], default = 4.0) DOUBLE
d Delete BSs from profile (A switch for deleting binding site positions of discovered motifs from the profile before searching for futher motifs., default = true) BOOLEAN
afs Adjust for shifts (Adjust for shifts of the motif., default = true) BOOLEAN
outdir The output directory, defaults to the current working directory (.) STRING
threads The number of threads used for the tool, defaults to 1 INT

Example:

java -jar MeDeMo-1.0.jar slimdimont i=1extracted/NRF1_HepG2_train1/Extracted_sequences.fasta m="Markov model" outdir=2train threads=8

Sequence Scoring

Sequence Scoring scans a set of input sequences (e.g., sequences under ChIP-seq peaks) for a given motif model (provided as XML as output by "Methyl SlimDimont" and provides per sequence information of i) the start position and strand of the best motif match, ii) the corresponding maximum score, iii) the log-sum occupancy score, iv) the matching sequence, and v) the ID (FastaA header) of the sequence.

The purpose of this tool mainly is to determine per-sequence scores for classification, for instance, distinguishing bound from unbound sequences.

If you experience problems using Sequence Scoring, please contact us.


Sequence Scoring may be called with

java -jar MeDeMo-1.0.jar score

and has the following parameters

name comment type

i Input sequences (Input sequences in FastA format) FILE
m Model (Model XML as output by Methyl SlimDimont) FILE
outdir The output directory, defaults to the current working directory (.) STRING

Example:

java -jar MeDeMo-1.0.jar score i=1extracted/NRF1_GM12878_test1/Extracted_sequences.fasta m=2train/Motif_1/SlimDimont_1.xml outdir=3score/NRF1_GM12878

Evaluate Scoring

Evaluate Scoring computes the area under the ROC curve and under the precision recall curve based on the scoring of a positive and a negative set of sequences. Optionally, also the curves may be drawn.

Evaluate Scoring may be called with

java -jar MeDeMo-1.0.jar eval

and has the following parameters

name comment type

p Positives (Output of "Sequence Scoring" for positive test sequences.) FILE
n Negatives (Output of "Sequence Scoring" for negative test sequences.) FILE
c Curves (Also compute and draw curves, default = false) BOOLEAN
u Use sum-occupancy (Use log-sum occupancy score instead of maximum, default = false) BOOLEAN
outdir The output directory, defaults to the current working directory (.) STRING

Example:

java -jar MeDeMo-1.0.jar eval p=3score/NRF1_GM12878/Predictions.tsv n=3score/negatives/Predictions.tsv c=true outdir=4eval

Motif scores

Motif scores computes features based on motif scores of a given motif model scanning sub-sequences along the genome. Motif scores are aggregated in bins of the specified width as maximum score and log of the average exponential score (i.e., average log-likelihood in case of statistical models). The motif model may be provided as PWMs in HOCOMOCO or PFMs in Jaspar format, or as Dimont motif models in XML format. For more complex motif models like Slim models, the current implementation uses several indexes to speed-up the scanning process. However, computation of these indexes is rather memory-consuming and often not reasonable for simple PWM models. Hence, a low-memory variant of the tool is available, which is typically only slightly slower for PWM models but substantially slower for Slim models. Output is provided as a gzipped file Motif_scores.tsv.gz containing columns chromosome, start position, maximum and average score. This output file together with a protocol of the tool run is saved to the specified output directory.

Motif scores may be called with

java -jar MeDeMo-1.0.jar motif

and has the following parameters

name comment type

m Motif model (The motif model in Dimont, HOCOMOCO, or Jaspar format, range={Dimont, HOCOMOCO, Jaspar}, default = Dimont)
Parameters for selection "Dimont":
d Dimont motif (Dimont motif model description) FILE
Parameters for selection "HOCOMOCO":
h HOCOMOCO PWM (PWM from the HOCOMOCO database) FILE
Parameters for selection "Jaspar":
j Jaspar PFM (PFM in Jaspar format) FILE
g Genome (Genome as FastA file) FILE
f FAI of genome (FastA index file of the genome) FILE
b Bin width (The width of the genomic bins considered) INT
l Low-memory mode (Use slower mode with a smaller memory footprint, default = true) BOOLEAN
outdir The output directory, defaults to the current working directory (.) STRING
threads The number of threads used for the tool, defaults to 1 INT

Example:

java -jar MeDeMo-1.0.jar motif d=2train/Motif_1/SlimDimont_1.xml g=0data/genomes/HepG2_converted_genome.unmasked.fa.gz f=0data/genomes/HepG2_converted_genome.unmasked.fa.fai outdir=7scores b=50

Quick Prediction Tool

Quick Prediction Tool predicts binding sites of a transcription factor based on a motif model and is also suited for genome-wide predictions. The motif model is provided as the XML output of (Slim) Dimont.

The tool outputs a list of predictions including, for every prediction, the IDof the sequence (e.g., chromosome) containing the binding site, position and strand of the matching sub-sequence, its score according to the model, the sub-sequence itself (in strand orientation according to the model), and a p-value from a normal distribution fitted to the score distribution of the provided negative examples or a sub-sample of the input data (parameter "Background sample").

If you experience problems using Quick Prediction Tool, please contact us.


Quick Prediction Tool may be called with

java -jar MeDeMo-1.0.jar quickpred

and has the following parameters

name comment type

d Dimont model (The model returned by Dimont (in XML format)) FILE
s Sequences (The sequences (e.g., a genome) to scan for binding sites) FILE
b Background sample (The sequences for determining the prediction threshold. Either a sub-sample of the input sequences or a dedicated background data set., range={sub-sample, background sequences}, default = sub-sample)
No parameters for selection "sub-sample"
Parameters for selection "background sequences":
bs Background sequences (The sequences (e.g., a genome) for determining the prediction threshold) FILE
t Threshold specification (The way of defining the prediction threshold. Either by explicitly defining a significance level or by specifying the number of expected sites, range={significance level, number of sites}, default = significance level)
Parameters for selection "significance level":
sl Significance level (The significance level for determining the prediction threshold, valid range = [0.0, 1.0E-4], default = 1.0E-6) DOUBLE
Parameters for selection "number of sites":
n Number of sites (The number of expected binding sites for determining the prediction threshold, valid range = [1, 1000000], default = 10000) INT
outdir The output directory, defaults to the current working directory (.) STRING

Example:

java -jar MeDeMo-1.0.jar quickpred d=2train/Motif_1/SlimDimont_1.xml s=0data/genomes/HepG2_converted_genome.unmasked.fa.gz sl=1e-5 outdir=6predict

Methylation Sensitivity

Methylation Sensitivity determines average methylation sensitivity profiles for CpG dinucleotides converted to MpG, CpH, and MpH. As input, it needs a model XML as generated by "Methyl SlimDimont", and a prediction file as output from the corresponding training run.

Optionally, Methylation Sensitivity also generates per-sequence methylation sensitivity profiles for the MpH context.

If you experience problems using Methylation Sensitivity, please contact us.


Methylation Sensitivity may be called with

java -jar MeDeMo-1.0.jar msens

and has the following parameters

name comment type

m Model (The XML file containing the model) FILE
p Predictions (The file containing the predictions from the training run) FILE
v Verbose (Output MpH sensitivity profile for every input sequence, default = false) BOOLEAN
outdir The output directory, defaults to the current working directory (.) STRING

Example:

java -jar MeDeMo-1.0.jar msens m=2train/Motif_1/SlimDimont_1.xml p=2train/Motif_1/Predictions_for_motif_1.tsv outdir=5msens