MeDeMo: Difference between revisions
(Created page with "=== Data Extractor === '''Data Extractor''' prepares an annotated FastA file as required by Dimont from a genome (in FastA format, including methylated variants) and a tabula...") |
No edit summary |
||
Line 1: | Line 1: | ||
Accurate models describing the binding specificity of transcription factors (TFs) are essential for a better understanding of transcriptional regulation. Aside from chromatin accessibility and sequence specificity, several studies suggested that DNA methylation influences TF binding in both activating and repressive ways. However, currently available TF motif inference and TF binding site prediction approaches do not adequately incorporate DNA methylation. | |||
We present MeDeMo (Methylation and Dependencies in Motifs) a novel framework for TF motif discovery and TFBS prediction that incorporates DNA methylation by extending [[Slim]] models. We show that dependencies between nucleotides, captured by MeDeMo are essential to represent DNA methylation and that MeDeMo achieves superior prediction performance compared to related approaches. The inferred TF motifs are highly interpretable and can provide new insights into the relation between DNA methylation and TF binding. | |||
=== Data Extractor === | === Data Extractor === | ||
Revision as of 12:59, 19 March 2020
Accurate models describing the binding specificity of transcription factors (TFs) are essential for a better understanding of transcriptional regulation. Aside from chromatin accessibility and sequence specificity, several studies suggested that DNA methylation influences TF binding in both activating and repressive ways. However, currently available TF motif inference and TF binding site prediction approaches do not adequately incorporate DNA methylation.
We present MeDeMo (Methylation and Dependencies in Motifs) a novel framework for TF motif discovery and TFBS prediction that incorporates DNA methylation by extending Slim models. We show that dependencies between nucleotides, captured by MeDeMo are essential to represent DNA methylation and that MeDeMo achieves superior prediction performance compared to related approaches. The inferred TF motifs are highly interpretable and can provide new insights into the relation between DNA methylation and TF binding.
Data Extractor
Data Extractor prepares an annotated FastA file as required by Dimont from a genome (in FastA format, including methylated variants) and a tabular file (e.g., BED, GTF, narrowPeak,...). The regions specified in the tabular file are used to determine the center of the extracted sequences. All extracted sequences have the same length as specified by parameter "Width".
In case of ChIP data, the center position could for instance be the peak summit. An annotated FastA file for ChIP-seq data comprising sequences of length 100 centered around the peak summit might look like:
> peak: 50; signal: 515 ggccatgtgtatttttttaaatttccac... > peak: 50; signal: 199 GGTCCCCTGGGAGGATGGGGACGTGCTG... ...
where the center is given as 50 for the first two sequences, and the confidence amounts to 515 and 199, respectively.
If you experience problems using Data Extractor, please contact us.
Data Extractor may be called with
java -jar MeDeMo-1.0.jar extract
and has the following parameters
name | comment | type | ||||||||||||
g | Genome (The FastA containing all chromosome sequences, may be gzipped) | FILE | ||||||||||||
p | Peaks (The file containing the peaks in tabular format) | FILE | ||||||||||||
c | Chromosome column (The column of the peaks file containing the chromosome, default = 1) | INT | ||||||||||||
s | Start column (The column of the peaks file containing the start position relative to the chromsome start, default = 2) | INT | ||||||||||||
pp | Peak position (The kind how the peak is specified, range={Peak center, End of peak}, default = End of peak) | |||||||||||||
| ||||||||||||||
w | Width (The fixed width of all extracted regions, valid range = [1, 10000], default = 1000) | INT | ||||||||||||
sc | Statistics column (The column of the peaks file containing the peak statistic or a similar measure of confidence, default = 7) | INT | ||||||||||||
outdir | The output directory, defaults to the current working directory (.) | STRING |
Example:
java -jar MeDeMo-1.0.jar extract
Methyl SlimDimont
Methyl SlimDimont is a tool for de-novo motif discovery from DNA sequences including extended, e.g., methylation-aware alphabets.
Input sequences must be supplied in an annotated FastA format as generated by the Data Extractor tool. Input sequences may also obtained from other sources. In this case, the annotation of each sequence needs to provide a value that reflects the confidence that this sequence is bound by the factor of interest. Such confidences may be peak statistics (e.g., number of fragments under a peak) for ChIP data or signal intensities for PBM data. In addition, you need to provide an anchor position within the sequence. In case of ChIP data, this anchor position could for instance be the peak summit. An annotated FastA file for ChIP-seq data comprising sequences of length 100 centered around the peak summit could look like:
> peak: 50; signal: 515 ggccatgtgtatttttttaaatttccac... > peak: 50; signal: 199 GGTCCCCTGGGAGGATGGGGACGTGCTG... ...
where the anchor point is given as 50 for the first two sequences, and the confidence amounts to 515 and 199, respectively.
The FastA comment may contain additional annotations of the format key1 : value1; key2: value2;...
.
Accordingly, you would need to set the parameter "Position tag" to peak
and the parameter "Value tag" to signal
for the input file (default values).
The parameter Alphabet specifies the symbols of the (extended) alphabet and their complementary symbols. Default is standard DNA alphabet.
For the standard deviation of the position prior, the initial motif length and the number of pre-optimization runs, we provide default values that worked well in our studies on ChIP and PBM data. However, you may want adjust these parameters to meet your prior information.
The parameter "Markov order of the motif model" sets the order of the inhomogeneous Markov model used for modeling the motif. If this parameter is set to 0
, you obtain a position weight matrix (PWM) model.
If it is set to 1
, you obtain a weight array matrix (WAM) model. You can set the order of the motif model to at most 3
.
The parameter "Markov order of the background model" sets the order of the homogeneous Markov model used for modeling positions not covered by a motif.
If this parameter is set to -1
, you obtain a uniform distribution, which worked well for ChIP data. For PBM data, orders of up to 4
resulted in an increased prediction performance in our case studies. The maximum allowed value is 5
.
The parameter "Weighting factor" defines the proportion of sequences that you expect to be bound by the targeted factor with high confidence. For ChIP data, the default value of 0.2
typically works well.
For PBM data, containing a large number of unspecific probes, this parameter should be set to a lower value, e.g. 0.01
.
The "Equivalent sample size" reflects the strength of the influence of the prior on the model parameters, where higher values smooth out the parameters to a greater extent.
The parameter "Delete BSs from profile" defines if BSs of already discovered motifs should be deleted, i.e., "blanked out", from the sequence before searching for futher motifs.
You can also install this web-application within your local Galaxy server. Instructions can be found at the Dimont_ page of Jstacs. There you can also download a command line version of Dimont.
If you experience problems using Methyl SlimDimont, please contact us.
Methyl SlimDimont may be called with
java -jar MeDeMo-1.0.jar slimdimont
and has the following parameters
name | comment | type | ||||||||||||
a | Alphabet (Characters of the alphabet as a string of unseparated characters, first listing the symbols in forward orientation and then their complement in the same order. For instance, a methylation-aware alphabet would be specified as ACGTMH,TGCAHM and a standard DNA alphabet as ACGT,TGCA, default = ACGTMH,TGCAHM) | STRING | ||||||||||||
i | Input file (The file name of the file containing the input sequences in annotated FastA format as generated by the Data Extractor tool) | FILE | ||||||||||||
b | Background sample (Background sample containing negative examples, may be di-nucleotide shuffled input sequences, range={background file, shuffled input}, default = shuffled input) | |||||||||||||
| ||||||||||||||
p | Position tag (The tag for the position information in the FastA-annotation of the input file, default = peak) | STRING | ||||||||||||
v | Value tag (The tag for the value information in the FastA-annotation of the input file, default = signal) | STRING | ||||||||||||
s | Standard deviation (The standard deviation of the position distribution centered at the position specified by the position tag, valid range = [1.0, 10000.0], default = 75.0) | DOUBLE | ||||||||||||
w | Weighting factor (The value for weighting the data, between 0 and 1, valid range = [0.0, 1.0], default = 0.2) | DOUBLE | ||||||||||||
Starts | Starts (The number of pre-optimization runs., valid range = [1, 100], default = 20) | INT | ||||||||||||
imw | Initial motif width (The motif width that is used initially, may be adjusted during optimization., valid range = [1, 50], default = 20) | INT | ||||||||||||
m | Model type (The type of the motif model; a PWM model corresponds to a Markov model of order 0., range={LSlim model, Markov model}, default = LSlim model) | |||||||||||||
| ||||||||||||||
moobm | Markov order of background model (The Markov order of the model for the background sequence and the background sequence, -1 defines uniform distribution., valid range = [-1, 5], default = -1) | INT | ||||||||||||
e | Equivalent sample size (Reflects the strength of the prior on the model parameters., valid range = [0.0, Infinity], default = 4.0) | DOUBLE | ||||||||||||
d | Delete BSs from profile (A switch for deleting binding site positions of discovered motifs from the profile before searching for futher motifs., default = true) | BOOLEAN | ||||||||||||
afs | Adjust for shifts (Adjust for shifts of the motif., default = true) | BOOLEAN | ||||||||||||
outdir | The output directory, defaults to the current working directory (.) | STRING | ||||||||||||
threads | The number of threads used for the tool, defaults to 1 | INT |
Example:
java -jar MeDeMo-1.0.jar slimdimont
Sequence Scoring
Sequence Scoring scans a set of input sequences (e.g., sequences under ChIP-seq peaks) for a given motif model (provided as XML as output by "Methyl SlimDimont" and provides per sequence information of i) the start position and strand of the best motif match, ii) the corresponding maximum score, iii) the log-sum occupancy score, iv) the matching sequence, and v) the ID (FastaA header) of the sequence.
The purpose of this tool mainly is to determine per-sequence scores for classification, for instance, distinguishing bound from unbound sequences.
If you experience problems using Sequence Scoring, please contact us.
Sequence Scoring may be called with
java -jar MeDeMo-1.0.jar score
and has the following parameters
name | comment | type |
i | Input sequences (Input sequences in FastA format) | FILE |
m | Model (Model XML as output by Methyl SlimDimont) | FILE |
outdir | The output directory, defaults to the current working directory (.) | STRING |
Example:
java -jar MeDeMo-1.0.jar score
Evaluate Scoring
Evaluate Scoring computes the area under the ROC curve and under the precision recall curve based on the scoring of a positive and a negative set of sequences. Optionally, also the curves may be drawn.
Evaluate Scoring may be called with
java -jar MeDeMo-1.0.jar eval
and has the following parameters
name | comment | type |
p | Positives (Output of "Sequence Scoring" for positive test sequences.) | FILE |
n | Negatives (Output of "Sequence Scoring" for negative test sequences.) | FILE |
c | Curves (Also compute and draw curves, default = false) | BOOLEAN |
u | Use sum-occupancy (Use log-sum occupancy score instead of maximum, default = false) | BOOLEAN |
outdir | The output directory, defaults to the current working directory (.) | STRING |
Example:
java -jar MeDeMo-1.0.jar eval
Motif scores
Motif scores computes features based on motif scores of a given motif model scanning sub-sequences along the genome. Motif scores are aggregated in bins of the specified width as maximum score and log of the average exponential score (i.e., average log-likelihood in case of statistical models). The motif model may be provided as PWMs in HOCOMOCO or PFMs in Jaspar format, or as Dimont motif models in XML format. For more complex motif models like Slim models, the current implementation uses several indexes to speed-up the scanning process. However, computation of these indexes is rather memory-consuming and often not reasonable for simple PWM models. Hence, a low-memory variant of the tool is available, which is typically only slightly slower for PWM models but substantially slower for Slim models. Output is provided as a gzipped file Motif_scores.tsv.gz containing columns chromosome, start position, maximum and average score. This output file together with a protocol of the tool run is saved to the specified output directory.
Motif scores may be called with
java -jar MeDeMo-1.0.jar motif
and has the following parameters
name | comment | type | ||||||||||||||||||
m | Motif model (The motif model in Dimont, HOCOMOCO, or Jaspar format, range={Dimont, HOCOMOCO, Jaspar}, default = Dimont) | |||||||||||||||||||
| ||||||||||||||||||||
g | Genome (Genome as FastA file) | FILE | ||||||||||||||||||
f | FAI of genome (FastA index file of the genome) | FILE | ||||||||||||||||||
b | Bin width (The width of the genomic bins considered) | INT | ||||||||||||||||||
l | Low-memory mode (Use slower mode with a smaller memory footprint, default = true) | BOOLEAN | ||||||||||||||||||
outdir | The output directory, defaults to the current working directory (.) | STRING | ||||||||||||||||||
threads | The number of threads used for the tool, defaults to 1 | INT |
Example:
java -jar MeDeMo-1.0.jar motif
Quick Prediction Tool
Quick Prediction Tool predicts binding sites of a transcription factor based on a motif model and is also suited for genome-wide predictions. The motif model is provided as the XML output of (Slim) Dimont.
The tool outputs a list of predictions including, for every prediction, the IDof the sequence (e.g., chromosome) containing the binding site, position and strand of the matching sub-sequence, its score according to the model, the sub-sequence itself (in strand orientation according to the model), and a p-value from a normal distribution fitted to the score distribution of the provided negative examples or a sub-sample of the input data (parameter "Background sample").
If you experience problems using Quick Prediction Tool, please contact us.
Quick Prediction Tool may be called with
java -jar MeDeMo-1.0.jar quickpred
and has the following parameters
name | comment | type | ||||||||||||
d | Dimont model (The model returned by Dimont (in XML format)) | FILE | ||||||||||||
s | Sequences (The sequences (e.g., a genome) to scan for binding sites) | FILE | ||||||||||||
b | Background sample (The sequences for determining the prediction threshold. Either a sub-sample of the input sequences or a dedicated background data set., range={sub-sample, background sequences}, default = sub-sample) | |||||||||||||
| ||||||||||||||
t | Threshold specification (The way of defining the prediction threshold. Either by explicitly defining a significance level or by specifying the number of expected sites, range={significance level, number of sites}, default = significance level) | |||||||||||||
| ||||||||||||||
outdir | The output directory, defaults to the current working directory (.) | STRING |
Example:
java -jar MeDeMo-1.0.jar quickpred
Methylation Sensitivity
Methylation Sensitivity determines average methylation sensitivity profiles for CpG dinucleotides converted to MpG, CpH, and MpH. As input, it needs a model XML as generated by "Methyl SlimDimont", and a prediction file as output from the corresponding training run.
Optionally, Methylation Sensitivity also generates per-sequence methylation sensitivity profiles for the MpH context.
If you experience problems using Methylation Sensitivity, please contact us.
Methylation Sensitivity may be called with
java -jar MeDeMo-1.0.jar msens
and has the following parameters
name | comment | type |
m | Model (The XML file containing the model) | FILE |
p | Predictions (The file containing the predictions from the training run) | FILE |
v | Verbose (Output MpH sensitivity profile for every input sequence, default = false) | BOOLEAN |
outdir | The output directory, defaults to the current working directory (.) | STRING |
Example:
java -jar MeDeMo-1.0.jar msens