Slim

From Jstacs
Revision as of 14:58, 2 July 2015 by Grau (talk | contribs) (→‎Paper)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigationJump to search

by Jens Keilwagen Jan Grau

Sparse local inhomogeneous mixture (Slim) models are statistical models for discrete sequences (as for instance DNA sequences) that allow for simultanous discriminative learning of features and model parameters. Slim models can be used in combination with Dimont for de-novo motif discovery.

Abstract

Binding of transcription factors to the DNA is one of the keystones of gene regulation. While the existence of statistical dependencies between binding site positions is widely accepted, their relevance for computational predictions has been debated.

Building probabilistic models of binding sites that may capture dependencies is still a challenge, since the most successful, discriminative approaches for motif discovery require numerical optimization techniques, which are not suited for selecting dependency structures. To overcome this issue, we propose sparse local inhomogeneous mixture (Slim) models that employ soft feature selection combining putative dependency structures in a weighted manner allowing for numerical optimization of dependency structure and model parameters simultaneously.

We find that Slim models yield a substantially better prediction performance than previous models on genomic context PBM data sets and on a large collection of ChIP-seq data sets from the ENCODE project. To elucidate the reasons for the improved performance, we develop dependency logos, which allow for a visual inspection of dependency structures within binding sites. We find that the dependency structures discovered by Slim models are highly diverse and highly transcription factor-specific, which emphasizes the need for flexible dependency models. The dependency structures we find range from broad heterogeneities to sparse dependencies between neighboring and non-neighboring binding site positions, which might be related to the transcription factor binding characteristics.

Paper

The paper Varying levels of complexity in transcription factor binding motifs presenting Slim models and dependency logos has been published in NAR.

Slim web-server

Slim models are available in a public web-server at galaxy.informatik.uni-halle.de.

Download

Slim models are implemented in Java using Jstacs.

You can download the command line application for learning Slim models from ChIP-seq data within the Dimont framework as a Jar.

In addition, we provide

for installing it in your local Galaxy server.

Slim models and the corresponding applications will be part of the next public release of the Jstacs library.

Running the command line application

For running the command line application, Java v1.6 or later is required.

The arguments of the command line application have the following meaning:

name comment type

home Home directory (The path to the directory containing the input file. Output files are written to this directory as well., default = ./) String
data Input file (The file name of the file containing the input sequences in annotated FastA format (see below)) String
infix Infix (a infix to be used for all output files (model, sequence logos, predicted binding sites)) String
position Position tag (The tag for the position information in the FastA-annotation of the input file) String
value Value tag (The tag for the value information in the FastA-annotation of the input file) String
sd Standard deviation (The standard deviation of the position distribution centered at the position specified by the position tag, valid range = [1.0, 10000.0], default = 75.0) Double
weightingFactor Weighting factor (The value for weighting the data; either a value between 0 and 1, or a description relative to the standard deviation (e.g. +4sd), default = 0.2) Double
starts Starts (The number of pre-optimization runs., valid range = [1, 100], default = 20) Integer
motifWidth Initial motif width (The motif width that is used initially, may be adjusted during optimization., valid range = [1, 50], default = 15) Integer
motifOrder Markov order of motif model or negative distance of Slim model (The Markov order of the model or negative distance of Slim model for the motif., valid range = [-2147483648, 3], default = 0) Integer
bgOrder Markov order of background model (The Markov order of the model for the background sequence and the background sequence, -1 defines uniform distribution., valid range = [-1, 5], default = -1) Integer
ess Equivalent sample size (Reflects the strength of the prior on the model parameters., valid range = [0.0, Infinity], default = 4.0) Double
delete Delete BSs from profile (A switch for deleting binding site positions of discovered motifs from the profile before searching for futher motifs., default = true) Boolean
modify Adjust for shifts (Adjust for shifts of the motif., default = true) Boolean
threads Compute threads (The number of threads that are use to evaluate the objective function and its gradient., valid range = [1, 128], OPTIONAL) Integer

Input sequences must be supplied in an annotated FastA format. In the annotation of each sequence, you need to provide a value that reflects the confidence that this sequence is bound by the factor of interest. Such confidences may be peak statistics (e.g., number of fragments under a peak) for ChIP data or signal intensities for PBM data. In addition, you need to provide an anchor position within the sequence. In case of ChIP data, this anchor position could for instance be the peak summit. For instance, an annotated FastA file for ChIP-exo data comprising sequences of length 100 centered around the peak summit could look like:

> peak: 50; signal: 515
ggccatgtgtatttttttaaatttccac...
> peak: 50; signal: 199
GGTCCCCTGGGAGGATGGGGACGTGCTG...
...

where the anchor point is given as 50 for the first two sequences, and the confidence amounts to 515 and 199, respectively. The FastA comment may contain additional annotations of the format key1 : value1; key2: value2;..... We also provide an example input file.

Accordingly, you would need to set the parameter "position" to peak and the parameter "value" to signal for the input file.

For the initial motif width and the number of pre-optimization runs, we provide default values that worked well in our studies on ChIP and PBM data. However, you may want adjust these parameters to meet your prior information.

The parameter "motifOrder" sets the order of the inhomogeneous Markov model used for modeling the motif. If this parameter is set to 0, you obtain a position weight matrix (PWM) model. If it is set to 1, you obtain a weight array matrix (WAM) model. For negative values, you obtain LSlim models with distance equal to the negative value given as "motifOrder". Hence, if motifOrder is set to -motifWidth, you obtain a full Slim model.

The parameter "bgOrder" sets the order of the homogeneous Markov model used for modeling positions not covered by a motif. If this parameter is set to -1, you obtain a uniform distribution, which worked well for ChIP data. For PBM data, orders of up to 4 resulted in an increased prediction performance in our case studies. The maximum allowed value is 5.

The parameter "weightingFactor" defines the proportion of sequences that you expect to be bound by the targeted factor with high confidence. For ChIP data, the default value of 0.2 typically works well. For PBM data, containing a large number of unspecific probes, this parameter should be set to a lower value, e.g. 0.01.

The "ess" reflects the strength of the influence of the prior on the model parameters, where higher values smooth out the parameters to a greater extent.

The parameter "delete" defines if BSs of already discovered motifs should be deleted, i.e., "blanked out", from the sequence before searching for futher motifs.

Using the parameter "threads" you can set the number of threads that are used by SlimDimont. If you do not provide a value for this parameter, all available threads are used.

For instance, you can start SlimDimont on the example input file by calling

java -jar SlimDimont.jar home=./ data=dimont-example.fa infix=example position=peak value=signal threads=2 motifOrder=-5

using a LSlim(5) model as motif model, a uniform distribution as background model and two threads for optimization.

For larger data sets, the standard memory allocation of Java may not be sufficient. You can increase the amount of RAM allocated by the virtual machine using the VM parameters -Xms and -Xmx. To start SlimDimont using initially 512MB and a maximum of 2GB of RAM, you call

java -Xms512M -Xmx2G -jar SlimDimont.jar home=./ data=dimont-example.fa infix=example position=peak value=signal threads=2 motifOrder=-5

Data preparation

We provide a Perl script for generating an annotated FastA file in the format required by SlimDimont from chromosome sequences and a BED file. The chromosome sequences should be provided in a directory containing one FastA file for each chromosome, where the filename should match the chromosome identifier of the BED file with extension .fa. For instance, if chromosome 1 is identified as chr1 in the BED file, the corresponding sequence should be available in a file chr1.fa. The complete list of parameters of this Perl script is

perl extract_data.pl <chromDir> <bedfile> <chromcol> <poscol> <widthcol> <statcol> <outfile>
   
   <chromDir>: the directory containing the chromosome FastAs (as 
               typically downloaded from UCSC as "chromFa"): one FastA per
               chromosome, one sequence per file
   <bedfile>:  the file containing the peaks in tabular format, 
               e.g., bed, gff, narrowPeak
   <chromcol>: the column of <bedfile> containing the chromosome
   <poscol>:   the column of <bedfile> containing the position relative to
               the chromosome start
   <widthcol>: either i) <int> the column of <bedfile> containing the width of 
                         the peak or any region to be extracted 
                         symmetrically around <poscol>
               or    ii) f<int> a fixed width of all regions, centered at the
                         position given in <poscol>, e.g., f100 for a fixed
                         width of 100 bp
   <statcol>:  the column of <bedfile> containing the peak statistic
               or a similar measure of confidence
   <outfile>:  the path to the output file, written as FastA