Catchitt: Difference between revisions

From Jstacs
Jump to navigationJump to search
 
(60 intermediate revisions by 2 users not shown)
Line 1: Line 1:
Catchitt is a collection of tools for predicting cell type-specific binding regions of transcription factors (TFs) based on binding motifs and chromatin accessibility assays.
Catchitt is a collection of tools for predicting cell type-specific binding regions of transcription factors (TFs) based on binding motifs and chromatin accessibility assays.
The initial implementation of this methodology has been one of the winning approaches of the ENCODE-DREAM challenge ([https://www.synapse.org/#!Synapse:syn6131484/wiki/402026]) and is described in a preprint (https://www.biorxiv.org/content/early/2017/12/06/230011 doi: 10.1101/230011).
The initial implementation of this methodology has been one of the winning approaches of the ENCODE-DREAM challenge ([https://www.synapse.org/#!Synapse:syn6131484/wiki/402026]) and is described in a preprint (https://www.biorxiv.org/content/early/2017/12/06/230011 doi: 10.1101/230011) and a recent [https://doi.org/10.1186/s13059-018-1614-y paper].
The implementation in Catchitt has been streamlined and slightly simplified to make its application more straight-forward. Specifically, we reduced the set of chromatin accessibility features to the most important ones, we simplified the sampling strategy of initial negative examples in the training step, and we omitted quantile normalization of chromatin accessibility features.
The implementation in Catchitt has been streamlined and slightly simplified to make its application more straight-forward. Specifically, we reduced the set of chromatin accessibility features to the most important ones, we simplified the sampling strategy of initial negative examples in the training step, and we omitted quantile normalization of chromatin accessibility features.


== Chatchitt tools ==
== Catchitt tools ==


Chatchitt comprises five tools for the individual steps of the pipeline. The tool "labels" computes labels for genomic regions from "conservative" (i.e., IDR-thresholded) and "relaxed" ChIP-seq peaks.
Catchitt comprises five tools for the individual steps of the pipeline (see below). The tool "labels" computes labels for genomic regions from "conservative" (i.e., IDR-thresholded) and "relaxed" ChIP-seq peaks.
The tool "access" computes chromatin accessibility features from DNase-seq or ATAC-seq data, either based on fold-enrichment tracks in Bigwig format (e.g., MACS output) or based on SAM/BAM files of mapped reads.
The tool "access" computes chromatin accessibility features from DNase-seq or ATAC-seq data, either based on fold-enrichment tracks in Bigwig format (e.g., MACS output) or based on SAM/BAM files of mapped reads.
The tool "motif" computes motif-based features from genomic sequence and PWMs in Jaspar or HOCOMOCO format, or motif models from [[Dimont]], including [[Slim]] models.
The tool "motif" computes motif-based features from genomic sequence and PWMs in Jaspar or HOCOMOCO format, or motif models from [[Dimont]], including [[Slim]] models.
Line 11: Line 11:
The tool "predict" predicts binding probabilities of genomic regions based on trained classifiers and feature files. The feature files may either be measured on the training cell type (e.g., other chromosomes, "within cell type" case) or on a different cell type.
The tool "predict" predicts binding probabilities of genomic regions based on trained classifiers and feature files. The feature files may either be measured on the training cell type (e.g., other chromosomes, "within cell type" case) or on a different cell type.


== Availability ==
== Downloads ==


We provide Catchitt as a pre-compiled JAR file and also publish its sources under GPL 3. For compiling Chatchitt from source files, Jstacs (v. 2.3 and later) and the corresponding external libraries are required.
We provide Catchitt as a pre-compiled JAR file and also publish its source code under GPL 3. For compiling Catchitt from source files, Jstacs (v. 2.3 and later) and the corresponding external libraries are required.


* JAR download
''Catchitt is distributed in the hope that it will be useful, but without any warranty; without even the implied warranty of merchantability or fitness for a particular purpose.''
* Source download and Jstacs [[Downloads]]
 
* [http://www.jstacs.de/downloads/Catchitt-0.1.4.jar JAR download]
* the source code of Catchitt is available from [https://github.com/Jstacs/Jstacs github] in package projects.encodedream.
* [http://www.jstacs.de/downloads/motifs.tgz motifs] used in the ENCODE-DREAM challenge
 
== Citation ==
 
If you use Catchitt in your research, please cite
 
J. Keilwagen, S. Posch, and J. Grau. [https://doi.org/10.1186/s13059-018-1614-y Accurate prediction of cell type-specific transcription factor binding]. ''Genome Biology'', 20(1):9, 2019.


== Usage ==
== Usage ==
In the following <code>Catchitt.jar</code> stands for the Catchitt binary in its current version, which currently would be 0.1.4. So every occurrence of <code>Catchitt.jar</code> needs to be replaced by <code>Catchitt-0.1.4.jar</code> when running code examples with the current Catchitt binary version.


Catchitt can be started by calling
Catchitt can be started by calling
Line 29: Line 41:
   
   
  access - Chromatin accessibility
  access - Chromatin accessibility
methyl - Methylation levels
  motif - Motif scores
  motif - Motif scores
  labels - Derive labels
  labels - Derive labels
Line 153: Line 166:


  java -jar Catchitt.jar access d="Bigwig" i=fold_enrich.bw f=hg19.fa.fai b=50 outdir=dnase
  java -jar Catchitt.jar access d="Bigwig" i=fold_enrich.bw f=hg19.fa.fai b=50 outdir=dnase
=== Methylation levels ===
''Methylation levels'' may be called with
java -jar Catchitt.jar methyl
and has the following parameters
<table border=0 cellpadding=10 align="center" width="100%">
<tr>
<td>name</td>
<td>comment</td>
<td>type</td>
</tr>
<tr><td colspan=3><hr></td></tr>
<tr style="vertical-align:top">
<td><font color="green">i</font></td>
<td>Input Bed.gz (The bedMethyl file (gzipped) containing the methylation levels, mime = bed.gz)</td>
<td style="width:100px;">FILE</td>
</tr>
<tr style="vertical-align:top">
<td><font color="green">f</font></td>
<td>FastA index (The genome index, mime = fai)</td>
<td style="width:100px;">FILE</td>
</tr>
<tr style="vertical-align:top">
<td><font color="green">b</font></td>
<td>Bin width (The width of the genomic bins considered)</td>
<td style="width:100px;">INT</td>
</tr>
<tr style="vertical-align:top">
<td><font color="green">outdir</font></td>
<td>The output directory, defaults to the current working directory (.)</td>
<td>STRING</td>
</tr>
</table>
'''Example:'''
java -jar Catchitt.jar methyl i=Input_Bed.gz f=hg19.fa.fai b=50


=== Motif scores ===
=== Motif scores ===


''Motif scores'' computes features based on motif scores of a given motif model scanning sub-sequences along the genome. Motif scores are aggregated in bins of the specified width as maximum score and log of the average exponential score (i.e., average log-likelihood in case of statistical models). The motif model may be provided as PWMs in HOCOMOCO or PFMs in Jaspar format, or as Dimont motif models in XML format. For more complex motif models like Slim models, the current implementation uses several indexes to speed-up the scanning process. However, computation of these indexes is rather memory-consuming and often not reasonable for simple PWM models. Hence, a low-memory variant of the tool is available, which is typically only slightly slower for PWM models but substantially slower for Slim models. Output is provided as a gzipped file 'Motif_scores.tsv.gz' containing columns chromosome, start position, maximum and average score. This output file together with a protocol of the tool run is saved to the specified output directory.
''Motif scores'' computes features based on motif scores of a given motif model scanning sub-sequences along the genome. Motif scores are aggregated in bins of the specified width as maximum score and log of the average exponential score (i.e., average log-likelihood in case of statistical models). The motif model may be provided as PWMs in HOCOMOCO or PFMs in Jaspar format, or as [[Dimont]] motif models in XML format. For more complex motif models like Slim models, the current implementation uses several indexes to speed-up the scanning process. However, computation of these indexes is rather memory-consuming and often not reasonable for simple PWM models. Hence, a low-memory variant of the tool is available, which is typically only slightly slower for PWM models but substantially slower for Slim models. Output is provided as a gzipped file 'Motif_scores.tsv.gz' containing columns chromosome, start position, maximum and average score. This output file together with a protocol of the tool run is saved to the specified output directory.




Line 212: Line 267:
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">l</font></td>
<td><font color="green">l</font></td>
<td>Low-memory mode (Use slower mode with a smaller memory footprint, default = false)</td>
<td>Low-memory mode (Use slower mode with a smaller memory footprint, default = true)</td>
<td>BOOLEAN</td>
<td>BOOLEAN</td>
</tr>
</tr>
Line 218: Line 273:
<td><font color="green">outdir</font></td>
<td><font color="green">outdir</font></td>
<td>The output directory, defaults to the current working directory (.)</td>
<td>The output directory, defaults to the current working directory (.)</td>
<td>STRING</td>
</tr>
<tr style="vertical-align:top">
<td><font color="green">threads</font></td>
<td>The number of threads used for the tool, defaults to 1</td>
<td>STRING</td>
<td>STRING</td>
</tr>
</tr>
Line 228: Line 288:
=== Iterative Training ===
=== Iterative Training ===


''Iterative Training'' performs an iterative training with the specified number of iterations to obtain a series of classifiers that may be used for predictions in the same cell type or in other cell types based on a corresponding set of feature files. The tool requires as input labels for the training chromosomes, a chromatin accessibility feature file and a set of motif featuture files. From the labels, an initial set of training regions is extracted containing all positive examples labeled as 'S' (summit) and a sub-sample of negative examples of regions labeled as 'U' (unbound). During the iterations, the initial negative examples are complemented with additional negatives obtaining large binding probabilities, i.e., putative false positive predictions. As these additional negative examples are derived from predictions of the current set of classifiers, the number of bins used for aggregation needs to be specified and should be identical to those used for predictions later. Training chromosomes and chromosomes used for predictions in the iterative training may be specified, as well as the percentile of the scores of positive (i.e., summit or bound regions) that should be used to identify putative false positives. The specified bin width must be identical to the bin width specified when computing the corresponding feature files. Feature vectors for training regions may span several adjacent bins as specified by the bin width parameter. Output is an XML file containing the set of trained classifiers. This output file together with a protocol of the tool run is saved to the specified output directory.
''Iterative Training'' performs an iterative training with the specified number of iterations to obtain a series of classifiers that may be used for predictions in the same cell type or in other cell types based on a corresponding set of feature files. The tool requires as input labels for the training chromosomes, a chromatin accessibility feature file and a set of motif feature files. From the labels, an initial set of training regions is extracted containing all positive examples labeled as 'S' (summit) and a sub-sample of negative examples of regions labeled as 'U' (unbound). During the iterations, the initial negative examples are complemented with additional negatives obtaining large binding probabilities, i.e., putative false positive predictions. As these additional negative examples are derived from predictions of the current set of classifiers, the number of bins used for aggregation needs to be specified and should be identical to those used for predictions later. Training chromosomes and chromosomes used for predictions in the iterative training may be specified, as well as the percentile of the scores of positive (i.e., summit or bound regions) that should be used to identify putative false positives. The specified bin width must be identical to the bin width specified when computing the corresponding feature files. Feature vectors for training regions may span several adjacent bins as specified by the bin width parameter. Output is an XML file Classifiers.xml containing the set of trained classifiers. This output file together with a protocol of the tool run is saved to the specified output directory.


''Iterative Trainin'' may be called with
''Iterative Training'' may be called with


  java -jar Catchitt.jar itrain
  java -jar Catchitt.jar itrain
Line 306: Line 366:
<td><font color="green">outdir</font></td>
<td><font color="green">outdir</font></td>
<td>The output directory, defaults to the current working directory (.)</td>
<td>The output directory, defaults to the current working directory (.)</td>
<td>STRING</td>
</tr>
<tr style="vertical-align:top">
<td><font color="green">threads</font></td>
<td>The number of threads used for the tool, defaults to 1</td>
<td>STRING</td>
<td>STRING</td>
</tr>
</tr>
Line 315: Line 380:


=== Prediction ===
=== Prediction ===
''Prediction'' predicts binding probabilities of genomic regions as specified during training of the set of classifiers in iterative training. As input, Prediction requires a set of trained classifiers in XML format, the same (type of) feature files as used in training (motif files must be specified in the same order!). In addition, the chromosomes for which predictions are made may be specified, and the number of bins used for aggregation may be specified to deviate from those used during training. If these bin numbers are not specified, those from the training run are used. Finally, it is possible to restrict the number of classifiers considered to the first n ones. Output is provided as a gzipped file 'Predictions.tsv.gz' with columns chromosome, start position, binding probability. This output file together with a protocol of the tool run is saved to the specified output directory.
''Prediction'' may be called with
java -jar Catchitt.jar predict
and has the following parameters


<table border=0 cellpadding=10 align="center">
<table border=0 cellpadding=10 align="center">
Line 335: Line 408:
<tr style="vertical-align:top">
<tr style="vertical-align:top">
<td><font color="green">m</font></td>
<td><font color="green">m</font></td>
<td>Motif (File containing motif features)</td>
<td>Motif (File containing motif features) MAY BE USED MULTIPLE TIMES</td>
<td>FILE</td>
<td>FILE</td>
</tr>
</tr>
Line 369: Line 442:
</tr>
</tr>
</table>
</table>
'''Example'''
java -jar Catchitt.jar predict c=cls/Classifiers.xml a=dnase/Chromatin_accessibility.tsv.gz m=motif1/Motif_scores.tsv.gz m=motif2/Motif_scores.tsv.gz f=hg19.fa.fai p="chr8,chr21" abb=1 aba=4 n=3 outdir=predict
== Standard pipeline ==
The standard Catchitt pipeline would comprise the following steps
* for a training cell type, collect ChIP-seq peak files (preferably ''conservative'' and ''relaxed'' peaks) in narrowPeak format and derive labels for genomic regions (''Derive labels'')
* for the same cell type, collect chromatin accessibility data (DNase-seq or ATAC-seq) as fold-enrichment tracks or mapping files, and derive chromatin accessibility features from those data (''Chromatin accessibility'')
* collect or learn (e.g., using [[Dimont]] a set of motif models for the transcription factor of interest, and scan the genome using these motif models (''Motif scores'')
* perform iterative training given the labels and feature files (''Iterative Training'')
* predict binding probabilities of genomic regions in the same cell type or in other cell types. In the latter case, additional chromatin accessibility data for these target cell types need to be collected and features need to be derived as in step 2. (''Prediction'')
== Tutorial using ENCODE data ==
We describe a typical Catchitt pipeline using public ENCODE data for the transcription factor CTCF in two cell lines.
This tutorial uses real-world data on the whole ENCODE GRCh38 human genome version, illustrating different DNase-seq input formats and different motif sources. Please note that this realistic scenario also comes at the expense of real-world runtimes of the individual Catchitt steps.
For best performance, we would further recommend
* to use multiple motifs from different sources, including motifs derived from DNase-seq (available in our [http://www.jstacs.de/downloads/motifs.tgz motif collection] of the ENCODE-DREAM challenge in directory de-novo/DNase-peaks
* to use replicate information for DNase data, for instance using the [https://github.com/kundajelab/atac_dnase_pipelines pipeline of the Kundaje lab]
In this tutorial, we concentrate on the Catchitt pipeline and illustrate its usage based on readily available data.
=== Obtaining training and test data ===
First, we need the GRCh38 genome version used by ENCODE. This genome is available as a gzipped FastA file from [https://www.encodeproject.org ENCODE] at
https://www.encodeproject.org/files/GRCh38_no_alt_analysis_set_GCA_000001405.15/@@download/GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.gz
After download, the genome needs to be gunzipped and indexed using the [http://www.htslib.org samtools] faidx command:
gunzip GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.gz
samtools faidx GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta
In the following, we assume that genome FastA and index are in the base directory.
In addition, we need the DNase-seq data. We consider two cell lines ("astrocyte of the spinal cord" and "fibroblast of villous mesenchyme"). The corresponding DNase-seq data are available from [https://www.encodeproject.org ENCODE] under accessions ENCSR000ENB and ENCSR000EOR, respectively.
Here, we first consider the Bigwig files of the first replicate for each cell line, which can be downloaded from the following URLs:
https://www.encodeproject.org/files/ENCFF901UBX/@@download/ENCFF901UBX.bigWig
https://www.encodeproject.org/files/ENCFF652HJH/@@download/ENCFF652HJH.bigWig
For obtaining labels for CTCF binding, we further need ChIP-seq peaks. Here, we consider the ChIP-seq experiment with accession ENCSR000DSU for the astrocytes, which will become our training data in the following:
The corresponding "conservative" and "relaxed" peak files for astrocytes are available from
https://www.encodeproject.org/files/ENCFF183YLB/@@download/ENCFF183YLB.bed.gz
https://www.encodeproject.org/files/ENCFF600CYD/@@download/ENCFF600CYD.bed.gz
Again, the peak files need to be gunzipped for the following steps.
Finally, we need a motif model for CTCF, which we download from [http://hocomoco11.autosome.ru HOCOMOCO] in this case
http://hocomoco11.autosome.ru/final_bundle/hocomoco11/full/HUMAN/mono/pwm/CTCF_HUMAN.H11MO.0.A.pwm
We organize all these files (and the Catchitt JAR) in the following directory structure
.:
Catchitt.jar
GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta
GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai
./astrocytes:
ENCFF183YLB.bed
ENCFF600CYD.bed
ENCFF901UBX.bigWig
./fibroblasts:
ENCFF652HJH.bigWig
./motifs/CTCF/:
CTCF_HUMAN.H11MO.0.A.pwm
=== Deriving labels ===
As we use supervised training of model parameters, we need labels for the genomic regions, qualifying these as bound (B) or unbound (U). Besides, we have additional labels for bound regions at the peak summit (S) and ambiguous regions (A) that are (partly) covered by relaxed but not by conservative peaks.
For training purposes, we need to derive labels from the astrocyte ChIP-seq peaks by calling
java -jar Catchitt.jar labels c=astrocytes/ENCFF183YLB.bed\
    r=astrocytes/ENCFF600CYD.bed\
    f=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai\
    b=50 rw=200 outdir=astrocytes/labels
Here, we use a bin width of 50 bp (i.e., we resolve any feature or binding event with 50 bp resolution) and a region width of 200 bp as used in ENCODE-DREAM. A detailed description of the partitioning of the genome into non-overlapping bins and the logic behind the regions for which prediction are made, may be found in the [https://doi.org/10.1186/s13059-018-1614-y Catchitt paper].
The result is a file astrocytes/labels/Labels.tsv.gz with the following format
chr1    0      U
chr1    50      U
chr1    100    U
chr1    150    U
chr1    200    U
chr1    250    U
where the columns contain chromosome, bin starting position, and corresponding label, and are separated by tabs.
=== Preparing DNase data from bigwig format ===
We further derive DNase-seq features from the bigwig file that we downloaded in the first step. Again, we specify a bin width of 50 bp.
java -jar Catchitt.jar access d="Bigwig" i=astrocytes/ENCFF901UBX.bigWig f=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai\
    b=50 outdir=astrocytes/access
The result is a file astrocytes/access/Chromatin_accessibility.tsv.gz with the following format
chr1    1033400 0.03954650089144707    0.05627769976854324    0.009126120246946812    0.030420400202274323    0.06692489981651306    1.03125 3.0    1.0    0.0
chr1    1033450 0.030420400202274323    0.03650449961423874    0.009126120246946812    0.030420400202274323    0.045630600303411484    1.03125 2.0    0.0    0.0
chr1    1033500 0.024336300790309906    0.03346240147948265    0.009126120246946812    0.030420400202274323    0.045630600303411484    1.03125 2.0    1.0    0.0
chr1    1033550 0.01825219951570034    0.024336300790309906    0.009126120246946812    0.024336300790309906    0.060840800404548645    1.03125 2.0    0.0    1.0
where the first two columns, again, correspond to chromosome and starting position, and the remaining columns are
* minimum DNase value in bin,
* median DNase value in bin,
* minimum in 1000 bp after bin  start,
* minimum in 1000 bp before bin  start,
* maximum in 1000 bp after bin  start,
* maximum in 1000 bp before bin  start,
* the number of steps in the bin profile,
* the length of the longest monotonically increasing range in the bin,
* the length of the longest monotonically decreasing range in the bin.
=== Preparing motif scores ===
We also compute motif scores along the genome for the PWM we downloaded from HOCOMOCO:
java -jar Catchitt.jar motif m="HOCOMOCO" h=motifs/CTCF/CTCF_HUMAN.H11MO.0.A.pwm g=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta\
    f=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai b=50 outdir=motifs/CTCF threads=3
The result is a file motifs/CTCF/Motif_scores.tsv.gz with the following format
chr1    46950  -4.996643      -4.9543528358429105
chr1    47000  -5.984124      -5.451674735652041
chr1    47050  -0.8633305      -0.4596223585537509
chr1    47100  -4.9379983      -4.813470561120627
where the first two columns, again, correspond to chromosome and starting position, and the remaining two columns are
* the maximum motif score within the bin,
* the logarithm of the exponentials of the individual scores with the bin; for scores that are log-likelihoods, this is proportional to the log-likelihood of the complete sequence.
=== Iterative training ===
With all the feature files prepared, we may now run the iterative training procedure. Here, we use all main chromosomes for training, use five of those chromosomes also for generating new negative examples in each of the iterations, and use 8 computation threads for the numeric optimization of model parameters.
''At this stage, it is critical that all feature files have been generated from the same reference. This way, we may sweep in parallel over all feature files that, at each line, represent the identical genomic location. Otherwise, the iterative training will throw an error stating that the chromosomes do not match at a certain line of the input files.''
We start iterative training by calling
java -jar Catchitt.jar itrain a=astrocytes/access/Chromatin_accessibility.tsv.gz m=motifs/CTCF/Motif_scores.tsv.gz\
    l=astrocytes/labels/Labels.tsv.gz f=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai\
    b=50 t='chr2,chr3,chr4,chr5,chr6,chr7,chr9,chr10,chr11,chr12,chr13,chr14,chr15,chr16,chr17,chr17,chr18,chr19,chr20,chr22'\
    itc='chr10,chr11,chr12,chr13,chr14' outdir=astrocytes/itrain threads=8
which results in a file astrocytes/itrain/Classifiers.xml containing the trained classifiers.
=== Predicting binding in new cell types ===
Using the trained classifier from the previous step and the DNase data for fibroblasts prepared before, we may now predict binding in the fibroblast cell type. In the example, we generate predictions only for chromosome 8, which could be extended to other chromosomes using parameter "p":
java -jar Catchitt.jar predict c=astrocytes/itrain/Classifiers.xml a=fibroblasts/access/Chromatin_accessibility.tsv.gz\
    m=motifs/CTCF/Motif_scores.tsv.gz f=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai\
    p="chr8" outdir=fibroblasts/predict
This finally results in a file fibroblasts/predict/Predictions.tsv.gz containing the predicted binding probabilities per region.
This file has three columns, corresponding to chromosome, starting position, and binding probability:
chr8    265850  0.9866555574053496
chr8    265900  0.9865107771922306
chr8    265950  0.9864837006927715
chr8    266000  0.8041139249973046
chr8    266050  0.19870629729482686
chr8    266100  0.1302269536110939
chr8    266150  0.09693322015563202
=== Using DNase-seq BAM files and multiple motifs ===
Instead of bigwig files, the "access" tool of Catchitt also accepts BAM files of mapped DNase-seq (or ATAC-seq) data. Internally, this tool counts 5' ends of reads, and performs local normalization of read depth and average smoothing.
Here, we download the BAM files corresponding to the previous bigwig files from ENCODE
https://www.encodeproject.org/files/ENCFF384CCQ/@@download/ENCFF384CCQ.bam
https://www.encodeproject.org/files/ENCFF368XNE/@@download/ENCFF368XNE.bam
and sort them into the directory structure.
In addition, we use four motifs from the ''used-for-all-TFs'' directory of our [http://www.jstacs.de/downloads/motifs.tgz motif collection].
Afterwards, the directory structure should look like
.:
Catchitt.jar
GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta
GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai
./astrocytes:
ENCFF183YLB.bed
ENCFF600CYD.bed
ENCFF901UBX.bigWig
ENCFF384CCQ.bam
./fibroblasts:
ENCFF652HJH.bigWig
ENCFF368XNE.bam
./motifs/CTCF/:
CTCF_HUMAN.H11MO.0.A.pwm
./motifs/CTCF_Slim:
Ctcf_H1hesc_shift20_bdeu_order-20_comp1-model-1.xml
./motifs/JUND_Slim:
Jund_K562_shift20_bdeu_order-20_comp1-model-1.xml
./motifs/MAX_Slim:
Max_K562_shift20_bdeu_order-20_comp1-model-1.xml
./motifs/SP1:
ENCSR000BHK_SP1-human_1_hg19-model-2.xml
Now, we first compute the DNase-seq features from the BAM files using the "access" tool:
java -jar Catchitt.jar access i=astrocytes/ENCFF384CCQ.bam b=50 outdir=astrocytes/access_bam/
java -jar Catchitt.jar access i=fibroblasts/ENCFF368XNE.bam b=50 outdir=fibroblasts/access_bam/
We also compute the motif-based features from the additional motif files. For the PWM model of SP1, we switch the input format to Dimont XMLs but still use the low-memory version of "motif" that we also used for the HOCOMOCO PWM:
java -jar Catchitt.jar motif d=motifs/SP1/ENCSR000BHK_SP1-human_1_hg19-model-2.xml\
    g=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta f=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai\
    b=50 outdir=motifs/SP1 threads=3
The remaining motif models are [[Slim]] models, which are substantially more complex than PWMs. While scans for these models could be accomplished by the low-memory version of "motif" as well, this would require substantial runtime. Hence, we switch off the low-memory option in this case, which, in turn, requires to increase the memory reserved by Java:
java -jar -Xms512M -Xmx64G Catchitt.jar motif d=motifs/CTCF_Slim/Ctcf_H1hesc_shift20_bdeu_order-20_comp1-model-1.xml\
    g=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta f=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai\
    b=50 outdir=motifs/CTCF_Slim l=false threads=3
java -jar -Xms512M -Xmx64G Catchitt.jar motif d=motifs/JUND_Slim/Jund_K562_shift20_bdeu_order-20_comp1-model-1.xml\
    g=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta f=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai\
    b=50 outdir=motifs/JUND_Slim l=false threads=3
java -jar -Xms512M -Xmx64G Catchitt.jar motif d=motifs/MAX_Slim/Max_K562_shift20_bdeu_order-20_comp1-model-1.xml\\
    g=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta f=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai\
    b=50 outdir=motifs/MAX_Slim l=false threads=3
Finally, we start the iterative training using the new feature files:
java -jar Catchitt.jar itrain a=astrocytes/access_bam/Chromatin_accessibility.tsv.gz\
    m=motifs/CTCF/Motif_scores.tsv.gz m=motifs/CTCF_Slim/Motif_scores.tsv.gz m=motifs/JUND_Slim/Motif_scores.tsv.gz\
    m=motifs/MAX_Slim/Motif_scores.tsv.gz m=motifs/SP1/Motif_scores.tsv.gz l=astrocytes/labels/Labels.tsv.gz\
    f=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai b=50\
    t='chr2,chr3,chr4,chr5,chr6,chr7,chr9,chr10,chr11,chr12,chr13,chr14,chr15,chr16,chr17,chr17,chr18,chr19,chr20,chr22'\
    itc='chr10,chr11,chr12,chr13,chr14' outdir=astrocytes/itrain_bam_5motifs threads=8
Please note that we used the parameter "m" multiple times to specify the different motif-based features files.
It is important to specify these motifs in the same order when calling the "predict" afterwards, i.e.
java -jar Catchitt.jar predict c=astrocytes/itrain_bam_5motifs/Classifiers.xml a=fibroblasts/access_bam/Chromatin_accessibility.tsv.gz\
    m=motifs/CTCF/Motif_scores.tsv.gz m=motifs/CTCF_Slim/Motif_scores.tsv.gz m=motifs/JUND_Slim/Motif_scores.tsv.gz\
    m=motifs/MAX_Slim/Motif_scores.tsv.gz m=motifs/SP1/Motif_scores.tsv.gz\
    f=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai p="chr8" outdir=fibroblasts/predict_bam_5motifs
The predictions based on the BAM files and the five motifs are then available from the file fibroblasts/predict_bam_5motifs/Predictions.tsv.gz in the format explained previously.
== Version history ==
* Catchitt v0.1.4: Fixing a bug that lead to an Exception if chromosomes did not appear in lexicographical order in the input files
* [http://www.jstacs.de/downloads/Catchitt-0.1.3.jar Catchitt v0.1.3]: Bugfix to load Catchitt classifiers learned with older Catchitt versions
* [http://www.jstacs.de/downloads/Catchitt-0.1.2.jar Catchitt v0.1.2]: Bugfixes, new experimental tools for handling methylation levels
* [http://www.jstacs.de/downloads/Catchitt_0.1.1.jar Catchitt v0.1.1]: Bugfixes for border cases; reduced debugging output
* Catchitt v0.1: [http://www.jstacs.de/downloads/Catchitt_0.1.jar Initial release]

Latest revision as of 20:51, 7 October 2021

Catchitt is a collection of tools for predicting cell type-specific binding regions of transcription factors (TFs) based on binding motifs and chromatin accessibility assays. The initial implementation of this methodology has been one of the winning approaches of the ENCODE-DREAM challenge ([1]) and is described in a preprint (https://www.biorxiv.org/content/early/2017/12/06/230011 doi: 10.1101/230011) and a recent paper. The implementation in Catchitt has been streamlined and slightly simplified to make its application more straight-forward. Specifically, we reduced the set of chromatin accessibility features to the most important ones, we simplified the sampling strategy of initial negative examples in the training step, and we omitted quantile normalization of chromatin accessibility features.

Catchitt tools

Catchitt comprises five tools for the individual steps of the pipeline (see below). The tool "labels" computes labels for genomic regions from "conservative" (i.e., IDR-thresholded) and "relaxed" ChIP-seq peaks. The tool "access" computes chromatin accessibility features from DNase-seq or ATAC-seq data, either based on fold-enrichment tracks in Bigwig format (e.g., MACS output) or based on SAM/BAM files of mapped reads. The tool "motif" computes motif-based features from genomic sequence and PWMs in Jaspar or HOCOMOCO format, or motif models from Dimont, including Slim models. The tool "itrain" performs iterative training of a series of classifiers based on labels, chromatin accessibility features, and motif features. The tool "predict" predicts binding probabilities of genomic regions based on trained classifiers and feature files. The feature files may either be measured on the training cell type (e.g., other chromosomes, "within cell type" case) or on a different cell type.

Downloads

We provide Catchitt as a pre-compiled JAR file and also publish its source code under GPL 3. For compiling Catchitt from source files, Jstacs (v. 2.3 and later) and the corresponding external libraries are required.

Catchitt is distributed in the hope that it will be useful, but without any warranty; without even the implied warranty of merchantability or fitness for a particular purpose.

  • JAR download
  • the source code of Catchitt is available from github in package projects.encodedream.
  • motifs used in the ENCODE-DREAM challenge

Citation

If you use Catchitt in your research, please cite

J. Keilwagen, S. Posch, and J. Grau. Accurate prediction of cell type-specific transcription factor binding. Genome Biology, 20(1):9, 2019.

Usage

In the following Catchitt.jar stands for the Catchitt binary in its current version, which currently would be 0.1.4. So every occurrence of Catchitt.jar needs to be replaced by Catchitt-0.1.4.jar when running code examples with the current Catchitt binary version.


Catchitt can be started by calling

java -jar Catchitt.jar

on the command line. This lists the names of the available tools with a short description:

Available tools:

	access - Chromatin accessibility
	methyl - Methylation levels
	motif - Motif scores
	labels - Derive labels
	itrain - Iterative Training
	predict - Prediction

Syntax: java -jar Catchitt.jar <toolname> [<parameter=value> ...]

Further info about the tools is given with
	java -jar Catchitt.jar <toolname> info

Tool parameters are listed with
	java -jar Catchitt.jar <toolname>

Tools

Derive labels

Derive labels computes labels for genomic regions based on ChIP-seq peak files. The input ChIP-seq peak files must be provided in narrowPeak format and may come in 'conservative', i.e., IDR-thresholded, and 'relaxed' flavors. In case only a single peak file is available, both of the corresponding parameters may be set to this one peak file. The parameter for the bin width defines the resolution of genomic regions that is assigned a label, while the parameter for the region width defines the size of the regions considered. If, for instance, the bin width is set to 50 and the region width to 100, regions of 100 bp shifted by 50 bp along the genome are labeled. The labels assigned may be 'S' (summit) is the current bin contains the annotated summit of a conservative peak, 'B' (bound) if the current region overlaps a conservative peak by at least half the region width, 'A' (ambiguous) if the current region overlaps a relaxed peak by at least 1 bp, or 'U' (unbound) if it overlaps with none of the peaks. The output is provided as a gzipped file 'Labels.tsv.gz' with columns chromosome, start position, and label. This output file together with a protocol of the tool run is saved to the specified output directory.

Derive labels may be called with

java -jar Catchitt.jar labels

and has the following parameters

name comment type

c Conservative peaks (NarrowPeak file containing the conservative peaks) FILE
r Relaxed peaks (NarrowPeak file containing the relaxed peaks) FILE
f FAI of genome (FastA index file of the genome) FILE
b Bin width (The width of the genomic bins considered, valid range = [1, 10000], default = 50) INT
rw Region width (The width of the genomic regions considered for overlaps, valid range = [1, 10000], default = 50) INT
outdir The output directory, defaults to the current working directory (.) STRING

Example:

java -jar Catchitt.jar labels c=conservative.narrowPeak r=relaxed.narrowPeak f=hg19.fa.fai b=50 rw=200 outdir=labels


Chromatin accessibility

Chromatin accessibility computes several chromatin accessibility features from DNase-seq or ATAC-seq data provided as fold-enrichment tracks or SAM/BAM files of mapped reads. Features a computed with a certain resolution defined by the bin width parameter. Setting this parameter to 50, for instance, features are computed for non-overlapping 50 bp bins along the genome. If input data are provided as SAM/BAM file, coverage information is extracted and normalized locally in a similar fashion as proposed for the MACS peak caller. Output is provided as a gzipped file 'Chromatin_accessibility.tsv.gz' with columns chromosome, start position of the bin, minimum coverage and median coverage in the current bin, minimum coverage in 1000 bp regions before and after the current bin, maximum coverage in 1000 bp regions before and after the current bin, the number of steps in the coverage profile, and the number of monotonically increasing and decreasing steps in the coverage profile of the current bin. This output file together with a protocol of the tool run is saved to the specified output directory.

Chromatin accessibility may be called with

java -jar Catchitt.jar access

and has the following parameters


name comment type

d Data source (The format of the input file containing the coverage information, range={BAM/SAM, Bigwig}, default = BAM/SAM)
Parameters for selection "BAM/SAM":
i Input SAM/BAM (The input file containing the mapped DNase-seq/ATAC-seq reads) FILE
Parameters for selection "Bigwig":
i Input Bigwig (The input file containing the mapped DNase-seq/ATAC-seq reads) FILE
f FastA index (The genome index) FILE
b Bin width (The width of the genomic bins considered) INT
outdir The output directory, defaults to the current working directory (.) STRING

Example:

java -jar Catchitt.jar access d="Bigwig" i=fold_enrich.bw f=hg19.fa.fai b=50 outdir=dnase


Methylation levels

Methylation levels may be called with

java -jar Catchitt.jar methyl

and has the following parameters

name comment type

i Input Bed.gz (The bedMethyl file (gzipped) containing the methylation levels, mime = bed.gz) FILE
f FastA index (The genome index, mime = fai) FILE
b Bin width (The width of the genomic bins considered) INT
outdir The output directory, defaults to the current working directory (.) STRING

Example:

java -jar Catchitt.jar methyl i=Input_Bed.gz f=hg19.fa.fai b=50


Motif scores

Motif scores computes features based on motif scores of a given motif model scanning sub-sequences along the genome. Motif scores are aggregated in bins of the specified width as maximum score and log of the average exponential score (i.e., average log-likelihood in case of statistical models). The motif model may be provided as PWMs in HOCOMOCO or PFMs in Jaspar format, or as Dimont motif models in XML format. For more complex motif models like Slim models, the current implementation uses several indexes to speed-up the scanning process. However, computation of these indexes is rather memory-consuming and often not reasonable for simple PWM models. Hence, a low-memory variant of the tool is available, which is typically only slightly slower for PWM models but substantially slower for Slim models. Output is provided as a gzipped file 'Motif_scores.tsv.gz' containing columns chromosome, start position, maximum and average score. This output file together with a protocol of the tool run is saved to the specified output directory.


Motif scores may be called with

java -jar Catchitt.jar motif

and has the following parameters

name comment type

m Motif model (The motif model in Dimont, HOCOMOCO, or Jaspar format, range={Dimont, HOCOMOCO, Jaspar}, default = Dimont)
Parameters for selection "Dimont":
d Dimont motif (Dimont motif model description) FILE
Parameters for selection "HOCOMOCO":
h HOCOMOCO PWM (PWM from the HOCOMOCO database) FILE
Parameters for selection "Jaspar":
j Jaspar PFM (PFM in Jaspar format) FILE
g Genome (Genome as FastA file) FILE
f FAI of genome (FastA index file of the genome) FILE
b Bin width (The width of the genomic bins considered) INT
l Low-memory mode (Use slower mode with a smaller memory footprint, default = true) BOOLEAN
outdir The output directory, defaults to the current working directory (.) STRING
threads The number of threads used for the tool, defaults to 1 STRING

Example

java -jar Catchitt.jar motif m=HOCOMOCO h=motif.pwm g=hg19.fa f=hg19.fa.fai b=50 outdir=motifs

Iterative Training

Iterative Training performs an iterative training with the specified number of iterations to obtain a series of classifiers that may be used for predictions in the same cell type or in other cell types based on a corresponding set of feature files. The tool requires as input labels for the training chromosomes, a chromatin accessibility feature file and a set of motif feature files. From the labels, an initial set of training regions is extracted containing all positive examples labeled as 'S' (summit) and a sub-sample of negative examples of regions labeled as 'U' (unbound). During the iterations, the initial negative examples are complemented with additional negatives obtaining large binding probabilities, i.e., putative false positive predictions. As these additional negative examples are derived from predictions of the current set of classifiers, the number of bins used for aggregation needs to be specified and should be identical to those used for predictions later. Training chromosomes and chromosomes used for predictions in the iterative training may be specified, as well as the percentile of the scores of positive (i.e., summit or bound regions) that should be used to identify putative false positives. The specified bin width must be identical to the bin width specified when computing the corresponding feature files. Feature vectors for training regions may span several adjacent bins as specified by the bin width parameter. Output is an XML file Classifiers.xml containing the set of trained classifiers. This output file together with a protocol of the tool run is saved to the specified output directory.

Iterative Training may be called with

java -jar Catchitt.jar itrain

and has the following parameters

name comment type

a Accessibility (File containing accessibility features) FILE
m Motif (File containing motif features), MAY BE USED MULTIPLE TIMES FILE
l Labels (File containing the labels) FILE
f FAI of genome (FastA index file of the genome) FILE
b Bin width (The width of the genomic bins, valid range = [1, 1000], default = 50) INT
n Number of bins (The number of adjacent bins, valid range = [1, 20], default = 5) INT
abb Aggregation: bins before (The number of bins before the current one considered in the aggregation, valid range = [1, 20], default = 1) INT
aba Aggregation: bins after (The number of bins after the current one considered in the aggregation, valid range = [1, 20], default = 4) INT
i Iterations (The number of iterations of the interative training, valid range = [1, 20], default = 5) INT
t Training chromosomes (Training chromosomes, separated by commas, OPTIONAL) STRING
itc Iterative training chromosomes (Chromosomes with predictions in iterative training, separated by commas, OPTIONAL) STRING
p Percentile (Percentile of the prediction scores of positives used as threshold in iterative training, valid range = [0.0, 1.0], default = 0.01) DOUBLE
outdir The output directory, defaults to the current working directory (.) STRING
threads The number of threads used for the tool, defaults to 1 STRING

Example

java -jar Catchitt.jar itrain a=dnase/Chromatin_accessibility.tsv.gz m=motif1/Motif_scores.tsv.gz m=motif2/Motif_scores.tsv.gz l=labels/Labels.tsv.gz f=hg19.fa.fai b=50 n=5 abb=1 aba=4 i=5 t="chr1,chr2,chr3" itc="chr1,chr2" p=0.01 outdir=cls

Prediction

Prediction predicts binding probabilities of genomic regions as specified during training of the set of classifiers in iterative training. As input, Prediction requires a set of trained classifiers in XML format, the same (type of) feature files as used in training (motif files must be specified in the same order!). In addition, the chromosomes for which predictions are made may be specified, and the number of bins used for aggregation may be specified to deviate from those used during training. If these bin numbers are not specified, those from the training run are used. Finally, it is possible to restrict the number of classifiers considered to the first n ones. Output is provided as a gzipped file 'Predictions.tsv.gz' with columns chromosome, start position, binding probability. This output file together with a protocol of the tool run is saved to the specified output directory.

Prediction may be called with

java -jar Catchitt.jar predict

and has the following parameters

name comment type

c Classifiers (The classifiers trained by iterative training) FILE
a Accessibility (File containing accessibility features) FILE
m Motif (File containing motif features) MAY BE USED MULTIPLE TIMES FILE
f FAI of genome (FastA index file of the genome) FILE
p Prediction chromosomes (Prediction chromosomes, separated by commas, OPTIONAL) STRING
abb Aggregation: bins before (Number of bins before the current one considered for aggregation., OPTIONAL) INT
aba Aggregation: bins after (Number of bins after the current one considered for aggregation., OPTIONAL) INT
n Number of classifiers (Use only the first k classifiers for predictions., OPTIONAL) INT
outdir The output directory, defaults to the current working directory (.) STRING

Example

java -jar Catchitt.jar predict c=cls/Classifiers.xml a=dnase/Chromatin_accessibility.tsv.gz m=motif1/Motif_scores.tsv.gz m=motif2/Motif_scores.tsv.gz f=hg19.fa.fai p="chr8,chr21" abb=1 aba=4 n=3 outdir=predict

Standard pipeline

The standard Catchitt pipeline would comprise the following steps

  • for a training cell type, collect ChIP-seq peak files (preferably conservative and relaxed peaks) in narrowPeak format and derive labels for genomic regions (Derive labels)
  • for the same cell type, collect chromatin accessibility data (DNase-seq or ATAC-seq) as fold-enrichment tracks or mapping files, and derive chromatin accessibility features from those data (Chromatin accessibility)
  • collect or learn (e.g., using Dimont a set of motif models for the transcription factor of interest, and scan the genome using these motif models (Motif scores)
  • perform iterative training given the labels and feature files (Iterative Training)
  • predict binding probabilities of genomic regions in the same cell type or in other cell types. In the latter case, additional chromatin accessibility data for these target cell types need to be collected and features need to be derived as in step 2. (Prediction)


Tutorial using ENCODE data

We describe a typical Catchitt pipeline using public ENCODE data for the transcription factor CTCF in two cell lines. This tutorial uses real-world data on the whole ENCODE GRCh38 human genome version, illustrating different DNase-seq input formats and different motif sources. Please note that this realistic scenario also comes at the expense of real-world runtimes of the individual Catchitt steps.

For best performance, we would further recommend

  • to use multiple motifs from different sources, including motifs derived from DNase-seq (available in our motif collection of the ENCODE-DREAM challenge in directory de-novo/DNase-peaks
  • to use replicate information for DNase data, for instance using the pipeline of the Kundaje lab

In this tutorial, we concentrate on the Catchitt pipeline and illustrate its usage based on readily available data.

Obtaining training and test data

First, we need the GRCh38 genome version used by ENCODE. This genome is available as a gzipped FastA file from ENCODE at

https://www.encodeproject.org/files/GRCh38_no_alt_analysis_set_GCA_000001405.15/@@download/GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.gz

After download, the genome needs to be gunzipped and indexed using the samtools faidx command:

gunzip GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.gz
samtools faidx GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta

In the following, we assume that genome FastA and index are in the base directory.

In addition, we need the DNase-seq data. We consider two cell lines ("astrocyte of the spinal cord" and "fibroblast of villous mesenchyme"). The corresponding DNase-seq data are available from ENCODE under accessions ENCSR000ENB and ENCSR000EOR, respectively. Here, we first consider the Bigwig files of the first replicate for each cell line, which can be downloaded from the following URLs:

https://www.encodeproject.org/files/ENCFF901UBX/@@download/ENCFF901UBX.bigWig
https://www.encodeproject.org/files/ENCFF652HJH/@@download/ENCFF652HJH.bigWig

For obtaining labels for CTCF binding, we further need ChIP-seq peaks. Here, we consider the ChIP-seq experiment with accession ENCSR000DSU for the astrocytes, which will become our training data in the following: The corresponding "conservative" and "relaxed" peak files for astrocytes are available from

https://www.encodeproject.org/files/ENCFF183YLB/@@download/ENCFF183YLB.bed.gz
https://www.encodeproject.org/files/ENCFF600CYD/@@download/ENCFF600CYD.bed.gz

Again, the peak files need to be gunzipped for the following steps.

Finally, we need a motif model for CTCF, which we download from HOCOMOCO in this case

http://hocomoco11.autosome.ru/final_bundle/hocomoco11/full/HUMAN/mono/pwm/CTCF_HUMAN.H11MO.0.A.pwm

We organize all these files (and the Catchitt JAR) in the following directory structure

.:
Catchitt.jar
GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta
GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai

./astrocytes:
ENCFF183YLB.bed
ENCFF600CYD.bed
ENCFF901UBX.bigWig

./fibroblasts:
ENCFF652HJH.bigWig

./motifs/CTCF/:
CTCF_HUMAN.H11MO.0.A.pwm

Deriving labels

As we use supervised training of model parameters, we need labels for the genomic regions, qualifying these as bound (B) or unbound (U). Besides, we have additional labels for bound regions at the peak summit (S) and ambiguous regions (A) that are (partly) covered by relaxed but not by conservative peaks.

For training purposes, we need to derive labels from the astrocyte ChIP-seq peaks by calling

java -jar Catchitt.jar labels c=astrocytes/ENCFF183YLB.bed\
    r=astrocytes/ENCFF600CYD.bed\
    f=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai\
    b=50 rw=200 outdir=astrocytes/labels

Here, we use a bin width of 50 bp (i.e., we resolve any feature or binding event with 50 bp resolution) and a region width of 200 bp as used in ENCODE-DREAM. A detailed description of the partitioning of the genome into non-overlapping bins and the logic behind the regions for which prediction are made, may be found in the Catchitt paper. The result is a file astrocytes/labels/Labels.tsv.gz with the following format

chr1    0       U
chr1    50      U
chr1    100     U
chr1    150     U
chr1    200     U
chr1    250     U

where the columns contain chromosome, bin starting position, and corresponding label, and are separated by tabs.

Preparing DNase data from bigwig format

We further derive DNase-seq features from the bigwig file that we downloaded in the first step. Again, we specify a bin width of 50 bp.

java -jar Catchitt.jar access d="Bigwig" i=astrocytes/ENCFF901UBX.bigWig f=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai\
    b=50 outdir=astrocytes/access

The result is a file astrocytes/access/Chromatin_accessibility.tsv.gz with the following format

chr1    1033400 0.03954650089144707     0.05627769976854324     0.009126120246946812    0.030420400202274323    0.06692489981651306     1.03125 3.0     1.0     0.0
chr1    1033450 0.030420400202274323    0.03650449961423874     0.009126120246946812    0.030420400202274323    0.045630600303411484    1.03125 2.0     0.0     0.0
chr1    1033500 0.024336300790309906    0.03346240147948265     0.009126120246946812    0.030420400202274323    0.045630600303411484    1.03125 2.0     1.0     0.0
chr1    1033550 0.01825219951570034     0.024336300790309906    0.009126120246946812    0.024336300790309906    0.060840800404548645    1.03125 2.0     0.0     1.0

where the first two columns, again, correspond to chromosome and starting position, and the remaining columns are

  • minimum DNase value in bin,
  • median DNase value in bin,
  • minimum in 1000 bp after bin start,
  • minimum in 1000 bp before bin start,
  • maximum in 1000 bp after bin start,
  • maximum in 1000 bp before bin start,
  • the number of steps in the bin profile,
  • the length of the longest monotonically increasing range in the bin,
  • the length of the longest monotonically decreasing range in the bin.

Preparing motif scores

We also compute motif scores along the genome for the PWM we downloaded from HOCOMOCO:

java -jar Catchitt.jar motif m="HOCOMOCO" h=motifs/CTCF/CTCF_HUMAN.H11MO.0.A.pwm g=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta\
    f=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai b=50 outdir=motifs/CTCF threads=3

The result is a file motifs/CTCF/Motif_scores.tsv.gz with the following format

chr1    46950   -4.996643       -4.9543528358429105
chr1    47000   -5.984124       -5.451674735652041
chr1    47050   -0.8633305      -0.4596223585537509
chr1    47100   -4.9379983      -4.813470561120627

where the first two columns, again, correspond to chromosome and starting position, and the remaining two columns are

  • the maximum motif score within the bin,
  • the logarithm of the exponentials of the individual scores with the bin; for scores that are log-likelihoods, this is proportional to the log-likelihood of the complete sequence.

Iterative training

With all the feature files prepared, we may now run the iterative training procedure. Here, we use all main chromosomes for training, use five of those chromosomes also for generating new negative examples in each of the iterations, and use 8 computation threads for the numeric optimization of model parameters. At this stage, it is critical that all feature files have been generated from the same reference. This way, we may sweep in parallel over all feature files that, at each line, represent the identical genomic location. Otherwise, the iterative training will throw an error stating that the chromosomes do not match at a certain line of the input files.

We start iterative training by calling

java -jar Catchitt.jar itrain a=astrocytes/access/Chromatin_accessibility.tsv.gz m=motifs/CTCF/Motif_scores.tsv.gz\
    l=astrocytes/labels/Labels.tsv.gz f=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai\
    b=50 t='chr2,chr3,chr4,chr5,chr6,chr7,chr9,chr10,chr11,chr12,chr13,chr14,chr15,chr16,chr17,chr17,chr18,chr19,chr20,chr22'\
    itc='chr10,chr11,chr12,chr13,chr14' outdir=astrocytes/itrain threads=8

which results in a file astrocytes/itrain/Classifiers.xml containing the trained classifiers.

Predicting binding in new cell types

Using the trained classifier from the previous step and the DNase data for fibroblasts prepared before, we may now predict binding in the fibroblast cell type. In the example, we generate predictions only for chromosome 8, which could be extended to other chromosomes using parameter "p":

java -jar Catchitt.jar predict c=astrocytes/itrain/Classifiers.xml a=fibroblasts/access/Chromatin_accessibility.tsv.gz\
    m=motifs/CTCF/Motif_scores.tsv.gz f=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai\
    p="chr8" outdir=fibroblasts/predict

This finally results in a file fibroblasts/predict/Predictions.tsv.gz containing the predicted binding probabilities per region. This file has three columns, corresponding to chromosome, starting position, and binding probability:

chr8    265850  0.9866555574053496
chr8    265900  0.9865107771922306
chr8    265950  0.9864837006927715
chr8    266000  0.8041139249973046
chr8    266050  0.19870629729482686
chr8    266100  0.1302269536110939
chr8    266150  0.09693322015563202


Using DNase-seq BAM files and multiple motifs

Instead of bigwig files, the "access" tool of Catchitt also accepts BAM files of mapped DNase-seq (or ATAC-seq) data. Internally, this tool counts 5' ends of reads, and performs local normalization of read depth and average smoothing. Here, we download the BAM files corresponding to the previous bigwig files from ENCODE

https://www.encodeproject.org/files/ENCFF384CCQ/@@download/ENCFF384CCQ.bam
https://www.encodeproject.org/files/ENCFF368XNE/@@download/ENCFF368XNE.bam

and sort them into the directory structure.

In addition, we use four motifs from the used-for-all-TFs directory of our motif collection.

Afterwards, the directory structure should look like

.:
Catchitt.jar
GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta
GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai

./astrocytes:
ENCFF183YLB.bed
ENCFF600CYD.bed
ENCFF901UBX.bigWig
ENCFF384CCQ.bam

./fibroblasts:
ENCFF652HJH.bigWig
ENCFF368XNE.bam

./motifs/CTCF/:
CTCF_HUMAN.H11MO.0.A.pwm

./motifs/CTCF_Slim:
Ctcf_H1hesc_shift20_bdeu_order-20_comp1-model-1.xml

./motifs/JUND_Slim:
Jund_K562_shift20_bdeu_order-20_comp1-model-1.xml

./motifs/MAX_Slim:
Max_K562_shift20_bdeu_order-20_comp1-model-1.xml

./motifs/SP1:
ENCSR000BHK_SP1-human_1_hg19-model-2.xml


Now, we first compute the DNase-seq features from the BAM files using the "access" tool:

java -jar Catchitt.jar access i=astrocytes/ENCFF384CCQ.bam b=50 outdir=astrocytes/access_bam/
java -jar Catchitt.jar access i=fibroblasts/ENCFF368XNE.bam b=50 outdir=fibroblasts/access_bam/

We also compute the motif-based features from the additional motif files. For the PWM model of SP1, we switch the input format to Dimont XMLs but still use the low-memory version of "motif" that we also used for the HOCOMOCO PWM:

java -jar Catchitt.jar motif d=motifs/SP1/ENCSR000BHK_SP1-human_1_hg19-model-2.xml\
    g=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta f=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai\
    b=50 outdir=motifs/SP1 threads=3

The remaining motif models are Slim models, which are substantially more complex than PWMs. While scans for these models could be accomplished by the low-memory version of "motif" as well, this would require substantial runtime. Hence, we switch off the low-memory option in this case, which, in turn, requires to increase the memory reserved by Java:

java -jar -Xms512M -Xmx64G Catchitt.jar motif d=motifs/CTCF_Slim/Ctcf_H1hesc_shift20_bdeu_order-20_comp1-model-1.xml\
    g=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta f=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai\
    b=50 outdir=motifs/CTCF_Slim l=false threads=3
java -jar -Xms512M -Xmx64G Catchitt.jar motif d=motifs/JUND_Slim/Jund_K562_shift20_bdeu_order-20_comp1-model-1.xml\
    g=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta f=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai\
    b=50 outdir=motifs/JUND_Slim l=false threads=3
java -jar -Xms512M -Xmx64G Catchitt.jar motif d=motifs/MAX_Slim/Max_K562_shift20_bdeu_order-20_comp1-model-1.xml\\
    g=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta f=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai\
    b=50 outdir=motifs/MAX_Slim l=false threads=3

Finally, we start the iterative training using the new feature files:

java -jar Catchitt.jar itrain a=astrocytes/access_bam/Chromatin_accessibility.tsv.gz\
    m=motifs/CTCF/Motif_scores.tsv.gz m=motifs/CTCF_Slim/Motif_scores.tsv.gz m=motifs/JUND_Slim/Motif_scores.tsv.gz\
    m=motifs/MAX_Slim/Motif_scores.tsv.gz m=motifs/SP1/Motif_scores.tsv.gz l=astrocytes/labels/Labels.tsv.gz\
    f=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai b=50\
    t='chr2,chr3,chr4,chr5,chr6,chr7,chr9,chr10,chr11,chr12,chr13,chr14,chr15,chr16,chr17,chr17,chr18,chr19,chr20,chr22'\
    itc='chr10,chr11,chr12,chr13,chr14' outdir=astrocytes/itrain_bam_5motifs threads=8

Please note that we used the parameter "m" multiple times to specify the different motif-based features files.

It is important to specify these motifs in the same order when calling the "predict" afterwards, i.e.

java -jar Catchitt.jar predict c=astrocytes/itrain_bam_5motifs/Classifiers.xml a=fibroblasts/access_bam/Chromatin_accessibility.tsv.gz\
    m=motifs/CTCF/Motif_scores.tsv.gz m=motifs/CTCF_Slim/Motif_scores.tsv.gz m=motifs/JUND_Slim/Motif_scores.tsv.gz\
    m=motifs/MAX_Slim/Motif_scores.tsv.gz m=motifs/SP1/Motif_scores.tsv.gz\
    f=GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai p="chr8" outdir=fibroblasts/predict_bam_5motifs

The predictions based on the BAM files and the five motifs are then available from the file fibroblasts/predict_bam_5motifs/Predictions.tsv.gz in the format explained previously.

Version history

  • Catchitt v0.1.4: Fixing a bug that lead to an Exception if chromosomes did not appear in lexicographical order in the input files
  • Catchitt v0.1.3: Bugfix to load Catchitt classifiers learned with older Catchitt versions
  • Catchitt v0.1.2: Bugfixes, new experimental tools for handling methylation levels