Catchitt: Difference between revisions
No edit summary |
No edit summary |
||
Line 49: | Line 49: | ||
''Derive labels'' computes labels for genomic regions based on ChIP-seq peak files. The input ChIP-seq peak files must be provided in narrowPeak format and may come in 'conservative', i.e., IDR-thresholded, and 'relaxed' flavors. In case only a single peak file is available, both of the corresponding parameters may be set to this one peak file. The parameter for the bin width defines the resolution of genomic regions that is assigned a label, while the parameter for the region width defines the size of the regions considered. If, for instance, the bin width is set to 50 and the region width to 100, regions of 100 bp shifted by 50 bp along the genome are labeled. The labels assigned may be 'S' (summit) is the current bin contains the annotated summit of a conservative peak, 'B' (bound) if the current region overlaps a conservative peak by at least half the region width, 'A' (ambiguous) if the current region overlaps a relaxed peak by at least 1 bp, or 'U' (unbound) if it overlaps with none of the peaks. The output is provided as a gzipped file 'Labels.tsv.gz' with columns chromosome, start position, and label. | ''Derive labels'' computes labels for genomic regions based on ChIP-seq peak files. The input ChIP-seq peak files must be provided in narrowPeak format and may come in 'conservative', i.e., IDR-thresholded, and 'relaxed' flavors. In case only a single peak file is available, both of the corresponding parameters may be set to this one peak file. The parameter for the bin width defines the resolution of genomic regions that is assigned a label, while the parameter for the region width defines the size of the regions considered. If, for instance, the bin width is set to 50 and the region width to 100, regions of 100 bp shifted by 50 bp along the genome are labeled. The labels assigned may be 'S' (summit) is the current bin contains the annotated summit of a conservative peak, 'B' (bound) if the current region overlaps a conservative peak by at least half the region width, 'A' (ambiguous) if the current region overlaps a relaxed peak by at least 1 bp, or 'U' (unbound) if it overlaps with none of the peaks. The output is provided as a gzipped file 'Labels.tsv.gz' with columns chromosome, start position, and label. | ||
''Derive labels'' may be called with | |||
java -jar Catchitt.jar labels | |||
and has the following parameters | |||
<table border=0 cellpadding=10 align="center"> | <table border=0 cellpadding=10 align="center"> |
Revision as of 12:21, 16 May 2018
Catchitt is a collection of tools for predicting cell type-specific binding regions of transcription factors (TFs) based on binding motifs and chromatin accessibility assays. The initial implementation of this methodology has been one of the winning approaches of the ENCODE-DREAM challenge ([1]) and is described in a preprint (https://www.biorxiv.org/content/early/2017/12/06/230011 doi: 10.1101/230011). The implementation in Catchitt has been streamlined and slightly simplified to make its application more straight-forward. Specifically, we reduced the set of chromatin accessibility features to the most important ones, we simplified the sampling strategy of initial negative examples in the training step, and we omitted quantile normalization of chromatin accessibility features.
Chatchitt tools
Chatchitt comprises five tools for the individual steps of the pipeline. The tool "labels" computes labels for genomic regions from "conservative" (i.e., IDR-thresholded) and "relaxed" ChIP-seq peaks. The tool "access" computes chromatin accessibility features from DNase-seq or ATAC-seq data, either based on fold-enrichment tracks in Bigwig format (e.g., MACS output) or based on SAM/BAM files of mapped reads. The tool "motif" computes motif-based features from genomic sequence and PWMs in Jaspar or HOCOMOCO format, or motif models from Dimont, including Slim models. The tool "itrain" performs iterative training of a series of classifiers based on labels, chromatin accessibility features, and motif features. The tool "predict" predicts binding probabilities of genomic regions based on trained classifiers and feature files. The feature files may either be measured on the training cell type (e.g., other chromosomes, "within cell type" case) or on a different cell type.
Availability
We provide Catchitt as a pre-compiled JAR file and also publish its sources under GPL 3. For compiling Chatchitt from source files, Jstacs (v. 2.3 and later) and the corresponding external libraries are required.
- JAR download
- Source download and Jstacs Downloads
Usage
Catchitt can be started by calling
java -jar Catchitt.jar
on the command line. This lists the names of the available tools with a short description:
Available tools: access - Chromatin accessibility motif - Motif scores labels - Derive labels itrain - Iterative Training predict - Prediction Syntax: java -jar EncodeDREAM.jar <toolname> [<parameter=value> ...] Further info about the tools is given with java -jar EncodeDREAM.jar <toolname> info Tool parameters are listed with java -jar EncodeDREAM.jar <toolname>
Tools
Derive labels
Derive labels computes labels for genomic regions based on ChIP-seq peak files. The input ChIP-seq peak files must be provided in narrowPeak format and may come in 'conservative', i.e., IDR-thresholded, and 'relaxed' flavors. In case only a single peak file is available, both of the corresponding parameters may be set to this one peak file. The parameter for the bin width defines the resolution of genomic regions that is assigned a label, while the parameter for the region width defines the size of the regions considered. If, for instance, the bin width is set to 50 and the region width to 100, regions of 100 bp shifted by 50 bp along the genome are labeled. The labels assigned may be 'S' (summit) is the current bin contains the annotated summit of a conservative peak, 'B' (bound) if the current region overlaps a conservative peak by at least half the region width, 'A' (ambiguous) if the current region overlaps a relaxed peak by at least 1 bp, or 'U' (unbound) if it overlaps with none of the peaks. The output is provided as a gzipped file 'Labels.tsv.gz' with columns chromosome, start position, and label.
Derive labels may be called with
java -jar Catchitt.jar labels
and has the following parameters
name | comment | type |
c | Conservative peaks (NarrowPeak file containing the conservative peaks) | FILE |
r | Relaxed peaks (NarrowPeak file containing the relaxed peaks) | FILE |
f | FAI of genome (FastA index file of the genome) | FILE |
b | Bin width (The width of the genomic bins considered, valid range = [1, 10000], default = 50) | INT |
rw | Region width (The width of the genomic regions considered for overlaps, valid range = [1, 10000], default = 50) | INT |
outdir | The output directory, defaults to the current working directory (.) | STRING |
name | comment | type | |||||||||||||||
d | Data source (The format of the input file containing the coverage information, range={BAM/SAM, Bigwig}, default = BAM/SAM)
| ||||||||||||||||
b | Bin width (The width of the genomic bins considered) | INT | |||||||||||||||
outdir | The output directory, defaults to the current working directory (.) | STRING |
name | comment | type | ||||||||||||||||||
m | Motif model (The motif model in Dimont, HOCOMOCO, or Jaspar format, range={Dimont, HOCOMOCO, Jaspar}, default = Dimont)
| |||||||||||||||||||
g | Genome (Genome as FastA file) | FILE | ||||||||||||||||||
f | FAI of genome (FastA index file of the genome) | FILE | ||||||||||||||||||
b | Bin width (The width of the genomic bins considered) | INT | ||||||||||||||||||
l | Low-memory mode (Use slower mode with a smaller memory footprint, default = false) | BOOLEAN | ||||||||||||||||||
outdir | The output directory, defaults to the current working directory (.) | STRING |
name | comment | type |
a | Accessibility (File containing accessibility features) | FILE |
m | Motif (File containing motif features) | FILE |
l | Labels (File containing the labels) | FILE |
f | FAI of genome (FastA index file of the genome) | FILE |
b | Bin width (The width of the genomic bins, valid range = [1, 1000], default = 50) | INT |
n | Number of bins (The number of adjacent bins, valid range = [1, 20], default = 5) | INT |
abb | Aggregation: bins before (The number of bins before the current one considered in the aggregation, valid range = [1, 20], default = 1) | INT |
aba | Aggregation: bins after (The number of bins after the current one considered in the aggregation, valid range = [1, 20], default = 4) | INT |
i | Iterations (The number of iterations of the interative training, valid range = [1, 20], default = 5) | INT |
t | Training chromosomes (Training chromosomes, separated by commas, OPTIONAL) | STRING |
itc | Iterative training chromosomes (Chromosomes with predictions in iterative training, separated by commas, OPTIONAL) | STRING |
p | Percentile (Percentile of the prediction scores of positives used as threshold in iterative training, valid range = [0.0, 1.0], default = 0.99) | DOUBLE |
outdir | The output directory, defaults to the current working directory (.) | STRING |
name | comment | type |
c | Classifiers (The classifiers trained by iterative training) | FILE |
a | Accessibility (File containing accessibility features) | FILE |
m | Motif (File containing motif features) | FILE |
f | FAI of genome (FastA index file of the genome) | FILE |
p | Prediction chromosomes (Prediction chromosomes, separated by commas, OPTIONAL) | STRING |
abb | Aggregation: bins before (Number of bins before the current one considered for aggregation., OPTIONAL) | INT |
aba | Aggregation: bins after (Number of bins after the current one considered for aggregation., OPTIONAL) | INT |
n | Number of classifiers (Use only the first k classifiers for predictions., OPTIONAL) | INT |
outdir | The output directory, defaults to the current working directory (.) | STRING |