Catchitt: Difference between revisions

Revision as of 11:50, 16 May 2018

Catchitt is a collection of tools for predicting cell type-specific binding regions of transcription factors (TFs) based on binding motifs and chromatin accessibility assays. The initial implementation of this methodology has been one of the winning approaches of the ENCODE-DREAM challenge ([1]) and is described in a preprint (https://www.biorxiv.org/content/early/2017/12/06/230011 doi: 10.1101/230011). The implementation in Catchitt has been streamlined and slightly simplified to make its application more straight-forward. Specifically, we reduced the set of chromatin accessibility features to the most important ones, we simplified the sampling strategy of initial negative examples in the training step, and we omitted quantile normalization of chromatin accessibility features.

Chatchitt comprises five tools for the individual steps of the pipeline. The tool "labels" computed labels for genomic regions from "conservative" (i.e., IDR-thresholded) and "relaxed" peaks. The tool "access" computes chromatin accessibility features from DNase-seq or ATAC-seq data, either based on fold-enrichment tracks in Bigwig format (e.g., MACS output) or based on SAM/BAM files of mapped reads. The tool "motif" computes motif-based features from genomic sequence and PWMs in Jaspar or HOCOMOCO format, or motif models from Dimont, including Slim models. The tool "itrain" performs iterative training of a series of classifiers based on labels, chromatin accessibility features, and motif features. The tool "predict" predicts binding probabilities of genomic regions based on trained classifiers and feature files. The feature files may either be measured on the training cell type (e.g., other chromosomes, "within cell type" case) or on a different cell type.

We provide Catchitt as a pre-compiled JAR file and also publish its sources under GPL 3. For compiling Chatchitt from source files, Jstacs (v. 2.3 and later) and the corresponding external libraries are required.

name	comment	type

c	Conservative peaks (NarrowPeak file containing the conservative peaks)	FILE
r	Relaxed peaks (NarrowPeak file containing the relaxed peaks)	FILE
f	FAI of genome (FastA index file of the genome)	FILE
b	Bin width (The width of the genomic bins considered, valid range = [1, 10000], default = 50)	INT
rw	Region width (The width of the genomic regions considered for overlaps, valid range = [1, 10000], default = 50)	INT
outdir	The output directory, defaults to the current working directory (.)	STRING

name

comment

type

d

Data source (The format of the input file containing the coverage information, range={BAM/SAM, Bigwig}, default = BAM/SAM)

Parameters for selection "BAM/SAM":
i	Input SAM/BAM (The input file containing the mapped DNase-seq/ATAC-seq reads)	FILE
Parameters for selection "Bigwig":
i	Input Bigwig (The input file containing the mapped DNase-seq/ATAC-seq reads)	FILE
f	FastA index (The genome index)	FILE

b

Bin width (The width of the genomic bins considered)

INT

outdir

The output directory, defaults to the current working directory (.)

STRING

name

comment

type

m

Motif model (The motif model in Dimont, HOCOMOCO, or Jaspar format, range={Dimont, HOCOMOCO, Jaspar}, default = Dimont)

Parameters for selection "Dimont":
d	Dimont motif (Dimont motif model description)	FILE
Parameters for selection "HOCOMOCO":
h	HOCOMOCO PWM (PWM from the HOCOMOCO database)	FILE
Parameters for selection "Jaspar":
j	Jaspar PFM (PFM in Jaspar format)	FILE

g

Genome (Genome as FastA file)

FILE

f

FAI of genome (FastA index file of the genome)

FILE

b

Bin width (The width of the genomic bins considered)

INT

l

Low-memory mode (Use slower mode with a smaller memory footprint, default = false)

BOOLEAN

outdir

The output directory, defaults to the current working directory (.)

STRING

name	comment	type

a	Accessibility (File containing accessibility features)	FILE
m	Motif (File containing motif features)	FILE
l	Labels (File containing the labels)	FILE
f	FAI of genome (FastA index file of the genome)	FILE
b	Bin width (The width of the genomic bins, valid range = [1, 1000], default = 50)	INT
n	Number of bins (The number of adjacent bins, valid range = [1, 20], default = 5)	INT
abb	Aggregation: bins before (The number of bins before the current one considered in the aggregation, valid range = [1, 20], default = 1)	INT
aba	Aggregation: bins after (The number of bins after the current one considered in the aggregation, valid range = [1, 20], default = 4)	INT
i	Iterations (The number of iterations of the interative training, valid range = [1, 20], default = 5)	INT
t	Training chromosomes (Training chromosomes, separated by commas, OPTIONAL)	STRING
itc	Iterative training chromosomes (Chromosomes with predictions in iterative training, separated by commas, OPTIONAL)	STRING
p	Percentile (Percentile of the prediction scores of positives used as threshold in iterative training, valid range = [0.0, 1.0], default = 0.99)	DOUBLE
outdir	The output directory, defaults to the current working directory (.)	STRING

name	comment	type

c	Classifiers (The classifiers trained by iterative training)	FILE
a	Accessibility (File containing accessibility features)	FILE
m	Motif (File containing motif features)	FILE
f	FAI of genome (FastA index file of the genome)	FILE
p	Prediction chromosomes (Prediction chromosomes, separated by commas, OPTIONAL)	STRING
abb	Aggregation: bins before (Number of bins before the current one considered for aggregation., OPTIONAL)	INT
aba	Aggregation: bins after (Number of bins after the current one considered for aggregation., OPTIONAL)	INT
n	Number of classifiers (Use only the first k classifiers for predictions., OPTIONAL)	INT
outdir	The output directory, defaults to the current working directory (.)	STRING

Catchitt: Difference between revisions

Revision as of 11:50, 16 May 2018

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Documentation

Tools

@@ Line 1: / Line 1: @@
+Catchitt is a collection of tools for predicting cell type-specific binding regions of transcription factors (TFs) based on binding motifs and chromatin accessibility assays.
+The initial implementation of this methodology has been one of the winning approaches of the ENCODE-DREAM challenge ([https://www.synapse.org/#!Synapse:syn6131484/wiki/402026]) and is described in a preprint (https://www.biorxiv.org/content/early/2017/12/06/230011 doi: 10.1101/230011).
+The implementation in Catchitt has been streamlined and slightly simplified to make its application more straight-forward. Specifically, we reduced the set of chromatin accessibility features to the most important ones, we simplified the sampling strategy of initial negative examples in the training step, and we omitted quantile normalization of chromatin accessibility features.
+Chatchitt comprises five tools for the individual steps of the pipeline. The tool "labels" computed labels for genomic regions from "conservative" (i.e., IDR-thresholded) and "relaxed" peaks.
+The tool "access" computes chromatin accessibility features from DNase-seq or ATAC-seq data, either based on fold-enrichment tracks in Bigwig format (e.g., MACS output) or based on SAM/BAM files of mapped reads.
+The tool "motif" computes motif-based features from genomic sequence and PWMs in Jaspar or HOCOMOCO format, or motif models from [[Dimont]], including [[Slim]] models.
+The tool "itrain" performs iterative training of a series of classifiers based on labels, chromatin accessibility features, and motif features.
+The tool "predict" predicts binding probabilities of genomic regions based on trained classifiers and feature files. The feature files may either be measured on the training cell type (e.g., other chromosomes, "within cell type" case) or on a different cell type.
+We provide Catchitt as a pre-compiled JAR file and also publish its sources under GPL 3. For compiling Chatchitt from source files, Jstacs (v. 2.3 and later) and the corresponding external libraries are required.
 <table border=0 cellpadding=10 align="center">
 <tr>