PCTLearn
by Ralf Eggeling, Ivo Grosse, and Mikko Koivisto.
Description
Parsimonious context trees, PCTs, provide a sparse parameterization of conditional probability distributions, but learning them from data is computationally hard due to the combinatorial explosion of the space of model structures as the number of predictor variables grows. Here, we propose new algorithmic ideas, which can ignificantly expedite the standard dynamic programming algorithm. Specifically, we introduce a memoization technique, which exploits regularities within the predictor variables by equating different contexts associated with the same data subset, and a bound-and-prune technique, which exploits regularities within the response variable by pruning parts of the search space based on score upper bounds. The software PCTLearn is a lightweight Java application that implements these ideas and can be used to learn a single PCT of user-specified maximal depth from data.
Download
The application is available as a single runnable .jar PCTLearn.
Input
The application requires as input a plain text file, which can consist of basic Latin characters a-z and A-Z (case sensitive) and Arabic numerical 0-9. The number of different characters in the input file determines the alphabet size for PCT optimization. Each line in the input file is chopped into overlapping k-mers, where k to the desired maximal PCT depth + 1. The PCT is learned on these resulting k-mers with the convention that the last symbol in each k-mer denotes the response variable.
Running PCTLearn
The application has one mandatory and various optional arguments. A shorter list of arguments can be provided, in which case all missing arguments are considered to assume default values. Run with
java -jar PCTLearn.jar inputFile maximalDepth scoringFunction memoization pruning fineBound memoLimit lookaheadDepth
where the arguments have the following semantics:
name | type | default | comment |
inputFile | String | -- | The location of a text file containing the input data. |
maximalDepth | Integer | 2 | The maximal depth of the learned PCT. |
scoringFunction | String | BIC | The used scoring function. Permitted values are "BIC" and "AIC". |
memoization | Boolean | TRUE | Enabling memoization. |
pruning | Boolean | TRUE | Enabling pruning. |
fineBound | Boolean | TRUE | Use fine upper bound instead of coarse. Is ignored if pruning is set to FALSE. |
memoLimit | Integer | 1 | Memoization limit that stops storing subtrees width given distance from the leaves. Is ignored if memoization is set to FALSE. |
lookaheadDepth | Integer | 1 | The used lookahead depth. Is ignored if pruning is set to FALSE. |
Output
The tool writes some statistics about the optimization, such optimal score, number of visited node, and required running time to stdout.
It addition it creates (i) a graphViz file of the learned PCT structure and (ii) a file with conditional probability parameters (MLE) for each leaf.