Second main course: Classifiers
Classifiers allow to classify, i.e., label, previously uncharacterized data. In Jstacs, we provide the abstract class AbstractClassifier that declares three important methods besides several others.
The first method trains a classifier, i.e., it somehow adjusts to the train data:
The second method classifies a given Sequence:
If we like to classify for instance the first sequence of a data set, we might use
The third method allows for assessing the performance. Typically this is done on test data
params is a ParameterSet of performance measures (cf. subsection #Performance measures),
exceptionIfNotComputeable indicates if an exception should be thrown if a performance measure could not be computed, and
s is an array of data sets, where dimension
i contains data of class
Similar to the
classify method. For two-class problems, the method
allows for computing the score-differences given foreground and background class for all Sequence s in the DataSet
s. Such scores are typically the sum of the a-priori class log-score or log-probability and the score returned by
getLogScore of SequenceScore or
getLogProb of StatisticalModel.
Sometimes data is not split into test and train data for several diverse reasons, as for instance limited amount of data. In such cases, it is recommended to utilize some repeated procedure to split the data, train on one part and classify on the other part. In Jstacs, we provide the abstract class ClassifierAssessment that allows to implement such procedures. In subsection #Assessment, we describe how to use ClassifierAssessment and its extension.
But at first, we will focus on classifiers. Any classifier in Jstacs is an extension of the AbstractClassifier. In this section, we present on two concrete implementations, namely TrainSMBasedClassifier (cf. subsection #TrainSMBasedClassifier) and GenDisMixClassifier (cf. subsection #GenDisMixClassifier).
If we like to build a binary classifier using PWMs for each class, we first create a PWM that is a TrainableStatisticalModel.
Then we can use this instance to create the classifier using
Thereby, we do not need to clone the PWM instance, as this is done internally for safety reasons. If we like to build a classifier that allows to distinguish between N classes, we use the same constructor but provide N TrainableStatisticalModel s.
If we train a TrainSMBasedClassifier, the train method of the internally used TrainableStatisticalModel s is called. For classifying a sequence, the TrainSMBasedClassifier calls
getLogProbFor of the internally used TrainableStatisticalModel s and incorporates some class weight.
The class GenDisMixClassifier implements a classifier using the unified generative-discriminative learning principle to train the internally used DifferentiableStatisticalModel s. In analogy to the TrainSMBasedClassifier, the GenDisMixClassifier holds for each class a DifferentiableStatisticalModel.
If we like to build a GenDisMixClassifier, we have to provide the parameters for this classifier:
This line of code generate a ParameterSet for a GenDisMixClassifier. It states the used AlphabetContainer, the sequence length, an indicator for the numerical algorithm that is used during training, an epsilon for stopping the numerical optimization, a line epsilon for stopping the line search within the numerical optimization, a start distance for the line search, a switch that indicates whether the free or all parameter should be used, an enum that indicates the kind of class parameter initialization, a switch that indicates whether normalization should be used during optimization, and the number of threads used during numerical optimization.
If we like to build a binary classifier using PWMs for each class, we create a PWM that is a DifferentiableStatisticalModel.
Now, we are able to build a GenDisMixClassifier that uses the maximum likelihood learning principle.
In close analogy, we can build a GenDisMixClassifier that uses the maximum conditional likelihood learning principle, if we use
However, if we like to use a Bayesian learning principle we have to specify a prior that represents our prior knowledge. One of the most popular priors is the product Dirichlet prior. We can create an instance of this prior using
This class utilizes methods of DifferentiableStatisticalModel (cf.
addGradientOfLogPriorTerm(double, int)) to provide the correct prior.
Given a prior, we can build a GenDisMixClassifier using for instance the maximum supervised learning principle:
Again in close analogy, we can build a GenDisMixClassifier that uses the maximum a-posteriori learning principle, if we use
Alternative, we can build a GenDisMixClassifier that utilize the unified generative-discriminative learning principle. If we like to do so, we have to provide a weighting that sums to 1 and represents the weights for the conditional likelihood, the likelihood and the prior.
If we like to assess the performance of any classifier, we have to use the method
evaluate (see beginning of this section). The first argument of this method is a PerformanceMeasureParameterSet that hold the performance measures to be computed. The most simple way to create an instance is
which yields an instance with all standard performance measures of Jstacs and specified parameters. The first argument states that all performance measures should be included. If we would change the argument to
true, only numerical performance measures would be included an the returned instance would be a NumericalPerformanceMeasureParameterSet. The other four arguments are parameters for some performance measures.
Another way of creating a PerformanceMeasureParameterSet is to directly use performance measures extending the class AbstractPerformanceMeasure. For instance if we like to use the area under the curve (auc) for ROC and PR curve, we create
Based on this array, we can create a PerformanceMeasureParameterSet that only contains these performance measures.
If we like to assess the performance of any classifier based on an array of data sets that is not split into test and train data, we have to use some repeated procedure. In Jstacs, we provide the ClassifierAssessment that is the abstract super class of any such an procedure. We have already implemented the most widely used procedures (cf. KFoldCrossValidation and RepeatedHoldOutExperiment).
Before performing a ClassifierAssessment, we have to define a set of numerical performance measures. The performance measure have to be numerical to allow for an averaging. The most simple way to create such a set is
However, you can choose other measures as described in the previous subsection.
In this subsection, we exemplarily present how to perform a k-fold cross validation in Jstacs. First, we have to create an instance of KFoldCrossValidation. There several constructor to do so. Here, we use the constructor that used AbstractClassifier s.
Second, we have to specify the parameters of the KFoldCrossValidation.
These parameter are the partition method, i.e., the way how to count entries during a partitioning, the sequence length for the test data, a switch indicating whether an exception should be thrown if a performance measure could not be computed (cf.
evaluate in AbstractClassifier),
and the number of repeats k.
Now, we are able to perform a ClassifierAssessment just by calling the method
We print the result (cf. ListResult) of this assessment to standard out. If we like to perform other ClassifierAssessment s, as for instance, a RepeatedHoldOutExperiment, we have to use a specific ParameterSet †(cf. KFoldCrossValidation and KFoldCrossValidationAssessParameterSet).