Second main course: Classifiers: Difference between revisions
(Created page with "= Second main course: Classifiers = <span id="classifier"> </span> Classifiers allow to classify, i.e., label, previously uncharacterized data. In Jstacs, we provide the abstrac...") |
mNo edit summary |
||
(5 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
<span id="classifier"> </span> | <span id="classifier"> </span> | ||
Classifiers allow to classify, i.e., label, previously uncharacterized data. In Jstacs, we provide the abstract class [http://www.jstacs.de/api-2.0//de/jstacs/classifiers/AbstractClassifier.html AbstractClassifier] that declares three important methods besides several others. | Classifiers allow to classify, i.e., label, previously uncharacterized data. In Jstacs, we provide the abstract class [http://www.jstacs.de/api-2.0//de/jstacs/classifiers/AbstractClassifier.html AbstractClassifier] that declares three important methods besides several others. | ||
__TOC__ | |||
The first method trains a classifier, i.e., it somehow adjusts to the train data: | The first method trains a classifier, i.e., it somehow adjusts to the train data: | ||
<source lang="java5" enclose="div"> | <source lang="java5" enclose="div"> | ||
public void train( DataSet... s ) throws Exception { | |||
</source> | </source> | ||
The second method classifies a given [http://www.jstacs.de/api-2.0//de/jstacs/data/sequences/Sequence.html Sequence]: | The second method classifies a given [http://www.jstacs.de/api-2.0//de/jstacs/data/sequences/Sequence.html Sequence]: | ||
<source lang="java5" enclose="div"> | <source lang="java5" enclose="div"> | ||
public abstract byte classify( Sequence seq ) throws Exception; | |||
</source> | </source> | ||
If we like to classify for instance the first sequence of a data set, we might use | If we like to classify for instance the first sequence of a data set, we might use | ||
<source lang="java5" enclose="div"> | <source lang="java5" enclose="div"> | ||
System.out.println( cl.classify( data[0].getElementAt(0) ) ); | |||
</source> | </source> | ||
Line 26: | Line 25: | ||
The third method allows for assessing the performance. Typically this is done on test data | The third method allows for assessing the performance. Typically this is done on test data | ||
<source lang="java5" enclose="div"> | <source lang="java5" enclose="div"> | ||
public final ResultSet evaluate( PerformanceMeasureParameterSet params, boolean exceptionIfNotComputeable, DataSet... s ) throws Exception { | |||
</source> | </source> | ||
where <code>params</code> is a [http://www.jstacs.de/api-2.0//de/jstacs/parameters/ParameterSet.html ParameterSet] of performance measures (cf. subsection [[#Performance | where <code>params</code> is a [http://www.jstacs.de/api-2.0//de/jstacs/parameters/ParameterSet.html ParameterSet] of performance measures (cf. subsection [[#Performance measures]]), <code>exceptionIfNotComputeable</code> indicates if an exception should be thrown if a performance measure could not be computed, and <code>s</code> is an array of data sets, where dimension <code>i</code> contains data of class <code>i</code>. | ||
The abstract sub-class [http://www.jstacs.de/api-2.0//de/jstacs/classifiers/AbstractScoreBasedClassifier.html AbstractScoreBasedClassifier] †of [http://www.jstacs.de/api-2.0//de/jstacs/classifiers/AbstractClassifier.html AbstractClassifier] adds an additional method for computing a joint score for an input [http://www.jstacs.de/api-2.0//de/jstacs/data/sequences/Sequence.html Sequence] †and a given class: | The abstract sub-class [http://www.jstacs.de/api-2.0//de/jstacs/classifiers/AbstractScoreBasedClassifier.html AbstractScoreBasedClassifier] †of [http://www.jstacs.de/api-2.0//de/jstacs/classifiers/AbstractClassifier.html AbstractClassifier] adds an additional method for computing a joint score for an input [http://www.jstacs.de/api-2.0//de/jstacs/data/sequences/Sequence.html Sequence] †and a given class: | ||
<source lang="java5" enclose="div"> | <source lang="java5" enclose="div"> | ||
public double getScore( Sequence seq, int i ) throws Exception { | |||
</source> | </source> | ||
Line 40: | Line 39: | ||
<source lang="java5" enclose="div"> | <source lang="java5" enclose="div"> | ||
public double[] getScores( DataSet s ) throws Exception { | |||
</source> | </source> | ||
allows for computing the score-differences given foreground and background class for all [http://www.jstacs.de/api-2.0//de/jstacs/data/sequences/Sequence.html Sequence] s in the [http://www.jstacs.de/api-2.0//de/jstacs/data/DataSet.html DataSet] <code>s</code>. Such scores are typically the sum of the a-priori class log-score or log-probability and the score returned by <code>getLogScore</code> of [http://www.jstacs.de/api-2.0//de/jstacs/sequenceScores/SequenceScore.html SequenceScore] | allows for computing the score-differences given foreground and background class for all [http://www.jstacs.de/api-2.0//de/jstacs/data/sequences/Sequence.html Sequence] s in the [http://www.jstacs.de/api-2.0//de/jstacs/data/DataSet.html DataSet] <code>s</code>. Such scores are typically the sum of the a-priori class log-score or log-probability and the score returned by <code>getLogScore</code> of [http://www.jstacs.de/api-2.0//de/jstacs/sequenceScores/SequenceScore.html SequenceScore] or <code>getLogProb</code> of [http://www.jstacs.de/api-2.0//de/jstacs/sequenceScores/statisticalModels/StatisticalModel.html StatisticalModel]. | ||
Sometimes data is not split into test and train data for several diverse reasons, as for instance limited amount of data. In such cases, it is recommended to utilize some repeated procedure to split the data, train on one part and classify on the other part. In Jstacs, we provide the abstract class [http://www.jstacs.de/api-2.0//de/jstacs/classifiers/assessment/ClassifierAssessment.html ClassifierAssessment] that allows to implement such procedures. In subsection [[#Assessment | Sometimes data is not split into test and train data for several diverse reasons, as for instance limited amount of data. In such cases, it is recommended to utilize some repeated procedure to split the data, train on one part and classify on the other part. In Jstacs, we provide the abstract class [http://www.jstacs.de/api-2.0//de/jstacs/classifiers/assessment/ClassifierAssessment.html ClassifierAssessment] that allows to implement such procedures. In subsection [[#Assessment]], we describe how to use [http://www.jstacs.de/api-2.0//de/jstacs/classifiers/assessment/ClassifierAssessment.html ClassifierAssessment] and its extension. | ||
But at first, we will focus on classifiers. Any classifier in Jstacs is an extension of the [http://www.jstacs.de/api-2.0//de/jstacs/classifiers/AbstractClassifier.html AbstractClassifier]. In this section, we present on two concrete implementations, namely [http://www.jstacs.de/api-2.0//de/jstacs/classifiers/trainSMBased/TrainSMBasedClassifier.html TrainSMBasedClassifier] (cf. subsection [[#TrainSMBasedClassifier | But at first, we will focus on classifiers. Any classifier in Jstacs is an extension of the [http://www.jstacs.de/api-2.0//de/jstacs/classifiers/AbstractClassifier.html AbstractClassifier]. In this section, we present on two concrete implementations, namely [http://www.jstacs.de/api-2.0//de/jstacs/classifiers/trainSMBased/TrainSMBasedClassifier.html TrainSMBasedClassifier] (cf. subsection [[#TrainSMBasedClassifier]]) and [http://www.jstacs.de/api-2.0//de/jstacs/classifiers/differentiableSequenceScoreBased/gendismix/GenDisMixClassifier.html GenDisMixClassifier] (cf. subsection [[#GenDisMixClassifier]]). | ||
== TrainSMBasedClassifier == | == TrainSMBasedClassifier == | ||
Line 58: | Line 57: | ||
<source lang="java5" enclose="div"> | <source lang="java5" enclose="div"> | ||
TrainableStatisticalModel pwm = TrainableStatisticalModelFactory.createPWM( alphabet, 10, 4.0 ); | |||
</source> | </source> | ||
Line 66: | Line 65: | ||
<source lang="java5" enclose="div"> | <source lang="java5" enclose="div"> | ||
AbstractClassifier cl = new TrainSMBasedClassifier( pwm, pwm ); | |||
</source> | </source> | ||
Line 83: | Line 82: | ||
<source lang="java5" enclose="div"> | <source lang="java5" enclose="div"> | ||
GenDisMixClassifierParameterSet ps = new GenDisMixClassifierParameterSet( alphabet, 10, (byte) 10, 1E-6, 1E-9, 1, false, KindOfParameter.PLUGIN, true, 2 ); | |||
</source> | </source> | ||
Line 103: | Line 102: | ||
<source lang="java5" enclose="div"> | <source lang="java5" enclose="div"> | ||
DifferentiableStatisticalModel pwm2 = new BayesianNetworkDiffSM( alphabet, 10, 4.0, true, new InhomogeneousMarkov(0) ); | |||
</source> | </source> | ||
Line 111: | Line 110: | ||
<source lang="java5" enclose="div"> | <source lang="java5" enclose="div"> | ||
cl = new GenDisMixClassifier(ps, DoesNothingLogPrior.defaultInstance, LearningPrinciple.ML, pwm2, pwm2 ); | |||
</source> | </source> | ||
Line 121: | Line 120: | ||
<source lang="java5" enclose="div"> | <source lang="java5" enclose="div"> | ||
LogPrior prior = new CompositeLogPrior(); | |||
</source> | </source> | ||
Line 131: | Line 130: | ||
<source lang="java5" enclose="div"> | <source lang="java5" enclose="div"> | ||
cl = new GenDisMixClassifier(ps, prior, LearningPrinciple.MSP, pwm2, pwm2 ); | |||
</source> | </source> | ||
Line 141: | Line 140: | ||
<source lang="java5" enclose="div"> | <source lang="java5" enclose="div"> | ||
cl = new GenDisMixClassifier(ps, prior, new double[]{0.4,0.1,0.5}, pwm2, pwm2 ); | |||
</source> | </source> | ||
Line 152: | Line 151: | ||
<source lang="java5" enclose="div"> | <source lang="java5" enclose="div"> | ||
PerformanceMeasureParameterSet measures = PerformanceMeasureParameterSet.createFilledParameters( false, 0.999, 0.95, 0.95, 1 ); | |||
</source> | </source> | ||
Line 162: | Line 161: | ||
<source lang="java5" enclose="div"> | <source lang="java5" enclose="div"> | ||
AbstractPerformanceMeasure[] m = {new AucROC(), new AucPR()}; | |||
</source> | </source> | ||
Line 170: | Line 169: | ||
<source lang="java5" enclose="div"> | <source lang="java5" enclose="div"> | ||
measures = new PerformanceMeasureParameterSet( m ); | |||
</source> | </source> | ||
Line 182: | Line 181: | ||
<source lang="java5" enclose="div"> | <source lang="java5" enclose="div"> | ||
NumericalPerformanceMeasureParameterSet numMeasures = PerformanceMeasureParameterSet.createFilledParameters(); | |||
</source> | </source> | ||
Line 191: | Line 190: | ||
<source lang="java5" enclose="div"> | <source lang="java5" enclose="div"> | ||
ClassifierAssessment assessment = new KFoldCrossValidation( cl ); | |||
</source> | </source> | ||
Line 199: | Line 198: | ||
<source lang="java5" enclose="div"> | <source lang="java5" enclose="div"> | ||
KFoldCrossValidationAssessParameterSet params = new KFoldCrossValidationAssessParameterSet( PartitionMethod.PARTITION_BY_NUMBER_OF_ELEMENTS, cl.getLength(), true, 10 ); | |||
</source> | </source> | ||
Line 213: | Line 212: | ||
<source lang="java5" enclose="div"> | <source lang="java5" enclose="div"> | ||
System.out.println( assessment.assess( numMeasures, params, data ) ); | |||
</source> | </source> | ||
We print the result (cf. [http://www.jstacs.de/api-2.0//de/jstacs/results/ListResult.html ListResult]) of this assessment to standard out. If we like to perform other [http://www.jstacs.de/api-2.0//de/jstacs/classifiers/assessment/ClassifierAssessment.html ClassifierAssessment] s, as for instance, a [http://www.jstacs.de/api-2.0//de/jstacs/classifiers/assessment/RepeatedHoldOutExperiment.html RepeatedHoldOutExperiment], we have to use a specific [http://www.jstacs.de/api-2.0//de/jstacs/parameters/ParameterSet.html ParameterSet] †(cf. [http://www.jstacs.de/api-2.0//de/jstacs/classifiers/assessment/KFoldCrossValidation.html KFoldCrossValidation] and [http://www.jstacs.de/api-2.0//de/jstacs/classifiers/assessment/KFoldCrossValidationAssessParameterSet.html KFoldCrossValidationAssessParameterSet]). | We print the result (cf. [http://www.jstacs.de/api-2.0//de/jstacs/results/ListResult.html ListResult]) of this assessment to standard out. If we like to perform other [http://www.jstacs.de/api-2.0//de/jstacs/classifiers/assessment/ClassifierAssessment.html ClassifierAssessment] s, as for instance, a [http://www.jstacs.de/api-2.0//de/jstacs/classifiers/assessment/RepeatedHoldOutExperiment.html RepeatedHoldOutExperiment], we have to use a specific [http://www.jstacs.de/api-2.0//de/jstacs/parameters/ParameterSet.html ParameterSet] †(cf. [http://www.jstacs.de/api-2.0//de/jstacs/classifiers/assessment/KFoldCrossValidation.html KFoldCrossValidation] and [http://www.jstacs.de/api-2.0//de/jstacs/classifiers/assessment/KFoldCrossValidationAssessParameterSet.html KFoldCrossValidationAssessParameterSet]). |
Latest revision as of 08:44, 3 February 2012
Classifiers allow to classify, i.e., label, previously uncharacterized data. In Jstacs, we provide the abstract class AbstractClassifier that declares three important methods besides several others.
The first method trains a classifier, i.e., it somehow adjusts to the train data:
public void train( DataSet... s ) throws Exception {
The second method classifies a given Sequence:
public abstract byte classify( Sequence seq ) throws Exception;
If we like to classify for instance the first sequence of a data set, we might use
System.out.println( cl.classify( data[0].getElementAt(0) ) );
In addition to this method, another method classify(DataSet)
exists that performs a classification for all Sequence s in a DataSet.
The third method allows for assessing the performance. Typically this is done on test data
public final ResultSet evaluate( PerformanceMeasureParameterSet params, boolean exceptionIfNotComputeable, DataSet... s ) throws Exception {
where params
is a ParameterSet of performance measures (cf. subsection #Performance measures), exceptionIfNotComputeable
indicates if an exception should be thrown if a performance measure could not be computed, and s
is an array of data sets, where dimension i
contains data of class i
.
The abstract sub-class AbstractScoreBasedClassifier †of AbstractClassifier adds an additional method for computing a joint score for an input Sequence †and a given class:
public double getScore( Sequence seq, int i ) throws Exception {
Similar to the classify
method. For two-class problems, the method
public double[] getScores( DataSet s ) throws Exception {
allows for computing the score-differences given foreground and background class for all Sequence s in the DataSet s
. Such scores are typically the sum of the a-priori class log-score or log-probability and the score returned by getLogScore
of SequenceScore or getLogProb
of StatisticalModel.
Sometimes data is not split into test and train data for several diverse reasons, as for instance limited amount of data. In such cases, it is recommended to utilize some repeated procedure to split the data, train on one part and classify on the other part. In Jstacs, we provide the abstract class ClassifierAssessment that allows to implement such procedures. In subsection #Assessment, we describe how to use ClassifierAssessment and its extension.
But at first, we will focus on classifiers. Any classifier in Jstacs is an extension of the AbstractClassifier. In this section, we present on two concrete implementations, namely TrainSMBasedClassifier (cf. subsection #TrainSMBasedClassifier) and GenDisMixClassifier (cf. subsection #GenDisMixClassifier).
TrainSMBasedClassifier
The class TrainSMBasedClassifier implements a classifier on TrainableStatisticalModel s, i.e., for each class the classifier holds a TrainableStatisticalModel.
If we like to build a binary classifier using PWMs for each class, we first create a PWM that is a TrainableStatisticalModel.
TrainableStatisticalModel pwm = TrainableStatisticalModelFactory.createPWM( alphabet, 10, 4.0 );
Then we can use this instance to create the classifier using
AbstractClassifier cl = new TrainSMBasedClassifier( pwm, pwm );
Thereby, we do not need to clone the PWM instance, as this is done internally for safety reasons. If we like to build a classifier that allows to distinguish between [math]N[/math] classes, we use the same constructor but provide [math]N[/math] TrainableStatisticalModel s.
If we train a TrainSMBasedClassifier, the train method of the internally used TrainableStatisticalModel s is called. For classifying a sequence, the TrainSMBasedClassifier calls getLogProbFor
of the internally used TrainableStatisticalModel s and incorporates some class weight.
GenDisMixClassifier
The class GenDisMixClassifier implements a classifier using the unified generative-discriminative learning principle to train the internally used DifferentiableStatisticalModel s. In analogy to the TrainSMBasedClassifier, the GenDisMixClassifier holds for each class a DifferentiableStatisticalModel.
If we like to build a GenDisMixClassifier, we have to provide the parameters for this classifier:
GenDisMixClassifierParameterSet ps = new GenDisMixClassifierParameterSet( alphabet, 10, (byte) 10, 1E-6, 1E-9, 1, false, KindOfParameter.PLUGIN, true, 2 );
This line of code generate a ParameterSet for a GenDisMixClassifier. It states
the used AlphabetContainer,
the sequence length,
an indicator for the numerical algorithm that is used during training,
an epsilon for stopping the numerical optimization,
a line epsilon for stopping the line search within the numerical optimization,
a start distance for the line search,
a switch that indicates whether the free or all parameter should be used,
an enum that indicates the kind of class parameter initialization,
a switch that indicates whether normalization should be used during optimization,
and the number of threads used during numerical optimization.
If we like to build a binary classifier using PWMs for each class, we create a PWM that is a DifferentiableStatisticalModel.
DifferentiableStatisticalModel pwm2 = new BayesianNetworkDiffSM( alphabet, 10, 4.0, true, new InhomogeneousMarkov(0) );
Now, we are able to build a GenDisMixClassifier that uses the maximum likelihood learning principle.
cl = new GenDisMixClassifier(ps, DoesNothingLogPrior.defaultInstance, LearningPrinciple.ML, pwm2, pwm2 );
In close analogy, we can build a GenDisMixClassifier that uses the maximum conditional likelihood learning principle, if we use LearningPrinciple.MCL
.
However, if we like to use a Bayesian learning principle we have to specify a prior that represents our prior knowledge. One of the most popular priors is the product Dirichlet prior. We can create an instance of this prior using
LogPrior prior = new CompositeLogPrior();
This class utilizes methods of DifferentiableStatisticalModel (cf. getLogPriorTerm()
and addGradientOfLogPriorTerm(double[], int)
) to provide the correct prior.
Given a prior, we can build a GenDisMixClassifier using for instance the maximum supervised learning principle:
cl = new GenDisMixClassifier(ps, prior, LearningPrinciple.MSP, pwm2, pwm2 );
Again in close analogy, we can build a GenDisMixClassifier that uses the maximum a-posteriori learning principle, if we use LearningPrinciple.MAP
.
Alternative, we can build a GenDisMixClassifier that utilize the unified generative-discriminative learning principle. If we like to do so, we have to provide a weighting that sums to 1 and represents the weights for the conditional likelihood, the likelihood and the prior.
cl = new GenDisMixClassifier(ps, prior, new double[]{0.4,0.1,0.5}, pwm2, pwm2 );
Performance measures
If we like to assess the performance of any classifier, we have to use the method evaluate
(see beginning of this section). The first argument of this method is a PerformanceMeasureParameterSet that hold the performance measures to be computed. The most simple way to create an instance is
PerformanceMeasureParameterSet measures = PerformanceMeasureParameterSet.createFilledParameters( false, 0.999, 0.95, 0.95, 1 );
which yields an instance with all standard performance measures of Jstacs and specified parameters. The first argument states that all performance measures should be included. If we would change the argument to true
, only numerical performance measures would be included an the returned instance would be a NumericalPerformanceMeasureParameterSet. The other four arguments are parameters for some performance measures.
Another way of creating a PerformanceMeasureParameterSet is to directly use performance measures extending the class AbstractPerformanceMeasure. For instance if we like to use the area under the curve (auc) for ROC and PR curve, we create
AbstractPerformanceMeasure[] m = {new AucROC(), new AucPR()};
Based on this array, we can create a PerformanceMeasureParameterSet that only contains these performance measures.
measures = new PerformanceMeasureParameterSet( m );
Assessment
If we like to assess the performance of any classifier based on an array of data sets that is not split into test and train data, we have to use some repeated procedure. In Jstacs, we provide the ClassifierAssessment that is the abstract super class of any such an procedure. We have already implemented the most widely used procedures (cf. KFoldCrossValidation and RepeatedHoldOutExperiment).
Before performing a ClassifierAssessment, we have to define a set of numerical performance measures. The performance measure have to be numerical to allow for an averaging. The most simple way to create such a set is
NumericalPerformanceMeasureParameterSet numMeasures = PerformanceMeasureParameterSet.createFilledParameters();
However, you can choose other measures as described in the previous subsection.
In this subsection, we exemplarily present how to perform a k-fold cross validation in Jstacs. First, we have to create an instance of KFoldCrossValidation. There several constructor to do so. Here, we use the constructor that used AbstractClassifier s.
ClassifierAssessment assessment = new KFoldCrossValidation( cl );
Second, we have to specify the parameters of the KFoldCrossValidation.
KFoldCrossValidationAssessParameterSet params = new KFoldCrossValidationAssessParameterSet( PartitionMethod.PARTITION_BY_NUMBER_OF_ELEMENTS, cl.getLength(), true, 10 );
These parameter are
the partition method, i.e., the way how to count entries during a partitioning,
the sequence length for the test data,
a switch indicating whether an exception should be thrown if a performance measure could not be computed (cf. evaluate
in AbstractClassifier),
and the number of repeats [math]k[/math].
Now, we are able to perform a ClassifierAssessment just by calling the method assess
.
System.out.println( assessment.assess( numMeasures, params, data ) );
We print the result (cf. ListResult) of this assessment to standard out. If we like to perform other ClassifierAssessment s, as for instance, a RepeatedHoldOutExperiment, we have to use a specific ParameterSet †(cf. KFoldCrossValidation and KFoldCrossValidationAssessParameterSet).