First main course: SequenceScores: Difference between revisions
(Created page with "<span id="sequenceScore"> </span> [http://www.jstacs.de/api-2.0//de/jstacs/sequenceScores/SequenceScore.html SequenceScore] s define scoring functions over sequences that can as...") |
No edit summary |
||
(5 intermediate revisions by one other user not shown) | |||
Line 7: | Line 7: | ||
defining a proper likelihood. [http://www.jstacs.de/api-2.0//de/jstacs/sequenceScores/statisticalModels/StatisticalModel.html StatisticalModel], in turn, has two sub-interfaces [http://www.jstacs.de/api-2.0//de/jstacs/sequenceScores/statisticalModels/trainable/TrainableStatisticalModel.html TrainableStatisticalModel] and [http://www.jstacs.de/api-2.0//de/jstacs/sequenceScores/statisticalModels/differentiable/DifferentiableStatisticalModel.html DifferentiableStatisticalModel], where the latter joins the properties of [http://www.jstacs.de/api-2.0//de/jstacs/sequenceScores/statisticalModels/StatisticalModel.html StatisticalModel] s and [http://www.jstacs.de/api-2.0//de/jstacs/sequenceScores/differentiable/DifferentiableSequenceScore.html DifferentiableSequenceScore] s. All of these will be explained in the sub-sections of this section. | defining a proper likelihood. [http://www.jstacs.de/api-2.0//de/jstacs/sequenceScores/statisticalModels/StatisticalModel.html StatisticalModel], in turn, has two sub-interfaces [http://www.jstacs.de/api-2.0//de/jstacs/sequenceScores/statisticalModels/trainable/TrainableStatisticalModel.html TrainableStatisticalModel] and [http://www.jstacs.de/api-2.0//de/jstacs/sequenceScores/statisticalModels/differentiable/DifferentiableStatisticalModel.html DifferentiableStatisticalModel], where the latter joins the properties of [http://www.jstacs.de/api-2.0//de/jstacs/sequenceScores/statisticalModels/StatisticalModel.html StatisticalModel] s and [http://www.jstacs.de/api-2.0//de/jstacs/sequenceScores/differentiable/DifferentiableSequenceScore.html DifferentiableSequenceScore] s. All of these will be explained in the sub-sections of this section. | ||
__TOC__ | |||
[http://www.jstacs.de/api-2.0//de/jstacs/sequenceScores/SequenceScore.html SequenceScore] extends the standard interface <code>Cloneable</code> and, hence, implementations must provide a <code>clone()</code> method, which returns a deep copy of all fields of an instance: | [http://www.jstacs.de/api-2.0//de/jstacs/sequenceScores/SequenceScore.html SequenceScore] extends the standard interface <code>Cloneable</code> and, hence, implementations must provide a <code>clone()</code> method, which returns a deep copy of all fields of an instance: | ||
<source lang="java5" enclose="div"> | <source lang="java5" enclose="div"> | ||
public SequenceScore clone() throws CloneNotSupportedException; | |||
</source> | </source> | ||
Since the implementation of the clone method is very score-specific, it must be implemented anew for each implementation of the [http://www.jstacs.de/api-2.0//de/jstacs/sequenceScores/SequenceScore.html SequenceScore] interface. Besides that [http://www.jstacs.de/api-2.0//de/jstacs/sequenceScores/SequenceScore.html SequenceScore] also extends [http://www.jstacs.de/api-2.0//de/jstacs//Storable.html Storable], which demands the implementation of the <code>toXML()</code> method and a constructor working on a <code>StringBuffer</code> containing an XML-representation (see section [[ | Since the implementation of the clone method is very score-specific, it must be implemented anew for each implementation of the [http://www.jstacs.de/api-2.0//de/jstacs/sequenceScores/SequenceScore.html SequenceScore] interface. Besides that [http://www.jstacs.de/api-2.0//de/jstacs/sequenceScores/SequenceScore.html SequenceScore] also extends [http://www.jstacs.de/api-2.0//de/jstacs//Storable.html Storable], which demands the implementation of the <code>toXML()</code> method and a constructor working on a <code>StringBuffer</code> containing an XML-representation (see section [[Intermediate course: XMLParser, Parameters, and Results]]). | ||
[http://www.jstacs.de/api-2.0//de/jstacs/sequenceScores/SequenceScore.html SequenceScore] also defines some methods that are basically getters for typical properties of a [http://www.jstacs.de/api-2.0//de/jstacs/sequenceScores/SequenceScore.html SequenceScore] implementation. | [http://www.jstacs.de/api-2.0//de/jstacs/sequenceScores/SequenceScore.html SequenceScore] also defines some methods that are basically getters for typical properties of a [http://www.jstacs.de/api-2.0//de/jstacs/sequenceScores/SequenceScore.html SequenceScore] implementation. | ||
Line 21: | Line 20: | ||
<source lang="java5" enclose="div"> | <source lang="java5" enclose="div"> | ||
public AlphabetContainer getAlphabetContainer(); | |||
</source> | </source> | ||
Line 27: | Line 26: | ||
<source lang="java5" enclose="div"> | <source lang="java5" enclose="div"> | ||
public String getInstanceName(); | |||
</source> | </source> | ||
Line 33: | Line 32: | ||
<source lang="java5" enclose="div"> | <source lang="java5" enclose="div"> | ||
public int getLength(); | |||
</source> | </source> | ||
Line 41: | Line 40: | ||
<source lang="java5" enclose="div"> | <source lang="java5" enclose="div"> | ||
public ResultSet getCharacteristics() throws Exception; | |||
</source> | </source> | ||
Line 47: | Line 46: | ||
<source lang="java5" enclose="div"> | <source lang="java5" enclose="div"> | ||
public NumericalResultSet getNumericalCharacteristics() throws Exception; | |||
</source> | </source> | ||
Line 56: | Line 55: | ||
<source lang="java5" enclose="div"> | <source lang="java5" enclose="div"> | ||
public double getLogScoreFor(Sequence seq); | |||
</source> | </source> | ||
returns the score for the given [http://www.jstacs.de/api-2.0//de/jstacs/data/sequences/Sequence.html Sequence]. For reasons of efficiency, it is not intended to check if the length or | returns the score for the given [http://www.jstacs.de/api-2.0//de/jstacs/data/sequences/Sequence.html Sequence]. For reasons of efficiency, it is not intended to check if the length or | ||
the [http://www.jstacs.de/api-2.0//de/jstacs/data/AlphabetContainer.html AlphabetContainer] of the [http://www.jstacs.de/api-2.0//de/jstacs/data/sequences/Sequence.html Sequence] match those of the [http://www.jstacs.de/api-2.0//de/jstacs/sequenceScores/SequenceScore.html SequenceScore]. Hence, no exception is declared for this method. So if | the [http://www.jstacs.de/api-2.0//de/jstacs/data/AlphabetContainer.html AlphabetContainer] of the [http://www.jstacs.de/api-2.0//de/jstacs/data/sequences/Sequence.html Sequence] match those of the [http://www.jstacs.de/api-2.0//de/jstacs/sequenceScores/SequenceScore.html SequenceScore]. Hence, no exception is declared for this method. So if | ||
this method is used (instead of the more stringent methods of [http://www.jstacs.de/api-2.0//de/jstacs/sequenceScores/statisticalModels/StatisticalModel.html StatisticalModel], see [[# | this method is used (instead of the more stringent methods of [http://www.jstacs.de/api-2.0//de/jstacs/sequenceScores/statisticalModels/StatisticalModel.html StatisticalModel], see [[#StatisticalModels]]) this should be checked externally. | ||
The second method is | The second method is | ||
<source lang="java5" enclose="div"> | <source lang="java5" enclose="div"> | ||
public double getLogScoreFor(Sequence seq, int start); | |||
</source> | </source> | ||
Line 74: | Line 73: | ||
<source lang="java5" enclose="div"> | <source lang="java5" enclose="div"> | ||
public double getLogScoreFor(Sequence seq, int start, int end) throws Exception; | |||
</source> | </source> | ||
Line 82: | Line 81: | ||
<source lang="java5" enclose="div"> | <source lang="java5" enclose="div"> | ||
public double[] getLogScoreFor(DataSet data) throws Exception; | |||
</source> | </source> | ||
Line 88: | Line 87: | ||
<source lang="java5" enclose="div"> | <source lang="java5" enclose="div"> | ||
public void getLogScoreFor(DataSet data, double[] res) throws Exception; | |||
</source> | </source> | ||
Line 100: | Line 99: | ||
<source lang="java5" enclose="div"> | <source lang="java5" enclose="div"> | ||
public boolean isInitialized(); | |||
</source> | </source> | ||
Line 110: | Line 109: | ||
The interface [http://www.jstacs.de/api-2.0//de/jstacs/sequenceScores/differentiable/DifferentiableSequenceScore.html DifferentiableSequenceScore] extends [http://www.jstacs.de/api-2.0//de/jstacs/sequenceScores/SequenceScore.html SequenceScore] | The interface [http://www.jstacs.de/api-2.0//de/jstacs/sequenceScores/differentiable/DifferentiableSequenceScore.html DifferentiableSequenceScore] extends [http://www.jstacs.de/api-2.0//de/jstacs/sequenceScores/SequenceScore.html SequenceScore] and adds all methods that are necessary for a numerical optimization of the parameters | ||
of a [http://www.jstacs.de/api-2.0//de/jstacs/sequenceScores/SequenceScore.html SequenceScore]. This especially includes the computation of the gradient w.r.t. the parameters for a given input [http://www.jstacs.de/api-2.0//de/jstacs/data/sequences/Sequence.html Sequence]. | of a [http://www.jstacs.de/api-2.0//de/jstacs/sequenceScores/SequenceScore.html SequenceScore]. This especially includes the computation of the gradient w.r.t. the parameters for a given input [http://www.jstacs.de/api-2.0//de/jstacs/data/sequences/Sequence.html Sequence]. | ||
Line 116: | Line 115: | ||
methods for that purpos. The first is | methods for that purpos. The first is | ||
<source lang="java5" enclose="div"> | <source lang="java5" enclose="div"> | ||
public void initializeFunction(int index, boolean freeParams, | |||
DataSet[] data, double[][] weights) throws Exception; | |||
</source> | </source> | ||
which is tailored to initializing the parameters from a given data set. The parameters have the following meaning: <code>index</code> is the index of the class this [http://www.jstacs.de/api-2.0//de/jstacs/sequenceScores/differentiable/DifferentiableSequenceScore.html DifferentiableSequenceScore] | which is tailored to initializing the parameters from a given data set. The parameters have the following meaning: <code>index</code> is the index of the class this [http://www.jstacs.de/api-2.0//de/jstacs/sequenceScores/differentiable/DifferentiableSequenceScore.html DifferentiableSequenceScore] represents. | ||
For some parameterizations, we can either optimize (and initialize) only the free parameters of the model, i.e., excluding parameters with values that can be computed from the values of a subset of the remaining parameters, or we can optimize all parameters. For example such parameters could be those of a DNA dice, where the probability of one nucleotide is fixed if we know the probabilities of all other nucleotides. | For some parameterizations, we can either optimize (and initialize) only the free parameters of the model, i.e., excluding parameters with values that can be computed from the values of a subset of the remaining parameters, or we can optimize all parameters. For example such parameters could be those of a DNA dice, where the probability of one nucleotide is fixed if we know the probabilities of all other nucleotides. | ||
If <code>freeParams</code> is <code>true</code>, only the free parameters are used. | If <code>freeParams</code> is <code>true</code>, only the free parameters are used. | ||
Line 128: | Line 127: | ||
<source lang="java5" enclose="div"> | <source lang="java5" enclose="div"> | ||
public void initializeFunctionRandomly(boolean freeParams) throws Exception; | |||
</source> | </source> | ||
Line 137: | Line 136: | ||
<source lang="java5" enclose="div"> | <source lang="java5" enclose="div"> | ||
public double getLogScoreAndPartialDerivation(Sequence seq, IntList indices, DoubleList partialDer); | |||
</source> | </source> | ||
Line 145: | Line 144: | ||
<source lang="java5" enclose="div"> | <source lang="java5" enclose="div"> | ||
public double getLogScoreAndPartialDerivation(Sequence seq, int start, IntList indices, DoubleList partialDer); | |||
</source> | </source> | ||
Line 151: | Line 150: | ||
<source lang="java5" enclose="div"> | <source lang="java5" enclose="div"> | ||
public double getLogScoreAndPartialDerivation( Sequence seq, int start, int end, IntList indices, DoubleList partialDer ) throws Exception; | |||
</source> | </source> | ||
Line 159: | Line 158: | ||
<source lang="java5" enclose="div"> | <source lang="java5" enclose="div"> | ||
public int getNumberOfParameters(); | |||
</source> | </source> | ||
Line 167: | Line 166: | ||
<source lang="java5" enclose="div"> | <source lang="java5" enclose="div"> | ||
public int getNumberOfRecommendedStarts(); | |||
</source> | </source> | ||
Line 175: | Line 174: | ||
<source lang="java5" enclose="div"> | <source lang="java5" enclose="div"> | ||
public double[] getCurrentParameterValues() throws Exception; | |||
</source> | </source> | ||
Line 181: | Line 180: | ||
<source lang="java5" enclose="div"> | <source lang="java5" enclose="div"> | ||
public void setParameters(double[] params, int start); | |||
</source> | </source> | ||
where <code>start</code> is the index of the value for the first parameter of this [http://www.jstacs.de/api-2.0//de/jstacs/sequenceScores/differentiable/DifferentiableSequenceScore.html DifferentiableSequenceScore] | where <code>start</code> is the index of the value for the first parameter of this [http://www.jstacs.de/api-2.0//de/jstacs/sequenceScores/differentiable/DifferentiableSequenceScore.html DifferentiableSequenceScore] within <code>params</code>. | ||
Finally, the method | Finally, the method | ||
<source lang="java5" enclose="div"> | <source lang="java5" enclose="div"> | ||
public double getInitialClassParam(double classProb); | |||
</source> | </source> | ||
Line 204: | Line 203: | ||
The first method just requires the specification of the [http://www.jstacs.de/api-2.0//de/jstacs/data/sequences/Sequence.html Sequence] object for which the log-likelihood is to be computed: | The first method just requires the specification of the [http://www.jstacs.de/api-2.0//de/jstacs/data/sequences/Sequence.html Sequence] object for which the log-likelihood is to be computed: | ||
<source lang="java5" enclose="div"> | <source lang="java5" enclose="div"> | ||
public double getLogProbFor(Sequence sequence) throws Exception; | |||
</source> | </source> | ||
Line 213: | Line 212: | ||
<source lang="java5" enclose="div"> | <source lang="java5" enclose="div"> | ||
public double getLogProbFor(Sequence sequence, int startpos) throws Exception; | |||
</source> | </source> | ||
Line 221: | Line 220: | ||
<source lang="java5" enclose="div"> | <source lang="java5" enclose="div"> | ||
public double getLogProbFor(Sequence sequence, int startpos, int endpos) throws Exception; | |||
</source> | </source> | ||
Line 229: | Line 228: | ||
<source lang="java5" enclose="div"> | <source lang="java5" enclose="div"> | ||
public double getLogPriorTerm() throws Exception; | |||
</source> | </source> | ||
Line 239: | Line 238: | ||
<source lang="java5" enclose="div"> | <source lang="java5" enclose="div"> | ||
public DataSet emitDataSet(int numberOfSequences, int... seqLength) | |||
throws NotTrainedException, Exception; | |||
</source> | </source> | ||
Line 247: | Line 246: | ||
The implementation of this method is not always possible. In its default implementation in the abstract implementations of [http://www.jstacs.de/api-2.0//de/jstacs/sequenceScores/statisticalModels/StatisticalModel.html StatisticalModel], this method just throws an Exception and must be explicitly overridden to be functional. | The implementation of this method is not always possible. In its default implementation in the abstract implementations of [http://www.jstacs.de/api-2.0//de/jstacs/sequenceScores/statisticalModels/StatisticalModel.html StatisticalModel], this method just throws an Exception and must be explicitly overridden to be functional. | ||
The last method defined in [http://www.jstacs.de/api-2.0//de/jstacs/sequenceScores/statisticalModels/StatisticalModel.html StatisticalModel] | The last method defined in [http://www.jstacs.de/api-2.0//de/jstacs/sequenceScores/statisticalModels/StatisticalModel.html StatisticalModel] is | ||
<source lang="java5" enclose="div"> | <source lang="java5" enclose="div"> | ||
public byte getMaximalMarkovOrder() throws UnsupportedOperationException; | |||
</source> | </source> | ||
Line 267: | Line 266: | ||
The parameters of statistical models are typically learned from some training data set. For this purpose, [http://www.jstacs.de/api-2.0//de/jstacs/sequenceScores/statisticalModels/trainable/TrainableStatisticalModel.html TrainableStatisticalModel] specifies a method <code>train</code> | The parameters of statistical models are typically learned from some training data set. For this purpose, [http://www.jstacs.de/api-2.0//de/jstacs/sequenceScores/statisticalModels/trainable/TrainableStatisticalModel.html TrainableStatisticalModel] specifies a method <code>train</code> | ||
<source lang="java5" enclose="div"> | <source lang="java5" enclose="div"> | ||
public void train(DataSet data) throws Exception; | |||
</source> | </source> | ||
Line 275: | Line 274: | ||
<source lang="java5" enclose="div"> | <source lang="java5" enclose="div"> | ||
public void train(DataSet data, double[] weights) throws Exception; | |||
</source> | </source> | ||
Line 285: | Line 284: | ||
<source lang="java5" enclose="div"> | <source lang="java5" enclose="div"> | ||
public String toString(); | |||
</source> | </source> | ||
Line 294: | Line 293: | ||
Besides the possibility to implement new statistical models in Jstacs, many of them are already implemented and can readily be used. | Besides the possibility to implement new statistical models in Jstacs, many of them are already implemented and can readily be used. | ||
As a central facility for creating [http://www.jstacs.de/api-2.0//de/jstacs/sequenceScores/statisticalModels/trainable/TrainableStatisticalModel.html TrainableStatisticalModel] instances of many standard models, Jstacs provides a [http://www.jstacs.de/api-2.0//de/jstacs/sequenceScores/statisticalModels/trainable/ | As a central facility for creating [http://www.jstacs.de/api-2.0//de/jstacs/sequenceScores/statisticalModels/trainable/TrainableStatisticalModel.html TrainableStatisticalModel] instances of many standard models, Jstacs provides a [http://www.jstacs.de/api-2.0//de/jstacs/sequenceScores/statisticalModels/trainable/TrainableStatisticalModelFactory.html TrainableStatisticalModelFactory]. | ||
Using the [http://www.jstacs.de/api-2.0//de/jstacs/sequenceScores/statisticalModels/trainable/ | Using the [http://www.jstacs.de/api-2.0//de/jstacs/sequenceScores/statisticalModels/trainable/TrainableStatisticalModelFactory.html TrainableStatisticalModelFactory], we can create a new position weight matrix (PWM) model by calling | ||
<source lang="java5" enclose="div"> | <source lang="java5" enclose="div"> | ||
TrainableStatisticalModel pwm = TrainableStatisticalModelFactory.createPWM( alphabet, 10, 4.0 ); | |||
</source> | </source> | ||
Line 307: | Line 306: | ||
<source lang="java5" enclose="div"> | <source lang="java5" enclose="div"> | ||
TrainableStatisticalModel imm = TrainableStatisticalModelFactory.createInhomogeneousMarkovModel( alphabet, 12, 4.0, (byte) 2 ); | |||
</source> | </source> | ||
Line 315: | Line 314: | ||
<source lang="java5" enclose="div"> | <source lang="java5" enclose="div"> | ||
TrainableStatisticalModel pmm = TrainableStatisticalModelFactory.createPermutedMarkovModel( alphabet, 7, 4.0, (byte) 1 ); | |||
</source> | </source> | ||
Line 322: | Line 321: | ||
<source lang="java5" enclose="div"> | <source lang="java5" enclose="div"> | ||
TrainableStatisticalModel hmm = TrainableStatisticalModelFactory.createHomogeneousMarkovModel( alphabet, 400.0, (byte) 3 ); | |||
</source> | </source> | ||
Line 330: | Line 329: | ||
<source lang="java5" enclose="div"> | <source lang="java5" enclose="div"> | ||
TrainableStatisticalModel zoops = TrainableStatisticalModelFactory.createZOOPS( pwm, hmm, new double[]{4,4}, false ); | |||
</source> | </source> | ||
Line 338: | Line 337: | ||
<source lang="java5" enclose="div"> | <source lang="java5" enclose="div"> | ||
pwm.train( ds ); | |||
</source> | </source> | ||
Instead of using the [http://www.jstacs.de/api-2.0//de/jstacs/sequenceScores/statisticalModels/trainable/ | Instead of using the [http://www.jstacs.de/api-2.0//de/jstacs/sequenceScores/statisticalModels/trainable/TrainableStatisticalModelFactory.html TrainableStatisticalModelFactory], we can also create [http://www.jstacs.de/api-2.0//de/jstacs/sequenceScores/statisticalModels/trainable/TrainableStatisticalModel.html TrainableStatisticalModel] s directly using their constructors. For example, we create a homogeneous Markov model of order <math>0</math> by calling | ||
<source lang="java5" enclose="div"> | <source lang="java5" enclose="div"> | ||
HomogeneousMM hmm2 = new HomogeneousMM( new HomMMParameterSet( alphabet, 4.0, "hmm(0)", (byte) 0 ) ); | |||
</source> | </source> | ||
Line 351: | Line 350: | ||
<source lang="java5" enclose="div"> | <source lang="java5" enclose="div"> | ||
BayesianNetworkTrainSM bnm = new BayesianNetworkTrainSM( new BayesianNetworkTrainSMParameterSet( alphabet, 8, 4.0, "Bayesian network", ModelType.BN, (byte) 1, LearningType.ML_OR_MAP ) ); | |||
</source> | </source> | ||
Line 359: | Line 358: | ||
<source lang="java5" enclose="div"> | <source lang="java5" enclose="div"> | ||
HMMTrainingParameterSet trainingPars = new BaumWelchParameterSet( 5, new SmallDifferenceOfFunctionEvaluationsCondition( 1E-6 ),2 ); | |||
</source> | </source> | ||
Line 367: | Line 366: | ||
<source lang="java5" enclose="div"> | <source lang="java5" enclose="div"> | ||
Emission[] emissions = new Emission[]{new DiscreteEmission( alphabet, 4.0 ),new DiscreteEmission( alphabet, new double[]{2.0,1.0,1.0,2.0} )}; | |||
</source> | </source> | ||
Line 375: | Line 374: | ||
<source lang="java5" enclose="div"> | <source lang="java5" enclose="div"> | ||
AbstractHMM myHMM = HMMFactory.createErgodicHMM( trainingPars, 1, 4.0, 0.1, 100.0, emissions ); | |||
</source> | </source> | ||
Line 382: | Line 381: | ||
<source lang="java5" enclose="div"> | <source lang="java5" enclose="div"> | ||
HigherOrderHMM hohmm = new HigherOrderHMM( trainingPars, new String[]{"A","B"}, emissions, | |||
new TransitionElement( null, new int[]{0}, new double[]{4.0} ), | |||
new TransitionElement( new int[]{0}, new int[]{0,1}, new double[]{2.0,2.0} ), | |||
new TransitionElement( new int[]{1}, new int[]{0}, new double[]{4.0} )); | |||
</source> | </source> | ||
Line 394: | Line 393: | ||
<source lang="java5" enclose="div"> | <source lang="java5" enclose="div"> | ||
System.out.println( hohmm.getGraphvizRepresentation( null ) ); | |||
</source> | </source> | ||
Line 403: | Line 402: | ||
<source lang="java5" enclose="div"> | <source lang="java5" enclose="div"> | ||
MixtureTrainSM mixEm = new MixtureTrainSM( 10, new TrainableStatisticalModel[]{pwm,pwm}, 3, new double[]{4.0,4.0}, 1, new SmallDifferenceOfFunctionEvaluationsCondition( 1E-6 ), Parameterization.LAMBDA ); | |||
</source> | </source> | ||
Line 411: | Line 410: | ||
<source lang="java5" enclose="div"> | <source lang="java5" enclose="div"> | ||
MixtureTrainSM mixGibbs = new MixtureTrainSM( 10, new TrainableStatisticalModel[]{pwm,pwm}, 3, new double[]{4.0,4.0}, 100, 1000, new VarianceRatioBurnInTest( new VarianceRatioBurnInTestParameterSet( 3, 1.2 ) ) ); | |||
</source> | </source> | ||
Line 417: | Line 416: | ||
A special case of a mixture model is the [http://www.jstacs.de/api-2.0//de/jstacs/sequenceScores/statisticalModels/trainable/mixture/StrandTrainSM.html StrandTrainSM], which is a mixture over the two strands of DNA. For both strands, the same component [http://www.jstacs.de/api-2.0//de/jstacs/sequenceScores/statisticalModels/trainable/TrainableStatisticalModel.html TrainableStatisticalModel] is used but it is evaluated once for the original [http://www.jstacs.de/api-2.0//de/jstacs/data/sequences/Sequence.html Sequence] and once for its reverse complement. | A special case of a mixture model is the [http://www.jstacs.de/api-2.0//de/jstacs/sequenceScores/statisticalModels/trainable/mixture/StrandTrainSM.html StrandTrainSM], which is a mixture over the two strands of DNA. For both strands, the same component [http://www.jstacs.de/api-2.0//de/jstacs/sequenceScores/statisticalModels/trainable/TrainableStatisticalModel.html TrainableStatisticalModel] is used but it is evaluated once for the original [http://www.jstacs.de/api-2.0//de/jstacs/data/sequences/Sequence.html Sequence] and once for its reverse complement. | ||
We construct a [http://www.jstacs.de/api-2.0//de/jstacs/sequenceScores/statisticalModels/trainable/mixture/StrandTrainSM.html StrandTrainSM] of the inhomogeneous Markov model created by the [http://www.jstacs.de/api-2.0//de/jstacs/sequenceScores/statisticalModels/trainable/ | We construct a [http://www.jstacs.de/api-2.0//de/jstacs/sequenceScores/statisticalModels/trainable/mixture/StrandTrainSM.html StrandTrainSM] of the inhomogeneous Markov model created by the [http://www.jstacs.de/api-2.0//de/jstacs/sequenceScores/statisticalModels/trainable/TrainableStatisticalModelFactory.html TrainableStatisticalModelFactory] by calling | ||
<source lang="java5" enclose="div"> | <source lang="java5" enclose="div"> | ||
StrandTrainSM strandModel = new StrandTrainSM( imm, 3, 0.5, 1, new SmallDifferenceOfFunctionEvaluationsCondition( 1E-6 ), Parameterization.LAMBDA ); | |||
</source> | </source> | ||
Line 437: | Line 436: | ||
[http://www.jstacs.de/api-2.0//de/jstacs/sequenceScores/statisticalModels/differentiable/DifferentiableStatisticalModel.html DifferentiableStatisticalModel] defines a method | [http://www.jstacs.de/api-2.0//de/jstacs/sequenceScores/statisticalModels/differentiable/DifferentiableStatisticalModel.html DifferentiableStatisticalModel] defines a method | ||
<source lang="java5" enclose="div"> | <source lang="java5" enclose="div"> | ||
public double getLogNormalizationConstant(); | |||
</source> | </source> | ||
that returns the logarithm of that normalization constant, which only depends on the current parameter values but not on the input sequence. For computing the gradients with respect to the parameters, we also need the partial derivations of the log-normalization constant with respect to the parameters. To this end [http://www.jstacs.de/api-2.0//de/jstacs/sequenceScores/statisticalModels/differentiable/DifferentiableStatisticalModel.html DifferentiableStatisticalModel] | that returns the logarithm of that normalization constant, which only depends on the current parameter values but not on the input sequence. For computing the gradients with respect to the parameters, we also need the partial derivations of the log-normalization constant with respect to the parameters. To this end [http://www.jstacs.de/api-2.0//de/jstacs/sequenceScores/statisticalModels/differentiable/DifferentiableStatisticalModel.html DifferentiableStatisticalModel] provides a method | ||
<source lang="java5" enclose="div"> | <source lang="java5" enclose="div"> | ||
public double getLogPartialNormalizationConstant(int parameterIndex) | |||
</source> | </source> | ||
Line 451: | Line 450: | ||
<source lang="java5" enclose="div"> | <source lang="java5" enclose="div"> | ||
public void addGradientOfLogPriorTerm(double[] grad, int start) | |||
</source> | </source> | ||
Line 459: | Line 458: | ||
<source lang="java5" enclose="div"> | <source lang="java5" enclose="div"> | ||
public boolean isNormalized(); | |||
</source> | </source> | ||
Line 467: | Line 466: | ||
<source lang="java5" enclose="div"> | <source lang="java5" enclose="div"> | ||
public double getESS(); | |||
</source> | </source> | ||
Line 479: | Line 478: | ||
<source lang="java5" enclose="div"> | <source lang="java5" enclose="div"> | ||
BayesianNetworkDiffSM bnDsm = new BayesianNetworkDiffSM( new BayesianNetworkDiffSMParameterSet( alphabet, 10, 4.0, true, new BTExplainingAwayResidual( new double[]{4.0,4.0} ) ) ); | |||
</source> | </source> | ||
Line 487: | Line 486: | ||
<source lang="java5" enclose="div"> | <source lang="java5" enclose="div"> | ||
MarkovModelDiffSM mmDsm = new MarkovModelDiffSM( alphabet, 8, 4.0, true, new InhomogeneousMarkov( 1 ) ); | |||
</source> | </source> | ||
Line 494: | Line 493: | ||
<source lang="java5" enclose="div"> | <source lang="java5" enclose="div"> | ||
GenDisMixClassifierParameterSet params = new GenDisMixClassifierParameterSet( alphabet, 8, Optimizer.QUASI_NEWTON_BFGS, 1E-6, 1E-6, 1, false, KindOfParameter.PLUGIN, true, 4 ); | |||
GenDisMixClassifier cl = new GenDisMixClassifier( params, new CompositeLogPrior(), LearningPrinciple.MSP, mmDsm, mmDsm ); | |||
cl.train( data ); | |||
</source> | </source> | ||
The discriminative learning principle used in the example is maximum supervised posterior (MSP). For details about the [http://www.jstacs.de/api-2.0//de/jstacs/classifiers/differentiableSequenceScoreBased/gendismix/GenDisMixClassifier.html GenDisMixClassifier] used in the example, see section [[# | The discriminative learning principle used in the example is maximum supervised posterior (MSP). For details about the [http://www.jstacs.de/api-2.0//de/jstacs/classifiers/differentiableSequenceScoreBased/gendismix/GenDisMixClassifier.html GenDisMixClassifier] used in the example, see section [[Second main course: Classifiers#GenDisMixClassifiers]]. | ||
A homogeneous Markov models of order <math>3</math> with equivalent sample size <math>4.0</math> is created by the constructor | A homogeneous Markov models of order <math>3</math> with equivalent sample size <math>4.0</math> is created by the constructor | ||
<source lang="java5" enclose="div"> | <source lang="java5" enclose="div"> | ||
HomogeneousMMDiffSM hmmDsm = new HomogeneousMMDiffSM( alphabet, 3, 4.0, 100 ); | |||
</source> | </source> | ||
Line 512: | Line 511: | ||
<source lang="java5" enclose="div"> | <source lang="java5" enclose="div"> | ||
DifferentiableEmission[] emissions = new DifferentiableEmission[]{new DiscreteEmission( alphabet, 4.0 ),new DiscreteEmission( alphabet, new double[]{2.0,1.0,1.0,2.0} )}; | |||
NumericalHMMTrainingParameterSet trainingParameterSet = new NumericalHMMTrainingParameterSet( 3, new SmallDifferenceOfFunctionEvaluationsCondition( 1E-6 ), 2, Optimizer.QUASI_NEWTON_BFGS, 1E-6, 1 ); | |||
DifferentiableHigherOrderHMM hmm = new DifferentiableHigherOrderHMM( trainingParameterSet, new String[]{"A","B"} , new int[]{0,1}, new boolean[]{true,true},emissions, true,4.0, | |||
new TransitionElement( null, new int[]{0}, new double[]{4.0} ), | |||
new TransitionElement( new int[]{0}, new int[]{0,1}, new double[]{2.0,2.0} ), | |||
new TransitionElement( new int[]{1}, new int[]{0}, new double[]{4.0} )); | |||
</source> | </source> | ||
Line 525: | Line 524: | ||
<source lang="java5" enclose="div"> | <source lang="java5" enclose="div"> | ||
MixtureDiffSM mixDsm = new MixtureDiffSM( 3, true, mmDsm,mmDsm ); | |||
</source> | </source> | ||
Line 533: | Line 532: | ||
<source lang="java5" enclose="div"> | <source lang="java5" enclose="div"> | ||
StrandDiffSM strandDsm = new StrandDiffSM( mmDsm, 0.5, 1, true, InitMethod.INIT_BOTH_STRANDS ); | |||
</source> | </source> | ||
Line 541: | Line 540: | ||
<source lang="java5" enclose="div"> | <source lang="java5" enclose="div"> | ||
ExtendedZOOPSDiffSM zoops = new ExtendedZOOPSDiffSM( ExtendedZOOPSDiffSM.CONTAINS_SOMETIMES_A_MOTIF, 500, 4, false, hmmDsm, strandDsm, null, true ); | |||
</source> | </source> | ||
Line 549: | Line 548: | ||
<source lang="java5" enclose="div"> | <source lang="java5" enclose="div"> | ||
IndependentProductDiffSM ipsf = new IndependentProductDiffSM( 4.0, true, bnDsm,mmDsm ); | |||
</source> | </source> | ||
Line 557: | Line 556: | ||
<source lang="java5" enclose="div"> | <source lang="java5" enclose="div"> | ||
DifferentiableStatisticalModelWrapperTrainSM trainSm = new DifferentiableStatisticalModelWrapperTrainSM( mmDsm, 4, Optimizer.QUASI_NEWTON_BFGS, new SmallDifferenceOfFunctionEvaluationsCondition( 1E-6 ), 1E-6, 1 ); | |||
trainSm.train( data[0] ); | |||
</source> | </source> | ||
where we need to specify the [http://www.jstacs.de/api-2.0//de/jstacs/sequenceScores/statisticalModels/differentiable/DifferentiableStatisticalModel.html DifferentiableStatisticalModel] that shall be learned and the parameters of the numerical optimization (see section [[ | where we need to specify the [http://www.jstacs.de/api-2.0//de/jstacs/sequenceScores/statisticalModels/differentiable/DifferentiableStatisticalModel.html DifferentiableStatisticalModel] that shall be learned and the parameters of the numerical optimization (see section [[Intermediate course: Optimization]] for details). By this means, we can use any [http://www.jstacs.de/api-2.0//de/jstacs/sequenceScores/statisticalModels/differentiable/DifferentiableStatisticalModel.html DifferentiableStatisticalModel] in the world of [http://www.jstacs.de/api-2.0//de/jstacs/sequenceScores/statisticalModels/trainable/TrainableStatisticalModel.html TrainableStatisticalModel] s as well, for instance to build a [http://www.jstacs.de/api-2.0//de/jstacs/classifiers/trainSMBased/TrainSMBasedClassifier.html TrainSMBasedClassifier] (see [[Second main course: Classifiers]]). |
Latest revision as of 07:51, 14 May 2012
SequenceScore s define scoring functions over sequences that can assign a score to each input sequence. Such scores can then be used to classify sequences by chosing the class represented by that SequenceScore yielding the maximum score.
Sub-iterfaces of SequenceScore are DifferentiableSequenceScore, which is tailored to numerical optimization, and StatisticalModel, which represents statistical models defining a proper likelihood. StatisticalModel, in turn, has two sub-interfaces TrainableStatisticalModel and DifferentiableStatisticalModel, where the latter joins the properties of StatisticalModel s and DifferentiableSequenceScore s. All of these will be explained in the sub-sections of this section.
SequenceScore extends the standard interface Cloneable
and, hence, implementations must provide a clone()
method, which returns a deep copy of all fields of an instance:
public SequenceScore clone() throws CloneNotSupportedException;
Since the implementation of the clone method is very score-specific, it must be implemented anew for each implementation of the SequenceScore interface. Besides that SequenceScore also extends Storable, which demands the implementation of the toXML()
method and a constructor working on a StringBuffer
containing an XML-representation (see section Intermediate course: XMLParser, Parameters, and Results).
SequenceScore also defines some methods that are basically getters for typical properties of a SequenceScore implementation. These are
public AlphabetContainer getAlphabetContainer();
which returns the current AlphabetContainer of the model, i.e., the AlphabetContainer of Sequence s this SequenceScore can score,
public String getInstanceName();
which returns a - distinguishing - name of the current SequenceScore instance, and
public int getLength();
which returns the length of the SequenceScore, i.e. the possible length of input sequences. This length is defined to be zero, if the score can handle Sequence s of variable length.
The methods
public ResultSet getCharacteristics() throws Exception;
and
public NumericalResultSet getNumericalCharacteristics() throws Exception;
can be used to return some properties of a SequenceScore like the number of parameters, the depth of some tree structure, or whatever seems useful. In the latter case, these characteristics are limited to numerical values. If a model is used, e.g., in a cross validation (KFoldCrossValidation), these numerical properties are averaged over all cross validation iterations and displayed together with the performance measures.
The scoring of input Sequence s is performed by three methods getLogScore
. The name of this method reflects that the returned score should be on a lagorithmic scale. For instance, for a stastical model with a proper likelihood over the sequences this score would correspond to the log-likelihood.
The first method
public double getLogScoreFor(Sequence seq);
returns the score for the given Sequence. For reasons of efficiency, it is not intended to check if the length or the AlphabetContainer of the Sequence match those of the SequenceScore. Hence, no exception is declared for this method. So if this method is used (instead of the more stringent methods of StatisticalModel, see #StatisticalModels) this should be checked externally.
The second method is
public double getLogScoreFor(Sequence seq, int start);
which returns the score for the sub-sequence starting at position start
end extending to the end of the Sequence or the end of the SequenceScore, whichever comes first.
The third method is
public double getLogScoreFor(Sequence seq, int start, int end) throws Exception;
and computes the score for the sub-sequence between start
and end
, where the value at position end
is the last value considered for the computation of the score.
Two additional methods
public double[] getLogScoreFor(DataSet data) throws Exception;
and
public void getLogScoreFor(DataSet data, double[] res) throws Exception;
allow for the computation of such scores for all Sequence s in a DataSet. The first method returns these scores, whereas the second stores the scores to the provided double
array. This may be reasonable to save memory, for instance if we compute the log-likelihoods of a large number of sequences using different models.
The result of this method (w.r.t. one Sequence) should be identical
to independent calls of getLogScore
. Hence, these method are pre-implemented exactly in this way in the abstract implementations
of SequenceScore.
Finally, the return value of the method
public boolean isInitialized();
indicates if a SequenceScore has been initialized, i.e., all parameters have been set to some (initial or learned) value and the SequenceScore is ready to score Sequence s.
DifferentiableSequenceScores
The interface DifferentiableSequenceScore extends SequenceScore and adds all methods that are necessary for a numerical optimization of the parameters of a SequenceScore. This especially includes the computation of the gradient w.r.t. the parameters for a given input Sequence.
Before we can start a numerical optimization, we first need to specify initial parameters. DifferentiableSequenceScore declares two methods for that purpos. The first is
public void initializeFunction(int index, boolean freeParams,
DataSet[] data, double[][] weights) throws Exception;
which is tailored to initializing the parameters from a given data set. The parameters have the following meaning: index
is the index of the class this DifferentiableSequenceScore represents.
For some parameterizations, we can either optimize (and initialize) only the free parameters of the model, i.e., excluding parameters with values that can be computed from the values of a subset of the remaining parameters, or we can optimize all parameters. For example such parameters could be those of a DNA dice, where the probability of one nucleotide is fixed if we know the probabilities of all other nucleotides.
If freeParams
is true
, only the free parameters are used.
The array data
holds the input data, where data[index]
holds the data for the current class, and weights
are the weights on each Sequence in data
.
The second method for initialization is
public void initializeFunctionRandomly(boolean freeParams) throws Exception;
which initializes all parameters randomly, where the exact method for drawing parameter values is implementation-specific.
Like for the method getLogScore
of SequenceScore, there are three methods for computing the gradient w.r.t. the parameters.
The first is
public double getLogScoreAndPartialDerivation(Sequence seq, IntList indices, DoubleList partialDer);
where seq
is the Sequence for which the gradient is to be computed, and indices
and partialDer
are lists that shall be filled with the indexes and values of the partial derivation, respectively. Assume that we implement a DifferentiableSequenceScore [math]f[/math] with [math]N[/math] parameters and [math]d_i = \frac{\partial \log f}{\partial \lambda_i}[/math] is the partial derivation of the log-score w.r.t. the [math]i[/math]-th parameter of [math]f[/math]. Then we would add the index [math]i[/math] to indices
and (before subsequent additions to indices
or partialDer
) add [math]d_i[/math] to partialDer
. Partial derivations, i.e., components of the gradient, that are zero may be omitted.
The other two methods
public double getLogScoreAndPartialDerivation(Sequence seq, int start, IntList indices, DoubleList partialDer);
and
public double getLogScoreAndPartialDerivation( Sequence seq, int start, int end, IntList indices, DoubleList partialDer ) throws Exception;
additionally allow for the specification of a start (and end) position within the Sequence seq
in the same manner as specified for getLogScore
of SequenceScore.
The method
public int getNumberOfParameters();
returns the number of parameters ([math]N[/math] in the example above) of parameters of this DifferentiableSequenceScore. It is used, for example, to create temporary arrays for the summation of gradients during numerical optimization.
Since numerical optimization for non-convex problems may get stuck in local optima, we often need to restart numerical optimization multiple times from different initial values of the parameters. As the number of starts that is required to yield the global optimum with high probability is highly score-specific, DifferentiableSequenceScore defines a method
public int getNumberOfRecommendedStarts();
that returns the number of starts that is recommended by the developer of this DifferentiableSequenceScore. If multiple DifferentiableSequenceScore are used in a numerical optimization (for instance for the different classed), the maximum of these values is used in the optimization. If your parameterization is known to be convex, this method should return 1.
The currently set parameters (either by initialization or explicitly, see below) can be obtained by
public double[] getCurrentParameterValues() throws Exception;
and new parameter values may be set by
public void setParameters(double[] params, int start);
where start
is the index of the value for the first parameter of this DifferentiableSequenceScore within params
.
Finally, the method
public double getInitialClassParam(double classProb);
returns the class parameter for a given class probability. The default implementation in the abstract class AbstractDifferentiableSequenceScore returns Math.log(classProb)
. This abstract class also provides standard implementations for many of the other methods.
StatisticalModels
StatisticalModel s extend the specification of SequenceScore s by the computation of proper (normalized) likelihoods for input sequences. To this end, it defines three methods for computing the log-likelihood of a Sequence given a StatisticalModel and its current parameter values.
The first method just requires the specification of the Sequence object for which the log-likelihood is to be computed:
public double getLogProbFor(Sequence sequence) throws Exception;
If the StatisticalModel has not been trained prior to calling this method, it is allowed to throw an Exception
. The meaning of this method is slightly different for inhomogeneous, that is position-dependent, and homogeneous models. In case of an inhomogeneous model, for instance a position weight matrix, the specified Sequence must be of the same length as the model, i.e. the number of columns in the weight matrix. Otherwise an Exception should be thrown, since users may be tempted to misinterpret the return value as a probability of the complete, possibly longer or shorter, provided sequence.
In case of models that can handle Sequence s of varying lengths, for instance homogeneous Markov models, this method must return the likelihood of the complete sequence. Since this is not always the desired result, to other methods are specified, which allow for computing the likelihood of sub-sequences. This method should also check if the provided Sequence is defined over the same AlphabetContainer as the model.
The second method is
public double getLogProbFor(Sequence sequence, int startpos) throws Exception;
which computes the log-likelihood of the sub-sequence starting at position startpos
. The resulting value of the log-likelihood must be the same as if the user had called getLogProbFor(sequence.getSubSequence(startpos))
.
The third method reads
public double getLogProbFor(Sequence sequence, int startpos, int endpos) throws Exception;
and computes the likelihood of the sub-sequence starting at position startpos
up to position endpos
(inclusive). The resulting value of the likelihood must be the same as if the user had called getProbFor(sequence.getSubSequence(startpos,endpos-startpos+1))
.
If we want to use Bayesian principles for learning the parameters of a model, we need to specify a prior distribution on the parameter values. In some cases, for instance for using MAP estimation in an expectation maximization (EM) algorithm, it is not only necessary to estimate the parameters taking the prior into account, but also to know the value of the prior (or a term proportional to it). For this reason, the StatisticalModel interface defines the method
public double getLogPriorTerm() throws Exception;
which returns the logarithm of this prior term.
If the concept of a prior is not applicable for a certain model or other reasons prevent you from implementing this method, getLogPriorTerm()
should return Double.NEGATIVE_INFINITY
.
StatisticalModel s can also be used to create new, artificial data according to the model assumptions. For this purpose, the StatisticalModel interface specifies a method
public DataSet emitDataSet(int numberOfSequences, int... seqLength)
throws NotTrainedException, Exception;
which returns a DataSet of numberOfSequences
Sequence s drawn from the model using its current parameter values. The second parameter seqLength
allows for the specification of the lenghts of the drawn Sequence s. For inhomogeneous model, which inherently define the length of possible sequence, this parameter should be null
or an array of length 0. For variable-length models, the lengths may either be specified for all drawn sequences by a single int
value or by an array of length numberOfSequences
specifying the length of each drawn sequence independently.
The implementation of this method is not always possible. In its default implementation in the abstract implementations of StatisticalModel, this method just throws an Exception and must be explicitly overridden to be functional.
The last method defined in StatisticalModel is
public byte getMaximalMarkovOrder() throws UnsupportedOperationException;
which returns the maximum number of preceeding positions that are considered for (conditional) probabilities. For instance, for a position weight matrix, this method should return 0, whereas for a homogeneous Markov model of order 2, it should return 2.
TrainableStatisticalModels
Statistical models that can be learned on a single input data set are represented by the interface TrainableStatisticalModel of Jstacs. In most cases, such models are learned by generative learning principles like maximum likelihood (ML) or maximum a-posteriori (MAP). For models that are learned from multiple data sets, commonly by discriminative learning principles, Jstacs provides another interface DifferentiableStatisticalModel, which will be presented in the next sub-section.
In the following, we briefly describe all methods that are defined in the TrainableStatisticalModel interface. For convenience, an abstract implementation AbstractTrainSM of TrainableStatisticalModel exists, which provides standard implementations for many of these methods.
The parameters of statistical models are typically learned from some training data set. For this purpose, TrainableStatisticalModel specifies a method train
public void train(DataSet data) throws Exception;
that learns the parameters from the training data set data
. By specification, successive calls to train
must result in a model trained on the data set provided in the last call, as opposed to incremental learning on all data sets.
Besides this simple train
method, TrainableStatisticalModel also declares another method
public void train(DataSet data, double[] weights) throws Exception;
that allows for the specification of weights for each input sequence. Since the previous method is a special case of this one where all weights are equal to [math]1[/math], only the weighted variant must be implemented, if you decide to extend AbstractTrainSM. This method should be implemented such that the specification of null
weights leads to the standard variant with all weights equal to 1. The actual training method may be totally problem and implementation specific. However, in most cases you might want to use one of the generative learning principles ML or MAP.
After a model has been trained it can be used to compute the likelihood of a sequence given the model and its (trained) parameters as defined in StatisticalModel.
The method
public String toString();
should return some String
representation of the current model. Typically, this representation should include the current parameter values in some suitable format.
Besides the possibility to implement new statistical models in Jstacs, many of them are already implemented and can readily be used.
As a central facility for creating TrainableStatisticalModel instances of many standard models, Jstacs provides a TrainableStatisticalModelFactory.
Using the TrainableStatisticalModelFactory, we can create a new position weight matrix (PWM) model by calling
TrainableStatisticalModel pwm = TrainableStatisticalModelFactory.createPWM( alphabet, 10, 4.0 );
where we need to specify the (discrete) alphabet, the length of the matrix ([math]10[/math]), i.e. the number of positions modeled, and an equivalent sample size (ESS) for MAP estimation ([math]4.0[/math]). For the concept of an equivalent sample size and the BDeu prior used for most models in Jstacs, we refer the reader to (Heckerman). If the ESS is set to [math]0.0[/math], the parameters are estimated by the ML instead of the MAP principle.
The factory method for an inhomogeneous Markov model of arbitrary order -- the PWM model is just an inhomogeneous Markov model of order 0 -- is
TrainableStatisticalModel imm = TrainableStatisticalModelFactory.createInhomogeneousMarkovModel( alphabet, 12, 4.0, (byte) 2 );
where the parameters are in complete analogy to the PWM model, expect the last one specifying the order of the inhomogeneous Markov model, which is 2 in the example.
We can also create permuted Markov models of order [math]1[/math] and [math]2[/math], where the positions of the sequences may be permuted before building an inhomogeneous Markov model. The permutation is chosen such that the mutual information between adjacent positions is maximized. We create a permuted Markov model of length 7 and order 1 by calling
TrainableStatisticalModel pmm = TrainableStatisticalModelFactory.createPermutedMarkovModel( alphabet, 7, 4.0, (byte) 1 );
Homogeneous Markov models are created by
TrainableStatisticalModel hmm = TrainableStatisticalModelFactory.createHomogeneousMarkovModel( alphabet, 400.0, (byte) 3 );
where the first parameter again specifies the alphabet, the second parameter is the equivalent sample size, and the third parameter is the order of the Markov model.
We can create a ZOOPS model with a PWM as motif model and the homogeneous Markov model created above as flanking model using
TrainableStatisticalModel zoops = TrainableStatisticalModelFactory.createZOOPS( pwm, hmm, new double[]{4,4}, false );
where the third parameter contains the hyper-parameters for the two components of the ZOOPS model, and the last boolean allows two switch training of the flanking model on or off. In the example, we choose to train the flanking model and the motif model.
After we created a TrainableStatisticalModel, we can train it on input data. For instance, we train the PWM created above on a training DataSet ds
by calling:
pwm.train( ds );
Instead of using the TrainableStatisticalModelFactory, we can also create TrainableStatisticalModel s directly using their constructors. For example, we create a homogeneous Markov model of order [math]0[/math] by calling
HomogeneousMM hmm2 = new HomogeneousMM( new HomMMParameterSet( alphabet, 4.0, "hmm(0)", (byte) 0 ) );
or we create a Bayesian network of (full) order [math]1[/math], i.e., a Bayesian tree, of length [math]8[/math] learned by MAP with an equivalent sample size of [math]4.0[/math] with
BayesianNetworkTrainSM bnm = new BayesianNetworkTrainSM( new BayesianNetworkTrainSMParameterSet( alphabet, 8, 4.0, "Bayesian network", ModelType.BN, (byte) 1, LearningType.ML_OR_MAP ) );
Due to the complexity of hidden Markov models, these have their own factory class called HMMFactory.
To create a new hidden Markov model, we first need to define the parameters of the training procedure, which is Baum-Welch training in the example
HMMTrainingParameterSet trainingPars = new BaumWelchParameterSet( 5, new SmallDifferenceOfFunctionEvaluationsCondition( 1E-6 ),2 );
The parameters provided to that ParameterSet are the number of restarts of the training, the termination condition for each training, and the number of threads used for the (multi-threaded) optimization.
Then we create the emissions of the HMM. Here, we choose two discrete emissions over the DNAAlphabet:
Emission[] emissions = new Emission[]{new DiscreteEmission( alphabet, 4.0 ),new DiscreteEmission( alphabet, new double[]{2.0,1.0,1.0,2.0} )};
For the first emission, we use a BDeu prior with equivalent sample size [math]4[/math], whereas for the second emission, we explictly specify the hyper-parameters of the BDe prior.
Finally, we use the HMMFactory to create a new ergodic, i.e., fully connected, HMM for this training procedure and these emissions:
AbstractHMM myHMM = HMMFactory.createErgodicHMM( trainingPars, 1, 4.0, 0.1, 100.0, emissions );
For non-standard architectures of the HMM, we can also use a constructor of HigherOrderHMM, although we only use order [math]1[/math] in the example,
HigherOrderHMM hohmm = new HigherOrderHMM( trainingPars, new String[]{"A","B"}, emissions,
new TransitionElement( null, new int[]{0}, new double[]{4.0} ),
new TransitionElement( new int[]{0}, new int[]{0,1}, new double[]{2.0,2.0} ),
new TransitionElement( new int[]{1}, new int[]{0}, new double[]{4.0} ));
The arguments are: the training parameters, the names of the (two) states, the emissions as specified above, and TransitionElement s that specify the transitions allowed in the HMM. In the example, we may start in the [math]0[/math]-th state (A) represented by the first TransitionElement, from that state we may either loop in state [math]0[/math] or proceeed to state [math]1[/math], and from state [math]1[/math] we may only go back to state [math]0[/math]. For all transitions, we specify hyper-parameters for the corresponding Dirichlet prior.
Once an HMM is created, we can use a method getGraphvizRepresentation
to obtain the content of a .dot file with a graphical representation (nodes and edges) of the HMM:
System.out.println( hohmm.getGraphvizRepresentation( null ) );
Another important type of TrainableStatisticalModel s within Jstacs are mixture models. Such mixture models can be mixtures of any sub-class of TrainableStatisticalModel and can be learned either by expectation-maximization (EM) or by Gibbs sampling.
We create a mixture model of two PWMs learned by expectation maximization using the constructor
MixtureTrainSM mixEm = new MixtureTrainSM( 10, new TrainableStatisticalModel[]{pwm,pwm}, 3, new double[]{4.0,4.0}, 1, new SmallDifferenceOfFunctionEvaluationsCondition( 1E-6 ), Parameterization.LAMBDA );
where the first argument is the length of the model, the second are the TrainableStatisticalModel s used as mixture components, the third is the number of restarts of the EM, the fourth are the hyper-parameters for the prior on the component probabilities, i.e., mixture weights, the fifth is the termination condition, and the last is the parameterization. For mixture models we can choose between the parameterization in terms of probabilities (called THETA
) and the so called "natural parameterization" as log-probabilities (called LAMBDA
).
If we want to use Gibbs sampling instead of EM, we use the constructor
MixtureTrainSM mixGibbs = new MixtureTrainSM( 10, new TrainableStatisticalModel[]{pwm,pwm}, 3, new double[]{4.0,4.0}, 100, 1000, new VarianceRatioBurnInTest( new VarianceRatioBurnInTestParameterSet( 3, 1.2 ) ) );
where the first 4 arguments remain unchanged, the fifth is the number of intial steps in the stationary phase, the sixth is the total number of stationary samplings, and the seventh is a burn-in test for determining the end of the burn-in phase and the start of the stationary phase.
A special case of a mixture model is the StrandTrainSM, which is a mixture over the two strands of DNA. For both strands, the same component TrainableStatisticalModel is used but it is evaluated once for the original Sequence and once for its reverse complement. We construct a StrandTrainSM of the inhomogeneous Markov model created by the TrainableStatisticalModelFactory by calling
StrandTrainSM strandModel = new StrandTrainSM( imm, 3, 0.5, 1, new SmallDifferenceOfFunctionEvaluationsCondition( 1E-6 ), Parameterization.LAMBDA );
where the first argument is the component TrainableStatisticalModel, the second is the number of starts, the third is the probability of the forward strand (which is fixed in the example but can also be learned), the fourth is the hyper-parameter for drawing initial strands ([math]1[/math] results in a uniform distribution on the simplex), the fifth is the termination condition, and the last is the parameterization (as above).
DifferentiableStatisticalModels
DifferentiableStatisticalModel s are StatisticalModel s that also inherit the properties of DifferentiableSequenceScore s. That means, DifferentiableStatisticalModel s define a proper likelihood over the input sequences and at the same time can compute partial derivations w.r.t. their parameters.
This also means that they can compute a - not necessarily normalized - log-score using the getLogScore
methods and normalized log-likelihoods using the getLogProb
methods. In many cases it is more efficient to use the getLogScore
methods and, in addition, to compute the normalization constant to obtain proper probabilities seperately. Hence,
DifferentiableStatisticalModel defines a method
public double getLogNormalizationConstant();
that returns the logarithm of that normalization constant, which only depends on the current parameter values but not on the input sequence. For computing the gradients with respect to the parameters, we also need the partial derivations of the log-normalization constant with respect to the parameters. To this end DifferentiableStatisticalModel provides a method
public double getLogPartialNormalizationConstant(int parameterIndex)
that takes the (internal) index of the parameter for which the partial derivation shall be computed as the only argument.
A similar schema is used for the prior on the parameters. In addition to the method getLogPriorTerm
of StatisticalModel, the method
public void addGradientOfLogPriorTerm(double[] grad, int start)
adds the values of the gradient of the log-prior with respect to all parameters to the given array grad
starting at position start
.
The method
public boolean isNormalized();
returns true
if the DifferentiableStatisticalModel is already normalized, that means if getLogScore
already returns proper probabilities. If that is the case, getLogNormalizationConstant
must always return [math]0[/math], and getLogPartialNormalizationConstant
must return Double.NEGATIVE_INFINITY
for all parameters.
Finally,
public double getESS();
returns the equivalent sample size used for the prior of this DifferentiableStatisticalModel.
The DifferentiableStatisticalModel implementations of standard models already exist in Jstacs.
For example, we can create a Bayesian network of length [math]10[/math] with the network structure learned by means of the discriminative "explaining away residual" with
BayesianNetworkDiffSM bnDsm = new BayesianNetworkDiffSM( new BayesianNetworkDiffSMParameterSet( alphabet, 10, 4.0, true, new BTExplainingAwayResidual( new double[]{4.0,4.0} ) ) );
where the parameters of the ParameterSet are the AlphabetContainer, the length and the equivalent sample size of the DifferentiableStatisticalModel, a switch wether (generative) plug-in parameters shall be used and an object for the structure measure.
An inhomogeneous Markov model of order [math]1[/math], length [math]8[/math], and with equivalent sample size [math]4.0[/math] is created by
MarkovModelDiffSM mmDsm = new MarkovModelDiffSM( alphabet, 8, 4.0, true, new InhomogeneousMarkov( 1 ) );
DifferentiableStatisticalModel s are mainly used for discriminative parameter learning in Jstacs. To discriminatively learn the parameters of two inhomogeneous Markov models as created above, we need the following three lines.
GenDisMixClassifierParameterSet params = new GenDisMixClassifierParameterSet( alphabet, 8, Optimizer.QUASI_NEWTON_BFGS, 1E-6, 1E-6, 1, false, KindOfParameter.PLUGIN, true, 4 );
GenDisMixClassifier cl = new GenDisMixClassifier( params, new CompositeLogPrior(), LearningPrinciple.MSP, mmDsm, mmDsm );
cl.train( data );
The discriminative learning principle used in the example is maximum supervised posterior (MSP). For details about the GenDisMixClassifier used in the example, see section Second main course: Classifiers#GenDisMixClassifiers.
A homogeneous Markov models of order [math]3[/math] with equivalent sample size [math]4.0[/math] is created by the constructor
HomogeneousMMDiffSM hmmDsm = new HomogeneousMMDiffSM( alphabet, 3, 4.0, 100 );
The last argument of this constructor is the expected length of the input sequences. This length is used to compute consistent hyper-parameters from the equivalent sample size for this class, since the (transition) parameters of a homogeneous Markov model are used for more than one position of an input sequence.
Similar to hidden Markov models that are TrainableStatisticalModel s, we can create a DifferentiableStatisticalModel hidden Markov model with the following lines:
DifferentiableEmission[] emissions = new DifferentiableEmission[]{new DiscreteEmission( alphabet, 4.0 ),new DiscreteEmission( alphabet, new double[]{2.0,1.0,1.0,2.0} )};
NumericalHMMTrainingParameterSet trainingParameterSet = new NumericalHMMTrainingParameterSet( 3, new SmallDifferenceOfFunctionEvaluationsCondition( 1E-6 ), 2, Optimizer.QUASI_NEWTON_BFGS, 1E-6, 1 );
DifferentiableHigherOrderHMM hmm = new DifferentiableHigherOrderHMM( trainingParameterSet, new String[]{"A","B"} , new int[]{0,1}, new boolean[]{true,true},emissions, true,4.0,
new TransitionElement( null, new int[]{0}, new double[]{4.0} ),
new TransitionElement( new int[]{0}, new int[]{0,1}, new double[]{2.0,2.0} ),
new TransitionElement( new int[]{1}, new int[]{0}, new double[]{4.0} ));
The emissions and transitions of the hidden Markov model are the same as in the TrainableStatisticalModel case. However, we use training parameters for numerical optimization in this example.
A mixture DifferentiableStatisticalModel of two inhomogeneous Markov models can be created by
MixtureDiffSM mixDsm = new MixtureDiffSM( 3, true, mmDsm,mmDsm );
where the first argument is the recommended number of restarts, and the second argument indicates if plug-in parameters shall be used.
Similar to TrainableStatisticalModel s, we can also define a mixture over the DNA strands with
StrandDiffSM strandDsm = new StrandDiffSM( mmDsm, 0.5, 1, true, InitMethod.INIT_BOTH_STRANDS );
where the first argument is the component DifferentiableStatisticalModel, the second is the a-priori probability of the forward strand, the third is the recommended number of restarts, the fourth, again, indicates if plug-in parameters are used, and the fifth indicates on which strands the component DifferentiableStatisticalModel is initialized.
We can use this StrandDiffSM together with a homogeneous Markov model as the flanking model to specify a ZOOPS DifferentiableStatisticalModel using the constructor
ExtendedZOOPSDiffSM zoops = new ExtendedZOOPSDiffSM( ExtendedZOOPSDiffSM.CONTAINS_SOMETIMES_A_MOTIF, 500, 4, false, hmmDsm, strandDsm, null, true );
Here, the first argument specifies that we want to use ZOOPS (as opposed to OOPS, where each sequence contains a motif occurrence), the second is the lenght of the input sequences, the third is the number of restarts, the fourth switches plug-in parameters, the fifth and sixth are the flanking and motif model, respectively, the seventh specifies a position distribution for the motifs (where null
defaults to a uniform distribution), and the last indicates if the flanking model is initialized (again).
Sometimes it is useful to model different parts of a sequence by different DifferentiableStatisticalModel s, for instance if we have some structural knowledge about the input sequences. To this end, Jstacs provides a class IndependentProductDiffSM, for which we can specify the models of different parts independently. We create such a IndependentProductDiffSM by calling
IndependentProductDiffSM ipsf = new IndependentProductDiffSM( 4.0, true, bnDsm,mmDsm );
where the first parameter is the equivalent sample size, the second switches plug-in parameters, and the remaining are the DifferentiableStatisticalModel s to be combined. Besides this constructor, other constructors allow for specifying the lengths of the parts of a sequence to be modeled by the different DifferentiableStatisticalModel s or the re-use of the same DifferentiableStatisticalModel for different, possibly remote, parts of an input sequence.
Since DifferentiableStatisticalModel s provide methods for computing log-likelihoods, for computing normalization constants, and for determining the gradient with respect to their parameters, they can also be learned generatively by numerical optimization. For this purpose, a wrapper class exists that creates a TrainableStatisticalModel out of any DifferentiableStatisticalModel. We create such a wrapper object with
DifferentiableStatisticalModelWrapperTrainSM trainSm = new DifferentiableStatisticalModelWrapperTrainSM( mmDsm, 4, Optimizer.QUASI_NEWTON_BFGS, new SmallDifferenceOfFunctionEvaluationsCondition( 1E-6 ), 1E-6, 1 );
trainSm.train( data[0] );
where we need to specify the DifferentiableStatisticalModel that shall be learned and the parameters of the numerical optimization (see section Intermediate course: Optimization for details). By this means, we can use any DifferentiableStatisticalModel in the world of TrainableStatisticalModel s as well, for instance to build a TrainSMBasedClassifier (see Second main course: Classifiers).