# First main course: SequenceScores

### From Jstacs

SequenceScore s define scoring functions over sequences that can assign a score to each input sequence. Such scores can then be used to classify sequences by chosing the class represented by that SequenceScore yielding the maximum score.

Sub-iterfaces of SequenceScore are DifferentiableSequenceScore, which is tailored to numerical optimization, and StatisticalModel, which represents statistical models defining a proper likelihood. StatisticalModel, in turn, has two sub-interfaces TrainableStatisticalModel and DifferentiableStatisticalModel, where the latter joins the properties of StatisticalModel s and DifferentiableSequenceScore s. All of these will be explained in the sub-sections of this section.

## Contents |

SequenceScore extends the standard interface `Cloneable`

and, hence, implementations must provide a `clone()`

method, which returns a deep copy of all fields of an instance:

Since the implementation of the clone method is very score-specific, it must be implemented anew for each implementation of the SequenceScore interface. Besides that SequenceScore also extends Storable, which demands the implementation of the `toXML()`

method and a constructor working on a `StringBuffer`

containing an XML-representation (see section Intermediate course: XMLParser, Parameters, and Results).

SequenceScore also defines some methods that are basically getters for typical properties of a SequenceScore implementation. These are

which returns the current AlphabetContainer of the model, i.e., the AlphabetContainer of Sequence s this SequenceScore can score,

which returns a - distinguishing - name of the current SequenceScore instance, and

which returns the length of the SequenceScore, i.e. the possible length of input sequences. This length is defined to be zero, if the score can handle Sequence s of variable length.

The methods

and

can be used to return some properties of a SequenceScore like the number of parameters, the depth of some tree structure, or whatever seems useful. In the latter case, these characteristics are limited to numerical values. If a model is used, e.g., in a cross validation (KFoldCrossValidation), these numerical properties are averaged over all cross validation iterations and displayed together with the performance measures.

The scoring of input Sequence s is performed by three methods `getLogScore`

. The name of this method reflects that the returned score should be on a lagorithmic scale. For instance, for a stastical model with a proper likelihood over the sequences this score would correspond to the log-likelihood.
The first method

returns the score for the given Sequence. For reasons of efficiency, it is not intended to check if the length or the AlphabetContainer of the Sequence match those of the SequenceScore. Hence, no exception is declared for this method. So if this method is used (instead of the more stringent methods of StatisticalModel, see #StatisticalModels) this should be checked externally.

The second method is

which returns the score for the sub-sequence starting at position `start`

end extending to the end of the Sequence or the end of the SequenceScore, whichever comes first.

The third method is

and computes the score for the sub-sequence between `start`

and `end`

, where the value at position `end`

is the last value considered for the computation of the score.

Two additional methods

and

allow for the computation of such scores for all Sequence s in a DataSet. The first method returns these scores, whereas the second stores the scores to the provided `double`

array. This may be reasonable to save memory, for instance if we compute the log-likelihoods of a large number of sequences using different models.

The result of this method (w.r.t. one Sequence) should be identical
to independent calls of `getLogScore`

. Hence, these method are pre-implemented exactly in this way in the abstract implementations
of SequenceScore.

Finally, the return value of the method

indicates if a SequenceScore has been initialized, i.e., all parameters have been set to some (initial or learned) value and the SequenceScore is ready to score Sequence s.

## DifferentiableSequenceScores

The interface DifferentiableSequenceScore extends SequenceScore and adds all methods that are necessary for a numerical optimization of the parameters of a SequenceScore. This especially includes the computation of the gradient w.r.t. the parameters for a given input Sequence.

Before we can start a numerical optimization, we first need to specify initial parameters. DifferentiableSequenceScore declares two methods for that purpos. The first is

DataSet[] data, double[][] weights) throws Exception;

which is tailored to initializing the parameters from a given data set. The parameters have the following meaning: `index`

is the index of the class this DifferentiableSequenceScore represents.
For some parameterizations, we can either optimize (and initialize) only the free parameters of the model, i.e., excluding parameters with values that can be computed from the values of a subset of the remaining parameters, or we can optimize all parameters. For example such parameters could be those of a DNA dice, where the probability of one nucleotide is fixed if we know the probabilities of all other nucleotides.
If `freeParams`

is `true`

, only the free parameters are used.
The array `data`

holds the input data, where `data[index]`

holds the data for the current class, and `weights`

are the weights on each Sequence in `data`

.

The second method for initialization is

which initializes all parameters randomly, where the exact method for drawing parameter values is implementation-specific.

Like for the method `getLogScore`

of SequenceScore, there are three methods for computing the gradient w.r.t. the parameters.
The first is

where `seq`

is the Sequence for which the gradient is to be computed, and `indices`

and `partialDer`

are lists that shall be filled with the indexes and values of the partial derivation, respectively. Assume that we implement a DifferentiableSequenceScore *f* with *N* parameters and is the partial derivation of the log-score w.r.t. the *i*-th parameter of *f*. Then we would add the index *i* to `indices`

and (before subsequent additions to `indices`

or `partialDer`

) add *d*_{i} to `partialDer`

. Partial derivations, i.e., components of the gradient, that are zero may be omitted.

The other two methods

and

additionally allow for the specification of a start (and end) position within the Sequence `seq`

in the same manner as specified for `getLogScore`

of SequenceScore.

The method

returns the number of parameters (*N* in the example above) of parameters of this DifferentiableSequenceScore. It is used, for example, to create temporary arrays for the summation of gradients during numerical optimization.

Since numerical optimization for non-convex problems may get stuck in local optima, we often need to restart numerical optimization multiple times from different initial values of the parameters. As the number of starts that is required to yield the global optimum with high probability is highly score-specific, DifferentiableSequenceScore defines a method

that returns the number of starts that is recommended by the developer of this DifferentiableSequenceScore. If multiple DifferentiableSequenceScore are used in a numerical optimization (for instance for the different classed), the maximum of these values is used in the optimization. If your parameterization is known to be convex, this method should return 1.

The currently set parameters (either by initialization or explicitly, see below) can be obtained by

and new parameter values may be set by

where `start`

is the index of the value for the first parameter of this DifferentiableSequenceScore within `params`

.

Finally, the method

returns the class parameter for a given class probability. The default implementation in the abstract class AbstractDifferentiableSequenceScore returns `Math.log(classProb)`

. This abstract class also provides standard implementations for many of the other methods.

## StatisticalModels

StatisticalModel s extend the specification of SequenceScore s by the computation of proper (normalized) likelihoods for input sequences. To this end, it defines three methods for computing the log-likelihood of a Sequence given a StatisticalModel and its current parameter values.

The first method just requires the specification of the Sequence object for which the log-likelihood is to be computed:

If the StatisticalModel has not been trained prior to calling this method, it is allowed to throw an `Exception`

. The meaning of this method is slightly different for inhomogeneous, that is position-dependent, and homogeneous models. In case of an inhomogeneous model, for instance a position weight matrix, the specified Sequence must be of the same length as the model, i.e. the number of columns in the weight matrix. Otherwise an Exception should be thrown, since users may be tempted to misinterpret the return value as a probability of the complete, possibly longer or shorter, provided sequence.
In case of models that can handle Sequence s of varying lengths, for instance homogeneous Markov models, this method must return the likelihood of the complete sequence. Since this is not always the desired result, to other methods are specified, which allow for computing the likelihood of sub-sequences. This method should also check if the provided Sequence is defined over the same AlphabetContainer as the model.

The second method is

which computes the log-likelihood of the sub-sequence starting at position `startpos`

. The resulting value of the log-likelihood must be the same as if the user had called `getLogProbFor(sequence.getSubSequence(startpos))`

.

The third method reads

and computes the likelihood of the sub-sequence starting at position `startpos`

up to position `endpos`

(inclusive). The resulting value of the likelihood must be the same as if the user had called `getProbFor(sequence.getSubSequence(startpos,endpos-startpos+1))`

.

If we want to use Bayesian principles for learning the parameters of a model, we need to specify a prior distribution on the parameter values. In some cases, for instance for using MAP estimation in an expectation maximization (EM) algorithm, it is not only necessary to estimate the parameters taking the prior into account, but also to know the value of the prior (or a term proportional to it). For this reason, the StatisticalModel interface defines the method

which returns the logarithm of this prior term.

If the concept of a prior is not applicable for a certain model or other reasons prevent you from implementing this method, `getLogPriorTerm()`

should return `Double.NEGATIVE_INFINITY`

.

StatisticalModel s can also be used to create new, artificial data according to the model assumptions. For this purpose, the StatisticalModel interface specifies a method

throws NotTrainedException, Exception;

which returns a DataSet of `numberOfSequences`

Sequence s drawn from the model using its current parameter values. The second parameter `seqLength`

allows for the specification of the lenghts of the drawn Sequence s. For inhomogeneous model, which inherently define the length of possible sequence, this parameter should be `null`

or an array of length 0. For variable-length models, the lengths may either be specified for all drawn sequences by a single `int`

value or by an array of length `numberOfSequences`

specifying the length of each drawn sequence independently.

The implementation of this method is not always possible. In its default implementation in the abstract implementations of StatisticalModel, this method just throws an Exception and must be explicitly overridden to be functional.

The last method defined in StatisticalModel is

which returns the maximum number of preceeding positions that are considered for (conditional) probabilities. For instance, for a position weight matrix, this method should return 0, whereas for a homogeneous Markov model of order 2, it should return 2.

### TrainableStatisticalModels

Statistical models that can be learned on a single input data set are represented by the interface TrainableStatisticalModel of Jstacs. In most cases, such models are learned by generative learning principles like maximum likelihood (ML) or maximum a-posteriori (MAP). For models that are learned from multiple data sets, commonly by discriminative learning principles, Jstacs provides another interface DifferentiableStatisticalModel, which will be presented in the next sub-section.

In the following, we briefly describe all methods that are defined in the TrainableStatisticalModel interface. For convenience, an abstract implementation AbstractTrainSM of TrainableStatisticalModel exists, which provides standard implementations for many of these methods.

The parameters of statistical models are typically learned from some training data set. For this purpose, TrainableStatisticalModel specifies a method `train`

that learns the parameters from the training data set `data`

. By specification, successive calls to `train`

must result in a model trained on the data set provided in the last call, as opposed to incremental learning on all data sets.

Besides this simple `train`

method, TrainableStatisticalModel also declares another method

that allows for the specification of weights for each input sequence. Since the previous method is a special case of this one where all weights are equal to 1, only the weighted variant must be implemented, if you decide to extend AbstractTrainSM. This method should be implemented such that the specification of `null`

weights leads to the standard variant with all weights equal to 1. The actual training method may be totally problem and implementation specific. However, in most cases you might want to use one of the generative learning principles ML or MAP.

After a model has been trained it can be used to compute the likelihood of a sequence given the model and its (trained) parameters as defined in StatisticalModel.

The method

should return some `String`

representation of the current model. Typically, this representation should include the current parameter values in some suitable format.

Besides the possibility to implement new statistical models in Jstacs, many of them are already implemented and can readily be used.
As a central facility for creating TrainableStatisticalModel instances of many standard models, Jstacs provides a TrainableStatisticalModelFactory.

Using the TrainableStatisticalModelFactory, we can create a new position weight matrix (PWM) model by calling

where we need to specify the (discrete) alphabet, the length of the matrix (10), i.e. the number of positions modeled, and an equivalent sample size (ESS) for MAP estimation (4.0). For the concept of an equivalent sample size and the BDeu prior used for most models in Jstacs, we refer the reader to (Heckerman). If the ESS is set to 0.0, the parameters are estimated by the ML instead of the MAP principle.

The factory method for an inhomogeneous Markov model of arbitrary order -- the PWM model is just an inhomogeneous Markov model of order 0 -- is

where the parameters are in complete analogy to the PWM model, expect the last one specifying the order of the inhomogeneous Markov model, which is 2 in the example.

We can also create permuted Markov models of order 1 and 2, where the positions of the sequences may be permuted before building an inhomogeneous Markov model. The permutation is chosen such that the mutual information between adjacent positions is maximized. We create a permuted Markov model of length 7 and order 1 by calling

Homogeneous Markov models are created by

where the first parameter again specifies the alphabet, the second parameter is the equivalent sample size, and the third parameter is the order of the Markov model.

We can create a ZOOPS model with a PWM as motif model and the homogeneous Markov model created above as flanking model using

where the third parameter contains the hyper-parameters for the two components of the ZOOPS model, and the last boolean allows two switch training of the flanking model on or off. In the example, we choose to train the flanking model and the motif model.

After we created a TrainableStatisticalModel, we can train it on input data. For instance, we train the PWM created above on a training DataSet `ds`

by calling:

Instead of using the TrainableStatisticalModelFactory, we can also create TrainableStatisticalModel s directly using their constructors. For example, we create a homogeneous Markov model of order 0 by calling

or we create a Bayesian network of (full) order 1, i.e., a Bayesian tree, of length 8 learned by MAP with an equivalent sample size of 4.0 with

Due to the complexity of hidden Markov models, these have their own factory class called HMMFactory.
To create a new hidden Markov model, we first need to define the parameters of the training procedure, which is Baum-Welch training in the example

The parameters provided to that ParameterSet are the number of restarts of the training, the termination condition for each training, and the number of threads used for the (multi-threaded) optimization.

Then we create the emissions of the HMM. Here, we choose two discrete emissions over the DNAAlphabet:

For the first emission, we use a BDeu prior with equivalent sample size 4, whereas for the second emission, we explictly specify the hyper-parameters of the BDe prior.

Finally, we use the HMMFactory to create a new ergodic, i.e., fully connected, HMM for this training procedure and these emissions:

For non-standard architectures of the HMM, we can also use a constructor of HigherOrderHMM, although we only use order 1 in the example,

new TransitionElement( null, new int[]{0}, new double[]{4.0} ),

new TransitionElement( new int[]{0}, new int[]{0,1}, new double[]{2.0,2.0} ),

new TransitionElement( new int[]{1}, new int[]{0}, new double[]{4.0} ));

The arguments are: the training parameters, the names of the (two) states, the emissions as specified above, and TransitionElement s that specify the transitions allowed in the HMM. In the example, we may start in the 0-th state (A) represented by the first TransitionElement, from that state we may either loop in state 0 or proceeed to state 1, and from state 1 we may only go back to state 0. For all transitions, we specify hyper-parameters for the corresponding Dirichlet prior.

Once an HMM is created, we can use a method `getGraphvizRepresentation`

to obtain the content of a .dot file with a graphical representation (nodes and edges) of the HMM:

Another important type of TrainableStatisticalModel s within Jstacs are mixture models. Such mixture models can be mixtures of any sub-class of TrainableStatisticalModel and can be learned either by expectation-maximization (EM) or by Gibbs sampling.

We create a mixture model of two PWMs learned by expectation maximization using the constructor

where the first argument is the length of the model, the second are the TrainableStatisticalModel s used as mixture components, the third is the number of restarts of the EM, the fourth are the hyper-parameters for the prior on the component probabilities, i.e., mixture weights, the fifth is the termination condition, and the last is the parameterization. For mixture models we can choose between the parameterization in terms of probabilities (called `THETA`

) and the so called "natural parameterization" as log-probabilities (called `LAMBDA`

).

If we want to use Gibbs sampling instead of EM, we use the constructor

where the first 4 arguments remain unchanged, the fifth is the number of intial steps in the stationary phase, the sixth is the total number of stationary samplings, and the seventh is a burn-in test for determining the end of the burn-in phase and the start of the stationary phase.

A special case of a mixture model is the StrandTrainSM, which is a mixture over the two strands of DNA. For both strands, the same component TrainableStatisticalModel is used but it is evaluated once for the original Sequence and once for its reverse complement. We construct a StrandTrainSM of the inhomogeneous Markov model created by the TrainableStatisticalModelFactory by calling

where the first argument is the component TrainableStatisticalModel, the second is the number of starts, the third is the probability of the forward strand (which is fixed in the example but can also be learned), the fourth is the hyper-parameter for drawing initial strands (1 results in a uniform distribution on the simplex), the fifth is the termination condition, and the last is the parameterization (as above).

### DifferentiableStatisticalModels

DifferentiableStatisticalModel s are StatisticalModel s that also inherit the properties of DifferentiableSequenceScore s. That means, DifferentiableStatisticalModel s define a proper likelihood over the input sequences and at the same time can compute partial derivations w.r.t. their parameters.

This also means that they can compute a - not necessarily normalized - log-score using the `getLogScore`

methods and normalized log-likelihoods using the `getLogProb`

methods. In many cases it is more efficient to use the `getLogScore`

methods and, in addition, to compute the normalization constant to obtain proper probabilities seperately. Hence,
DifferentiableStatisticalModel defines a method

that returns the logarithm of that normalization constant, which only depends on the current parameter values but not on the input sequence. For computing the gradients with respect to the parameters, we also need the partial derivations of the log-normalization constant with respect to the parameters. To this end DifferentiableStatisticalModel provides a method

that takes the (internal) index of the parameter for which the partial derivation shall be computed as the only argument.

A similar schema is used for the prior on the parameters. In addition to the method `getLogPriorTerm`

of StatisticalModel, the method

adds the values of the gradient of the log-prior with respect to all parameters to the given array `grad`

starting at position `start`

.

The method

returns `true`

if the DifferentiableStatisticalModel is already normalized, that means if `getLogScore`

already returns proper probabilities. If that is the case, `getLogNormalizationConstant`

must always return 0, and `getLogPartialNormalizationConstant`

must return `Double.NEGATIVE_INFINITY`

for all parameters.

Finally,

returns the equivalent sample size used for the prior of this DifferentiableStatisticalModel.

The DifferentiableStatisticalModel implementations of standard models already exist in Jstacs.
For example, we can create a Bayesian network of length 10 with the network structure learned by means of the discriminative "explaining away residual" with

where the parameters of the ParameterSet are the AlphabetContainer, the length and the equivalent sample size of the DifferentiableStatisticalModel, a switch wether (generative) plug-in parameters shall be used and an object for the structure measure.

An inhomogeneous Markov model of order 1, length 8, and with equivalent sample size 4.0 is created by

DifferentiableStatisticalModel s are mainly used for discriminative parameter learning in Jstacs. To discriminatively learn the parameters of two inhomogeneous Markov models as created above, we need the following three lines.

GenDisMixClassifier cl = new GenDisMixClassifier( params, new CompositeLogPrior(), LearningPrinciple.MSP, mmDsm, mmDsm );

cl.train( data );

The discriminative learning principle used in the example is maximum supervised posterior (MSP). For details about the GenDisMixClassifier used in the example, see section Second main course: Classifiers#GenDisMixClassifiers.

A homogeneous Markov models of order 3 with equivalent sample size 4.0 is created by the constructor

The last argument of this constructor is the expected length of the input sequences. This length is used to compute consistent hyper-parameters from the equivalent sample size for this class, since the (transition) parameters of a homogeneous Markov model are used for more than one position of an input sequence.

Similar to hidden Markov models that are TrainableStatisticalModel s, we can create a DifferentiableStatisticalModel hidden Markov model with the following lines:

NumericalHMMTrainingParameterSet trainingParameterSet = new NumericalHMMTrainingParameterSet( 3, new SmallDifferenceOfFunctionEvaluationsCondition( 1E-6 ), 2, Optimizer.QUASI_NEWTON_BFGS, 1E-6, 1 );

DifferentiableHigherOrderHMM hmm = new DifferentiableHigherOrderHMM( trainingParameterSet, new String[]{"A","B"} , new int[]{0,1}, new boolean[]{true,true},emissions, true,4.0,

new TransitionElement( null, new int[]{0}, new double[]{4.0} ),

new TransitionElement( new int[]{0}, new int[]{0,1}, new double[]{2.0,2.0} ),

new TransitionElement( new int[]{1}, new int[]{0}, new double[]{4.0} ));

The emissions and transitions of the hidden Markov model are the same as in the TrainableStatisticalModel case. However, we use training parameters for numerical optimization in this example.

A mixture DifferentiableStatisticalModel of two inhomogeneous Markov models can be created by

where the first argument is the recommended number of restarts, and the second argument indicates if plug-in parameters shall be used.

Similar to TrainableStatisticalModel s, we can also define a mixture over the DNA strands with

where the first argument is the component DifferentiableStatisticalModel, the second is the a-priori probability of the forward strand, the third is the recommended number of restarts, the fourth, again, indicates if plug-in parameters are used, and the fifth indicates on which strands the component DifferentiableStatisticalModel is initialized.

We can use this StrandDiffSM together with a homogeneous Markov model as the flanking model to specify a ZOOPS DifferentiableStatisticalModel using the constructor

Here, the first argument specifies that we want to use ZOOPS (as opposed to OOPS, where each sequence contains a motif occurrence), the second is the lenght of the input sequences, the third is the number of restarts, the fourth switches plug-in parameters, the fifth and sixth are the flanking and motif model, respectively, the seventh specifies a position distribution for the motifs (where `null`

defaults to a uniform distribution), and the last indicates if the flanking model is initialized (again).

Sometimes it is useful to model different parts of a sequence by different DifferentiableStatisticalModel s, for instance if we have some structural knowledge about the input sequences. To this end, Jstacs provides a class IndependentProductDiffSM, for which we can specify the models of different parts independently. We create such a IndependentProductDiffSM by calling

where the first parameter is the equivalent sample size, the second switches plug-in parameters, and the remaining are the DifferentiableStatisticalModel s to be combined. Besides this constructor, other constructors allow for specifying the lengths of the parts of a sequence to be modeled by the different DifferentiableStatisticalModel s or the re-use of the same DifferentiableStatisticalModel for different, possibly remote, parts of an input sequence.

Since DifferentiableStatisticalModel s provide methods for computing log-likelihoods, for computing normalization constants, and for determining the gradient with respect to their parameters, they can also be learned generatively by numerical optimization. For this purpose, a wrapper class exists that creates a TrainableStatisticalModel out of any DifferentiableStatisticalModel. We create such a wrapper object with

trainSm.train( data[0] );

where we need to specify the DifferentiableStatisticalModel that shall be learned and the parameters of the numerical optimization (see section Intermediate course: Optimization for details). By this means, we can use any DifferentiableStatisticalModel in the world of TrainableStatisticalModel s as well, for instance to build a TrainSMBasedClassifier (see Second main course: Classifiers).