Quick start: Jstacs in a nutshell

From Jstacs
Revision as of 08:49, 14 May 2012 by Keilwagen (talk | contribs) (Created page with "This section is for the unpatient who like to directly start using Jstacs without reading the complete cookbook. If you do not belong to this group, you can skip this section. H...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigationJump to search

This section is for the unpatient who like to directly start using Jstacs without reading the complete cookbook. If you do not belong to this group, you can skip this section.

Here, we provide code snippets for simple task including reading a data set, creating models and classifiers which might be frequently used. In addition, some of the basic code examples in section \nameref{recipes} may also serve as a basis for a quick start into Jstacs.

For reading a FastA file, we call the constructor of the DNADataSet with the (absolute or relative) path to the FastA file. Subsequently, we can determine the alphabets used.

DNADataSet ds = new DNADataSet( args[0] );
AlphabetContainer con = ds.getAlphabetContainer();

For more detailed information about data sets, sequences, and alphabets, we refer to section Starter: Data handling.

Statistical models and classifiers using generative learning principles

In Jstacs, statistical models that use generative learning principles to infer their parameters implement the interface TrainableStatisticalModel. For convenience, Jstacs provides the TrainableStatisticalModelFactory, which allows for creating various simple models in an easy manner. Creating for instance a PWM model is just one line of code.

TrainableStatisticalModel pwm = TrainableStatisticalModelFactory.createPWM( con, ds.getElementLength(), 4 );

Similarily other models including inhomogeneous Markov models, permuted Markov models, Bayesian networks, homogeneous Markov models, ZOOPS models, and hidden Markov models can be created using the TrainableStatisticalModelFactory and the HMMFactory, respectively.

Given some model pwm, we can directly infer the model parameters based on some data set ds using the train method.

pwm.train( ds );

After the model has been trained, it can be used to score sequences using the getLogProbFor methods. More information about the interface TrainableStatisticalModel can be found in section First main course: SequenceScores#TrainableStatisticalModels.

Based on a set of TrainableStatisticalModel s, for instance two PWM models, we can build a classifier.


AbstractClassifier cl = new TrainSMBasedClassifier( pwm, pwm );


Further statistical models and classifiers

Sometimes, we like to learn statistical models by other learning principles that require numerical optimization. For this purpose, Jstacs provides the interface DifferentiableStatisticalModel and the factory DifferentiableStatisticalModelFactory in close analogy to TrainableStatisticalModel and TrainableStatisticalModelFactory (cf. First main course: SequenceScores#DifferentiableStatisticalModels). Creating a classifier using two PWM models and the maximum supervised posterior learning principle, can be accomplished by calling


DifferentiableStatisticalModel pwm = DifferentiableStatisticalModelFactory.createPWM(con, 10, 4);


GenDisMixClassifierParameterSet pars = new GenDisMixClassifierParameterSet(con,10,(byte)10,1E-9,1E-10,1, false,KindOfParameter.PLUGIN,true,1);


AbstractClassifier cl = new MSPClassifier( pars, pwm, pwm );


Using classifiers

Based on statistical models, we can build classifiers as we have seen in the previous subsections. The main functionality is predicting the class label of a sequence and assessing the performance of a classifier. For these tasks, Jstacs provides the methods classify and evaluate, respectively.

For classifying a sequence, we just call

byte res = cl.classify( seq );

on a trained classifier. The method returns numerical class labels starting from [math]0[/math] and in the same order as data is provided for training.


For evaluating the performance of a classifier, we need to compute some performance measures. For convenience, Jstacs provides the possibility of getting a bunch of standard measures including point measures and areas under curves (cf. Second main course: Classifiers#Performance_measures). Based on such measures, we can directly determine the performance of the classifier.

NumericalPerformanceMeasureParameterSet params = PerformanceMeasureParameterSet.createFilledParameters();
System.out.println( cl.evaluate(params, true, data) );

Here, true indicates that an Exception should be thrown if a measure could not be computed, and data is an array of data sets, where the index within this array encodes for the class.

For assessing the performance of a classifier using some repeated procedure of training and testing, Jstacs provides the class ClassifierAssessment (cf. Second main course: Classifiers#Assessment).