Starter: Data handling

From Jstacs
Jump to navigationJump to search
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

In Jstacs, data is organized at three levels:

  • Alphabet s for defining single symbols, and AlphabetContainer s for defining aggregate alphabets,
  • Sequence s for defining sequences of symbols over a given alphabet,
  • DataSet s for defining sets of sequences.

Sequences are implemented as numerical values. In case of discrete sequences over some symbolic alphabet, the symbols are mapped to contiguous discrete values starting at [math]0[/math], which can be mapped back to the original symbols using the alphabet. This mapping is also used for the toString() method, e.g., for printing a sequence. The actual data type, i.e. byte, short, or integer, used to represented the symbols is chosen internally depending on the size of the alphabet. Alphabet s, Sequence s, and DataSet s are immutable for reasons of security and data consistency. That means, an instance of those classes cannot be modified once it has been created.

Alphabets

Since Jstacs is tailored at sequence analysis in bioinformatics, the most prominent alphabet is the DNAAlphabet, which is a singleton instance that can be accessed by:

DNAAlphabet dna = DNAAlphabet.SINGLETON;

For general discrete alphabets, i.e., any kind of categorical data, you can use a DiscreteAlphabet. Such an alphabet can be constructed in case-sensitive and insensitive variants (first argument) using a list of symbols. In this example, we create a case-sensitive alphabet with symbols "W", "S", "w", and "x":

DiscreteAlphabet discrete = new DiscreteAlphabet( false, "W", "S", "w", "x" );

If you rather want to define an alphabet over contiguous discrete numerical values, you can do so by calling a constructor that takes the minimum and maximum value of the desired range, and defines the alphabet as all integer values between minimum and maximum (inclusive). For example, to create a discrete alphabet over the values from [math]3[/math] to [math]10[/math], you can call

DiscreteAlphabet numerical = new DiscreteAlphabet( 3, 10 );

Continuous alphabets are defined over all reals (minus infinity to infinity) by default (see first line in the following example). However, if you want to define the continuous alphabet over a specific interval, you can specify the maximum and the minimum value of that interval. In the example, we define a continuous alphabet spanning all reals between [math]0[/math] and [math]100[/math]:

ContinuousAlphabet continuousInf = new ContinuousAlphabet();
ContinuousAlphabet continuousPos = new ContinuousAlphabet( 0.0, 100.0 );


For the DNA alphabet, each symbols has a complementary counterpart. Since in some cases, a similar complementarity can also be defined for symbols other than DNA-nucleotides (e.g., for RNA sequences containing U instead of T), Jstacs allows to define generic complementable alphabets. These allow for example the generation of reverse complementary sequences out of an existing sequence. Here, we define a binary alphabet of symbols "A" and "B", where "A" is the complement of "B" and vice versa.

GenericComplementableDiscreteAlphabet complementable = new GenericComplementableDiscreteAlphabet( true, new String[]{"A","B"}, new int[]{1,0} );

The first parameter again defines if this alphabet is case-insensitive (which is the case), the second parameter defines the symbols of the alphabet, and the third parameter specifies the index of the complementary symbol. For instance, the symbol at position [math]1[/math] ("B") is set as the complement of the symbol at position 0 ("A") by setting the [math]0[/math]-th value of the integer array to [math]1[/math].

After the definition of single alphabets, we switch to the creation of aggregate alphabets. Almost everywhere in Jstacs, we use aggregate alphabets to maintain generalizability. Since the aggregate alphabet containing only a DNAAlphabet is always the same, a singleton for such an AlphabetContainer is pre-defined:

AlphabetContainer dnaContainer = DNAAlphabetContainer.SINGLETON;

We can explicitly define an AlphabetContainer using a simple continuous alphabet by calling:

AlphabetContainer contContainer = new AlphabetContainer( continuousInf );


Aggregate alphabets become interesting if we need different symbols at different positions of a sequence, or even a mixture of discrete and continuous values. For example, we might want to represent sequences that consist of a DNA-nucleotide at the first position, some other discrete symbol at the second position, and a real number stemming from some measurement at the third position. Using the DNAAlphabet, the discrete Alphabet, and the continuous Alphabet defined above, we can define such an aggregate alphabet by calling

AlphabetContainer mixedContainer = new AlphabetContainer(dna, discrete, continuousPos);

To save memory, we can also re-use the same alphabet at different position of the aggregate alphabet. If we want to use a DNAAlphabet at positions [math]0[/math], [math]1[/math], and [math]3[/math], and a continuous alphabet at positions [math]2[/math], [math]4[/math], and [math]5[/math], we can use a constructor that takes the alphabets as the first argument and the assignment to the positions as the second argument:

AlphabetContainer complex = new AlphabetContainer( new Alphabet[]{dna,continuousInf}, new int[]{0,0,1,0,1,1} );

The alphabets are assigned to specific positions by their index in the array of the first argument.

Sequences

Single sequences can be created from an AlphabetContainer and a string. However, in most cases, we load the data from some file, which will be explained in the next sub-section. For creating a DNA sequence, we use a DNAAlphabet like the one defined above and a string over the DNA alphabet:

Sequence dnaSeq = Sequence.create( dnaContainer, "ACGTACGTACGT" );

In a similar manner, we define a continuous sequence. In this case, a single value is represented by more than one letter in the string. Hence, we need to define a delimiter between the values as a third argument, which is a space in the example.

Sequence contSeq = Sequence.create( contContainer, "0.5 1.32642 99.5 20.4 5 7.7" , " " );

We can also create sequences over the mixed alphabet defined above. In the example, the single values are delimited by a ";".

Sequence mixedSeq = Sequence.create( mixedContainer, "C;x;5.67" , ";" );

For very large amounts of data or very long sequences, even the representation of symbols by byte values can be too memory-consuming. Hence, Jstacs also offers a representation of DNA sequences in a sparse encoding as bits of long values. You can create such a SparseSequence from a DNAAlphabet and a string:

Sequence sparse = new SparseSequence( dnaContainer, "ACGTACGTACGT" );

However, the reduced memory footprint comes at the expense of a slightly increased runtime for accessing symbols of a SparseSequence. Hence, it is not the default representation in Jstacs.

After we learned how to create sequences, we now want to work with them. First of all, you can obtain the length of a sequence from its getLength() method:

int length = dnaSeq.getLength();

Since on the abstract level of Sequence we do not distinguish between discrete and continuous sequences (and we also may have mixed sequences), there are two alternative methods to obtain one element of a sequence regardless of its content. With the first method, we can obtain the discrete value at a certain position (2 in the example):

int value = dnaSeq.discreteVal( 2 );

If the Sequence contains a continuous value at this position, it is discretized by default by returning the distance to the minimum value of the continuous alphabet at this position casted to an integer. If the Sequence contains a discrete value, that value is just returned in the encoding according to the AlphabetContainer. In a similar manner, we can obtain the continuous value at a position (5 in the example)

double value2 = contSeq.continuousVal( 5 );

where discrete values are just casted to doubles.

We can obtain a sub-sequence of a Sequence using the method getSubSequence(int,int), where the first parameter is the start position within the sequence, counting from [math]0[/math], and the second parameter is the length of the extracted sub-sequence. So the following line of code would extract a sub-sequence of length [math]3[/math] starting at position [math]2[/math] of the original sequence or, stated differently, we skip the first two elements, extract the following three elements, and again skip everything after position [math]4[/math].

Sequence contSub = contSeq.getSubSequence( 2, 3 );

Since Sequence s in Jstacs are immutable, this method returns a new instance of Sequence, which is assigned to a variable contSub in the example. Hence, in cases where you need the same sub-sequences frequently in your code, for example in a ZOOPS-model or other models using sliding windows on a Sequence, we recommend to precompute these sub-sequences and store them in some auxiliary data structure in order to invest runtime in computations rather than garbage collection. Internally, sub-sequences only hold a reference on the original sequences and the start position and length within that sequence to keep the memory overhead of sub-sequences at a low level.

For Sequence s defined over a ComplementableDiscreteAlphabet like the DNAAlphabet, we can also obtain the (reverse) complement of a sequence. For example, to create the reverse complementary sequence of a complete sequence, we call

Sequence revComp = dnaSeq.reverseComplement();

For the complement of a sub-sequence of length [math]6[/math] starting at position [math]3[/math] of the original sequence, we use

Sequence subComp = dnaSeq.complement( 3, 6 );


For some analyses, for instance permutation tests or for estimating false-positive rates of predictions, it is useful to create permuted variants of an original sequence. To this end, Jstacs provides a class PermutedSequence that creates a randomly permuted variant using the constructor

PermutedSequence permuted = new PermutedSequence( dnaSeq );

or a user-defined permutation by an alternative constructor. In the randomized variant, the positions of the original sequence are permuted independently of each other, which means that higher order properties of the sequence like di-nucleotide content are not preserved. If you want to create sequences with similar higher-order properties, have a look at the emitSample() method of HomogeneousModel.

Often, we want to add additional annotations to a sequence, for instance the occurrences of some binding motif, start and end positions of introns, or just the species a sequence is stemming from. To this end, Jstacs provides a number of SequenceAnnotation s that can be added to a Sequence (or read from a FastA-file as we will see later). For instance, we can add the annotation for binding site of a motif called "new motif" of length [math]5[/math] starting at position [math]3[/math] of the forward strand of sequence dnaSeq using the annotate method of that sequence:

Sequence annotatedDnaSeq = dnaSeq.annotate( true, new MotifAnnotation( "new motif", 3, 5, Strand.FORWARD ) );

Again, this method creates a new Sequence object due to Sequence s being immutable. After we added several SequenceAnnotation s to a Sequence, we can obtain all those annotations by calling

SequenceAnnotation[] allAnnotations = annotatedDnaSeq.getAnnotation();

For retrieving annotations of a specific type, we can use the method getSequenceAnnotationByType

MotifAnnotation motif = (MotifAnnotation) annotatedDnaSeq.getSequenceAnnotationByType( "Motif", 0 );

to, for instance, obtain the first (index [math]0[/math]) annotation of type "Motif".

DataSets

In most cases, we want to load Sequence s from some FastA or plain text file instead of creating Sequence s manually from strings. In Jstacs, collections of Sequence s are represented by DataSet s. The class DataSet (and DNADataSet) provide constructors that work on a file or the path to a file, and parse the contents of the file to a DataSet, i.e. a collection of Sequence s.

The most simple case is to create a DNADataSet from a FastA file. To do so, we call the constructor of DNADataSet with the (absolute or relative) path to the FastA file:

DNADataSet dnaDataSet = new DNADataSet( "myfile.fa" );

For other file formats and types of Sequence s, DataSet provides another constructor that works on the AlphabetContainer for the data in the file, a StringExtractor that handles the extraction of the strings representing single sequences and skipping comment lines, and a delimiter between the elements of a sequence. Hence, the StringExtractor, a SparseStringExtractor in the example, requires the specification of the path to the file and the symbol that indicates comment line. For example, if we want to create a sample of continuous sequences stored in a tab-separated plain text file "myfile.tab", we use the AlphabetContainer with a continuous Alphabet from above, a StringExtractor with a hash as the comment symbol, and a tab as the delimiter:

DataSet contDataSet = new DataSet( contContainer, new SparseStringExtractor( "myfile.tab", '#' ), "\t" );

The SparseStringExtractor is tailored to files containing many sequences, and reads the file line by line, where each line is converted to a Sequence and discarded before the next line is parsed.

Since SparseSequence s are not one of the default representations of sequence in Jstacs (see above), these are not created by the constructors of DataSet or DNADataSet. However, the class SparseSequence provides a method getSample that takes the same arguments as the constructor of DataSet, for example

DataSet sparseDataSet = SparseSequence.getDataSet( dnaContainer, new SparseStringExtractor( "myfile.fa", '>' ) );

for reading DNA sequences from a FastA file, and returns a DataSet containing SparseSequence s.

After we successfully created a DataSet, we want to access and use the Sequence s within this DataSet. We retrieve a Sequence of a DataSet using the method getElementAt(int). For instance, we get the fifth Sequence of dnaSample by calling

Sequence fifth = dnaDataSet.getElementAt( 5 );

We can also request the number of Sequence s in a DataSet by the method getNumberOfElements() and use this information, for instance, to iterate over all Sequence s. In the example, we just print the retrieved Sequence s to standard out

for(int i=0;i<dnaDataSet.getNumberOfElements();i++){
	System.out.println(dnaDataSet.getElementAt( i ));
}

where the Sequence s are printed in their original alphabet since their toString() method is overridden accordingly.

As an alternative to the iteration by explicit calls to these methods, DataSet also implements the Iterable interface, which facilitates the Java variant of foreach-loops as in the following example:

for(Sequence seq : contDataSet){
	System.out.println(seq.getLength());
}

Here, we just print the length of each Sequence in contSample to standard out.

We can also apply some of the sequence-level operations to all Sequence s of a DataSet, and obtain a new DataSet containing the modified sequences. For example, we get a DataSet containing the sub-sequences of length [math]10[/math] starting at position [math]3[/math] of each sequence by calling

DataSet infix = dnaDataSet.getInfixDataSet( 3, 10 );

a DataSet of all suffixes starting at position [math]7[/math] from

DataSet suffix = dnaDataSet.getSuffixDataSet( 7 );

or a DataSet containing all reverse complementary Sequence s using

DataSet allRevComplements = dnaDataSet.getReverseComplementaryDataSet();


For cross-validation experiments, hold-out samplings, or similar procedures (cf. de/jstacs/classifier/assessment/package-summary.html package assessment), it is useful to partition a sample randomly. DataSet s in Jstacs support two types of partitionings. The first is to partition a DataSet into [math]k[/math] almost equally sized parts. What is "equally sized" can either be determined by the number of sequences or by the number of symbols of all sequences in a sample. Both measures are supported by Jstacs.

The second partitioning method creates partitions of a user-defined fraction of the original sample. For example, we partition the DataSet dnaSample into five equally sized parts according to the number of sequences in that DataSet by calling

DataSet[] fiveParts = dnaDataSet.partition( 5, PartitionMethod.PARTITION_BY_NUMBER_OF_ELEMENTS );

and we partition the same sample into parts containing [math]10[/math], [math]20[/math], and [math]70[/math] percent of the symbols of the original DataSet by calling

DataSet[] randParts = dnaDataSet.partition( PartitionMethod.PARTITION_BY_NUMBER_OF_SYMBOLS, 0.1, 0.2, 0.7 );

In both cases, the Sequence s in the DataSet are partitioned as atomic elements. That means, a Sequence is not cut into several parts to obtain exactly equally sized parts, but the size of a part may slightly (depending on the number of sequences and lengths of those sequences) differ from the specified percentages.

To create a new DataSet that contains all sub-sequences of a user-defined length of the original Sequence s, we can use another constructor of DataSet. The sub-sequences are extracted in the same manner as we would do by shifting a sliding window over each sequence, extracting the sub-sequence under this window, and build a new DataSet of the extracted sub-sequences. For instance, we obtain a DataSet with all sub-sequences of length [math]8[/math] using

DataSet sliding = new DataSet( dnaDataSet, 8 );


In the previous sub-section, we learned how to add SequenceAnnotation s to a Sequence. Often, we want to use the annotation that is already present in an input file, for example the comment line of a FastA file. We can do so by specifying a SequenceAnnotationParser in the constructor of the DataSet. The simplest type of SequenceAnnotationParser †is the SimpleSequenceAnnotationParser, which just extracts the complete comment line preceding a sequence.

DNADataSet dnaWithComments = new DNADataSet( "myfile.fa", '>', new SimpleSequenceAnnotationParser() );

Although the specification of the parser is quite simple, the extraction of the comment line as a string is a bit lengthy. We first obtain the Sequence from the DataSet, get the annotation of that sequence, obtain the first comment, called "result" in the hierarchy of Jstacs (you see in Intermediate course: XMLParser, Parameters, and Results, why), and convert the corresponding result object to a string.

String comment = dnaWithComments.getElementAt( 0 ).getAnnotation()[0].getResultAt( 0 ).getValue().toString();


If your comment line is defined in a "key-value" format with some generic delimiter between entries, you can Jstacs let parse the entries to distinct annotations. For instance, if the comment line has some format key1=value1; key2=value2;..., we can parse that comment line using the SplitSequenceAnnotationParser. This parser only requires the specification of the delimiter between key and value ("=" in the example) and the delimiter between different entries (";" in the example). Like before, we instantiate a SplitSequenceAnnotationParser as the last argument of the DNADataSet constructor:

DNADataSet dnaWithParsedComments = new DNADataSet( "myfile.fa", '>', new SplitSequenceAnnotationParser("=",";") );

We can now access all parsed annotations by the getAnnotation() method

SequenceAnnotation[] allAnnotations2 = dnaWithParsedComments.getElementAt( 0 ).getAnnotation();

or, for instance, the getSequenceAnnotationByType introduced in the previous section, where the type corresponds to the key in the comment line, and the identifier of the retrieved SequenceAnnotation is identical to the value for that key in the comment line.

Jstacs only supports FastA and plain text files directly. However, you can access other formats or even databases like Genbank using an adaptor to BioJava.

For example, we can use BioJava to load two sequences from Genbank.

GenbankRichSequenceDB db = new GenbankRichSequenceDB();

SequenceIterator dbIterator = new SimpleSequenceIterator( 
		db.getRichSequence( "NC_001284" ), 
		db.getRichSequence( "NC_000932" ) 
);

As a result, we obtain a RichSequenceIterator, which implements the SequenceIterator interface of BioJava. We can use a SequenceIterator in an adaptor method to obtain a Jstacs DataSet including converted annotations:

DataSet fromBioJava = BioJavaAdapter.sequenceIteratorToDataSet( dbIterator, null );

The second argument of the method allows for filtering for specific annotations using a BioJava FeatureFilter.

Vice versa, we can convert a Jstacs DataSet to a BioJava SequenceIterator by an analogous adaptor method:

SequenceIterator backFromJstacs = BioJavaAdapter.dataSetToSequenceIterator( fromBioJava, true );


By means of these two methods, we can use all BioJava facilities for loading and storing data from and to diverse file formats and loading data from data bases in our Jstacs applications.