Starter: Data handling
In Jstacs, data is organized at three levels:
- Alphabet s for defining single symbols, and AlphabetContainer s for defining aggregate alphabets,
- Sequence s for defining sequences of symbols over a given alphabet,
- DataSet s for defining sets of sequences.
Sequences are implemented as numerical values. In case of discrete sequences over some symbolic alphabet,
the symbols are mapped to contiguous discrete values starting at 0, which can be mapped back to the original symbols
using the alphabet. This mapping is also used for the
toString() method, e.g., for printing a sequence. The actual
data type, i.e. byte, short, or integer, used to represented the symbols is chosen internally depending on the size of
the alphabet. Alphabet s, Sequence s, and DataSet s are immutable for reasons of security and data consistency. That means, an instance of those
classes cannot be modified once it has been created.
Since Jstacs is tailored at sequence analysis in bioinformatics, the most prominent alphabet is the DNAAlphabet, which is a singleton instance that can be accessed by:
For general discrete alphabets, i.e., any kind of categorical data, you can use a DiscreteAlphabet. Such an alphabet can be constructed in case-sensitive and insensitive variants (first argument) using a list of symbols. In this example, we create a case-sensitive alphabet with symbols "W", "S", "w", and "x":
If you rather want to define an alphabet over contiguous discrete numerical values, you can do so by calling a constructor that takes the minimum and maximum value of the desired range, and defines the alphabet as all integer values between minimum and maximum (inclusive). For example, to create a discrete alphabet over the values from 3 to 10, you can call
Continuous alphabets are defined over all reals (minus infinity to infinity) by default (see first line in the following example). However, if you want to define the continuous alphabet over a specific interval, you can specify the maximum and the minimum value of that interval. In the example, we define a continuous alphabet spanning all reals between 0 and 100:
ContinuousAlphabet continuousPos = new ContinuousAlphabet( 0.0, 100.0 );
For the DNA alphabet, each symbols has a complementary counterpart. Since in some cases, a similar complementarity can also be defined for symbols other than DNA-nucleotides (e.g., for RNA sequences containing U instead of T), Jstacs allows to define generic complementable alphabets. These allow for example the generation of reverse complementary sequences out of an existing sequence. Here, we define a binary alphabet of symbols "A" and "B", where "A" is the complement of "B" and vice versa.
The first parameter again defines if this alphabet is case-insensitive (which is the case), the second parameter defines the symbols of the alphabet, and the third parameter specifies the index of the complementary symbol. For instance, the symbol at position 1 ("B") is set as the complement of the symbol at position 0 ("A") by setting the 0-th value of the integer array to 1.
After the definition of single alphabets, we switch to the creation of aggregate alphabets. Almost everywhere in Jstacs, we use aggregate alphabets to maintain generalizability. Since the aggregate alphabet containing only a DNAAlphabet is always the same, a singleton for such an AlphabetContainer is pre-defined:
We can explicitly define an AlphabetContainer using a simple continuous alphabet by calling:
Aggregate alphabets become interesting if we need different symbols at different positions of a sequence, or even a mixture of discrete and continuous values. For example, we might want to represent sequences that consist of a DNA-nucleotide at the first position, some other discrete symbol at the second position, and a real number stemming from some measurement at the third position. Using the DNAAlphabet, the discrete Alphabet, and the continuous Alphabet defined above, we can define such an aggregate alphabet by calling
To save memory, we can also re-use the same alphabet at different position of the aggregate alphabet. If we want to use a DNAAlphabet at positions 0, 1, and 3, and a continuous alphabet at positions 2, 4, and 5, we can use a constructor that takes the alphabets as the first argument and the assignment to the positions as the second argument:
The alphabets are assigned to specific positions by their index in the array of the first argument.
Single sequences can be created from an AlphabetContainer and a string. However, in most cases, we load the data from some file, which will be explained in the next sub-section. For creating a DNA sequence, we use a DNAAlphabet like the one defined above and a string over the DNA alphabet:
In a similar manner, we define a continuous sequence. In this case, a single value is represented by more than one letter in the string. Hence, we need to define a delimiter between the values as a third argument, which is a space in the example.
We can also create sequences over the mixed alphabet defined above. In the example, the single values are delimited by a ";".
For very large amounts of data or very long sequences, even the representation of symbols by byte values can be too memory-consuming. Hence, Jstacs also offers a representation of DNA sequences in a sparse encoding as bits of long values. You can create such a SparseSequence from a DNAAlphabet and a string:
However, the reduced memory footprint comes at the expense of a slightly increased runtime for accessing symbols of a SparseSequence. Hence, it is not the default representation in Jstacs.
After we learned how to create sequences, we now want to work with them. First of all, you can obtain the length of a sequence
Since on the abstract level of Sequence we do not distinguish between discrete and continuous sequences (and we also may have mixed sequences), there are two alternative methods to obtain one element of a sequence regardless of its content. With the first method, we can obtain the discrete value at a certain position (2 in the example):
If the Sequence contains a continuous value at this position, it is discretized by default by returning the distance to the minimum value of the continuous alphabet at this position casted to an integer. If the Sequence contains a discrete value, that value is just returned in the encoding according to the AlphabetContainer. In a similar manner, we can obtain the continuous value at a position (5 in the example)
where discrete values are just casted to
We can obtain a sub-sequence of a Sequence using the method
getSubSequence(int,int), where the first parameter is the start position within the sequence, counting from 0, and the second
parameter is the length of the extracted sub-sequence. So the following line of code would extract a sub-sequence of length 3 starting at position 2 of the original sequence or, stated differently,
we skip the first two elements, extract the following three elements, and again skip everything after position 4.
Since Sequence s in Jstacs are immutable, this method returns a new instance of Sequence, which is assigned to a variable
contSub in the example. Hence, in cases where you need
the same sub-sequences frequently in your code, for example in a ZOOPS-model or other models using sliding windows on a Sequence, we recommend to precompute these sub-sequences and store them in some auxiliary data structure in order to invest runtime in computations rather than garbage collection. Internally, sub-sequences only hold a reference on the original sequences and the start position and length within that sequence to keep the memory overhead of sub-sequences at a low level.
For Sequence s defined over a ComplementableDiscreteAlphabet like the DNAAlphabet, we can also obtain the (reverse) complement of a sequence. For example, to create the reverse complementary sequence of a complete sequence, we call
For the complement of a sub-sequence of length 6 starting at position 3 of the original sequence, we use
For some analyses, for instance permutation tests or for estimating false-positive rates of predictions, it is useful to create permuted variants of an original sequence. To this end, Jstacs provides a class PermutedSequence that creates a randomly permuted variant using the constructor
or a user-defined permutation by an alternative constructor. In the randomized variant, the positions of the original sequence are permuted independently of each other, which means that higher order properties of the sequence like di-nucleotide content are not preserved. If you want to create sequences with similar higher-order properties, have a look at the
emitSample() method of HomogeneousModel.
Often, we want to add additional annotations to a sequence, for instance the occurrences of some binding motif, start and end positions of introns, or just the species a sequence is stemming from. To this end, Jstacs provides a number of SequenceAnnotation s that can be added to a Sequence (or read from a FastA-file as we will see later).
For instance, we can add the annotation for binding site of a motif called "new motif" of length 5 starting at position 3 of the forward strand of sequence
dnaSeq using the
annotate method of that sequence:
For retrieving annotations of a specific type, we can use the method
to, for instance, obtain the first (index 0) annotation of type "Motif".
In most cases, we want to load Sequence s from some FastA or plain text file instead of creating Sequence s manually from strings. In Jstacs, collections of Sequence s are represented by DataSet s. The class DataSet (and DNADataSet) provide constructors that work on a file or the path to a file, and parse the contents of the file to a DataSet, i.e. a collection of Sequence s.
For other file formats and types of Sequence s, DataSet provides another constructor that works on the AlphabetContainer for the data in the file, a StringExtractor that handles the extraction of the strings representing single sequences and skipping comment lines, and a delimiter between the elements of a sequence. Hence, the StringExtractor, a SparseStringExtractor in the example, requires the specification of the path to the file and the symbol that indicates comment line. For example, if we want to create a sample of continuous sequences stored in a tab-separated plain text file "myfile.tab", we use the AlphabetContainer with a continuous Alphabet from above, a StringExtractor with a hash as the comment symbol, and a tab as the delimiter:
Since SparseSequence s are not one of the default representations of sequence in Jstacs (see above), these are not created by the constructors of DataSet or DNADataSet. However, the class SparseSequence provides a method
getSample that takes the same arguments as the constructor of DataSet, for example
After we successfully created a DataSet, we want to access and use the Sequence s within this DataSet. We retrieve a Sequence of a DataSet using the method
getElementAt(int). For instance,
we get the fifth Sequence of
dnaSample by calling
We can also request the number of Sequence s in a DataSet by the method
getNumberOfElements() and use this information, for instance, to
iterate over all Sequence s. In the example, we just print the retrieved Sequence s to standard out
System.out.println(dnaDataSet.getElementAt( i ));
where the Sequence s are printed in their original alphabet since their
toString() method is overridden accordingly.
As an alternative to the iteration by explicit calls to these methods, DataSet also implements the
Iterable interface, which facilitates the Java variant
of foreach-loops as in the following example:
Here, we just print the length of each Sequence in
contSample to standard out.
We can also apply some of the sequence-level operations to all Sequence s of a DataSet, and obtain a new DataSet containing the modified sequences. For example, we get a DataSet containing the sub-sequences of length 10 starting at position 3 of each sequence by calling
a DataSet of all suffixes starting at position 7 from
For cross-validation experiments, hold-out samplings, or similar procedures (cf. de/jstacs/classifier/assessment/package-summary.html package assessment), it is useful to partition a sample randomly. DataSet s in Jstacs support two types of partitionings. The first is to partition a DataSet into k almost equally sized parts. What is "equally sized" can either be determined by the number of sequences or by the number of symbols of all sequences in a sample. Both measures are supported by Jstacs.
The second partitioning method creates partitions of a user-defined fraction of the original sample. For example, we partition the DataSet
dnaSample into five equally sized parts according to the number of sequences in that DataSet by calling
and we partition the same sample into parts containing 10, 20, and 70 percent of the symbols of the original DataSet by calling
In both cases, the Sequence s in the DataSet are partitioned as atomic elements. That means, a Sequence is not cut into several parts to obtain exactly equally sized parts, but the size of a part may slightly (depending on the number of sequences and lengths of those sequences) differ from the specified percentages.
To create a new DataSet that contains all sub-sequences of a user-defined length of the original Sequence s, we can use another constructor of DataSet. The sub-sequences are extracted in the same manner as we would do by shifting a sliding window over each sequence, extracting the sub-sequence under this window, and build a new DataSet of the extracted sub-sequences. For instance, we obtain a DataSet with all sub-sequences of length 8 using
In the previous sub-section, we learned how to add SequenceAnnotation s to a Sequence. Often, we want to use the annotation that is already present in an input file, for example the comment line of a FastA file. We can do so by specifying a SequenceAnnotationParser in the constructor of the DataSet. The simplest type of SequenceAnnotationParser †is the SimpleSequenceAnnotationParser, which just extracts the complete comment line preceding a sequence.
Although the specification of the parser is quite simple, the extraction of the comment line as a string is a bit lengthy. We first obtain the Sequence from the DataSet, get the annotation of that sequence, obtain the first comment, called "result" in the hierarchy of Jstacs (you see in Intermediate course: XMLParser, Parameters, and Results, why), and convert the corresponding result object to a string.
If your comment line is defined in a "key-value" format with some generic delimiter between entries, you can Jstacs let parse the entries to distinct annotations. For instance, if the comment line has some format
key1=value1; key2=value2;..., we can parse that comment line using the SplitSequenceAnnotationParser. This parser only requires the specification of the delimiter between key and value ("=" in the example) and the delimiter between different entries (";" in the example). Like before, we instantiate a SplitSequenceAnnotationParser as the last argument of the DNADataSet constructor:
We can now access all parsed annotations by the
or, for instance, the
getSequenceAnnotationByType introduced in the previous section, where the type corresponds to the key in the comment line, and the identifier of the retrieved SequenceAnnotation is identical to the value for that key in the comment line.
Jstacs only supports FastA and plain text files directly. However, you can access other formats or even databases like Genbank using an adaptor to BioJava.
For example, we can use BioJava to load two sequences from Genbank.
SequenceIterator dbIterator = new SimpleSequenceIterator(
db.getRichSequence( "NC_001284" ),
db.getRichSequence( "NC_000932" )
As a result, we obtain a
RichSequenceIterator, which implements the
SequenceIterator interface of BioJava. We can use a
SequenceIterator in an adaptor method to obtain a Jstacs DataSet including converted annotations:
The second argument of the method allows for filtering for specific annotations using a BioJava
Vice versa, we can convert a Jstacs DataSet to a BioJava
SequenceIterator by an analogous adaptor method:
By means of these two methods, we can use all BioJava facilities for loading and storing data from and to diverse file formats and loading data from data bases in our Jstacs applications.