DataSet

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES All Classes

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

de.jstacs.data
Class DataSet

java.lang.Object
  de.jstacs.data.DataSet

All Implemented Interfaces:: Iterable<Sequence>

Direct Known Subclasses:: DNADataSet

public class DataSet
extends Object
implements Iterable<Sequence>
extends Object
implements Iterable<Sequence>

This is the class for any data set of Sequences. All Sequences in a DataSet have to have the same AlphabetContainer. The Sequences may have different lengths.
For the internal representation the class Sequence is used, where the external alphabet is converted to integral numerical values. The class DataSet knows about this coding via instances of class AlphabetContainer and accordingly Alphabet.

There are different ways to access the elements of a DataSet. If one needs random access there is the method getElementAt(int). For fast sequential access it is recommended to use an DataSet.ElementEnumerator.

DataSet is immutable.

Author:: Jens Keilwagen, Andre Gohr, Jan Grau
See Also:: AlphabetContainer, Alphabet, Sequence

Nested Class Summary
`static class`	`DataSet.ElementEnumerator` This class can be used to have a fast sequential access to a `DataSet`.
`static class`	`DataSet.PartitionMethod` This `enum` defines different partition methods for a `DataSet`.
`static class`	`DataSet.WeightedDataSetFactory` This class enables you to eliminate `Sequence`s that occur more than once in one or more `DataSet`s.

Constructor Summary
`DataSet(AlphabetContainer abc, AbstractStringExtractor se)` Creates a new `DataSet` from a `StringExtractor` using the given `AlphabetContainer`.
`DataSet(AlphabetContainer abc, AbstractStringExtractor se, int subsequenceLength)` Creates a new `DataSet` from a `StringExtractor` using the given `AlphabetContainer` and all overlapping windows of length `subsequenceLength`.
`DataSet(AlphabetContainer abc, AbstractStringExtractor se, String delim)` Creates a new `DataSet` from a `StringExtractor` using the given `AlphabetContainer` and a delimiter `delim`.
`DataSet(AlphabetContainer abc, AbstractStringExtractor se, String delim, int subsequenceLength)` Creates a new `DataSet` from a `StringExtractor` using the given `AlphabetContainer`, the given delimiter `delim` and all overlapping windows of length `subsequenceLength`.
`DataSet(DataSet s, int subsequenceLength)` Creates a new `DataSet` from a given `DataSet` and a given length `subsequenceLength`.
`DataSet(String annotation, Sequence... seqs)` Creates a new `DataSet` from an array of `Sequence`s and a given annotation.

Method Summary
`static DataSet`	`diff(DataSet data, DataSet... samples)` This method computes the difference between the `DataSet` `data` and the `DataSet`s `samples`.
`Sequence[]`	`getAllElements()` Returns an array of `Sequence`s containing all elements of this `DataSet`.
`AlphabetContainer`	`getAlphabetContainer()` Returns the `AlphabetContainer` of this `DataSet`.
`String`	`getAnnotation()` Returns some annotation of the `DataSet`.
`static String`	`getAnnotation(DataSet... s)` Returns the annotation for an array of `DataSet`s.
`Hashtable<String,HashSet<String>>`	`getAnnotationTypesAndIdentifier()` This method returns all `SequenceAnnotation` types and the corresponding identifier which occur in this `DataSet`.
`double`	`getAverageElementLength()` Returns the average length of all `Sequence`s in this `DataSet`.
`DataSet`	`getCompositeDataSet(int[] starts, int[] lengths)` This method enables you to use only composite `Sequence`s of all elements in the current `DataSet`.
`Sequence`	`getElementAt(int i)` This method returns the element, i.e. the `Sequence`, with index `i`.
`int`	`getElementLength()` Returns the length of the elements, i.e. the `Sequence`s, in this `DataSet`.
`DataSet`	`getInfixDataSet(int start, int length)` This method enables you to use only an infix of all elements, i.e. the `Sequence`s, in the current `DataSet`.
`int`	`getMaximalElementLength()` Returns the maximal length of an element, i.e. a `Sequence`, in this `DataSet`.
`int`	`getMinimalElementLength()` Returns the minimal length of an element, i.e. a `Sequence`, in this `DataSet`.
`int`	`getNumberOfElements()` Returns the number of elements, i.e. the `Sequence`s, in this `DataSet`.
`int`	`getNumberOfElementsWithLength(int len)` Returns the number of overlapping elements that can be extracted.
`double`	`getNumberOfElementsWithLength(int len, double[] weights)` Returns the weighted number of overlapping elements that can be extracted.
`DataSet`	`getReverseComplementaryDataSet()` Returns a `DataSet` that contains the reverse complement of all `Sequence`s in this `DataSet`.
`int[][]`	`getSequenceAnnotationIndexMatrix(String rowType, Hashtable<String,Integer> rowHash, String columnType, Hashtable<String,Integer> columnHash)` This method creates a matrix which contains the index of the `Sequence` with specific `SequenceAnnotation` combination or -1 if the `DataSet` does not contain any `Sequence` with such a combination.
`DataSet`	`getSuffixDataSet(int start)` This method enables you to use only a suffix of all elements, i.e. the `Sequence`, in the current `DataSet`.
`static DataSet`	`intersection(DataSet... samples)` This method computes the intersection between all elements/`DataSet` s of the array, i.e. it returns a `DataSet` containing only `Sequence`s that are contained in all `DataSet`s of the array.
`boolean`	`isDiscreteDataSet()` This method indicates if all positions use discrete values.
`boolean`	`isSimpleDataSet()` This method indicates whether all random variables are defined over the same range, i.e. all positions use the same (fixed) alphabet.
`Iterator<Sequence>`	`iterator()`
`DataSet[]`	`partition(DataSet.PartitionMethod method, double... percentage)` This method partitions the elements, i.e. the `Sequence`s, of the `DataSet` in distinct parts where each part holds the corresponding percentage given in the array `percentage`.
`Pair<DataSet[],double[][]>`	`partition(double[] sequenceWeights, DataSet.PartitionMethod method, double... percentage)` This method partitions the elements, i.e. the `Sequence`s, of the `DataSet` and the corresponding weights in distinct parts where each part holds the corresponding percentage given in the array `percentage`.
`Pair<DataSet[],double[][]>`	`partition(double[] sequenceWeights, int k, DataSet.PartitionMethod method)` This method partitions the elements, i.e. the `Sequence`s, of the `DataSet` and the corresponding weights in `k` distinct parts.
`DataSet[]`	`partition(double p, DataSet.PartitionMethod method, int subsequenceLength)` This method partitions the elements, i.e. the `Sequence`s, of the `DataSet` in two distinct parts.
`DataSet[]`	`partition(int k, DataSet.PartitionMethod method)` This method partitions the elements, i.e. the `Sequence`s, of the `DataSet` in `k` distinct parts.
`void`	`save(File f)` This method writes the `DataSet` to a file `f`.
`void`	`save(OutputStream stream, char commentChar, SequenceAnnotationParser p)` This method allows to write all `Sequence`s including their `SequenceAnnotation`s into a `OutputStream`.
`DataSet`	`subSampling(int number)` Randomly samples elements, i.e.
`String`	`toString()`
`static DataSet`	`union(DataSet... s)` Unites all `DataSet`s of the array `s`.
`static DataSet`	`union(DataSet[] s, boolean[] in)` This method unites all `DataSet`s of the array `s` regarding the array `in`.
`static DataSet`	`union(DataSet[] s, boolean[] in, int subsequenceLength)` This method unites all `DataSet`s of the array `s` regarding the array `in` and sets the element length in the united `DataSet` to `subsequenceLength`.
`static DataSet`	`union(DataSet[] s, int subsequenceLength)` This method unites all `DataSet`s of the array `s` and sets the element length in the united sample to `subsequenceLength`.

Methods inherited from class java.lang.Object
`clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait`

Constructor Detail

DataSet

public DataSet(AlphabetContainer abc,
               AbstractStringExtractor se)
        throws WrongAlphabetException,
               EmptyDataSetException,
               WrongLengthException

Creates a new DataSet from a StringExtractor using the given AlphabetContainer.

Parameters:: abc - the AlphabetContainer; se - the StringExtractor
Throws:: WrongAlphabetException - if the AlphabetContainer is not suitable; EmptyDataSetException - if the DataSet would be empty; WrongLengthException - never happens (forwarded from DataSet(AlphabetContainer, AbstractStringExtractor, String, int) )
See Also:: DataSet(AlphabetContainer, AbstractStringExtractor, String, int)

DataSet

public DataSet(AlphabetContainer abc,
               AbstractStringExtractor se,
               int subsequenceLength)
        throws WrongAlphabetException,
               WrongLengthException,
               EmptyDataSetException

Creates a new DataSet from a StringExtractor using the given AlphabetContainer and all overlapping windows of length subsequenceLength.

Parameters:: abc - the AlphabetContainer; se - the StringExtractor; subsequenceLength - the length of the window sliding on the String of se, if len is 0 (zero) then the Sequences are used as given from the StringExtractor
Throws:: WrongAlphabetException - if the AlphabetContainer is not suitable; WrongLengthException - if the subsequence length is not supported; EmptyDataSetException - if the DataSet would be empty
See Also:: DataSet(AlphabetContainer, AbstractStringExtractor, String, int)

DataSet

public DataSet(AlphabetContainer abc,
               AbstractStringExtractor se,
               String delim)
        throws WrongAlphabetException,
               EmptyDataSetException,
               WrongLengthException

Creates a new DataSet from a StringExtractor using the given AlphabetContainer and a delimiter delim.

Parameters:: abc - the AlphabetContainer; se - the StringExtractor; delim - the delimiter for parsing the Strings
Throws:: WrongAlphabetException - if the AlphabetContainer is not suitable; EmptyDataSetException - if the DataSet would be empty; WrongLengthException - never happens (forwarded from DataSet(AlphabetContainer, AbstractStringExtractor, String, int) )
See Also:: DataSet(AlphabetContainer, AbstractStringExtractor, String, int)

DataSet

public DataSet(AlphabetContainer abc,
               AbstractStringExtractor se,
               String delim,
               int subsequenceLength)
        throws EmptyDataSetException,
               WrongAlphabetException,
               WrongLengthException

Creates a new DataSet from a StringExtractor using the given AlphabetContainer, the given delimiter delim and all overlapping windows of length subsequenceLength.

Parameters:: abc - the AlphabetContainer; se - the StringExtractor; delim - the delimiter for parsing the Strings; subsequenceLength - the length of the window sliding on the String of se, if len is 0 (zero) then the Sequences are used as given from the StringExtractor
Throws:: WrongAlphabetException - if the AlphabetContainer is not suitable; EmptyDataSetException - if the DataSet would be empty; WrongLengthException - if the subsequence length is not supported

DataSet

public DataSet(DataSet s,
               int subsequenceLength)
        throws WrongLengthException

Creates a new DataSet from a given DataSet and a given length subsequenceLength.
This constructor enables you to use subsequences of the elements of a DataSet.

It can also be used to ensure that all sequences that can be accessed by getElementAt(int) are real objects and do not have to be created at the invocation of the method. (The same holds for the DataSet.ElementEnumerator. In those cases both ways to access the Sequence are approximately equally fast.)

Parameters:: s - the given DataSet; subsequenceLength - the new element length
Throws:: WrongLengthException - if something is wrong with subsequenceLength

DataSet

public DataSet(String annotation,
               Sequence... seqs)
        throws EmptyDataSetException,
               WrongAlphabetException

Creates a new DataSet from an array of Sequences and a given annotation.
This constructor is specially designed for the method StatisticalModel.emitDataSet(int, int...)

Parameters:: annotation - the annotation of the DataSet; seqs - the Sequence(s)
Throws:: EmptyDataSetException - if the array seqs is null or the length is 0; WrongAlphabetException - if the AlphabetContainers do not match

Method Detail

getAnnotation

public static final String getAnnotation(DataSet... s)

Returns the annotation for an array of DataSets.

Parameters:: s - an array of DataSets
Returns:: the annotation
See Also:: getAnnotation()

diff

public static final DataSet diff(DataSet data,
                                 DataSet... samples)
                          throws EmptyDataSetException,
                                 WrongAlphabetException

This method computes the difference between the DataSet data and the DataSets samples.

Parameters:: data - the minuend; samples - the subtrahends
Returns:: the difference
Throws:: WrongAlphabetException - if the AlphabetContainers do not match, i.e., if the DataSets are from different domains; EmptyDataSetException - if the difference is empty

intersection

public static final DataSet intersection(DataSet... samples)
                                  throws IllegalArgumentException,
                                         EmptyDataSetException

This method computes the intersection between all elements/DataSet s of the array, i.e. it returns a DataSet containing only Sequences that are contained in all DataSets of the array.

Parameters:: samples - the array of DataSets
Returns:: the intersection of the elements/DataSets in the array
Throws:: IllegalArgumentException - if the elements of the array are from different domains; EmptyDataSetException - if the intersection is empty

union

public static final DataSet union(DataSet[] s,
                                  boolean[] in)
                           throws IllegalArgumentException,
                                  EmptyDataSetException

This method unites all DataSets of the array s regarding the array in.

Parameters:: s - the array of DataSets; in - an array indicating which DataSet is used in the union, if in[i]==true the DataSet s[i] is used
Returns:: the united DataSet
Throws:: IllegalArgumentException - if s.length != in.length or the Alphabet s do not match; EmptyDataSetException - if the union is empty
See Also:: union(DataSet[], boolean[], int)

union

public static final DataSet union(DataSet... s)
                           throws IllegalArgumentException

Unites all DataSets of the array s.

Parameters:: s - the array of DataSets
Returns:: the united DataSet
Throws:: IllegalArgumentException - if the Alphabets do not match
See Also:: union(DataSet[], boolean[])

union

public static final DataSet union(DataSet[] s,
                                  boolean[] in,
                                  int subsequenceLength)
                           throws IllegalArgumentException,
                                  EmptyDataSetException,
                                  WrongLengthException

This method unites all DataSets of the array s regarding the array in and sets the element length in the united DataSet to subsequenceLength.

Parameters:: s - the array of DataSets; in - an array indicating which DataSet is used in the union, if in[i]==true the DataSet s[i] is used; subsequenceLength - the length of the elements in the united DataSet
Returns:: the united DataSet
Throws:: IllegalArgumentException - if s.length != in.length or the Alphabet s do not match; EmptyDataSetException - if the union is empty; WrongLengthException - if the united DataSet does not support this subsequenceLength

union

public static final DataSet union(DataSet[] s,
                                  int subsequenceLength)
                           throws IllegalArgumentException,
                                  WrongLengthException

This method unites all DataSets of the array s and sets the element length in the united sample to subsequenceLength.

Parameters:: s - the array of DataSets; subsequenceLength - the length of the elements in the united DataSet
Returns:: the united DataSet
Throws:: IllegalArgumentException - if the Alphabets do not match; WrongLengthException - if the united DataSet does not support this subsequenceLength
See Also:: union(DataSet[], boolean[], int)

getAllElements

public Sequence[] getAllElements()

Returns an array of Sequences containing all elements of this DataSet.

Returns:: all elements (Sequences) of this DataSet
See Also:: DataSet.ElementEnumerator

getAlphabetContainer

public final AlphabetContainer getAlphabetContainer()

Returns the AlphabetContainer of this DataSet.

Returns:: the AlphabetContainer of this DataSet

getAnnotation

public final String getAnnotation()

Returns some annotation of the DataSet.

Returns:: some annotation of the DataSet

getCompositeDataSet

public final DataSet getCompositeDataSet(int[] starts,
                                         int[] lengths)
                                  throws IllegalArgumentException

This method enables you to use only composite Sequences of all elements in the current DataSet. Each composite Sequence will be build from one corresponding Sequence in this DataSet and all composite Sequences will be returned in a new DataSet.

Parameters:: starts - the start positions of the chunks; lengths - the lengths of the chunks
Returns:: a composite DataSet
Throws:: IllegalArgumentException - if either starts or lengths or both in combination are not suitable
See Also:: Sequence.getCompositeSequence(AlphabetContainer, int[], int[])

getElementAt

public Sequence getElementAt(int i)

This method returns the element, i.e. the Sequence, with index i. See also this comment.

Parameters:: i - the index of the element, i.e. the Sequence
Returns:: the element, i.e. the Sequence, with index i

getElementLength

public int getElementLength()

Returns the length of the elements, i.e. the Sequences, in this DataSet.

Returns:: the length of the elements, i.e. the Sequences, in this DataSet

getAverageElementLength

public double getAverageElementLength()

Returns the average length of all Sequences in this DataSet.

Returns:: the average length

getInfixDataSet

public final DataSet getInfixDataSet(int start,
                                     int length)
                              throws IllegalArgumentException

This method enables you to use only an infix of all elements, i.e. the Sequences, in the current DataSet. The subsequences will be returned in an new DataSet.

This method can also be used to create a DataSet of prefixes if the element length is not zero.

Parameters:: start - the start position of the infix; length - the length of the infix, has to be positive
Returns:: a DataSet of the specified infixes
Throws:: IllegalArgumentException - if either start or length or both in combination are not suitable

getReverseComplementaryDataSet

public DataSet getReverseComplementaryDataSet()
                                       throws OperationNotSupportedException

Returns a DataSet that contains the reverse complement of all Sequences in this DataSet.

Returns:: the reverse complements
Throws:: OperationNotSupportedException - if the AlphabetContainer of any of the Sequences in this DataSet is not complementable

getMinimalElementLength

public int getMinimalElementLength()

Returns the minimal length of an element, i.e. a Sequence, in this DataSet.

Returns:: the minimal length of an element, i.e. a Sequence, in this DataSet

getMaximalElementLength

public int getMaximalElementLength()

Returns the maximal length of an element, i.e. a Sequence, in this DataSet.

Returns:: the maximal length of an element, i.e. a Sequence, in this DataSet

getNumberOfElements

public int getNumberOfElements()

Returns the number of elements, i.e. the Sequences, in this DataSet.

Returns:: the number of elements, i.e. the Sequences, in this DataSet

iterator

public Iterator<Sequence> iterator()

Specified by:: iterator in interface Iterable<Sequence>

getNumberOfElementsWithLength

public int getNumberOfElementsWithLength(int len)
                                  throws WrongLengthException

Returns the number of overlapping elements that can be extracted.

Parameters:: len - the length of the elements
Returns:: the number of elements with the specified length
Throws:: WrongLengthException - if the given length is bigger than the minimal element length
See Also:: getNumberOfElementsWithLength(int, double[])

getNumberOfElementsWithLength

public double getNumberOfElementsWithLength(int len,
                                            double[] weights)
                                     throws WrongLengthException,
                                            IllegalArgumentException

Returns the weighted number of overlapping elements that can be extracted.

Parameters:: len - the length of the elements; weights - the weights of each element of the sample (see getElementAt(int)), can be null
Returns:: the weighted number of elements with the specified length
Throws:: WrongLengthException - if the given length is bigger than the minimal element length; IllegalArgumentException - if the weights have a wrong dimension

getSuffixDataSet

public final DataSet getSuffixDataSet(int start)
                               throws IllegalArgumentException

This method enables you to use only a suffix of all elements, i.e. the Sequence, in the current DataSet. The subsequences will be returned in an new DataSet.

Parameters:: start - the start position of the suffix
Returns:: a DataSet of specified suffixes
Throws:: IllegalArgumentException - if start is not suitable

isSimpleDataSet

public final boolean isSimpleDataSet()

This method indicates whether all random variables are defined over the same range, i.e. all positions use the same (fixed) alphabet.

Returns:: true if the DataSet is simple, false otherwise
See Also:: AlphabetContainer.isSimple()

isDiscreteDataSet

public final boolean isDiscreteDataSet()

This method indicates if all positions use discrete values.

Returns:: true if the DataSet is discrete, false otherwise
See Also:: AlphabetContainer.isDiscrete()

partition

public DataSet[] partition(double p,
                           DataSet.PartitionMethod method,
                           int subsequenceLength)
                    throws WrongLengthException,
                           UnsupportedOperationException,
                           EmptyDataSetException

This method partitions the elements, i.e. the Sequences, of the DataSet in two distinct parts. The second part (test sample) holds the percentage of p, the first the rest (train sample). The first part has element length as the current DataSet, the second has element length subsequenceLength, which might be necessary for testing.

Parameters:: p - the percentage for the second part, the second part holds at least this percentage of the full DataSet; method - the method how to partition the sample (partitioning criterion); subsequenceLength - the element length of the second part, if 0 (zero) then the sequences are used as given in this DataSet
Returns:: the array of partitioned DataSets
Throws:: WrongLengthException - if something is wrong with subsequenceLength; UnsupportedOperationException - if the DataSet is not simple; EmptyDataSetException - if at least one of the created partitions is empty
See Also:: DataSet.PartitionMethod, DataSet.PartitionMethod.PARTITION_BY_NUMBER_OF_ELEMENTS, DataSet.PartitionMethod.PARTITION_BY_NUMBER_OF_SYMBOLS, partition(PartitionMethod, double...)

partition

public DataSet[] partition(DataSet.PartitionMethod method,
                           double... percentage)
                    throws IllegalArgumentException,
                           EmptyDataSetException

This method partitions the elements, i.e. the Sequences, of the DataSet in distinct parts where each part holds the corresponding percentage given in the array percentage.

Parameters:: method - the method how to partition the DataSet (partitioning criterion); percentage - the array of percentages for each "subsample"
Returns:: the array of partitioned DataSets
Throws:: IllegalArgumentException - if something with the percentages is not correct ( sum != 1 or one value is not in [0,1]); EmptyDataSetException - if at least one of the created partitions is empty
See Also:: DataSet.PartitionMethod, DataSet.PartitionMethod.PARTITION_BY_NUMBER_OF_ELEMENTS, DataSet.PartitionMethod.PARTITION_BY_NUMBER_OF_SYMBOLS

partition

public Pair<DataSet[],double[][]> partition(double[] sequenceWeights,
                                            DataSet.PartitionMethod method,
                                            double... percentage)
                                     throws IllegalArgumentException,
                                            EmptyDataSetException

This method partitions the elements, i.e. the Sequences, of the DataSet and the corresponding weights in distinct parts where each part holds the corresponding percentage given in the array percentage.

Parameters:: sequenceWeights - the weights for the sequences (might be null); method - the method how to partition the DataSet (partitioning criterion); percentage - the array of percentages for each "subsample"
Returns:: a Pair containing an array of partitioned DataSets and an array of partitioned sequence weights
Throws:: IllegalArgumentException - if something with the percentages is not correct ( sum != 1 or one value is not in [0,1]); EmptyDataSetException - if at least one of the created partitions is empty
See Also:: DataSet.PartitionMethod, DataSet.PartitionMethod.PARTITION_BY_NUMBER_OF_ELEMENTS, DataSet.PartitionMethod.PARTITION_BY_NUMBER_OF_SYMBOLS

partition

public DataSet[] partition(int k,
                           DataSet.PartitionMethod method)
                    throws IllegalArgumentException,
                           EmptyDataSetException

This method partitions the elements, i.e. the Sequences, of the DataSet in k distinct parts.

Parameters:: k - the number of distinct parts; method - the method how to partition the DataSet (partitioning criterion)
Returns:: the array of partitioned DataSets
Throws:: IllegalArgumentException - if k is not correct; EmptyDataSetException - if at least one of the created partitions is empty
See Also:: DataSet.PartitionMethod, DataSet.PartitionMethod.PARTITION_BY_NUMBER_OF_ELEMENTS, DataSet.PartitionMethod.PARTITION_BY_NUMBER_OF_SYMBOLS

partition

public Pair<DataSet[],double[][]> partition(double[] sequenceWeights,
                                            int k,
                                            DataSet.PartitionMethod method)
                                     throws IllegalArgumentException,
                                            EmptyDataSetException

This method partitions the elements, i.e. the Sequences, of the DataSet and the corresponding weights in k distinct parts.

Parameters:: sequenceWeights - the weights for the sequences (might be null); k - the number of distinct parts; method - the method how to partition the DataSet (partitioning criterion)
Returns:: a Pair containing an array of partitioned DataSets and an array of partitioned sequence weights
Throws:: IllegalArgumentException - if k is not correct; EmptyDataSetException - if at least one of the created partitions is empty
See Also:: DataSet.PartitionMethod, DataSet.PartitionMethod.PARTITION_BY_NUMBER_OF_ELEMENTS, DataSet.PartitionMethod.PARTITION_BY_NUMBER_OF_SYMBOLS

subSampling

public DataSet subSampling(int number)
                    throws EmptyDataSetException

Randomly samples elements, i.e. Sequences, from the set of all elements, i.e. the Sequences, contained in this DataSet.
Depending on whether this DataSet is chosen to contain overlapping elements (windows of length subsequenceLength) or not, those elements (overlapping windows, whole sequences) are subsampled.

Parameters:: number - the number of Sequences that should be drawn from the contained set of Sequences (with replacement)
Returns:: a new DataSet containing the drawn Sequences
Throws:: EmptyDataSetException - if number is not positive

save

public final void save(File f)
                throws IOException

This method writes the DataSet to a file f.

Parameters:: f - the File
Throws:: IOException - if something went wrong with the file
See Also:: save(OutputStream, char, SequenceAnnotationParser)

save

public final void save(OutputStream stream,
                       char commentChar,
                       SequenceAnnotationParser p)
                throws IOException

This method allows to write all Sequences including their SequenceAnnotations into a OutputStream. The SequenceAnnotations are parsed using the SequenceAnnotationParser.

Parameters:: stream - the stream which is used to write the DataSet; commentChar - the character that marks comment lines; p - the parser for the SequenceAnnotations of the Sequences
Throws:: IOException - if something went wrong while writing into the stream.
See Also:: SequenceAnnotationParser.parseAnnotationToComment(char, SequenceAnnotation...)

toString

public String toString()

Overrides:: toString in class Object

getAnnotationTypesAndIdentifier

public Hashtable<String,HashSet<String>> getAnnotationTypesAndIdentifier()

This method returns all SequenceAnnotation types and the corresponding identifier which occur in this DataSet.

Returns:: a Hashtable with key = SequenceAnnotation type and identifier = SequenceAnnotation identifier
See Also:: SequenceAnnotation

getSequenceAnnotationIndexMatrix

public int[][] getSequenceAnnotationIndexMatrix(String rowType,
                                                Hashtable<String,Integer> rowHash,
                                                String columnType,
                                                Hashtable<String,Integer> columnHash)

This method creates a matrix which contains the index of the Sequence with specific SequenceAnnotation combination or -1 if the DataSet does not contain any Sequence with such a combination. The rows and columns are indexed according to the Hashtables.

Here is a short example, how to interpret the returned matrix:

 int[][] matrix = s.getSequenceAnnotationIndexMatrix( rowType, rowHash, columnType, columnHash )
 
 if( matrix[i][j] < 0 ) {
        System.out.println( "There is no Sequence in the DataSet with this SequenceAnnotation combination");
 } else {
        System.out.println( "This is the Sequence: " + s.getElementAt( matrix[i][j] ) );
 }

Parameters:: rowType - the SequenceAnnotation type for the rows; rowHash - a Hashtable of SequenceAnnotation identifier and indices for the rows; columnType - the SequenceAnnotation type for the columns; columnHash - a Hashtable of SequenceAnnotation identifier and indices for the columns
Returns:: a matrix with the indices of the Sequences with each specific combination of SequenceAnnotation for code>rowType and columnType and -1 if this combination does not exist in the DataSet
See Also:: getAnnotationTypesAndIdentifier(), ToolBox.parseHashSet2IndexHashtable(HashSet)

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES All Classes

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

de.jstacs.data Class DataSet

DataSet

DataSet

DataSet

DataSet

DataSet

DataSet

getAnnotation

diff

intersection

union

union

union

union

getAllElements

getAlphabetContainer

getAnnotation

getCompositeDataSet

getElementAt

getElementLength

getAverageElementLength

getInfixDataSet

getReverseComplementaryDataSet

getMinimalElementLength

getMaximalElementLength

getNumberOfElements

iterator

getNumberOfElementsWithLength

getNumberOfElementsWithLength

getSuffixDataSet

isSimpleDataSet

isDiscreteDataSet

partition

partition

partition

partition

partition

subSampling

save

save

toString

getAnnotationTypesAndIdentifier

getSequenceAnnotationIndexMatrix

de.jstacs.data
Class DataSet