Dataset

Dataset is an abstract representation of a sequence of data that transforms records loaded from Loader into Jubatus Datum using data type information defined in Schema. Dataset can be constructed from Loader and Schema.

from jubakit.classifier import Schema, Dataset

loader = ...
schema = Schema( ... )

dataset = Dataset(loader, schema)

Some Services provides additional ways to construct Dataset. For example, jubakit.classifier.Dataset provides from_array and from_matrix which are convenient when using datasets generated by scikit-learn. Other than noted, features mentioned in this section are implemented in jubakit.base.BaseDataset, which is a base Dataset class for all Services.

Static and Non-static Datasets

By default, all records are loaded from Loader to memory when creating Dataset instance. Such Datasets are called Static Dataset. If you want to load records one by one from Loader, instead of loading everything first, you can specify static option to create Dataset as Non-static Dataset.

dataset = Dataset(loader, schema, static=False)

Note that some features like index-based record access cannot be used over non-static Datasets.

Datasets constructed from infinite Loaders are non-static by default. You cannot specify static=True when using infinite Loaders.

Schema Prediction

If you don’t specify Schema class when constructing Dataset, the Schema will automatically be predicted from the first record of the Dataset.

>>> dataset = Dataset(loader)
>>> print(dataset.get_schema())
{'types': {'k1': 's', 'k2': 'n'}, 'fallback_type': None, 'keys': {'k1': 'k1', 'k2': 'k2'}}

types sections shows the predicted Schema. In the example above k1 and k2 columns are typed as STRING and NUMBER respectively.

Accessing Records

You can access the raw record (i.e., record loaded from Loader) using get method.

>>> dataset.get(1)
{'k1': 'hello world', 'k2': 5}

You can access the transformed record (i.e., Datum) using index operator. You don’t need this in most cases, though.

>>> dataset[1]
(None, None, <jubatus.common.datum.Datum object at 0xdeadbeef>)
>>> print(dataset[1][2])
datum{string_values: [['k1', 'hello world'], num_values: [['k2', 5.0]], binary_values: []}

You can create a subset of Dataset using index operator with slice or numeric array.

>>> dataset2 = dataset[1:3]
>>> type(dataset2)
<class 'jubakit.anomaly.Dataset'>
>>> len(dataset2)
2

This allows you to use cross-validation modules of scikit-learn easily. The following code shows how to apply KFold on Dataset instance dataset. This code creates two new Dataset instances called ds_train and ds_test, which are both subset of dataset.

>>> from sklearn.cross_validation import KFold
>>> for train, test in KFold(4, n_folds=2):
...   (ds_train, ds_test) = (dataset[train], dataset[test])

Note that non-static Datasets cannot be random-accessed; they only allow accessing the current raw record in the iteration by specifying the index currently being iterated to get method.

Transformation

Static Datasets can be bulk transformed by a user-defined lambda function using convert method.

>>> shuffled_dataset = dataset.convert(lambda x: random.sample(x, len(x)))

The lambda function must take 1 argument, which is a list of raw records to be processed. The lambda function must not modify the given list. The result will become an another Dataset instance.

For convenience, Dataset class provides shuffle method, which shuffles the order of records.

Persisting Datasets

You can use the standard pickle module to persist the Dataset instance. Please note that pickled Dataset instances may not be able to be unpickled in other versions of Jubakit.