Dataset
=======

Dataset is an abstract representation of a sequence of data that transforms records loaded from Loader into Jubatus Datum using data type information defined in Schema.
Dataset can be constructed from Loader and Schema.

.. code-block:: python

  from jubakit.classifier import Schema, Dataset

  loader = ...
  schema = Schema( ... )

  dataset = Dataset(loader, schema)

Some Services provides additional ways to construct Dataset.
For example, :py:class:`jubakit.classifier.Dataset` provides :py:func:`from_array <jubakit.classifier.Dataset.from_array>` and :py:func:`from_matrix <jubakit.classifier.Dataset.from_matrix>` which are convenient when using datasets generated by scikit-learn.
Other than noted, features mentioned in this section are implemented in :py:class:`jubakit.base.BaseDataset`, which is a base Dataset class for all Services.

Static and Non-static Datasets
------------------------------

By default, all records are loaded from Loader to memory when creating Dataset instance.
Such Datasets are called *Static Dataset*.
If you want to load records one by one from Loader, instead of loading everything first, you can specify ``static`` option to create Dataset as *Non-static Dataset*.

.. code-block:: python

  dataset = Dataset(loader, schema, static=False)

Note that some features like index-based record access cannot be used over non-static Datasets.

Datasets constructed from infinite Loaders are non-static by default.
You cannot specify ``static=True`` when using infinite Loaders.

Schema Prediction
-----------------

If you don't specify Schema class when constructing Dataset, the Schema will automatically be predicted from the first record of the Dataset.

.. code-block:: python

  >>> dataset = Dataset(loader)
  >>> print(dataset.get_schema())
  {'types': {'k1': 's', 'k2': 'n'}, 'fallback_type': None, 'keys': {'k1': 'k1', 'k2': 'k2'}}

``types`` sections shows the predicted Schema.
In the example above ``k1`` and ``k2`` columns are typed as ``STRING`` and ``NUMBER`` respectively.

Accessing Records
-----------------

You can access the raw record (i.e., record loaded from Loader) using ``get`` method.

.. code-block:: python

  >>> dataset.get(1)
  {'k1': 'hello world', 'k2': 5}

You can access the transformed record (i.e., Datum) using index operator.
You don't need this in most cases, though.

.. code-block:: python

  >>> dataset[1]
  (None, None, <jubatus.common.datum.Datum object at 0xdeadbeef>)
  >>> print(dataset[1][2])
  datum{string_values: [['k1', 'hello world'], num_values: [['k2', 5.0]], binary_values: []}

You can create a subset of Dataset using index operator with slice or numeric array.

.. code-block:: python

  >>> dataset2 = dataset[1:3]
  >>> type(dataset2)
  <class 'jubakit.anomaly.Dataset'>
  >>> len(dataset2)
  2

This allows you to use `cross-validation modules of scikit-learn <http://scikit-learn.org/stable/modules/cross_validation.html>`_ easily.
The following code shows how to apply ``KFold`` on Dataset instance ``dataset``.
This code creates two new Dataset instances called ``ds_train`` and ``ds_test``, which are both subset of ``dataset``.

.. code-block:: python

  >>> from sklearn.cross_validation import KFold
  >>> for train, test in KFold(4, n_folds=2):
  ...   (ds_train, ds_test) = (dataset[train], dataset[test])

Note that non-static Datasets cannot be random-accessed; they only allow accessing the current raw record in the iteration by specifying the index currently being iterated to ``get`` method.

Transformation
--------------

Static Datasets can be bulk transformed by a user-defined lambda function using ``convert`` method.

.. code-block:: python

  >>> shuffled_dataset = dataset.convert(lambda x: random.sample(x, len(x)))

The lambda function must take 1 argument, which is a list of raw records to be processed.
The lambda function must not modify the given list.
The result will become an another Dataset instance.

For convenience, Dataset class provides ``shuffle`` method, which shuffles the order of records.

Persisting Datasets
-------------------

You can use the standard ``pickle`` module to persist the Dataset instance.
Please note that pickled Dataset instances may not be able to be unpickled in other versions of Jubakit.