Dataset ======= Dataset is an abstract representation of a sequence of data that transforms records loaded from Loader into Jubatus Datum using data type information defined in Schema. Dataset can be constructed from Loader and Schema. .. code-block:: python from jubakit.classifier import Schema, Dataset loader = ... schema = Schema( ... ) dataset = Dataset(loader, schema) Some Services provides additional ways to construct Dataset. For example, :py:class:`jubakit.classifier.Dataset` provides :py:func:`from_array ` and :py:func:`from_matrix ` which are convenient when using datasets generated by scikit-learn. Other than noted, features mentioned in this section are implemented in :py:class:`jubakit.base.BaseDataset`, which is a base Dataset class for all Services. Static and Non-static Datasets ------------------------------ By default, all records are loaded from Loader to memory when creating Dataset instance. Such Datasets are called *Static Dataset*. If you want to load records one by one from Loader, instead of loading everything first, you can specify ``static`` option to create Dataset as *Non-static Dataset*. .. code-block:: python dataset = Dataset(loader, schema, static=False) Note that some features like index-based record access cannot be used over non-static Datasets. Datasets constructed from infinite Loaders are non-static by default. You cannot specify ``static=True`` when using infinite Loaders. Schema Prediction ----------------- If you don't specify Schema class when constructing Dataset, the Schema will automatically be predicted from the first record of the Dataset. .. code-block:: python >>> dataset = Dataset(loader) >>> print(dataset.get_schema()) {'types': {'k1': 's', 'k2': 'n'}, 'fallback_type': None, 'keys': {'k1': 'k1', 'k2': 'k2'}} ``types`` sections shows the predicted Schema. In the example above ``k1`` and ``k2`` columns are typed as ``STRING`` and ``NUMBER`` respectively. Accessing Records ----------------- You can access the raw record (i.e., record loaded from Loader) using ``get`` method. .. code-block:: python >>> dataset.get(1) {'k1': 'hello world', 'k2': 5} You can access the transformed record (i.e., Datum) using index operator. You don't need this in most cases, though. .. code-block:: python >>> dataset[1] (None, None, ) >>> print(dataset[1][2]) datum{string_values: [['k1', 'hello world'], num_values: [['k2', 5.0]], binary_values: []} You can create a subset of Dataset using index operator with slice or numeric array. .. code-block:: python >>> dataset2 = dataset[1:3] >>> type(dataset2) >>> len(dataset2) 2 This allows you to use `cross-validation modules of scikit-learn `_ easily. The following code shows how to apply ``KFold`` on Dataset instance ``dataset``. This code creates two new Dataset instances called ``ds_train`` and ``ds_test``, which are both subset of ``dataset``. .. code-block:: python >>> from sklearn.cross_validation import KFold >>> for train, test in KFold(4, n_folds=2): ... (ds_train, ds_test) = (dataset[train], dataset[test]) Note that non-static Datasets cannot be random-accessed; they only allow accessing the current raw record in the iteration by specifying the index currently being iterated to ``get`` method. Transformation -------------- Static Datasets can be bulk transformed by a user-defined lambda function using ``convert`` method. .. code-block:: python >>> shuffled_dataset = dataset.convert(lambda x: random.sample(x, len(x))) The lambda function must take 1 argument, which is a list of raw records to be processed. The lambda function must not modify the given list. The result will become an another Dataset instance. For convenience, Dataset class provides ``shuffle`` method, which shuffles the order of records. Persisting Datasets ------------------- You can use the standard ``pickle`` module to persist the Dataset instance. Please note that pickled Dataset instances may not be able to be unpickled in other versions of Jubakit.