Schema

Schema defines the meaning of each column of record loaded from Loader. The basic usage of Schema is to specify data types for each key one by one. In the following example, name and age columns are used as features and gender column is used as a label when training classifier.

from jubakit.classifier import Schema

schema = Schema({
  'name': Schema.STRING,
  'age': Schema.NUMBER,
  'gender': Schema.LABEL,
})

Fallback Type

Data types must be defined for all column keys that may input from Loader. If you have many columns in your data and only a part of the columns is of your interest, you can specify a fallback data type. The Schema in the following example ignores columns other than name, age and gender.

schema = Schema({
  'name': Schema.STRING,
  'age': Schema.NUMBER,
  'gender': Schema.LABEL,
}, Schema.IGNORE)

Similarly, if you know that all of your records are numeric feature, you can just specify Schema as follows using fallback data type.

schema = Schema({}, Schema.NUMBER)

Alias Names

By default, the column key names passed from Loader is used as a Datum key name. However, you can manually assign the Datum key name by giving alias names. In the following example, user_name and user_profile columns will become name and profile in Datum respectively.

schema = Schema({
  'user_name': (Schema.STRING, 'name'),
  'user_profile': (Schema.STRING, 'profile'),
})

Alias names are convenient when training records from multiple data sources that have different Schema into one Service.

List of Data Types

Following data types can be specified for Schema.

Type Description
NUMBER Feature (numeric)
STRING Feature (string)
BINARY Feature (binary)
INFER Feature (infer data type automatically [1])
AUTO Feature (use data type loaded by Loader [2])
LABEL Ground truth (label column) – Classifier only
TARGET Ground truth (target column) – Regression only
FLAG Flag if the record is anomaly or not – Anomaly only
ID Key that uniquely identifies each record – Anomaly and Recommender only
IGNORE Discard the column
[1]Each data is tried to be cast to NUMBER, STRING and BINARY, and treated as that type once cast succeeds. Type will be estimated for every single record, so be aware that result of type inference for the same key may different between records.
[2]AUTO is intended to be used with Loader that loads records from typed data sources like RDBMS. Note that all data will become STRING when using CSVLoader as CSV files is not typed data source.