Schema¶
Schema defines the meaning of each column of record loaded from Loader.
The basic usage of Schema is to specify data types for each key one by one.
In the following example, name
and age
columns are used as features and gender
column is used as a label when training classifier.
from jubakit.classifier import Schema
schema = Schema({
'name': Schema.STRING,
'age': Schema.NUMBER,
'gender': Schema.LABEL,
})
Fallback Type¶
Data types must be defined for all column keys that may input from Loader.
If you have many columns in your data and only a part of the columns is of your interest, you can specify a fallback data type.
The Schema in the following example ignores columns other than name
, age
and gender
.
schema = Schema({
'name': Schema.STRING,
'age': Schema.NUMBER,
'gender': Schema.LABEL,
}, Schema.IGNORE)
Similarly, if you know that all of your records are numeric feature, you can just specify Schema as follows using fallback data type.
schema = Schema({}, Schema.NUMBER)
Alias Names¶
By default, the column key names passed from Loader is used as a Datum key name.
However, you can manually assign the Datum key name by giving alias names.
In the following example, user_name
and user_profile
columns will become name
and profile
in Datum respectively.
schema = Schema({
'user_name': (Schema.STRING, 'name'),
'user_profile': (Schema.STRING, 'profile'),
})
Alias names are convenient when training records from multiple data sources that have different Schema into one Service.
List of Data Types¶
Following data types can be specified for Schema.
Type | Description |
---|---|
NUMBER |
Feature (numeric) |
STRING |
Feature (string) |
BINARY |
Feature (binary) |
INFER |
Feature (infer data type automatically [1]) |
AUTO |
Feature (use data type loaded by Loader [2]) |
LABEL |
Ground truth (label column) – Classifier only |
TARGET |
Ground truth (target column) – Regression only |
FLAG |
Flag if the record is anomaly or not – Anomaly only |
ID |
Key that uniquely identifies each record – Anomaly and Recommender only |
IGNORE |
Discard the column |
[1] | Each data is tried to be cast to NUMBER , STRING and BINARY , and treated as that type once cast succeeds.
Type will be estimated for every single record, so be aware that result of type inference for the same key may different between records. |
[2] | AUTO is intended to be used with Loader that loads records from typed data sources like RDBMS.
Note that all data will become STRING when using CSVLoader as CSV files is not typed data source. |