CREATE MODEL

Syntax:

CREATE { RECOMMENDER | CLASSIFIER | ANOMALY } MODEL
model_name [ ({ label | id }: id_col) ] AS
col_spec [ WITH convert_function ] [, ... ]
CONFIG 'json_string'

where: col_spec = wildcard | col_name

Examples:

jubaql> CREATE CLASSIFIER MODEL cls (label: label) AS
        name WITH unigram
        CONFIG {"method": "AROW",
          "parameter": {"regularization_weight" : 1.0}}
CREATE MODEL (started)

jubaql> CREATE RECOMMENDER MODEL reco (id: 名前) AS *
        CONFIG '{"method": "inverted_index",
          "parameter": {}}'
CREATE MODEL (started)

jubaql> CREATE CLASSIFIER MODEL test (label: country) AS
        name WITH bigram,
        photo WITH jpeganalyze
        CONFIG '{"method": "AROW",
          "parameter": {"regularization_weight" : 1.0}}'
CREATE MODEL (started)

Explanation

CREATE MODEL defines a Jubatus model to be used for training. It is assumed that the data that will be used for training is well-typed row-column-shaped data.

  • model_name is a user-defined string that will identify this model later on.
  • label | id must be label for a CLASSIFIER model and id for a RECOMMENDER model. The clause must be omitted for an ANOMALY model.
  • id_col is the name of the column whose value will become the id parameter of the update_row(id, row) RPC method or the label of the labeled datum passed to the train(data) RPC method, depending on the model type.
  • col_spec points to one or multiple columns that will be converted with either a Jubatus built-in function (if one exists) or a previously defined FEATURE FUNCTION named convert_function. If a conversion function is not specified, it defaults to num for numeric values and str for anything else. A col_spec can have one of the following forms:
    • It can be a single column name. In that case, convert_function must be a unary function and will be called with the value of that column.
    • It can be a column wildcard of the form *, *suffix or prefix* and then means all columns with a name that matches that wildcard description and that have not been mentioned in any previous clause. In that case, convert_function must be a unary function and will be called for every matching column with the value of that column.
  • json_string is a JSON configuration string like it would normally be contained in the file passed to Jubatus at startup. However, it should not contain a "converter" part.

After a CREATE MODEL statement has been processed successfully, the user can use the specified model_name in other statements.

Notes

  • It is not specified whether the Jubatus instance will be launched right away or later. Therefore, the successful execution of this command only indicates that the syntax is correct; it does not say anything about whether startup was successful.
  • Feature functions return Map[String, Any] where actually the Any part should be a numeric type or a string. The map key will become a part of the key for the Jubatus datum. Say that a function with the name product is fed with values from the column height and returns a Map("val" -> 80), then the Jubatus datum will have an entry in num_values that looks like: "product#height#val": 80.
  • When a column that is referenced as label/id or in a conversion specification does not exist in the (inferred or explicitly declared) schema of a batch of the input stream and the batch is non-empty, UPDATE MODEL or CREATE STREAM FROM ANALYZE processing of that batch and therefore the whole process will fail after retrying spark.task.maxFailures times. An empty batch with a mismatching schema does not cause a failure, though.