Classifier

  • See IDL definition for detailed specification.
  • See Algorithms for detailed description of algorithms used in this server.

Configuration

Configuration is given as a JSON file. We show each field below:

method

Specify classificaiton algorithm. You can use these algorithms.

Value Method Classifier type
"perceptron" Use perceptron. linear classifier
"PA" Use Passive Aggressive (PA). [Crammer06] linear classifier
"PA1" Use PA-I. [Crammer06] linear classifier
"PA2" Use PA-II. [Crammer06] linear classifier
"CW" Use Confidence Weighted Learning. [Dredze08] linear classifier
"AROW" Use Adaptive Regularization of Weight vectors. [Crammer09b] linear classifier
"NHERD" Use Normal Herd. [Crammer10] linear classifier
"NN" Use an inplementation of nearest_neighbor k-Nearest Neighbor
"cosine" Use the result of nearest neighbor search by cosine similarity [1] k-Nearest Neighbor
"euclidean" Use the result of nearest neighbor search by euclidean distance [1] k-Nearest Neighbor
[1](1, 2) These algorithms don’t support delete_label API and unlearner option
parameter

Specify parameters for the algorithm. Its format differs for each method. Note that adequate value for refularization_weight differ for each algorithm.

Specify parameters for the algorithm. Its format differs for each method.

common
unlearner:Specify unlearner strategy. If you don’t use unlearner function, you can omit this parameter. You can specify unlearner strategy described in Unlearner. Labels will be deleted based on strategy specified here. When method is "NN", each data ( labeled_datum, not labels) will be deleted.
unlearner_parameter:
 Specify unlearner parameter. You can specify unlearner_parameter Unlearner. You cannot omit this parameter if you specify unlearner. Labels (or data) in excess of this number will be deleted automatically.

note: unlearner and unlearner_parameter can be omitted .

perceptron
None
PA
None
PA1
regularization_weight:
 

Sensitivity to learning rate. The bigger it is, the ealier you can train, but more sensitive to noise. It corresponds to \(C\) in the original paper [Crammer06]. (Float)

  • Range: 0.0 < regularization_weight
PA2
regularization_weight:
 

Sensitivity to learning rate. The bigger it is, the ealier you can train, but more sensitive to noise. It corresponds to \(C\) in the original paper [Crammer06]. (Float)

  • Range: 0.0 < regularization_weight
CW
regularization_weight:
 

Sensitivity to learning rate. The bigger it is, the ealier you can train, but more sensitive to noise. It corresponds to \(\phi\) in the original paper [Dredze08]. (Float)

  • Range: 0.0 < regularization_weight
AROW
regularization_weight:
 

Sensitivity to learning rate. The bigger it is, the ealier you can train, but more sensitive to noise. It corresponds to \(1/r\) in the original paper [Crammer09b]. (Float)

  • Range: 0.0 < regularization_weight
NHERD
regularization_weight:
 

Sensitivity to learning rate. The bigger it is, the ealier you can train, but more sensitive to noise. It corresponds to \(C\) in the original paper [Crammer10]. (Float)

  • Range: 0.0 < regularization_weight
NN
method:

Specify algorithm for nearest neighbor. Refer to Nearest Neighbor for the list of algorithms available.

parameter:

Specify parameters for the algorithm. Refer to Nearest Neighbor for the list of parameters.

nearest_neighbor_num:
 

Number of data which is used for calculating scores. (Integer)

  • Range: 1 <= nearest_neighbor_num
local_sensitivity:
 

Sensitivity used for caliculating scores. When it is bigger, near data are weighted much more. When it is 0, all data will be treated as same weight. (Float)

  • Range: 0.0 <= local_sensitivity
cosine
nearest_neighbor_num:
 

Number of data which is used for calculating scores. (Integer)

  • Range: 1 <= nearest_neighbor_num
local_sensitivity:
 

Sensitivity used for caliculating scores. When it is bigger, near data are weighted much more. When it is 0, all data will be treated as same weight. (Float)

  • Range: 0.0 <= local_sensitivity
euclidean
nearest_neighbor_num:
 

Number of data which is used for calculating scores. (Integer)

  • Range: 1 <= nearest_neighbor_num
local_sensitivity:
 

Sensitivity used for caliculating scores. When it is bigger, near data are weighted much more. When it is 0, all data will be treated as same weight. (Float)

  • Range: 0.0 <= local_sensitivity
converter

Specify configuration for data conversion. Its format is described in Data Conversion.

Example:
{
  "method" : "AROW",
  "parameter" : {
    "regularization_weight" : 1.0
  },
  "converter" : {
    "string_filter_types" : {},
    "string_filter_rules" : [],
    "num_filter_types" : {},
    "num_filter_rules" : [],
    "string_types" : {},
    "string_rules" : [
      { "key" : "*", "type" : "str", "sample_weight" : "bin", "global_weight" : "bin" }
    ],
    "num_types" : {},
    "num_rules" : [
      { "key" : "*", "type" : "num" }
    ]
  }
}

Data Structures

message estimate_result

Represents a result of classification.

0: string label

Represents an estimated label.

1: double score

Represents a probability value for the label. Higher score value means that the estimated label is more confident.

message estimate_result {
  0: string label
  1: double score
}
message labeled_datum

Represents a datum with its label.

0: string label

Represents a label of this datum.

1: datum data

Represents a datum.

message labeled_datum {
  0: string label
  1: datum data
}

Methods

service classifier
int train(0: list<labeled_datum> data)
Parameters:
  • data – list of tuple of label and datum
Returns:

Number of trained datum (i.e., the length of the data)

Trains and updates the model. labeled_datum is a tuple of datum and its label. This API is designed to accept bulk update with list of labeled_datum.

list<list<estimate_result>> classify(0: list<datum> data)
Parameters:
  • data – list of datum to classify
Returns:

List of list of estimate_result, in order of given datum

Estimates labels from given data. This API is designed to accept bulk classification with list of datum.

map<string, ulong> get_labels()
Returns:Pairs of label and the number of trained data

Returns the number of trained data for each label. If method is NN , the number of trained data that are deleted by unlearner is not include in this count.

bool set_label(0: string new_label)
Parameters:
  • new_label – name of new label
Returns:

True if the new label was not exist. False if the label already exists.

Append new label. If the label is already exist, it fails. New label is add when label found in train method argument, too.

bool delete_label(0: string target_label)
Parameters:
  • target_label – deleting label name
Returns:

True if jubatus success to delete label. False if the label is not exists.

Deleting label. True if jubatus success to delete. False if the label is not exists.