Classifier¶

See IDL definition for detailed specification.
See Algorithms for detailed description of algorithms used in this server.

Configuration¶

Configuration is given as a JSON file. We show each field below:

method

Specify classificaiton algorithm. You can use these algorithms.

Value	Method	Classifier type
`"perceptron"`	Use perceptron.	linear classifier
`"PA"`	Use Passive Aggressive (PA). [Crammer06]	linear classifier
`"PA1"`	Use PA-I. [Crammer06]	linear classifier
`"PA2"`	Use PA-II. [Crammer06]	linear classifier
`"CW"`	Use Confidence Weighted Learning. [Dredze08]	linear classifier
`"AROW"`	Use Adaptive Regularization of Weight vectors. [Crammer09b]	linear classifier
`"NHERD"`	Use Normal Herd. [Crammer10]	linear classifier
`"NN"`	Use an inplementation of `nearest_neighbor`	k-Nearest Neighbor
`"cosine"`	Use the result of nearest neighbor search by cosine similarity [1]	k-Nearest Neighbor
`"euclidean"`	Use the result of nearest neighbor search by euclidean distance [1]	k-Nearest Neighbor

[1]	(1, 2) These algorithms don’t support `delete_label` API and `unlearner` option

parameter

Specify parameters for the algorithm. Its format differs for each method. Note that adequate value for refularization_weight differ for each algorithm.

Specify parameters for the algorithm. Its format differs for each method.

common

unlearner_parameter:
unlearner:	Specify unlearner strategy. If you don’t use unlearner function, you can omit this parameter. You can specify `unlearner` strategy described in Unlearner. Labels will be deleted based on strategy specified here. When `method` is `"NN"`, each data ( `labeled_datum`, not labels) will be deleted.
	Specify unlearner parameter. You can specify `unlearner_parameter` Unlearner. You cannot omit this parameter if you specify `unlearner`. Labels (or data) in excess of this number will be deleted automatically.

note: unlearner and unlearner_parameter can be omitted .

perceptron

None

PA

None

PA1

regularization_weight:

Sensitivity to learning rate. The bigger it is, the ealier you can train, but more sensitive to noise. It corresponds to \(C\) in the original paper [Crammer06]. (Float)

Range: 0.0 < regularization_weight

PA2

regularization_weight:

Sensitivity to learning rate. The bigger it is, the ealier you can train, but more sensitive to noise. It corresponds to \(C\) in the original paper [Crammer06]. (Float)

Range: 0.0 < regularization_weight

CW

regularization_weight:

Sensitivity to learning rate. The bigger it is, the ealier you can train, but more sensitive to noise. It corresponds to \(\phi\) in the original paper [Dredze08]. (Float)

Range: 0.0 < regularization_weight

AROW

regularization_weight:

Sensitivity to learning rate. The bigger it is, the ealier you can train, but more sensitive to noise. It corresponds to \(1/r\) in the original paper [Crammer09b]. (Float)

Range: 0.0 < regularization_weight

NHERD

regularization_weight:

Sensitivity to learning rate. The bigger it is, the ealier you can train, but more sensitive to noise. It corresponds to \(C\) in the original paper [Crammer10]. (Float)

Range: 0.0 < regularization_weight

NN

nearest_neighbor_num:
method:	Specify algorithm for nearest neighbor. Refer to Nearest Neighbor for the list of algorithms available.
parameter:	Specify parameters for the algorithm. Refer to Nearest Neighbor for the list of parameters.
	Number of data which is used for calculating scores. (Integer) Range: 1 <= `nearest_neighbor_num`
local_sensitivity:
	Sensitivity used for caliculating scores. When it is bigger, near data are weighted much more. When it is 0, all data will be treated as same weight. (Float) Range: 0.0 <= `local_sensitivity`

cosine

nearest_neighbor_num:

Number of data which is used for calculating scores. (Integer)

Range: 1 <= nearest_neighbor_num

local_sensitivity:

Sensitivity used for caliculating scores. When it is bigger, near data are weighted much more. When it is 0, all data will be treated as same weight. (Float)

Range: 0.0 <= local_sensitivity

euclidean

nearest_neighbor_num:

Number of data which is used for calculating scores. (Integer)

Range: 1 <= nearest_neighbor_num

local_sensitivity:

Sensitivity used for caliculating scores. When it is bigger, near data are weighted much more. When it is 0, all data will be treated as same weight. (Float)

Range: 0.0 <= local_sensitivity

converter: Specify configuration for data conversion. Its format is described in Data Conversion.

Example:

{
  "method" : "AROW",
  "parameter" : {
    "regularization_weight" : 1.0
  },
  "converter" : {
    "string_filter_types" : {},
    "string_filter_rules" : [],
    "num_filter_types" : {},
    "num_filter_rules" : [],
    "string_types" : {},
    "string_rules" : [
      { "key" : "*", "type" : "str", "sample_weight" : "bin", "global_weight" : "bin" }
    ],
    "num_types" : {},
    "num_rules" : [
      { "key" : "*", "type" : "num" }
    ]
  }
}

Data Structures¶

message estimate_result¶

Represents a result of classification.

0: string label¶: Represents an estimated label.

1: double score¶: Represents a probability value for the label. Higher score value means that the estimated label is more confident.

message estimate_result {
  0: string label
  1: double score
}

message labeled_datum¶

Represents a datum with its label.

0: string label¶: Represents a label of this datum.

1: datum data¶: Represents a datum.

message labeled_datum {
  0: string label
  1: datum data
}

Methods¶

service classifier

int train(0: list<labeled_datum> data)¶

Parameters:	data – list of tuple of label and `datum`
Returns:	Number of trained datum (i.e., the length of the `data`)

Trains and updates the model. labeled_datum is a tuple of datum and its label. This API is designed to accept bulk update with list of labeled_datum.

list<list<estimate_result>> classify(0: list<datum> data)¶

Parameters:	data – list of datum to classify
Returns:	List of list of `estimate_result`, in order of given `datum`

Estimates labels from given data. This API is designed to accept bulk classification with list of datum.

map<string, ulong> get_labels()¶

Returns:	Pairs of label and the number of trained data

Returns the number of trained data for each label. If method is NN , the number of trained data that are deleted by unlearner is not include in this count.

bool set_label(0: string new_label)¶

Parameters:	new_label – name of new label
Returns:	True if the new label was not exist. False if the label already exists.

Append new label. If the label is already exist, it fails. New label is add when label found in train method argument, too.

bool delete_label(0: string target_label)¶

Parameters:	target_label – deleting label name
Returns:	True if jubatus success to delete label. False if the label is not exists.

Deleting label. True if jubatus success to delete. False if the label is not exists.