Config defines machine learning parameters and feature extraction rules of Service.

Data Structure

Config classes inherits dict class. Here is a default Config contents for Classifier Service.

>>> from jubakit.classifier import Config
>>> cfg = Config()
>>> print(cfg)
{'converter': {'string_filter_rules': [], 'num_filter_types': {}, 'num_types': {}, 'num_filter_rules': [], 'string_rules': [{'global_weight': 'idf', 'sample_weight': 'tf', 'key': '*', 'type': 'unigram'}], 'string_filter_types': {}, 'num_rules': [{'key': '*', 'type': 'num'}], 'binary_types': {}, 'binary_rules': [], 'string_types': {'bigram': {'method': 'ngram', 'char_num': '2'}, 'trigram': {'method': 'ngram', 'char_num': '3'}, 'unigram': {'method': 'ngram', 'char_num': '1'}}}, 'method': 'AROW', 'parameter': {'regularization_weight': 1.0}}

The data structure is same as the Jubatus servers’ JSON configuration file. See the Jubatus API Reference for details.

Machine Learning Parameters

Machine learning parameters consist of Methods and Hyper Parameters. Parameters that works well in most cases are set to Config class by default, so you can start using machine learning features without configuring them.

You can create Config instance using these parameters specified.

>>> from jubakit.classifier import Config
>>> cfg = Config(method='PA', parameter={'regularization_weight': 1.0})

If you only specify method, the default parameter for the specified method will be set automatically.

>>> cfg = Config(method='NN')
>>> cfg['parameter']
{'local_sensitivity': 1.0, 'nearest_neighbor_num': 128, 'parameter': {'threads': -1, 'hash_num': 64}, 'method': 'euclid_lsh'}
>>> cfg = Config(method='NHERD')
>>> cfg['parameter']
{'regularization_weight': 1.0}

You can even modify parameters after creating Config instance as if it is a dict object.

>>> print(cfg['method'])
>>> print(cfg['parameter']['regularization_weight'])
>>> cfg['method'] = 'NHERD'
>>> cfg['parameter']['regularization_weight'] = 0.1

Feature Extraction Rules

The default feature extraction rules are as follows:

  • String features are processed with unigram with TF-IDF weighting. For convenience bigram and trigram are also defined in string_types by default.
  • Numeric features are processed as is (using num type).
  • Binary features are not processed.

You can clear these default rules by calling clear_converter method. It is convenient when writing rules from scratch.

>>> cfg.clear_converter()
>>> cfg
{'converter': {'string_filter_rules': [], 'num_filter_types': {}, 'num_types': {}, 'num_filter_rules': [], 'string_rules': [], 'string_filter_types': {}, 'num_rules': [], 'binary_types': {}, 'binary_rules': [], 'string_types': {}}, 'method': 'AROW', 'parameter': {'regularization_weight': 1.0}}