Config¶
Config defines machine learning parameters and feature extraction rules of Service.
Data Structure¶
Config classes inherits dict
class.
Here is a default Config contents for Classifier Service.
>>> from jubakit.classifier import Config
>>> cfg = Config()
>>> print(cfg)
{'converter': {'string_filter_rules': [], 'num_filter_types': {}, 'num_types': {}, 'num_filter_rules': [], 'string_rules': [{'global_weight': 'idf', 'sample_weight': 'tf', 'key': '*', 'type': 'unigram'}], 'string_filter_types': {}, 'num_rules': [{'key': '*', 'type': 'num'}], 'binary_types': {}, 'binary_rules': [], 'string_types': {'bigram': {'method': 'ngram', 'char_num': '2'}, 'trigram': {'method': 'ngram', 'char_num': '3'}, 'unigram': {'method': 'ngram', 'char_num': '1'}}}, 'method': 'AROW', 'parameter': {'regularization_weight': 1.0}}
The data structure is same as the Jubatus servers’ JSON configuration file. See the Jubatus API Reference for details.
Machine Learning Parameters¶
Machine learning parameters consist of Methods and Hyper Parameters. Parameters that works well in most cases are set to Config class by default, so you can start using machine learning features without configuring them.
You can create Config instance using these parameters specified.
>>> from jubakit.classifier import Config
>>> cfg = Config(method='PA', parameter={'regularization_weight': 1.0})
If you only specify method
, the default parameter for the specified method will be set automatically.
>>> cfg = Config(method='NN')
>>> cfg['parameter']
{'local_sensitivity': 1.0, 'nearest_neighbor_num': 128, 'parameter': {'threads': -1, 'hash_num': 64}, 'method': 'euclid_lsh'}
>>> cfg = Config(method='NHERD')
>>> cfg['parameter']
{'regularization_weight': 1.0}
You can even modify parameters after creating Config instance as if it is a dict
object.
>>> print(cfg['method'])
AROW
>>> print(cfg['parameter']['regularization_weight'])
1.0
>>> cfg['method'] = 'NHERD'
>>> cfg['parameter']['regularization_weight'] = 0.1
Feature Extraction Rules¶
The default feature extraction rules are as follows:
- String features are processed with
unigram
with TF-IDF weighting. For conveniencebigram
andtrigram
are also defined instring_types
by default. - Numeric features are processed as is (using
num
type). - Binary features are not processed.
You can clear these default rules by calling clear_converter
method.
It is convenient when writing rules from scratch.
>>> cfg.clear_converter()
>>> cfg
{'converter': {'string_filter_rules': [], 'num_filter_types': {}, 'num_types': {}, 'num_filter_rules': [], 'string_rules': [], 'string_filter_types': {}, 'num_rules': [], 'binary_types': {}, 'binary_rules': [], 'string_types': {}}, 'method': 'AROW', 'parameter': {'regularization_weight': 1.0}}