Clustering¶
- See IDL definition for detailed specification.
- See Algorithms for detailed description of algorithms used in this server.
Configuration¶
configuration is given as a JSON file. We show each field below:
-
method Specify algorithm for clustering. You can use these algorithms.
Vaule Method "kmeans"Use k-means "gmm"Use Gaussian Mixture Model "dbscan"Use dbscan
-
parameter Specify parameters for the algorithm. Its format differs for each
method.- kmeans, gmm
k: Number of clusters. (Integer)
- Range: 1 <=
k
seed: Specify seed used to generate random number. (Integer)
- Range: 0 <=
seed<= \(2^{32} - 1\)
- Range: 1 <=
- dbscan
eps: Specify the distance to define neighbor points. The bigger it is, the more points can be regarded as neighbor points. (Float)
- Range: 0 <
eps
min_core_point: Specify the minimum density (number of neighbor points) required to make a cluster. The bigger it is, the less areas can be regarded as clusters. (Integer)
- Range: 1 <=
min_core_point
- Range: 0 <
-
compressor_method Specify algorithm for compressing points. You can use these algorithms.
Vaule Method "simple"no compression "compressive"use coresets compression(only kmeans, gmm)
-
compressor_parameter Specify parameters for the compressor. Its format differs for each
compressor_method.- simple
bucket_size: Number of data points to trigger mini batch. Clustering will run for each time
bucket_sizedata is pushed. Note that the initial clustering will not run untilkdata is pushed whenmethodiskmeansorgmm. (Integer)- Range: 2 <=
bucket_size
- Range: 2 <=
- compresive
bucket_size: Number of data points to trigger mini batch and compression. Clustering will run for each time
bucket_sizedata is pushed. Note that the initial clustering will not run untilkdata is pushed whenmethodiskmeansorgmm. (Integer)- Range: 2 <=
bucket_size
bucket_length: Size of mini batch clustering. (Integer)
- Range: 2 <=
bucket_length
compressed_bucket_size: Number of compressed
bucket_size. Compression ratio = (compressed_bucket_size/bucket_size) (Integer)- Range:
bicriteria_base_size<=compressed_bucket_size<=bucket_size
bicriteria_base_size: Specify roughness of compression. (Integer)
- Range: 1 <=
bicriteria_base_size<compressed_bucket_size
forgetting_factor: Forgetting factor (Float)
- Range: 0.0 <=
forgetting_factor
forgetting_threshold: When the summation of forgetting factors exceeds this value, it will not compress any more. (Float)
- Range: 0.0 <=
forgetting_threshold<= 1.0
seed: Specify seed used to generate random number. (Integer)
- Range: 0 <=
seed<= \(2^{32} - 1\)
- Range: 2 <=
-
distance(optional) Specify the distance function. You can specify the values in the table below. If
distanceis omitted,euclideanis used. This option is effective whenmethodiskmeansordbscan.Value Method "euclidean"Use euclidean distance "cosine"Use cosine distance
-
converter Specify configuration for data conversion. Its format is described in Data Conversion.
- Example:
{ "method" : "kmeans", "parameter" : { "k" : 3, "seed" : 0 }, "compressor_method" : "compressive", "compressor_parameter" : { "bucket_size" : 1000, "compressed_bucket_size" : 100, "bicriteria_base_size" : 10, "bucket_length" : 2, "forgetting_factor" : 0.0, "forgetting_threshold" : 0.5, "seed" : 0 }, "distance": "euclidean", "converter" : { "string_filter_types" : {}, "string_filter_rules" : [], "num_filter_types" : {}, "num_filter_rules" : [], "string_types" : {}, "string_rules" : [ { "key" : "*", "type" : "str", "sample_weight" : "bin", "global_weight" : "bin" } ], "num_types" : {}, "num_rules" : [ { "key" : "*", "type" : "num" } ] } }
Data Structures¶
Methods¶
-
service
clustering -
bool
push(0: list<indexed_point> points)¶ Parameters: - points – list of
indexed_pointfor the points.indexed_pointis a tuple of string id and datum
Returns: True when the point was added successfully
Adds points.
- points – list of
-
uint
get_revision()¶ Returns: revision of cluster Return revesion of cluster.
-
list<list<weighted_datum>>
get_core_members()¶ Returns: coreset of cluster Returns coreset of cluster in datum. This method is not supported in
dbscan.
-
list<list<weighted_index>>
get_core_members_light()¶ Returns: coreset of cluster Returns coreset of cluster in index. This method is not supported in
dbscan.
-
datum
get_nearest_center(0: datum point)¶ Parameters: - point –
datum
Returns: nearest cluster center
Returns nearest cluster center without adding
pointto cluster. This method is not supported indbscan.- point –
-
list<weighted_datum>
get_nearest_members(0: datum point)¶ Parameters: - point –
datum
Returns: coreset
Returns nearest summary of cluster(coreset) from
point. Its format is a list of tuples of weight and datum. This method is not supported indbscan.- point –
-
list<weighted_index>
get_nearest_members_light(0: datum point)¶ Parameters: - point –
datum
Returns: coreset
Returns nearest summary of cluster(coreset) from
point. Its format is a list of tuples of weight and id. This method is not supported indbscan.- point –
-
bool