Clustering

  • See IDL definition for detailed specification.
  • See Algorithms for detailed description of algorithms used in this server.

Configuration

configuration is given as a JSON file. We show each field below:

method

Specify algorithm for clustering. You can use these algorithms.

Vaule Method
"kmeans" Use k-means
"gmm" Use Gaussian Mixture Model
"dbscan" Use dbscan
parameter

Specify parameters for the algorithm. Its format differs for each method.

kmeans, gmm
k:

Number of clusters. (Integer)

  • Range: 1 <= k
seed:

Specify seed used to generate random number. (Integer)

  • Range: 0 <= seed <= \(2^{32} - 1\)
dbscan
eps:

Specify the distance to define neighbor points. The bigger it is, the more points can be regarded as neighbor points. (Float)

  • Range: 0 < eps
min_core_point:

Specify the minimum density (number of neighbor points) required to make a cluster. The bigger it is, the less areas can be regarded as clusters. (Integer)

  • Range: 1 <= min_core_point
compressor_method

Specify algorithm for compressing points. You can use these algorithms.

Vaule Method
"simple" no compression
"compressive" use coresets compression(only kmeans, gmm)
compressor_parameter

Specify parameters for the compressor. Its format differs for each compressor_method.

simple
bucket_size:

Number of data points to trigger mini batch. Clustering will run for each time bucket_size data is pushed. Note that the initial clustering will not run until k data is pushed when method is kmeans or gmm. (Integer)

  • Range: 2 <= bucket_size
compresive
bucket_size:

Number of data points to trigger mini batch and compression. Clustering will run for each time bucket_size data is pushed. Note that the initial clustering will not run until k data is pushed when method is kmeans or gmm. (Integer)

  • Range: 2 <= bucket_size
bucket_length:

Size of mini batch clustering. (Integer)

  • Range: 2 <= bucket_length
compressed_bucket_size:
 

Number of compressed bucket_size . Compression ratio = ( compressed_bucket_size / bucket_size ) (Integer)

  • Range: bicriteria_base_size <= compressed_bucket_size <= bucket_size
bicriteria_base_size:
 

Specify roughness of compression. (Integer)

  • Range: 1 <= bicriteria_base_size < compressed_bucket_size
forgetting_factor:
 

Forgetting factor (Float)

  • Range: 0.0 <= forgetting_factor
forgetting_threshold:
 

When the summation of forgetting factors exceeds this value, it will not compress any more. (Float)

  • Range: 0.0 <= forgetting_threshold <= 1.0
seed:

Specify seed used to generate random number. (Integer)

  • Range: 0 <= seed <= \(2^{32} - 1\)
converter

Specify configuration for data conversion. Its format is described in Data Conversion.

Example:
{
  "method" : "kmeans",
  "parameter" : {
    "k" : 3,
    "seed" : 0
  },
  "compressor_method" : "compressive_kmeans",
  "compressor_parameter" : {
    "bucket_size" : 1000,
    "compressed_bucket_size" : 100,
    "bicriteria_base_size" : 10,
    "bucket_length" : 2,
    "forgetting_factor" : 0.0,
    "forgetting_threshold" : 0.5,
    "seed" : 0
  },
  "converter" : {
    "string_filter_types" : {},
    "string_filter_rules" : [],
    "num_filter_types" : {},
    "num_filter_rules" : [],
    "string_types" : {},
    "string_rules" : [
      { "key" : "*", "type" : "str", "sample_weight" : "bin", "global_weight" : "bin" }
    ],
    "num_types" : {},
    "num_rules" : [
      { "key" : "*", "type" : "num" }
    ]
  }
}

Data Structures

message weighted_datum
0: double weight
1: datum point
message indexed_point
0: string id
1: datum point
message weighted_index
0: double weight
1: string id

Methods

service clustering
bool push(0: list<indexed_point> points)
Parameters:
  • points – list of indexed_point for the points. indexed_point is a tuple of string id and datum
Returns:

True when the point was added successfully

Adds points.

uint get_revision()
Returns:revision of cluster

Return revesion of cluster.

list<list<weighted_datum>> get_core_members()
Returns:coreset of cluster

Returns coreset of cluster in datum. This method is not supported in dbscan.

list<list<weighted_index>> get_core_members_light()
Returns:coreset of cluster

Returns coreset of cluster in index. This method is not supported in dbscan.

list<datum> get_k_center()
Returns:cluster centers

Returns k cluster centers.

datum get_nearest_center(0: datum point)
Parameters:
Returns:

nearest cluster center

Returns nearest cluster center without adding point to cluster. This method is not supported in dbscan.

list<weighted_datum> get_nearest_members(0: datum point)
Parameters:
Returns:

coreset

Returns nearest summary of cluster(coreset) from point. Its format is a list of tuples of weight and datum. This method is not supported in dbscan.

list<weighted_index> get_nearest_members_light(0: datum point)
Parameters:
Returns:

coreset

Returns nearest summary of cluster(coreset) from point. Its format is a list of tuples of weight and id. This method is not supported in dbscan.