Clustering

  • See IDL definition for detailed specification.
  • See Algorithms for detailed description of algorithms used in this server.

Configuration

configuration is given as a JSON file. We show each field below:

method

Specify algorithm for clustering. You can use these algorithms.

Vaule Method
"kmeans" Use k-means
"gmm" Use Gaussian Mixture Model
"dbscan" Use dbscan
parameter

Specify parameters for the algorithm. Its format differs for each method.

kmeans, gmm
k:

Number of clusters. (Integer)

  • Range: 1 <= k
seed:

Specify seed used to generate random number. (Integer)

  • Range: 0 <= seed <= \(2^{32} - 1\)
dbscan
eps:

Specify the distance to define neighbor points. The bigger it is, the more points can be regarded as neighbor points. (Float)

  • Range: 0 < eps
min_core_point:

Specify the minimum density (number of neighbor points) required to make a cluster. The bigger it is, the less areas can be regarded as clusters. (Integer)

  • Range: 1 <= min_core_point
compressor_method

Specify algorithm for compressing points. You can use these algorithms.

Vaule Method
"simple" no compression
"compressive" use coresets compression(only kmeans, gmm)
compressor_parameter

Specify parameters for the compressor. Its format differs for each compressor_method.

simple
bucket_size:

Number of data points to trigger mini batch. Clustering will run for each time bucket_size data is pushed. Note that the initial clustering will not run until k data is pushed when method is kmeans or gmm. (Integer)

  • Range: 2 <= bucket_size
compresive
bucket_size:

Number of data points to trigger mini batch and compression. Clustering will run for each time bucket_size data is pushed. Note that the initial clustering will not run until k data is pushed when method is kmeans or gmm. (Integer)

  • Range: 2 <= bucket_size
bucket_length:

Size of mini batch clustering. (Integer)

  • Range: 2 <= bucket_length
compressed_bucket_size:
 

Number of compressed bucket_size . Compression ratio = ( compressed_bucket_size / bucket_size ) (Integer)

  • Range: bicriteria_base_size <= compressed_bucket_size <= bucket_size
bicriteria_base_size:
 

Specify roughness of compression. (Integer)

  • Range: 1 <= bicriteria_base_size < compressed_bucket_size
forgetting_factor:
 

Forgetting factor (Float)

  • Range: 0.0 <= forgetting_factor
forgetting_threshold:
 

When the summation of forgetting factors exceeds this value, it will not compress any more. (Float)

  • Range: 0.0 <= forgetting_threshold <= 1.0
seed:

Specify seed used to generate random number. (Integer)

  • Range: 0 <= seed <= \(2^{32} - 1\)
distance(optional)

Specify the distance function. You can specify the values in the table below. If distance is omitted, euclidean is used. This option is effective when method is kmeans or dbscan.

Value Method
"euclidean" Use euclidean distance
"cosine" Use cosine distance
converter

Specify configuration for data conversion. Its format is described in Data Conversion.

Example:
{
  "method" : "kmeans",
  "parameter" : {
    "k" : 3,
    "seed" : 0
  },
  "compressor_method" : "compressive",
  "compressor_parameter" : {
    "bucket_size" : 1000,
    "compressed_bucket_size" : 100,
    "bicriteria_base_size" : 10,
    "bucket_length" : 2,
    "forgetting_factor" : 0.0,
    "forgetting_threshold" : 0.5,
    "seed" : 0
  },
  "distance": "euclidean",
  "converter" : {
    "string_filter_types" : {},
    "string_filter_rules" : [],
    "num_filter_types" : {},
    "num_filter_rules" : [],
    "string_types" : {},
    "string_rules" : [
      { "key" : "*", "type" : "str", "sample_weight" : "bin", "global_weight" : "bin" }
    ],
    "num_types" : {},
    "num_rules" : [
      { "key" : "*", "type" : "num" }
    ]
  }
}

Data Structures

message weighted_datum
0: double weight
1: datum point
message indexed_point
0: string id
1: datum point
message weighted_index
0: double weight
1: string id

Methods

service clustering
bool push(0: list<indexed_point> points)
Parameters:
  • points – list of indexed_point for the points. indexed_point is a tuple of string id and datum
Returns:

True when the point was added successfully

Adds points.

uint get_revision()
Returns:revision of cluster

Return revesion of cluster.

list<list<weighted_datum>> get_core_members()
Returns:coreset of cluster

Returns coreset of cluster in datum. This method is not supported in dbscan.

list<list<weighted_index>> get_core_members_light()
Returns:coreset of cluster

Returns coreset of cluster in index. This method is not supported in dbscan.

list<datum> get_k_center()
Returns:cluster centers

Returns k cluster centers.

datum get_nearest_center(0: datum point)
Parameters:
Returns:

nearest cluster center

Returns nearest cluster center without adding point to cluster. This method is not supported in dbscan.

list<weighted_datum> get_nearest_members(0: datum point)
Parameters:
Returns:

coreset

Returns nearest summary of cluster(coreset) from point. Its format is a list of tuples of weight and datum. This method is not supported in dbscan.

list<weighted_index> get_nearest_members_light(0: datum point)
Parameters:
Returns:

coreset

Returns nearest summary of cluster(coreset) from point. Its format is a list of tuples of weight and id. This method is not supported in dbscan.