Clustering¶

See IDL definition for detailed specification.
See Algorithms for detailed description of algorithms used in this server.

Configuration¶

configuration is given as a JSON file. We show each field below:

method

Specify algorithm for clustering. You can use these algorithms.

Vaule	Method
`"kmeans"`	Use k-means
`"gmm"`	Use Gaussian Mixture Model
`"dbscan"`	Use dbscan

parameter

Specify parameters for the algorithm. Its format differs for each method.

kmeans, gmm

k:
Number of clusters. (Integer)

Range: 1 <= k

seed:

Specify seed used to generate random number. (Integer)

Range: 0 <= seed <= \(2^{32} - 1\)

dbscan

eps:

Specify the distance to define neighbor points. The bigger it is, the more points can be regarded as neighbor points. (Float)

Range: 0 < eps

min_core_point:

Specify the minimum density (number of neighbor points) required to make a cluster. The bigger it is, the less areas can be regarded as clusters. (Integer)

Range: 1 <= min_core_point

compressor_method

Specify algorithm for compressing points. You can use these algorithms.

Vaule	Method
`"simple"`	no compression
`"compressive"`	use coresets compression(only kmeans, gmm)

compressor_parameter

Specify parameters for the compressor. Its format differs for each compressor_method.

simple

bucket_size:

Number of data points to trigger mini batch. Clustering will run for each time bucket_size data is pushed. Note that the initial clustering will not run until k data is pushed when method is kmeans or gmm. (Integer)

Range: 2 <= bucket_size

compresive

compressed_bucket_size:
bucket_size:	Number of data points to trigger mini batch and compression. Clustering will run for each time `bucket_size` data is pushed. Note that the initial clustering will not run until `k` data is pushed when `method` is `kmeans` or `gmm`. (Integer) Range: 2 <= `bucket_size`
bucket_length:	Size of mini batch clustering. (Integer) Range: 2 <= `bucket_length`
	Number of compressed `bucket_size` . Compression ratio = ( `compressed_bucket_size` / `bucket_size` ) (Integer) Range: `bicriteria_base_size` <= `compressed_bucket_size` <= `bucket_size`
bicriteria_base_size:
	Specify roughness of compression. (Integer) Range: 1 <= `bicriteria_base_size` < `compressed_bucket_size`
forgetting_factor:
	Forgetting factor (Float) Range: 0.0 <= `forgetting_factor`
forgetting_threshold:
	When the summation of forgetting factors exceeds this value, it will not compress any more. (Float) Range: 0.0 <= `forgetting_threshold` <= 1.0
seed:	Specify seed used to generate random number. (Integer) Range: 0 <= `seed` <= \(2^{32} - 1\)

distance(optional)

Specify the distance function. You can specify the values in the table below. If distance is omitted, euclidean is used. This option is effective when method is kmeans or dbscan.

Value	Method
`"euclidean"`	Use euclidean distance
`"cosine"`	Use cosine distance

converter: Specify configuration for data conversion. Its format is described in Data Conversion.

Example:

{
  "method" : "kmeans",
  "parameter" : {
    "k" : 3,
    "seed" : 0
  },
  "compressor_method" : "compressive",
  "compressor_parameter" : {
    "bucket_size" : 1000,
    "compressed_bucket_size" : 100,
    "bicriteria_base_size" : 10,
    "bucket_length" : 2,
    "forgetting_factor" : 0.0,
    "forgetting_threshold" : 0.5,
    "seed" : 0
  },
  "distance": "euclidean",
  "converter" : {
    "string_filter_types" : {},
    "string_filter_rules" : [],
    "num_filter_types" : {},
    "num_filter_rules" : [],
    "string_types" : {},
    "string_rules" : [
      { "key" : "*", "type" : "str", "sample_weight" : "bin", "global_weight" : "bin" }
    ],
    "num_types" : {},
    "num_rules" : [
      { "key" : "*", "type" : "num" }
    ]
  }
}

Data Structures¶

message weighted_datum¶

0: double weight¶

1: datum point¶

message indexed_point¶

0: string id¶

1: datum point¶

message weighted_index¶

0: double weight¶

1: string id¶

Methods¶

service clustering

bool push(0: list<indexed_point> points)¶

Parameters:	points – list of `indexed_point` for the points. `indexed_point` is a tuple of string id and datum
Returns:	True when the point was added successfully

Adds points.

uint get_revision()¶

Returns:	revision of cluster

Return revesion of cluster.

list<list<weighted_datum>> get_core_members()¶

Returns:	coreset of cluster

Returns coreset of cluster in datum. This method is not supported in dbscan.

list<list<weighted_index>> get_core_members_light()¶

Returns:	coreset of cluster

Returns coreset of cluster in index. This method is not supported in dbscan.

list<datum> get_k_center()¶

Returns:	cluster centers

Returns k cluster centers.

datum get_nearest_center(0: datum point)¶

Parameters:	point – `datum`
Returns:	nearest cluster center

Returns nearest cluster center without adding point to cluster. This method is not supported in dbscan.

list<weighted_datum> get_nearest_members(0: datum point)¶

Parameters:	point – `datum`
Returns:	coreset

Returns nearest summary of cluster(coreset) from point. Its format is a list of tuples of weight and datum. This method is not supported in dbscan.

list<weighted_index> get_nearest_members_light(0: datum point)¶

Parameters:	point – `datum`
Returns:	coreset

Returns nearest summary of cluster(coreset) from point. Its format is a list of tuples of weight and id. This method is not supported in dbscan.