Nearest Neighbor¶

See IDL definition for detailed specification.
See Algorithms for detailed description of algorithms used in this server.

Configuration¶

Configuration is given as a JSON file. We show each field below:

method

Specify algorithm for nearest neighbor. You can use these algorithms.

Value	Method
`"lsh"`	Use Locality Sensitive Hashing based on cosine similarity.
`"minhash"`	Use MinHash. [Ping2010]
`"euclid_lsh"`	Use LSH based on cosine similarity for nearest neighbor search with Euclidean distance.

parameter

Specify parameters for the algorithm. Its format differs for each method.

common

threads:

The number of threads execute random_projection and search. The bigger it is, query latency becomes smaller because data is divided into several parts and processed by multiple threads in parallel. This option has been added since 0.9.0. Single thread is used in before versions. (Integer)

The behavior of this option varies as below:

threads < 0
- threads is set to the number of logical CPU cores
threads = 0
- The same behavior as threads is set to 1.
1 <= threads <= The number of logical cores of CPU
- The number of threads is set to threads .
The number of logical cores of CPU < threads .
- The number of threads is set to the number of logical CPU cores. In addtion, data points are divided into threads parts.

note: threads can be omitted (It works with 1 thread) .

lsh

hash_num:

Bit length of hash values. The bigger it is, the more accurate results you can get, but the more memory is required. (Integer)

Range: 1 <= hash_num

minhash

hash_num:

Bit length of hash values. The bigger it is, the more accurate results you can get, but the more memory is required. (Integer)

Range: 1 <= hash_num

euclid_lsh

hash_num:

Bit length of hash values. The bigger it is, the more accurate results you can get, but the more memory is required. (Integer)

Range: 1 <= hash_num

converter: Specify configuration for data conversion. Its format is described in Data Conversion.

Example:

{
  "method": "lsh",
  "parameter" : {
    "hash_num" : 64
  },
  "converter" : {
    "string_filter_types": {},
    "string_filter_rules":[],
    "num_filter_types": {},
    "num_filter_rules": [],
    "string_types": {},
    "string_rules":[
      {"key" : "*", "type" : "str", "sample_weight":"bin", "global_weight" : "bin"}
    ],
    "num_types": {},
    "num_rules": [
      {"key" : "*", "type" : "num"}
    ]
  }
}

Data Structures¶

message id_with_score¶

Represents ID with its score.

0: string id¶: Data ID.

1: double score¶: Score.

message id_with_score {
  0: string id
  1: double score
}

Methods¶

service nearest_neighbor¶

bool set_row(0: string id, 1: datum d)¶

Parameters:	id – row ID row – `datum` for the row
Returns:	True if this function updates models successfully

Updates the row whose id is id with given row. If the row with the same id already exists, the row is overwritten with row (note that this behavior is different from that of recommender). Otherwise, new row entry will be created. If the server that manages the row and the server that received this RPC request are same, this operation is reflected instantly. If not, update operation is reflected after mix.

list<id_with_score> neighbor_row_from_id(0: string id, 1: uint size)¶

Parameters:	id – row ID in the nearest neighbor search table size – number of rows to be returned
Returns:	row IDs that are the nearest to the row `id` and their distance values

Returns size rows (at maximum) that have most similar datum to id and their distance values.

list<id_with_score> neighbor_row_from_datum(0: datum query, 1: uint size)¶

Parameters:	query – `datum` for nearest neighbor search size – number of rows to be returned
Returns:	row IDs that are the nearest to `query` and their distance values

Returns size rows (at maximum) of which datum are most similar to query and their distance values.

list<id_with_score> similar_row_from_id(0: string id, 1: uint ret_num)¶

Parameters:	id – row ID in the nearest neighbor search table ret_num – number of rows to be returned
Returns:	row IDs that are the nearest to the row `id` and their similarity values

Returns ret_num rows (at maximum) that have most similar datum to id and their similarity values.

list<id_with_score> similar_row_from_datum(0: datum query, 1: uint ret_num)¶

Parameters:	query – `datum` for nearest neighbor search ret_num – number of rows to be returned
Returns:	row IDs that are the nearest to `query` and their similarity values

Returns ret_num rows (at maximum) of which datum are most similar to query and their similarity values.

list<string> get_all_rows()¶

Returns:	list of all row IDs

Returns the list of all row IDs.