Nearest Neighbor

  • See IDL definition for detailed specification.
  • See Algorithms for detailed description of algorithms used in this server.

Configuration

Configuration is given as a JSON file. We show each field below:

method

Specify algorithm for nearest neighbor. You can use these algorithms.

Value Method
"lsh" Use Locality Sensitive Hashing based on cosine similarity.
"minhash" Use MinHash. [Ping2010]
"euclid_lsh" Use LSH based on cosine similarity for nearest neighbor search with Euclidean distance.
parameter

Specify parameters for the algorithm. Its format differs for each method.

common
threads:

The number of threads execute random_projection and search. The bigger it is, query latency becomes smaller because data is divided into several parts and processed by multiple threads in parallel. This option has been added since 0.9.0. Single thread is used in before versions. (Integer)

The behavior of this option varies as below:

  • threads < 0
    • threads is set to the number of logical CPU cores
  • threads = 0
    • The same behavior as threads is set to 1.
  • 1 <= threads <= The number of logical cores of CPU
    • The number of threads is set to threads .
  • The number of logical cores of CPU < threads .
    • The number of threads is set to the number of logical CPU cores. In addtion, data points are divided into threads parts.

note: threads can be omitted (It works with 1 thread) .

lsh
hash_num:

Bit length of hash values. The bigger it is, the more accurate results you can get, but the more memory is required. (Integer)

  • Range: 1 <= hash_num
minhash
hash_num:

Bit length of hash values. The bigger it is, the more accurate results you can get, but the more memory is required. (Integer)

  • Range: 1 <= hash_num
euclid_lsh
hash_num:

Bit length of hash values. The bigger it is, the more accurate results you can get, but the more memory is required. (Integer)

  • Range: 1 <= hash_num
converter

Specify configuration for data conversion. Its format is described in Data Conversion.

Example:
{
  "method": "lsh",
  "parameter" : {
    "hash_num" : 64
  },
  "converter" : {
    "string_filter_types": {},
    "string_filter_rules":[],
    "num_filter_types": {},
    "num_filter_rules": [],
    "string_types": {},
    "string_rules":[
      {"key" : "*", "type" : "str", "sample_weight":"bin", "global_weight" : "bin"}
    ],
    "num_types": {},
    "num_rules": [
      {"key" : "*", "type" : "num"}
    ]
  }
}

Data Structures

message id_with_score

Represents ID with its score.

0: string id

Data ID.

1: double score

Score.

message id_with_score {
  0: string id
  1: double score
}

Methods

service nearest_neighbor
bool set_row(0: string id, 1: datum d)
Parameters:
  • id – row ID
  • rowdatum for the row
Returns:

True if this function updates models successfully

Updates the row whose id is id with given row. If the row with the same id already exists, the row is overwritten with row (note that this behavior is different from that of recommender). Otherwise, new row entry will be created. If the server that manages the row and the server that received this RPC request are same, this operation is reflected instantly. If not, update operation is reflected after mix.

list<id_with_score> neighbor_row_from_id(0: string id, 1: uint size)
Parameters:
  • id – row ID in the nearest neighbor search table
  • size – number of rows to be returned
Returns:

row IDs that are the nearest to the row id and their distance values

Returns size rows (at maximum) that have most similar datum to id and their distance values.

list<id_with_score> neighbor_row_from_datum(0: datum query, 1: uint size)
Parameters:
  • querydatum for nearest neighbor search
  • size – number of rows to be returned
Returns:

row IDs that are the nearest to query and their distance values

Returns size rows (at maximum) of which datum are most similar to query and their distance values.

list<id_with_score> similar_row_from_id(0: string id, 1: uint ret_num)
Parameters:
  • id – row ID in the nearest neighbor search table
  • ret_num – number of rows to be returned
Returns:

row IDs that are the nearest to the row id and their similarity values

Returns ret_num rows (at maximum) that have most similar datum to id and their similarity values.

list<id_with_score> similar_row_from_datum(0: datum query, 1: uint ret_num)
Parameters:
  • querydatum for nearest neighbor search
  • ret_num – number of rows to be returned
Returns:

row IDs that are the nearest to query and their similarity values

Returns ret_num rows (at maximum) of which datum are most similar to query and their similarity values.

list<string> get_all_rows()
Returns:list of all row IDs

Returns the list of all row IDs.