Recommender¶

See IDL definition for detailed specification.
See Algorithms for detailed description of algorithms used in this server.

Configuration¶

Configuration is given as a JSON file. We show each field below:

method

Specify algorithm for recommender. You can use these algorithms.

Value	Method
`"inverted_index"`	Use Inverted Index (with cosine similarity)
`"inverted_index_euclid"`	Use Inverted Index (with Euclidean distance)
`"minhash"`	Use MinHash. [Ping2010]
`"lsh"`	Use Locality Sensitive Hashing.
`"euclid_lsh"`	Use Euclid-distance LSH. [Andoni2005]
`"nearest_neighbor_recommender"`	Use an implementation of `nearest_neighbor`.

parameter

Specify parameters for the algorithm. Its format differs for each method.

common

unlearner(optional):
	Specify unlearner strategy. If you don’t use unlearner, you should omit this parameter. You can specify `unlearner` strategy described in Unlearner. Data will be deleted by the ID based on strategy specified here.
unlearner_parameter(optional):
	Specify unlearner parameter. You can specify `unlearner_parameter` Unlearner. You cannot omit this parameter when you specify `unlearner`. Data in excess of this number will be deleted automatically.

inverted_index

None

inverted_index_euclid

ignore_orthogonal(optional):

Ignore the points which don’t have any same key with the query when searching neighbors. If this option is specified, the result includes only points which have inverted index similarity with the query. In addition, this option contributes to accelerate calculation in specific cases (e.g. when most points don’t have any same key). This parameter is optional and is false (disabled) by default. (Boolean)

minhash

hash_num:

Number of hash values. The bigger it is, the more accurate results you can get, but the more memory is required. (Integer)

Range: 1 <= hash_num

lsh

threads(optional):
hash_num:	Bit length of hash values. The bigger it is, the more accurate results you can get, but the more memory is required. (Integer) Range: 1 <= `hash_num`
	Specify the number of threads which execute random_projection and search. If `threads` is omitted, those are executed with 1 thread. The bigger it is, query latency becomes smaller because data is divided into several parts and processed by multiple threads in parallel. This option has been added since 0.9.1. Single thread is used in before versions. (Integer) The behavior of this option varies as below: `threads` < 0 threads is set to the number of logical CPU cores `threads` = 0 The same behavior as `threads` is set to 1. 1 <= `threads` <= The number of logical cores of CPU The number of threads is set to `threads` . The number of logical cores of CPU < `threads` . The number of threads is set to the number of logical CPU cores. In addtion, data points are divided into threads parts.
cache_size(optional):
	Specify the number of vectors to cache projection vectors for hashing. If `cache_size` is omitted, projection vectors are generated in each hash calculation. The bigger it is, response time can be lower thoough more memory is required. (Integer) Range: 0 <= `cache_size`

euclid_lsh

threads(optional):
hash_num:	Number of hash values. The bigger it is, the more accurate results you can get, but the fewer results you can find and the more memory is required. (Integer) Range: 1 <= `hash_num`
table_num:	Number of tables. The bigger it is, the mroe results you can find, but the more memory is required and the longer response time is required. (Integer) Range: 1 <= `table_num`
bin_width:	Quantization step size. The bigger it is, the more results you can find, but the longer response time is required. (Float) Range: 0.0 < `bin_width`
probe_num:	Number of bins to find. The bigger it is, the more results you can find, but the longer response time is required. (Integer) Range: 0 <= `probe_num`
seed:	Seed of random number generator. (Integer) Range: 0 <= `seed` <= \(2^{32} - 1\)
	Specify the number of threads which execute random_projection and search(optional). If `threads` is omitted, those are executed with 1 thread. The bigger it is, query latency becomes smaller because data is divided into several parts and processed by multiple threads in parallel. This option has been added since 0.9.1. Single thread is used in before versions. (Integer) The behavior of this option varies as below: `threads` < 0 threads is set to the number of logical CPU cores `threads` = 0 The same behavior as `threads` is set to 1. 1 <= `threads` <= The number of logical cores of CPU The number of threads is set to `threads` . The number of logical cores of CPU < `threads` . The number of threads is set to the number of logical CPU cores. In addtion, data points are divided into threads parts.
cache_size(optional):
	Specify the number of vectors to cache projection vectors for hashing(optional). If `cache_size` is omitted, projection vectors are generated in each hash calculation. The bigger it is, response time can be lower thoough more memory is required. (Integer) Range: 0 <= `cache_size`

nearest_neighbor_recommender

method:	Specify algorithm for nearest neighbor. Refer to Nearest Neighbor for the list of algorithms available.
parameter:	Specify parameters for the algorithm. Refer to Nearest Neighbor for the list of parameters.

converter: Specify configuration for data conversion. Its format is described in Data Conversion.

Example:

{
  "method": "lsh",
  "parameter" : {
    "hash_num" : 64
  },
  "converter" : {
    "string_filter_types": {},
    "string_filter_rules":[],
    "num_filter_types": {},
    "num_filter_rules": [],
    "string_types": {},
    "string_rules":[
      {"key" : "*", "type" : "str", "sample_weight":"bin", "global_weight" : "bin"}
    ],
    "num_types": {},
    "num_rules": [
      {"key" : "*", "type" : "num"}
    ]
  }
}

Data Structures¶

message id_with_score¶

Represents ID with its score.

0: string id¶: Data ID.

1: double score¶: Score. Range of scores is 0 <= score <= 1 (less than or equal to -0 when using euclid_lsh).

message id_with_score {
  0: string id
  1: double score
}

Methods¶

service recommender

bool clear_row(0: string id)¶

Parameters:	id – row ID to be removed
Returns:	True when the row was cleared successfully

Removes the given row id from the recommendation table.

bool update_row(0: string id, 1: datum row)¶

Parameters:	id – row ID row – `datum` for the row
Returns:	True if this function updates models successfully

Updates the row whose id is id with given row. If the row with the same id already exists, the row is differential updated with row. Otherwise, new row entry will be created. If the server that manages the row and the server that received this RPC request are same, this operation is reflected instantly. If not, update operation is reflected after mix.

datum complete_row_from_id(0: string id)¶

Parameters:	id – row ID
Returns:	`datum` stored in `id` row with missing value completed by predicted value

Returns the datum for the row id, with missing value completed by predicted value.

datum complete_row_from_datum(0: datum row)¶

Parameters:	row – original `datum` to be completed (possibly some values are missing)
Returns:	`datum` constructed from the given `datum` with missing value completed by predicted value

Returns the datum constructed from row, with missing value completed by predicted value.

list<id_with_score> similar_row_from_id(0: string id, 1: uint size)¶

Parameters:	id – row ID size – number of rows to be returned
Returns:	row IDs that are most similar to the row `id`

Returns size rows (at maximum) which are most similar to the row id.

list<id_with_score> similar_row_from_id_and_score(0: string id, 1: double score)¶

Parameters:	id – row ID score – threshold of similarity score
Returns:	row IDs that are most similar to the row `id`

Returns rows which are most similar to the row id and have a greater similarity score than score.

list<id_with_score> similar_row_from_id_and_rate(0: string id, 1: float rate)¶

Parameters:	id – row ID rate – rate of all the rows to be returned (Range `0 < rate <= 1`)
Returns:	row IDs that are most similar to the row `id`

Returns the top rate of all the rows which are most similar to the row id. For example, return the top 40% of all the rows when 0.4 is specified as rate.

list<id_with_score> similar_row_from_datum(0: datum row, 1: uint size)¶

Parameters:	row – `datum` to find similar rows size – number of rows to be returned
Returns:	rows that most have a similar datum to `row`

Returns size rows (at maximum) that most have similar datum to row.

list<id_with_score> similar_row_from_datum_and_score(0: datum row, 1: double score)¶

Parameters:	row – `datum` to find similar rows score – threshold of similarity score
Returns:	rows that most have a similar datum to `row`

Returns rows which are most similar to row and have a greater similarity score than score.

list<id_with_score> similar_row_from_datum_and_rate(0: datum row, 1: float rate)¶

Parameters:	row – `datum` to find similar rows rate – rate of all the rows to be returned (Range `0 < rate <= 1`)
Returns:	rows that most have a similar datum to `row`

Returns the top rate of all the rows which are most similar to row. For example, return the top 40% of all the rows when 0.4 is specified as rate.

datum decode_row(0: string id)¶

Parameters:	id – row ID
Returns:	`datum` for the given row `id`

Returns the datum in the row id. Note that irreversibly converted datum (processed by fv_converter) will not be decoded.

list<string> get_all_rows()¶

Returns:	list of all row IDs

Returns the list of all row IDs.

double calc_similarity(0: datum lhs, 1: datum rhs)¶

Parameters:	lhs – `datum` rhs – another `datum`
Returns:	similarity between `lhs` and `rhs`

Returns the similarity score (see score member of id_with_score) between two datum.

double calc_l2norm(0: datum row)¶

Parameters:	row – `datum`
Returns:	L2 norm for the given `row`

Returns the value of L2 norm for the row.