Recommender¶
- See IDL definition for detailed specification.
- See Algorithms for detailed description of algorithms used in this server.
Configuration¶
Configuration is given as a JSON file. We show each field below:
-
method
Specify algorithm for recommender. You can use these algorithms.
Value Method "inverted_index"
Use Inverted Index (with cosine similarity) "inverted_index_euclid"
Use Inverted Index (with Euclidean distance) "minhash"
Use MinHash. [Ping2010] "lsh"
Use Locality Sensitive Hashing. "euclid_lsh"
Use Euclid-distance LSH. [Andoni2005] "nearest_neighbor_recommender"
Use an implementation of nearest_neighbor
.
-
parameter
Specify parameters for the algorithm. Its format differs for each
method
.- common
unlearner(optional): Specify unlearner strategy. If you don’t use unlearner, you should omit this parameter. You can specify unlearner
strategy described in Unlearner. Data will be deleted by the ID based on strategy specified here.unlearner_parameter(optional): Specify unlearner parameter. You can specify unlearner_parameter
Unlearner. You cannot omit this parameter when you specifyunlearner
. Data in excess of this number will be deleted automatically.- inverted_index
- None
- inverted_index_euclid
ignore_orthogonal(optional): Ignore the points which don’t have any same key with the query when searching neighbors. If this option is specified, the result includes only points which have inverted index similarity with the query. In addition, this option contributes to accelerate calculation in specific cases (e.g. when most points don’t have any same key). This parameter is optional and is false
(disabled) by default. (Boolean)- minhash
hash_num: Number of hash values. The bigger it is, the more accurate results you can get, but the more memory is required. (Integer)
- Range: 1 <=
hash_num
- Range: 1 <=
- lsh
hash_num: Bit length of hash values. The bigger it is, the more accurate results you can get, but the more memory is required. (Integer)
- Range: 1 <=
hash_num
threads(optional): Specify the number of threads which execute random_projection and search. If
threads
is omitted, those are executed with 1 thread. The bigger it is, query latency becomes smaller because data is divided into several parts and processed by multiple threads in parallel. This option has been added since 0.9.1. Single thread is used in before versions. (Integer)The behavior of this option varies as below:
threads
< 0- threads is set to the number of logical CPU cores
threads
= 0- The same behavior as
threads
is set to 1.
- The same behavior as
- 1 <=
threads
<= The number of logical cores of CPU- The number of threads is set to
threads
.
- The number of threads is set to
- The number of logical cores of CPU <
threads
.- The number of threads is set to the number of logical CPU cores. In addtion, data points are divided into threads parts.
cache_size(optional): Specify the number of vectors to cache projection vectors for hashing. If
cache_size
is omitted, projection vectors are generated in each hash calculation. The bigger it is, response time can be lower thoough more memory is required. (Integer)- Range: 0 <=
cache_size
- Range: 1 <=
- euclid_lsh
hash_num: Number of hash values. The bigger it is, the more accurate results you can get, but the fewer results you can find and the more memory is required. (Integer)
- Range: 1 <=
hash_num
table_num: Number of tables. The bigger it is, the mroe results you can find, but the more memory is required and the longer response time is required. (Integer)
- Range: 1 <=
table_num
bin_width: Quantization step size. The bigger it is, the more results you can find, but the longer response time is required. (Float)
- Range: 0.0 <
bin_width
probe_num: Number of bins to find. The bigger it is, the more results you can find, but the longer response time is required. (Integer)
- Range: 0 <=
probe_num
seed: Seed of random number generator. (Integer)
- Range: 0 <=
seed
<= \(2^{32} - 1\)
threads(optional): Specify the number of threads which execute random_projection and search(optional). If
threads
is omitted, those are executed with 1 thread. The bigger it is, query latency becomes smaller because data is divided into several parts and processed by multiple threads in parallel. This option has been added since 0.9.1. Single thread is used in before versions. (Integer)The behavior of this option varies as below:
threads
< 0- threads is set to the number of logical CPU cores
threads
= 0- The same behavior as
threads
is set to 1.
- The same behavior as
- 1 <=
threads
<= The number of logical cores of CPU- The number of threads is set to
threads
.
- The number of threads is set to
- The number of logical cores of CPU <
threads
.- The number of threads is set to the number of logical CPU cores. In addtion, data points are divided into threads parts.
cache_size(optional): Specify the number of vectors to cache projection vectors for hashing(optional). If
cache_size
is omitted, projection vectors are generated in each hash calculation. The bigger it is, response time can be lower thoough more memory is required. (Integer)- Range: 0 <=
cache_size
- Range: 1 <=
- nearest_neighbor_recommender
method: Specify algorithm for nearest neighbor. Refer to Nearest Neighbor for the list of algorithms available. parameter: Specify parameters for the algorithm. Refer to Nearest Neighbor for the list of parameters.
-
converter
Specify configuration for data conversion. Its format is described in Data Conversion.
- Example:
{ "method": "lsh", "parameter" : { "hash_num" : 64 }, "converter" : { "string_filter_types": {}, "string_filter_rules":[], "num_filter_types": {}, "num_filter_rules": [], "string_types": {}, "string_rules":[ {"key" : "*", "type" : "str", "sample_weight":"bin", "global_weight" : "bin"} ], "num_types": {}, "num_rules": [ {"key" : "*", "type" : "num"} ] } }
Data Structures¶
Methods¶
-
service
recommender
-
bool
clear_row
(0: string id)¶ Parameters: - id – row ID to be removed
Returns: True when the row was cleared successfully
Removes the given row
id
from the recommendation table.
-
bool
update_row
(0: string id, 1: datum row)¶ Parameters: - id – row ID
- row –
datum
for the row
Returns: True if this function updates models successfully
Updates the row whose id is
id
with givenrow
. If the row with the sameid
already exists, the row is differential updated withrow
. Otherwise, new row entry will be created. If the server that manages the row and the server that received this RPC request are same, this operation is reflected instantly. If not, update operation is reflected after mix.
-
datum
complete_row_from_id
(0: string id)¶ Parameters: - id – row ID
Returns: datum
stored inid
row with missing value completed by predicted valueReturns the
datum
for the rowid
, with missing value completed by predicted value.
-
datum
complete_row_from_datum
(0: datum row)¶ Parameters: - row – original
datum
to be completed (possibly some values are missing)
Returns: datum
constructed from the givendatum
with missing value completed by predicted valueReturns the
datum
constructed fromrow
, with missing value completed by predicted value.- row – original
-
list<id_with_score>
similar_row_from_id
(0: string id, 1: uint size)¶ Parameters: - id – row ID
- size – number of rows to be returned
Returns: row IDs that are most similar to the row
id
Returns
size
rows (at maximum) which are most similar to the rowid
.
-
list<id_with_score>
similar_row_from_id_and_score
(0: string id, 1: double score)¶ Parameters: - id – row ID
- score – threshold of similarity score
Returns: row IDs that are most similar to the row
id
Returns rows which are most similar to the row
id
and have a greater similarity score thanscore
.
-
list<id_with_score>
similar_row_from_id_and_rate
(0: string id, 1: float rate)¶ Parameters: - id – row ID
- rate – rate of all the rows to be returned (Range
0 < rate <= 1
)
Returns: row IDs that are most similar to the row
id
Returns the top
rate
of all the rows which are most similar to the rowid
. For example, return the top 40% of all the rows when0.4
is specified asrate
.
-
list<id_with_score>
similar_row_from_datum
(0: datum row, 1: uint size)¶ Parameters: - row –
datum
to find similar rows - size – number of rows to be returned
Returns: rows that most have a similar datum to
row
Returns
size
rows (at maximum) that most have similardatum
torow
.- row –
-
list<id_with_score>
similar_row_from_datum_and_score
(0: datum row, 1: double score)¶ Parameters: - row –
datum
to find similar rows - score – threshold of similarity score
Returns: rows that most have a similar datum to
row
Returns rows which are most similar to
row
and have a greater similarity score thanscore
.- row –
-
list<id_with_score>
similar_row_from_datum_and_rate
(0: datum row, 1: float rate)¶ Parameters: - row –
datum
to find similar rows - rate – rate of all the rows to be returned (Range
0 < rate <= 1
)
Returns: rows that most have a similar datum to
row
Returns the top
rate
of all the rows which are most similar torow
. For example, return the top 40% of all the rows when0.4
is specified asrate
.- row –
-
datum
decode_row
(0: string id)¶ Parameters: - id – row ID
Returns: datum
for the given rowid
Returns the
datum
in the rowid
. Note that irreversibly converteddatum
(processed byfv_converter
) will not be decoded.
-
list<string>
get_all_rows
()¶ Returns: list of all row IDs Returns the list of all row IDs.
-
bool