Bandit¶

See IDL definition for detailed specification.

Configuration¶

Configuration is given as a JSON file. We show each field below:

method

Specify bandit algorithm. You can use the algorithms below.

Value	Method
`"epsilon_greedy"`	Use epsilon-greedy.
"epsilon_decreasing"`	Use Greedy Mix.
`"ucb1"`	Use UCB1
`"softmax"`	Use softmax
`"exp3"`	Use exp3
`"ts"`	Use Thompson sampling [1]

[1]	Note that `reward` in register_reward API must be 0 or 1 when you use Thompson sampling method.

parameter

Specify parameters for the algorithm. Its format differs for each method.

common

assume_unrewarded:
	Specify whether it can be omitted to call `register_reward` when the reward is zero. When it is True, calling `register_reward` can be omitted, but calling `register_reward` must be associated with the result of `select_arm`. When it is False, although `register_reward` must be called when the reward is zero, it can be called independently of calling `select_arm`. (Boolean)

epsilon_greedy

epsilon:

The probability of choosing arms randomly. With probability epsilon, choose an arm according to uniform distribution. With probability 1 - epsilon, choose the arm whose expectation value is the highest. (Float)

Range: 0.0 <= epsilon <= 1.0

seed(optional):

Specify seed used to generate random number. If not specified, system clock is used as seed parameter. So you will get different result each experiment.

range of value: 0 <= seed <= \(2^{32} - 1\)

epsilon_decreasing

decreasing_rate:

Decreasing rate for the probability of selecting arms randomly. The bigger this parameter is, the more slowly the probability decreases. (Float)

Range: 0 < decreasing_rate < 1

seed(optional):

Specify seed used to generate random number. If not specified, system clock is used as seed parameter. So you will get different result each experiment.

range of value: 0 <= seed <= \(2^{32} - 1\)

ucb1

None

softmax

tau:

Temperature parameter. For high temperature, all arms are selected equally. For low temperature, arms with higher expected value are frequently selected. (Float)

Range: 0.0 < tau

seed(optional):

Specify seed used to generate random number. If not specified, system clock is used as seed parameter. So you will get different result each experiment.

range of value: 0 <= seed <= \(2^{32} - 1\)

exp3

gamma:

Mixture rate of constant weight and each arm’s weight. The higher gamma is, the higher the rate of constant weight is. The lower gamma is, the higher the rate of each arm’s weight is. (Float)

Range: 0.0 < gamma <= 1.0

seed(optional):

Specify seed used to generate random number. If not specified, system clock is used as seed parameter. So you will get different result each experiment.

range of value: 0 <= seed <= \(2^{32} - 1\)

ts

seed(optional):

Specify seed used to generate random number. If not specified, system clock is used as seed parameter. So you will get different result each experiment.

range of value: 0 <= seed <= \(2^{32} - 1\)

Example:

{
  "method" : "epsilon_greedy",
  "parameter" : {
    "assume_unrewarded" : false,
    "epsilon" : 0.1
  }
}

Data Structures¶

message arm_info¶

The state of an arm.

0: int trial_count¶: Number of times of an arm being selected.

1: double weight¶: The weight of an arm.

Methods¶

service bandit

bool register_arm(0: string arm_id)¶

Parameters:	arm_id – ID of the new arm to be registered
Returns:	True if succeeded in registering the arm. False if failed to register the arm.

Register a new arm with the name of arm_id.

bool delete_arm(0: string arm_id)¶

Parameters:	arm_id – ID of the arm to be deleted
Returns:	True if succeeded in deleting the arm. False if failed to delete the arm.

Delete an arm with the name of arm_id.

string select_arm(0: string player_id)¶

Parameters:	player_id – ID of the player whose arm is to be selected
Returns:	`arm_id` selected by bandit algorithm.

Select player’s arm according to current state.

bool register_reward(0: string player_id, 1: string arm_id, 2: double reward)¶

Parameters:	player_id – ID of the player whose arm gets rewards arm_id – ID of the arm which rewards are registered with reward – amount of rewards
Returns:	True if succeeded in registering reward. False if failed to register rewards.

Register rewards with specified player’s specified arm.

map<string, arm_info> get_arm_info(0: string player_id)¶

Parameters:	player_id – ID of the player
Returns:	arm information of specified player

Get all arms information of specified player.

bool reset(0: string player_id)¶

Parameters:	player_id – ID of the user whose arms are to be reset.
Returns:	True if succeeded in resetting the arm. False if failed to reset.

Reset all arms information of specified player.