Overtaking method based on sand-sifter mechanism: Why do optimistic value functions find optimal solutions in multi-armed bandit problems?

Ochi, Kento; Kamiura, Moto

doi:10.1016/j.biosystems.2015.06.009

SAFETYLIT WEEKLY UPDATE

We compile citations and summaries of about 400 new articles every week.

RSS Feed

HELP: Tutorials | FAQ

CONTACT US: Contact info

Search Results

Journal Article

Overtaking method based on sand-sifter mechanism: Why do optimistic value functions find optimal solutions in multi-armed bandit problems?
Citation	Ochi K, Kamiura M. Biosystems 2015; 135: 55-65.
Affiliation	Graduate School of Science and Engineering, Tokyo Denki University, Japan; School of Science and Engineering, Tokyo Denki University, Japan; Research Institute of Electrical Communication, Tohoku University, Japan. Electronic address: kamiura@mail.dendai.ac.jp.
Copyright	(Copyright © 2015, Elsevier Publishing)
DOI	10.1016/j.biosystems.2015.06.009
PMID	26166266
Abstract	A multi-armed bandit problem is a search problem on which a learning agent must select the optimal arm among multiple slot machines generating random rewards. UCB algorithm is one of the most popular methods to solve multi-armed bandit problems. It achieves logarithmic regret performance by coordinating balance between exploration and exploitation. Since UCB algorithms, researchers have empirically known that optimistic value functions exhibit good performance in multi-armed bandit problems. The terms optimistic or optimism might suggest that the value function is sufficiently larger than the sample mean of rewards. The first definition of UCB algorithm is focused on the optimization of regret, and it is not directly based on the optimism of a value function. We need to think the reason why the optimism derives good performance in multi-armed bandit problems. In the present article, we propose a new method, which is called Overtaking method, to solve multi-armed bandit problems. The value function of the proposed method is defined as an upper bound of a confidence interval with respect to an estimator of expected value of reward: the value function asymptotically approaches to the expected value of reward from the upper bound. If the value function is larger than the expected value under the asymptote, then the learning agent is almost sure to be able to obtain the optimal arm. This structure is called sand-sifter mechanism, which has no regrowth of value function of suboptimal arms. It means that the learning agent can play only the current best arm in each time step. Consequently the proposed method achieves high accuracy rate and low regret and some value functions of it can outperform UCB algorithms. This study suggests the advantage of optimism of agents in uncertain environment by one of the simplest frameworks. Language: en