### Exploration, Exploitation and Rationality

A little introduction to what I consider to be the Mother of All Problems: the exploration-exploitation trade-off.

Let's firt draw a distinction between first- and second-order uncertainty. Knowing that a source of reward (or money, or food, etc.) will be rewarding in 70% of the occasions is uncertain knowledge because one does not know for sure what will be the next outcome (one can only know that there is a 70% probability that it is a reward). In some situations however, uncertainty can be radical, or second-order uncertainty: even the probabilities are unknown. Under radical uncertainty, cognitive agents must learn reward probabilities. Learners must, at the same time, explore their environment in order to gather information about its payoff structure and exploit this information to obtain reward. They face a deep problem—known as the exploration/exploitation tradeoff—because they cannot do both at the same time: you cannot explore all the time, you cannot exploit all the time, you must reduce exploration but cannot eliminate it. This tradeoff is usually modeled with the K-armed bandit problem.

Suppose an agent has n coins to spend in a slot machine with K arms (here K=2 and we will suppose that one arm is high-paying and the other low-paying, although the agent does not know that). The only way the agent has access to the arms’ rate of payment – and obtains reward – is by pulling them. Hence she must find an optimal tradeoff when spending its coins: trying another arm just to see how it pays or staying with the one who already paid? The goal is not only to maximize reward, but also to maximize reward while obtaining information about the arm’s rate. The process can be erroneous in two different ways: either the player can be victim of a false negative (a low-paying sequence of the high-paying arm) or false positive (a high-paying sequence paying of the low-paying paying arm).

To solve this problem, the optimal solution is to compute an index for every arm, updating this index according to the arm’s payoff and choosing the arm that has the greater index (Gittins, 1989). In the long run, this strategies amount to following decision theory after a learning phase. But as soon as switching from one arm to another has a cost, as Banks & Sundaram (1994) showed, the index strategies cannot converge towards an optimal solution. A huge literature in optimization theory, economics, management and machine learning addresses this problem (Kaelbling et al., 1996; Sundaram, 2003; Tackseung, 2004). Studies of humans or animals explicitly submitted to bandit problems, however, show that subjects tend to rely on the matching strategy (Estes, 1954). They match the probability of action with the probability of reward. In one study, for instance, (Meyer & Shi, 1995), subjects were required to select between two icons displayed on a computer screen; after each selection, a slider bar indicated the actual amount of reward obtained. The matching strategy predicted the subject’s behavior, and the same results hold for monkeys in a similar task (Bayer & Glimcher, 2005; Morris et al., 2006).

The important thing with this trade-off, is its lack of a priori solutions. Decision theory works well when we know the probabilities and the utilities, but what can we do when we don’t have them? We learn. This is the heart of natural rationality: crafting solutions—under radical uncertainty and non-stationary environments—for problems that may not have an optimal solution. Going from second- to first-order uncertainty.

See also:

References

Let's firt draw a distinction between first- and second-order uncertainty. Knowing that a source of reward (or money, or food, etc.) will be rewarding in 70% of the occasions is uncertain knowledge because one does not know for sure what will be the next outcome (one can only know that there is a 70% probability that it is a reward). In some situations however, uncertainty can be radical, or second-order uncertainty: even the probabilities are unknown. Under radical uncertainty, cognitive agents must learn reward probabilities. Learners must, at the same time, explore their environment in order to gather information about its payoff structure and exploit this information to obtain reward. They face a deep problem—known as the exploration/exploitation tradeoff—because they cannot do both at the same time: you cannot explore all the time, you cannot exploit all the time, you must reduce exploration but cannot eliminate it. This tradeoff is usually modeled with the K-armed bandit problem.

Suppose an agent has n coins to spend in a slot machine with K arms (here K=2 and we will suppose that one arm is high-paying and the other low-paying, although the agent does not know that). The only way the agent has access to the arms’ rate of payment – and obtains reward – is by pulling them. Hence she must find an optimal tradeoff when spending its coins: trying another arm just to see how it pays or staying with the one who already paid? The goal is not only to maximize reward, but also to maximize reward while obtaining information about the arm’s rate. The process can be erroneous in two different ways: either the player can be victim of a false negative (a low-paying sequence of the high-paying arm) or false positive (a high-paying sequence paying of the low-paying paying arm).

To solve this problem, the optimal solution is to compute an index for every arm, updating this index according to the arm’s payoff and choosing the arm that has the greater index (Gittins, 1989). In the long run, this strategies amount to following decision theory after a learning phase. But as soon as switching from one arm to another has a cost, as Banks & Sundaram (1994) showed, the index strategies cannot converge towards an optimal solution. A huge literature in optimization theory, economics, management and machine learning addresses this problem (Kaelbling et al., 1996; Sundaram, 2003; Tackseung, 2004). Studies of humans or animals explicitly submitted to bandit problems, however, show that subjects tend to rely on the matching strategy (Estes, 1954). They match the probability of action with the probability of reward. In one study, for instance, (Meyer & Shi, 1995), subjects were required to select between two icons displayed on a computer screen; after each selection, a slider bar indicated the actual amount of reward obtained. The matching strategy predicted the subject’s behavior, and the same results hold for monkeys in a similar task (Bayer & Glimcher, 2005; Morris et al., 2006).

The important thing with this trade-off, is its lack of a priori solutions. Decision theory works well when we know the probabilities and the utilities, but what can we do when we don’t have them? We learn. This is the heart of natural rationality: crafting solutions—under radical uncertainty and non-stationary environments—for problems that may not have an optimal solution. Going from second- to first-order uncertainty.

See also:

- Hardy-Vallée, B. (2007). Artificial Life, Natural Rationality and Probability Matching. IEEE Symposium on Artificial Life, 2007, 123-129.
- Cohen, J. D., McClure, S. M., & Yu, A. J. (2007). Should I Stay or Should I Go? How the Human Brain Manages the Trade-Off between Exploitation and Exploration. Philosophical Transactions of the Royal Society B: Biological Sciences, 362(1481), 933-942.[Great paper on the subject]

References

- Banks, J. S., & Sundaram, R. K. (1994). Switching Costs and the Gittins Index. Econometrica: Journal of the Econometric Society, 62(3), 687-694.
- Bayer, H. M., & Glimcher, P. W. (2005). Midbrain Dopamine Neurons Encode a Quantitative Reward Prediction Error Signal. Neuron, 47(1), 129.
- Estes, W. K. (1954). Individual Behavior in Uncertain Situations: An Interpretation in Terms of Statistical Association Theory. In R. M. Thrall, C. H. Coombs & R. L. Davies (Eds.), Decision Processes (pp. 127-137). New York: Wiley.
- Gittins, J. C. (1989). Multi-Armed Bandit Allocation Indices. New York: Wiley.
- Kaelbling, L. P., Littman, M. L., & Moore, A. W. (1996). Reinforcement Learning: A Survey. Journal of Artificial Intelligence Research, 4, 237-285.
- Meyer, R. J., & Shi, Y. (1995). Sequential Choice under Ambiguity: Intuitive Solutions to the Armed-Bandit Problem. Management Science, 41(5), 817-834.
- Morris, G., Nevet, A., Arkadir, D., Vaadia, E., & Bergman, H. (2006). Midbrain Dopamine Neurons Encode Decisions for Future Action. Nat Neurosci, 9(8), 1057-1063.
- Sundaram, R. K. (2003). Generalized Bandit Problems: Working Paper, Stern School of Business.
- Tackseung, J. (2004). A Survey on the Bandit Problem with Switching Costs. De Economist, V152(4), 513-541.
- Yen, G., Yang, F., & Hickey, T. (2002). Coordination of Exploration and Exploitation in a Dynamic Environment. International Journal of Smart Engineering System Design, 4(3), 177-182.

## 1 Comments:

Thanks for this interesting post!

I wonder if the exploration-exploitation trade-off could also be used to explain some aspects of internal perspective and self-modelling, because if you internally represent the area you have to explore, the cost of the exploration is likely to be reduced. I wonder if this explanation would be a feasible one for the internal self-modeling seen in various experiments in robotics (such as Floreano and Mondada 1996, Tani & Yamamoto 2002, and especially the self-mapping robots of Bongard et al. 2006: http://ccsl.mae.cornell.edu/research/selfmodels/) as well as in the body and peripersonal space maps in humans. (Blakeslee & Blakeslee 2007, http://sandrablakeslee.com/articles/peripersonal_space_jul04.php).

References:

Blakeslee, Sandra and Matthew Blakeslee. 2007. “The Body Has A Mind of its Own: How Body Maps in Your Brain Help You Do (Most) Everything Better.” New York: Random House.

Bongard, J. V. Zykov, & Lipson, H. 2006. “Resilient Machines Through Continuous Self-Modeling” Science 314

Floreano, Dario, and Francesco Mondada. 1996. “Evolution of homing navigation in a real mobile robot”, IEEE Transactions on Systems, Man, and Cybernetics – Part B: Cybernetics 26:396–407.

Tani, J. and J. Yamamoto. 2002. "On the dynamics of robot exploration learning", Cognitive Systems Research 3.3, 459-470.

[P.S. last time I tried to post this my PC crashed, so sorry if I sent this comment twice]

Post a Comment