Natural Rationality | decision-making in the economy of nature
Showing posts with label td-learning. Show all posts
Showing posts with label td-learning. Show all posts

8/1/07

A basic mode of behavior: a review of reinforcement learning, from a computational and biological point of view.

The Journal Frontiers of Interdisciplinary Research in the Life Sciences (HFSP Publishing) made its first issue freely available online. The Journal specializes in "innovative interdisciplinary research at the interface between biology and the physical sciences." An excellent paper (complete, clear, exhaustive) by Kenji Doya presents a state-of-the-art review of reinforcement learning, both as a computational theory (the procedures) and a biological mechanism (neural activity). Exactly what the title announces: Reinforcement learning: Computational theory and biological mechanisms. The paper covers research in neuroscience, AI, computer science, robotics, neuroeconomics, psychology. See this nice schema of reinforcement learning in the brain:



(From the paper:) A schematic model of implementation of reinforcement learning in the cortico-basal ganglia circuit (Doya, 1999, 2000). Based on the state representation in the cortex, the striatum learns state and action value functions. The state value coding striatal neurons project to dopamine neurons, which sends the TD signal back to the striatum. The outputs of action value coding striatal neurons channel through the pallidum and the thalamus, where stochastic action selection may be realized

This stuff is exactly what a theory of natural rationality (and economics tout court): plausible, tractable, and real computational mechanism grounded in neurobiology. As Selten once said, speaking of reinforcement learning:

a theory of bounded rationality cannot avoid this basic mode of behavior (Selten, 2001, p. 16)


References



3/22/07

A short primer on dopamine and TD learning

Many researches indicates that dopaminergic systems have an important role in decision-making, and that their activity can be precisely formulated by TD algorithms. Here is a brief description, from a forthcoming paper of mine:
--

According to many findings, utility computation is realized by dopaminergic systems, a network of structures in ‘older’ brain areas highly involved in motivation and valuation (Berridge, 2003; Montague & Berns, 2002). Neuroscience revealed their role in working memory, motivation, learning, decision-making planning and motor control (Morris et al., 2006). Dopaminergic neurons are activated by stimuli related to primary rewards (juice, food) and stimuli that recruit attention (new or intense). It is important to note that they do not encode hedonic experiences, but predictions of expected reward. Activations of dopaminergic neurons respond selectively to prediction errors: the presence of unexpected reward or the absence of expected reward. In other words, they detect the discrepancies between predicted and experienced utility. Moreover, dopaminergic neurons learns from their mistake: from these prediction errors they learn to predict future rewarding events and can then bias action choice. Computational neuroscience identified a class of reinforcement learning algorithms that mirror the activity of dopaminergic activity (Niv et al., 2005; Suri & Schultz, 2001). It is suggested that dopaminergic neurons broadcast in different brain areas a reward-predictio error signal similar to those displayed by temporal difference (TD) algorithms developed by computer scientists (Sutton & Barto, 1987, 1998). This dopaminergic mechanisms use sensory inputs to predict future rewards. The difference between successive value predictions is computed and constitutes an error signal. The model then updates a value function (the function that maps state-action pairs to numerical values) according to the prediction error. Thus TD-learning algorithms are neural mechanisms of decision-making under uncertainty implemented in dopaminergic systems. They are not involved only in basic reward prediction, such as food, but also abstract stimuli like art, branded good, love or trust (Montague et al., 2006, p. 420). From the mouthwatering vision of a filet mignon in a red wine sauce to the intellectual contemplation of Any Warhol’s Pop Art Brillo Boxes, the valuation mechanisms are essentially the same

Berridge, K. C. (2003). Irrational pursuits: Hyper-incentives from a visceral brain. In I. Brocas & J. Carrillo (Eds.), The psychology of economic decisions (pp. 17-40). Oxford: Oxford University Press.
Montague, P. R., & Berns, G. S. (2002). Neural economics and the biological substrates of valuation. Neuron, 36(2), 265-284.
Montague, P. R., King-Casas, B., & Cohen, J. D. (2006). Imaging valuation models in human choice. Annu Rev Neurosci, 29, 417-448.
Morris, G., Nevet, A., Arkadir, D., Vaadia, E., & Bergman, H. (2006). Midbrain dopamine neurons encode decisions for future action. Nat Neurosci, 9(8), 1057-1063.
Niv, Y., Duff, M. O., & Dayan, P. (2005). Dopamine, uncertainty and td learning. Behav Brain Funct, 1, 6.
Suri, R. E., & Schultz, W. (2001). Temporal difference model reproduces anticipatory neural activity. Neural Comput, 13(4), 841-862.
Sutton, R. S., & Barto, A. G. (1987). A temporal-difference model of classical conditioning. Paper presented at the Ninth Annual Conference of the Cognitive Science Society.
Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning : An introduction. Cambridge, Mass.: MIT Press.