|Norvig I 831
Reinforcement Learning/AI Research/Norvig/Russell: In many complex domains, reinforcement learning [by reward and punishment] is the only feasible way to train a program to perform at high levels. For example, in game playing, it is very hard for a human to provide accurate and consistent evaluations of large numbers of positions, which would be needed to train an evaluation function directly from examples. Instead, the program can be told when it has won or lost, and it can use this information to learn an evaluation function that gives reasonably accurate estimates of the probability of winning from any given position. Similarly, it is extremely difficult to program an agent to fly a helicopter; yet given appropriate negative rewards for crashing, wobbling, or deviating from a set course, an agent can learn to fly by itself.
A. Passive Reinforcement learning
Situation: an agent is placed in an environment and must learn to behave successfully therein.
A utility-based agent learns a utility function on states and uses it to select actions that maximize the expected outcome utility.
A Q-learning agent learns an action-utility function, or Q-function, giving the expected utility of taking a given action in a given state.
A reflex agent learns a policy that maps directly from states to actions.
Exploration: an agent must experience as much as possible of its environment in order to learn how to behave in it. >Markov decision processes/Norvig.
Norvig I 833
Passive reinforcement learning: A simple method for direct utility estimation was invented in the late 1950s in the area of adaptive control theory by Widrow and Hoff (1960)(1). The idea is that the utility of a state is the expected total reward from that state onward (called the expected reward-to-go), and each trial provides a sample of this quantity for each state visited.
Utility: the utilities of states are not independent! The utility of each state equals its own reward plus the expected utility of its successor states. That is, the utility values obey the Bellman equations for a fixed policy. (>Values/AI Research).
Problem: By ignoring the connections between states, direct utility estimation misses opportunities for learning.
Norvig I 834
Adaptive Dynamic Programming /ADP: An adaptive dynamic programming (or ADP) agent takes advantage of the constraints among the utilities of states by learning the transition model that connects them and solving the corresponding Markov decision process using a dynamic programming method. Alternatively, we can adopt the approach of modified policy iteration (…), using a simplified value iteration process to update the utility estimates after each change to the learned model.
Norvig I 836
Temporal difference learning/TD: All temporal-difference methods work by adjusting the utility estimates towards the ideal equilibrium that holds locally when the utility estimates are correct.
Norvig I 839
B. Active reinforcement learning:
A passive learning agent has a fixed policy that determines its behavior. An active agent must decide what actions to take. First, the agent will need to learn a complete model with outcome probabilities for all actions, (…). Next, we need to take into account the fact that the agent has a choice of actions. The utilities it needs to learn are those defined by the optimal policy; they obey the >Bellman equations (…).The final issue is what to do at each step. Having obtained a utility function U that is optimal for the learned model, the agent can extract an optimal action by one-step look-ahead to maximize the expected utility; alternatively, if it uses policy iteration, the optimal policy is already available, so it should simply execute the action the optimal policy recommends.
Norvig I 843
Q-Learning: There is an alternative TD method, called Q-learning, which learns an action-utility representation instead of learning utilities. [A] TD [temporal difference] agent that learns a Q-function does not need a model of the form P(s’| s, a), either for learning or for action selection. For this reason, Q-learning is called a model-free method.
Norvig I 845
Function approximation: simply means using any sort of representation for the Q-function other than a lookup table. The representation is viewed as approximate because it might not be the case that the true utility function or Q-function can be represented in the chosen form.
Norvig I 846
Generalization: The compression achieved by a function approximator allows the learning agent to generalize from states it has visited to states it has not visited. That is, the most important aspect of function approximation is not that it requires less space, but that it allows for inductive generalization over input states.
Norvig I 848
Policies: a policy π is a function that maps states to actions. (…) we could represent π by a collection of parameterized Q-functions, one for each action, and take the action with the highest predicted value (…).if the policy is represented by Q-functions, then policy search results in a process that learns Q-functions. This process is not the same as Q-learning! In Q-learning with function approximation, the algorithm finds a value of θ such that ˆQ θ is “close” to Q ∗, the optimal Q-function.
Policy search: Policy search, on the other hand, finds a value of θ that results in good performance; (…).
VsPolicy search: Problems: One problem with policy representations of the kind (…) is that the policy is a discontinuous function of the parameters when the actions are discrete.
Solution: This means that the value of the policy may also change discontinuously, which makes gradient-based search difficult. For this reason, policy search methods often use a stochastic policy representation πθ(s, a), which specifies the probability of selecting action a in state s.
Norvig I 854
History of reinforcement learning: Turing (1948(2), 1950(3)) proposed the reinforcement-learning approach, although he was not convinced of its effectiveness, writing, “the use of punishments and rewards can at best be a part of the teaching process.” Arthur Samuel’s work (1959)(4) was probably the earliest successful machine learning research. Around the same time, researchers in adaptive control theory (Widrow and Hoff, 1960)(1), building on work by Hebb (1949)(5), were training simple networks using the delta rule. The cart–pole work of Michie and Chambers (1968)(6) can also be seen as a reinforcement learning method with a function approximator. The psychological literature on reinforcement learning is much older; Hilgard and Bower (1975)(7) provide a good survey.
Neuroscience: The neuroscience text by Dayan and Abbott (2001)(8) describes possible neural implementations of temporal-difference learning, while Dayan and Niv (2008)(9) survey the latest evidence from neuroscientific and behavioral experiments.
Markov decision process: The connection between reinforcement learning and Markov decision processes was first made by Werbos (1977)(10), but the development of reinforcement learning in AI stems from work at the University of Massachusetts in the early 1980s (Barto et al., 1981)(11). The paper by Sutton (1988) provides a good historical overview.
Temporal difference learning: The combination of temporal-difference learning with the model-based generation of simulated experiences was proposed in Sutton’s DYNA architecture (Sutton, 1990)(12). The idea of prioritized sweeping was introduced independently by Moore and Atkeson (1993)(13) and
Norvig I 855
Peng and Williams (1993)(14).
Q-learning: was developed in Watkins’s Ph.D. thesis (1989)(15), while SARSA appeared in a technical report by Rummery and Niranjan (1994)(16).
Function approximation: Function approximation in reinforcement learning goes back to the work of Samuel, who used both linear and nonlinear evaluation functions and also used feature-selection methods to reduce the feature CMAC space. Later methods include the CMAC (Cerebellar Model Articulation Controller) (Albus, 1975)(17), which is essentially a sum of overlapping local kernel functions, and the associative neural networks of Barto et al. (1983)(18). Neural networks are currently the most popular form of function approximator. The best-known application is TD-Gammon (Tesauro, 1992(19), 1995(20)), (…).
Policy search: Policy search methods were brought to the fore by Williams (1992(21)), who developed the REINFORCE family of algorithms. Later work by Marbach and Tsitsiklis (1998)(22), Sutton et al. (2000)(23), and Baxter and Bartlett (2000)(24) strengthened and generalized the convergence results for policy search. The method of correlated sampling for comparing different configurations of a system was described formally by Kahn and Marshall (1953)(25), but seems to have been known long before that. Its use in reinforcement learning is due to Van Roy (1998)(26) and Ng and Jordan (2000)(27); the latter paper also introduced the PEGASUS algorithm and proved its formal properties.
Norvig I 857
Inverse reinforcement learning: Russell (1998)(28) describes the task of inverse reinforcement learning - figuring out what the reward function must be from an example path through that state space. This is useful as a part of apprenticeship learning, or as a part of doing science—we can understand an animal or robot by working backwards from what it does to what its reward function must be.
1. Widrow, B. and Hoff, M. E. (1960). Adaptive switching circuits. In 1960 IRE WESCON Convention
Record, pp. 96–104.
2. Turing, A. (1948). Intelligent machinery. Tech. rep. National Physical Laboratory. reprinted in (Ince,
3. Turing, A. (1950). Computing machinery and intelligence. Mind, 59, 433–460.
4. Samuel, A. L. (1959). Some studies in machine learning using the game of checkers. IBM Journal of Research and Development, 3(3), 210–229.
5. Hebb, D. O. (1949). The Organization of Behavior. Wiley.
6. Michie, D. and Chambers, R. A. (1968). BOXES: An experiment in adaptive control. In Dale, E. and
Michie, D. (Eds.), Machine Intelligence 2, pp. 125–133. Elsevier/North-Holland.
7. Hilgard, E. R. and Bower, G. H. (1975). Theories of Learning (4th edition). Prentice-Hall.
8. Dayan, P. and Abbott, L. F. (2001). Theoretical Neuroscience: Computational and Mathematical Modeling of Neural Systems. MIT Press.
9. Dayan, P. and Niv, Y. (2008). Reinforcement learning and the brain: The good, the bad and the ugly.
Current Opinion in Neurobiology, 18(2), 185–196.
10. Werbos, P. (1977). Advanced forecasting methods for global crisis warning and models of intelligence. General Systems Yearbook, 22, 25–38.
11. Barto, A. G., Sutton, R. S., and Brouwer, P. S. (1981). Associative search network: A reinforcement learning associative memory. Biological Cybernetics, 40(3), 201–211.
12. Sutton, R. S. (1990). Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In ICML-90, pp. 216–224.
13. Moore, A. W. and Atkeson, C. G. (1993). Prioritized sweeping—Reinforcement learning with less data and less time. Machine Learning, 13, 103–130.
14. Peng, J. and Williams, R. J. (1993). Efficient learning and planning within the Dyna framework. Adaptive Behavior, 2, 437–454.
15. Watkins, C. J. (1989). Models of Delayed Reinforcement Learning. Ph.D. thesis, Psychology Department, Cambridge University.
16. Rummery, G. A. and Niranjan, M. (1994). Online Q-learning using connectionist systems. Tech.
rep. CUED/F-INFENG/TR 166, Cambridge University Engineering Department.
17. Albus, J. S. (1975). A new approach to manipulator control: The cerebellar model articulation controller (CMAC). J. Dynamic Systems, Measurement, and Control, 97, 270–277.
18. Barto, A. G., Sutton, R. S., and Anderson, C. W. (1983). Neuron-like adaptive elements that can solve difficult learning control problems. IEEE Transactions on Systems, Man and Cybernetics, 13, 834–
19. Tesauro, G. (1992). Practical issues in temporal difference learning. Machine Learning, 8(3–4), 257–
20. Tesauro, G. (1995). Temporal difference learning and TD-Gammon. CACM, 38(3), 58–68.
21. Williams, R. J. (1992). Simple statistical gradient following algorithms for connectionist reinforcement learning. Machine Learning, 8, 229–256.
22. Marbach, P. and Tsitsiklis, J. N. (1998). Simulation based optimization of Markov reward processes.
Technical report LIDS-P-2411, Laboratory for Information and Decision Systems, Massachusetts Institute of Technology.
23. Sutton, R. S., McAllester, D. A., Singh, S. P., and Mansour, Y. (2000). Policy gradient methods for reinforcement learning with function approximation. In Solla, S. A., Leen, T. K., andM¨uller, K.-R. (Eds.),
NIPS 12, pp. 1057–1063. MIT Press.
24. Baxter, J. and Bartlett, P. (2000). Reinforcement learning in POMDP’s via direct gradient ascent. In
ICML-00, pp. 41–48.
25. Kahn, H. and Marshall, A. W. (1953). Methods of reducing sample size in Monte Carlo computations. Operations Research, 1(5), 263–278.
26. Van Roy, B. (1998). Learning and value function approximation in complex decision processes. Ph.D. thesis, Laboratory for Information and Decision Systems, MIT.
27. Ng, A. Y. and Jordan, M. I. (2000). PEGASUS: A policy search method for large MDPs and POMDPs.
In UAI-00, pp. 406–415.
28. Russell, S. J. (1998). Learning agents for uncertain environments (extended abstract). In COLT-98, pp. 101–103._____________Explanation of symbols: Roman numerals indicate the source, arabic numerals indicate the page number. The corresponding books are indicated on the right hand side. ((s)…): Comment by the sender of the contribution. The note [Author1]Vs[Author2] or [Author]Vs[term] is an addition from the Dictionary of Arguments. If a German edition is specified, the page numbers refer to this edition.
Stuart J. Russell
Artificial Intelligence: A Modern Approach Upper Saddle River, NJ 2010