|Norvig I 645
Sequential Decision Making/AI research/Norvig/Russell: [this is about] the computational issues involved in making decisions in a stochastic environment. Sequential decision problems incorporate utilities, uncertainty, and sensing, and include search and planning problems as special cases. >Planning/Norvig, >Decision networks/Norvig, >Decision theory/AI Research, >Utility/AI Research, >Utility theory/Norvig, >Environment/AI research, >Multi-attribute utility theory/AI research.
Norvig I 649
Optimal policy: the optimal policy for a finite horizon is non-stationary. With no fixed time limit, on the other hand, there is no reason to behave differently in the same state at different times. Hence, the optimal action depends only on the current state, and the optimal policy is stationary.
States: In the terminology of multi-attribute utility theory, each state si can be viewed as an attribute of the state sequence [s0, s1, s2 . . .]. >Values/AI research.
Norvig I 684
Sequential decision problems in uncertain environments, also called Markov decision processes, or MDPs, are defined by a transition model specifying the probabilistic outcomes of actions and a reward function specifying the reward in each state.
Norvig I 685
Richard Bellman developed the ideas underlying the modern approach to sequential decision problems while working at the RAND Corporation beginning in 1949. (…) Bellman’s book, Dynamic Programming (1957)(1), gave the new field a solid foundation and introduced the basic algorithmic approaches. Ron Howard’s Ph.D. thesis (1960)(2) introduced policy iteration and the idea of average reward for solving infinite-horizon problems. Several additional results were introduced by Bellman and Dreyfus (1962)(3). Modified policy iteration is due to van Nunen (1976)(4) and Puterman and Shin (1978)(5). Asynchronous policy iteration was analyzed by Williams and Baird (1993)(6) (…).
The analysis of discounting in terms of stationary preferences is due to Koopmans (1972)(7). The texts by Bertsekas (1987)(8), Puterman (1994)(9), and Bertsekas and Tsitsiklis (1996)(10) provide a rigorous introduction to sequential decision problems. Papadimitriou and Tsitsiklis (1987)(11) describe results on the computational complexity of MDPs. Seminal work by Sutton (1988)(12) and Watkins (1989)(13) on reinforcement learning methods for solving MDPs played a significant role in introducing MDPs into the AI community, as did the later survey by Barto et al. (1995)(14). >Markov Decision Processes/Norvig.
1. Bellman, R. E. (1957). Dynamic Programming. Princeton University Press
2. Howard, R. A. (1960). Dynamic Programming and Markov Processes. MIT Press.
3. Bellman, R. E. and Dreyfus, S. E. (1962). Applied Dynamic Programming. Princeton University Press.
4. van Nunen, J. A. E. E. (1976). A set of successive approximation methods for discounted Markovian decision problems. Zeitschrift fur Operations Research, Serie A, 20(5), 203–208.
5. Puterman, M. L. and Shin, M. C. (1978). Modified policy iteration algorithms for discounted Markov decision problems. Management Science, 24(11), 1127-1137.
6. Williams, R. J. and Baird, L. C. I. (1993). Tight performance bounds on greedy policies based on imperfect value functions. Tech. rep. NU-CCS-93-14, College of Computer Science, Northeastern University.
7. Koopmans, T. C. (1972). Representation of preference orderings over time. In McGuire, C. B. and Radner, R. (Eds.), Decision and Organization. Elsevier/North-Holland.
8. Bertsekas, D. (1987). Dynamic Programming: Deterministic and Stochastic Models. Prentice-Hall.
9. Puterman, M. L. (1994). Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley
10. Bertsekas, D. and Tsitsiklis, J. N. (1996). Neurodynamic programming. Athena Scientific.
11. Papadimitriou, C. H. and Tsitsiklis, J. N. (1987). The complexity of Markov decision processes.
Mathematics of Operations Research, 12(3), 441-450.
12. Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine Learning,
13. Watkins, C. J. (1989). Models of Delayed Reinforcement Learning. Ph.D. thesis, Psychology Department, Cambridge University.
14. Barto, A. G., Bradtke, S. J., and Singh, S. P. (1995). Learning to act using real-time dynamic programming. AIJ, 73(1), 81-138._____________Explanation of symbols: Roman numerals indicate the source, arabic numerals indicate the page number. The corresponding books are indicated on the right hand side. ((s)…): Comment by the sender of the contribution. The note [Author1]Vs[Author2] or [Author]Vs[term] is an addition from the Dictionary of Arguments. If a German edition is specified, the page numbers refer to this edition.
Stuart J. Russell
Artificial Intelligence: A Modern Approach Upper Saddle River, NJ 2010