|Norvig I 645
Values/utility/decision theory/AI research/Norvig/Russell: in making decisions in a stochastic environment. Sequential decision problems incorporate utilities, uncertainty, and sensing, and include search and planning problems as special cases.
Norvig I 652
Bellman equation for utilities: (…)there is a direct relationship between the utility of a state and the utility of its neighbors: the utility of a state is the immediate reward for that state plus the expected discounted utility of the next state, assuming that the agent chooses the optimal action. Richard Bellman (1957)(1).
The Bellman equation is the basis of the value iteration algorithm for solving MDPs (Markov decision processes). If there are n possible states, then there are n Bellman equations, one for each state. The n equations contain n unknowns - the utilities of the states.
Problem: the equations are nonlinear, because the “max” operator is not a linear operator. Whereas systems of linear equations can be solved quickly using linear algebra techniques, systems of nonlinear equations are more problematic.
Norvig I 654
Value iteration: (…) value iteration eventually converges to a unique set of solutions of the Bellman equations.
Contraction: a contraction is a function of one argument that, when applied to two different inputs in turn, produces two output values that are “closer together,” by at least some constant factor, than the original inputs. For example, the function “divide by two” is a contraction, because, after we divide any two numbers by two, their difference is halved. Notice that the “divide by two” function has a fixed point, namely zero that is unchanged by the application of the function.
Norvig I 656
Policy iteration: (…) it is possible to get an optimal policy even when the utility function estimate is inaccurate. If one action is clearly better than all others, then the exact magnitude of the utilities on the states involved need not be precise. The policy iteration algorithm alternates (…) two steps, policy evaluation and policy improvement. The algorithm terminates when the policy improvement step yields no change in the utilities. >Game theory/AI research.
1. Bellman, R. E. (1957). Dynamic Programming. Princeton University Press._____________Explanation of symbols: Roman numerals indicate the source, arabic numerals indicate the page number. The corresponding books are indicated on the right hand side. ((s)…): Comment by the sender of the contribution. The note [Author1]Vs[Author2] or [Author]Vs[term] is an addition from the Dictionary of Arguments. If a German edition is specified, the page numbers refer to this edition.
Stuart J. Russell
Artificial Intelligence: A Modern Approach Upper Saddle River, NJ 2010