|Norvig I 694
Supervised Learning/AI Research/Norvig/Russell: In unsupervised learning the agent learns patterns in the input even though no explicit feedback is supplied. The most common unsupervised learning task is clustering: detecting
Norvig I 695
potentially useful clusters of input examples.
Def Supervised learning: In supervised learning the agent observes some example input–output pairs and learns a function that maps from input to output.
In semi-supervised learning we are given a few labeled examples and must make what we can of a large collection of unlabeled examples. Even the labels themselves may not be the oracular truths that we hope for. Imagine that you are trying to build a system to guess a person’s age from a photo. You gather some labeled examples by snapping pictures of people and asking their age. That’s supervised learning. But in reality some of the people lied about their age. It’s not just that there is random noise in the data; rather the inaccuracies are systematic, and to uncover them is an unsupervised learning problem involving images, self-reported ages, and true (unknown) ages. Thus, both noise and lack of labels create a continuum between supervised and unsupervised learning.
The task of supervised learning is this:
Given a training set of N example input–output pairs
(x1, y1), (x2, y2), . . . (xN, yN) ,
where each yj was generated by an unknown function y = f(x),
discover a function h that approximates the true function f.
Norvig I 696
Classification: When the output y is one of a finite set of values (such as sunny, cloudy or rainy), the learning problem is called classification, and is called Boolean or binary classification if there are only two values.
Regression: When y is a number (such as tomorrow’s temperature), the learning problem is called regression.
Hypotheses: In general, there is a tradeoff between complex hypotheses that fit the training data well and simpler hypotheses that may generalize better. >Representation/Norvig, >Knowledge/AI Research, >Learning/AI Research.
Norvig I 697
Realization: We say that a learning problem is realizable if the hypothesis space contains the true function. Unfortunately, we cannot always tell whether a given learning problem is realizable, because the true function is not known.
There is a tradeoff between the expressiveness of a hypothesis space and the complexity of finding a good hypothesis within that space.
Norvig I 759
History: Cross-validation was first introduced by Larson (1931)(1), and in a form close to what we show by Stone (1974)(2) and Golub et al. (1979)(3). The regularization procedure is due to Tikhonov (1963)(4). Guyon and Elisseeff (2003)(5) introduce a journal issue devoted to the problem of feature selection. Banko and Brill (2001)(6) and Halevy et al. (2009)(7) discuss the advantages of using large amounts of data. It was Robert Mercer, a speech researcher who said in 1985 “There is no data like more data.”
(Lyman and Varian, 2003)(8)o estimate that about 5 exabytes (5 × 1018 bytes) of data was produced in 2002, and that the rate of production is doubling every 3 years. Theoretical analysis of learning algorithms began with the work of Gold (1967)(9) on identification in the limit. This approach was motivated in part by models of scientific discovery from the philosophy of science (Popper, 1962)(10), but has been applied mainly to the problem of learning grammars from example sentences (Osherson et al., 1986)(11).
1. Larson, S. C. (1931). The shrinkage of the coefficient of multiple correlation. J. Educational Psychology, 22, 45–-55.
2. Stone, M. (1974). Cross-validatory choice and assessment of statostical predictions. J. Royal Statistical Society, 36 (111-133).
3. Golub, G., Heath, M., and Wahba, G. (1979). Generalized cross-validation as a method for choosing a good ridge parameter. Technometrics, 21 (2).
4. Tikhonov, A. N. (1963). Solution of incorrectly formulated problems and the regularization method.
Soviet Math. Dokl., 5, 1035-1038.
5. Guyon, I. and Elisseeff, A. (2003). An introduction to variable and feature selection. JMLR, pp. 1157-
6. Banko, M. and Brill, E. (2001). Scaling to very very large corpora for natural language disambiguation. In ACL-01, pp. 26-33.
7. Halevy, A., Norvig, P., and Pereira, F. (2009). The unreasonable effectiveness of data. IEEE Intelligent
Systems, March/April, 8-12.
8. Lyman, P. and Varian, H. R. (2003). How much information? www.sims.berkeley. edu/how-much-info-2003.
9. Gold, E. M. (1967). Language identification in the limit. Information and Control, 10, 447-474.
10. Popper, K. R. (1962). Conjectures and Refutations: The Growth of Scientific Knowledge. Basic Books.
11. Osherson, D. N., Stob, M., and Weinstein, S. (1986). Systems That Learn: An Introduction to
Learning Theory for Cognitive and Computer Scientists. MIT Press._____________Explanation of symbols: Roman numerals indicate the source, arabic numerals indicate the page number. The corresponding books are indicated on the right hand side. ((s)…): Comment by the sender of the contribution. The note [Author1]Vs[Author2] or [Author]Vs[term] is an addition from the Dictionary of Arguments. If a German edition is specified, the page numbers refer to this edition.
Stuart J. Russell
Artificial Intelligence: A Modern Approach Upper Saddle River, NJ 2010