Peter Norvig on Learning Theory - Dictionary of Arguments
Norvig I 713
Theory of Learning/Norvig/Russell: [main problem:] how can we be sure that our learning algorithm has produced a hypothesis that will predict the correct value for previously unseen inputs? In formal terms, how do we know that the hypothesis h is close to the target function f if we don’t know what f is? These questions have been pondered for several centuries.
In more recent decades, other questions have emerged: how many examples do we need to get a good h? What hypothesis space should we use? If the hypothesis space is very complex, can we even find the best h, or do we have to settle for a local maximum in the
Norvig I 714
space of hypotheses? How complex should h be? How do we avoid overfitting? (>Decision tree/Norvig, >Complexity/Norvig, >Learning/AI Research).
Computational learning theory: (…) lies at the intersection of AI, statistics, and theoretical computer science. The underlying principle is that any hypothesis that is seriously wrong will almost certainly be “found out” with high probability after a small number of examples, because it will make an incorrect prediction. Thus, any hypothesis that is consistent with a sufficiently large set of training examples is unlikely to be seriously wrong: that is, it must be probably approximately correct.
PAC: Any learning algorithm that returns hypotheses that are probably approximately correct is called a PAC learning algorithm; we can use this approach to provide bounds on the performance of various learning algorithms.
Stationarity assumption: future examples are going to be drawn from the same fixed distribution P(E)=P(X, Y ) as past examples.
Correctness: A hypothesis h is called approximately correct if error (h) ≤ , where is a small constant. We will show that we can find an N such that, after seeing N examples, with high probability, all consistent hypotheses will be approximately correct. >Learning/AI Research, >Artificial Neural Networks.
Norvig I 757
Computational learning theory analyzes the sample complexity and computational complexity of inductive learning. There is a tradeoff between the expressiveness of the hypothesis language and the ease of learning.
Linear regression is a widely used model. The optimal parameters of a linear regression model can be found by gradient descent search, or computed exactly.
A linear classifier with a hard threshold - also known as a perceptron - can be trained by a simple weight update rule to fit data that are linearly separable. In other cases, the rule fails to converge.
Norvig I 759
History: The theory of PAC-learning was inaugurated by Leslie Valiant (1984)(1). His work stressed the importance of computational and sample complexity. With Michael Kearns (1990)(2), Valiant showed that several concept classes cannot be PAC-learned tractably, even though sufficient information is available in the examples. Some positive results were obtained for classes such as decision lists (Rivest, 1987)(3).
An independent tradition of sample-complexity analysis has existed in statistics, beginning with the work on uniform convergence theory (Vapnik and Chervonenkis, 1971)(4).
The so-called VC dimension provides a measure roughly analogous to, but more general than, the ln |H| measure obtained from PAC analysis. The VC dimension can be applied to continuous function classes, to which standard PAC analysis does not apply. PAC-learning theory and C theory were first connected by the “four Germans” (none of whom actually is German): Blumer, Ehrenfeucht, Haussler, and Warmuth (1989)(5).
Linear regression with squared error loss goes back to Legendre (1805)(6) and Gauss (1809)(7), who were both working on predicting orbits around the sun. The modern use of multivariate regression for machine learning is covered in texts such as Bishop (2007)(8). Ng 2004)(9) analyzed the differences between L1 and L2 regularization.
1. Valiant, L. (1984). A theory of the learnable. CACM, 27, 1134-1142.
2. Kearns, M. (1990). The Computational Complexity of Machine Learning. MIT Press.
3. Rivest, R. (1987). Learning decision lists. Machine Learning, 2(3), 229-246.
4. Vapnik, V. N. and Chervonenkis, A. Y. (1971). On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and Its Applications, 16, 264-280.
5. Blumer, A., Ehrenfeucht, A., Haussler, D., andWarmuth, M. (1989). Learnability and the Vapnik-
Chervonenkis dimension. JACM, 36(4), 929–965.
6. Legendre, A. M. (1805). Nouvelles méthodes pour la détermination des orbites des comètes.
7. Gauss, C. F. (1809). Theoria Motus Corporum Coelestium in Sectionibus Conicis Solem Ambientium.
Sumtibus F. Perthes et I. H. Besser, Hamburg._____________Explanation of symbols: Roman numerals indicate the source, arabic numerals indicate the page number. The corresponding books are indicated on the right hand side. ((s)…): Comment by the sender of the contribution. Translations: Dictionary of Arguments The note [Author1]Vs[Author2] or [Author]Vs[term] is an addition from the Dictionary of Arguments. If a German edition is specified, the page numbers refer to this edition.
Stuart J. Russell
Artificial Intelligence: A Modern Approach Upper Saddle River, NJ 2010