Peter Norvig on Prior Knowledge - Dictionary of Arguments
Norvig I 777
Prior knowledge/AI Research/Norvig/Russell: To understand the role of prior knowledge, we need to talk about the logical relationships among hypotheses, example descriptions, and classifications. Let Descriptions denote the conjunction of all the example descriptions in the training set, and let Classifications denote the conjunction of all the example classifications. Then a Hypothesis that “explains the observations” must satisfy the following property (recall that |= means “logically entails”):
Hypothesis ∧ Descriptions |= Classifications.
Entailment constraint: We call this kind of relationship an entailment constraint, in which Hypothesis is the “un-known.” Pure inductive learning means solving this constraint, where Hypothesis is drawn from some predefined hypothesis space. >Hypotheses/AI Research.
Software agents/knowledge/learning/Norvig: The modern approach is to design agents that already know something and are trying to learn some more. An autonomous learning agent that uses background knowledge must somehow obtain the background knowledge in the first place (…). This method must itself be a learning process. The agent’s life history will therefore be characterized by cumulative, or incremental, development.
Norvig I 778
Learning with background knowledge: allows much faster learning than one might expect from a pure induction program.
Explanation based learning/EBL: the entailment constraints satisfied by EBL are the following:
Hypothesis ∧ Descriptions |= Classifications
Background |= Hypothesis.
Norvig I 779
(…) it was initially thought to be a way to learn from examples. But because it requires that the background knowledge be sufficient to explain the hypothesis, which in turn explains the observations, the agent does not actually learn anything factually new from the example. The agent could have derived the example from what it already knew, although that might have required an unreasonable amount of computation. EBL is now viewed as a method for converting first-principles theories into useful, special purpose knowledge.
Relevance/observations/RBL: the prior knowledge background concerns the relevance of a set of features to the goal predicate. This knowledge, together with the observations, allows the agent to infer a new, general rule that explains the observations:
Hypothesis ∧ Descriptions |= Classifications ,
Background ∧ Descriptions ∧ Classifications |= Hypothesis.
We call this kind of generalization relevance-based learning, or RBL. (…) whereas RBL does make use of the content of the observations, it does not produce hypotheses that go beyond the logical content of the background knowledge and the observations. It is a deductive form of learning and cannot by itself account for the creation of new knowledge starting from scratch.
Background ∧ Hypothesis ∧ Descriptions |= Classifications.
That is, the background knowledge and the new hypothesis combine to explain the examples.
Knowledge-based inductive learning/KBIL algorithms: Algorithms that satisfy [the entailment] constraint are called knowledge-based inductive learning, or KBIL, algorithms. KBIL algorithms, (…) have been studied mainly in the field of inductive logic programming, or ILP.
Norvig I 780
Explanation-based learning: The basic idea of memo functions is to accumulate a database of input–output pairs; when the function is called, it first checks the database to see whether it can avoid solving the problem from scratch. Explanation-based learning takes this a good deal further, by creating general rules that cover an entire class of cases.
Norvig I 781
General rules: The basic idea behind EBL is first to construct an explanation of the observation using prior knowledge, and then to establish a definition of the class of cases for which the same explanation structure can be used. This definition provides the basis for a rule covering all of the cases in the class.
Explanation: The “explanation” can be a logical proof, but more generally it can be any reasoning or problem-solving process whose steps are well defined. The key is to be able to identify the necessary conditions for those same steps to apply to another case.
Norvig I 782
EBL: 1. Given an example, construct a proof that the goal predicate applies to the example using the available background knowledge.
Norvig I 783
2. In parallel, construct a generalized proof tree for the variabilized goal using the same inference steps as in the original proof.
3. Construct a new rule whose left-hand side consists of the leaves of the proof tree and whose right-hand side is the variabilized goal (after applying the necessary bindings from the generalized proof).
4. Drop any conditions from the left-hand side that are true regardless of the values of the variables in the goal.
Norvig I 794
Inverse resolution: Inverse resolution is based on the observation that if the example Classifications follow from Background ∧ Hypothesis ∧ Descriptions, then one must be able to prove this fact by resolution (because resolution is complete). If we can “run the proof backward,” then we can find a Hypothesis such that the proof goes through.
Norvig I 795
Inverse entailment: The idea is to change the entailment constraint
Background ∧ Hypothesis ∧ Descriptions |= Classifications
to the logically equivalent form
Background ∧ Descriptions ∧ ¬Classifications |= ¬Hypothesis.
Norvig I 796
An inverse resolution procedure that inverts a complete resolution strategy is, in principle, a complete algorithm for learning first-order theories. That is, if some unknown Hypothesis generates a set of examples, then an inverse resolution procedure can generate Hypothesis from the examples. This observation suggests an interesting possibility: Suppose that the available examples include a variety of trajectories of falling bodies. Would an inverse resolution program be theoretically capable of inferring the law of gravity? The answer is clearly yes, because the law of gravity allows one to explain the examples, given suitable background mathematics.
Norvig I 798
Literature: The current-best-hypothesis approach is an old idea in philosophy (Mill, 1843)(1). Early work in cognitive psychology also suggested that it is a natural form of concept learning in humans (Bruner et al., 1957)(2). In AI, the approach is most closely associated with the work of Patrick Winston, whose Ph.D. thesis (Winston, 1970)(3) addressed the problem of learning descriptions of complex objects.
Version space: The version space method (Mitchell, 1977(4), 1982(5)) takes a different approach, maintaining the set of all consistent hypotheses and eliminating thosefound to be inconsistent with new examples. The approach was used in the Meta-DENDRAL
Norvig I 799
expert system for chemistry (Buchanan and Mitchell, 1978)(6), and later in Mitchell’s (1983)(7) LEX system, which learns to solve calculus problems. A third influential thread was formed by the work of Michalski and colleagues on the AQ series of algorithms, which learned sets of logical rules (Michalski, 1969(8); Michalski et al., 1986(9)).
EBL: EBL had its roots in the techniques used by the STRIPS planner (Fikes et al., 1972)(10). When a plan was constructed, a generalized version of it was saved in a plan library and used in later planning as a macro-operator. Similar ideas appeared in Anderson’s ACT* architecture, under the heading of knowledge compilation (Anderson, 1983)(11), and in the SOAR architecture, as chunking (Laird et al., 1986)(12). Schema acquisition (DeJong, 1981)(13), analytical generalization (Mitchell, 1982)(5), and constraint-based generalization (Minton, 1984)(14) were immediate precursors of the rapid growth of interest in EBL stimulated by the papers of Mitchell et al. (1986)(15) and DeJong and Mooney (1986)(16). Hirsh (1987) introduced the EBL algorithm described in the text, showing how it could be incorporated directly into a logic programming system. Van Harmelen and Bundy (1988)(18) explain EBL as a variant of the partial evaluation method used in program analysis systems (Jones et al., 1993)(19).
VsEBL: Initial enthusiasm for EBL was tempered by Minton’s finding (1988)(20) that, without extensive
extra work, EBL could easily slow down a program significantly. Formal probabilistic analysis of the expected payoff of EBL can be found in Greiner (1989)(21) and Subramanian and Feldman (1990)(22). An excellent survey of early work on EBL appears in Dietterich (1990)(23).
Relevance: Relevance information in the form of functional dependencies was first developed in the database community, where it is used to structure large sets of attributes into manageable subsets. Functional dependencies were used for analogical reasoning by Carbonell and Collins (1973)(24) and rediscovered and given a full logical analysis by Davies and Russell (Davies, 1985(25); Davies and Russell, 1987(26)).
Prior knowledge: Their role as prior knowledge in inductive learning was explored by Russell and Grosof (1987)(27). The equivalence of determinations to a restricted-vocabulary hypothesis space was proved in Russell (1988)(28).
Learning: Learning algorithms for determinations and the improved performance obtained by RBDTL were first shown in the FOCUS algorithm, due to Almuallim and Dietterich (1991)(29). Tadepalli (1993)(30) describes a very ingenious algorithm for learning with determinations that shows large improvements in earning speed.
Inverse deduction: The idea that inductive learning can be performed by inverse deduction can be traced to W. S. Jevons (1874)(31) (…).
Computational investigations began with the remarkable Ph.D. thesis by
Norvig I 800
Gordon Plotkin (1971)(32) at Edinburgh. Although Plotkin developed many of the theorems and methods that are in current use in ILP, he was discouraged by some undecidability results for certain subproblems in induction. MIS (Shapiro, 1981)(33) reintroduced the problem of learning logic programs, but was seen mainly as a contribution to the theory of automated debugging.
Induction/rules: Work on rule induction, such as the ID3 (Quinlan, 1986)(34) and CN2 (Clark and Niblett, 1989)(35) systems, led to FOIL (Quinlan, 1990)(36), which for the first time allowed practical induction of relational rules.
Relational Learning: The field of relational learning was reinvigorated by Muggleton and Buntine (1988)(37), whose CIGOL program incorporated a slightly incomplete version of inverse resolution and was capable of generating new predicates. The inverse resolution method also appears in (Russell, 1986)(38), with a simple algorithm given in a footnote. The next major system was GOLEM (Muggleton and Feng, 1990)(39), which uses a covering algorithm based on Plotkin’s concept of relative least general generalization. ITOU (Rouveirol and Puget, 1989)(40) and CLINT (De Raedt, 1992)(41) were other systems of that era.
Natural language: More recently, PROGOL (Muggleton, 1995)(42) has taken a hybrid (top-down and bottom-up) approach to inverse entailment and has been applied to a number of practical problems, particularly in biology and natural language processing.
Uncertainty: Muggleton (2000)(43) describes an extension of PROGOL to handle uncertainty in the form of stochastic logic programs.
Inductive logic programming /ILP: A formal analysis of ILP methods appears in Muggleton (1991)(44), a large collection of papers in Muggleton (1992)(45), and a collection of techniques and applications in the book by Lavrauc and Duzeroski (1994)(46). Page and Srinivasan (2002)(47) give a more recent overview of the field’s history and challenges for the future. Early complexity results by Haussler (1989) suggested that learning first-order sentences was intractible. However, with better understanding of the importance of syntactic restrictions on clauses, positive results have been obtained even for clauses with recursion (Duzeroski et al., 1992)(48). Learnability results for ILP are surveyed by Kietz and Duzeroski (1994)(49) and Cohen and Page (1995)(50).
Discovery systems/VsILP: Although ILP now seems to be the dominant approach to constructive induction, it has not been the only approach taken. So-called discovery systems aim to model the process of scientific discovery of new concepts, usually by a direct search in the space of concept definitions. Doug Lenat’s Automated Mathematician, or AM (Davis and Lenat, 1982)(51), used discovery heuristics expressed as expert system rules to guide its search for concepts and conjectures in elementary number theory. Unlike most systems designed for mathematical reasoning, AM lacked a concept of proof and could only make conjectures. It rediscovered Goldbach’s conjecture and the Unique Prime Factorization theorem.
AM’s architecture was generalized in the EURISKO system (Lenat, 1983)(52) by adding a mechanism capable of rewriting the system’s own discovery heuristics. EURISKO was applied in a number of areas other than mathematical discovery, although with less success than AM. The methodology of AM and EURISKO has been controversial (Ritchie and Hanna, 1984; Lenat and Brown, 1984).
1. Mill, J. S. (1843). A System of Logic, Ratiocinative and Inductive: Being a Connected View of the Principles of Evidence, and Methods of Scientific Investigation. J.W. Parker, London.
2. Bruner, J. S., Goodnow, J. J., and Austin, G. A. (1957). A Study of Thinking. Wiley.
3. Winston, P. H. (1970). Learning structural descriptions from examples. Technical report MAC-TR-76,
Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology.
4. Mitchell, T.M. (1977). Version spaces: A candidate elimination approach to rule learning. In IJCAI-77,
5. Mitchell, T. M. (1982). Generalization as search. AIJ, 18(2), 203–226.
6. Buchanan, B. G.,Mitchell, T.M., Smith, R. G., and Johnson, C. R. (1978). Models of learning systems.
In Encyclopedia of Computer Science and Technology, Vol. 11. Dekker.
7. Mitchell, T. M., Utgoff, P. E., and Banerji, R. (1983). Learning by experimentation: Acquiring and refining problem-solving heuristics. In Michalski, R. S., Carbonell, J. G., and Mitchell, T. M. (Eds.),
Machine Learning: An Artificial Intelligence Approach, pp. 163–190. Morgan Kaufmann.
8. Michalski, R. S. (1969). On the quasi-minimal solution of the general covering problem. In Proc.
First International Symposium on Information Processing, pp. 125–128.
9. Michalski, R. S.,Mozetic, I., Hong, J., and Lavrauc, N. (1986). The multi-purpose incremental learning
system AQ15 and its testing application to three medical domains. In AAAI-86, pp. 1041–1045.
10. Fikes, R. E., Hart, P. E., and Nilsson, N. J. (1972). Learning and executing generalized robot plans. AIJ, 3(4), 251–288.
11. Anderson, J. R. (1983). The Architecture of Cognition. Harvard University Press.
12. Laird, J., Rosenbloom, P. S., and Newell, A. (1986). Chunking in Soar: The anatomy of a general learning mechanism. Machine Learning, 1, 11–46.
13. DeJong, G. (1981). Generalizations based on explanations. In IJCAI-81, pp. 67–69.
14. Minton, S. (1984). Constraint-based generalization: Learning game-playing plans from single examples. In AAAI-84, pp. 251–254.
15. Mitchell, T. M., Keller, R., and Kedar-Cabelli, S. (1986). Explanation-based generalization: A unifying view. Machine Learning, 1, 47–80.
16. DeJong, G. and Mooney, R. (1986). Explanation-based learning: An alternative view. Machine Learning, 1, 145–176.
17. Hirsh, H. (1987). Explanation-based generalization in a logic programming environment. In IJCAI-87.
18. van Harmelen, F. and Bundy, A. (1988). Explanation-based generalisation = partial evaluation.
AIJ, 36(3), 401–412.
19. Jones, N. D., Gomard, C. K., and Sestoft, P. (1993). Partial Evaluation and Automatic Program Generation. Prentice-Hall.
20. Minton, S. (1988). Quantitative results concerning the utility of explanation-based learning. In AAAI-88, pp. 564–569.
21. Greiner, R. (1989). Towards a formal analysis of EBL. In ICML-89, pp. 450–453.
22. Subramanian, D. and Feldman, R. (1990). The utility of EBL in recursive domain theories. In AAAI-90, Vol. 2, pp. 942–949.
23. Dietterich, T. (1990). Machine learning. Annual Review of Computer Science, 4, 255–306.
24. Carbonell, J. R. and Collins, A. M. (1973). Natural semantics in artificial intelligence. In IJCAI-73, pp.
25. Davies, T. R. (1985). Analogy. Informal note INCSLI- 85-4, Center for the Study of Language and
26. Davies, T. R. and Russell, S. J. (1987). A logical approach to reasoning by analogy. In IJCAI-87, Vol. 1,
27. Russell, S. J. and Grosof, B. (1987). A declarative approach to bias in concept learning. In AAAI-87.
28. Russell, S. J. (1988). Tree-structured bias. In AAAI-88, Vol. 2, pp. 641–645.
29. Almuallim, H. and Dietterich, T. (1991). Learning with many irrelevant features. In AAAI-91, Vol. 2,
30. Tadepalli, P. (1993). Learning from queries and examples with tree-structured bias. In ICML-93, pp.
31. Jevons, W. S. (1874). The Principles of Science. Routledge/Thoemmes Press, London.
32. Plotkin, G. (1971). Automatic Methods of Inductive Inference. Ph.D. thesis, Edinburgh University.
33. Shapiro, E. (1981). An algorithm that infers theories from facts. In IJCAI-81, p. 1064.
34. Quinlan, J. R. (1986). Induction of decision trees. Machine Learning, 1, 81–106.
35. Clark, P. and Niblett, T. (1989). The CN2 induction algorithm. Machine Learning, 3, 261–283.
36. Quinlan, J. R. (1990). Learning logical definitions from relations. Machine Learning, 5(3), 239–266.
37. Muggleton, S. H. and Buntine, W. (1988). Machine invention of first-order predicates by inverting resolution. In ICML-88, pp. 339–352.
38. Russell, S. J. (1986). A quantitative analysis of analogy by similarity. In AAAI-86, pp. 284–288.
39. Muggleton, S. H. and Feng, C. (1990). Efficient induction of logic programs. In Proc. Workshop on
Algorithmic Learning Theory, pp. 368–381.
40. Rouveirol, C. and Puget, J.-F. (1989). A simple and general solution for inverting resolution. In Proc.
European Working Session on Learning, pp. 201–210.
41. De Raedt, L. (1992). Interactive Theory Revision: An Inductive Logic Programming Approach. Academic Press.
42. Muggleton, S. H. (1995). Inverse entailment and Progol. New Generation Computing, 13(3-4), 245-
43. Muggleton, S. H. (2000). Learning stochastic logic programs. Proc. AAAI 2000 Workshop on Learning
Statistical Models from Relational Data.
44. Muggleton, S. H. (1991). Inductive logic programming. New Generation Computing, 8, 295–318.
45. Muggleton, S. H. (1992). Inductive Logic Programming. Academic Press.
46. Lavrauc, N. and Duzeroski, S. (1994). Inductive Logic Programming: Techniques and Applications.
47. Page, C. D. and Srinivasan, A. (2002). ILP: A short look back and a longer look forward. Submitted to
Journal of Machine Learning Research.
48. Duzeroski, S., Muggleton, S. H., and Russell, S. J. (1992). PAC-learnability of determinate logic programs. In COLT-92, pp. 128–135.
49. Kietz, J.-U. and Duzeroski, S. (1994). Inductive logic programming and learnability. SIGART Bulletin,
50. Cohen, W. W. and Page, C. D. (1995). Learnability in inductive logic programming: Methods and results. New Generation Computing, 13(3–4), 369-409.
51. Davis, R. and Lenat, D. B. (1982). Knowledge-Based Systems in Artificial Intelligence. McGraw-
52. Lenat, D. B. (1983). EURISKO: A program that learns new heuristics and domain concepts: The nature of heuristics, III: Program design and results. AIJ, 21(1–2), 61–98.
53. Ritchie, G. D. and Hanna, F. K. (1984). AM: A case study in AI methodology. AIJ, 23(3), 249–268.
54. Lenat, D. B. and Brown, J. S. (1984). Why AM and EURISKO appear to work. AIJ, 23(3), 269–294._____________Explanation of symbols: Roman numerals indicate the source, arabic numerals indicate the page number. The corresponding books are indicated on the right hand side. ((s)…): Comment by the sender of the contribution. Translations: Dictionary of Arguments The note [Author1]Vs[Author2] or [Author]Vs[term] is an addition from the Dictionary of Arguments. If a German edition is specified, the page numbers refer to this edition.
Stuart J. Russell
Artificial Intelligence: A Modern Approach Upper Saddle River, NJ 2010