|Complex: a complex is composed of components that can be distinguished from each other and are relatively autonomous. Complex behavior refers to systems that consist of several components. The relative independence of the components is manifested in their behavior. Relative autonomy of the components is determined by the description of the complex as a whole._____________Annotation: The above characterizations of concepts are neither definitions nor exhausting presentations of problems related to them. Instead, they are intended to give a short introduction to the contributions below. – Lexicon of Arguments. |
Peter Norvig on Complexes/Complexity - Dictionary of Arguments
Norvig I 712
Complexity/AI Research/Norvig/Russell: [one way of reducing complexity is] model selection with cross-validation on model size. An alternative approach is to search for a hypothesis that directly minimizes the weighted sum of
Norvig I 713
empirical loss and the complexity of the hypothesis, which we will call the total cost:
Cost (h) = EmpLoss(h) + λ Complexity (h)
ˆh ∗ = argmin Cost (h)/h∈H.
Here λ is a parameter, a positive number that serves as a conversion rate between loss and hypothesis complexity (which after all are not measured on the same scale). This approach combines loss and complexity into one metric, allowing us to find the best hypothesis all at once.
Regularization: This process of explicitly penalizing complex hypotheses is called regularization (because it looks for a function that is more regular, or less complex). Note that the cost function requires us to make two choices: the loss function and the complexity measure, which is called a regularization function. The choice of regularization function depends on the hypothesis space.
Another way to simplify models is to reduce the dimensions that the models work with. A process of feature selection can be performed to discard attributes that appear to be irrelevant. Χ2 pruning is a kind of feature selection.
MDL: The minimum description length or MDL hypothesis minimizes the total number of bits required. VsMDL: This works well in the limit, but for smaller problems there is a difficulty in that the choice of encoding for the program - for example, how best to encode a decision tree as a bit string - affects the outcome. >Learning theory/Norvig, >Learning/AI Research.
Norvig I 759
History: Whereas the identification-in-the-limit approach concentrates on eventual convergence, the study of Kolmogorov complexity or algorithmic complexity, developed independently by Solomonoff (1964(1), 2009(2)) and Kolmogorov (1965)(3), attempts to provide a formal definition for the notion of simplicity used in Ockham’s razor. To escape the problem that simplicity depends on the way in which information is represented, it is proposed that simplicity be measured by the length of the shortest program for a universal Turing machine that correctly reproduces the observed data.
Although there are many possible universal Turing machines, and hence many possible “shortest” programs, these programs differ in length by at most a constant that is independent of the amount of data. This beautiful insight, which essentially shows that any initial representation bias will eventually be overcome by the data itself, is marred only by the undecidability of computing the length of the shortest program. Approximate measures such as the minimum description length, or MDL (Rissanen, 1984(4), 2007(5)) can be used instead and have produced excellent results in practice. The text by Li and Vitanyi (1993)(6) is the best source for Kolmogorov complexity.
Norvig I 762
The complexity of neural network learning has been investigated by researchers in computational learning theory. Early computational results were obtained by Judd (1990)(7), who showed that the general problem of finding a set of weights consistent with a set of examples is NP-complete, even under very restrictive assumptions. Some of the first sample complexity results were obtained by Baum and Haussler (1989)(8), who showed that the number of examples required for effective learning grows as roughly W logW, where W is the number of weights. Since then, a much more sophisticated theory has been developed (Anthony and Bartlett, 1999)(9), including the important result that the representational capacity of a network depends on the size of the weights as well as on their number, a result that should not be surprising in the light of our discussion of regularization.
1. Solomonoff, R. J. (1964). A formal theory of inductive inference. Information and Control, 7, 1–22,
2. Solomonoff, R. J. (2009). Algorithmic probability-theory and applications. In Emmert-Streib, F. and
Dehmer, M. (Eds.), Information Theory and Statistical Learning. Springer.
3. Kolmogorov, A. N. (1965). Three approaches to the quantitative definition of information. Problems in Information Transmission, 1(1), 1–7.
4. Rissanen, J. (1984). Universal coding, information, prediction, and estimation. IEEE Transactions on Information Theory, IT-30(4), 629-636.
5. Rissanen, J. (2007). Information and Complexity in Statistical Modeling. Springer.
6. Li, M. and Vitanyi, P. M. B. (1993). An Introduction to Kolmogorov Complexity and Its Applications.
7. Judd, J. S. (1990). Neural Network Design and the Complexity of Learning. MIT Press.
8. Baum, E. and Haussler, D. (1989). What size net gives valid generalization? Neural Computation,
9. Anthony, M. and Bartlett, P. (1999). Neural Network Learning: Theoretical Foundations. Cambridge
University Press._____________Explanation of symbols: Roman numerals indicate the source, arabic numerals indicate the page number. The corresponding books are indicated on the right hand side. ((s)…): Comment by the sender of the contribution. Translations: Dictionary of Arguments The note [Author1]Vs[Author2] or [Author]Vs[term] is an addition from the Dictionary of Arguments. If a German edition is specified, the page numbers refer to this edition.
Stuart J. Russell
Artificial Intelligence: A Modern Approach Upper Saddle River, NJ 2010