Peter Norvig on Language Models - Dictionary of Arguments
Norvig I 860
Language models/Norvig/Russell: language models [are] models that predict the probability distribution of language expressions.
Norvig I 861
Ultimately, a written text is composed of characters - letters, digits, punctuation, and spaces in English (and more exotic characters in some other languages). Thus, one of the simplest language models is a probability distribution over sequences of characters.
N-gram character models: A sequence of written symbols of length n is called an n-gram (from the Greek root for writing or letters), with special case “unigram” for 1-gram, “bigram” for 2-gram, and “trigram” for 3-gram. A model of the probability distribution of n-letter sequences is thus called an n-gram model. (But be careful: we can have n-gram models over sequences of words, syllables, or other units; not just over characters.) An n-gram model is defined as a Markov chain of order n − 1. (…)in a Markov chain the probability of character ci depends only on the immediately preceding characters, not on any other characters.
Norvig I 863
We can evaluate a model with cross-validation. Split the corpus into a training corpus and a validation corpus.
Determine the parameters of the model from the training data. Then evaluate the model on the validation corpus. The evaluation can be a task-specific metric, such as measuring accuracy on language identification. Alternatively we can have a task-independent model of language quality: calculate the probability assigned to the validation corpus by the model; the higher the probability the better.
Norvig I 864
N-gram models over words: The main difference is that the vocabulary - the set of symbols that make up the corpus and the model—is larger. There are only about 100 characters in most languages, and sometimes we build character models that are even more restrictive, for example by treating “A” and “a” as the same symbol or by treating all punctuation as the same symbol. But with word models we have at least tens of thousands of symbols, and sometimes millions. The wide range is because it is not clear what constitutes a word. In English a sequence of letters surrounded by spaces is a word, but in some languages, like Chinese, words are not separated by spaces, (…).Word n-gram models need to deal with out of vocabulary words. With character models, we didn’t have to worry about someone inventing a new letter of the alphabet. But with word models there is always the chance of a new word that was not seen in the training corpus, so we need to model that explicitly in our language model. ((s) Cf. >Vocabulary/philosophical theories.) >Data compression.
Norvig I 883
History: N-gram letter models for language modeling were proposed by Markov (1913)(1). Claude Shannon (Shannon and Weaver, 1949)(2) was the first to generate n-gram word models of English.
Chomsky (1956(3), 1957(4)) pointed out the limitations of finite-state models compared with context-free models, concluding, “Probabilistic models give no particular insight into some of the basic problems of syntactic structure.” This is true, but probabilistic models do provide insight into some other basic problems—problems that context-free models ignore. Chomsky’s remarks had the unfortunate effect of scaring many people away from statistical models for two decades, until these models reemerged for use in speech recognition (Jelinek, 1976)(5). Kessler et al. (1997)(6) show how to apply character n-gram models to genre classification, and Klein et al. (2003)(7) describe named-entity recognition with character models. Franz and Brants (2006)(8) describe the Google n-gram corpus of 13 million unique words from a trillion words of Web text; it is now publicly available. The bag of words model gets its name from a passage from linguist Zellig Harris (1954)(9), “language is not merely a bag of words but a tool with particular properties.” Norvig (2009)(10) gives some examples of tasks that can be accomplished with n-gram models.
Simple n-gram letter and word models are not the only possible probabilistic models. Blei et al. (2001)(11) describe a probabilistic text model called latent Dirichlet allocation that views a document as a mixture of topics, each with its own distribution of words. This model can be seen as an extension and rationalization of the latent semantic indexing model of (Deerwester et al., 1990)(12) (see also Papadimitriou et al. (1998)(13)) and is also related to the multiple-cause mixture model of (Sahami et al., 1996)(14).
1. Markov, A. A. (1913). An example of statistical investigation in the text of “Eugene Onegin” illustrating coupling of “tests” in chains. Proc. Academy of Sciences of St. Petersburg, 7.
2. Shannon, C. E. and Weaver, W. (1949). The Mathematical Theory of Communication. University of
3. Chomsky, N. (1956). Three models for the description of language. IRE Transactions on Information
Theory, 2(3), 113–124.
4. Chomsky, N. (1957). Syntactic Structures. Mouton.
5. Jelinek, F. (1976). Continuous speech recognition by statistical methods. Proc. IEEE, 64(4), 532–556.
6. Kessler, B., Nunberg, G., and Schütze, H. (1997). Automatic detection of text genre. CoRR, cmplg/
7. Klein, D., Smarr, J., Nguyen, H., and Manning, C. (2003). Named entity recognition with character level models. In Conference on Natural Language Learning (CoNLL).
8. Franz, A. and Brants, T. (2006). All our n-gram are belong to you. Blog posting.
9. Harris, Z. (1954). Distributional structure. Word, 10(2/3).
10. Norvig, P. (2009). Natural language corpus data. In Segaran, T. and Hammerbacher, J. (Eds.), Beautiful Data. O’Reilly.
11. Blei, D. M., Ng, A. Y., and Jordan, M. I. (2001). Latent Dirichlet Allocation. In Neural Information
Processing Systems, Vol. 14.
12. Deerwester, S. C., Dumais, S. T., Landauer, T. K., Furnas, G. W., and Harshman, R. A. (1990). Indexing by latent semantic analysis. J. American Society for Information Science, 41(6), 391–407.
13. Papadimitriou, C. H., Tamaki, H., Raghavan, P., and Vempala, S. (1998). Latent semantic indexing:
A probabilistic analysis. In PODS-98, pp. 159–168.
14. Sahami, M., Hearst, M. A., and Saund, E. (1996). Applying the multiple cause mixture model to text
categorization. In ICML-96, pp. 435–443._____________Explanation of symbols: Roman numerals indicate the source, arabic numerals indicate the page number. The corresponding books are indicated on the right hand side. ((s)…): Comment by the sender of the contribution. Translations: Dictionary of Arguments The note [Author1]Vs[Author2] or [Author]Vs[term] is an addition from the Dictionary of Arguments. If a German edition is specified, the page numbers refer to this edition.
Stuart J. Russell
Artificial Intelligence: A Modern Approach Upper Saddle River, NJ 2010