Peter Norvig on Information Retrieval - Dictionary of Arguments
Norvig I 867
Information Retrieval/Norvig/Russell: Information retrieval is the task of finding documents that are relevant to a user’s need for information. An information retrieval (IR) system can be characterized by
1. A corpus of documents. Each system must decide what it wants to treat as a document: a paragraph, a page, or a multipage text.
2. Queries posed in a query language. A query specifies what the user wants to know. The query language can be just a list of words, such as [AI book]; or it can specify a phrase of words that must be adjacent, as in [“AI book”]; it can contain Boolean operators as in [AI AND book]; it can include non-Boolean operators such as [AI NEAR book] or [AI book site: www.aaai.org].
3. A result set. This is the subset of documents that the IR system judges to be relevant to the query. By relevant, we mean likely to be of use to the person who posed the query, for the particular information need expressed in the query.
4. A presentation of the result set. This can be as simple as a ranked list of document titles or as complex as a rotating color map of the result set projected onto a three dimensional space, rendered as a two-dimensional display.
The earliest IR systems worked on a Boolean keyword model. Each word in the document collection is treated as a Boolean feature that is true of a document if the word occurs in the document and false if it does not.
Norvig I 868
VsBoolean model: Problems: First, the degree of relevance of a document is a single bit, so there is no guidance as to how to order the relevant documents for presentation. Second, Boolean expressions are unfamiliar to users who are not programmers or logicians. Third, it can be hard to formulate an appropriate query, even for a skilled user.
Solution: Most IR systems have abandoned the Boolean model and use models based on the statistics of word counts. A scoring function takes a document and a query and returns a numeric score; the most relevant documents have the highest scores.
Relevance/weight: Three factors affect the weight of a query term: First, the frequency with which a query term appears in a document (also known as TF for term frequency). For the query [farming in Kansas], documents that mention “farming” frequently will have higher scores. Second, the inverse document frequency of the term, or IDF. The word “in” appears in almost every document, so it has a high document frequency, and thus a low inverse document frequency, and thus it is not as important to the query as “farming” or “Kansas.” Third, the length of the document. A million-word document will probably mention all the query words, but may not actually be about the query. A short document that mentions all the words is a much better candidate.
Norvig I 870
Refinements of IR: One common refinement is a better model of the effect of document length on relevance. Singhal et al. (1996)(1) observed that simple document length normalization schemes tend to favor short documents too much and long documents not enough. They propose a pivoted document length normalization scheme; the idea is that the pivot is the document length at which the old-style normalization is correct; documents shorter than that get a boost and longer ones get a penalty.
Stemming algorithms: Most IR systems do case folding of “COUCH” to “couch,” and some use a stemming algorithm to reduce “couches” to the stem form “couch,” both in the query and the documents. This typically yields a small increase in recall (on the order of 2% for English). However, it can harm precision. For example, stemming “stocking” to “stock” will tend to decrease precision for queries about either foot coverings or financial instruments, although it could improve recall for queries about warehousing. Stemming algorithms based on rules (e.g., remove “-ing”) cannot avoid this problem, but algorithms based on dictionaries (don’t remove “-ing” if the word is already listed in the dictionary) can. While stemming has a small effect in English, it is more important in other languages.
Synomyms: The next step is to recognize synonyms, such as “sofa” for “couch.” As with stemming, this has the potential for small gains in recall, but can hurt precision. The problem is that “languages abhor absolute synonyms just as nature abhors a vacuum” (Cruse, 1986)(2). ((s) Cf. >Synonymy/Philosophical theories).
Metadata: As a final refinement, IR can be improved by considering metadata—data outside of the text of the document. Examples include human-supplied keywords and publication data. On the Web, hypertext links between documents are a crucial source of information.
Norvig I 884
History: The field of information retrieval is experiencing a regrowth in interest, sparked by the wide usage of Internet searching. Robertson (1977)(3) gives an early overview and introduces the probability ranking principle. Croft et al. (2009)(4) and Manning et al. (2008)(5) are the first textbooks to cover Web-based search as well as traditional IR. Hearst (2009)(6) covers user interfaces for Web search. The TREC conference, organized by the U.S. government’s National Institute of Standards and Technology (NIST), hosts an annual competition for IR systems and publishes proceedings with results. In the first seven years of the competition, performance roughly doubled. The most popular model for IR is the vector space model (Salton et al., 1975)(7). Salton’s work dominated the early years of the field. There are two alternative probabilistic models, one due to Ponte and Croft (1998)(8) and one by Maron and Kuhns (1960)(9) and Robertson and Sparck Jones (1976)(10). Lafferty and Zhai (2001)(11) show that the models are based on the same joint probability distribution, but that the choice of model has implications for training the parameters. Craswell et al. (2005)(12) describe the BM25 scoring function and Svore and Burges (2009)(13) describe how BM25 can be improved with a machine learning approach that incorporates click data—examples of past search queries and the results that were clicked on. Brin and Page (1998)(14) describe the PageRank algorithm and the implementation of a Web search engine. Kleinberg (1999)(15) describes the HITS algorithm. Silverstein et al. (1998)(16) investigate a log of a billion Web searches. The journal Information Retrieval and the proceedings of the annual SIGIR conference cover recent developments in the field.
1. Singhal, A., Buckley, C., and Mitra, M. (1996). Pivoted document length normalization. In SIGIR-96,
2. Cruse, D. A. (1986). Lexical Semantics. Cambridge University Press.
3. Robertson, S. E. (1977). The probability ranking principle in IR. J. Documentation, 33, 294–304.
4. Croft, B., Metzler, D., and Stroham, T. (2009). Search Engines: Information retrieval in Practice.
5. Manning, C., Raghavan, P., and Schütze, H. (2008). Introduction to Information Retrieval. Cambridge University Press.
6. Hearst,M. A. (2009). Search User Interfaces. Cambridge University Press.
7. Salton, G., Wong, A., and Yang, C. S. (1975). A vector space model for automatic indexing. CACM,
8. Ponte, J. and Croft, W. B. (1998). A language modeling approach to information retrieval. In SIGIR-98, pp. 275–281.
9. Maron, M. E. and Kuhns, J.-L. (1960). On relevance, probabilistic indexing and information retrieval.
CACM, 7, 219–244.
10. Robertson, S. E. and Sparck Jones, K. (1976). Relevance weighting of search terms. J. American Society for Information Science, 27, 129–146.
11. Lafferty, J. and Zhai, C. (2001). Probabilistic relevance models based on document and query generation. In Proc. Workshop on Language Modeling and Information Retrieval.
12. Craswell, N., Zaragoza, H., and Robertson, S. E. (2005). Microsoft Cambridge at trec-14: Enterprise track. In Proc. Fourteenth Text Retrieval Conference.
13. Svore, K. and Burges, C. (2009). A machine learning approach for improved bm25 retrieval. In
Proc. Conference on Information Knowledge Management.
14. Brin, S. and Page, L. (1998). The anatomy of a large-scale hypertextual web search engine. In Proc.
Seventh World Wide Web Conference.
15. Kleinberg, J. M. (1999). Authoritative sources in a hyperlinked environment. JACM, 46(5), 604–632.
16. Silverstein, C., Henzinger, M., Marais, H., and Moricz,M. (1998). Analysis of a very large altavista
query log. Tech. rep. 1998-014, Digital Systems Research Center._____________Explanation of symbols: Roman numerals indicate the source, arabic numerals indicate the page number. The corresponding books are indicated on the right hand side. ((s)…): Comment by the sender of the contribution. Translations: Dictionary of Arguments The note [Author1]Vs[Author2] or [Author]Vs[term] is an addition from the Dictionary of Arguments. If a German edition is specified, the page numbers refer to this edition.
Stuart J. Russell
Artificial Intelligence: A Modern Approach Upper Saddle River, NJ 2010