Peter Norvig on PageRank Algorithm - Dictionary of Arguments
Norvig I 870
PageRank Algorithm/Norvig/Russell: PageRank was invented to solve the problem of the tyranny of TF scores: if the query is [IBM], how do we make sure that IBM’s home page, ibm.com, is the first result, even if another page mentions the term “IBM” more frequently? The idea is that ibm.com has many in-links (links to the page), so it should be ranked higher: each in-link is a vote for the quality of the linked-to page. But if we only counted in-links, then it would be possible for a Web spammer to create a network of pages and have them all point to a page of his choosing, increasing the score of that page. Therefore, the PageRank algorithm is designed to weight links from high-quality sites more heavily. What is a high quality site? One that is linked to by other high-quality sites. The definition is recursive, but (…) the recursion bottoms out properly. The PageRank for a page p is defined as:
PR(p) = 1 – d/N + d ∑i PR(ini)/ C ini)
PR(p): the PageRank of page p
N: the toal number of pages in the corpus,
Ini: are the pages that link in to p, and
C(ini): is the count of the total number of out-links on page in i.
D: The constant d is a damping factor.
Random surfer: It can be understood through the random surfer model: imagine a Web surfer who starts at some random page and begins exploring. With probability d (we’ll assume d=0.85) the surfer clicks on one of the links on the page (choosing uniformly among them), and with probability 1 − d she gets bored with the page and restarts on a random page anywhere on the Web. The PageRank of page p is then the probability that the random surfer will be at page p at any point in time. PageRank can be computed by an iterative procedure: start with all pages having PR(p)=1, and iterate the algorithm, updating ranks until they converge.
Norvig I 972
HITS Algorithm: The Hyperlink-Induced Topic Search algorithm, also known as “Hubs and Authorities” or HITS, is another influential link-analysis algorithm (…). HITS differs from PageRank in several ways. First, it is a query-dependent measure: it rates pages with respect to a query. That means that it must be computed anew for each query - a computational burden that most search engines have elected not to take on. Given a query, HITS first finds a set of pages that are relevant to the query. It does that by intersecting hit lists of query words, and then adding pages in the link neighborhood of these pages—pages that link to or are linked from one of the pages in the original relevant set.
Each page in this set is considered AUTHORITY an authority on the query to the degree that other
HUB pages in the relevant set point to it. A page is considered a hub to the degree that it points to other authoritative pages in the relevant set. Just as with PageRank, we don’t want to merely count the number of links; we want to give more value to the high-quality hubs and authorities. Thus, as with PageRank, we iterate a process that updates the authority score of a page to be the sum of the hub scores of the pages that point to it, and the hub score to be the sum of the authority scores of the pages it points to. If we then normalize the scores and repeat k times, the process will converge.
Both PageRank and HITS played important roles in developing our understanding of Web >Information retrieval, >Question answering.
Norvig I 884
Literature: Brin and Page (1998)(1) describe the PageRank algorithm and the implementation of a
Web search engine. Kleinberg (1999)(2) describes the HITS algorithm. Silverstein et al. (1998)(3) investigate a log of a billion Web searches. The journal Information Retrieval and the proceedings of the annual SIGIR conference cover recent developments in the field.
1. Brin, S. and Page, L. (1998). The anatomy of a large-scale hypertextual web search engine. In Proc.
Seventh World Wide Web Conference.
2. Kleinberg, J. M. (1999). Authoritative sources in a hyperlinked environment. JACM, 46(5), 604–632.
3. Silverstein, C., Henzinger, M., Marais, H., and Moricz,M. (1998). Analysis of a very large altavista query log. Tech. rep. 1998-014, Digital Systems Research Center_____________Explanation of symbols: Roman numerals indicate the source, arabic numerals indicate the page number. The corresponding books are indicated on the right hand side. ((s)…): Comment by the sender of the contribution. Translations: Dictionary of Arguments The note [Author1]Vs[Author2] or [Author]Vs[term] is an addition from the Dictionary of Arguments. If a German edition is specified, the page numbers refer to this edition.
Stuart J. Russell
Artificial Intelligence: A Modern Approach Upper Saddle River, NJ 2010