|Norvig I 873
Information extraction/AI Research/Norvig/Russell: Information extraction is the process of acquiring knowledge by skimming a text and looking for occurrences of a particular class of object and for relationships among objects. A typical task is to extract instances of addresses from Web pages, with database fields for street, city, state, and zip code; (…).In a limited domain, this can be done with high accuracy. As the domain gets more general, more complex linguistic models and more complex learning techniques are necessary.
Norvig I 874
A. Finite-state template-based information extraction:
Attribute-based extraction system: (…) assumes that the entire text refers to a single object and the task is to extract attributes of that object. E.g., Manufacturer; product; price.
Relational extraction systems: deal with multiple objects and the relations among them.
Norvig I 875
A relational extraction system can be built as a series of cascaded finite-state transducers. E.g., FASTUS consists of five stages: 1. Tokenization, 2. Complex-word handling, 3. Basic-group handling,
4. Complex-phrase handling, 5. Structure merging.
Norvig I 876
B. Probabilistic models for information extraction:
When information extraction must be attempted from noisy or varied input, (…) it is better to use a probabilistic model rather than a rule-based model. The simplest probabilistic model for sequences with hidden state is the hidden Markov model, or HMM. (Cf. >Bayesian networks, >Statistical learning.) (…) an HMM models a progression through a sequence of hidden states, xt, with an observation et at each step. To apply HMMs to information extraction, [one] can either build one big HMM for all the attributes or build a separate HMM for each attribute. HMMs are probabilistic, and thus tolerant to noise. (…) with HMMs there is graceful degradation with missing characters/words, and [one] get[s] a probability indicating the degree of match, not just a Boolean match/fail.
Norvig I 877
(…) HMMs can be trained from data; they don’t require laborious engineering of templates, and thus they can more easily be kept up to date as text changes over time.
Norvig I 878
VsHMMs: Problem: One issue with HMMs for the information extraction task is that they model a lot of probabilities that we don’t really need. An HMM is a generative model; it models the full joint probability of observations and hidden states, and thus can be used to generate samples. That is, we can use the HMM model not only to parse a text and recover the speaker and date, but also to generate a random instance of a text containing a speaker and a date.
Solution: All we need in order to understand a text is a discriminative model, one that models the conditional probability of the hidden attributes given the observations (the text).
Conditional random field: We don’t need the independence assumptions of the Markov model - we can have an xt that is dependent on x1. A framework for this type of model is the conditional random field, or CRF, which models a conditional probability distribution of a set of target variables given a set of observed variables. Like Bayesian networks, CRFs can represent many different structures of dependencies among the variables.
Norvig I 879
Ontology extraction: [different from] information extraction as finding a specific set of relations (e.g., speaker, time, location) in a specific text (e.g., a talk announcement) (…) [ontology extraction] is building a large knowledge base or ontology of facts from a corpus. This is different in three ways: First it is open-ended - we want to acquire facts about all types of domains, not just one specific domain. Second, with a large corpus, this task is dominated by precision, not recall - just as with >question answering on the Web (…). Third, the results can be statistical aggregates gathered from multiple sources, rather than being extracted from one specific text. E.g., Hearst (1992)(1) looked at the problem of learning an ontology of concept categories and subcategories from a large corpus. The work concentrated on templates that are very general (not tied to a specific domain) and have high precision (are
Norvig I 880
almost always correct when they match) but low recall (do not always match). Here is one of the most productive templates: NP such as NP (, NP)* (,)? ((and | or) NP)? .
Here [“such as”, “and”, “or”] and commas must appear literally in the text, but the parentheses are for grouping, the asterisk means repetition of zero or more, and the question mark means optional.
Problems: The biggest weakness in this approach is the sensitivity to noise. If one of the first few templates is incorrect, errors can propagate quickly. One way to limit this problem is to not accept a new example unless it is verified by multiple templates, and not accept a new template unless it discovers multiple examples that are also found by other templates.
Machine reading: (…) a system that could read on its own and build up its own database. Such a system would be relation-independent; would work for any relation. In practice, these systems work on all relations in parallel, because of the I/O demands of large corpora. They behave less like a traditional information extraction system that is targeted at a few relations and more like a human reader who learns from the text itself; because of this the field has been called machine reading.
A representative machine-reading system is TEXTRUNNER (Banko and Etzioni, 2008)(2).
TEXTRUNNER uses cotraining to boost its performance, but it needs something to bootstrap from. In the case of Hearst (1992)(1), specific patterns (e.g., such as) provided the bootstrap, and for Brin (1998)(3), it was a set of five author-title pairs.
Norvig I 884
Early information extraction programs include GUS (Bobrow et al., 1977)(4) and FRUMP (DeJong, 1982)(5). Recent information extraction has been pushed forward by the annual Message Understand Conferences (MUC), sponsored by the U.S. government. The FASTUS finite-state system was done by Hobbs et al. (1997)(6). It was based in part on the idea from Pereira and Wright (1991)(7) of using FSAs as approximations to phrase-structure grammars. Surveys of template-based systems are given by Roche and Schabes (1997)(8), Appelt (1999)(9),
Norvig I 885
Freitag and McCallum (2000)(10) discuss HMMs for Information Extraction. CRFs were introduced by Lafferty et al. (2001)(11); an example of their use for information extraction is described in (McCallum, 2003)(12) and a tutorial with practical guidance is given by (Sutton and McCallum, 2007)(13). Sarawagi (2007)(14) gives a comprehensive survey.
1. Hearst, M. A. (1992). Automatic acquisition of hyponyms from large text corpora. In COLING-92.
2. Banko, M. and Etzioni, O. (2008). The tradeoffs between open and traditional relation extraction. In ACL-08, pp. 28–36.
3. Brin, D. (1998). The Transparent Society. Perseus
4. Bobrow, D. G.,Kaplan, R.,Kay,M.,Norman, D. A., Thompson, H., and Winograd, T. (1977). GUS, a
frame driven dialog system. AIJ, 8, 155–173.
5. DeJong, G. (1982). An overview of the FRUMP system. In Lehnert,W. and Ringle,M. (Eds.), Strategies for Natural Language Processing, pp. 149–176. Lawrence Erlbaum.
6. Hobbs, J. R., Appelt, D., Bear, J., Israel, D., Kameyama, M., Stickel, M. E., and Tyson, M.
(1997). FASTUS: A cascaded finite-state transducer for extracting information from natural-language text. In Roche, E. and Schabes, Y. (Eds.), Finite- State Devices for Natural Language Processing, pp.
383–406. MIT Press.
7. Pereira, F. and Wright, R. N. (1991). Finite-state approximation of phrase structure grammars. In ACL-91, pp. 246–255.
8. Roche, E. and Schabes, Y. (1997). Finite-State Language Processing (Language, Speech and Communication). Bradford Books.
9. Appelt, D. (1999). Introduction to information extraction. CACM, 12(3), 161–172.
10. Freitag, D. and McCallum, A. (2000). Information extraction with hmm structures learned by stochastic optimization. In AAAI-00.
11. Lafferty, J., McCallum, A., and Pereira, F. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In ICML-01.
12. McCallum, A. (2003). Efficiently inducing features of conditional random fields. In UAI-03.
13. Sutton, C. and McCallum, A. (2007). An introduction to conditional random fields for relational learning. In Getoor, L. and Taskar, B. (Eds.), Introduction to Statistical Relational Learning. MIT Press.
14. Sarawagi, S. (2007). Information extraction. Foundations and Trends in Databases, 1(3), 261–377._____________Explanation of symbols: Roman numerals indicate the source, arabic numerals indicate the page number. The corresponding books are indicated on the right hand side. ((s)…): Comment by the sender of the contribution. The note [Author1]Vs[Author2] or [Author]Vs[term] is an addition from the Dictionary of Arguments. If a German edition is specified, the page numbers refer to this edition.
Stuart J. Russell
Artificial Intelligence: A Modern Approach Upper Saddle River, NJ 2010