A Trainable Approach to Coreference Resolution for Information Extraction

Joseph F. McCarthy
Ph.D. Dissertation, 1996, University of Massachusetts
[Paper (PDF)]

Abstract:

This dissertation presents a new approach to solving the coreference resolution problem for a natural language processing (NLP) task known as information extraction. It describes a new system, named RESOLVE, that uses machine learning techniques to determine when two phrases in a text co-refer, i.e., refer to the same thing. RESOLVE can be used as a component within an information extraction system - a system that extracts information automatically from a corpus of texts that all focus on the same topic area - or it can be used as a stand-alone system to evaluate the relative contribution of different types of knowledge to the coreference resolution process.

RESOLVE represents a improvement over previous approaches to the coreference resolution problem, in that it uses a machine learning algorithm to handle some of the work that had previously been performed manually by a knowledge engineer. RESOLVE can achieve performance that is as good as a system that was manually constructed for the same task, when both systems are given access to the same knowledge and tested against the same data.

The machine learning algorithm used by RESOLVE can be given access to different types of knowledge, some portions of which are very specific to a particular topic area or domain, and other portions are more general or domain-independent. An ablation experiment shows that domain-specific knowledge is very important to coreference resolution - the performance degradation when the domain-specific features are disabled is significantly worse than when a similarly-sized set of domain-independent features is disabled.

However, even though domain-specific knowledge is important for coreference resolution, domain-independent features alone enable RESOLVE to achieve 80% of the performance it achieves when domain-specific features are available. One explanation for why domain-independent knowledge can be used so effectively is illusrated in another domain, where the machine learning algorithm discovers domain-specific knowledge by assembling the domain-independent features of knowledge into domain-specific patterns. This ability of RESOLVE to compensate for missing or insufficient domain-specific knowledge is a significant advantage for redeploying the system in new domains.