Abstract:
This dissertation presents a new
approach to solving the coreference resolution problem for
a natural language processing (NLP) task known as
information extraction. It describes a new system, named
RESOLVE, that uses machine learning techniques to
determine when two phrases in a text co-refer, i.e., refer
to the same thing. RESOLVE can be used as a component
within an information extraction system -- a system that
extracts information automatically from a corpus of texts
that all focus on the same topic area -- or it can be used
as a stand-alone system to evaluate the relative
contribution of different types of knowledge to the
coreference resolution process.
RESOLVE represents a improvement over
previous approaches to the coreference resolution problem,
in that it uses a machine learning algorithm to handle
some of the work that had previously been performed
manually by a knowledge engineer. RESOLVE can achieve
performance that is as good as a system that was manually
constructed for the same task, when both systems are given
access to the same knowledge and tested against the same
data.
The machine learning algorithm used
by RESOLVE can be given access to different types of
knowledge, some portions of which are very specific to a
particular topic area or domain, and other portions are
more general or domain-independent. An ablation experiment
shows that domain-specific knowledge is very important to
coreference resolution -- the performance degradation when
the domain-specific features are disabled is significantly
worse than when a similarly-sized set of
domain-independent features is disabled.
However, even though domain-specific
knowledge is important for coreference resolution,
domain-independent features alone enable RESOLVE to
achieve 80% of the performance it achieves when
domain-specific features are available. One explanation
for why domain-independent knowledge can be used so
effectively is illusrated in another domain, where the
machine learning algorithm discovers domain-specific
knowledge by assembling the domain-independent features of
knowledge into domain-specific patterns. This ability of
RESOLVE to compensate for missing or insufficient
domain-specific knowledge is a significant advantage for
redeploying the system in new domains.
Download the
full reportfull
report. (PDF 265 KB)
|