Geoscience Reference
In-Depth Information
CHAPTER
4
Co-Training
4.1 TWOVIEWS OF AN INSTANCE
Consider the supervised learning task of named entity classification in natural language processing.
A named entity is a proper name such as “Washington State” or “Mr. Washington.” Each named
entity has a class label depending on what it is referring to. For simplicity, we assume there are
only two classes: Person or Location . The goal of named entity classification is to assign the
correct label to each entity, for example, Location to “Washington State” and Person to “Mr.
Washington.” Named entity classification is obviously a classification problem, to predict the class
y from the features x . Our focus is not on the details of training supervised classifiers that work
on strings. (Basically, it involves some form of partial string matching. The details can be found in
the bibliographical notes.) Instead, we focus on named entity classification as one example task that
involves instances with a special structure that lends itself well to semi-supervised learning.
An instance of a named entity can be represented by two distinct sets of features. The first is
the set of words that make up the named entity itself. The second is the set of words in the context
in which the named entity occurs. In the following examples the named entity is in parentheses, and
the context is underlined:
instance 1:
... headquartered in (Washington State) ...
instance 2:
... (Mr. Washington), the vice president of ...
Formally, each named entity instance is represented by two views (sets of features): the words in
itself x ( 1 ) , and the words in its context x ( 2 ) . We write x
x ( 1 ) , x ( 2 )
.
As another example of views, consider Web page classification into Student or Faculty
Web pages. In this task, the first view x ( 1 ) can be the words on the Web page in question. The second
view x ( 2 ) can be the words in all the hyperlinks that point to the Web page.
Going back to the named entity classification task, let us assume we only have these two
labeled instances in our training data:
=[
]
x ( 1 )
x ( 2 )
instance
y
1.
Washington State
headquartered in Location
2.
Mr. Washington
vice president
Person
This labeled training sample seems woefully inadequate: we know that there are many other ways
to express a location or person. For example,
... (Robert Jordan), a partner at ...
... flew to (China) ...
 
Search WWH ::




Custom Search