Geoscience Reference
In-Depth Information
38 CHAPTER 4. CO-TRAINING
In other words, if we know the true label y , then knowing one view (e.g., x ( 2 ) ) does not affect
what we will observe for the other view (it will simply be P( x ( 1 )
| y) ). To illustrate the second
assumption, consider our named entity classification task again. Let us collect all instances with true
label y = Location . View 1 of these instances will be Location named entity strings, i.e., x ( 1 )
{Washington State, Kazakhstan, China, ... }. The frequency of observing these named entities, given
y = Location , is described by P( x ( 1 )
| y) . These named entities are associated with various contexts.
Now let us select any particular context, say x ( 2 )
=
“headquartered in,” and consider the instances
with this context and y
= Location . If conditional independence holds, in these instances we will
again find all those named entities {Washington State, Kazakhstan, China, ... } with the same
frequencies as indicated by P( x ( 1 )
|
y) . In other words, the context “headquartered in” does not favor
any particular location.
Why is the conditional independence assumption important for Co-Training? If the view-2
classifier f ( 2 ) decides that the context “headquartered in” indicates Location with high confidence,
Co-Training will add unlabeled instances with that context as view-1 training examples. These new
training examples for f ( 1 ) will include all representative Location named entities x ( 1 ) , thanks to
the conditional independence assumption. If the assumption didn't hold, the new examples could all
be highly similar and thus be less informative for the view-1 classifier. It can be shown that if the two
assumptions hold, Co-Training can learn successfully from labeled and unlabeled data. However,
it is actually difficult to find tasks in practice that completely satisfy the conditional independence
assumption. After all, the context “Prime Minister of ” practically rules out most locations except
countries. When the conditional independence assumption is violated, Co-Training may not perform
well.
There are several variants of Co-Training. The original Co-Training algorithm picks the
top k most confident unlabeled instances in each view, and augments them with predicted labels.
In contrast, the so-called Co-EM algorithm is less categorical. Co-EM maintains a probabilistic
model P(y |
x (v)
; θ (v) ) for views v =
x ( 1 ) , x ( 2 )
, view 1 virtually
splits it into two copies with opposite labels and fractional weights: ( x ,y = 1 ) with weight P(y =
1
1 , 2. For each unlabeled instance x
=[
]
θ ( 1 ) ) . View 1 then adds all augmented
unlabeled instances to L 2 . This is equivalent to the E-step in the EM algorithm. The same is true
for view 2. Each view's parameter θ (v) is then updated, which corresponds to the M-step, except
that the expectations are from the other view. For certain tasks, Co-EM empirically performs better
than Co-Training.
x ( 1 )
θ ( 1 ) ) and ( x ,y
x ( 1 )
|
;
=−
1 ) with weight 1
P(y
=
1
|
;
4.4 MULTIVIEWLEARNING
The Co-Training algorithm is a means to an end: making the two classifiers f ( 1 ) and f ( 2 ) agree
(i.e., predict the same label) on the unlabeled data. Such agreement is justified by learning theory,
which is beyond the scope of this topic, but the intuition is simple: there are not many candidate
predictors that can agree on unlabeled data in two views, so the so-called hypothesis space is small.
If a candidate predictor in this small hypothesis space also fits the labeled data well, it is less likely
Search WWH ::




Custom Search