Co-Training - Introduction to Semi-Supervised Learning

Geoscience Reference

In-Depth Information

38 CHAPTER 4. CO-TRAINING

In other words, if we know the true label y , then knowing one view (e.g., x ( 2 ) ) does not affect

what we will observe for the other view (it will simply be P( x ( 1 )

| y) ). To illustrate the second

assumption, consider our named entity classification task again. Let us collect all instances with true

label y = Location . View 1 of these instances will be Location named entity strings, i.e., x ( 1 )

∈

{Washington State, Kazakhstan, China, ... }. The frequency of observing these named entities, given

y = Location , is described by P( x ( 1 )

| y) . These named entities are associated with various contexts.

Now let us select any particular context, say x ( 2 )

“headquartered in,” and consider the instances

with this context and y

= Location . If conditional independence holds, in these instances we will

again find all those named entities {Washington State, Kazakhstan, China, ... } with the same

frequencies as indicated by P( x ( 1 )

y) . In other words, the context “headquartered in” does not favor

any particular location.

Why is the conditional independence assumption important for Co-Training? If the view-2

classifier f ( 2 ) decides that the context “headquartered in” indicates Location with high confidence,

Co-Training will add unlabeled instances with that context as view-1 training examples. These new

training examples for f ( 1 ) will include all representative Location named entities x ( 1 ) , thanks to

the conditional independence assumption. If the assumption didn't hold, the new examples could all

be highly similar and thus be less informative for the view-1 classifier. It can be shown that if the two

assumptions hold, Co-Training can learn successfully from labeled and unlabeled data. However,

it is actually difficult to find tasks in practice that completely satisfy the conditional independence

assumption. After all, the context “Prime Minister of ” practically rules out most locations except

countries. When the conditional independence assumption is violated, Co-Training may not perform

well.

There are several variants of Co-Training. The original Co-Training algorithm picks the

top k most confident unlabeled instances in each view, and augments them with predicted labels.

In contrast, the so-called Co-EM algorithm is less categorical. Co-EM maintains a probabilistic

model P(y |

x (v)

; θ (v) ) for views v =

x ( 1 ) , x ( 2 )

, view 1 virtually

splits it into two copies with opposite labels and fractional weights: ( x ,y = 1 ) with weight P(y =

1 , 2. For each unlabeled instance x

]

θ ( 1 ) ) . View 1 then adds all augmented

unlabeled instances to L 2 . This is equivalent to the E-step in the EM algorithm. The same is true

for view 2. Each view's parameter θ (v) is then updated, which corresponds to the M-step, except

that the expectations are from the other view. For certain tasks, Co-EM empirically performs better

than Co-Training.

x ( 1 )

θ ( 1 ) ) and ( x ,y

x ( 1 )

;

=−

1 ) with weight 1

−

P(y

;

4.4 MULTIVIEWLEARNING ∗

The Co-Training algorithm is a means to an end: making the two classifiers f ( 1 ) and f ( 2 ) agree

(i.e., predict the same label) on the unlabeled data. Such agreement is justified by learning theory,

which is beyond the scope of this topic, but the intuition is simple: there are not many candidate

predictors that can agree on unlabeled data in two views, so the so-called hypothesis space is small.

If a candidate predictor in this small hypothesis space also fits the labeled data well, it is less likely

Introduction to Semi-Supervised Learning

Search WWH ::

Custom Search

Home