Co-Training - Introduction to Semi-Supervised Learning

Geoscience Reference

In-Depth Information

to be overfitting, and can be expected to be a good predictor. In this section we discuss some other

algorithms which explicitly enforce hypothesis agreement, without requiring explicit feature splits

or the iterative mutual-teaching procedure. To understand these algorithms, we need to introduce

the regularized risk minimization framework for machine learning.

Recall that, in general, we can define a loss function to specify the cost of mistakes in prediction:

Definition 4.3. Loss Function .L t x

its true label, and f( x ) our

prediction. A loss function c( x ,y,f( x )) ∈[ 0 , ∞ ) measures the amount of loss, or cost, of this

prediction.

∈ X

be an instance, y

∈ Y

For example, in regression we can define the squared loss c( x ,y,f( x )) = (y − f( x )) 2 .In

classification we can define the 0/1 loss as c( x ,y,f( x )) = 1 if y = f( x ) , and 0 otherwise. The

loss function can be different for different types of misclassification. In medical diagnosis we

might use c( x ,y =

healthy ) = 100.

The loss function can also depend on the instance x : The same amount of medical prediction error

might incur a higher loss on an infant than on an adult.

healthy ,f( x ) =

diseased ) = 1 and c( x ,y =

diseased ,f( x ) =

Definition 4.4. Empirical Risk .

The empirical risk of f is the average loss incurred by f on a

l i = 1 c( x i ,y i ,f( x i )) .

R(f ) =

labeled training sample:

Applying the principle of empirical risk minimization (ERM)—finding the f that minimizes

the empirical risk—may seem like a natural thing to do:

R(f ),

f ERM

argmin

f ∈ F

(4.2)

where

is the set of all hypotheses we consider. For classification with 0/1 loss, ERM amounts to

minimize the training sample error.

However, f ERM

can overfit the particular training sample. As a consequence, f ERM

is not

necessarily the classifier in

with the smallest true risk on future data. One remedy is to regularize

the empirical risk by a regularizer (f) . The regularizer (f) is a non-negative functional, i.e.,

it takes a function f as input and outputs a non-negative real value. The value is such that if f is

“smooth” or “simple” in some sense, (f) will be close to zero; if f is too zigzagged (i.e., it overfits

and attempts to pass through all labeled training instances), (f) is large.

Definition 4.5. Regularized Risk .

The regularized risk is the weighted sum of the empirical risk

R(f )

and the regularizer, with weight λ

λ(f ) .The principle of regularized riskminimization

is to find the f that minimizes the regularized risk:

≥

f ∗ =

R(f ) + λ(f ).

argmin

f ∈ F

(4.3)

Introduction to Semi-Supervised Learning

Search WWH ::

Custom Search

Home