Introduction to Statistical Machine Learning - Introduction to Semi-Supervised Learning

Geoscience Reference

In-Depth Information

where f( x )

y is1if f predicts a different class than y on x , and 0 otherwise. For regression, one

commonly used loss function is the squared loss c( x ,y,f( x )) ≡ (f ( x i ) − y i ) 2 :

=

n

1

n

(f ( x i ) − y i ) 2 .

(1.6)

i = 1

Other loss functions will be discussed as we encounter them later in the topic.

It might be tempting to seek the f that minimizes training sample error. However, this

strategy is flawed: such an f will tend to overfit the particular training sample. That is, it will likely

fit itself to the statistical noise in the particular training sample. It will learn more than just the

true relationship between

X

Y

. Such an overfitted predictor will have small training sample

error, but is likely to perform less well on future test data. A sub-area within machine learning called

computational learning theory studies the issue of overfitting. It establishes rigorous connections

between the training sample error and the true error, using a formal notion of complexity such as

the Vapnik-Chervonenkis dimension or Rademacher complexity. We provide a concise discussion

in Section 8.1. Informed by computational learning theory, one reasonable training strategy is to

seek an f that “almost” minimizes the training sample error, while regularizing f so that it is not too

complex in a certain sense. Interested readers can find the references in the bibliographical notes.

To estimate f 's future performance, one can use a separate sample of labeled instances, called

the test sample :

and

i . i . d .

∼

n + m

j = n + 1

P( x ,y) . A test sample is not used during training, and therefore

provides a faithful (unbiased) estimation of future performance.

Definition 1.12. Test sample error .

{

( x j ,y j )

}

The corresponding test sample error for classification with

0-1 loss is

n

+

m

1

m

(f ( x j )

=

y j ),

(1.7)

j = n +

1

and for regression with squared loss is

n + m

1

m

(f ( x j ) − y j ) 2 .

(1.8)

j = n + 1

In the remainder of the topic, we focus on classification due to its prevalence in semi-supervised

learning research. Most ideas discussed also apply to regression, though.

As a concrete example of a supervised learning method, we now introduce a simple classifi-

cation algorithm: k -nearest-neighbor ( k NN).

Algorithm 1.13. k -nearest-neighbor classifier.

Input: Training data ( x 1 ,y 1 ),...,( x n ,y n ); distance function d();

number of neighbors k; test instance x ∗

Introduction to Semi-Supervised Learning

Search WWH ::

Custom Search

Home