Introduction to Statistical Machine Learning - Introduction to Semi-Supervised Learning

Geoscience Reference

In-Depth Information

Depending on the domain of label y , supervised learning problems are further divided into

classification and regression :

Definition 1.9. Classification .

Classification is the supervised learning problem with discrete

classes

. The function f is called a classifier .

Definition 1.10. Regression .

Regression is the supervised learning problem with continuous

The function f is called a regression function .

What exactly is a good f ? The best f is by definition

f ∗ =

argmin

f ∈ F

E ( x ,y) ∼ P [ c( x ,y,f( x )) ] ,

(1.3)

where argmin means “finding the f that minimizes the following quantity”.

] is the

expectation over random test data drawn from P . Readers not familiar with this notation may wish

to consult Appendix A. c(

E ( x ,y) ∼ P [

) is a loss function that determines the cost or impact of making a prediction

f( x ) that is different from the true label y . Some typical loss functions will be discussed shortly. Note

we limit our attention to some function family

, mostly for computational reasons. If we remove

this limitation and consider all possible functions, the resulting f ∗ is the Bayes optimal predictor , the

best one can hope for on average. For the distribution P , this function will incur the lowest possible

loss when making predictions. The quantity

E ( x ,y) ∼ P [ c( x ,y,f ∗ ( x )) ]

is known as the Bayes error .

∈ F

However, the Bayes optimal predictor may not be in

in general. Our goal is to find the f

that is as close to the Bayes optimal predictor as possible.

It is worth noting that the underlying distribution P( x ,y) is unknown to us.Therefore, it is not

possible to directly find f ∗ , or even to measure any predictor f 's performance, for that matter. Here

lies the fundamental difficulty of statistical machine learning: one has to generalize the prediction

from a finite training sample to any unseen test data. This is known as induction .

To proceed, a seemingly reasonable approximation is to gauge f 's performance using training

sample error. That is, to replace the unknown expectation by the average over the training sample:

i = 1 , the training sample

Definition 1.11. Training sample error .

Given a training sample

{ ( x i ,y i ) }

error is

c( x i ,y i ,f( x i )).

(1.4)

i = 1

For classification, one commonly used loss function is the 0-1 loss c( x ,y,f( x )) ≡ (f ( x i ) =

y i ) :

(f ( x i )

y i ),

(1.5)

i =

Introduction to Semi-Supervised Learning

Search WWH ::

Custom Search

Home