Introduction to Statistical Machine Learning - Introduction to Semi-Supervised Learning

Geoscience Reference

In-Depth Information

1.1 THEDATA

Definition 1.2. Instance .An instance x represents a specific object. The instance is often repre-

sented by a D -dimensional feature vector x

D , where each dimension is called a

feature . The length D of the feature vector is known as the dimensionality of the feature vector.

= (x 1 ,...,x D ) ∈ R

The feature representation is an abstraction of the objects. It essentially ignores all other infor-

mation not represented by the features. For example, two little green men with the same weight and

height, but with different names, will be regarded as indistinguishable by our feature representation.

Note we use boldface x to denote the whole instance, and x d to denote the d -th feature of x . In our

example, an instance is a specific little green man; the feature vector consists of D =

2 features: x 1

is the weight, and x 2 is the height. Features can also take discrete values. When there are multiple

instances, we will use x id to denote the i -th instance's d -th feature.

Definition 1.3. Training Sample .A training sample is a collection of instances

i = 1 =

{

x i }

{

x 1 ,..., x n }

, which acts as the input to the learning process. We assume these instances are sampled

independently from an underlying distribution P( x ) , which is unknown to us. We denote this by

{

i . i . d .

∼

i = 1

x i }

P( x ) , where i.i.d. stands for independent and identically distributed.

100 instances x 1 ,..., x 100 . A training

sample is the “experience” given to a learning algorithm. What the algorithm can learn from it,

however, varies. In this chapter, we introduce two basic learning paradigms: unsupervised learning

and supervised learning .

In our example, the training sample consists of n

=

1.2 UNSUPERVISEDLEARNING

Definition 1.4. Unsupervised learning .

Unsupervised learning algorithms work on a training

n

i =

sample with n instances

1 . There is no teacher providing supervision as to how individual

instances should be handled—this is the defining property of unsupervised learning. Common

unsupervised learning tasks include:

clustering, where the goal is to separate the n instances into groups;

{

x i }

novelty detection, which identifies the few instances that are very different from the majority;

dimensionality reduction, which aims to represent each instance with a lower dimensional

feature vector while maintaining key characteristics of the training sample.

Among the unsupervised learning tasks, the one most relevant to this topic is clustering , which

we discuss in more detail.

Definition 1.5. Clustering .

i = 1 into k clusters, such that instances in the

same cluster are similar, and instances in different clusters are dissimilar. The number of clusters k

may be specified by the user, or may be inferred from the training sample itself.

Clustering splits

{

x i }

Introduction to Semi-Supervised Learning

Search WWH ::

Custom Search

Home