Geoscience Reference
In-Depth Information
1.1 THEDATA
Definition 1.2. Instance .An instance x represents a specific object. The instance is often repre-
sented by a D -dimensional feature vector x
D , where each dimension is called a
feature . The length D of the feature vector is known as the dimensionality of the feature vector.
= (x 1 ,...,x D ) ∈ R
The feature representation is an abstraction of the objects. It essentially ignores all other infor-
mation not represented by the features. For example, two little green men with the same weight and
height, but with different names, will be regarded as indistinguishable by our feature representation.
Note we use boldface x to denote the whole instance, and x d to denote the d -th feature of x . In our
example, an instance is a specific little green man; the feature vector consists of D =
2 features: x 1
is the weight, and x 2 is the height. Features can also take discrete values. When there are multiple
instances, we will use x id to denote the i -th instance's d -th feature.
Definition 1.3. Training Sample .A training sample is a collection of instances
i = 1 =
{
x i }
{
x 1 ,..., x n }
, which acts as the input to the learning process. We assume these instances are sampled
independently from an underlying distribution P( x ) , which is unknown to us. We denote this by
{
i . i . d .
i = 1
x i }
P( x ) , where i.i.d. stands for independent and identically distributed.
100 instances x 1 ,..., x 100 . A training
sample is the “experience” given to a learning algorithm. What the algorithm can learn from it,
however, varies. In this chapter, we introduce two basic learning paradigms: unsupervised learning
and supervised learning .
In our example, the training sample consists of n
=
1.2 UNSUPERVISEDLEARNING
Definition 1.4. Unsupervised learning .
Unsupervised learning algorithms work on a training
n
i =
sample with n instances
1 . There is no teacher providing supervision as to how individual
instances should be handled—this is the defining property of unsupervised learning. Common
unsupervised learning tasks include:
￿ clustering, where the goal is to separate the n instances into groups;
{
x i }
￿ novelty detection, which identifies the few instances that are very different from the majority;
￿ dimensionality reduction, which aims to represent each instance with a lower dimensional
feature vector while maintaining key characteristics of the training sample.
Among the unsupervised learning tasks, the one most relevant to this topic is clustering , which
we discuss in more detail.
Definition 1.5. Clustering .
i = 1 into k clusters, such that instances in the
same cluster are similar, and instances in different clusters are dissimilar. The number of clusters k
may be specified by the user, or may be inferred from the training sample itself.
Clustering splits
{
x i }
Search WWH ::




Custom Search