Database Reference
In-Depth Information
12.1 The Machine-Learning Model
In this brief section we introduce the framework for machine-learning algorithms and give
the basic definitions.
12.1.1
Training Sets
The data to which a machine-learning (often abbreviated ML) algorithm is applied is called
a training set. A
training set
consists of a set of pairs (
x
,
y
), called
training examples
, where
•
x
is a vector of values, often called a
feature vector
. Each value, or feature, can
be
categorical
(values are taken from a set of discrete values, such as {red, blue,
green}) or
numerical
(values are integers or real numbers).
•
y
is the
label
, the classification value for
x
.
The objective of the ML process is to discover a function
y
=
f
(
x
) that best predicts the
value of
y
associated with each value of
x
. The type of
y
is in principle arbitrary, but there
are several common and important cases.
(1)
y
is a real number. In this case, the ML problem is called
regression
.
(2)
y
is a boolean value true-or-false, more commonly written as +1 and −1, respectively.
In this class the problem is
binary classification
.
(3)
y
is a member of some finite set. The members of this set can be thought of as
“classes,” and each member represents one class. The problem is
multiclass classific-
ation
.
(4)
y
is a member of some potentially infinite set, for example, a parse tree for
x
, which is
interpreted as a sentence.
12.1.2
Some Illustrative Examples
weight of dogs in three classes: Beagles, Chihuahuas, and Dachshunds. We can think of
this data as a training set, provided the data includes the variety of the dog along with each
height-weight pair. Each pair (
x
,
y
) in the training set consists of a feature vector
x
of the
form [height, weight]. The associated label
y
is the variety of the dog. An example of a
training-set pair would be ([5 inches, 2 pounds], Chihuahua).