Information Technology Reference
In-Depth Information
regression, and probability density (PDF) estimation — the data-based de-
vice has to learn some desired information — respectively, class labeling,
functional description, and probability density — from the data.
To formalize the classification problem, we start by assuming that a dataset
X ds is available for the inductive design of the classifier: design or training
set. The training set X ds can be viewed as an array whose rows correspond to
data objects (e.g., individual electrocardiograms for the above electrocardio-
gram classification problem), and whose columns represent object attributes
(measurements, features). We denote by n the number of objects (also called
instances or cases) of X ds . Each instance is represented by an ordered se-
quence of attributes with d elements x j from some space X (the input space
of the classification system). The attributes can be numerical — and in this
case we always assume an underlying real number domain — , or nominal
(categorical), say a set B of categories. For the above electrocardiogram clas-
sification problem an instance is represented by electrocardiographic signal
features (amplitudes and durations of signal waves), measured as real num-
bers, and by categorical features such as sex ( B ={“male”, “female”}).
We will often be dealing with instances characterized solely by numerical
attributes; in this case X ds
X is (repre-
sented as) an ordered sequence ( d -tuple): x =( x 1 ,x 2 , ..., x j , ..., x d ).
Sometimes we may find it convenient to use vector notation for x, x =
[ x 1 x 2 ...x j ...x d ] T , specifically when vector operations are required; X ds
is then represented by an n
X =
R
d , and any instance x
d real matrix.
Any attribute value x j is a realization value of a random variable (r.v.) X j ,
whose codomain is X j ;whether X j denotes a codomain or a variable will be
obvious from the context. Note that X j may have a single Dirac- δ distribu-
tion, in which case X j is in fact a deterministic variable (a degenerate random
variable). We will also denote by X the d -dimensional r.v. whose codomain is
X and whose realization values are the d -tuples x =( x 1 ,x 2 , ..., x j , ..., x d );
X will be characterized by a joint distribution of the X j with cumulative dis-
tribution function F X .
Throughout the topic all data instances in X ds are assumed as having
been obtained by an independent and identically distributed (i.i.d.) sampling
process, from a d -dimensional joint probability distribution with cumulative
distribution function F X characterizing a large (perhaps infinite) population
of instances. For numerical attributes defined in bounded intervals of
×
(as
the electrocardiographic measurements) one may still use the real line as
domain, by assigning zero probability outside the intervals.
When confronted with unsupervised classification problems (popularly
known as data clustering problems), i.e., when one wants the classifica-
tion system to find a structuring solution that partitions the data into
“meaningful” groups (clusters) according to certain criteria, the X ds set,
X ds = {x i = x i 1 ,x i 2 , ..., x ij , ..., x id ); i =1 , ..., n} ,isa l
that is required. Data clustering is a somewhat loose type of classification
problem, since one may find a variety of solutions (unsupervised classifiers)
R
Search WWH ::




Custom Search