Databases Reference
In-Depth Information
analysis to help guess whether a customer with a given profile will buy a new computer.
A medical researcher wants to analyze breast cancer data to predict which one of three
specific treatments a patient should receive. In each of these examples, the data analysis
task is classification , where a model or classifier is constructed to predict class (categor-
ical) labels , such as “safe” or “risky” for the loan application data; “yes” or “no” for the
marketing data; or “treatment A,” “treatment B,” or “treatment C” for the medical data.
These categories can be represented by discrete values, where the ordering among values
has no meaning. For example, the values 1, 2, and 3 may be used to represent treatments
A, B, and C, where there is no ordering implied among this group of treatment regimes.
Suppose that the marketing manager wants to predict how much a given customer
will spend during a sale at AllElectronics . This data analysis task is an example of numeric
prediction , where the model constructed predicts a continuous-valued function , or
ordered value , as opposed to a class label. This model is a predictor . Regression analysis
is a statistical methodology that is most often used for numeric prediction; hence the
two terms tend to be used synonymously, although other methods for numeric predic-
tion exist. Classification and numeric prediction are the two major types of prediction
problems . This chapter focuses on classification.
8.1.2 General Approach to Classification
“How does classification work?” Data classification is a two-step process, consisting of a
learning step (where a classification model is constructed) and a classification step (where
the model is used to predict class labels for given data). The process is shown for the
loan application data of Figure 8.1. (The data are simplified for illustrative purposes.
In reality, we may expect many more attributes to be considered.
In the first step, a classifier is built describing a predetermined set of data classes or
concepts. This is the learning step (or training phase), where a classification algorithm
builds the classifier by analyzing or “learning from” a training set made up of database
tuples and their associated class labels. A tuple, X , is represented by an n -dimensional
attribute vector , X D.
, depicting n measurements made on the tuple
from n database attributes, respectively, A 1 , A 2 ,
x 1 , x 2 ,
:::
, x n /
, A n . 1 Each tuple, X , is assumed to
belong to a predefined class as determined by another database attribute called the class
label attribute . The class label attribute is discrete-valued and unordered. It is categor-
ical (or nominal) in that each value serves as a category or class. The individual tuples
making up the training set are referred to as training tuples and are randomly sam-
pled from the database under analysis. In the context of classification, data tuples can be
referred to as samples, examples, instances, data points , or objects . 2
:::
1 Each attribute represents a “feature” of X . Hence, the pattern recognition literature uses the term fea-
ture vector rather than attribute vector . In our discussion, we use the term attribute vector, and in our
notation, any variable representing a vector is shown in bold italic font; measurements depicting the
vector are shown in italic font (e.g., X D.
.
2 In the machine learning literature, training tuples are commonly referred to as training samples .
Throughout this text, we prefer to use the term tuples instead of samples.
x 1 , x 2 , x 3
//
 
Search WWH ::




Custom Search