Classification: Basic Concepts - Data Mining: Concepts and Techniques

Databases Reference

In-Depth Information

analysis to help guess whether a customer with a given profile will buy a new computer.

A medical researcher wants to analyze breast cancer data to predict which one of three

specific treatments a patient should receive. In each of these examples, the data analysis

task is classification , where a model or classifier is constructed to predict class (categor-

ical) labels , such as “safe” or “risky” for the loan application data; “yes” or “no” for the

marketing data; or “treatment A,” “treatment B,” or “treatment C” for the medical data.

These categories can be represented by discrete values, where the ordering among values

has no meaning. For example, the values 1, 2, and 3 may be used to represent treatments

A, B, and C, where there is no ordering implied among this group of treatment regimes.

Suppose that the marketing manager wants to predict how much a given customer

will spend during a sale at AllElectronics . This data analysis task is an example of numeric

prediction , where the model constructed predicts a continuous-valued function , or

ordered value , as opposed to a class label. This model is a predictor . Regression analysis

is a statistical methodology that is most often used for numeric prediction; hence the

two terms tend to be used synonymously, although other methods for numeric predic-

tion exist. Classification and numeric prediction are the two major types of prediction

problems . This chapter focuses on classification.

8.1.2 General Approach to Classification

“How does classification work?” Data classification is a two-step process, consisting of a

learning step (where a classification model is constructed) and a classification step (where

the model is used to predict class labels for given data). The process is shown for the

loan application data of Figure 8.1. (The data are simplified for illustrative purposes.

In reality, we may expect many more attributes to be considered.

In the first step, a classifier is built describing a predetermined set of data classes or

concepts. This is the learning step (or training phase), where a classification algorithm

builds the classifier by analyzing or “learning from” a training set made up of database

tuples and their associated class labels. A tuple, X , is represented by an n -dimensional

attribute vector , X D.

, depicting n measurements made on the tuple

from n database attributes, respectively, A 1 , A 2 ,

x 1 , x 2 ,

:::

, x n /

, A n . 1 Each tuple, X , is assumed to

belong to a predefined class as determined by another database attribute called the class

label attribute . The class label attribute is discrete-valued and unordered. It is categor-

ical (or nominal) in that each value serves as a category or class. The individual tuples

making up the training set are referred to as training tuples and are randomly sam-

pled from the database under analysis. In the context of classification, data tuples can be

referred to as samples, examples, instances, data points , or objects . 2

:::

1 Each attribute represents a “feature” of X . Hence, the pattern recognition literature uses the term fea-

ture vector rather than attribute vector . In our discussion, we use the term attribute vector, and in our

notation, any variable representing a vector is shown in bold italic font; measurements depicting the

vector are shown in italic font (e.g., X D.

2 In the machine learning literature, training tuples are commonly referred to as training samples .

Throughout this text, we prefer to use the term tuples instead of samples.

x 1 , x 2 , x 3

Data Mining: Concepts and Techniques

Search WWH ::

Custom Search

Home