Getting to Know Your Data - Data Mining: Concepts and Techniques

Databases Reference

In-Depth Information

Example2.21 Dissimilarity between ordinal attributes. Suppose that we have the sample data shown

earlier in Table 2.2, except that this time only the object-identifier and the continuous

ordinal attribute, test-2 , are available. There are three states for test-2 : fair , good , and

excellent , that is, M f D 3. For step 1, if we replace each value for test-2 by its rank, the

four objects are assigned the ranks 3, 1, 2, and 3, respectively. Step 2 normalizes the

ranking by mapping rank 1 to 0.0, rank 2 to 0.5, and rank 3 to 1.0. For step 3, we can

use, say, the Euclidean distance (Eq. 2.16), which results in the following dissimilarity

matrix:

2

4

3

5

0

1.0

0

.

0.5

0

1.0

0.5

0

Therefore, objects 1 and 2 are the most dissimilar, as are objects 2 and 4 (i.e., d

.

2, 1

/D

1.0 and d

/D 1.0). This makes intuitive sense since objects 1 and 4 are both excellent .

Object 2 is fair , which is at the opposite end of the range of values for test-2 .

.

4, 2

Similarity values for ordinal attributes can be interpreted from dissimilarity as

sim

.

i , j

/D 1 d

.

i , j

/

.

2.4.6 DissimilarityforAttributesofMixedTypes

Sections 2.4.2 through 2.4.5 discussed how to compute the dissimilarity between objects

described by attributes of the same type, where these types may be either nominal , sym-

metric binary, asymmetric binary , numeric , or ordinal . However, in many real databases,

objects are described by a mixture of attribute types. In general, a database can contain

all of these attribute types.

“So, how can we compute the dissimilarity between objects of mixed attribute types?”

One approach is to group each type of attribute together, performing separate data

mining (e.g., clustering) analysis for each type. This is feasible if these analyses derive

compatible results. However, in real applications, it is unlikely that a separate analysis

per attribute type will generate compatible results.

A more preferable approach is to process all attribute types together, performing a

single analysis. One such technique combines the different attributes into a single dis-

similarity matrix, bringing all of the meaningful attributes onto a common scale of the

interval [0.0, 1.0].

Suppose that the data set contains p attributes of mixed type. The dissimilarity d

.

i , j

/

between objects i and j is defined as

P p

f D1 . f /

d . f /

ij

,

(2.22)

d

.

i , j

/D

P p

f D1 . f /

ij

Data Mining: Concepts and Techniques

Search WWH ::

Custom Search

Home