Data Mining Process - Java Data Mining: Strategy, Standard, and Practice

Java Reference

In-Depth Information

140,000

9,000

8,000

120,000

7,000

100,000

6,000

Case

Count

Case

Count

80,000

5,000

4,000

60,000

3,000

40,000

2,000

20,000

1,000

0

High Medium Low

(b) Straified Sample Case Count

High

Medium

Low

(a) Case Count

Figure 3-6

Customer satisfaction data: histogram of target attribute.

relatively equal number of each category. Since we have few cases in

the high category, we will want to use all of those, and we may decide

to take 8,000 of the medium cases, and 6,000 of the low cases, as illus-

trated in Figure 3-6(b). The “correct” number of cases to specify is

more of an art than a science. Trying several variations can help iden-

tify an appropriate mix.

Recoding

In several of the transformations above, we discussed replacing one

value with another. In its most general form, this is called recoding .

The binning transformation on categorical data mentioned above

(the 50 United States into regions) is also a form of recoding that

enables manual roll-up of data. Typically, recoding is performed on

categorical data (e.g., replacing the attribute values H , high , hi , and

“ ***” with HIGH ). This can be useful for cleaning data, as in the pre-

vious example, or to help in the interpretability of a model. For

example, when looking at the rules produced from market basket

analysis, it is much easier to understand a rule like “BEER implies

PIZZA” than one like “Prod-3425 implies Prod-5593.” Recoding can

also be useful for numerical data when non-numerical data is mixed

in with numerical data. For example, it is not uncommon to see “999”

or “ “ used for a missing age value. These may be more appropriately

be replaced with null.

Integrating data

An important step in data preparation often involves integrating two

or more datasets into one. If the data contain well-defined keys and

use the same data conventions, integrating the data can be as simple

as performing a database join on the two tables, producing a new

table or view. However, real data is seldom so clean and requires var-

ious data cleaning techniques as noted in the previous section about

errors and outliers.

Java Data Mining: Strategy, Standard, and Practice

Search WWH ::

Custom Search

Home