Data Preprocessing - Data Mining: Concepts and Techniques

Databases Reference

In-Depth Information

3.8 Using the data for age and body fat given in Exercise 2.4, answer the following:

(a) Normalize the two attributes based on z-score normalization .

(b) Calculate the correlation coefficient (Pearson's product moment coefficient). Are

these two attributes positively or negatively correlated? Compute their covariance.

3.9 Suppose a group of 12 sales price records has been sorted as follows:

5, 10, 11, 13, 15, 35, 50, 55, 72, 92, 204, 215.

Partition them into three bins by each of the following methods:

(a) equal-frequency (equal-depth) partitioning

(b) equal-width partitioning

(c) clustering

3.10 Use a flowchart to summarize the following procedures for attribute subset selection :

(a) stepwise forward selection

(b) stepwise backward elimination

(c) a combination of forward selection and backward elimination

3.11 Using the data for age given in Exercise 3.3,

(a) Plot an equal-width histogram of width 10.

(b) Sketch examples of each of the following sampling techniques: SRSWOR, SRSWR,

cluster sampling, and stratified sampling. Use samples of size 5 and the strata

“youth,” “middle-aged,” and “senior.”

3.12 ChiMerge [Ker92] is a supervised, bottom-up (i.e., merge-based) data discretization

method. It relies on

2

analysis: Adjacent intervals with the least

values are merged

together until the chosen stopping criterion satisfies.

(a) Briefly describe how ChiMerge works.

(b) Take the IRIS data set, obtained from the University of California-Irvine Machine

Learning Data Repository ( www.ics.uci.edu/ mlearn/MLRepository.html ), as a data

set to be discretized. Perform data discretization for each of the four numeric

attributes using the ChiMerge method. (Let the stopping criteria be: max-interval

D 6). You need to write a small program to do this to avoid clumsy numerical

computation. Submit your simple analysis and your test results: split-points, final

intervals, and the documented source program.

3.13 Propose an algorithm, in pseudocode or in your favorite programming language, for the

following:

(a) The automatic generation of a concept hierarchy for nominal data based on the

number of distinct values of attributes in the given schema.

(b) The automatic generation of a concept hierarchy for numeric data based on the

equal-width partitioning rule.

Search WWH ::

Custom Search

Home