Databases Reference
In-Depth Information
3.8 Using the data for age and body fat given in Exercise 2.4, answer the following:
(a) Normalize the two attributes based on z-score normalization .
(b) Calculate the correlation coefficient (Pearson's product moment coefficient). Are
these two attributes positively or negatively correlated? Compute their covariance.
3.9 Suppose a group of 12 sales price records has been sorted as follows:
5, 10, 11, 13, 15, 35, 50, 55, 72, 92, 204, 215.
Partition them into three bins by each of the following methods:
(a) equal-frequency (equal-depth) partitioning
(b) equal-width partitioning
(c) clustering
3.10 Use a flowchart to summarize the following procedures for attribute subset selection :
(a) stepwise forward selection
(b) stepwise backward elimination
(c) a combination of forward selection and backward elimination
3.11 Using the data for age given in Exercise 3.3,
(a) Plot an equal-width histogram of width 10.
(b) Sketch examples of each of the following sampling techniques: SRSWOR, SRSWR,
cluster sampling, and stratified sampling. Use samples of size 5 and the strata
“youth,” “middle-aged,” and “senior.”
3.12 ChiMerge [Ker92] is a supervised, bottom-up (i.e., merge-based) data discretization
method. It relies on
2
2
analysis: Adjacent intervals with the least
values are merged
together until the chosen stopping criterion satisfies.
(a) Briefly describe how ChiMerge works.
(b) Take the IRIS data set, obtained from the University of California-Irvine Machine
Learning Data Repository ( www.ics.uci.edu/ mlearn/MLRepository.html ), as a data
set to be discretized. Perform data discretization for each of the four numeric
attributes using the ChiMerge method. (Let the stopping criteria be: max-interval
D 6). You need to write a small program to do this to avoid clumsy numerical
computation. Submit your simple analysis and your test results: split-points, final
intervals, and the documented source program.
3.13 Propose an algorithm, in pseudocode or in your favorite programming language, for the
following:
(a) The automatic generation of a concept hierarchy for nominal data based on the
number of distinct values of attributes in the given schema.
(b) The automatic generation of a concept hierarchy for numeric data based on the
equal-width partitioning rule.
 
Search WWH ::




Custom Search