Data Preprocessing - Data Mining: Concepts and Techniques

Databases Reference

In-Depth Information

was proposed in Siedlecki and Sklansky [SS88]. A wrapper approach to attribute selec-

tion is described in Kohavi and John [KJ97]. Unsupervised attribute subset selection is

described in Dash, Liu, and Yao [DLY97].

For a description of wavelets for dimensionality reduction, see Press, Teukolosky, Vet-

terling, and Flannery [PTVF07]. A general account of wavelets can be found in Hubbard

[Hub96]. For a list of wavelet software packages, see Bruce, Donoho, and Gao [BDG96].

Daubechies transforms are described in Daubechies [Dau92]. The topic by Press et al.

[PTVF07] includes an introduction to singular value decomposition for principal com-

ponents analysis. Routines for PCA are included in most statistical software packages

such as SAS ( www.sas.com/SASHome.html ).

An introduction to regression and log-linear models can be found in several

textbooks such as James [Jam85]; Dobson [Dob90]; Johnson and Wichern [JW92];

Devore [Dev95]; and Neter, Kutner, Nachtsheim, and Wasserman [NKNW96]. For log-

linear models (known as multiplicative models in the computer science literature), see

Pearl [Pea88]. For a general introduction to histograms, see Barbara et al. [BDF C 97]

and Devore and Peck [DP97]. For extensions of single-attribute histograms to multiple

attributes, see Muralikrishna and DeWitt [MD88] and Poosala and Ioannidis [PI97].

Several references to clustering algorithms are given in Chapters 10 and 11 of this topic,

which are devoted to the topic.

A survey of multidimensional indexing structures is given in Gaede and G unther

[GG98]. The use of multidimensional index trees for data aggregation is discussed in

Aoki [Aok98]. Index trees include R-trees (Guttman [Gut84]), quad-trees (Finkel and

Bentley [FB74]), and their variations. For discussion on sampling and data mining, see

Kivinen and Mannila [KM94] and John and Langley [JL96].

There are many methods for assessing attribute relevance. Each has its own bias. The

information gain measure is biased toward attributes with many values. Many alterna-

tives have been proposed, such as gain ratio (Quinlan [Qui93]), which considers the

probability of each attribute value. Other relevance measures include the Gini index

(Breiman, Friedman, Olshen, and Stone [BFOS84]), the

2 contingency table statis-

tic, and the uncertainty coefficient (Johnson and Wichern [JW92]). For a comparison

of attribute selection measures for decision tree induction, see Buntine and Niblett

[BN92]. For additional methods, see Liu and Motoda [LM98a], Dash and Liu [DL97],

and Almuallim and Dietterich [AD91].

Liu et al. [LHTD02] performed a comprehensive survey of data discretization

methods. Entropy-based discretization with the C4.5 algorithm is described in Quin-

lan [Qui93]. In Catlett [Cat91], the D-2 system binarizes a numeric feature recursively.

ChiMerge by Kerber [Ker92] and Chi2 by Liu and Setiono [LS95] are methods for the

automatic discretization of numeric attributes that both employ the

2 statistic. Fayyad

and Irani [FI93] apply the minimum description length principle to determine the num-

ber of intervals for numeric discretization. Concept hierarchies and their automatic

generation from categorical data are described in Han and Fu [HF94].

Data Mining: Concepts and Techniques

Search WWH ::

Custom Search

Home