Databases Reference
In-Depth Information
was proposed in Siedlecki and Sklansky [SS88]. A wrapper approach to attribute selec-
tion is described in Kohavi and John [KJ97]. Unsupervised attribute subset selection is
described in Dash, Liu, and Yao [DLY97].
For a description of wavelets for dimensionality reduction, see Press, Teukolosky, Vet-
terling, and Flannery [PTVF07]. A general account of wavelets can be found in Hubbard
[Hub96]. For a list of wavelet software packages, see Bruce, Donoho, and Gao [BDG96].
Daubechies transforms are described in Daubechies [Dau92]. The topic by Press et al.
[PTVF07] includes an introduction to singular value decomposition for principal com-
ponents analysis. Routines for PCA are included in most statistical software packages
such as SAS ( www.sas.com/SASHome.html ).
An introduction to regression and log-linear models can be found in several
textbooks such as James [Jam85]; Dobson [Dob90]; Johnson and Wichern [JW92];
Devore [Dev95]; and Neter, Kutner, Nachtsheim, and Wasserman [NKNW96]. For log-
linear models (known as multiplicative models in the computer science literature), see
Pearl [Pea88]. For a general introduction to histograms, see Barbara et al. [BDF C 97]
and Devore and Peck [DP97]. For extensions of single-attribute histograms to multiple
attributes, see Muralikrishna and DeWitt [MD88] and Poosala and Ioannidis [PI97].
Several references to clustering algorithms are given in Chapters 10 and 11 of this topic,
which are devoted to the topic.
A survey of multidimensional indexing structures is given in Gaede and G unther
[GG98]. The use of multidimensional index trees for data aggregation is discussed in
Aoki [Aok98]. Index trees include R-trees (Guttman [Gut84]), quad-trees (Finkel and
Bentley [FB74]), and their variations. For discussion on sampling and data mining, see
Kivinen and Mannila [KM94] and John and Langley [JL96].
There are many methods for assessing attribute relevance. Each has its own bias. The
information gain measure is biased toward attributes with many values. Many alterna-
tives have been proposed, such as gain ratio (Quinlan [Qui93]), which considers the
probability of each attribute value. Other relevance measures include the Gini index
(Breiman, Friedman, Olshen, and Stone [BFOS84]), the
2 contingency table statis-
tic, and the uncertainty coefficient (Johnson and Wichern [JW92]). For a comparison
of attribute selection measures for decision tree induction, see Buntine and Niblett
[BN92]. For additional methods, see Liu and Motoda [LM98a], Dash and Liu [DL97],
and Almuallim and Dietterich [AD91].
Liu et al. [LHTD02] performed a comprehensive survey of data discretization
methods. Entropy-based discretization with the C4.5 algorithm is described in Quin-
lan [Qui93]. In Catlett [Cat91], the D-2 system binarizes a numeric feature recursively.
ChiMerge by Kerber [Ker92] and Chi2 by Liu and Setiono [LS95] are methods for the
automatic discretization of numeric attributes that both employ the
2 statistic. Fayyad
and Irani [FI93] apply the minimum description length principle to determine the num-
ber of intervals for numeric discretization. Concept hierarchies and their automatic
generation from categorical data are described in Han and Fu [HF94].
 
Search WWH ::




Custom Search