Database Reference
In-Depth Information
1
Data Mining
In this intoductory chapter we begin with the essence of data mining and a discussion of
how data mining is treated by the various disciplines that contribute to this field. We cover
“Bonferroni's Principle,” which is really a warning about overusing the ability to mine data.
This chapter is also the place where we summarize a few useful ideas that are not data min-
ing but are useful in understanding some important data-mining concepts. These include the
TF.IDF measure of word importance, behavior of hash functions and indexes, and identities
involving e , the base of natural logarithms. Finally, we give an outline of the topics covered
in the balance of the topic.
1.1 What is Data Mining?
The most commonly accepted definition of “data mining” is the discovery of “models” for
data. A “model,” however, can be one of several things. We mention below the most import-
ant directions in modeling.
1.1.1
Statistical Modeling
Statisticians were the first to use the term “data mining.” Originally, “data mining” or “data
dredging” was a derogatory term referring to attempts to extract information that was not
supported by the data. Section 1.2 illustrates the sort of errors one can make by trying to
extract what really isn't in the data. Today, “data mining” has taken on a positive meaning.
Now, statisticians view data mining as the construction of a statistical model , that is, an un-
derlying distribution from which the visible data is drawn.
EXAMPLE 1.1 Suppose our data is a set of numbers. This data is much simpler than data
that would be data-mined, but it will serve as an example. A statistician might decide that
the data comes from a Gaussian distribution and use a formula to compute the most likely
Search WWH ::




Custom Search