Databases Reference
In-Depth Information
Chapter 1
Data Mining
In this intoductory chapter we begin with the essence of data mining and a dis-
cussion of how data mining is treated by the various disciplines that contribute
to this field. We cover “Bonferroni's Principle,” which is really a warning about
overusing the ability to mine data. This chapter is also the place where we
summarize a few useful ideas that are not data mining but are useful in un-
derstanding some important data-mining concepts. These include the TF.IDF
measure of word importance, behavior of hash functions and indexes, and iden-
tities involving e, the base of natural logarithms. Finally, we give an outline of
the topics covered in the balance of the topic.
1.1
What is Data Mining?
The most commonly accepted definition of “data mining” is the discovery of
“models” for data.
A “model,” however, can be one of several things.
We
mention below the most important directions in modeling.
1.1.1
Statistical Modeling
Statisticians were the first to use the term “data mining.” Originally, “data
mining” or “data dredging” was a derogatory term referring to attempts to
extract information that was not supported by the data. Section 1.2 illustrates
the sort of errors one can make by trying to extract what really isn't in the data.
Today, “data mining” has taken on a positive meaning. Now, statisticians view
data mining as the construction of a statistical model, that is, an underlying
distribution from which the visible data is drawn.
Example 1.1 : Suppose our data is a set of numbers. This data is much
simpler than data that would be data-mined, but it will serve as an example. A
statistician might decide that the data comes from a Gaussian distribution and
use a formula to compute the most likely parameters of this Gaussian. The mean
1
Search WWH ::




Custom Search