Data Mining - Mining of Massive Datasets

Databases Reference

In-Depth Information

Chapter 1

Data Mining

In this intoductory chapter we begin with the essence of data mining and a dis-

cussion of how data mining is treated by the various disciplines that contribute

to this field. We cover “Bonferroni's Principle,” which is really a warning about

overusing the ability to mine data. This chapter is also the place where we

summarize a few useful ideas that are not data mining but are useful in un-

derstanding some important data-mining concepts. These include the TF.IDF

measure of word importance, behavior of hash functions and indexes, and iden-

tities involving e, the base of natural logarithms. Finally, we give an outline of

the topics covered in the balance of the topic.

1.1

What is Data Mining?

The most commonly accepted definition of “data mining” is the discovery of

“models” for data.

A “model,” however, can be one of several things.

We

mention below the most important directions in modeling.

1.1.1

Statistical Modeling

Statisticians were the first to use the term “data mining.” Originally, “data

mining” or “data dredging” was a derogatory term referring to attempts to

extract information that was not supported by the data. Section 1.2 illustrates

the sort of errors one can make by trying to extract what really isn't in the data.

Today, “data mining” has taken on a positive meaning. Now, statisticians view

data mining as the construction of a statistical model, that is, an underlying

distribution from which the visible data is drawn.

Example 1.1 : Suppose our data is a set of numbers. This data is much

simpler than data that would be data-mined, but it will serve as an example. A

statistician might decide that the data comes from a Gaussian distribution and

use a formula to compute the most likely parameters of this Gaussian. The mean

1

Search WWH ::

Custom Search

Home