Database Reference
In-Depth Information
coupling. In Section 2.5 the two main questions of this chapter (what is data
mining and how does it work?) are answered.
2.2 Data Mining and Related Research Areas
Data Mining emerged as a field only recently, with as a notorious milestone,
the first ACM Conference on Knowledge Discovery in Databases held in
August 1995 in Montreal, Canada 1 . The data mining research community grew out
of many related areas, including machine learning, artificial intelligence,
visualization, statistics, and analytics.
Data mining is often defined as the automated or convenient extraction of
patterns representing knowledge implicitly stored or catchable in large databases,
data warehouses, the Web, other massive information repositories, or data
streams 2 . Unlike in statistics, where the data is collected specially with the
purpose of testing a particular hypothesis, or estimating the parameters of a model,
in data mining one usually starts with historical data that was not necessarily
collected with the purpose of analysis, but rather as a by-product of an operational
system. In this context, data mining is often referred to as secondary data-
analysis 3 . Another major difference with traditional statistical methods is that data
mining aims at data-driven discovery ; instead of the user stating which hypothesis
needs to be checked against the data, the data itself is used to generate the
hypotheses. As such, hypotheses generated by data mining do not have the same
status as those in statistics. The following example illustrates this difference using
the concept of a p-value from statistics.
Example 1. Suppose one throws a coin 10 times, and 9 times the coin falls head
up. Under the hypothesis that the coin is fair (equal probability of heads and
tails), the probability of seeing an outcome being so skewed; i.e., the chance of
having nine or more of heads or nine or more tails, is approximately 2%. This
value is called the p-value of the observation; it expresses how likely it is to see an
outcome as extreme as observed, under the assumption that the hypothesis holds.
If the p-value falls below a threshold, the level of significance, we deem the
observation to be so extreme, that we reject the hypothesis. To continue the
example, a data mining equivalent of this hypothesis test would be that we analyze
a dataset consisting of the outcomes of 1,000 coins that have been tossed, each 10
times. Even if all coins are fair, the data mining algorithm would mark
approximately 20 coins as being “suspicious”, because their tosses show a
disproportionally high number of tails or heads. Indeed, looking at the statistics, it
is likely that among the 1,000 coin toss experiments, some will have an
exceptional outcome. For those 20 suspicious coins, if we would run a statistical
test on our dataset, the hypothesis that they are fair coins would be rejected. The
problem with this setup is, however, that in order for a statistical test to be valid,
1 Fayyad, U.M., Uthurusamy, R. (1995).
2 Han, J. and Kamber, M. (2006).
3 Hand, D., Mannila, H., Smyth, P. (2001).
Search WWH ::




Custom Search