What Is Data Mining and How Does It Work? - Discrimination and Privacy in the Information Society

Database Reference

In-Depth Information

coupling. In Section 2.5 the two main questions of this chapter (what is data

mining and how does it work?) are answered.

2.2 Data Mining and Related Research Areas

Data Mining emerged as a field only recently, with as a notorious milestone,

the first ACM Conference on Knowledge Discovery in Databases held in

August 1995 in Montreal, Canada 1 . The data mining research community grew out

of many related areas, including machine learning, artificial intelligence,

visualization, statistics, and analytics.

Data mining is often defined as the automated or convenient extraction of

patterns representing knowledge implicitly stored or catchable in large databases,

data warehouses, the Web, other massive information repositories, or data

streams 2 . Unlike in statistics, where the data is collected specially with the

purpose of testing a particular hypothesis, or estimating the parameters of a model,

in data mining one usually starts with historical data that was not necessarily

collected with the purpose of analysis, but rather as a by-product of an operational

system. In this context, data mining is often referred to as secondary data-

analysis 3 . Another major difference with traditional statistical methods is that data

mining aims at data-driven discovery ; instead of the user stating which hypothesis

needs to be checked against the data, the data itself is used to generate the

hypotheses. As such, hypotheses generated by data mining do not have the same

status as those in statistics. The following example illustrates this difference using

the concept of a p-value from statistics.

Example 1. Suppose one throws a coin 10 times, and 9 times the coin falls head

up. Under the hypothesis that the coin is fair (equal probability of heads and

tails), the probability of seeing an outcome being so skewed; i.e., the chance of

having nine or more of heads or nine or more tails, is approximately 2%. This

value is called the p-value of the observation; it expresses how likely it is to see an

outcome as extreme as observed, under the assumption that the hypothesis holds.

If the p-value falls below a threshold, the level of significance, we deem the

observation to be so extreme, that we reject the hypothesis. To continue the

example, a data mining equivalent of this hypothesis test would be that we analyze

a dataset consisting of the outcomes of 1,000 coins that have been tossed, each 10

times. Even if all coins are fair, the data mining algorithm would mark

approximately 20 coins as being “suspicious”, because their tosses show a

disproportionally high number of tails or heads. Indeed, looking at the statistics, it

is likely that among the 1,000 coin toss experiments, some will have an

exceptional outcome. For those 20 suspicious coins, if we would run a statistical

test on our dataset, the hypothesis that they are fair coins would be rejected. The

problem with this setup is, however, that in order for a statistical test to be valid,

1 Fayyad, U.M., Uthurusamy, R. (1995).

2 Han, J. and Kamber, M. (2006).

3 Hand, D., Mannila, H., Smyth, P. (2001).

Search WWH ::

Custom Search

Home