What Is Data Mining and How Does It Work? - Discrimination and Privacy in the Information Society

Database Reference

In-Depth Information

the data used in the test should be independent from the data that was used to

generate the hypothesis.

From a methodological point of view, another difference with statistics is that in

the data mining research field there is a much stronger focus on scalable

techniques that work for very large datasets; for instance, techniques that scale

linear in the dataset size in the sense that their running time is proportional to data

size. Many statistical techniques do not scale well as they were developed initially

to work on small datasets.

Most closely related to data mining is without doubt machine learning . There is

a big overlap between the two communities, and over time the difference became

less relevant and boundaries are beginning to blur. Traditionally, machine learning

is about learning to perform a task, whereas data mining is more about “finding

knowledge from the data”. Both are tightly connected; on the one hand, in general,

useful knowledge extracted from given examples of a task will allow for

performing the task better, whereas on the other hand, during the learning process

of a task, knowledge about the task will have to be accumulated in one form or

another, from the examples, and be stored in the system. Given its task-oriented

nature, historically one can see the ML community having a strong focus on

supervised tasks, whereas data mining is more concerned with unsupervised tasks.

One important challenge the data mining community is faced with in this

perspective, is that often it is difficult to quantify the quality of a result. In a

supervised context with a well-described task the quality of a solution is much

easier to assess, but in an unsupervised context questions like “When does a

discovered pattern represent useful knowledge?” are less obvious to answer.

Another notorious field having similar problems is that of data visualization; also

there it is hard to unambiguously determine if a particular visualization is

informative.

Another area closely related to data mining is that of data warehousing and

online analytical processing (OLAP) . In the field of online analytical processing, a

myriad of highly performant data analysis techniques have been developed. A

main concept here is that of a data cube 4 , a conceptual model of the data as a

multidimensional cube that can be seen as an extension of a cross-table. OLAP,

however, is user-driven; it merely provides the user with the tools to quickly

generate the aggregates in the data he or she selects to be displayed and presents

them in a convenient display. Unlike data mining, in OLAP there is no notion of

exploratory search performed by the computer algorithm; the exploration is

completely determined by the user.

2.3 Database Terminology

In this section we will provide an overview of some common terminology used

throughout the topic. Unless stated differently, throughout the topic it will be

assumed that data to be analyzed is available in a structured format, such as a

4 Gray, J. et al. (1997).

Search WWH ::

Custom Search

Home