Database Reference
In-Depth Information
the data used in the test should be independent from the data that was used to
generate the hypothesis.
From a methodological point of view, another difference with statistics is that in
the data mining research field there is a much stronger focus on scalable
techniques that work for very large datasets; for instance, techniques that scale
linear in the dataset size in the sense that their running time is proportional to data
size. Many statistical techniques do not scale well as they were developed initially
to work on small datasets.
Most closely related to data mining is without doubt machine learning . There is
a big overlap between the two communities, and over time the difference became
less relevant and boundaries are beginning to blur. Traditionally, machine learning
is about learning to perform a task, whereas data mining is more about “finding
knowledge from the data”. Both are tightly connected; on the one hand, in general,
useful knowledge extracted from given examples of a task will allow for
performing the task better, whereas on the other hand, during the learning process
of a task, knowledge about the task will have to be accumulated in one form or
another, from the examples, and be stored in the system. Given its task-oriented
nature, historically one can see the ML community having a strong focus on
supervised tasks, whereas data mining is more concerned with unsupervised tasks.
One important challenge the data mining community is faced with in this
perspective, is that often it is difficult to quantify the quality of a result. In a
supervised context with a well-described task the quality of a solution is much
easier to assess, but in an unsupervised context questions like “When does a
discovered pattern represent useful knowledge?” are less obvious to answer.
Another notorious field having similar problems is that of data visualization; also
there it is hard to unambiguously determine if a particular visualization is
informative.
Another area closely related to data mining is that of data warehousing and
online analytical processing (OLAP) . In the field of online analytical processing, a
myriad of highly performant data analysis techniques have been developed. A
main concept here is that of a data cube 4 , a conceptual model of the data as a
multidimensional cube that can be seen as an extension of a cross-table. OLAP,
however, is user-driven; it merely provides the user with the tools to quickly
generate the aggregates in the data he or she selects to be displayed and presents
them in a convenient display. Unlike data mining, in OLAP there is no notion of
exploratory search performed by the computer algorithm; the exploration is
completely determined by the user.
2.3 Database Terminology
In this section we will provide an overview of some common terminology used
throughout the topic. Unless stated differently, throughout the topic it will be
assumed that data to be analyzed is available in a structured format, such as a
4 Gray, J. et al. (1997).
Search WWH ::




Custom Search