Does Relevance Matter to Data Mining Research? - Data Mining: Foundations and Practice

Databases Reference

In-Depth Information

2.1 Foundations(Theory)-Oriented Frameworks

Frameworks of this type are based mainly on one of the following paradigms:

(1) the statistical paradigms ;(2) the data compression paradigm - “compress

the dataset by finding some structure or knowledge for it”; (3) the machine

learning paradigm - “let the data suggest a model” that can be seen as a

practical alternative to the statistical paradigms “fit a model to the data”;

(4) the database paradigm - “there is no such thing as discovery, it is all

in the power of the query language” [21]; and also the inductive databases

paradigm - “locating interesting sentences from a given logic that are true in

the database” [3].

The Statistical Paradigms

Generally, it is possible to consider the task of DM from the statistical point of

view, emphasizing the fact that DM techniques are applied to larger datasets

than it is commonly done in applied statistics [17]. Thus the analysis of ap-

propriate statistical literature, where strong analytical background is accu-

mulated, would solve most DM problems. Many DM tasks naturally may be

formulated in the statistical terms, and many statistical contributions may be

used in DM in a quite straightforward manner [16].

According to [7] there exist two basic statistical paradigms that are used

in theoretical support for DM. The first paradigm is so-called “Statistical ex-

periment”. It can be seen from three perspectives: Fisher's version that uses

the inductive principle of maximum likelihood, Neyman-E.S. Pearson-Wald's

version that is based on the principle of inductive behavior, and the Bayesian

version that is based on the principle of maximum posterior probability. An

evolved version of the “Statistical experiment” paradigm is the “Statistical

learning from empirical process” paradigm [39]. Generally, many DM tasks

can be seen as the task of finding the underlying joint distribution of variables

in the data. Good examples of this approach would be a Bayesian network or

a hierarchical Bayesian model, which give a short and understandable repre-

sentation of the joint distribution. DM tasks dealing with clustering and/or

classification fit easily into this approach.

The second statistical paradigm is called “Structural data analysis” and

can be associated with singular value decomposition methods, which are

broadly used, for example, in text mining applications.

A deeper consideration of DM and statistics can be found in [14]. Here, we

only want to point out that the volume of the data being analyzed and the

different educational background of researchers are not the most important

issues that constitute the difference between the areas. DM is an applied area

of science and limitations in available computational resources is a big issue

when applying results from traditional statistics to DM. An important point

here is that the theoretical framework of statistics is not concerned much

about data analysis as an iterative process that generally includes several

Search WWH ::

Custom Search

Home