Databases Reference
In-Depth Information
2.1 Foundations(Theory)-Oriented Frameworks
Frameworks of this type are based mainly on one of the following paradigms:
(1) the statistical paradigms ;(2) the data compression paradigm - “compress
the dataset by finding some structure or knowledge for it”; (3) the machine
learning paradigm - “let the data suggest a model” that can be seen as a
practical alternative to the statistical paradigms “fit a model to the data”;
(4) the database paradigm - “there is no such thing as discovery, it is all
in the power of the query language” [21]; and also the inductive databases
paradigm - “locating interesting sentences from a given logic that are true in
the database” [3].
The Statistical Paradigms
Generally, it is possible to consider the task of DM from the statistical point of
view, emphasizing the fact that DM techniques are applied to larger datasets
than it is commonly done in applied statistics [17]. Thus the analysis of ap-
propriate statistical literature, where strong analytical background is accu-
mulated, would solve most DM problems. Many DM tasks naturally may be
formulated in the statistical terms, and many statistical contributions may be
used in DM in a quite straightforward manner [16].
According to [7] there exist two basic statistical paradigms that are used
in theoretical support for DM. The first paradigm is so-called “Statistical ex-
periment”. It can be seen from three perspectives: Fisher's version that uses
the inductive principle of maximum likelihood, Neyman-E.S. Pearson-Wald's
version that is based on the principle of inductive behavior, and the Bayesian
version that is based on the principle of maximum posterior probability. An
evolved version of the “Statistical experiment” paradigm is the “Statistical
learning from empirical process” paradigm [39]. Generally, many DM tasks
can be seen as the task of finding the underlying joint distribution of variables
in the data. Good examples of this approach would be a Bayesian network or
a hierarchical Bayesian model, which give a short and understandable repre-
sentation of the joint distribution. DM tasks dealing with clustering and/or
classification fit easily into this approach.
The second statistical paradigm is called “Structural data analysis” and
can be associated with singular value decomposition methods, which are
broadly used, for example, in text mining applications.
A deeper consideration of DM and statistics can be found in [14]. Here, we
only want to point out that the volume of the data being analyzed and the
different educational background of researchers are not the most important
issues that constitute the difference between the areas. DM is an applied area
of science and limitations in available computational resources is a big issue
when applying results from traditional statistics to DM. An important point
here is that the theoretical framework of statistics is not concerned much
about data analysis as an iterative process that generally includes several
Search WWH ::




Custom Search