Information Technology Reference
In-Depth Information
5
Kernel-Based Algorithms and Visualization for
Interval Data Mining
Thanh-Nghi Do 1 , 2 and Francois Poulet 3
1
CIT, CanTho University, VietNam
dtnghi@cit.ctu.edu.vn
2
INRIA Futurs/LRI, Universite de Paris-Sud, Orsay, France
Thanh-Nghi.Do@lri.fr
3
IRISA, Rennes, France
Francois.Poulet@irisa.fr
Abstract. Our investigation aims at extending kernel methods to interval data min-
ing and using graphical methods to explain the obtained results. Interval data type can
be an interesting way to aggregate large datasets into smaller ones or to represent data
with uncertainty. No algorithmic changes are required from the usual case of continu-
ous data other than the modification of the Radial Basis Kernel Function evaluation.
Thus, kernel-based algorithms can deal easily with interval data. The numerical test
results with real and artificial datasets show that the proposed methods have given
promising performance. We also use interactive graphical decision tree algorithms and
visualization techniques to give an insight into support vector machines results. The
user has a better understanding of the models' behavior.
5.1
Introduction
In recent years, real-world databases have increased rapidly [1], so that the need
to extract knowledge from very large databases is increasing. Data mining [2] can
be defined as the particular pattern recognition task in the knowledge discovery
in databases process. It uses different algorithms for classification, regression,
clustering or association rules. The support vector machines algorithms (SVM)
proposed by Vapnik [3] are a well-known class of algorithms using the idea of
kernel substitution. They have shown practical relevance for classification, regres-
sion and novelty detection tasks. The successful applications of SVM and other
kernel-based methods have been reported for various fields like facial recognition,
text categorization, bioinformatics, etc. [4].
While SVM and kernel-based methods are a powerful paradigm, they have
some diculty to deal with the challenge of large datasets. The learning task
is accomplished through the resolution of quadratic problem. Therefore, the
computational cost of a SVM approach is at least equal to the squared number
of training data points and the memory requirement makes them intractable
with very large datasets. We propose to scale up their training tasks based on
the interval data concept [5]. We summarize the massive datasets into interval
data. Then, we must adapt the kernel-based algorithms, e.g. SVM to deal with
 
Search WWH ::




Custom Search