Kernel-Based Algorithms and Visualization for Interval Data Mining - Mining Complex Data

Information Technology Reference

In-Depth Information

5

Kernel-Based Algorithms and Visualization for

Interval Data Mining

Thanh-Nghi Do 1 , 2 and Francois Poulet 3

1

CIT, CanTho University, VietNam

dtnghi@cit.ctu.edu.vn

2

INRIA Futurs/LRI, Universite de Paris-Sud, Orsay, France

Thanh-Nghi.Do@lri.fr

3

IRISA, Rennes, France

Francois.Poulet@irisa.fr

Abstract. Our investigation aims at extending kernel methods to interval data min-

ing and using graphical methods to explain the obtained results. Interval data type can

be an interesting way to aggregate large datasets into smaller ones or to represent data

with uncertainty. No algorithmic changes are required from the usual case of continu-

ous data other than the modification of the Radial Basis Kernel Function evaluation.

Thus, kernel-based algorithms can deal easily with interval data. The numerical test

results with real and artificial datasets show that the proposed methods have given

promising performance. We also use interactive graphical decision tree algorithms and

visualization techniques to give an insight into support vector machines results. The

user has a better understanding of the models' behavior.

5.1

Introduction

In recent years, real-world databases have increased rapidly [1], so that the need

to extract knowledge from very large databases is increasing. Data mining [2] can

be defined as the particular pattern recognition task in the knowledge discovery

in databases process. It uses different algorithms for classification, regression,

clustering or association rules. The support vector machines algorithms (SVM)

proposed by Vapnik [3] are a well-known class of algorithms using the idea of

kernel substitution. They have shown practical relevance for classification, regres-

sion and novelty detection tasks. The successful applications of SVM and other

kernel-based methods have been reported for various fields like facial recognition,

text categorization, bioinformatics, etc. [4].

While SVM and kernel-based methods are a powerful paradigm, they have

some diculty to deal with the challenge of large datasets. The learning task

is accomplished through the resolution of quadratic problem. Therefore, the

computational cost of a SVM approach is at least equal to the squared number

of training data points and the memory requirement makes them intractable

with very large datasets. We propose to scale up their training tasks based on

the interval data concept [5]. We summarize the massive datasets into interval

data. Then, we must adapt the kernel-based algorithms, e.g. SVM to deal with

Mining Complex Data

Search WWH ::

Custom Search

Home