Kernel-Based Algorithms and Visualization for Interval Data Mining - Mining Complex Data

Information Technology Reference

In-Depth Information

α i ∗ ) = 0

s.t.

( α i −

(5.6)

i =1

α i ,α i ∗ ≥

0( i =1 ,...,m )

The solution of (6) gives α i ,α i ∗ . Thus the regression function is given by:

≥

f ( x )=

( α i −

α i ∗ ) K

x, x i −

i =1

where the scalar b is determined by the support vectors.

A survey [14] and the topic [15] provide more details about SVM and others

kernel-based learning methods.

These SVMs only deal with continuous data. To deal with interval data no

algorithmic changes are required from the usual case of continuous data other

than the substitution of the RBF kernel function for interval data described

in section 2 into the classical SVM algorithms including SVC, One-class SVM,

SVR. All the benefits of the classical SVMs are kept. Thus they can be used to

deal with interval data.

For the evaluation of our proposed approach, we have added the new non-

linear kernel for interval data to the publicly available toolkit, LibSVM [16]. The

software program is able to deal with interval data in classification, regression

and novelty detection tasks. To apply the SVM algorithms to the multi-class clas-

sification problem (more than 2 classes), LibSVM uses one-against-one strategy.

Assume that we have k classes, LibSVM construct k

1) / 2 models: a model

separates i th class against j th class. Then to predict the class for a new data

point, LibSVM just predicts with each model and finds out which one separates

the furthest into the positive region. We have used datasets from Statlog [17],

the UCI machine learning repository [18], regression datasets [19] and Delve [20].

By using k-means algorithm [9], the large datasets are aggregated into smaller

ones. A data point in interval datasets corresponds to a cluster, the low and high

values of an interval are computed by the cluster data points. Some other meth-

ods for creating interval data can be found in [5]. Furthermore, we generated

uncertain data set for evaluating our algorithm. This dataset called Ringnoise is

∗

( k

−

dimensional with 2 classes where class 1 is multivariate normal with mean

0 and covariance 4 times the identity matrix and class 2 has unit covariance and

mean (0 . 5 , 0 . 5 , 0 . 5 , 0 . 5). Then Gaussian noise is added with mean (0 , 0 , 0 , 0) and

covariance matrix σ i I where σ i is randomly chosen from [0.1, 0.8], the matrix

I denotes the 4x4 identity matrix. The interval data concept can also represent

this dataset uncertainty. Table 5.1 presents the dataset description and aggre-

gations (interval data). We report the cross validation accuracy of classification

results and mean squared error of regression results in table 5.2. The results of

novelty detection task are presented in table 5.3 with the number of outliers (fur-

thest from other data points in the dataset). According to our knowledge, there

is no other available algorithm being able to deal with interval data in non-linear

−

Mining Complex Data

Search WWH ::

Custom Search

Home