Data Reduction - Data Preprocessing in Data Mining

Graphics Reference

In-Depth Information

reconstruct the i th data point in D dimensions should also reconstruct its embedded

manifold coordinates in d dimensions.

LLE constructs a neighborhood preserving mapping based on the above idea. In

the final step of the algorithm, each high dimensional observation X i is mapped to a

low dimensional vector Y i representing global internal coordinates on the manifold.

This is done by choosing d -dimensional coordinates Y i to minimize the embedding

cost function:

2

Φ(

Y

) =

Y i −

W ij Y j

i

j

This cost function, like the previous one, is based on locally linear reconstruction

errors, but here, the weights W ij are fixed while optimizing the coordinates Y i .

Now, the embedding cost can be minimized by solving a sparse N

N eigenvector

problem, whose bottom d non-zero eigenvectors provide an ordered set of orthogonal

coordinates centered on the origin.

It is noteworthy that while the reconstruction weights for each data point are

computed from its local neighborhood, the embedding coordinates are computed by

an N

×

N eigensolver, a global operation that couples all data points in connected

components of the graph defined by the weight matrix. The different dimensions in

the embedding space can be computed successively; this is done simply by computing

the bottom eigenvectors from previous equation one at a time. But the computation is

always coupled across data points. This is how the algorithm leverages overlapping

local information to discover global structure. Implementation of the algorithm is

fairly straightforward, as the algorithm has only one free parameter: the number of

neighbors per data point, K .

×

6.3 Data Sampling

Sampling is used to ease the analysis and modeling of large data sets. In DM, data

sampling serves four purposes:

•

To reduce the number of instances submitted to the DM algorithm . In many cases,

predictive learning can operate with 10-20% of cases without a significant dete-

rioration of the performance. After that, the addition of more cases should have

expected outcomes. However, in descriptive analysis, it is better to have as many

cases as possible.

•

To support the selection of only those cases in which the response is relatively

homogeneous . When you have data sets where different trends are clearly observ-

able or the examples can be easily separated, you can partition the data for different

types of modelling. For instance, imagine the learning of the approving decision

of bank loans depending on some economic characteristics of a set of customers.

Data Preprocessing in Data Mining

Search WWH ::

Custom Search

Home