Graphics Reference
In-Depth Information
reconstruct the i th data point in D dimensions should also reconstruct its embedded
manifold coordinates in d dimensions.
LLE constructs a neighborhood preserving mapping based on the above idea. In
the final step of the algorithm, each high dimensional observation X i is mapped to a
low dimensional vector Y i representing global internal coordinates on the manifold.
This is done by choosing d -dimensional coordinates Y i to minimize the embedding
cost function:
2
Φ(
Y
) =
Y i
W ij Y j
i
j
This cost function, like the previous one, is based on locally linear reconstruction
errors, but here, the weights W ij are fixed while optimizing the coordinates Y i .
Now, the embedding cost can be minimized by solving a sparse N
N eigenvector
problem, whose bottom d non-zero eigenvectors provide an ordered set of orthogonal
coordinates centered on the origin.
It is noteworthy that while the reconstruction weights for each data point are
computed from its local neighborhood, the embedding coordinates are computed by
an N
×
N eigensolver, a global operation that couples all data points in connected
components of the graph defined by the weight matrix. The different dimensions in
the embedding space can be computed successively; this is done simply by computing
the bottom eigenvectors from previous equation one at a time. But the computation is
always coupled across data points. This is how the algorithm leverages overlapping
local information to discover global structure. Implementation of the algorithm is
fairly straightforward, as the algorithm has only one free parameter: the number of
neighbors per data point, K .
×
6.3 Data Sampling
Sampling is used to ease the analysis and modeling of large data sets. In DM, data
sampling serves four purposes:
To reduce the number of instances submitted to the DM algorithm . In many cases,
predictive learning can operate with 10-20% of cases without a significant dete-
rioration of the performance. After that, the addition of more cases should have
expected outcomes. However, in descriptive analysis, it is better to have as many
cases as possible.
To support the selection of only those cases in which the response is relatively
homogeneous . When you have data sets where different trends are clearly observ-
able or the examples can be easily separated, you can partition the data for different
types of modelling. For instance, imagine the learning of the approving decision
of bank loans depending on some economic characteristics of a set of customers.
 
Search WWH ::




Custom Search