Machine Learning Techniques for Large Data - High-Throughput Image Reconstruction and Analysis

Biomedical Engineering Reference

In-Depth Information

pattern using a function such x

. However, in practice, if the function maps

the data into a very high dimension, it would be problematic to compute and to

store the results of the mapping.

Note, however, that in (7.6) only pairwise inner products of examples are

used. Therefore, instead of mapping each sample to a higher dimension and

then performing the inner product, it is possible (for certain classes of map-

ping functions) to first compute the inner product between patterns and only

then compute the mapping for the resulting scalar. This is known as the kernel

trick [51].

Thus, in (7.6) we now replace the inner products

=

j

(

x

)

x

x )

,where

k is the mapping function, also known as the kernel function. Kernel functions

should conform to conditions known as the Mercer conditions [53]. Examples of

such functions include polynomials, radial basis functions (Gaussian functions),

and hyperbolic tangents.

SVMs have been studied extensively [18] and have been extended to many di-

rections. Some notable examples include cases where the classes are not separable,

through the introduction of a penalty term that allows some training patterns to

be incorrectly classified [54], and for feature selection [55, 56].

As noted above, the solution to an SVM requires the solution of a quadratic

optimization problem. Such problems have O

x

,

with k

(

x

,

n 3

n 2

space

complexity [57]. In practice, the computational complexity is usually closer to

O

(

)

time complexity and O

(

)

n 2

[58]. Recently, some approximation algorithms have claimed to further

reduce the computational complexity of SVMs to O

(

)

complexity [57].

Some research has been devoted to the solution of SVMs on parallel com-

puting nodes. In [58] an SVM solver is parallelized by training multiple SVMs,

each on a subset of the training data, and aggregating the resulting classifiers

into a single classifier. The training data is then redistributed to the classifiers

according to their performance and the process is iterated until convergence is

reached. A more low-level approach is taken in [59], where the quadratic opti-

mization problem is divided into smaller quadratic programs, each of which is

solved on a different node. The results are aggregated and the process is repeated

until convergence. Graf et al. [60] partition the data and solve an SVM for each

partition. The support vectors from each pair of classifiers are then aggregated

into a new training set, for which an SVM is solved. The process continues un-

til a single classifier remains. The aggregation process can be iterated using the

support vectors of the final classifier in the previous iteration to seed the new

classifiers.

An alternative approach [61] to parallelization of SVM solvers is to compute

the full kernel matrix in a distributed memory so that each node holds a slice of

the full kernel matrix, and execute a gradient descent-like algorithm to find the

Lagrange multipliers. This approach is advantageous when it is hard to separate

between the data from different classes.

Online SVMs have been extensively researched (see, for example, [62]). A

simple though effective idea is the Forgetron algorithm [63]. Here, a ''budget''

of support vectors is prespecified. As each new example is received, it is first

classified according to the support vectors held in memory. If it is misclassified, it

is accepted into the learner's memory as a new support vector and the weight of all

(

n

)

High-Throughput Image Reconstruction and Analysis

Search WWH ::

Custom Search

Home