Biomedical Engineering Reference
In-Depth Information
pattern using a function such x
. However, in practice, if the function maps
the data into a very high dimension, it would be problematic to compute and to
store the results of the mapping.
Note, however, that in (7.6) only pairwise inner products of examples are
used. Therefore, instead of mapping each sample to a higher dimension and
then performing the inner product, it is possible (for certain classes of map-
ping functions) to first compute the inner product between patterns and only
then compute the mapping for the resulting scalar. This is known as the kernel
trick [51].
Thus, in (7.6) we now replace the inner products
=
j
(
x
)
x
x )
,where
k is the mapping function, also known as the kernel function. Kernel functions
should conform to conditions known as the Mercer conditions [53]. Examples of
such functions include polynomials, radial basis functions (Gaussian functions),
and hyperbolic tangents.
SVMs have been studied extensively [18] and have been extended to many di-
rections. Some notable examples include cases where the classes are not separable,
through the introduction of a penalty term that allows some training patterns to
be incorrectly classified [54], and for feature selection [55, 56].
As noted above, the solution to an SVM requires the solution of a quadratic
optimization problem. Such problems have O
x
,
with k
(
x
,
n 3
n 2
space
complexity [57]. In practice, the computational complexity is usually closer to
O
(
)
time complexity and O
(
)
n 2
[58]. Recently, some approximation algorithms have claimed to further
reduce the computational complexity of SVMs to O
(
)
complexity [57].
Some research has been devoted to the solution of SVMs on parallel com-
puting nodes. In [58] an SVM solver is parallelized by training multiple SVMs,
each on a subset of the training data, and aggregating the resulting classifiers
into a single classifier. The training data is then redistributed to the classifiers
according to their performance and the process is iterated until convergence is
reached. A more low-level approach is taken in [59], where the quadratic opti-
mization problem is divided into smaller quadratic programs, each of which is
solved on a different node. The results are aggregated and the process is repeated
until convergence. Graf et al. [60] partition the data and solve an SVM for each
partition. The support vectors from each pair of classifiers are then aggregated
into a new training set, for which an SVM is solved. The process continues un-
til a single classifier remains. The aggregation process can be iterated using the
support vectors of the final classifier in the previous iteration to seed the new
classifiers.
An alternative approach [61] to parallelization of SVM solvers is to compute
the full kernel matrix in a distributed memory so that each node holds a slice of
the full kernel matrix, and execute a gradient descent-like algorithm to find the
Lagrange multipliers. This approach is advantageous when it is hard to separate
between the data from different classes.
Online SVMs have been extensively researched (see, for example, [62]). A
simple though effective idea is the Forgetron algorithm [63]. Here, a ''budget''
of support vectors is prespecified. As each new example is received, it is first
classified according to the support vectors held in memory. If it is misclassified, it
is accepted into the learner's memory as a new support vector and the weight of all
(
n
)
Search WWH ::




Custom Search