Biomedical Engineering Reference
In-Depth Information
descriptors. This setting is common for machine learning approaches. Neural net-
work simulations build descriptor-based models for class label prediction by deriving
pathways through arrays of computational neurons that best distinguish between pos-
itive and negative training examples. Once the model is built, it is used to predict the
class label (active versus inactive) of screening database compounds. However, the
model does not reveal why a compound might be predicted as active or inactive. This
remains hidden, which also applies to SOMs, a special neural network architecture
designed to map compounds from descriptor reference spaces onto a two-dimensional
neuron grid. The SOM is trained to group positive and negative training examples on
distinct regions of the map and separate them from each other. Then, test compounds
are projected onto the SOM. Because SOMs start from higher-dimensional descriptor
spaces, this approach is also a dimension reduction method. A trained SOM does not
reveal why a compound was assigned to an active region in the neuron grid, analo-
gously to other neural nets. By contrast, decision trees separate training compounds
along descriptor pathways. Each descriptor represents a decision point to divide a
learning set along the tree. Typically, a yes/no decision is made if the presence or
absence of a feature is detected, or, alternatively, it is determined whether or not
compounds fall into a specific value range of a chosen numerical descriptor. Dur-
ing training, trees are constructed that best separate active and inactive compounds
in terminal leaf nodes. A model is derived by recursively partitioning compounds
along the tree that yields a meaningful separation. If this is the case, combinations of
selected descriptors form pathways that are signatures of a given biological activity,
whereas other pathways enrich inactive compounds along the tree. Importantly, path-
ways in a decision tree are directly interpretable as feature/value range sequences
that establish classification rules, different from methods with black box character.
The tree structure so derived is then used to screen a database for active compounds.
Ensembles of independently derived decision trees, capturing different descriptors
and pathways, are often combined to yield random forest models where predictions
made independently are subjected to consensus scoring schemes [83,84]. Such ran-
dom forest models typically further increase the LBVS performance of individual
decision trees.
15.8.2 Support Vector Machines
The currently most popular machine learning approaches in LBVS are SVMs and
Bayesian classifiers. SVMs represent a class of algorithms to project training sets that
are not linearly separable into chemical reference spaces of higher dimensionality
where a separating hyperplane can be derived. Thus, SVMs are designed to depart
into opposite direction from the paradigm of low dimensionality, which provides
the basis for cell-based partitioning, as discussed. In high-dimensional space rep-
resentations, SVMs construct a maximum-margin hyperplane that yields the largest
possible distance to the nearest positive and negative training examples. A key aspect
of SVM modeling is that high-dimensional descriptor spaces are not explicitly con-
stituted and mapped. Rather, a kernel function is applied to determine the degree of
similarity between compounds in a higher-dimensional representation. For example,
Search WWH ::




Custom Search