Biomedical Engineering Reference
In-Depth Information
[65]. Such distance functions can be applied to produce (distance-based) compound
rankings. Binary fingerprints consisting of m features can be rationalized as a special
case of an m -dimensional reference space, with each dimension having either unit
length or a length of zero.
Chemical space navigation is usually more complicated in high- than in low-
dimensional descriptor spaces and, accordingly, the paradigm of low dimensionality
has been a cornerstone of activity-oriented compound classification. Low dimension-
ality of chemical space representations can either be achieved by a priori design of
low-dimensional spaces through the use of orthogonal multicomponent descriptors
[66] or, alternatively, by dimension reduction of original high-dimensional descrip-
tor spaces. For dimension reduction, statistical methods such as principal component
analysis [67] or multidimensional scaling [68] are available. However, it has also been
shown that low-dimensional spaces are not essential for effective compound classifi-
cation and LBVS. For example, mapping algorithms have been introduced to assign
database compounds to combinations of descriptor value ranges that are defined by
sets of reference molecules in chemical spaces of iteratively increasing dimensional-
ity [69]. This increase is facilitated by adding new descriptors per iteration. Following
this approach, database compounds that are most similar to references are identified
by increasing the dimensionality and hence chemical resolution of descriptor spaces.
Dimension extension is terminated when only a small database selection set remains.
15.7.2 Clustering and Partitioning
Clustering is a classical approach to compound classification for which many differ-
ent algorithms have been introduced [70]. In LBVS, cluster analysis is carried out
by adding active reference compounds to a database and inspecting resulting clusters
into which references fall. It is then assumed that compounds in these clusters might
be good candidates for hit identification. As a form of unsupervised learning, clus-
tering depends on distance relationships in chemical reference space, as discussed
above. Regardless of the specifics of clustering algorithms (e.g., whether they might
be hierarchical or nonhierarchical) [70], the clustering process involves the pairwise
comparison of compound distances, which generally precludes the applicability of
clustering methods to very large compound data sets. As a consequence, as screen-
ing databases have substantially grown in size, partitioning methods have become
popular for LBVS [71], especially cell-based partitioning, which is based on the low-
dimensionality paradigm for chemical space representations [66,71]. The principal
difference between clustering and partitioning is that the latter approach does not
depend on pairwise compound distance comparisons. Rather, in cell-based partition-
ing, compounds are assigned to subsections of orthogonal low-dimensional reference
space (e.g., descriptor spaces with six to eight dimensions). These sections, called
cells , are obtained by binning of descriptor axes. Assignment of compounds to such
cells is done based on their descriptor coordinates. Candidate compounds are then
selected that populate the same cell(s) as reference molecules.
Despite their long history in compound classification, the development of advanced
cluster algorithms continues to be an area of active research. Recently, emphasis has
Search WWH ::




Custom Search