Scientific Data Analysis - Scientific Data Management

Database Reference

In-Depth Information

typically utilize similarity values between a pair of compounds. This similar-

ity value is usually computed over a suitable descriptor-space representation 4

of chemical compounds, which is typically derived from the two-dimensional

topological molecular graph of the chemical compounds. It has been shown

that when this similarity is high, these two-dimensional descriptor-based

methods are very effective in finding compounds that share similar activity

against a biomolecular target. 2 However, the task of identifying hit compounds

is complicated by the fact that the query might have undesirable properties

such as toxicity, bad ADME (absorption, distribution, metabolism, and excre-

tion) properties, or may be promiscuous. 2 These properties will also be shared

by most of the compounds similar to the query, as they will correspond to very

similar structures. In order to overcome this problem, it is important to iden-

tify (i.e., rank high) as many chemical compounds as possible that not only

show the desired activity for the biomolecular target but also have different

structures (come from diverse chemical classes or chemotypes). Finding novel

chemotypes using the information of already known bioactive small molecules

is termed scaffold hopping . 2

We developed techniques, 21 inspired by research in social network analy-

sis, that measure the similarity between the query and a compound by taking

into account additional information beyond their direct descriptor-space-based

representation. These techniques derive indirect similarities by analyzing the

network connecting the query and the library compounds. This network is

determined using an undirected k -nearest-neighbor graph (NG) and an undi-

rected k -mutual-nearest-neighbor graph (MG). Both of these graphs contain

a node for each of the compounds as well as a node for the query. How-

ever, they differ on the set of edges that they contain. In the k -nearest-

neighbor graph there is an edge between a pair of nodes corresponding to

compounds c i and c j ,if c i is in the k -nearest-neighbor list of c j or vice-versa.

In the k -mutual-nearest-neighbor graph, an edge exists only when c i is in the

k -nearest-neighbor list of c j and c j is in the k -nearest-neighbor list of c i . The

indirect similarity between a pair of nodes is computed as the Tanimoto coe-

cient of their adjacency lists, which assigns a high similarity value to a pair of

compounds if they have a large number of common similar compounds. Thus,

the indirect similarity between a pair of compounds will be high if there are

a large number of size-two paths connecting them in the network.

The performance of indirect similarity-based retrieval strategies based

on the NG as well as MG graph was compared with direct similarity based on

the Tanimoto coecient. 21 The compounds were represented using different

descriptor-spaces (GF, ECFP, etc.). The quantitative results showed that in-

direct similarity is consistently, and in many cases substantially, better than

direct similarity. Figure 8.2 shows a part of our results in which we compare

MG-based indirect similarity to direct Tanimoto-coecient similarity search-

ing using ECFP descriptors. It can be observed from the figure that indirect

similarity outperforms direct similarity for scaffold-hopping active retrieval

in five out of six datasets (COX2, A1A, CDK2, FXa, MAO, and PDE5) on

Scientific Data Management

Search WWH ::

Custom Search

Home