Database Reference
In-Depth Information
approach is applicable to both chemical and biological data. An example of an ap-
plication which is relevant to both domains is that of finding relevant substructures
of molecules [ 25 , 40 ]. In the following, we will discuss some useful applications in
both domains.
12.1
Chemical Applications
Since frequent pattern mining is closely related to that of classification, as discussed
earlier, many methods have been developed for predictive tasks with the use of fre-
quent pattern mining. Examples of such tasks include carcinogenesis prediction [ 117 ]
and predictive toxicology evaluation [ 118 ]. Key characteristics of compound repre-
sentations can often be characterized by descriptor-based representations [ 24 , 72 ].
The properties which are tracked are generally structure-driven, and may correspond
to activity, toxicity, absorption, distribution, metabolism and excretion [ 24 ]. A nat-
ural way of mining these descriptors is with the use of algorithms such as frequent
subgraph mining. Frequent subgraphs of a chemical graph database are defined as
all subgraphs that are present in at least a certain minimum number of compounds
in the database. This is essentially the minimum support requirement, and define
the descriptors for the compounds in the database. The main challenge here is that
the optimum value of the minimum support to be used may not be known a-priori
for a given database. Nevertheless, since different data sets may contain different
number of descriptors, with different supports, sizes, and shapes, such an approach
provides some flexibility with the sue of the minimum support parameter, as long
as an effective approach for tuning is available. Such descriptors are quite useful
for chemical compound classification, since they encode important properties of the
chemical compound, which may be very relevant to classification. An example of
such an approach is discussed in [ 41 ], which uses the descriptors defined by frequent
subgraphs for chemical compound classification.
12.2
Biological Applications
Biological data is available either in the form of sequence data or graph-structured
data. In both cases, frequent pattern mining methods can be very helpful in dis-
covering different kinds of insights. Much of biological and microarray data can
be expressed as sequences in its most simplified form. In these cases, many algo-
rithms have been developed in order to determine useful frequent patterns from these
sequences [ 34 , 35 , 89 , 104 , 105 , 111 , 125 , 123 ]. One special characteristic of bio-
logical data is that the number of rows may not be too large, but each individual row
may be very long. As a result, row-enumeration techniques are often used in such
scenarios. Such patterns provide an idea of the characteristics of the underlying data,
and may also be used for other data mining tasks such as classification. The issue
Search WWH ::




Custom Search