Mining Spatial Association Rules for Composite Motif Discovery - Mathematical Approaches to Polymer Sequence Analysis and Related Problems

Biomedical Engineering Reference

In-Depth Information

of a given length. Conversely, a probabilistic framework is more expressive, since it

relaxes the hard constraints of discrete frameworks and associates each module with

a score which is a combination (e.g., the sum) of motifs and distance scores. Issues

of probabilistic frameworks are local optima and interpretability of results.

A recent assessment of eight published methods for module discovery [ 21 ]has

shown that no single method performed consistently better than others in all situa-

tions and that there are still advances to be made in computational module discovery.

In this chapter, we propose an innovative approach to module discovery, which can

be a useful supplement or alternative to other well-known approaches. The idea is

to mine rules which define “strong” spatial associations between single motifs [ 27 ].

Single motifs might either be de novo discovered by traditional discovery algorithms

or taken from databases of known motifs.

The spatial relationships considered in this work are the order of motifs along

the DNA sequence and the inter-motif distance between each consecutive couple of

motifs, although the mining method proposed to generate spatial association rules

has no limitation on both the number and the nature of spatial relationships. The as-

sociation rule mining method is based on an inductive logic programming (ILP) [ 31 ]

formulation according to which both data and discovered patterns are represented in

a first-order logic formalism. This formulation also facilitates the accommodation

of diverse sources of domain (or background) knowledge which are expressed in a

declarative way. Indeed, ILP is particularly well suited to bioinformatics tasks due

to its ability both to take into account background knowledge and to work directly

with structured data [ 30 ]. This is confirmed by some notable success in molecular

biology applications, such as predicting carcinogenesis [ 44 , 45 ].

The proposed approach is based on a discrete framework, which presents several

advantages, the most relevant being the straightforward interpretation of rules, but

also some disadvantages, such as the hard discretization of numerical inter-motif

distances or the choice of a minimum support threshold. To overcome these issues,

some computational solutions have been developed and tested.

The specific features of this approach are:

An original perspective of module discovery as a spatial association rule mining

task;

A logic-based approach where background knowledge can be expressed in a

declarative way;

A procedure for the automated selection of some parameters which are difficult

to properly set;

Some computational solutions to overcome the discretization issues of discrete

approaches.

These features provide our module discovery tool several advantages with

respect to competitive approaches. First, spatial association rules, which take

the form of A ) C, provide insight both into the support of the module (repre-

sented by A ^ C) and into the confidence of possible predictions of C given A.

Predictions may equally concern both properties of motifs (e.g., its type) and spa-

tial relationships (e.g., the inter-motif distance). Second, the declarative knowledge

Mathematical Approaches to Polymer Sequence Analysis and Related Problems

Search WWH ::

Custom Search

Home