Mining Spatial Association Rules for Composite Motif Discovery - Mathematical Approaches to Polymer Sequence Analysis and Related Problems

Biomedical Engineering Reference

In-Depth Information

representation facilitates the development and debugging of background knowledge

in collaboration with a domain expert. Moreover, knowledge expressed in a declar-

ative way is re-usable across different tasks and domains, thus easing the burden of

the knowledge engineering effort. Third, the resort to first-order (or relational) logic

facilitates the representation of input sequences, whose structure can be arbitrarily

complex, and increases the explanatory power of discovered patterns, which are

relatively easy to interpret for domain experts. Fourth, computational solutions

devised for both the problem of selecting a minimum support threshold and the

problem of discretizing numerical data fulfill the twofold goal of improving the

quality of results and designing tools for the actual end-users, namely biologists.

Further significant advantages are:

No prior assumption is necessary either on the constituent motifs of a module or

on their spatial distribution;

Specific information on the bases occurring between two consecutive motifs is

not required.

This work also extends our previous study [ 48 ], where frequent patterns are

generated by means of the algorithm GSP [ 3 ]. The extension aims to: (1) find asso-

ciation rules, which convey additional information with respect to frequent patterns;

(2) discover more significant inter-motif distances by means of a new discretiza-

tion algorithm which does not require input parameters; (3) automatically select the

best minimum support threshold; (4) filter redundant rules; (5) investigate a new

application of an ILP algorithm to a challenging bioinformatics task.

The chapter is organized as follows. Section 5.2 presents a formalization of the

problem, which is decomposed into two subproblems: (1) mining frequent sets of

motifs, and (2) mining spatial association rules. Input and output of each step of the

proposed approach are also reported. Section 5.3 describes the method for spatial

association rule mining. Section 5.4 presents the solution to some methodological

and architectural problems which affect the implementation of a module discovery

tool effectively usable by biologists. Section 5.5 is devoted to a case study, which

shows the application of the developed system. Finally, conclusions are drawn.

5.2

Mining Spatial Association Rules from Sequences

Before proceeding to a formalization of the problem, we first introduce some general

notions on association rules.

Association rules are a class of patterns that describe regularities or co-

occurrence relationships in a set T of homogeneous data structures (e.g., sets,

sequences and so on) [ 2 ]. Formally, an association rule R is expressed in the form

of A ) C,whereA (the antecedent )andC (the consequent ) are disjoint conditions

on properties of data structures (e.g., the presence of an item in a set). The meaning

of an association rule is quite intuitive: if a data structure satisfies A,thenitis

likely to satisfy C. To quantify this likelihood, two statistical parameters are usually

Search WWH ::

Custom Search

Home