Biomedical Engineering Reference
In-Depth Information
representation facilitates the development and debugging of background knowledge
in collaboration with a domain expert. Moreover, knowledge expressed in a declar-
ative way is re-usable across different tasks and domains, thus easing the burden of
the knowledge engineering effort. Third, the resort to first-order (or relational) logic
facilitates the representation of input sequences, whose structure can be arbitrarily
complex, and increases the explanatory power of discovered patterns, which are
relatively easy to interpret for domain experts. Fourth, computational solutions
devised for both the problem of selecting a minimum support threshold and the
problem of discretizing numerical data fulfill the twofold goal of improving the
quality of results and designing tools for the actual end-users, namely biologists.
Further significant advantages are:
No prior assumption is necessary either on the constituent motifs of a module or
on their spatial distribution;
Specific information on the bases occurring between two consecutive motifs is
not required.
This work also extends our previous study [ 48 ], where frequent patterns are
generated by means of the algorithm GSP [ 3 ]. The extension aims to: (1) find asso-
ciation rules, which convey additional information with respect to frequent patterns;
(2) discover more significant inter-motif distances by means of a new discretiza-
tion algorithm which does not require input parameters; (3) automatically select the
best minimum support threshold; (4) filter redundant rules; (5) investigate a new
application of an ILP algorithm to a challenging bioinformatics task.
The chapter is organized as follows. Section 5.2 presents a formalization of the
problem, which is decomposed into two subproblems: (1) mining frequent sets of
motifs, and (2) mining spatial association rules. Input and output of each step of the
proposed approach are also reported. Section 5.3 describes the method for spatial
association rule mining. Section 5.4 presents the solution to some methodological
and architectural problems which affect the implementation of a module discovery
tool effectively usable by biologists. Section 5.5 is devoted to a case study, which
shows the application of the developed system. Finally, conclusions are drawn.
5.2
Mining Spatial Association Rules from Sequences
Before proceeding to a formalization of the problem, we first introduce some general
notions on association rules.
Association rules are a class of patterns that describe regularities or co-
occurrence relationships in a set T of homogeneous data structures (e.g., sets,
sequences and so on) [ 2 ]. Formally, an association rule R is expressed in the form
of A ) C,whereA (the antecedent )andC (the consequent ) are disjoint conditions
on properties of data structures (e.g., the presence of an item in a set). The meaning
of an association rule is quite intuitive: if a data structure satisfies A,thenitis
likely to satisfy C. To quantify this likelihood, two statistical parameters are usually
Search WWH ::




Custom Search