Mining Spatial Association Rules for Composite Motif Discovery - Mathematical Approaches to Polymer Sequence Analysis and Related Problems

Biomedical Engineering Reference

In-Depth Information

computed, namely support and confidence . The former, denoted as sup .R;T/,es-

timates the probability P.A ^ C/ by means of the percentage of data structures

in T satisfying both A and C. The latter, denoted as conf .R;T/, estimates the

probability P.C j A/ by means of the percentage of data structures which satisfy

condition C, out of those which satisfy condition A. The task of association rule

mining consists in discovering all rules whose support and confidence values exceed

two respective minimum thresholds. When data structures describe spatial objects

together with their spatial relationships, mined association rules are called spatial ,

since conditions in either the antecedent or the consequent of a rule express some

form of spatial constraint.

We now give a formal statement of the module discovery problem, which is

decomposed into two subproblems as follows:

1. Given: AsetM of single motifs, a set T of sequences with annotations about

type and position of motifs in M and a minimum value min ,

Find: The collection

of all the sets S 1 , S 2 ;:::;S n of single motifs such that,

for each S i , at least min sequences in T contain all motifs in S i .

2. Given: AsetS 2 S

S

and two thresholds min and min ,

Find: Spatial association rules involving motifs in S, such that their support and

confidence are greater than min and min , respectively.

Single motifs in M can be either discovered de novo or taken from a single motif

database. Each S i 2 S

is called motif set .The support set of S i is the subset T S i of

sequences in T such that each sequence in T S i contains at least one occurrence of

each motif in S i . According to the statement of subproblem (1) j T S i j min . T S i is

used to evaluate both support and confidence of spatial association rules mentioned

in subproblem (2).

The proposed approach is two-stepped since it reflects this problem decompo-

sition. In the first step, motif sets which are frequent , i.e., have a support greater

than min , are extracted from sequences annotated with predictions for known single

motifs. Only information about the occurrence of motifs is considered, while spa-

tial distribution of motifs is ignored. This step has a manifold purpose: (1) enabling

biologists to guide deeper analysis only for sets of motifs which are deemed poten-

tially interesting; (2) filtering out sequences which do not include those interesting

sets of motifs; (3) lowering the computational cost of the second step.

In the second step, sequences that support specific frequent motif sets are ab-

stracted into sequences of spaced motifs . A sequence of spaced motifs is defined as

an ordered collection of motifs interleaved with inter-motif distances. Each inter-

motif distance measures the distance between the last nucleotide of a motif and

the first nucleotide of the next motif in the sequence. Spatial association rules are

mined from these abstractions. In order to deal with numerical information on the

inter-motif distance, a discretization algorithm is applied. The algorithm takes into

account the distribution of the examples and does not significantly depend on input

parameters as in the case of classical equal width or equal frequency discretization

algorithms. Details on both steps are reported below.

Mathematical Approaches to Polymer Sequence Analysis and Related Problems

Search WWH ::

Custom Search

Home