Biomedical Engineering Reference
In-Depth Information
computed, namely support and confidence . The former, denoted as sup .R;T/,es-
timates the probability P.A ^ C/ by means of the percentage of data structures
in T satisfying both A and C. The latter, denoted as conf .R;T/, estimates the
probability P.C j A/ by means of the percentage of data structures which satisfy
condition C, out of those which satisfy condition A. The task of association rule
mining consists in discovering all rules whose support and confidence values exceed
two respective minimum thresholds. When data structures describe spatial objects
together with their spatial relationships, mined association rules are called spatial ,
since conditions in either the antecedent or the consequent of a rule express some
form of spatial constraint.
We now give a formal statement of the module discovery problem, which is
decomposed into two subproblems as follows:
1. Given: AsetM of single motifs, a set T of sequences with annotations about
type and position of motifs in M and a minimum value min ,
Find: The collection
of all the sets S 1 , S 2 ;:::;S n of single motifs such that,
for each S i , at least min sequences in T contain all motifs in S i .
2. Given: AsetS 2 S
S
and two thresholds min and min ,
Find: Spatial association rules involving motifs in S, such that their support and
confidence are greater than min and min , respectively.
Single motifs in M can be either discovered de novo or taken from a single motif
database. Each S i 2 S
is called motif set .The support set of S i is the subset T S i of
sequences in T such that each sequence in T S i contains at least one occurrence of
each motif in S i . According to the statement of subproblem (1) j T S i j min . T S i is
used to evaluate both support and confidence of spatial association rules mentioned
in subproblem (2).
The proposed approach is two-stepped since it reflects this problem decompo-
sition. In the first step, motif sets which are frequent , i.e., have a support greater
than min , are extracted from sequences annotated with predictions for known single
motifs. Only information about the occurrence of motifs is considered, while spa-
tial distribution of motifs is ignored. This step has a manifold purpose: (1) enabling
biologists to guide deeper analysis only for sets of motifs which are deemed poten-
tially interesting; (2) filtering out sequences which do not include those interesting
sets of motifs; (3) lowering the computational cost of the second step.
In the second step, sequences that support specific frequent motif sets are ab-
stracted into sequences of spaced motifs . A sequence of spaced motifs is defined as
an ordered collection of motifs interleaved with inter-motif distances. Each inter-
motif distance measures the distance between the last nucleotide of a motif and
the first nucleotide of the next motif in the sequence. Spatial association rules are
mined from these abstractions. In order to deal with numerical information on the
inter-motif distance, a discretization algorithm is applied. The algorithm takes into
account the distribution of the examples and does not significantly depend on input
parameters as in the case of classical equal width or equal frequency discretization
algorithms. Details on both steps are reported below.
Search WWH ::




Custom Search