Database Reference
In-Depth Information
Fig. 4.5 Examples of
biological sequences
A T C C G
T A C A G T T C A G T A G C A
A T C C G C A C A G G T C A G T A G C A
A T C T G C A C A G G T C A
G C
G G A G C A
A G C A
A T C A G C A C A G G T C A
Pattern ( W 1 , W 2 )
= M 1 , g 1 , M 2 , g 2 , ... , M k
, where M i ,1
i k denote the max-
imal matching substrings shared by W 1 and W 2 . g i ,1
k denote the number
of mismatches in the i th gap. In this example of S 1 and S 2 , Pattern ( S 1 , S 2 )
i
=
.
For any two substrings S and P , it is observed that (1) Pattern ( S , P ) is uniquely
defined, and (2) Pattern ( S , P )= Pattern ( P , S ).
One can therefore define the notion of a strand , which is a set of substrings that
share one same matching pattern.
AT CCG ,1, ACAG ,1, TCAGTTGCA
Definition 4.8
A set U of substrings U
={
S 1 , ... , S k }
is a strand if and only if
(1) for any two pairs of substrings
{
S i 1 , S j 1 }
and
{
S i 2 , S j 2 }
of U , Pattern ( S i 1 , S j 1 )
=
Pattern ( S i 2 , S j 2 ) .
Based on the idea of a strand, the following approach can be used to decide if
a given substring P is a FAS. Find all the closed valid strands of P and let the
union of them be X . P is a FAS if and only if the cardinality of X is at least θ .
Consider the example of Fig. 4.5 in which the error tolerance is 0 . 1 and minimum
frequency threshold is 4. Both strands
are valid. Suppose
these two strands are also closed, then combining them one gets a support set of size
4, satisfying the frequency requirement. As such, S 2 is a FAS.
On the highest level, the algorithm works in two steps.
{
S 1 , S 2 }
and
{
S 2 , S 3 , S 4 }
1. Growing Strand
Compute a set of closed valid strands initially. Mine out all closed valid strands
by iteratively growing current ones on both ends. Let the result set be X .
2. Grouping Strand
For each distinct substring P in the closed valid strands of X , group all the strands
which contain P by taking the union of them. If the result of the union contains
at least θ members, report P as a FAS.
The set of initial strands is the set of all maximal exact repeats. More precisely, for
each initial strand U , Pat ( U )
0 and U is closed. These initial
strands are computed by InitStrand using the suffix tree of the input sequence S .
Similar approach has been used in REPuter [ 14 ] to mine exact repeats. A suffix tree
is a data structure that compactly encodes the internal structure of a string. As such,
it can be used to solve some complicated string problems in linear time. In particular,
it enables us to mine out all frequent maximal exact-matching substrings of S with a
running time linear in the length of S . When growing a current strand, the algorithm
scans the entire tape and, for each strand encountered, checks on both ends to see if
the current strand can be grown by assembling neighboring strands.
=
M 1
, Miss ( U )
=
 
Search WWH ::




Custom Search