Database Reference
In-Depth Information
Fig. 4.5
Examples of
biological sequences
A T C C G
T
A C A G
T
T C A G T A G C A
A T C
C
G C A C A G G T C A G
T
A G C A
A T C T G C A C A G G T C A
G C
G G A G C A
A G C A
A T C A G C A C A G G T C A
Pattern
(
W
1
,
W
2
)
=
M
1
,
g
1
,
M
2
,
g
2
,
...
,
M
k
, where
M
i
,1
≤
i
≤
k
denote the max-
imal matching substrings shared by
W
1
and
W
2
.
g
i
,1
k
denote the number
of mismatches in the
i
th gap. In this example of
S
1
and
S
2
,
Pattern
(
S
1
,
S
2
)
≤
i
≤
=
.
For any two substrings
S
and
P
, it is observed that (1)
Pattern
(
S
,
P
) is uniquely
defined, and (2)
Pattern
(
S
,
P
)=
Pattern
(
P
,
S
).
One can therefore define the notion of a
strand
, which is a set of substrings that
share one same matching pattern.
AT CCG
,1,
ACAG
,1,
TCAGTTGCA
Definition 4.8
A set U of substrings U
={
S
1
,
...
,
S
k
}
is a strand if and only if
(1) for any two pairs of substrings
{
S
i
1
,
S
j
1
}
and
{
S
i
2
,
S
j
2
}
of U , Pattern
(
S
i
1
,
S
j
1
)
=
Pattern
(
S
i
2
,
S
j
2
)
.
Based on the idea of a strand, the following approach can be used to decide if
a given substring
P
is a FAS. Find all the closed valid strands of
P
and let the
union of them be
X
.
P
is a FAS if and only if the cardinality of
X
is at least
θ
.
Consider the example of Fig.
4.5
in which the error tolerance is 0
.
1 and minimum
frequency threshold is 4. Both strands
are valid. Suppose
these two strands are also closed, then combining them one gets a support set of size
4, satisfying the frequency requirement. As such,
S
2
is a FAS.
On the highest level, the algorithm works in two steps.
{
S
1
,
S
2
}
and
{
S
2
,
S
3
,
S
4
}
1.
Growing Strand
Compute a set of closed valid strands initially. Mine out all closed valid strands
by iteratively growing current ones on both ends. Let the result set be
X
.
2.
Grouping Strand
For each distinct substring
P
in the closed valid strands of
X
, group all the strands
which contain
P
by taking the union of them. If the result of the union contains
at least
θ
members, report
P
as a FAS.
The set of initial strands is the set of all maximal exact repeats. More precisely, for
each initial strand
U
,
Pat
(
U
)
0 and
U
is closed. These initial
strands are computed by
InitStrand
using the suffix tree of the input sequence
S
.
Similar approach has been used in REPuter [
14
] to mine exact repeats. A suffix tree
is a data structure that compactly encodes the internal structure of a string. As such,
it can be used to solve some complicated string problems in linear time. In particular,
it enables us to mine out all frequent maximal exact-matching substrings of
S
with a
running time linear in the length of
S
. When growing a current strand, the algorithm
scans the entire tape and, for each strand encountered, checks on both ends to see if
the current strand can be grown by assembling neighboring strands.
=
M
1
,
Miss
(
U
)
=