Information Technology Reference
In-Depth Information
N
. The goal is to sample
U
and
L
, so as to obtain
two sequences
U
' and
L
', each of length
M < N
.
Initially, it is assumed that uniform sampling is
performed. In this case, each time the (
i
×
N
/
M
)-th
element of
U
and
L
is selected, where 1 ≤
i
≤
M
.
When the LB Keogh between the query sequence
Q
and a data sequence is computed, each phrase
C
of length
N
in
Q
is considered. Thus, each phrase
has to be sampled in the same way as
U
and
L
.
This leads to a sampled phrase
C
'. Therefore, the
lower-bound measure
B
', is given as:
that the sampling of
U
and
L
does not produce
any false negatives.
The separate sampling of
U
and
L
presents the
requirement of having to store the positions from
which elements are being selected in
U
'
and
L
'. If
the positions are stored explicitly, then this doubles
the amount of information kept (2M numbers for
storing
U
'
and
L
'
and additionally 2
M
numbers for
storing the positions of selected elements). Since
this information is propagated during querying,
traffic is increased. For this reason an alternative
representation is proposed. To represent
U
' , a
bitmap of length
N
(the phrase length) is used.
Each bit corresponds to an element in
U
. If an
element is selected in the sample
U
', then its bit
is set to 1, otherwise it is set to 0. Therefore, the
combination of the bitmap and the
M
values that
are selected in
U
'
are used to represent
U
'. The
same applies for
L
'. This representation proves
to be efficient: the space required for
U
'
is
M
+
2
(
C
'
−
U
'
)
f
C
'
>
U
'
i
i
i
i
M
∑
=
2
LB'
=
(
C
'
−
L
'
)
f
C
'
<
L
'
(2)
i
i
i
i
i
1
0
otherwise
In the aforementioned equation, the third case
(i.e., when
L
i
' ≤
C
i
≤
U
i
') does not contribute in
the computation of
B
' . The problem of uniform
sampling is that, as it selects elements without
following any particular criterion, it tends to
select many elements from
U
and
L
that result
to this third case. Therefore,
B
'
may become a
significantly bad underestimation of
B
that would
have been computed if sampling was not used.
The underestimation of the lower-bound value
will result to an increase in false-alarms, which
will in turn incur high traffic.
To overcome this problem, an alternative
sampling method is therein proposed.
U
and
L
are
sampled separately. Initially, the elements of U are
stored in ascending order. In
U
'
the first
M
ele-
ments of this ordering are selected. Respectively,
L
is sorted in descending order and the first
M
elements in
L
'
are selected. The intuition is that
the selection of the smallest
M
values of
U
, helps
in increasing the number of occurrences of the
first case (i.e., when
C
i
>
U
i
'), since the smallest
the value of
U
i
' is, the more expected is to have a
C
i
' larger than it. Accordingly, it is easy to prove
(Karydis, Nanopoulos, & Manolopoulos, 2005)
N
/
8
bytes
2
.
The plain representation requires 5
M
bytes
(since it requires only one integer, i.e., 4 bytes,
to store the position of each selected element).
Thus, the proposed method is advantageous
when
N
<32
M
, i.e., for sample larger than about
3% (experimentation showed that samples with
size 10% are the best choice).
In the following, existing algorithms for
searching in P2P networks that can be used in
a modified version within the framework are
presented.
The BFSS Algorithm
The simplest similarity searching algorithm is on
the basis of breadth-first-search over the nodes of
the P2P network. The adapted algorithm, which
uses the proposed sampling and representation
methods, is denoted as BFSS (breadth-first-search
with sampling). Each time, the current node
n
is
considered. A TTL (time-to-leave or Maxhop)
value denotes how many valid hops are remain-
ing for
n
, whereas T
s
is the user-defined similarity
Search WWH ::
Custom Search