Information Technology Reference
In-Depth Information
N . The goal is to sample U and L , so as to obtain
two sequences U ' and L ', each of length M < N .
Initially, it is assumed that uniform sampling is
performed. In this case, each time the ( i × N / M )-th
element of U and L is selected, where 1 ≤ i M .
When the LB Keogh between the query sequence
Q and a data sequence is computed, each phrase C
of length N in Q is considered. Thus, each phrase
has to be sampled in the same way as U and L .
This leads to a sampled phrase C '. Therefore, the
lower-bound measure B ', is given as:
that the sampling of U and L does not produce
any false negatives.
The separate sampling of U and L presents the
requirement of having to store the positions from
which elements are being selected in U ' and L '. If
the positions are stored explicitly, then this doubles
the amount of information kept (2M numbers for
storing U ' and L ' and additionally 2 M numbers for
storing the positions of selected elements). Since
this information is propagated during querying,
traffic is increased. For this reason an alternative
representation is proposed. To represent U ' , a
bitmap of length N (the phrase length) is used.
Each bit corresponds to an element in U . If an
element is selected in the sample U ', then its bit
is set to 1, otherwise it is set to 0. Therefore, the
combination of the bitmap and the M values that
are selected in U ' are used to represent U '. The
same applies for L '. This representation proves
to be efficient: the space required for U ' is M +
2
(
C
'
U
'
)
f
C
'
>
U
'
i
i
i
i
M
=
2
LB'
=
(
C
'
L
'
)
f
C
'
<
L
'
(2)
i
i
i
i
i
1
0
otherwise
In the aforementioned equation, the third case
(i.e., when L i ' ≤ C i U i ') does not contribute in
the computation of B ' . The problem of uniform
sampling is that, as it selects elements without
following any particular criterion, it tends to
select many elements from U and L that result
to this third case. Therefore, B ' may become a
significantly bad underestimation of B that would
have been computed if sampling was not used.
The underestimation of the lower-bound value
will result to an increase in false-alarms, which
will in turn incur high traffic.
To overcome this problem, an alternative
sampling method is therein proposed. U and L are
sampled separately. Initially, the elements of U are
stored in ascending order. In U ' the first M ele-
ments of this ordering are selected. Respectively,
L is sorted in descending order and the first M
elements in L ' are selected. The intuition is that
the selection of the smallest M values of U , helps
in increasing the number of occurrences of the
first case (i.e., when C i > U i '), since the smallest
the value of U i ' is, the more expected is to have a
C i ' larger than it. Accordingly, it is easy to prove
(Karydis, Nanopoulos, & Manolopoulos, 2005)
N
/
8
bytes 2 .
The plain representation requires 5 M bytes
(since it requires only one integer, i.e., 4 bytes,
to store the position of each selected element).
Thus, the proposed method is advantageous
when N <32 M , i.e., for sample larger than about
3% (experimentation showed that samples with
size 10% are the best choice).
In the following, existing algorithms for
searching in P2P networks that can be used in
a modified version within the framework are
presented.
The BFSS Algorithm
The simplest similarity searching algorithm is on
the basis of breadth-first-search over the nodes of
the P2P network. The adapted algorithm, which
uses the proposed sampling and representation
methods, is denoted as BFSS (breadth-first-search
with sampling). Each time, the current node n is
considered. A TTL (time-to-leave or Maxhop)
value denotes how many valid hops are remain-
ing for n , whereas T s is the user-defined similarity
Search WWH ::




Custom Search