Information Technology Reference
In-Depth Information
Sampling Framework
comprises the U and L sequences of the query
sequence. A node receiving these sequences
computes the LB value between its documents
and envelope. When a LB value is smaller than
the user-specified similarity threshold, then the
query sequence is propagated to this node and the
actual DTW distance is computed between the
query and corresponding document.
As acoustic data tend to be very large, the num-
ber of elements in a phrase of even few seconds
can be several hundred thousands. The length
of the U and L sequences is equal to the length
of the query sequence, meaning that a straight-
forward approach, which directly propagates U
and L sequences between nodes, will result into
an extremely large traffic over the P2P network.
Moreover, the computation of LB in each node
can become rather costly, violating the need of
a P2P network to burden the participating nodes
as little as possible.
A two-fold optimization scheme is used for
the reduction of the traffic over the P2P network
when querying music documents by content. The
scheme works as follows:
The work of Karydis, Nanopoulos, and Manolo-
poulos (2005) examines the problem of search-
ing for similar acoustic data over unstructured
decentralized P2P networks using Dynamic Time
Warping (DTW) as a distance measure.
DTW can express similarity between two time
series even if they are out of phase in the time axis,
or they do not have the same length. The DTW
distance D DTW ( S , T ) between time series S and T is
essentially a way to map S to T and vice-versa. The
most important disadvantage of the DTW method
is that it does not satisfy the triangular inequality,
which is a desirable property for constructing ef-
ficient indexing schemes and pruning the search
space. Moreover, the calculation of D DTW ( S , T ) is
significantly more CPU intensive than the calcu-
lation of D Euclidean ( S,T ). Therefore, a performance
improvement is the definition of a lower bound,
that would take advantage of indexing schemes
and avoid the computation of DTW when there
is a guarantee that the two time series are not
similar. One such lower bound, as proposed in
Keogh and Ratanamahatana (2005), is termed
LB_Keogh and defined as follows:
It reduces the length of the envelope's se-
quences by sampling them. However, plain
sampling can be ineffective, since it leads
to underestimation of LB, thus a sampling
method to reduce the length of the sequences
without significantly affecting the computa-
tion of LB or introducing false-negatives is
provided.
n
[ i ]) 2
2
(
T
[
i
]
U
[
i
)
T
[
i
]
U
[
i
]
i
=
1
n
=
[ i ]) 2
2
LB_Keogh (S,T) =
(
T
[
i
]
L
[
i
)
T
[
i
]
<
L
[
i
]
(1)
i
=
1
0
otherwise
The scheme uses (whenever possible) a
compact representation of the sampled se-
quences of the envelope. The representation
comprises a form of compression for the
sequences, but does not burden the nodes
with the cost of decompression.
where U and L is the upper and lower bound
respectively for the time series S . Essentially, for
each i , the upper bound guarantees that U [ i ] ≥ S [ i ]
and the lower bound guarantees that L[i] ≤ S[i].
In Keogh and Ratanamahatana (2005) it has been
proven that LB_Keogh ≤ D DTW (S,T), and therefore
the distance measure LB_Keogh(S,T) can be effec-
tively used for pruning, resulting in considerably
less number of D DTW computations.
Due to the selected similarity model, the
information that is propagated between nodes
Sampling and Representation Methods
Let the considered phrase length be equal to N . The
length of each query Q , and therefore of its upper
( U ) and lower ( L ) sequences, will also be equal to
 
Search WWH ::




Custom Search