Database Reference
In-Depth Information
effective in practice, especially when there is significant evolution of the
underlying data stream.
While sampling has several advantages in terms of simplicity and
preservation of multi-dimensional correlations, it loses its effectiveness in
the presence of data sparsity. For example, a query which contains very
few data points is unlikely to be accurate with the use of a sampling ap-
proach. However, this is a general problem with most techniques which
are effective at counting frequent elements, but are not quite as effective
at counting rare or distinct elements in the data stream.
Sketches:
Sketches use some properties of random sampling in order
to perform counting tasks in data streams. Sketches are most useful
when the
domain size
of a data stream is very large. In such cases,
the number of possible distinct elements become very large, and it is no
longer possible to track them in space-constrained scenarios. There are
two broad classes of sketches:
projection based
and
hash based
. We will
discuss each of them in turn.
Projection based sketches are constructed on the broad idea of ran-
dom projection [54]. The most well known projection-based sketch is
the AMS sketch [17, 18], which we will discuss below. It has been shown
in [54], that by by randomly sampling subspaces from multi-dimensional
data, it is possible to compute
-accurate projections of the data with
high probability. This broad idea can easily be extended to the mas-
sive domain case, by viewing each distinct item as a dimension, and the
counts on these items as the corresponding values. The main problem is
that the vector for performing the projection cannot be maintained ex-
plicitly since the length of such a vector would be of the same size as the
number of distinct elements. In fact, since the sketch-based method is
most relevant in the distinct element scenario, such an approach defeats
the purpose of keeping a synopsis structure in the first place.
Let us assume that the random projection is performed using
k
sketch
vectors, and
r
i
represents the
j
th vector for the
i
th item in the domain
being tracked. In order to achieve the goal of ecient synopsis construc-
tion, we store the random vectors implicitly in the form of a seed, and
this can be used to dynamically generate the vector. The main idea
discussed in [49] is that it is possible to generate random vectors with
a seed of size
O
(log(
N
)), provided that one is willing to work with the
restriction that
r
i
∈{−
should be 4-wise independent. The sketch
is computed by adding
r
i
to the
j
th component of the sketch for the
i
th
item. In the event that the incoming item has frequency
f
,weaddthe
value
f
1
,
+1
}
r
i
. Let us assume that there are a total of
k
sketch components
which are denoted by (
s
1
...s
k
). Some key properties of the pseudo-
·