Database Reference
In-Depth Information
effective in practice, especially when there is significant evolution of the
underlying data stream.
While sampling has several advantages in terms of simplicity and
preservation of multi-dimensional correlations, it loses its effectiveness in
the presence of data sparsity. For example, a query which contains very
few data points is unlikely to be accurate with the use of a sampling ap-
proach. However, this is a general problem with most techniques which
are effective at counting frequent elements, but are not quite as effective
at counting rare or distinct elements in the data stream.
Sketches: Sketches use some properties of random sampling in order
to perform counting tasks in data streams. Sketches are most useful
when the domain size of a data stream is very large. In such cases,
the number of possible distinct elements become very large, and it is no
longer possible to track them in space-constrained scenarios. There are
two broad classes of sketches: projection based and hash based . We will
discuss each of them in turn.
Projection based sketches are constructed on the broad idea of ran-
dom projection [54]. The most well known projection-based sketch is
the AMS sketch [17, 18], which we will discuss below. It has been shown
in [54], that by by randomly sampling subspaces from multi-dimensional
data, it is possible to compute -accurate projections of the data with
high probability. This broad idea can easily be extended to the mas-
sive domain case, by viewing each distinct item as a dimension, and the
counts on these items as the corresponding values. The main problem is
that the vector for performing the projection cannot be maintained ex-
plicitly since the length of such a vector would be of the same size as the
number of distinct elements. In fact, since the sketch-based method is
most relevant in the distinct element scenario, such an approach defeats
the purpose of keeping a synopsis structure in the first place.
Let us assume that the random projection is performed using k sketch
vectors, and r i represents the j th vector for the i th item in the domain
being tracked. In order to achieve the goal of ecient synopsis construc-
tion, we store the random vectors implicitly in the form of a seed, and
this can be used to dynamically generate the vector. The main idea
discussed in [49] is that it is possible to generate random vectors with
a seed of size O (log( N )), provided that one is willing to work with the
restriction that r i ∈{−
should be 4-wise independent. The sketch
is computed by adding r i to the j th component of the sketch for the i th
item. In the event that the incoming item has frequency f ,weaddthe
value f
1 , +1
}
r i . Let us assume that there are a total of k sketch components
which are denoted by ( s 1 ...s k ). Some key properties of the pseudo-
·
Search WWH ::




Custom Search