Database Reference
In-Depth Information
may be acceptable to generate approximate solutions for many problems
by summarizing the data in a time and space-ecient way. In recent
years a number of synopsis structures have been developed, which can
be used in conjunction with a variety of mining and query processing
techniques [41]. Some key synopsis methods include those of sampling,
wavelets, sketches and histograms. The key challenges which arise in the
context of synopsis construction of data streams are as follows:
Broad Applicability: The synopsis structure is typically used as an
intermediate representation, which is then leveraged for a variety of data
mining and data management problems. Therefore, the synopsis struc-
ture should be c0onstructed in such a way that it has applicability across
a wide range of problems.
One-pass constraint: As in all data stream algorithms, the one-pass
constraint is critical to synopsis construction algorithms. We would like
to design all synopsis construction algorithms in one pass, and this is
not the case for most traditional methods. In fact, even simply methods
such as sampling need to be re-designed in order to handle the one-pass
constraint.
Time and Space E ciency: Since data streams have a very large vol-
ume, it is essential to create the synopsis in a time- and space-ecient
way. In this sense, some of the probabilistic techniques such as sketches
are extremely effective for counting-based applications, since they re-
quire constant-space for provable probabilistic accuracy. In other words,
the time- and space-eciency depends only upon the accuracy of the
approach rather than the length of the data stream.
Data Stream Evolution: Since the stream evolves over time, a synop-
sis structure which is constructed from the overall behavior of the data
stream is not quite as effective as one which uses recent history. Con-
sequently, it is often more effective to create synopsis structures which
either work with sliding windows, or use some decay-based approach in
order to weight the data stream points.
One key characteristic of many of the above methods is that while
they work effectively in the 1-dimensional case, they often lose their
effectiveness in the multi-dimensional case either because of data spar-
sity or because of inter-attribute correlations. Next, we will discuss the
broad classes of techniques which are used for synopsis construction in
data streams. Each of these techniques have their own advantages in
different scenarios, and we will take care to provide an overview of the
different array of methods which are used for synopsis construction in
data streams. The broad techniques which are used for synopsis con-
structionindatastreamsareasfollows:
Reservoir Sampling: Sampling methods are widely used for tradi-
Search WWH ::




Custom Search