Database Reference
In-Depth Information
the Reduce step. One such application is the so-called “attribution” process.
In this setting, users identified by a unique identifier engage in a number of
events before possibly engaging in an event of interest, called a “conversion”
in this context. The attribution process is interested in the events that
happened in some window before the final conversion event. Because most
users will not convert (low single-digit percentages of conversion are
normal), a Bloom Filter containing the IDs of the users who did convert can
be used in the Map step of an attribution Map-Reduce job. Even with a high
error of 10 percent, it still tends to reduce the amount of data going to the
reduce step by 80 percent to 90 percent.
Conclusion
One of the main challenges of processing streaming data is keeping up
with the number of events to be processed. Even with the advent of the
high-performance solid-state disk (SSD), this data must generally be stored
in main memory (RAM) to achieve acceptable performance. If the data to be
stored is simple, such as sums or averages, this does not present a problem.
When the data to be stored becomes more complicated, like the number of
unique values in the stream, this can present a problem. Attempting to store
the data directly can result in storage requirements that are proportional to
the size of the data stream and can quickly overrun the available RAM.
This chapter has presented a number of methods for storing certain values
such as sets and their size in such a way that the memory usage is controlled
by the application rather than the data, ensuring that RAM requirements
can be met. The downside of these techniques is that they introduce
estimation error into computed values. In some cases, this error may not be
tolerable, but the error is also a function of storage so it may be controlled
by the application into acceptable levels.
Search WWH ::




Custom Search