Database Reference
In-Depth Information
Chapter 6
Storing Streaming Data
One of the primary reasons for building a streaming data system is to allow
decoupled communication and access between different aspects of the
system. A key system is the storage and backup mechanism for both raw data
as well as data that has been processed by one or more of the processing
environments covered in the previous chapter.
Processing the data is one thing, but for it to be delivered to the end user it
needs to be stored somewhere. That storage location could be the processing
system, using something like Storm's Distributed Remote Procedure Calls
(DRPC) and in-bolt memory storage. However, in a production environment
this simply isn't practical. First, the data usually need to persist for a time,
which means the memory requirements become prohibitive. Second, it
means that maintenance for the processing system necessitates an outage
of any external interfaces, despite the fact that the two have nothing to do
with each other. Finally, it is usually desirable to persist results to tertiary
storage (disks or “cloud” storage devices) so that the data may be more easily
analyzed for long-term trends.
This chapter considers how to store data after it has been processed. There
are a number of storage options available for processing systems that need
to deliver their data to some sort of front-end interface, typically either an
application programming interface (API) or a user interface (UI). Although
there are dozens of potential options, this chapter surveys some of the more
common choices. Systems with different philosophies and constraints are
intentionally chosen to highlight these differences. This allows for more
informed decision-making when considering storage of specific applications.
For longer term storage and analysis, a batch system often makes more
sense than a streaming data system. Largely, this is because a streaming
system chooses to trade off relatively expensive storage options, such as main
memory, for lower random access latency. A batch system takes the opposite
side of this trade off, choosing high capacity storage with high random access
latency, such as traditional spinning platters. Fortunately, whereas a batch
system's random access performance is usually not sufficient for streaming
Search WWH ::




Custom Search