Storing Streaming Data - Real-Time Analytics

Database Reference

In-Depth Information

Chapter 6

Storing Streaming Data

One of the primary reasons for building a streaming data system is to allow

decoupled communication and access between different aspects of the

system. A key system is the storage and backup mechanism for both raw data

as well as data that has been processed by one or more of the processing

environments covered in the previous chapter.

Processing the data is one thing, but for it to be delivered to the end user it

needs to be stored somewhere. That storage location could be the processing

system, using something like Storm's Distributed Remote Procedure Calls

(DRPC) and in-bolt memory storage. However, in a production environment

this simply isn't practical. First, the data usually need to persist for a time,

which means the memory requirements become prohibitive. Second, it

means that maintenance for the processing system necessitates an outage

of any external interfaces, despite the fact that the two have nothing to do

with each other. Finally, it is usually desirable to persist results to tertiary

storage (disks or “cloud” storage devices) so that the data may be more easily

analyzed for long-term trends.

This chapter considers how to store data after it has been processed. There

are a number of storage options available for processing systems that need

to deliver their data to some sort of front-end interface, typically either an

application programming interface (API) or a user interface (UI). Although

there are dozens of potential options, this chapter surveys some of the more

common choices. Systems with different philosophies and constraints are

intentionally chosen to highlight these differences. This allows for more

informed decision-making when considering storage of specific applications.

For longer term storage and analysis, a batch system often makes more

sense than a streaming data system. Largely, this is because a streaming

system chooses to trade off relatively expensive storage options, such as main

memory, for lower random access latency. A batch system takes the opposite

side of this trade off, choosing high capacity storage with high random access

latency, such as traditional spinning platters. Fortunately, whereas a batch

system's random access performance is usually not sufficient for streaming

Search WWH ::

Custom Search

Home