Designing Real-Time Streaming Architectures - Real-Time Analytics

Database Reference

In-Depth Information

documentation only considers the map-reduce case. This makes it difficult

for an inexperienced user to get much traction with getting set up.

A user already familiar with the maintenance of a YARN cluster will have

much more success with Samza and should consider it for implementing a

real-time application. In many ways it is a cleaner design than Storm, and

its native support of Kafka makes it easy to integrate into a Kafka-based

environment.

For first-time users, Storm is the more successful framework. Although it

too can be hosted in a YARN cluster, that is not the typical deployment. The

non-YARN deployment discussed in this topic is much easier for new users

to understand, and it's relatively easy to manage. The only disadvantage is

that it does not include native support for either Kafka or Flume. Chapter 5

addresses the integration, but the development cycle of the plug-ins is quite

different from the mainline Storm code, which can cause incompatibilities

around the time of release.

Storage

For relatively small build-outs, such as a proof of concept or a fairly

low-volume application, Redis is the obvious choice for data storage. It has

a rich set of abstractions beyond the simple key-value store that allows for

sophisticated storage. It is easy to configure, install, and maintain, and it

requires almost no real maintenance (it is even available as a turnkey option

from various cloud-based providers). It also has a wide array of available

clients.

The two drawbacks of Redis are that it has not really addressed the problem

of horizontal scalability for writes and it is limited to available random

access memory (RAM) of the master server. Like many of the tools in this

topic, it offers a master-slave style of replication with the ability to failover

to one of the replicas using Redis Sentinel. However, this still requires that

all writes go to a single server. There are several client-side projects, such as

Twitter's Twemproxy, that attempt to work around this limitation, but there

is no native Redis solution as of yet. There is an ongoing clustering project,

but there is no timeline for its stable release.

If very large amounts of infrequently accessed data are going to be needed,

Cassandra is an excellent choice. Early versions of Cassandra suffered from

a “query language” that was essentially just an RPC layer on the internal

Search WWH ::

Custom Search

Home