Industry Needs and Solutions - Microsoft Big Data Solutions

Database Reference

In-Depth Information

columns. In that sense, there could be millions of columns. In contrast, SQL

Server is limited to 1,024 columns.

Architecturally, HBase belongs to the master/slave collection of distributed

Hadoop implementations. It is also heavily reliant on Zookeeper (an Apache

project we discuss shortly).

Flume

Flume is the StreamInsight of the Hadoop ecosystem. As you would expect,

it is a distributed system that collects, aggregates, and shifts large volumes

of event streaming data into HDFS. Flume is also fault tolerant and can be

tuned for failover and recovery. However, in general terms, faster recovery

tends to mean trading some performance; so, as with most things, a balance

needs to be found.

The Flume architecture consists of the following components:

• Client

• Source

• Channel

• Sink

• Destination

Events flow from the client to the source. The source is the first Flume

component. The source inspects the event and then farms it out to one

or more channels for processing. Each channel is consumed by a sink . In

Hadoop parlance, the event is “drained” by the sink. The channel provides

theseparationbetweensourceandsinkandisalsoresponsibleformanaging

recovery by persisting events to the file system if required.

Once an event is drained, it is the sink's responsibility to then deliver the

event to the destination. There are a number of different sinks available,

including an HDFS sink. For the Integration Services users out there

familiar with the term backpressure, you can think of the channel as the

component that handles backpressure. If the source is receiving events

faster than they can be drained, it is the channel's responsibility to grow and

manage that accumulation of events.

A single pass through a source, channel, and sink is known as a hop . The

componentsforahopexistinasingleJVMcalledan agent .However,Flume

Search WWH ::

Custom Search

Home