IBM’s Enterprise Hadoop: InfoSphere BigInsights - Harness the Power of Big Data

Database Reference

In-Depth Information

What the Hadoop!

At a very high level, Hadoop is a distributed file system and data processing

engine that is designed to handle extremely high volumes of data in any struc-

ture. In simpler terms, just imagine that you've got dozens, or even hundreds

(or thousands!) of individual computers racked and networked together.

Each computer (often referred to as a node in Hadoop-speak) has its own proces-

sors and a dozen or so 2TB or 3TB hard disk drives. All of these nodes are run-

ning software that unifies them into a single cluster, where, instead of seeing the

individual computers, you see an extremely large volume where you can

store your data. The beauty of this Hadoop system is that you can store any-

thing in this space: millions of digital image scans of mortgage contracts,

days and weeks of security camera footage, trillions of sensor-generated log

records, or all of the operator transcription notes from a call center. This in-

gestion of data, without worrying about the data model, is actually a key

tenet of the NoSQL movement (this is referred to as “schema later”). In contrast,

the traditional SQL and relational database world depends on the opposite

approach (“schema now”), where the data model is of utmost concern upon

data ingest. This is where the flexibility of Hadoop is even more apparent.

It's not just a place where you can dump many files. There are Hadoop-based

databases where you can store records in a variety of models: relational, co-

lumnar, and key/value. In other words, with data in Hadoop, you can go

from completely unstructured to fully relational, and any point in between.

The data storage system that we describe here is known as the Hadoop distrib-

uted file system (HDFS).

Let's go back to this imaginary Hadoop cluster with many individual

nodes. Suppose that your business uses this cluster to store all of the click-

stream log records for its e-commerce site. Your Hadoop cluster is using the

BigInsights distribution, and you and your fellow analysts decide to run

some of the sessionization analytics against this data to isolate common pat-

terns for customers who leave abandoned shopping carts—we call this use

case last mile optimization . When you run this application, Hadoop sends cop-

ies of your application logic to each individual computer in the cluster, to be

run against data that's local to each computer. So instead of moving data to a

Search WWH ::

Custom Search

Home