Database Reference
In-Depth Information
What the Hadoop!
At a very high level, Hadoop is a distributed file system and data processing
engine that is designed to handle extremely high volumes of data in any struc-
ture. In simpler terms, just imagine that you've got dozens, or even hundreds
(or thousands!) of individual computers racked and networked together.
Each computer (often referred to as a node in Hadoop-speak) has its own proces-
sors and a dozen or so 2TB or 3TB hard disk drives. All of these nodes are run-
ning software that unifies them into a single cluster, where, instead of seeing the
individual computers, you see an extremely large volume where you can
store your data. The beauty of this Hadoop system is that you can store any-
thing in this space: millions of digital image scans of mortgage contracts,
days and weeks of security camera footage, trillions of sensor-generated log
records, or all of the operator transcription notes from a call center. This in-
gestion of data, without worrying about the data model, is actually a key
tenet of the NoSQL movement (this is referred to as “schema later”). In contrast,
the traditional SQL and relational database world depends on the opposite
approach (“schema now”), where the data model is of utmost concern upon
data ingest. This is where the flexibility of Hadoop is even more apparent.
It's not just a place where you can dump many files. There are Hadoop-based
databases where you can store records in a variety of models: relational, co-
lumnar, and key/value. In other words, with data in Hadoop, you can go
from completely unstructured to fully relational, and any point in between.
The data storage system that we describe here is known as the Hadoop distrib-
uted file system (HDFS).
Let's go back to this imaginary Hadoop cluster with many individual
nodes. Suppose that your business uses this cluster to store all of the click-
stream log records for its e-commerce site. Your Hadoop cluster is using the
BigInsights distribution, and you and your fellow analysts decide to run
some of the sessionization analytics against this data to isolate common pat-
terns for customers who leave abandoned shopping carts—we call this use
case last mile optimization . When you run this application, Hadoop sends cop-
ies of your application logic to each individual computer in the cluster, to be
run against data that's local to each computer. So instead of moving data to a
 
Search WWH ::




Custom Search