Database Reference
In-Depth Information
Keep Everything
The first rule for the Hadoop developer is to “keep everything.” This means
keepingthedatainitsrawform.Cleansingoperations,transformations, and
aggregations of data remove subtle nuances in the data that may hold value.
Therefore, a Hadoop developer is motivated to keep everything and work
from this base data set.
The advantages here are clear. If I always have access to the raw data
online, my analysis is not restricted, and I have complete creative freedom
to explore the data.
Coming from a relational database background I was actually quite envious
of Hadoop's ability to offer this option. Often, it just isn't feasible to hold
this volume of data in a database. The scale at which Hadoop can hold
data online is quite unlike anything seen by relational database technology.
Even the largest data warehouses will be in the low petabyte (PB) range.
Facebook's Hadoop cluster by contrast had 100 PB+ of data under
management and was growing at 0.5PB a day. Those figures were released a
couple of years ago.
An esteemed colleague and friend of mine in the database community,
Thomas Kejser, once proposed an architecture that leveraged Hadoop as a
giant repository for all data received by the warehouse thus removing the
need for an operational data store (ODS), data vault, or third normal form
model.Hearguesthatthatthedatawarehousecouldalwayseasilyrehydrate
a data feed if needed down the road. We'll discuss this in more detail later in
this chapter. However, take a look at Figure 10.1 to get your creative juices
flowing.
NOTE
You can read more about Thomas' architecture on his blog
http://blog.kejser.org/2011/08/30/
the-big-picture-edwdw-architecture/ .
 
Search WWH ::




Custom Search