Data Warehouses and Hadoop Integration - Microsoft Big Data Solutions

Database Reference

In-Depth Information

Keep Everything

The first rule for the Hadoop developer is to “keep everything.” This means

keepingthedatainitsrawform.Cleansingoperations,transformations, and

aggregations of data remove subtle nuances in the data that may hold value.

Therefore, a Hadoop developer is motivated to keep everything and work

from this base data set.

The advantages here are clear. If I always have access to the raw data

online, my analysis is not restricted, and I have complete creative freedom

to explore the data.

Coming from a relational database background I was actually quite envious

of Hadoop's ability to offer this option. Often, it just isn't feasible to hold

this volume of data in a database. The scale at which Hadoop can hold

data online is quite unlike anything seen by relational database technology.

Even the largest data warehouses will be in the low petabyte (PB) range.

Facebook's Hadoop cluster by contrast had 100 PB+ of data under

management and was growing at 0.5PB a day. Those figures were released a

couple of years ago.

An esteemed colleague and friend of mine in the database community,

Thomas Kejser, once proposed an architecture that leveraged Hadoop as a

giant repository for all data received by the warehouse thus removing the

need for an operational data store (ODS), data vault, or third normal form

model.Hearguesthatthatthedatawarehousecouldalwayseasilyrehydrate

a data feed if needed down the road. We'll discuss this in more detail later in

this chapter. However, take a look at Figure 10.1 to get your creative juices

flowing.

NOTE

You can read more about Thomas' architecture on his blog

Search WWH ::

Custom Search

Home