Strategies for Dealing with Data Silos - Data Just Right: Introduction to Large-Scale Data and Analytics

Database Reference

In-Depth Information

The data warehouse concept represents the philosophy of spending a great deal of

effort upfront in order to extract and merge data from various silos for effective analy-

sis. Technologies like Hadoop tend to represent a different philosophy, embracing

huge, unstructured datasets and processing data in an ad hoc fashion. Why bother with

the seemingly impossible task of constantly extracting, normalizing, and structuring

data when we can use tools like Hadoop's MapReduce framework to answer questions

about huge amounts of data whenever the need arises? Why not simply dump unstruc-

tured data into Hadoop's distributed filesystem each day and write a custom MapRe-

duce function whenever it's necessary to query the data?

There's an interesting culture clash between the traditional enterprise world of

the data warehouse and the world of technologies symbolized by newer, open-source

projects such as Hadoop. Users of each type are beginning to find ways in which these

technologies complement each other. One thing that distributed processing technolo-

gies is starting to reveal is that perhaps data silos aren't such a bad thing after all.

Data Silos Can Be Good

Like many of the terms you hear in the enterprise data world, such as “business intel-

ligence” or “ETL,” the term “data silo” should be up for scrutiny. Data silos have a

bad reputation because of all the previous challenges that we've discussed. As data

sizes grow, the concept of providing a single repository for all this data is not always

the most practical solution. In a world without any feasible way to handle the massive

amounts of data being generated across an organization, the only practical solution

has been to make a clean, streamlined copy of aggregated data and store it in a data

warehouse.

Now even small organizations are beginning to have access to technology that can

process large amounts of data on demand, and storage is getting even cheaper as well.

This means that it is becoming economically feasible to simply keep data in whatever

silo is the best fit for the use case and then run reporting tools. In this model, the focus

is more on keeping data where it works best and finding ways to process it for analysis

when the need arises. In other words, as one business-intelligence software developer

once told me: “Data silos exist because they are useful.”

The reality is that that data warehousing doesn't solve every data silo challenge

effectively. Approaching every data problem from an ad hoc perspective by assuming

Hadoop will be the answer to the challenge isn't practical for solving every use case

either. Dumping raw, unstructured data from operational data stores into a distributed

filesystem lacks some of the discipline imposed by good data warehousing design. Cer-

tainly, there are good reasons to use data warehousing technology, and until a single

database system can be used for all aspects of an organization's data needs, data must be

moved from system to system to be used effectively.

Similarly, managing the complexity of ad hoc systems requires a great deal of effort.

Building custom systems that bridge together disparate data sources can be difficult

and require a different sort of expertise.

Search WWH ::

Custom Search

Home