Database Reference
In-Depth Information
The data warehouse concept represents the philosophy of spending a great deal of
effort upfront in order to extract and merge data from various silos for effective analy-
sis. Technologies like Hadoop tend to represent a different philosophy, embracing
huge, unstructured datasets and processing data in an ad hoc fashion. Why bother with
the seemingly impossible task of constantly extracting, normalizing, and structuring
data when we can use tools like Hadoop's MapReduce framework to answer questions
about huge amounts of data whenever the need arises? Why not simply dump unstruc-
tured data into Hadoop's distributed filesystem each day and write a custom MapRe-
duce function whenever it's necessary to query the data?
There's an interesting culture clash between the traditional enterprise world of
the data warehouse and the world of technologies symbolized by newer, open-source
projects such as Hadoop. Users of each type are beginning to find ways in which these
technologies complement each other. One thing that distributed processing technolo-
gies is starting to reveal is that perhaps data silos aren't such a bad thing after all.
Data Silos Can Be Good
Like many of the terms you hear in the enterprise data world, such as “business intel-
ligence” or “ETL,” the term “data silo” should be up for scrutiny. Data silos have a
bad reputation because of all the previous challenges that we've discussed. As data
sizes grow, the concept of providing a single repository for all this data is not always
the most practical solution. In a world without any feasible way to handle the massive
amounts of data being generated across an organization, the only practical solution
has been to make a clean, streamlined copy of aggregated data and store it in a data
warehouse.
Now even small organizations are beginning to have access to technology that can
process large amounts of data on demand, and storage is getting even cheaper as well.
This means that it is becoming economically feasible to simply keep data in whatever
silo is the best fit for the use case and then run reporting tools. In this model, the focus
is more on keeping data where it works best and finding ways to process it for analysis
when the need arises. In other words, as one business-intelligence software developer
once told me: “Data silos exist because they are useful.”
The reality is that that data warehousing doesn't solve every data silo challenge
effectively. Approaching every data problem from an ad hoc perspective by assuming
Hadoop will be the answer to the challenge isn't practical for solving every use case
either. Dumping raw, unstructured data from operational data stores into a distributed
filesystem lacks some of the discipline imposed by good data warehousing design. Cer-
tainly, there are good reasons to use data warehousing technology, and until a single
database system can be used for all aspects of an organization's data needs, data must be
moved from system to system to be used effectively.
Similarly, managing the complexity of ad hoc systems requires a great deal of effort.
Building custom systems that bridge together disparate data sources can be difficult
and require a different sort of expertise.
 
 
Search WWH ::




Custom Search