Database Reference
In-Depth Information
Since the dawn of the information age, clueless managers have approached beleaguered
software engineers with requests to take on impossible data processing tasks. Invari-
ably, these requests involve data interoperability between incompatible systems. Some-
one from the upper ranks will say, “Why can't System X simply talk to System Y? All
we need to do is take the gigabytes of data that we collect each day on our Web site,
merge them with the gigabytes of data we store in our data warehouse, and send the
results to our visualization tool. Oh, and we are going to need you to go ahead get it
into us by 8 a.m. tomorrow.”
In response, the engineers bring up the issues of incompatible formats, inconsis-
tent data, the time it takes to process all those records, and how it's not their job, and
finally they remind management that none of these problems would have happened if
the company hadn't purchased that particular database.
If there's one thing that our woefully ignorant management and our oppressed
worker bees may both be able to agree on, it is that there can be a great deal of
value in combining large datasets in useful ways. The term “unlock” is often used to
describe the process of processing, joining, and transforming data in order to discover
a previously unknown fact or relationship. I think that this metaphor is misleading as
it assumes that there is always a kind of buried treasure at the end of a data processing
rainbow. Perhaps a better way of looking at our data is to imagine that it exists to help
us answer questions and tell stories. Data needs to be transformed into a state in which
the stories can be told more completely, and sometimes our questions need to be asked
in the right language.
Transformations
In Chapter 8, we took a look at how easy it can be to build MapReduce-based data
pipelines using Hadoop's Streaming API and Python MapReduce frameworks such
as mrjob. For relatively simple processing tasks that require only a few MapReduce
steps (such as the ubiquitous example of loading text from many files and counting the
unique words within), it's easy to manually define mapper and reducer phases using a
scripting language such as Python.
In practice, data pipelines can become very complex. For example, a common
pipeline challenge that large companies face is the need to merge data from a variety
of disparate databases in a timely manner. In some cases, records in the individual
data sources may be linked by a particular key: for example, a common date or email
address. Records may need to be filtered by a particular value, or certain values in
certain records must be “normalized” and set to a common value. A large number of
individual MapReduce steps might be needed to arrive at the desired result.
In these cases, maintaining MapReduce-based data-transformation code can be
complicated. Passing around key-value pairs using streaming scripts is manageable for
Search WWH ::
Custom Search