Database Reference
In-Depth Information
In Chapter 9, “Building Data Transformation Workf lows with Pig and Cascading,” we
took a look at how to build transformation pipelines with Hadoop. Here is another case
of using overlapping technologies: One can choose to build a pipeline solution program-
matically using Cascading or stick to a higher-level abstraction by using Apache Pig. Both
technologies were developed independently and cover some of the same use cases. A soft-
ware company building a data processing application on top of Hadoop would obviously
use Cascading, whereas a data analyst is more likely to use a tool such as Pig.
In the midst of this f lux, developers, analysts, and data scientists all have practical
problems to solve and rarely have unlimited time or money to solve them. In some
cases, we'll want to pay someone else do the hard work; in others, we will need to dig
in and build our own solutions. Let's take a look at some basic guidelines to help us
figure out how to navigate the constantly evolving data landscape.
Understanding Your Data Problem
Let's revisit a central theme of this topic, which is to understand the data problem
before working out a solution. It seems obvious, but when working with difficult data
problems many pitfalls can be avoided by f lushing out both the end goals and the audi-
ence you are trying to serve.
When deciding on what to build in-house and what to outsource, business strategists
often talk about understanding your organization's “core competencies.” Technology
strategist and author Geoffrey Moore, well known for his work on innovation and tech-
nology adoption cycles, has written a great deal about this concept. In Moore's view,
organizations should concentrate energy on building core technologies—those that help
an organization differentiate itself from others. 1 In contrast, other activities are “contex-
tual”; they can help the organization maintain parity with everyone else, but will not
help it distinguish itself. The conventional wisdom around this concept is that an orga-
nization should devote energy to building unique technology that provides competitive
differentiation. While doing this, they should outsource other technology problems to
outside vendors. This concept is key for small organizations trying to focus on work
that will help them gain a foothold in the market. Why administer a mail server when
it's much easier and likely cheaper overall to pay for a Web-based email service?
Although this idea comes from specific, corporate business management cases, the
concept is not just for corporations. A small research organization, an academic unit,
a game startup, and a data journalist each has a particular core focus. In each of these
cases, it's likely that the user would want the technology to get out of the way as much
as possible. Even the hackers who love to dabble in the latest and greatest (and believe
me, I know how fun it can be) have deadlines to meet and budgets to stay below.
The key lesson here is to understand whether the data challenge you are facing is
one that you must solve yourself or one that is commonplace enough that someone has
1. Moore, Geoffrey A. “Managing Inertia in Your Enterprise.” Chap. 11 in Dealing with Darwin:
How Great Companies Innovate at Every Phase of Their Evolution . New York: Portfolio, 2008.
 
 
Search WWH ::




Custom Search