Database Reference
In-Depth Information
proof-of-concept projects. Finally, go through the process of understanding potential
pitfalls when you need to scale-up data processing tasks.
What Have You Already Invested In?
Before you do anything, understand the technologies in which you've already made
investments. Do you already have access to an internal data center? As we've seen,
there are many advantages to using clusters of virtualized servers in the cloud over
physical hardware in house. This includes f lexibility in pricing models and the ability
to expand or contract the number of nodes as necessary. These advantages might not
be valid if your organization already has made an investment in physical hardware and
maintenance.
Your organizational culture will also help dictate which data technologies you ulti-
mately use. Is your group already proficient in using a particular database or platform?
If so, consider sticking with the technology that your team knows best, even if it is not
commonly accepted as the most scalable or cost-effective solution. An example of this
approach can be found in the posts of the ServerFault blog, home of popular Web site
Stack Exchange's engineering team. In one post, entitled “Why Stack Exchange Isn't
in the Cloud,” Kyle Brandt explains, “We don't just love programming and our Web
applications. We get excited learning about computer hardware, operating systems,
history, computer games, and new innovations.” 2 The post goes on to explain that the
entire engineering team has the skills and interest to maintain hardware, and this core
competency helps determine what they do. Obviously, the Stack Exchange team has
the experience and organizational culture to handle the tasks and optimize the costs of
infrastructure management. Other organizations may not have this level of expertise
or passion for handling hardware.
Starting Small
You've clearly defined your use case and your audience, and you've scoped out
your existing resources. Now it's time to collect and crunch those large, valuable
datasets—right?
A common red herring in the world of data science is to immediately start big. It's
dangerous to latch on to a trendy Big Data technology to solve a problem that could
just as easily have been approached with traditional database tools or desktop software.
Organizations are feeling the pressure to derive value from large datasets. The Apache
Hadoop project has been hyped to no end by technology media as the be-all and end-
all accessible solution to a variety of data use cases. However, the Hadoop framework,
which provides an accessible way to distribute MapReduce-based jobs over a cluster of
servers, is not always the best solution for the job.
When trying to make a decision about software, one strategy that often pays off is
to build proof-of-concept solutions using small samples of data. In doing so, the goal
2. http://blog.ser ver fault.com/2011/11/17/why-stack-exchange-isn% E2 % 80 % 99t-in-the-cloud/
 
Search WWH ::




Custom Search