Database Reference
In-Depth Information
is to remove as many variables as possible to try to evaluate the pain points in building
a system from scratch. If there are prohibitive factors involved in deploying a solution
for a small subset of data, then certainly a larger data challenge should be solved using
a commercial solution.
Proof-of-concept projects can even be handled on single workstations. Processing
data using scripting tools such as Python, sed, and awk on a local machine can some-
times be all that's needed. Many of the distributed-data tools featured in this topic,
such as Apache Hadoop, can be run in single-server mode locally on a workstation.
Even better, much of the same code used for defining batch processes can be reused on
a cluster environment.
Planning for Scale
From your explorations with a proof of concept, you might have gotten some ideas
about the types of skills necessary to build your tools with existing open-source tech-
nologies. Perhaps, for example, you now have a plan for analyzing the last month's
worth of data you have collected. What happens when you want to analyze all the data
you've collected for the past five years? Will the technology that you are evaluating be
easy to use when data sizes grow? Will you need additional hardware, personnel, or
organizational practices in the event of additional data growth?
A common pattern when dealing with ever-growing data sizes is to start with a
familiar, mature technology and then face the need to radically change that as data
sizes grow. An example is beginning a data collection and processing challenge using
a well-known relational database (such as MySQL) on a single server. Then, as the
problems of scale begin to appear, the solution must be moved to an entirely different
infrastructure, often involving a nonrelational database. Depending on the problem
being solved, some commercially available solutions become prohibitively expensive at
scale or may not even be performant over certain data sizes.
Some database designs lend themselves well to being distributed across multiple
machines (see Chapter 3, “Building a NoSQL-Based Web App to Collect Crowd-
Sourced Data”). However, the amount of effort required to actually implement them
can be nontrivial. In such cases, it may make sense to purchase the services of a cloud-
based nonrelational database solution (such as Amazon's Dynamo DB) rather than invest
in the effort to consistently administer an ever-growing cluster of Redis machines. In
summary, never invest in a course of action before having a plan for dealing with data
as it grows. If working with a commercial solution, determine the vendors' recom-
mended limits, and make sure there is a method for working with additional technolo-
gies if there is a possibility for your data challenge to overwhelm these limits.
My Own Private Data Center
The state of the art in large-scale data analysis software is often based on distributed
systems of many servers deployed directly on physical hardware or as virtual machines
and linked together in a network. As network communication can often be the
 
 
Search WWH ::




Custom Search