When to Build, When to Buy, When to Outsource - Data Just Right: Introduction to Large-Scale Data and Analytics

Database Reference

In-Depth Information

is to remove as many variables as possible to try to evaluate the pain points in building

a system from scratch. If there are prohibitive factors involved in deploying a solution

for a small subset of data, then certainly a larger data challenge should be solved using

a commercial solution.

Proof-of-concept projects can even be handled on single workstations. Processing

data using scripting tools such as Python, sed, and awk on a local machine can some-

times be all that's needed. Many of the distributed-data tools featured in this topic,

such as Apache Hadoop, can be run in single-server mode locally on a workstation.

Even better, much of the same code used for defining batch processes can be reused on

a cluster environment.

Planning for Scale

From your explorations with a proof of concept, you might have gotten some ideas

about the types of skills necessary to build your tools with existing open-source tech-

nologies. Perhaps, for example, you now have a plan for analyzing the last month's

worth of data you have collected. What happens when you want to analyze all the data

you've collected for the past five years? Will the technology that you are evaluating be

easy to use when data sizes grow? Will you need additional hardware, personnel, or

organizational practices in the event of additional data growth?

A common pattern when dealing with ever-growing data sizes is to start with a

familiar, mature technology and then face the need to radically change that as data

sizes grow. An example is beginning a data collection and processing challenge using

a well-known relational database (such as MySQL) on a single server. Then, as the

problems of scale begin to appear, the solution must be moved to an entirely different

infrastructure, often involving a nonrelational database. Depending on the problem

being solved, some commercially available solutions become prohibitively expensive at

scale or may not even be performant over certain data sizes.

Some database designs lend themselves well to being distributed across multiple

machines (see Chapter 3, “Building a NoSQL-Based Web App to Collect Crowd-

Sourced Data”). However, the amount of effort required to actually implement them

can be nontrivial. In such cases, it may make sense to purchase the services of a cloud-

based nonrelational database solution (such as Amazon's Dynamo DB) rather than invest

in the effort to consistently administer an ever-growing cluster of Redis machines. In

summary, never invest in a course of action before having a plan for dealing with data

as it grows. If working with a commercial solution, determine the vendors' recom-

mended limits, and make sure there is a method for working with additional technolo-

gies if there is a possibility for your data challenge to overwhelm these limits.

The state of the art in large-scale data analysis software is often based on distributed

systems of many servers deployed directly on physical hardware or as virtual machines

and linked together in a network. As network communication can often be the

Search WWH ::

Custom Search

Home