When to Build, When to Buy, When to Outsource - Data Just Right: Introduction to Large-Scale Data and Analytics

Database Reference

In-Depth Information

bottleneck in these systems, latency may be reduced by having the machines in close

proximity to one another. This means we need to have direct control of the com-

puter hardware itself to solve our problem. We will also need space, plenty of power, a

backup power supply, security, and cooling systems. In other words, we need to build

and maintain our own data center.

Or do we? Computing is on its way to becoming a utility, and in the future, a lot

of the computing resources we consume will be available in much the same way as

water and power: metered service right out of the tap. For many software applica-

tions, most of the heavy lifting will take place on platforms or virtual machines with

the bulk of processing taking place far away in large data centers. This trend is already

very visible on the Web and with mobile applications. From Yelp to Netf lix to your

favorite social games, how many apps on your smartphone are essentially just interfaces

to cloud-based services?

Unfortunately, many hurdles must be overcome before the cloud can become the

de facto home of data processing. A common mantra of large data processing is to

make sure that processing takes place as close to the data as possible. This concept is

what makes the design of Hadoop so attractive; data is generally distributed across

server nodes from which processing takes place. In order to use cloud systems for the

processing of large amounts of in-house data, data would need to be moved using the

relatively small bandwidth of the Internet. Similarly, data generated by an application

hosted on one cloud provider might need to be moved to another cloud service for

processing. These steps take time and reduce the overall performance of the system in

comparison to a solution in which the data is accessible in a single place. Most impor-

tantly, there are a range of security, compliance, and regulatory concerns that need to

be addressed when moving data from one place to another.

The disadvantages of using a public cloud include an inability to make changes to

the infrastructure. Also, the loss of control might even result in greater costs overall.

Maintaining hardware can also provide some f lexibility in wringing every last bit of

performance from the system. For some applications, this might be a major concern.

Currently, it's possible to access cloud computing resources that are maintained in

off-site data centers as a service. These are sometimes referred to using the slightly

misleading term private clouds . It is also possible to lease dedicated servers in data cen-

ters that don't share hardware with other customers. These private clouds can often

provide more control over the underlying hardware, leading to the potential for higher

performance data processing.

A potential advantage of not dealing with physical infrastructure is that more time

can be devoted to data analysis. If your company is building a Web application, why

divert engineering resources to dealing with all the administrative overhead necessary

to administer the security and networking needed to run a cluster of computers? In

reality, depending on the type of application being built, managing clusters of virtual

server instances in the cloud might be just as time consuming as managing physical

hardware. To truly avoid the overhead of managing infrastructure, the best solution is

to use data processing as a service tools (discussed later in this chapter).

Data Just Right: Introduction to Large-Scale Data and Analytics

Search WWH ::

Custom Search

Home