Four Rules for Data Success - Data Just Right: Introduction to Large-Scale Data and Analytics

Database Reference

In-Depth Information

provides the ability to search the entire Internet, purchase any product from any seller

anywhere in the world, or provide social networking services for anyone on the planet

with access to the Internet. The massive scale of the World Wide Web, as well as the

constantly accelerating growth of the number of total Internet users, presented an

almost impossible task for software engineers: finding solutions that potentially could

be scaled to the needs of every human being to collect, store, and process the world's

data.

Traditional data analysis software, such as spreadsheets and relational databases, as

reliable and widespread as it had been, was generally designed to be used on a single

machine. In order to build these systems to be able to scale to unprecedented size,

computer scientists needed to build systems that could run on clusters of machines.

The Big Data Trade-Off

Because of the incredible task of dealing with the data needs of the World Wide

Web and its users, Internet companies and research organizations realized that a new

approach to collecting and analyzing data was necessary. Since off-the-shelf, commod-

ity computer hardware was getting cheaper every day, it made sense to think about

distributing database software across many readily available servers built from com-

modity parts. Data processing and information retrieval could be farmed out to a col-

lection of smaller computers linked together over a network. This type of computing

model is generally referred to as distributed computing . In many cases, deploying

a large number of small, cheap servers in a distributed computing system can be more

economically feasible than buying a custom built, single machine with the same com-

putation capabilities.

While the hardware model for tackling massive scale data problems was being

developed, database software started to evolve as well. The relational database model,

for all of its benefits, runs into limitations that make it challenging to deploy in a

distributed computing network. First of all, sharding a relational database across mul-

tiple machines can often be a nontrivial exercise. Because of the need to coordinate

between various machines in a cluster, maintaining a state of data consistency at any

given moment can become tricky. Furthermore, most relational databases are designed

to guarantee data consistency; in a distributed network, this type of design can create

a problem.

Software designers began to make trade-offs to accommodate the advantages of

using distributed networks to address the scale of the data coming from the Internet.

Perhaps the overall rock-solid consistency of the relational database model was less

important than making sure there was always a machine in the cluster available to pro-

cess a small bit of data. The system could always provide coordination eventually. Does

the data actually have to be indexed? Why use a fixed schema at all? Maybe databases

could simply store individual records, each with a different schema, and possibly with

redundant data.

Search WWH ::

Custom Search

Home