Building a NoSQL-Based Web App to Collect Crowd-Sourced Data - Data Just Right: Introduction to Large-Scale Data and Analytics

Database Reference

In-Depth Information

not for a single user on a desktop, but for anyone who had access to an Internet con-

nection. Much of the growth of open-source database MySQL was due to the avail-

ability of easy-to-use integrations with Web-friendly scripting languages such as Perl

and PHP.

Codd's concept of database structure imposes an upfront understanding of the data.

Schemas and relationships must be defined before making a single insert. The rela-

tional model also requires a bit of work on the part of the software itself. Consider the

process of an application writing a record to a relational database; data might be repre-

sented using more than one table. The database itself must take care to ensure that data

is consistent after the write has occurred, which takes a bit of computational overhead.

However, as the user base of a Web site grows, so does the need to handle the scale

of data. Some of the early Web pioneers such as Amazon and Google found that rela-

tional databases were not always the right tool for the job. The priorities of existing

relational database systems were geared more toward consistency than availability.

Consider an online messaging system in which users post and share comments pub-

licly with other users. A relational database design architecture might define a table

to keep track of individual users, with each user being assigned a unique identifier. In

order to facilitate message sharing, we would also require a table relating each posted

message to information about the target recipient. Although heavily simplified, this

type of system is not unlike the many comment and blog systems currently used on

the Web.

Now imagine the Web site has gone viral and that millions of users access this

online system at all times. How can we handle the scale? With computing prices

dropping every day, servers and hard disks are available to handle quite a lot of trans-

actional processing. At some point, a single machine might not be able to handle the

load of many thousands of queries every second. Furthermore, Web traffic, log data,

and other factors may mean that, over time, it might not be possible to continually

upgrade a single server. The need for higher capacity and plenty of data throughput

requires other strategies.

Although commodity computer hardware tends to become cheaper over time, con-

tinually upgrading to more massive server hardware has been historically economically

infeasible. Spending twice as much money on a huge, single machine may not provide

double the performance. In contrast, smaller, more modest servers remain inexpensive.

In general, it makes more economic sense to scale horizontally: in other words, to

simply add more cheap machines to the system rather than try to put a single relational

database on one expensive, massive server.

In order to guarantee performance of this Web application, one might consider

splitting the relational tables across a collection of machines. Perhaps each table could

reside on a different machine; it might be possible to split, or shard , individual tables

of a relational database system off to a dedicated server. At some point, the table with

the most data might again become too large to host on a single machine. Situations

like this create bottlenecks in our system. When faced with the onslaught of Web-

scale data, the popular relational database model begins to create very challenging

Data Just Right: Introduction to Large-Scale Data and Analytics

Search WWH ::

Custom Search

Home