NoSQL: What It Is and Why You Need It - Professional NoSQL - page 5

Databases Reference

In-Depth Information

CHALLENGES OF RDBMS

The challenges of RDBMS for massive Web-scale data processing aren't specifi c to

a product but pertain to the entire class of such databases. RDBMS assumes a well-

defi ned structure in data. It assumes that the data is dense and is largely uniform.

RDBMS builds on a prerequisite that the properties of the data can be defi ned

up front and that its interrelationships are well established and systematically

referenced. It also assumes that indexes can be consistently defi ned on data sets and

that such indexes can be uniformly leveraged for faster querying. Unfortunately,

RDBMS starts to show signs of giving way as soon as these assumptions don't hold

true. RDBMS can certainly deal with some irregularities and lack of structure but

in the context of massive sparse data sets with loosely defi ned structures, RDBMS

appears a forced fi t. With massive data sets the typical storage mechanisms and

access methods also get stretched. Denormalizing tables, dropping constraints,

and relaxing transactional guarantee can help an RDBMS scale, but after these

modifi cations an RDBMS starts resembling a NoSQL product.

Flexibility comes at a price. NoSQL alleviates the problems that RDBMS imposes

and makes it easy to work with large sparse data, but in turn takes away the power

of transactional integrity and fl exible indexing and querying. Ironically, one of

the features most missed in NoSQL is SQL, and product vendors in the space are

making all sorts of attempts to bridge this gap.

Google has, over the past few years, built out a massively scalable infrastructure for its search engine

and other applications, including Google Maps, Google Earth, GMail, Google Finance, and Google

Apps. Google's approach was to solve the problem at every level of the application stack. The

goal was to build a scalable infrastructure for parallel processing of large amounts of data. Google

therefore created a full mechanism that included a distributed fi lesystem, a column-family-oriented

data store, a distributed coordination system, and a MapReduce-based parallel algorithm execution

environment. Graciously enough, Google published and presented a series of papers explaining some

of the key pieces of its infrastructure. The most important of these publications are as follows:

Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. “The Google File System”; pub.

19th ACM Symposium on Operating Systems Principles, Lake George, NY, October 2003.

URL: http://labs.google.com/papers/gfs.html

➤

Jeffrey Dean and Sanjay Ghemawat. “MapReduce: Simplifi ed Data Processing on

Large Clusters”; pub. OSDI'04: Sixth Symposium on Operating System Design and

Implementation, San Francisco, CA, December 2004. URL: http://labs.google.com/

papers/mapreduce.html

Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike

Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber. “Bigtable: A Distributed

Storage System for Structured Data”; pub. OSDI'06: Seventh Symposium on Operating

System Design and Implementation, Seattle, WA, November 2006. URL: http://labs

.google.com/papers/bigtable.html

➤

➤

Next Page

Professional NoSQL

Search WWH ::

Custom Search

Home