Databases Reference
In-Depth Information
CHALLENGES OF RDBMS
The challenges of RDBMS for massive Web-scale data processing aren't specifi c to
a product but pertain to the entire class of such databases. RDBMS assumes a well-
defi ned structure in data. It assumes that the data is dense and is largely uniform.
RDBMS builds on a prerequisite that the properties of the data can be defi ned
up front and that its interrelationships are well established and systematically
referenced. It also assumes that indexes can be consistently defi ned on data sets and
that such indexes can be uniformly leveraged for faster querying. Unfortunately,
RDBMS starts to show signs of giving way as soon as these assumptions don't hold
true. RDBMS can certainly deal with some irregularities and lack of structure but
in the context of massive sparse data sets with loosely defi ned structures, RDBMS
appears a forced fi t. With massive data sets the typical storage mechanisms and
access methods also get stretched. Denormalizing tables, dropping constraints,
and relaxing transactional guarantee can help an RDBMS scale, but after these
modifi cations an RDBMS starts resembling a NoSQL product.
Flexibility comes at a price. NoSQL alleviates the problems that RDBMS imposes
and makes it easy to work with large sparse data, but in turn takes away the power
of transactional integrity and fl exible indexing and querying. Ironically, one of
the features most missed in NoSQL is SQL, and product vendors in the space are
making all sorts of attempts to bridge this gap.
Google has, over the past few years, built out a massively scalable infrastructure for its search engine
and other applications, including Google Maps, Google Earth, GMail, Google Finance, and Google
Apps. Google's approach was to solve the problem at every level of the application stack. The
goal was to build a scalable infrastructure for parallel processing of large amounts of data. Google
therefore created a full mechanism that included a distributed fi lesystem, a column-family-oriented
data store, a distributed coordination system, and a MapReduce-based parallel algorithm execution
environment. Graciously enough, Google published and presented a series of papers explaining some
of the key pieces of its infrastructure. The most important of these publications are as follows:
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. “The Google File System”; pub.
19th ACM Symposium on Operating Systems Principles, Lake George, NY, October 2003.
URL: http://labs.google.com/papers/gfs.html
Jeffrey Dean and Sanjay Ghemawat. “MapReduce: Simplifi ed Data Processing on
Large Clusters”; pub. OSDI'04: Sixth Symposium on Operating System Design and
Implementation, San Francisco, CA, December 2004. URL: http://labs.google.com/
papers/mapreduce.html
Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike
Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber. “Bigtable: A Distributed
Storage System for Structured Data”; pub. OSDI'06: Seventh Symposium on Operating
System Design and Implementation, Seattle, WA, November 2006. URL: http://labs
.google.com/papers/bigtable.html
Search WWH ::




Custom Search