Coexistence - Professional NoSQL

Databases Reference

In-Depth Information

to tweak them to suit its needs. Some have argued that legacy has stuck, but clearly performance

numbers show that Facebook has fi gured out how to make it scalable.

Like Facebook, Twitter and LinkedIn have adopted polyglot persistence. Twitter, for example,

uses MySQL and Cassandra actively. Twitter also uses a graph database, named FlockDB, for

maintaining relationships, such as who's following whom and who you receive phone notifi cations

from. Twitter's popularity and data volume have grown immensely over the years. Kevin Weil's

September 2010 presentation ( www.slideshare.net/kevinweil/analyzing-big-data-at-

twitter-web-20-expo-nyc-sep-2010 ) claims tweets and direct messages now add up to 12 TB/

day, which when linearly scaled out imply over 4 petabytes every year. These numbers are bound

to continue to grow and become larger and larger as more people adopt Twitter and use tweets to

communicate with the world. Manipulating this large volume of data is a huge challenge. Twitter

uses Hadoop and MapReduce functionality to analyze the large data set. Twitter leverages the high-

level language Pig ( http://pig.apache.org/ ) for data analysis. Pig statements lead to MapReduce

jobs on a Hadoop cluster. A lot of the core storage at Twitter still depends on MySQL. MySQL is

heavily used for multiple features within Twitter. Cassandra is used for a select few use cases like

storing geocentric data.

LinkedIn, like Twitter, relies on a host of different types of data stores. Jay Kreps at the Hadoop

Summit provided a preview into the large data architecture and manipulation at LinkedIn last

year. The slides from that presentation are available online at www.slideshare.net/ydn/6-data-

applicationlinkedinhadoopsummmit2010 . LinkedIn uses Hadoop for many large-scale analytics

jobs like probabilistically predicting people you may know. The data set acted upon by the Hadoop

cluster is fairly large and usually in the range of more than 120 billion relationships a day. It is

carried out by around 82 Hadoop jobs that require over 16 TB of intermediate data. The probabilistic

graphs are copied over from the batch offl ine storage to a live NoSQL cluster. The NoSQL database

is Voldemort, an Apache Dynamo clone that represents data in key/value pairs. The relationship

graph data is read-only and Voldemort's eventual consistency model doesn't cause any problems. The

relationship data is processed in a batch mode but fi ltered through a faceted search in real time. These

fi lters may lead to the exclusion of people who a person has indicated they don't know.

Looking at Facebook, Twitter, and LinkedIn it becomes clear that polyglot persistence has its benefi ts

and leads to an optimal stack, where each data store is appropriately used for the use case in hand.

Data Warehousing and Business Intelligence

An entire category of applications is built to store and manipulate archived data sets. Usually, these

data warehouses are built out of old transactional data, which is typically referred to as fact data.

Data, in the data warehouse, is then analyzed and manipulated to uncover patterns or decipher

trends. All such archived and warehoused data is read-only and the transactional requirements for

such data stores is minimal. These data sets have traditionally been stored in special-purpose data

stores, which have the capability to store large volumes of data and analyze the data on the basis of

multiple dimensions.

With the advent of Hadoop, some of the large-scale analytics is done by MapReduce-based jobs.

The MapReduce-based model of analytics is being enriched by the availability of querying tools

like Hive and workfl ow defi nition high-level languages like Pig. Added to this, the innovation in

the MapReduce space is ever expanding. The Apache Mahout project builds a machine learning

Search WWH ::

Custom Search

Home