Databases Reference
In-Depth Information
to tweak them to suit its needs. Some have argued that legacy has stuck, but clearly performance
numbers show that Facebook has fi gured out how to make it scalable.
Like Facebook, Twitter and LinkedIn have adopted polyglot persistence. Twitter, for example,
uses MySQL and Cassandra actively. Twitter also uses a graph database, named FlockDB, for
maintaining relationships, such as who's following whom and who you receive phone notifi cations
from. Twitter's popularity and data volume have grown immensely over the years. Kevin Weil's
September 2010 presentation ( www.slideshare.net/kevinweil/analyzing-big-data-at-
twitter-web-20-expo-nyc-sep-2010 ) claims tweets and direct messages now add up to 12 TB/
day, which when linearly scaled out imply over 4 petabytes every year. These numbers are bound
to continue to grow and become larger and larger as more people adopt Twitter and use tweets to
communicate with the world. Manipulating this large volume of data is a huge challenge. Twitter
uses Hadoop and MapReduce functionality to analyze the large data set. Twitter leverages the high-
level language Pig ( http://pig.apache.org/ ) for data analysis. Pig statements lead to MapReduce
jobs on a Hadoop cluster. A lot of the core storage at Twitter still depends on MySQL. MySQL is
heavily used for multiple features within Twitter. Cassandra is used for a select few use cases like
storing geocentric data.
LinkedIn, like Twitter, relies on a host of different types of data stores. Jay Kreps at the Hadoop
Summit provided a preview into the large data architecture and manipulation at LinkedIn last
year. The slides from that presentation are available online at www.slideshare.net/ydn/6-data-
applicationlinkedinhadoopsummmit2010 . LinkedIn uses Hadoop for many large-scale analytics
jobs like probabilistically predicting people you may know. The data set acted upon by the Hadoop
cluster is fairly large and usually in the range of more than 120 billion relationships a day. It is
carried out by around 82 Hadoop jobs that require over 16 TB of intermediate data. The probabilistic
graphs are copied over from the batch offl ine storage to a live NoSQL cluster. The NoSQL database
is Voldemort, an Apache Dynamo clone that represents data in key/value pairs. The relationship
graph data is read-only and Voldemort's eventual consistency model doesn't cause any problems. The
relationship data is processed in a batch mode but fi ltered through a faceted search in real time. These
fi lters may lead to the exclusion of people who a person has indicated they don't know.
Looking at Facebook, Twitter, and LinkedIn it becomes clear that polyglot persistence has its benefi ts
and leads to an optimal stack, where each data store is appropriately used for the use case in hand.
Data Warehousing and Business Intelligence
An entire category of applications is built to store and manipulate archived data sets. Usually, these
data warehouses are built out of old transactional data, which is typically referred to as fact data.
Data, in the data warehouse, is then analyzed and manipulated to uncover patterns or decipher
trends. All such archived and warehoused data is read-only and the transactional requirements for
such data stores is minimal. These data sets have traditionally been stored in special-purpose data
stores, which have the capability to store large volumes of data and analyze the data on the basis of
multiple dimensions.
With the advent of Hadoop, some of the large-scale analytics is done by MapReduce-based jobs.
The MapReduce-based model of analytics is being enriched by the availability of querying tools
like Hive and workfl ow defi nition high-level languages like Pig. Added to this, the innovation in
the MapReduce space is ever expanding. The Apache Mahout project builds a machine learning
Search WWH ::




Custom Search