Database Reference
In-Depth Information
Hadoop and Cascading at ShareThis
ShareThis is a sharing network that makes it simple to share any online content. With the
click of a button on a web page or browser plug-in, ShareThis allows users to seamlessly
access their contacts and networks from anywhere online and share the content via email,
IM, Facebook, Digg, mobile SMS, and similar services, without ever leaving the current
page. Publishers can deploy the ShareThis button to tap into the service's universal sharing
capabilities to drive traffic, stimulate viral activity, and track the sharing of online content.
ShareThis also simplifies social media services by reducing clutter on web pages and
providing instant distribution of content across social networks, affiliate groups, and com-
munities.
As ShareThis users share pages and information through the online widgets, a continuous
stream of events enter the ShareThis network. These events are first filtered and processed,
and then handed to various backend systems, including AsterData, Hypertable, and Katta.
The volume of these events can be huge; too large to process with traditional systems. This
data can also be very “dirty” thanks to “injection attacks” from rogue systems, browser
bugs, or faulty widgets. For this reason, the developers at ShareThis chose to deploy Ha-
doop as the preprocessing and orchestration frontend to their backend systems. They also
chose to use Amazon Web Services to host their servers on the Elastic Computing Cloud
(EC2) and provide long-term storage on the Simple Storage Service (S3), with an eye to-
ward leveraging Elastic MapReduce (EMR).
In this overview, we will focus on the “log processing pipeline” ( Figure 24-9 ). This
pipeline simply takes data stored in an S3 bucket, processes it (as described shortly), and
stores the results back into another bucket. The Simple Queue Service (SQS) is used to co-
ordinate the events that mark the start and completion of data processing runs. Down-
stream, other processes pull data to load into AsterData, pull URL lists from Hypertable to
source a web crawl, or pull crawled page data to create Lucene indexes for use by Katta.
Note that Hadoop is central to the ShareThis architecture. It is used to coordinate the pro-
cessing and movement of data between architectural components.
Search WWH ::




Custom Search