Cascading - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

Hadoop and Cascading at ShareThis

ShareThis is a sharing network that makes it simple to share any online content. With the

click of a button on a web page or browser plug-in, ShareThis allows users to seamlessly

access their contacts and networks from anywhere online and share the content via email,

IM, Facebook, Digg, mobile SMS, and similar services, without ever leaving the current

page. Publishers can deploy the ShareThis button to tap into the service's universal sharing

capabilities to drive traffic, stimulate viral activity, and track the sharing of online content.

ShareThis also simplifies social media services by reducing clutter on web pages and

providing instant distribution of content across social networks, affiliate groups, and com-

munities.

As ShareThis users share pages and information through the online widgets, a continuous

stream of events enter the ShareThis network. These events are first filtered and processed,

and then handed to various backend systems, including AsterData, Hypertable, and Katta.

The volume of these events can be huge; too large to process with traditional systems. This

data can also be very “dirty” thanks to “injection attacks” from rogue systems, browser

bugs, or faulty widgets. For this reason, the developers at ShareThis chose to deploy Ha-

doop as the preprocessing and orchestration frontend to their backend systems. They also

chose to use Amazon Web Services to host their servers on the Elastic Computing Cloud

(EC2) and provide long-term storage on the Simple Storage Service (S3), with an eye to-

ward leveraging Elastic MapReduce (EMR).

In this overview, we will focus on the “log processing pipeline” ( Figure 24-9 ). This

pipeline simply takes data stored in an S3 bucket, processes it (as described shortly), and

stores the results back into another bucket. The Simple Queue Service (SQS) is used to co-

ordinate the events that mark the start and completion of data processing runs. Down-

stream, other processes pull data to load into AsterData, pull URL lists from Hypertable to

source a web crawl, or pull crawled page data to create Lucene indexes for use by Katta.

Note that Hadoop is central to the ShareThis architecture. It is used to coordinate the pro-

cessing and movement of data between architectural components.

Search WWH ::

Custom Search

Home