Cascading - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

Figure 24-9. The ShareThis log processing pipeline

With Hadoop as the frontend, all the event logs can be parsed, filtered, cleaned, and or-

ganized by a set of rules before ever being loaded into the AsterData cluster or used by

any other component. AsterData is a clustered data warehouse that can support large data-

sets and that allows for complex ad hoc queries using a standard SQL syntax. ShareThis

chose to clean and prepare the incoming datasets on the Hadoop cluster and then to load

that data into the AsterData cluster for ad hoc analysis and reporting. Though that process

would have been possible with AsterData, it made a lot of sense to use Hadoop as the first

stage in the processing pipeline to offset load on the main data warehouse.

Cascading was chosen as the primary data processing API to simplify the development

process, codify how data is coordinated between architectural components, and provide

the developer-facing interface to those components. This represents a departure from more

“traditional” Hadoop use cases, which essentially just query stored data. Cascading and

Hadoop together provide a better and simpler structure for the complete solution, end to

end, and thus provide more value to the users.

For the developers, Cascading made it easy to start with a simple unit test (created by sub-

classing cascading.ClusterTestCase ) that did simple text parsing and then to

layer in more processing rules while keeping the application logically organized for main-

tenance. Cascading aided this organization in a couple of ways. First, standalone opera-

Search WWH ::

Custom Search

Home