Scalding—A Scala DSL for Cascading - Enterprise Data Workflows with Cascading

Databases Reference

In-Depth Information

of analysis. Those aspects are particularly important for the scale of operations at firms

such as Twitter, eBay, LinkedIn, Etsy, etc., where Scalding is deployed.

Keep in mind that Apache Hadoop is based on the MapReduce research made public

by Google nearly a decade ago. MapReduce became an important component of Google's

internal technology for large-scale batch workflows. Meanwhile, Google has continued

to evolve its infrastructure; estimates place its current technology stack at least three

generations beyond the original MapReduce work. The public sees only portions of that

massive R&D effort (e.g., in papers about Dremel , Pregel , etc.).

What becomes clear from the published works is that Google scientists and engineers

leverage advanced techniques based on abstract algebra, linear algebra for very large

sparse matrices, sketches, etc., to build robust, efficient infrastructure at massive scale.

Scalding represents a relatively public view of comparable infrastructure.

Let's start here with a few simple examples in Scalding. Given a few subtle changes in

the code, some of our brief examples can be turned into state-of-the-art parallel pro‐

cessing at scale. For instance, check out the PageRank implementation shown in the

Scalding source, and also these sample recommender systems written by Twitter .

Getting Started with Scalding

The best resource for getting started with Scalding is the project wiki page on GitHub .

In addition to Git and Java, which were set up in Chapter 1 , you will need to have a few

other platforms and tools installed for the examples in this chapter:

Scala

Current version of Scalding works with Scala versions 2.8.1, 2.9.1, 2.9.2.

Must be version 0.11.3.

Ruby

Required for the scald.rb script; most recent stable release.

Also, be sure to put the executable for sbt in your PATH .

The scald.rb script provides a limited command-line interface (CLI) for Scalding, so

that one can conveniently compile and launch apps. Keep in mind that this is not a build

system. For any serious work, you are better off using a build tool such as Gradle to

create a “fat jar” that includes all the class dependencies that are not available on your

Hadoop cluster. More about that later.

Search WWH ::

Custom Search

Home