Databases Reference
In-Depth Information
of analysis. Those aspects are particularly important for the scale of operations at firms
such as Twitter, eBay, LinkedIn, Etsy, etc., where Scalding is deployed.
Keep in mind that Apache Hadoop is based on the MapReduce research made public
by Google nearly a decade ago. MapReduce became an important component of Google's
internal technology for large-scale batch workflows. Meanwhile, Google has continued
to evolve its infrastructure; estimates place its current technology stack at least three
generations beyond the original MapReduce work. The public sees only portions of that
massive R&D effort (e.g., in papers about Dremel , Pregel , etc.).
What becomes clear from the published works is that Google scientists and engineers
leverage advanced techniques based on abstract algebra, linear algebra for very large
sparse matrices, sketches, etc., to build robust, efficient infrastructure at massive scale.
Scalding represents a relatively public view of comparable infrastructure.
Let's start here with a few simple examples in Scalding. Given a few subtle changes in
the code, some of our brief examples can be turned into state-of-the-art parallel pro‐
cessing at scale. For instance, check out the PageRank implementation shown in the
Scalding source, and also these sample recommender systems written by Twitter .
Getting Started with Scalding
The best resource for getting started with Scalding is the project wiki page on GitHub .
In addition to Git and Java, which were set up in Chapter 1 , you will need to have a few
other platforms and tools installed for the examples in this chapter:
Scala
Current version of Scalding works with Scala versions 2.8.1, 2.9.1, 2.9.2.
Simple Build Tool, a.k.a. SBT
Must be version 0.11.3.
Ruby
Required for the scald.rb script; most recent stable release.
Also, be sure to put the executable for sbt in your PATH .
The scald.rb script provides a limited command-line interface (CLI) for Scalding, so
that one can conveniently compile and launch apps. Keep in mind that this is not a build
system. For any serious work, you are better off using a build tool such as Gradle to
create a “fat jar” that includes all the class dependencies that are not available on your
Hadoop cluster. More about that later.
 
 
Search WWH ::




Custom Search