Database Reference
In-Depth Information
Chapter 18. Crunch
Apache Crunch
is a higher-level API for writing MapReduce pipelines. The main advant-
ages it offers over plain MapReduce are its focus on programmer-friendly Java types like
String
and plain old Java objects, a richer set of data transformation operations, and
multistage pipelines (no need to explicitly manage individual MapReduce jobs in a work-
flow).
In these respects, Crunch looks a lot like a Java version of Pig. One day-to-day source of
friction in using Pig, which Crunch avoids, is that the language used to write user-defined
functions (Java or Python) is different from the language used to write Pig scripts (Pig Lat-
in), which makes for a disjointed development experience as one switches between the two
different representations and languages. By contrast, Crunch programs and UDFs are writ-
ten in a single language (Java or Scala), and UDFs can be embedded right in the programs.
The overall experience feels very like writing a non-distributed program. Although it has
many parallels with Pig, Crunch was inspired by FlumeJava, the Java library developed at
Google for building MapReduce pipelines.
NOTE
FlumeJava is not to be confused with Apache Flume, covered in
Chapter 14
, which is a system for collect-
ing streaming event data. You can read more about FlumeJava in
“FlumeJava: Easy, Efficient Data-Paral-
lel Pipelines”
by Craig Chambers et al.
Because they are high level, Crunch pipelines are highly composable and common func-
tions can be extracted into libraries and reused in other programs. This is different from
MapReduce, where it is very difficult to reuse code: most programs have custom mapper
and reducer implementations, apart from simple cases such as where an identity function or
a simple sum (
LongSumReducer
) is called for. Writing a library of mappers and redu-
cers for different types of transformations, like sorting and joining operations, is not easy in
MapReduce, whereas in Crunch it is very natural. For example, there is a library class,
org.apache.crunch.lib.Sort
, with a
sort()
method that will sort any Crunch
collection that is passed to it.
Although Crunch was initially written to run using Hadoop's MapReduce execution engine,
it is not tied to it, and in fact you can run a Crunch pipeline using Apache Spark (see
Chapter 19
)
as the distributed execution engine. Different engines have different character-
istics: Spark, for example, is more efficient than MapReduce if there is a lot of intermediate
data to be passed between jobs, since it can retain the data in memory rather than material-