Crunch - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

Chapter 18. Crunch

Apache Crunch is a higher-level API for writing MapReduce pipelines. The main advant-

ages it offers over plain MapReduce are its focus on programmer-friendly Java types like

String and plain old Java objects, a richer set of data transformation operations, and

multistage pipelines (no need to explicitly manage individual MapReduce jobs in a work-

flow).

In these respects, Crunch looks a lot like a Java version of Pig. One day-to-day source of

friction in using Pig, which Crunch avoids, is that the language used to write user-defined

functions (Java or Python) is different from the language used to write Pig scripts (Pig Lat-

in), which makes for a disjointed development experience as one switches between the two

different representations and languages. By contrast, Crunch programs and UDFs are writ-

ten in a single language (Java or Scala), and UDFs can be embedded right in the programs.

The overall experience feels very like writing a non-distributed program. Although it has

many parallels with Pig, Crunch was inspired by FlumeJava, the Java library developed at

Google for building MapReduce pipelines.

NOTE

FlumeJava is not to be confused with Apache Flume, covered in Chapter 14 , which is a system for collect-

ing streaming event data. You can read more about FlumeJava in “FlumeJava: Easy, Efficient Data-Paral-

lel Pipelines” by Craig Chambers et al.

Because they are high level, Crunch pipelines are highly composable and common func-

tions can be extracted into libraries and reused in other programs. This is different from

MapReduce, where it is very difficult to reuse code: most programs have custom mapper

and reducer implementations, apart from simple cases such as where an identity function or

a simple sum ( LongSumReducer ) is called for. Writing a library of mappers and redu-

cers for different types of transformations, like sorting and joining operations, is not easy in

MapReduce, whereas in Crunch it is very natural. For example, there is a library class,

org.apache.crunch.lib.Sort , with a sort() method that will sort any Crunch

collection that is passed to it.

Although Crunch was initially written to run using Hadoop's MapReduce execution engine,

it is not tied to it, and in fact you can run a Crunch pipeline using Apache Spark (see

Chapter 19 ) as the distributed execution engine. Different engines have different character-

istics: Spark, for example, is more efficient than MapReduce if there is a lot of intermediate

data to be passed between jobs, since it can retain the data in memory rather than material-

Search WWH ::

Custom Search

Home