Database Reference
In-Depth Information
Enter Apache Crunch
Bringing together and analyzing such disparate datasets creates a lot of demands, but a few
stood out:
▪ We needed to split many processing steps into modules that could easily be as-
sembled into a sophisticated pipeline.
▪ We needed to offer a higher-level programming model than raw MapReduce.
▪ We needed to work with the complex structure of medical records, which have sev-
eral hundred unique fields and several levels of nested substructures.
We explored a variety of options in this case, including Pig, Hive, and Cascading. Each of
these worked well, and we continue to use Hive for ad hoc analysis, but they were un-
wieldy when applying arbitrary logic to our complex data structures. Then we heard of
Crunch (see Chapter 18 ), a project led by Josh Wills that is similar to the FlumeJava system
from Google. Crunch offers a simple Java-based programming model and static type
checking of records — a perfect fit for our community of Java developers and the type of
data we were working with.
Search WWH ::




Custom Search