Composable Data at Cerner - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

Enter Apache Crunch

Bringing together and analyzing such disparate datasets creates a lot of demands, but a few

stood out:

▪ We needed to split many processing steps into modules that could easily be as-

sembled into a sophisticated pipeline.

▪ We needed to offer a higher-level programming model than raw MapReduce.

▪ We needed to work with the complex structure of medical records, which have sev-

eral hundred unique fields and several levels of nested substructures.

We explored a variety of options in this case, including Pig, Hive, and Cascading. Each of

these worked well, and we continue to use Hive for ad hoc analysis, but they were un-

wieldy when applying arbitrary logic to our complex data structures. Then we heard of

Crunch (see Chapter 18 ), a project led by Josh Wills that is similar to the FlumeJava system

from Google. Crunch offers a simple Java-based programming model and static type

checking of records — a perfect fit for our community of Java developers and the type of

data we were working with.

Search WWH ::

Custom Search

Home