Analytic Helpers - Field Guide to Hadoop

Database Reference

In-Depth Information

Chapter 5. Analytic Helpers

Now that you've ingested data into your Hadoop cluster, what's next? Usually you'll want to

start by simply cleansing or transforming your data. This could be as simple or reformatting

fields and removing corrupt records or it could involve all manner of complex aggregation,

enrichment, and summarization. Once you've cleaned up your data, you may be satisfied to

simply push it into a more traditional data store, such as a relational database, and consider

your big data work to be done. On the other hand, you may want to continue to work with

your data, running specialized machine-learning algorithms to categorize your data or per-

haps performing some sort of geospatial analysis.

In this chapter, we're going to talk about two types of tools:

MapReduce interfaces

General-purpose tools that make it easier to process your data

Analytic libraries

Focused-purpose libraries that include functionality to make it easier to analyze your data

MapReduce Interfaces

In the early days of Hadoop, the only way to process the data in your system was to work

with MapReduce in Java, but this approach presented a couple of major problems:

▪ Your analytic writers need to not only understand your business and your data, but they

also need to understand Java code

▪ Pushing a Java archive to Hadoop is more time-consuming than simply authoring a query

For example, the process of developing and testing a simple analytic written directly in

MapReduce might look something like the following for a developer:

1. Write about a hundred lines of Java MapReduce.

2. Compile the code into a JAR file (Java archive).

Search WWH ::

Custom Search

Home