Database Reference
In-Depth Information
Chapter 5. Analytic Helpers
Now that you've ingested data into your Hadoop cluster, what's next? Usually you'll want to
start by simply cleansing or transforming your data. This could be as simple or reformatting
fields and removing corrupt records or it could involve all manner of complex aggregation,
enrichment, and summarization. Once you've cleaned up your data, you may be satisfied to
simply push it into a more traditional data store, such as a relational database, and consider
your big data work to be done. On the other hand, you may want to continue to work with
your data, running specialized machine-learning algorithms to categorize your data or per-
haps performing some sort of geospatial analysis.
In this chapter, we're going to talk about two types of tools:
MapReduce interfaces
General-purpose tools that make it easier to process your data
Analytic libraries
Focused-purpose libraries that include functionality to make it easier to analyze your data
MapReduce Interfaces
In the early days of Hadoop, the only way to process the data in your system was to work
with MapReduce in Java, but this approach presented a couple of major problems:
▪ Your analytic writers need to not only understand your business and your data, but they
also need to understand Java code
▪ Pushing a Java archive to Hadoop is more time-consuming than simply authoring a query
For example, the process of developing and testing a simple analytic written directly in
MapReduce might look something like the following for a developer:
1. Write about a hundred lines of Java MapReduce.
2. Compile the code into a JAR file (Java archive).
Search WWH ::




Custom Search