Analytics with Hadoop - Big Data Made Easy: A Working Guide to the Complete Hadoop Toolset

Database Reference

In-Depth Information

Summary

Analysts, whether they be managers, testers, or researchers, need to find meaning in their data. They need to be able

to move, load, extract, and transform all or parts of their data to create meaning. Although the simple examples for

Impala, Hive, and Spark in this chapter may not yield any revelations in your company's data, they do demonstrate the

available building blocks of analytics and provide an overview of the capabilities of these tools.

In this chapter I have shown that you can represent data on HDFS as a database table and use Hive QL or Impala

SQL to query and transform that data. If you combine the steps in this chapter with tools like Sqoop and Flume

(covered in Chapter 6), you can start to build ETL chains to source, move, and modify your data, step by step.

If you find that you need real-time processing rather than batch processing, you might consider using Apache

Spark. Following the example installation in this chapter, you can start using Spark on your cluster. The Spark SQL

example also shows how to process your Spark cluster based in memory data using SQL.

The next chapter covers the ETL tools Pentaho and Talend, which can be used to visually manipulate Hadoop- and

Spark-based data. They integrate with Map Reduce and Hadoop base tools like Pig, Sqoop, and Hive, and can be used

to create and schedule ETL-based chains using a combination of the Hadoop tools that have been introduced so far.

Search WWH ::

Custom Search

Home