Database Reference
In-Depth Information
Summary
Analysts, whether they be managers, testers, or researchers, need to find meaning in their data. They need to be able
to move, load, extract, and transform all or parts of their data to create meaning. Although the simple examples for
Impala, Hive, and Spark in this chapter may not yield any revelations in your company's data, they do demonstrate the
available building blocks of analytics and provide an overview of the capabilities of these tools.
In this chapter I have shown that you can represent data on HDFS as a database table and use Hive QL or Impala
SQL to query and transform that data. If you combine the steps in this chapter with tools like Sqoop and Flume
(covered in Chapter 6), you can start to build ETL chains to source, move, and modify your data, step by step.
If you find that you need real-time processing rather than batch processing, you might consider using Apache
Spark. Following the example installation in this chapter, you can start using Spark on your cluster. The Spark SQL
example also shows how to process your Spark cluster based in memory data using SQL.
The next chapter covers the ETL tools Pentaho and Talend, which can be used to visually manipulate Hadoop- and
Spark-based data. They integrate with Map Reduce and Hadoop base tools like Pig, Sqoop, and Hive, and can be used
to create and schedule ETL-based chains using a combination of the Hadoop tools that have been introduced so far.
 
Search WWH ::




Custom Search