The Problem with Data - Big Data Made Easy: A Working Guide to the Complete Hadoop Toolset

Database Reference

In-Depth Information

Scheduling: Chapter 5

The big data requirement for scheduling encompasses the need to share resources and determine when tasks will

run. For sharing Hadoop-based resources, Chapter 5 introduces the Capacity and Fair schedulers for Hadoop. It also

introduces Apache Oozie, showing how simple ETL tasks can be created using Hadoop components like Apache

Sqoop and Apache Pig. Finally, it demonstrates how to schedule Oozie tasks.

Data Movement: Chapter 6

Big data systems require tools to allow safe movement of a variety of data types, safely and without data loss. Chapter 6

introduces the Apache Sqoop tool for moving data into and out of relational databases. It also provides an example of

how Apache Flume can be used to process log-based data. Apache Storm is introduced for data stream processing.

Monitoring: Chapter 7

The requirement for system monitoring tools for a big data system is discussed in Chapter 7. The chapter introduces

the Hue tool as a single location to access a wide range of Apache Hadoop functionality. It also demonstrates the

Ganglia and Nagios resource monitoring and alerting tools.

Cluster Management: Chapter 8

Cluster managers are introduced in Chapter 8 by using the Apache Ambari tool to install Horton Works HDP 2.1 and

Cloudera's cluster manager to install Cloudera CDH5. A brief overview is then given of their functionality.

Analysis: Chapter 9

Big data requires the ability to monitor data trends in real time. To that end, Chapter 9 introduces the Apache Spark

real-time, in-memory distributed processing system. It also shows how Spark SQL can be used, via an example. It also

includes a practical demonstration of the features of the Apache Hive and Cloudera Impala query languages.

ETL: Chapter 10

Although ETL was briefly introduced in Chapter 5, this chapter discusses the need for graphic tools for ETL chain

building and management. ETL-like tools (preferably with a graphic interface) can be used to build tasks to process

the data and monitor their progress. Thus, Chapter 10 introduces the Pentaho and Talend graphical ETL tools for

big data. This chapter investigates their visual object based approach to big data ETL task creation. It also shows that

these tools offer an easier path into the work of Map Reduce development.

Reports: Chapter 11

Big data systems need reporting tools. In Chapter 11, some reporting tools are discussed and a typical dashboard is

built using the Splunk/Hunk tool. Also, the evaluative data-quality capabilities of Talend are investigated by using the

profiling function.

Search WWH ::

Custom Search

Home