Database Reference
In-Depth Information
Scheduling: Chapter 5
The big data requirement for scheduling encompasses the need to share resources and determine when tasks will
run. For sharing Hadoop-based resources, Chapter 5 introduces the Capacity and Fair schedulers for Hadoop. It also
introduces Apache Oozie, showing how simple ETL tasks can be created using Hadoop components like Apache
Sqoop and Apache Pig. Finally, it demonstrates how to schedule Oozie tasks.
Data Movement: Chapter 6
Big data systems require tools to allow safe movement of a variety of data types, safely and without data loss. Chapter 6
introduces the Apache Sqoop tool for moving data into and out of relational databases. It also provides an example of
how Apache Flume can be used to process log-based data. Apache Storm is introduced for data stream processing.
Monitoring: Chapter 7
The requirement for system monitoring tools for a big data system is discussed in Chapter 7. The chapter introduces
the Hue tool as a single location to access a wide range of Apache Hadoop functionality. It also demonstrates the
Ganglia and Nagios resource monitoring and alerting tools.
Cluster Management: Chapter 8
Cluster managers are introduced in Chapter 8 by using the Apache Ambari tool to install Horton Works HDP 2.1 and
Cloudera's cluster manager to install Cloudera CDH5. A brief overview is then given of their functionality.
Analysis: Chapter 9
Big data requires the ability to monitor data trends in real time. To that end, Chapter 9 introduces the Apache Spark
real-time, in-memory distributed processing system. It also shows how Spark SQL can be used, via an example. It also
includes a practical demonstration of the features of the Apache Hive and Cloudera Impala query languages.
ETL: Chapter 10
Although ETL was briefly introduced in Chapter 5, this chapter discusses the need for graphic tools for ETL chain
building and management. ETL-like tools (preferably with a graphic interface) can be used to build tasks to process
the data and monitor their progress. Thus, Chapter 10 introduces the Pentaho and Talend graphical ETL tools for
big data. This chapter investigates their visual object based approach to big data ETL task creation. It also shows that
these tools offer an easier path into the work of Map Reduce development.
Reports: Chapter 11
Big data systems need reporting tools. In Chapter 11, some reporting tools are discussed and a typical dashboard is
built using the Splunk/Hunk tool. Also, the evaluative data-quality capabilities of Talend are investigated by using the
profiling function.
 
Search WWH ::




Custom Search