Database Reference
In-Depth Information
Chapter 11
Reporting with Hadoop
Because the potential storage capability of a Hadoop cluster is so very large, you need some means to track both the
data contained on the cluster and the data feeds moving data into and out of it. In addition, you need to consider the
locations where data might reside on the cluster—that is, in HDFS, Hive, HBase, or Impala. Knowing you should track
your data only spawns more questions, however: What type of reporting might be required and in what format? Is a
dashboard needed to post the status of data at any given moment? Are graphs or tables helpful to show the state of a
data source for a given time period, such as the days in a week?
Building on the ETL work carried out in Chapter 10, this chapter will help you sort out the answers to those
questions by demonstrating how to build a range of simple reports using HDFS- and Hive-based data. Although you
may end up using completely different tools, reporting methods, and data content to construct the reports for your
real-world data, the building blocks presented here will provide insight into the tasks on Hadoop that apply to many
other scenarios as well.
This chapter will give you a basic overview of Hunk (the Hadoop version of Splunk) and Talend from a report-
generation point of view. It will show you how to source the software, how to install it, how to use it, and how to create
reports. Some basic errors and their solutions will be presented along with some simple dashboards to monitor the
data. The chapter begins withan introduction to the Hadoop version of Splunk, which is called Hunk.
â–  Reports show the state of given data sources in a variety of forms (tables, pie charts, etc.) and might also
aggregate data to show totals or use colors to represent data from different sources. Dashboards provide a single-page
view or overview of a system's status and might also contain charts with key indicators to show the overall state of
its data.
Note
Hunk
Hunk is the Hadoop version of Splunk ( www.splunk.com ) , and it can be used to create reports and dashboards to
examine the state of the data on a Hadoop cluster. The tool offers search, reporting, alerts, and dashboards from
a web-based user interface. Let's look at the installation and uses of Hunk, as well as some simple reports and
dashboards.
Installing Hunk
By way of example, I install Hunk onto the Centos 6 Linux host hc2nn and connect it to the Cloudera CDH5 Hadoop
cluster on the same node. Before downloading the Splunk software, though, I must first create an account and register
my details. I source Hunk from www.splunk.com/goto/downloadhunk .
 
 
Search WWH ::




Custom Search