Reporting with Hadoop - Big Data Made Easy: A Working Guide to the Complete Hadoop Toolset

Database Reference

In-Depth Information

This issue was caused by my not installing all third-party libraries when being prompted to do so. The solution is

simple: just click Finish and accept the licensing, then patiently wait for the libraries to install.

Two errors were caused by my setting up the Hive connection incorrectly. Specifically, I received

the following error:Failed to run analysis: rawtrans_analysys

Error message:

Error while processing statement: Failed: Execution Error, return code 1 from

org.apache.hadoop.hive.ql.exec.mr.MapRedTask

and the following error was in the Hive log file /var/log/hive/hadoop-cmf-hive-HIVEMETASTORE-hc2nn.semtech-

solutions.co.nz.log.out:

assuming we are not on mysql: ERROR: syntax error at or near "@@"

The port number should have been set to 10000 for the hiveserver2 address. I used the value 9083, which was

the port value defined in the property hive.metatstore.uris in the file hive-site.xml under the directory /etc/hive/

conf.cloudera.hive.

There was the following error regarding an RPM component:There was an error creating the RPM file:

Could not find valid RPM application:

RPM-building tools are not available on the system

The error occurred because an RPM build component was missing from the Centos Linux host on which Talend

was installed. The solution was to install the component using the yum command install .

Finally, this short error occurred while I was installing the Talend client software and it implied that the Talend

install file called “dist” was corrupted:

Unable to execute validation program

I don't know how it happened, but I solved the problem by removing the Talend software release directory and

extracting the tar archive a second time.

Summary

Relational database systems encounter data-quality problems, and they use data-quality rules to solve those

problems. Hadoop Hive has the potential to hold an extremely large amount of data—a great deal larger than

traditional relational database systems and at a lower unit cost. As the data volume rises, however, so does the

potential for encountering data-quality issues.

Tools like Talend and the reports that it can produce offer the ability to connect to Hive and, via external tables,

to HDFS-based data. Talend can run user-defined data quality checks against that Hive data. The examples presented

here offer a small taste of the functionality that is available. Likewise, Splunk/Hunk has the potential for generating

reports and creating dashboards to monitor data. After working through the Splunk/Hunk and Talend application

examples provided in this chapter, you might consider investigating the Tableau and Pentaho applications for big data

as well.

You now have the tools to begin creating your own Hadoop-based systems. As you go forward, remember to

check the Apache and tool supplier websites. Consult their forums and ask questions if you encounter problems. As

you find your own solutions, post them as well, so as to help other members of the Hadoop community.