Database Reference
In-Depth Information
I solved this problem on YARN by changing the value of the parameter yarn.app.mapreduce.am.resource.mb in
the file yarn-site.xml, under the directory /etc/hadoop/conf. After making the change, I needed to restart the cluster
to pick up the change.
The next error occurred when I tried to run Talend from a Windows 7 host and tried to connect to a Centos 6
Linux-based CDH5 cluster:
83_0004 failed 2 times due to AM Container for appattempt_1413095146783_0004_000002 exited with
exitCode:
1 due to: Exception from container-launch: org.apache.hadoop.util.Shell$ExitCodeException:
/bin/bash: line 0: fg: no job control
org.apache.hadoop.util.Shell$ExitCodeException: /bin/bash: line 0: fg: no job control
This was not a problem with Talend, but a known fix appears in Horton Works HDP 2. I assume that it will soon
be fixed in other cluster stacks like CDH, but at the time of this writing, I used the Talend application only on Linux.
Finally, the following error occurred because I used insufficient configuration settings in the Talend tPigLoad step.
2014-10-14 17:56:13,123 INFO [main] org.apache.hadoop.yarn.client.RMProxy: Connecting to
ResourceManager at /0.0.0.0:8030
2014-10-14 17:56:14,241 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server:
0.0.0.0/0.0.0.0:8030. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep
(maxRetries=10, sleepTime=1000 MILLISECONDS)
2014-10-14 17:56:15,242 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server:
0.0.0.0/0.0.0.0:8030. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep
(maxRetries=10, sleepTime=1000 MILLISECONDS)
Because the Resource Manager scheduler address was not being set on the tPigLoad step, the address on YARN
defaulted to 0.0.0.0:8030, and so the job hung and timed out.
Summary
You can use visual, drag-and-drop Map Reduce enabled ETL tools, such as Pentaho Data Integrator and Talend Open
Studio, for big data processing. These tools offer the ability to tackle the creation of ETL chains for big data by logically
connecting the functional elements that the tools provide. This chapter covered only a fraction of the functionality
that they offer. Both include an abundance of Map Reduce components that you can combine to create more
permutations of functionality than I could possibly examine in these pages.
I created the examples in this chapter using a combination of a Hadoop cluster, which I built using Cloudera's
CDH5 cluster manager, and the visual ETL big data enabled tools Pentaho and Talend. I think that the errors that I
encountered are either configuration based or will be solved by later cluster stack releases. Remember to check the
company websites for application updates and the supplier forums for problem solutions. If you don't see a solution
to your ETL problem, don't be afraid to ask questions; also, consider simplifying your algorithms as a way to zero in on
the cause of a problem.
Just as I believe that cluster managers reduce problems and ongoing costs when creating and managing Hadoop
clusters, so I think tools like Pentaho and Talend will save you money. They provide a quick entry point to the world of
Hadoop-based Map Reduce. I am not suggesting that they can replace low-level Map Reduce programming, because
I'm sure that eventually you will find complex problems that require you to delve down into API code. Rather, these
tools provide a good starting point, an easier path into the complex domain of Map Reduce.
 
Search WWH ::




Custom Search