ETL with Hadoop - Big Data Made Easy: A Working Guide to the Complete Hadoop Toolset

Database Reference

In-Depth Information

If the wrong type is specified for a key field, there will be an error message generated similar to the following:

{"type":"TASK_FAILED","event":{"org.apache.hadoop.mapreduce.jobhistory.TaskFailed":{"taskid":"task_1

412385899407_0008_m_000000","taskType":"MAP","finishTime":1412403861583,"error":",

Error: java.io.IOException: org.pentaho.hadoop.mapreduce.converter.TypeConversionException: \n

Error converting to Long: 1995,ACURA,INTEGRA,SUBCOMPACT,1.8,4,A4,X,10.2,7,28,40,1760,202\n

For input string: \"1995,ACURA,INTEGRA,SUBCOMPACT,1.8,4,A4,X,10.2,7,28,40,1760,202\"\n\n

In this case, a string key was incorrectly being treated as a value.

An error in the configuration of the PDI Map Reduce job can cause the following error message:

commons.vfs.FileNotFoundException: Could not read from

"file:///yarn/nm/usercache/mikejf12/appcache/application_1412471201309_0001/

container_1412471201309_0001_01_000013/job.jar"

/yarn/nm/usercache/mikejf12/appcache/application_1412471201309_0001/

container_1412471201309_0001_01_000001

because it is a not a file.

Although it looks like some kind of Hadoop configuration error, it is not. It was again caused by setting the wrong

data type on Map Reduce variable values. Just follow the example installation and configuration in this section and

you will be fine.

Finally, a lack of available memory on the Hadoop Resource Manager host Linux machine produces an error like

the following:

2014-10-07 18:08:57,674 INFO [RMCommunicator Allocator] org.apache.hadoop.mapreduce.v2.app.

rm.RMContainerAllocator:

Reduce slow start threshold not met. completedMapsForReduceSlowstart 1

To resolve a problem like this, try reducing the Resource Manager memory usage in the CDH Manager so that it

does not exceed that available.

Now that you understand how to develop a Map Reduce job using Pentaho, let's see how to create a similar job

using Talend Open Studio. The illustrative example uses the same Hadoop CDH5 cluster as a data source and for

processing.

Talend Open Studio

Talend offers a popular big data visual ETL tool called Open Studio. Like Pentaho, Talend gives you the ability to create

Map Reduce jobs against existing Hadoop clusters in a logical, step-by-step manner by pulling pre-defined modules

from a palette and linking them in an ETL chain to create a Map Reduce based job. I describe how to source, install,

and use Open Studio, as well as to create a Pig-based Map Reduce job. Along the way, I point out a few common errors

and their solutions.

Search WWH ::

Custom Search

Home