ETL with Hadoop - Big Data Made Easy: A Working Guide to the Complete Hadoop Toolset

Database Reference

In-Depth Information

My next step is to load the rawdata.txt file (tPigLoad_1). Remember that even though the Hadoop cluster may be

fully configured via XML-based site configuration files, the Talend job carries configuration information, as shown

in Figure 10-32 . Note also that the Map/Reduce icon has been selected here, telling Talend that this will be a Map

Reduce job. The same CDH5 cluster information has been specified. However, this time the host- and port-based

addresses have been set for the Resource Manager, the Job History server, and the Resource Manager scheduler. The

port values have been suggested by Talend as default values, and they match the default values chosen by the CDH5

Cluster Manager installer. Again, I use the Linux hadoop account for the connection.

Figure 10-32. Loading the HDFS raw data file

At this step I encountered an error message. Initially failing to set the Resource Manager scheduler address

caused the Resource Manager-based job to time out and fail (see the “Potential Errors” section for more detail).

When loading a data file, you must also specify the schema to indicate what columns are in the incoming data,

what they should be called, and what data types they have. For this example, I click the Edit Schema button (shown at

the top of Figure 10-32 ) to open the Schema window, shown in Figure 10-33 .

Search WWH ::

Custom Search

Home