Database Reference
In-Depth Information
My next step is to load the rawdata.txt file (tPigLoad_1). Remember that even though the Hadoop cluster may be
fully configured via XML-based site configuration files, the Talend job carries configuration information, as shown
in Figure 10-32 . Note also that the Map/Reduce icon has been selected here, telling Talend that this will be a Map
Reduce job. The same CDH5 cluster information has been specified. However, this time the host- and port-based
addresses have been set for the Resource Manager, the Job History server, and the Resource Manager scheduler. The
port values have been suggested by Talend as default values, and they match the default values chosen by the CDH5
Cluster Manager installer. Again, I use the Linux hadoop account for the connection.
Figure 10-32. Loading the HDFS raw data file
At this step I encountered an error message. Initially failing to set the Resource Manager scheduler address
caused the Resource Manager-based job to time out and fail (see the “Potential Errors” section for more detail).
When loading a data file, you must also specify the schema to indicate what columns are in the incoming data,
what they should be called, and what data types they have. For this example, I click the Edit Schema button (shown at
the top of Figure 10-32 ) to open the Schema window, shown in Figure 10-33 .
 
Search WWH ::




Custom Search