Troubleshooting Job Failures - Pro Microsoft HDInsight: Hadoop on Windows

Database Reference

In-Depth Information

Concatenate Input Files

Concatenation is another technique that can improve your MapReduce job performance. The MapReduce program is

designed to handle few larger files well in comparison to several smaller files. Thus, you can concatenate many small

files into a few larger ones. This needs to be done in the program code where you implement your own MapReduce

job. MapReduce can concatenate multiple small files to make it one block size, which is more efficient in terms of

storage and data movement.

Avoid Spilling

All data in a Hadoop MapReduce job is handled as key-value pairs. All input data received by the user-defined method

that constitutes the reduce task is guaranteed to be sorted by key. This sorting happens in two parts. The first sorting

happens local to each mapper as the mapper reads the input data from one or more splits and produces the output

from the mapping phase. The second sorting happens after a reducer has collected all the data from one or more

mappers, and then produces the output from the shuffle phase.

The process of spilling during the map phase is the phenomenon in which complete input to the mapper cannot

be held in memory before the final sorting can be performed on the output from the mapper. As each mapper reads

input data from one or more splits, the mapper requires an in-memory buffer to hold the unsorted data as key-value

pairs. If the Hadoop job configuration is not optimized for the type and size of the input data, the buffer can get filled

up before the mapper has finished reading its data. In that case, the mapper will sort the data already in the filled

buffer, partition that data, serialize it, and write (spill) it to the disk. The result is referred to as a spill file .

Separate spill files are created each time a mapper has to spill data. Once all the data has been read and spilled,

the mapper will read all the spilled files again, sort and merge the data, and write (spill) that data back into a single file

known as an attempt file .

If there is more than one spill, there must be one extra read and write of the entire data. So there will be three

times (3x) the required I/O during the mapping phase, a phenomenon known as data I/O explosion . The goal is to

spill only once (1x) during the mapping phase, which is a goal that can be achieved only if you carefully select the

correct configuration for your Hadoop MapReduce job.

The memory buffer per-data record consists of three parts. The first part is the offset of the data record stored

as a tuple. That tuple requires 12 bytes per record, and it contains the partition key, the key offset, and a value offset.

The second part is the indirect sort index, requiring four bytes per record. Together, these two parts constitute the

metadata for a record, for a total of 16 bytes per record. The third part is the record itself, which is the serialized key-value

pair requiring R bytes, where R is the number of bytes of data.

If each mapper handles N records, the recommended value of the parameter that sets the proper configuration in

the mapred-site.xml is expressed as follows:

</property>

By specifying your configuration in this way, you reduce the chance of unwanted spill operations.

Hive Jobs

The best place to start looking at a Hive command failure is the Hive log file, which can be configured by editing the

hive-site.xml file. The location of the hive-site.xml file is the C:\apps\dist\hive-0.11.0.1.3.0.1-0302\conf\

directory. Listing 13-7 is a sample snippet that shows how you can specify the Hive log file path.

Search WWH ::

Custom Search

Home