Database Reference
In-Depth Information
Concatenate Input Files
Concatenation is another technique that can improve your MapReduce job performance. The MapReduce program is
designed to handle few larger files well in comparison to several smaller files. Thus, you can concatenate many small
files into a few larger ones. This needs to be done in the program code where you implement your own MapReduce
job. MapReduce can concatenate multiple small files to make it one block size, which is more efficient in terms of
storage and data movement.
Avoid Spilling
All data in a Hadoop MapReduce job is handled as key-value pairs. All input data received by the user-defined method
that constitutes the reduce task is guaranteed to be sorted by key. This sorting happens in two parts. The first sorting
happens local to each mapper as the mapper reads the input data from one or more splits and produces the output
from the mapping phase. The second sorting happens after a reducer has collected all the data from one or more
mappers, and then produces the output from the shuffle phase.
The process of spilling during the map phase is the phenomenon in which complete input to the mapper cannot
be held in memory before the final sorting can be performed on the output from the mapper. As each mapper reads
input data from one or more splits, the mapper requires an in-memory buffer to hold the unsorted data as key-value
pairs. If the Hadoop job configuration is not optimized for the type and size of the input data, the buffer can get filled
up before the mapper has finished reading its data. In that case, the mapper will sort the data already in the filled
buffer, partition that data, serialize it, and write (spill) it to the disk. The result is referred to as a spill file .
Separate spill files are created each time a mapper has to spill data. Once all the data has been read and spilled,
the mapper will read all the spilled files again, sort and merge the data, and write (spill) that data back into a single file
known as an attempt file .
If there is more than one spill, there must be one extra read and write of the entire data. So there will be three
times (3x) the required I/O during the mapping phase, a phenomenon known as data I/O explosion . The goal is to
spill only once (1x) during the mapping phase, which is a goal that can be achieved only if you carefully select the
correct configuration for your Hadoop MapReduce job.
The memory buffer per-data record consists of three parts. The first part is the offset of the data record stored
as a tuple. That tuple requires 12 bytes per record, and it contains the partition key, the key offset, and a value offset.
The second part is the indirect sort index, requiring four bytes per record. Together, these two parts constitute the
metadata for a record, for a total of 16 bytes per record. The third part is the record itself, which is the serialized key-value
pair requiring R bytes, where R is the number of bytes of data.
If each mapper handles N records, the recommended value of the parameter that sets the proper configuration in
the mapred-site.xml is expressed as follows:
<property>
<name>io.sort.mb</name><value> N *(16+ R )/(1024*1024)</value>
</property>
By specifying your configuration in this way, you reduce the chance of unwanted spill operations.
Hive Jobs
The best place to start looking at a Hive command failure is the Hive log file, which can be configured by editing the
hive-site.xml file. The location of the hive-site.xml file is the C:\apps\dist\hive-0.11.0.1.3.0.1-0302\conf\
directory. Listing 13-7 is a sample snippet that shows how you can specify the Hive log file path.
 
Search WWH ::




Custom Search