Troubleshooting Job Failures - Pro Microsoft HDInsight: Hadoop on Windows

Database Reference

In-Depth Information

■

Note

Currently, hDinsight supports Gzip and BZ2 codecs.

Configure the Reducer Task Size

In majority of the MapReduce job-execution scenarios, after the map jobs are over, most of the nodes go idle with only

a few nodes working for the reduce jobs to complete. To make reduce jobs finish fast, you can increase the number of

reducers to match the number of nodes or the total number of processor cores. Following is the SET command you

use to configure the number of reducers launched from a Hive job:

set mapred.reduce.tasks=<number>

Implement Map Joins

Map joins in Hive are particularly useful when a single, huge table needs to be joined with a very small table. The small

table can be placed into memory, in a distributed cache, by using map joins. By doing that, you avoid a good deal of

disk IO. The SET commands in Listing 13-13 enable Hive to perform map joins and cache the small table in memory.

Listing 13-13. Hive SET options

set hive.auto.convert.join=true;

set hive.mapjoin.smalltable.filesize=40000000;

Another important configuration is the hive.mapjoin.smalltable.filesize setting . By default, it is 25

MB, and if the smaller table exceeds this size, all of your original MapJoin tests revert back to common joins. In the

preceding snippet, I have overridden the default setting and set it to 40 MB.

■ there are no reducers in map joins, because such a join can be completed during the map phase with a lot less

data movement.

Note

You can confirm that map joins are happening if you see the following:

•

With a map join, there are no reducers because the join happens at the map level.

•

From the command line, it'll report that a map join is being done because it is pushing a

smaller table up to memory.

MapJoin .

•

And right at the end, there is a call out that it's converting the join into

The command-line output or the Hive logs will have snippets indicating that a map join has happened, as you

can see in Listing 13-14.

Listing 13-14. hive.log file

2013-11-26 10:55:41 Starting to launch local task to process map join;

maximum memory = 932118528

2013-11-26 10:55:45 Processing rows: 200000 Hashtable size: 199999

Memory usage: 145227488 rate: 0.158

2013-11-26 10:55:47 Processing rows: 300000 Hashtable size: 299999

Memory usage: 183032536 rate: 0.188

Search WWH ::

Custom Search

Home