Database Reference
In-Depth Information
Note
Currently, hDinsight supports Gzip and BZ2 codecs.
Configure the Reducer Task Size
In majority of the MapReduce job-execution scenarios, after the map jobs are over, most of the nodes go idle with only
a few nodes working for the reduce jobs to complete. To make reduce jobs finish fast, you can increase the number of
reducers to match the number of nodes or the total number of processor cores. Following is the SET command you
use to configure the number of reducers launched from a Hive job:
set mapred.reduce.tasks=<number>
Implement Map Joins
Map joins in Hive are particularly useful when a single, huge table needs to be joined with a very small table. The small
table can be placed into memory, in a distributed cache, by using map joins. By doing that, you avoid a good deal of
disk IO. The SET commands in Listing 13-13 enable Hive to perform map joins and cache the small table in memory.
Listing 13-13. Hive SET options
set hive.auto.convert.join=true;
set hive.mapjoin.smalltable.filesize=40000000;
Another important configuration is the hive.mapjoin.smalltable.filesize setting . By default, it is 25
MB, and if the smaller table exceeds this size, all of your original MapJoin tests revert back to common joins. In the
preceding snippet, I have overridden the default setting and set it to 40 MB.
there are no reducers in map joins, because such a join can be completed during the map phase with a lot less
data movement.
Note
You can confirm that map joins are happening if you see the following:
With a map join, there are no reducers because the join happens at the map level.
From the command line, it'll report that a map join is being done because it is pushing a
smaller table up to memory.
MapJoin .
And right at the end, there is a call out that it's converting the join into
The command-line output or the Hive logs will have snippets indicating that a map join has happened, as you
can see in Listing 13-14.
Listing 13-14. hive.log file
2013-11-26 10:55:41 Starting to launch local task to process map join;
maximum memory = 932118528
2013-11-26 10:55:45 Processing rows: 200000 Hashtable size: 199999
Memory usage: 145227488 rate: 0.158
2013-11-26 10:55:47 Processing rows: 300000 Hashtable size: 299999
Memory usage: 183032536 rate: 0.188
 
 
Search WWH ::




Custom Search