Troubleshooting Job Failures - Pro Microsoft HDInsight: Hadoop on Windows

Database Reference

In-Depth Information

2013-11-26 10:55:49 Processing rows: 330936 Hashtable size: 330936

Memory usage: 149795152 rate: 0.166

2013-11-26 10:55:49 Dump the hashtable into file: file:/tmp/msgbigdata/

hive_2013-11-26 _22-55-34_959_3143934780177488621/-local-10002/

HashTable-Stage-4/MapJoin-mapfile01-.hashtable

2013-11-26 10:55:56 Upload 1 File to: file:/tmp/msgbigdata/

hive_2013-11-26 _22-55-34_959_3143934780177488621/-local-10002/

HashTable-Stage-4/MapJoin-mapfile01-.hashtable File size: 39685647

2013-11-26 10:55:56 End of local task; Time Taken: 13.203 sec.

Execution completed successfully

Mapred Local Task Succeeded . Convert the Join into MapJoin

Launching Job 2 out of 2

Hive is a common choice in the Hadoop world. SQL users take no time to get started with Hive, because the

schema-based data structure is very familiar to them. Familiarity with SQL syntax also translates well into using Hive.

Pig Jobs

Pig is a set-based, data-transformation tool that works on top of Hadoop and cluster storage. Pig offers a

command-line application for user input called Grunt , and the scripts are called Pig Latin . Pig can be run on the

name-node host or client machine, and it can run jobs that read data from HDFS/WASB and compute data using

the MapReduce framework. The biggest advantage, again, is to free the developer from writing complex MapReduce

programs.

Configuration File

The configuration file for Pig is pig.properties , and it is found in the C:\apps\dist\pig-0.11.0.1.3.1.0-06\conf\

directory of the HDInsight name node. It contains several key parameters for controlling job submission and

execution. Listing 13-15 highlights a few of them.

Listing 13-15. pig.properties file

#Verbose print all log messages to screen (default to print only INFO and above to screen)

verbose=true

#Exectype local|mapreduce, mapreduce is default

exectype=mapreduce

#The following two parameters are to help estimate the reducer number

pig.exec.reducers.bytes.per.reducer=1000000000

pig.exec.reducers.max=999

#Performance tuning properties

pig.cachedbag.memusage=0.2

pig.skewedjoin.reduce.memusagea=0.3

pig.exec.nocombiner=false

opt.multiquery=true

pig.tmpfilecompression=false

These properties help you control the number of mappers and reducers, and several other performance-tuning

options dealing with the internal dataset joins and memory usage.

Search WWH ::

Custom Search

Home