Database Reference
In-Depth Information
2013-11-26 10:55:49 Processing rows: 330936 Hashtable size: 330936
Memory usage: 149795152 rate: 0.166
2013-11-26 10:55:49 Dump the hashtable into file: file:/tmp/msgbigdata/
hive_2013-11-26 _22-55-34_959_3143934780177488621/-local-10002/
HashTable-Stage-4/MapJoin-mapfile01-.hashtable
2013-11-26 10:55:56 Upload 1 File to: file:/tmp/msgbigdata/
hive_2013-11-26 _22-55-34_959_3143934780177488621/-local-10002/
HashTable-Stage-4/MapJoin-mapfile01-.hashtable File size: 39685647
2013-11-26 10:55:56 End of local task; Time Taken: 13.203 sec.
Execution completed successfully
Mapred Local Task Succeeded . Convert the Join into MapJoin
Mapred Local Task Succeeded . Convert the Join into MapJoin
Launching Job 2 out of 2
Hive is a common choice in the Hadoop world. SQL users take no time to get started with Hive, because the
schema-based data structure is very familiar to them. Familiarity with SQL syntax also translates well into using Hive.
Pig Jobs
Pig is a set-based, data-transformation tool that works on top of Hadoop and cluster storage. Pig offers a
command-line application for user input called Grunt , and the scripts are called Pig Latin . Pig can be run on the
name-node host or client machine, and it can run jobs that read data from HDFS/WASB and compute data using
the MapReduce framework. The biggest advantage, again, is to free the developer from writing complex MapReduce
programs.
Configuration File
The configuration file for Pig is pig.properties , and it is found in the C:\apps\dist\pig-0.11.0.1.3.1.0-06\conf\
directory of the HDInsight name node. It contains several key parameters for controlling job submission and
execution. Listing 13-15 highlights a few of them.
Listing 13-15. pig.properties file
#Verbose print all log messages to screen (default to print only INFO and above to screen)
verbose=true
#Exectype local|mapreduce, mapreduce is default
exectype=mapreduce
#The following two parameters are to help estimate the reducer number
pig.exec.reducers.bytes.per.reducer=1000000000
pig.exec.reducers.max=999
#Performance tuning properties
pig.cachedbag.memusage=0.2
pig.skewedjoin.reduce.memusagea=0.3
pig.exec.nocombiner=false
opt.multiquery=true
pig.tmpfilecompression=false
These properties help you control the number of mappers and reducers, and several other performance-tuning
options dealing with the internal dataset joins and memory usage.
 
Search WWH ::




Custom Search