Database Reference
In-Depth Information
13/12/10 01:37:50 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=0
13/12/10 01:37:50 INFO mapred.JobClient: File Output Format Counters
13/12/10 01:37:50 INFO mapred.JobClient: Bytes Written=0
13/12/10 01:37:50 INFO mapred.JobClient: FileSystemCounters
13/12/10 01:37:50 INFO mapred.JobClient: WASB_BYTES_READ=3027416
13/12/10 01:37:50 INFO mapred.JobClient: FILE_BYTES_READ=3696
13/12/10 01:37:50 INFO mapred.JobClient: HDFS_BYTES_READ=792
13/12/10 01:37:50 INFO mapred.JobClient: FILE_BYTES_WRITTEN=296608
13/12/10 01:37:50 INFO mapred.JobClient: File Input Format Counters
13/12/10 01:37:50 INFO mapred.JobClient: Bytes Read=0
13/12/10 01:37:50 INFO mapred.JobClient: Map-Reduce Framework
13/12/10 01:37:50 INFO mapred.JobClient: Map input records=36153
13/12/10 01:37:50 INFO mapred.JobClient: Physical memory (bytes) snapshot=779915264
13/12/10 01:37:50 INFO mapred.JobClient: Spilled Records=0
13/12/10 01:37:50 INFO mapred.JobClient: CPU time spent (ms)=17259
13/12/10 01:37:50 INFO mapred.JobClient: Total committed heap usage (bytes)=2058092544
13/12/10 01:37:50 INFO mapred.JobClient: Virtual memory (bytes) snapshot=2608484352
13/12/10 01:37:50 INFO mapred.JobClient: Map output records=36153
13/12/10 01:37:50 INFO mapred.JobClient: SPLIT_RAW_BYTES=792
13/12/10 01:37:50 INFO mapreduce.ExportJobBase:Transferred 792 bytes in 53.6492
seconds (14.7626 bytes/sec)
13/12/10 01:37:50 INFO mapreduce.ExportJobBase: Exported 36153 records.
As you can see, Sqoop is a pretty handy import/export tool for your cluster's data, allowing you to go easily
to and from a SQL Azure database. Sqoop allows you to merge structured and unstructured data, and to provide
powerful analytics on the data overall. For a complete reference of all the available Sqoop commands, visit the Apache
documentation site at https://cwiki.apache.org/confluence/display/SQOOP/Home .
The Pig Console
Pig is a set-based data transformation tool that works on top of the Hadoop stack to manipulate data sets to add
and remove aggregates, and to transform data. Pig is most analogous to the Dataflow task in SQL Server Integration
Services (SSIS) , as discussed in Chapter 10.
Unlike SSIS, Pig does not have a control-flow system. Pig is written in Java and produces Java .jar code to run
MapReduce jobs across the nodes in the Hadoop cluster to manipulate the data in a distributed way. Pig exposes a
command-line shell called Grunt to execute Pig statements. To launch the Grunt shell, navigate to c:\apps\dist\
pig-0.11.0.1.3.1.0-06\bin directory from the Hadoop Command Line. Then execute the Pig command. That
should launch the Grunt shell as shown in Listing 6-12.
Listing 6-12. Launching the Pig Grunt shell
c:\apps\dist\pig-0.11.0.1.3.1.0-06\bin>pig
2013 -12-10 01:48:10,150 [main] INFO org.apache.pig.Main - Apache Pig version 0.11.0.1.3.1.0-06
(r: unknown) compiled Oct 02 2013, 21:58:30
2013 -12-10 01:48:10,151 [main] INFO org.apache.pig.Main - Logging error messages to:
C:\apps\dist\hadoop-1.2.0.1.3.1.0-06\logs\pig_1386640090147.log
2013 -12-10 01:48:10,194 [main] INFO org.apache.pig.impl.util.Utils
- Default bootup file D:\Users\hadoopuser/.pigbootup not found
2013 -12-10 01:48:10,513 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine
- Connecting to hadoop file system at: wasb://democlustercontainer@democluster.blob.core.windows.net
 
Search WWH ::




Custom Search