Database Reference
In-Depth Information
The EXPLAIN operator's output is segmented into three sections:
Logical Plan The Logical Plan gives you the chain of operators used to build the relations,
along with data type validation. Any filters (like NULL checking) that might have been applied
early on also apply here.
Physical Plan The Physical Plan shows how the logical operators are actually translated as
physical operators with some memory-optimization techniques that might have been used.
MapReduce Plan The MapReduce Plan shows how the physical operators are grouped into
MapReduce jobs that would actually work on the cluster's data.
Illustrate Command
The ILLUSTRATE command is one of the best ways to debug Pig scripts. The command attempts to provide a reader-friendly
representation of the data. ILLUSTRATE works by taking a sample of the output data and running it through the Pig script.
But as the ILLUSTRATE command encounters operators that remove data (such as filter, join, etc.), it makes sure that some
records pass through the operator and some do not. When necessary, it will manufacture records that look similar to
the data set. For example, if you have a variable B , formed by grouping another variable A , the ILLUSTRATE command
on variable B will show you the details of the underlying composite types. Type in the following command in the Pig
shell to check this out:
A = LOAD 'data' AS (f1:int, f2:int, f3:int);
B = GROUP A BY (f1,f2);
ILLUSTRATE B;
This will give you output similar to what is shown here:
----------------------------------------------------------------------
| b |group: tuple({f1: int,f2: int})|a: bag({f1: int,f2: int,f3: int})|
----------------------------------------------------------------------
| | (8, 3) | {(8, 3, 4), (8, 3, 4)} |
----------------------------------------------------------------------
You can use the ILLUSTRATE command to examine the structure of relation or variable B . Relation B has two fields.
The first field is named group and is of type tuple . The second field is name a , after relation A , and is of type bag .
Note
a variable is also called a relation in pig latin terms.
Sqoop Jobs
Sqoop is the bi-directional data-transfer tool between HDFS (again, WASB in Azure HDInsight service) and relational
databases. In an HDInsight context, Sqoop is primarily used to import and export data to and from SQL Azure
databases and the cluster storage. When you run a Sqoop command, Sqoop in turn runs a MapReduce task in the
Hadoop Cluster (map only, and no reduce task). There is no separate log file specific to Sqoop. So you need to
troubleshoot a Sqoop failure or performance issue pretty much the same way as a MapReduce failure or
performance issue.
 
 
Search WWH ::




Custom Search