Troubleshooting Job Failures - Pro Microsoft HDInsight: Hadoop on Windows

Database Reference

In-Depth Information

The EXPLAIN operator's output is segmented into three sections:

• Logical Plan The Logical Plan gives you the chain of operators used to build the relations,

along with data type validation. Any filters (like NULL checking) that might have been applied

early on also apply here.

• Physical Plan The Physical Plan shows how the logical operators are actually translated as

physical operators with some memory-optimization techniques that might have been used.

• MapReduce Plan The MapReduce Plan shows how the physical operators are grouped into

MapReduce jobs that would actually work on the cluster's data.

Illustrate Command

The ILLUSTRATE command is one of the best ways to debug Pig scripts. The command attempts to provide a reader-friendly

representation of the data. ILLUSTRATE works by taking a sample of the output data and running it through the Pig script.

But as the ILLUSTRATE command encounters operators that remove data (such as filter, join, etc.), it makes sure that some

records pass through the operator and some do not. When necessary, it will manufacture records that look similar to

the data set. For example, if you have a variable B , formed by grouping another variable A , the ILLUSTRATE command

on variable B will show you the details of the underlying composite types. Type in the following command in the Pig

shell to check this out:

A = LOAD 'data' AS (f1:int, f2:int, f3:int);

B = GROUP A BY (f1,f2);

ILLUSTRATE B;

This will give you output similar to what is shown here:

----------------------------------------------------------------------

| b |group: tuple({f1: int,f2: int})|a: bag({f1: int,f2: int,f3: int})|

----------------------------------------------------------------------

| | (8, 3) | {(8, 3, 4), (8, 3, 4)} |

----------------------------------------------------------------------

You can use the ILLUSTRATE command to examine the structure of relation or variable B . Relation B has two fields.

The first field is named group and is of type tuple . The second field is name a , after relation A , and is of type bag .

■

Note

a variable is also called a relation in pig latin terms.

Sqoop Jobs

Sqoop is the bi-directional data-transfer tool between HDFS (again, WASB in Azure HDInsight service) and relational

databases. In an HDInsight context, Sqoop is primarily used to import and export data to and from SQL Azure

databases and the cluster storage. When you run a Sqoop command, Sqoop in turn runs a MapReduce task in the

Hadoop Cluster (map only, and no reduce task). There is no separate log file specific to Sqoop. So you need to

troubleshoot a Sqoop failure or performance issue pretty much the same way as a MapReduce failure or

performance issue.

Search WWH ::

Custom Search

Home