Databases Reference
In-Depth Information
This tells Pig to run the script and also output a file named original_script_name.
substituted that has the original script but with all the parameters fully substituted. Exe-
cuting pig with the -dryrun option outputs the same file but doesn't execute the script.
The exec and run commands allow you to run Pig Latin scripts from within the
Grunt shell, and they support parameter substitution using the same -param and
-param_file arguments; for example:
grunt> exec -param input=excite-small.log -param size=4 Myscript.pig
However, parameter substitution in exec and run doesn't support Unix commands,
and there's no debug or dryrun option.
10.7.3 Multiquery execution
In the Grunt shell, a DUMP or STORE operation processes all previous statements need-
ed for the result. On the other hand, Pig optimizes and processes an entire Pig script
as a whole. This difference would have no effect at all if your script has only one DUMP
or STORE command at the end. If your script has multiple DUMP/STORE , Pig script's mul-
tiquery execution improves efficiency by avoiding redundant evaluations. For example,
let's say you have a script that stores intermediate data:
a = LOAD ...
b = some transformation of a
STORE b ...
c = some further transformation of b
STORE c ...
If you enter the statements in Grunt, where there's no multiquery execution, it will
generate a chain of jobs on the STORE b command to compute b . On encountering
STORE c , Grunt will run another chain of jobs to compute c , but this time it will evalu-
ate both a and b again! You can manually avoid this reevaluation by inserting a b =
LOAD ... statement right after STORE b , to force the computation of c to use the saved
value of b . This works on the assumption that the stored value of b has not been modi-
fied, because Grunt, by itself, has no way of knowing.
On the other hand, if you run all the statements as a script, multiquery execution
can optimize the execution by intelligently handling intermediate data. Pig compiles
all the statements together and can locate the dependency and redundancy. Multiquery
execution is enabled by default and usually has no effect on the computed results. But
multiquery execution can fail if there are data dependencies that Pig is not aware of.
This is quite rare but can happen with, for example, UDFs. Consider this script:
STORE a INTO 'out1';
b = LOAD ...
c = FOREACH b GENERATE MYUDF($0,'out1');
STORE c INTO 'out2';
If the custom function MYUDF is such that it accesses a through the file out1 , the Pig
compiler would have no way of knowing that. Not seeing the dependency, the Pig com-
piler may erroneously think it OK to evaluate b and c before evaluating a . To disable
multiquery execution, run the pig command with -M or -no_multiquery option.
 
Search WWH ::




Custom Search