Database Reference
In-Depth Information
The MainHandler class that responds to requests to the /mapreduce URL
simply spins up a background thread when it receives a POST request, which
is triggered when you hit the Run button. The background thread runs a
BigQuery job to extract the table to Google Cloud Storage and then runs an
AppEngine mapper pipeline to add the zip code and save the output back
in Google Cloud Storage. Finally, the background thread runs a Load job to
import the data back into BigQuery.
There is a lot of boilerplate code that handles AppEngine MapReduce state
management. We have included a function that waits for MapReduce
pipeline completion so that it is clear how to poll for its status. The definition
of the MapReduce pipeline, the middle operation, in code closely mirrors
the configuration in the controller.yaml file previously discussed. We
specify the outputs of the earlier step as the inputs to the next step to chain
them together.
In this case, a simple sequential pipeline approach works well and makes
it easy to see what is happening. If the individual steps have more complex
dependencies or could run in parallel, then you could use the AppEngine
pipeline framework to orchestrate the steps. More information on the
pipeline framework is available at: https://code.google.com/p/
appengine-pipeline/ .
In addition, we have not covered cleaning up the files generated on GCS.
One option is to add code to the function to perform the deletion explicitly.
Alternatively, you can use automatic life-cycle management available in GCS
to clean up the files after some duration. Documentation for this feature
is available at https://developers.google.com/storage/docs/
lifecycle .
Search WWH ::




Custom Search