External Data Processing - Google BigQuery Analytics

Database Reference

In-Depth Information

The MainHandler class that responds to requests to the /mapreduce URL

simply spins up a background thread when it receives a POST request, which

is triggered when you hit the Run button. The background thread runs a

BigQuery job to extract the table to Google Cloud Storage and then runs an

AppEngine mapper pipeline to add the zip code and save the output back

in Google Cloud Storage. Finally, the background thread runs a Load job to

import the data back into BigQuery.

There is a lot of boilerplate code that handles AppEngine MapReduce state

management. We have included a function that waits for MapReduce

pipeline completion so that it is clear how to poll for its status. The definition

of the MapReduce pipeline, the middle operation, in code closely mirrors

the configuration in the controller.yaml file previously discussed. We

specify the outputs of the earlier step as the inputs to the next step to chain

them together.

In this case, a simple sequential pipeline approach works well and makes

it easy to see what is happening. If the individual steps have more complex

dependencies or could run in parallel, then you could use the AppEngine

pipeline framework to orchestrate the steps. More information on the

pipeline framework is available at: https://code.google.com/p/

appengine-pipeline/ .

In addition, we have not covered cleaning up the files generated on GCS.

One option is to add code to the function to perform the deletion explicitly.

Alternatively, you can use automatic life-cycle management available in GCS

to clean up the files after some duration. Documentation for this feature

is available at https://developers.google.com/storage/docs/

lifecycle .

Search WWH ::

Custom Search

Home