Database Reference
In-Depth Information
BigQuery Integration
Now that you have seen how to MapReduce over files that live in GCS, you
can integrate with BigQuery by coordinating the MapReduce job with a pair
of BigQuery export and import jobs. You need to run a BigQuery export job
to materialize the contents of a table as a set of GCS files. When that job
completes you can run the MapReduce job in AppEngine to produce output
files in GCS. Finally, you need to run a BigQuery import job to populate a
table with the contents of the output files.
All the BigQuery-related plumbing should be familiar. The first section of
this chapter covered the details of extracting data from BigQuery tables, and
Chapter 6 covered loading data into tables. The main challenge is to run all
this in the AppEngine environment and coordinate it with a MapReduce job.
Ordinarily, you'd run into an AppEngine limitation—all requests must finish
within 60 seconds. Because BigQuery Load and Extract jobs may take longer
than 60 seconds, you need to implement a complex timer-and-callback
mechanism that would divide up the longer-lived BigQuery jobs into smaller
chunks.
However, to simplify the code, you can take advantage of AppEngine's
support for long-lived instances; you just need to write a function that
sequentially performs each step. Because long-lived instances are allowed
to spin up background threads and have no restrictions on the time spent
on an individual request, your background thread can simply poll for the
completion of each step.
Configuring long-lived instances is covered in the AppEngine
documentation. The setup_appengine.py script creates a
controller.yaml file that defines a suitable AppEngine module. The
significant portions of this file are:
application: bigquery-mr-sample
module: controller
version: 1
runtime: python27
api_version: 1
threadsafe: yes
instance_class: B4
basic_scaling:
Search WWH ::




Custom Search