External Data Processing - Google BigQuery Analytics

Database Reference

In-Depth Information

BigQuery Integration

Now that you have seen how to MapReduce over files that live in GCS, you

can integrate with BigQuery by coordinating the MapReduce job with a pair

of BigQuery export and import jobs. You need to run a BigQuery export job

to materialize the contents of a table as a set of GCS files. When that job

completes you can run the MapReduce job in AppEngine to produce output

files in GCS. Finally, you need to run a BigQuery import job to populate a

table with the contents of the output files.

All the BigQuery-related plumbing should be familiar. The first section of

this chapter covered the details of extracting data from BigQuery tables, and

Chapter 6 covered loading data into tables. The main challenge is to run all

this in the AppEngine environment and coordinate it with a MapReduce job.

Ordinarily, you'd run into an AppEngine limitation—all requests must finish

within 60 seconds. Because BigQuery Load and Extract jobs may take longer

than 60 seconds, you need to implement a complex timer-and-callback

mechanism that would divide up the longer-lived BigQuery jobs into smaller

chunks.

However, to simplify the code, you can take advantage of AppEngine's

support for long-lived instances; you just need to write a function that

sequentially performs each step. Because long-lived instances are allowed

to spin up background threads and have no restrictions on the time spent

on an individual request, your background thread can simply poll for the

completion of each step.

Configuring long-lived instances is covered in the AppEngine

documentation. The setup_appengine.py script creates a

controller.yaml file that defines a suitable AppEngine module. The

significant portions of this file are:

application: bigquery-mr-sample

module: controller

version: 1

runtime: python27

api_version: 1

threadsafe: yes

instance_class: B4

basic_scaling:

Search WWH ::

Custom Search

Home