External Data Processing - Google BigQuery Analytics

Database Reference

In-Depth Information

# First start all of the reader threads.

for index in range(len(partition_readers)):

partition_readers[index].start(gcs_objects[index])

# Wait for all of the reader threads to complete.

for index in range(len(partition_readers)):

partition_readers[index].wait_for_complete()

This code has two main parts: the PartitionReader class that downloads

files when they become available and the

run_partitioned_extract_job method that launches the partitioned

read and waits for it to complete. The PartitionedReader extends

Python's Thread object and periodically polls for new files to be available.

Once the BigQuery job is complete, no more new files will arrive, so it exits

after reading all remaining files.

The run_partitioned_extract_job method starts the job, and then

starts all of the PartitionedReader threads to read and download the

files in parallel. It returns after waiting for all of those threads to complete.

The

following

code

snippet

shows

how

to

use

the

to

download

the

run_partitioned_extract_job

publicdata:samples.shakespeare table in three partitions:

$ python

>>> from extract_and_partitioned_read import

run_partitioned_extract_job

>>> from job_runner import JobRunner

>>> from gcs_reader import GcsReader

>>> project_id='bigquery-e2e'

>>> gcs_bucket='bigquery-e2e'

>>> run_partitioned_extract_job(

…

JobRunner(project_id=project_id) ,

…

[GcsReader(gcs_bucket=gcs_bucket ,

…

download_dir='/tmp/bigquery') for x

in range(3)] ,

… source_project_id='publicdata' ,

… source_dataset_id='samples' ,

… source_table_id='shakespeare')

[0] STARTING on gs://bigquery-e2e/

Google BigQuery Analytics

Search WWH ::

Custom Search

Home