External Data Processing - Google BigQuery Analytics

Database Reference

In-Depth Information

threads.append(read_thread)

threads[index].start()

for index in range(partition_count):

threads[index].join()

Extract Jobs Versus TableData.list() for

Reading Data in Parallel

Both Extract jobs and TableData.list() let you read data from

tables in parallel. When should you use one versus the other? The

answer, unsurprisingly, depends on how you want to read the data. If

you want to read the table like a file—that is, read 1 k bytes at a

time—you will likely want to use the output of an Extract job. Extract

produces files that live in Google Cloud Storage (GCS) that you can read

multiple times and in any byte range you choose. You can download the

files using standard HTTP resumable download operations.

TableData.list() , however, lets you read a specific number of rows

but doesn't give you control over bytes. To read all the data, you need to

use a page token to fetch the next section of data. This means that you

can't just plug it in as-is to download your tables.

There are latency trade-offs as well. Extract jobs require you to wait for

the data to be produced, but when it is ready, you can download at the

speed of your Internet connection. TableData.list() , however, lets

you read data immediately, but the effective bandwidth will be lower

because the data has to be transcoded into your desired format

on-the-fly.

AppEngine MapReduce

There are a number of reasons you might want to extract data from

BigQuery. One common case is when a certain data transformation cannot

be expressed as a query within the service. For instance, it could be any

combination of the following:

Search WWH ::

Custom Search

Home