External Data Processing - Google BigQuery Analytics

Database Reference

In-Depth Information

are the easiest to use because you get back a page token after you start

reading the table, which you can then use in your next request to pick up

where you left off. When no page token is returned, it means you are at the

end of the table. A page token also reads from a single point in time, so if

the table changes while you're reading from it, you still get a stable version

of the table. The problem with page tokens, however, is that they force you

to read the entire table serially, that is, start at the beginning and read to the

end.

The other way of reading from a table, via a row index, lets you read any

rows you want, whenever you want. So if you want to skip to row 1,000,000

and read from there, you can just specify that row as the start index. This

makes it easier to read a table in parallel because each parallel worker

can skip to the index that they want. For instance, if you have 10,000,000

rows in the table and 10 workers, the first worker would start at row 0

and read the first 1,000,000 rows, the second worker would read rows

1,000,000 through 1,999,999, and so on. There are two problems with

reading a table this way: It is trickier to keep track of which rows to read by

which worker, and if the table changes while you're reading it, you're going

to read inconsistent data.

Listing 12.4 shows a TableReader class that can be used to read an entire

table using TableData.list() . It reads one page at a time and then calls

out to a ResultHandler class that processes each page of results as it

arrives. If you use the FileResultHandler , it will write the results (still

in the F/V row format) to a local file. The section titled “TableData.list()” in

Chapter 5 shows how to translate the F/V format to a flat JSON format.

Listing

12.4 :

Reading

a

table

with

TableData.list()

(table_reader.py)

import json

import os

import sys

import threading

import time

# Imports from the Google API client:

from apiclient.errors import HttpError

Search WWH ::

Custom Search

Home