Database Reference
In-Depth Information
are the easiest to use because you get back a page token after you start
reading the table, which you can then use in your next request to pick up
where you left off. When no page token is returned, it means you are at the
end of the table. A page token also reads from a single point in time, so if
the table changes while you're reading from it, you still get a stable version
of the table. The problem with page tokens, however, is that they force you
to read the entire table serially, that is, start at the beginning and read to the
end.
The other way of reading from a table, via a row index, lets you read any
rows you want, whenever you want. So if you want to skip to row 1,000,000
and read from there, you can just specify that row as the start index. This
makes it easier to read a table in parallel because each parallel worker
can skip to the index that they want. For instance, if you have 10,000,000
rows in the table and 10 workers, the first worker would start at row 0
and read the first 1,000,000 rows, the second worker would read rows
1,000,000 through 1,999,999, and so on. There are two problems with
reading a table this way: It is trickier to keep track of which rows to read by
which worker, and if the table changes while you're reading it, you're going
to read inconsistent data.
Listing 12.4 shows a TableReader class that can be used to read an entire
table using TableData.list() . It reads one page at a time and then calls
out to a ResultHandler class that processes each page of results as it
arrives. If you use the FileResultHandler , it will write the results (still
in the F/V row format) to a local file. The section titled “TableData.list()” in
Chapter 5 shows how to translate the F/V format to a flat JSON format.
Listing
12.4 :
Reading
a
table
with
TableData.list()
(table_reader.py)
import json
import os
import sys
import threading
import time
# Imports from the Google API client:
from apiclient.errors import HttpError
 
 
 
 
Search WWH ::




Custom Search