Loading Data - Google BigQuery Analytics

Database Reference

In-Depth Information

reported by JSON load jobs. There is no point retrying invalid inserts,

but connection and other transient errors should be retried. If the request

contains only a couple of records, and includes the insertId fields, it is

reasonable to retry the entire request, but if the request has a large number

of records, it is more efficient to retry only the failed records.

Just as with load jobs, there is a size limit on an individual request and rate

limits on the total number of requests.

• Maximum record size : 100 KB

• Maximum bytes per request : 1 MB

• Table rate limit : 10,000 rows/second (enforced over 10 seconds)

• Project rate limit : 100,000 rows/second

Record size and bytes refers to the size computed based on the data in the

records and not the JSON encoded size.

To complete this section now look at how to perform inserts using the

Python client API. Listing 6.2 is a script that accepts a filename as an

argument. It tails (polls for data appearing at the end) the given file and

parses each line, turns it into a record, and then performs an insert. Notice

that it uses the filename and position of the record as the insertId , which

ensures that if the script is restarted on the file, the records will not be

duplicated. This is not perfect because if the script is restarted after the

deduplication window has passed, the records will end up duplicated. Fixing

this behavior is left as an exercise to the reader. Another feature to note

is that the script builds batches of up to 10 records before submitting the

request, but only if the records are immediately available. This usually

increases throughput without delaying the delivery of records.

Listing 6.2 : (stream.py)

def tail_and_insert(infile,

tabledata,

project_id,

dataset_id,

table_id):

'''Tail a file and stream its lines to a BigQuery

table.

infile: file object to be tailed.

Search WWH ::

Custom Search

Home