Database Reference
In-Depth Information
bulk loads are often crucial when loading data because they make it simple
to ensure that queries operate on valid data.
So far we have been qualifying load with the term “bulk.” In the BigQuery
API this corresponds to a job with load configuration and bulk is implied.
For simplicity, from here, the operation of loading a batch of records is
referred to as a load job. It is this job that has ACID semantics in BigQuery,
particularly for other jobs in the system. As described in Chapter 5, “Talking
to the BigQuery API,” a load job like every BigQuery job goes through the
same life cycle of pending, running, and done.
NOTE
The code in this chapter assumes a dataset named ch06 exists in the
project you use for trying the sample code. You can create this dataset
by running:
bq mk ch06
And when you finish this chapter, you can clean up by running:
bq rm -f -r ch06
Listing 6.1 (a and b) is the skeleton Python code for executing a load job in
BigQuery and monitoring it over its life cycle. This involves a job insertion,
polling to detect completion, and inspecting the final status of the job. The
main feature to note is the polling loop. It is not generally necessary to
poll load jobs frequently because they usually take at least 30 seconds. The
code provided uses a 10-second wait between Jobs.get() operations and
this is a reasonable value. You may want to tune the wait depending on the
nature of the load jobs you need to run—if you run large loads, a longer
wait time would be more appropriate. Also observe that the code Listing
6.1b has comments indicating where you would add code to control the
configuration of the job and manage the transfer of data to the service. The
following sections contain code snippets to place in these locations to enable
a particular configuration.
The configuration of a load job has three distinct components:
 
 
Search WWH ::




Custom Search