Loading Data - Google BigQuery Analytics

Database Reference

In-Depth Information

existing contents of a table. This was something of a white lie. It is true that

they have no access to the data already present in a table. However, they do

depend on the existing schema of the table and as mentioned can modify the

existing schema by adding columns. In this respect isolation is relevant and

load jobs are isolated from other load jobs because they capture a copy of the

destination table's schema at the time they are created and do not observe

changes to the schema that may occur while they are running. This can lead

to the job eventually failing because the schema of the table changes while it

is in progress.

If all this seems too complicated, follow this simple rule of thumb: Do not

modify the schema of an existing table by specifying a schema on a job.

Instead modify the schema by updating the table, and do this only when

there are no pending or running jobs. However, if you are interested in how

this unpredictable behavior can arise, here is a concrete example. Assume

that the table being updated has columns A and B. Load job 1 adds column

C, and load job 2 adds columns C and D. Both jobs can be legally started

at the same time because they are both performing compatible (with the

existing table) schema updates. If job 1 completes before job 2, both jobs

can succeed because the schema update of job 2 is adding a column (D)

in addition to the column (C) added by job 1. However, if job 2 completes

before job 1, job 1 will fail because the schema it specified removes a column

(D) with respect to the schema specified in job 2. Because the completion

order of concurrently executing jobs is not guaranteed, this implies that job

1 may fail randomly, which is unwanted. Also after job 2 succeeds it will be

impossible to retry job 1.

In addition to specifying where the load job should put the data, you can also

tune how the load job behaves if the existing destination table does or does

not exist, and what it does with any data that might already be present in the

destination table. This behavior is tuned by controlling the create and write

dispositions of the job. These properties were briefly described in Chapter

4 in the overview of the Jobs collection. You can control the dispositions by

modifying the job configuration:

load['createDisposition'] = 'CREATE_IF_NEEDED'

load['writeDisposition'] = 'WRITE_TRUNCATE'

Table 6.2 shows the supported dispositions and how they affect the

operation of the job:

Search WWH ::

Custom Search

Home