Loading Data - Google BigQuery Analytics

Database Reference

In-Depth Information

This code causes the first six lines of every file to be ignored.

allowJaggedRows

When encoding records with a lot of fields (columns) that are frequently

null-valued, some tools choose to leave out trailing fields that are null. When

reading this data all columns after the last column present in a row must be

treated as null or absent. Making this data conform to the requirements of

the basic CSV format would mean padding each row with trailing commas

(the field delimiter) to represent the null columns. Again BigQuery has a

feature that can handle this data.

load['allowJaggedRows'] = True

In this mode BigQuery will accept a row with fewer columns than the

number of fields in the schema as long as all the fields in the schema that are

missing are marked as NULLABLE . Note that any null column that appears

before a non-null column must still be explicitly encoded as a blank field in

the row.

allowQuotedNewlines

This option deserves careful explanation because it affects the one aspect

of CSV parsing in which BigQuery's default behavior differs from the

specification of the format. The CSV format enables the newline character

to appear within quoted fields. This is necessary to let the format encode

values that contain the line separator. However, it turns out that this feature

makes it impossible to safely process chunks of a CSV file in parallel. In

any chunk other than the first chunk, it is impossible to tell if a newline

occurs inside a quoted string or outside a quoted string. This means that

the file can be processed only from beginning to end by a single process

keeping track of whether it is in the middle of a quoted value. However, the

majority of CSV associated with data processing does not contain quoted

newlines, so it would be a shame if the default behavior were to use the slow,

but specification-compatible, serial processing strategy instead of the faster

parallel processing strategy. As a result, BigQuery defaults to behavior that

assumes that no quoted newlines are present in the input data, and it can

be safely divided up for parallel processing. If your data does have quoted

newlines, you can set the allowQuotedNewlines property:

Search WWH ::

Custom Search

Home