Loading Data - Google BigQuery Analytics

Database Reference

In-Depth Information

Modifying the delimiter and quote is useful when fields contain sentences

and paragraphs that usually have punctuation. You still need to be careful

about the handling of newlines and carriage returns and may need to come

up with a scheme for escaping them or transforming them in your data.

encoding

You just got a taste for the complications that character encoding brings to

the table. If you work with UTF-8, you can basically ignore encoding because

UTF-8 is the encoding used natively by BigQuery and is the encoding used in

the HTTP-based API. If at all possible you want to stick with UTF-8 because

it avoids any difficulty associated with encoding conversions.

Note that even if you use UTF-8 for the values of the field, the lines of data

will not be valid UTF-8 data if you use a field delimiter in the range 128-255.

When BigQuery parses your data; it first splits the data into rows based on

the record delimiter (limited to “ \n ,” “ \r ,” and “ \r\n ”). Then it splits rows

into fields based on the customizable field delimiter and then checks the

encoding of each individual field. The only alternative encoding supported

is ISO-8859-1, which is a superset of Latin1. To request that your values be

treated as Latin1 strings and converted to UTF-8, use the following setting:

load['encoding'] = 'ISO-8859-1'

It is also legal to set this field to UTF-8, but that has no effect because

it is the default encoding. If you set the input encoding to ISO-8859-1,

single-byte characters in the range 128-255 will be converted to the

corresponding multibyte UTF-8 characters.

skipLeadingRows

Many tools that produce CSV include one or more header rows describing

the fields present in the data. It is tedious to have to strip this header

because in practice it means regenerating the entire file to just remove

the first few lines. Instead you can set a parameter in the configuration to

indicate to the parser that it should ignore some number of lines at the start

of the file. If your configuration specified multiple source files (on GCS), the

lines at the start of each of the files will be skipped.

load['skipLeadingRows'] = 6

Search WWH ::

Custom Search

Home