Database Reference
In-Depth Information
load['allowQuotedNewlines'] = True
This makes it process each input file serially (separate files are still
processed in parallel) so that it can correctly handle the values. Beware
that this can be much slower than processing your data in parallel. If each
individual load is small (say, <100MB) this is not a significant issue, but if
your loads are much larger, you should consider alternatives. One simple
workaround is to substitute newlines in strings with some other sequence,
for example, the C escape sequence \n . The exact choice of replacement
characters depends on how you intend to query the field.
Compression
The last option to cover is not quite an option but rather a property of
the data. BigQuery supports GZIP-compressed CSV data. It automatically
detects if the data is compressed in a recognized format, and if so it
decompresses the data before processing it. GZIP compression has the same
property as quoted newlines for distributed processing. Decompressing a
file requires that a single process decompress the entire file because it is
not possible to resume decompression in the middle of file. However,
compression can be critical for transferring data, especially with formats
such as CSV, which compress quite well, so it may still be necessary to
employ compression. Most HTTP client implementations can transparently
compress the request so that both ends deal only with uncompressed data.
If you are instead relying on explicit compression prior to transmitting
your data to BigQuery (or GCS), it is a good idea to generate multiple
compressed files with sizes between 10MB-100MB rather than a single,
large compressed file. Because the processing can still be parallelized over
files, you can benefit from distributed processing.
This completes the discussion of CSV, which has the most options for
processing because it is such a loosely implemented format. There is a good
chance that you will use it as an input format because it is so widely adopted.
It is a good idea to be familiar with all the options covered in this section
because they might end up saving you the trouble of reformatting your data.
JSON
The other textual format that BigQuery supports is JSON
( http://www.json.org ) , which has established itself as the standard for
data exchange between web applications. This is the only textual format that
Search WWH ::




Custom Search