Loading Data - Google BigQuery Analytics

Database Reference

In-Depth Information

between records that were in separate tables. In the single table case, it is

likely possible to collect the updates from all 10 independent processes and

combine them into a single load operation. In both cases this would work

around the quota constraint without sacrificing the frequency of updates.

Another point to note is that small load jobs are wasteful of the per table

quota. If you load only a handful of records per load job, at the end of the

day, your table will contain only a few thousand records, not exactly Big

Data. The usual reason for issuing small frequent updates is to keep the table

fresh for some real-time data source. Considering that even a small load job

can take anywhere from tens of seconds to a couple of minutes, this is not

an effective way to keep data fresh or utilize your daily quota. This issue

sets you up nicely for the next major section of this chapter. As previously

mentioned, BigQuery supports a throughput optimized load operation and

a latency optimized load operation. When you run into the daily table or

project limits, it may be a signal that you should use the latency optimized

operation. The next section covers this alternative way of loading data into

the service.

Streaming Inserts

If you are familiar with traditional databases, you may wonder why so much

machinery is required to load a couple of records into a table. As discussed

in Chapter 2, “BigQuery Fundamentals,” aspects of the service resemble

a relational database, but at its core BigQuery is a distributed processing

framework optimized for dealing with large amounts of data. As a result, its

primary loading mechanism is geared toward ingesting large quantities of

data rather than individual records. Nevertheless, the service does provide

a simple operation for inserting individual records, referred to here as a

streaming insert. Even though it bears a strong resemblance to the SQL

insert statement, do not be fooled; there are substantial differences. The

API gains it simplicity and low latency by foregoing the strong guarantees

offered by the job-based load operation. Perhaps in contrast to the ACID

properties of load jobs, you can describe this operation as

Eventual-At-least-Once. This means one or more copies of a record inserted

via the streaming API are guaranteed to eventually appear in queries over

the destination table. This may seem like an alarmingly weak promise, but it

is sufficient for a variety of applications. In practice, records inserted via this

API are available immediately and exactly once in queries, which means that

Search WWH ::

Custom Search

Home