Querying Google Data Sources - Google BigQuery Analytics

Database Reference

In-Depth Information

work. For the command to load the data into BigQuery, you need to specify

the list of files. The log files generated conform to a naming scheme:

gs://<log bucket>/<log

prefix>_usage_YYYY_MM_DD_hh_mm_ss_<id>_v0

The ID following the timestamp on the file is a system-generated string to

avoid name collisions. For example, you might be saving the serving logs of

2 different buckets to the same log bucket and prefix. In the event that log

files are generated in the same second, the ID ensures that one of them does

not clobber the other.

A reasonable scheme is to load all the logs for a day after the last log for

the day has been generated. Because logs are generated hourly, you could

set up a cronjob that executes a couple of hours after midnight UTC to load

the previous day's logs. Assuming you have a suitable dataset created in

BigQuery, the command would look something like this:

$ LOG_DATASET='ch14'

$ bq mk ${LOG_DATASET}

$ bq load \

-- skip_leading_rows=1 \

-- schema=/tmp/cloud_storage_usage_schema_v0.json \

${LOG_DATASET}.gcs_usage \

"gs://${LOG_BUCKET}/${LOG_PREFIX}_usage_2014_02_*"

You could keep the table more up to date by loading files through the day

as they appear, but this requires additional bookkeeping to track which files

have been loaded. One strategy is to use load job IDs that contain the suffix

of the name of the file that is being loaded. Then you could have a periodic

job that checks to see which files have been scheduled by enumerating jobs

and scheduling jobs for files that have no corresponding job ID.

After the data is loaded into BigQuery, you can run all sorts of interesting

queries on your usage data. Here is one to get you started:

SELECT

HOUR(time) traffic_hour

SUM(get_bucket) bucket_gets,

SUM(put_bucket) bucket_puts,

Search WWH ::

Custom Search

Home