Database Reference
In-Depth Information
work. For the command to load the data into BigQuery, you need to specify
the list of files. The log files generated conform to a naming scheme:
gs://<log bucket>/<log
prefix>_usage_YYYY_MM_DD_hh_mm_ss_<id>_v0
The ID following the timestamp on the file is a system-generated string to
avoid name collisions. For example, you might be saving the serving logs of
2 different buckets to the same log bucket and prefix. In the event that log
files are generated in the same second, the ID ensures that one of them does
not clobber the other.
A reasonable scheme is to load all the logs for a day after the last log for
the day has been generated. Because logs are generated hourly, you could
set up a cronjob that executes a couple of hours after midnight UTC to load
the previous day's logs. Assuming you have a suitable dataset created in
BigQuery, the command would look something like this:
$ LOG_DATASET='ch14'
$ bq mk ${LOG_DATASET}
$ bq load \
-- skip_leading_rows=1 \
-- schema=/tmp/cloud_storage_usage_schema_v0.json \
${LOG_DATASET}.gcs_usage \
"gs://${LOG_BUCKET}/${LOG_PREFIX}_usage_2014_02_*"
You could keep the table more up to date by loading files through the day
as they appear, but this requires additional bookkeeping to track which files
have been loaded. One strategy is to use load job IDs that contain the suffix
of the name of the file that is being loaded. Then you could have a periodic
job that checks to see which files have been scheduled by enumerating jobs
and scheduling jobs for files that have no corresponding job ID.
After the data is loaded into BigQuery, you can run all sorts of interesting
queries on your usage data. Here is one to get you started:
SELECT
HOUR(time) traffic_hour
SUM(get_bucket) bucket_gets,
SUM(put_bucket) bucket_puts,
Search WWH ::




Custom Search