Database Reference
In-Depth Information
This listing reads records from standard input, looks up the ZIP code, and
writes records with the ZIP code added to the standard output. The input
records must be specified as newline-delimited JSON matching your
expected input schema. The output format is similar, with the addition of
the ZIP code field.
On startup, the listing loads the ZIP code database into an index (a k -d tree,
which supports efficient lookup of the nearest points to a given point; see
https://code.google.com/p/python-kdtree/ ). Then it parses each
input line as a JSON object, looks up the nearest ZIP code, and emits the
record with the ZIP code it located added to the record. You can run the
script with the following command line:
$ cd appengine
$ python add_zip.py ../add_zip_sample.json
$ cd ..
The code has been structured so that the core computation is independent of
the details of input and output. This enables you to clearly see the parts that
stay the same when it is converted to a MapReduce computation. You might
not be familiar with the Python __new__ operator. This is used in order to
represent the ZipPoint class as a tuple.
Before moving on, it is important to note that the script seems simple only
because most of the complexity is hidden in the k -d tree library it uses.
This illustrates why it is sometimes necessary to perform transformations
outside of BigQuery. There are inevitably specialized algorithms that will be
difficult to implement within the BigQuery query language. Furthermore, it
is likely that implementations exist in some suitable external framework. In
these situations you need to look for the best way to make the data stored in
BigQuery accessible in the appropriate framework.
Search WWH ::




Custom Search