Database Reference
In-Depth Information
example, you might be running a Hadoop job over your data in Google
Compute Engine, or you might be exporting analysis results into your local
MySQL database. In those cases, you don't necessarily want to wait until
every last byte is ready because it is probably going to take you a while to
consume the data. Ideally, you'd like to start processing the data as soon as
possible.
If you have a number of parallel readers (as in the Hadoop case), you can
tell BigQuery to write your data out to multiple patterns immediately. When
the destinationUris field has more than one path, the export goes into
a special “partitioned” mode, where the target file sizes are smaller and the
parallel writers work on separate patterns immediately. When the writer
finishes, it writes a special 0-row file to signal completion.
This mode can be extremely useful when you export data to use as input to
a Hadoop job. In this case, each Hadoop worker will be looking for a single
pattern and will continue to poll for new data until a 0-byte object is found.
GCS shows only files after they have completed.
Listing 12.3 demonstrates an example of how you would use partitioned
export.
Listing
12.3 :
Parallel
export
readers
(extract_and_partitioned_read.py)
import sys
import threading
import time
from apiclient.errors import HttpError
# Imports from local files in this directory:
from gcs_reader import GcsReader
from job_runner import JobRunner
class PartitionReader(threading.Thread):
'''Reads output files from a partitioned BigQuery
extract job.'''
def __init__(self, job_runner, gcs_reader,
partition_id):
threading.Thread.__init__(self)
 
 
 
 
Search WWH ::




Custom Search