Understanding Query Execution - Google BigQuery Analytics

Database Reference

In-Depth Information

{and, 2}

{creeps, 1}

{day, 2}

{from, 1}

{in, 1}

{last, 1}

{of, 1}

{pace, 1}

{petty, 1}

{recorded, 1}

{syllable, 1}

{the, 1}

{this, 1}

{time, 1}

{to, 2}

{tomorrow, 3}

At first, it might seem a bit awkward to decompose your computation into

a Map and Reduce phase. However, a very wide range of problems can be

broken into a series of Map and Reduce operations, and there are a number

of tools built on top of Hadoop that can help.

Storage System

MapReduce doesn't rely on a particular storage system or storage format the

way relational databases do. However, in practice, a standard filesystem is

not well suited for performing MapReduce operations. If you're just reading

from a single disk, you are likely going to be limited by the disk speed, and

processing the data in parallel isn't actually going to help.

Apache Hadoop uses a custom distributed filesystem called HDFS that is

in many ways similar to the Google File System (GFS) or the Colossus File

System used by BigQuery. Because HDFS is distributed among many nodes,

it can read in parallel and hopefully keep up with the Map and Reduce

workers.

Worker Management

The final key piece of MapReduce architecture is the worker manager, called

the Controller in Google MapReduce or the Name Node in Hadoop. When

dealing with large numbers of workers (many MapReduces use hundreds or

Search WWH ::

Custom Search

Home