BigQuery Fundamentals - Google BigQuery Analytics

Database Reference

In-Depth Information

completely immune to loss. A buggy software release could cause data to be

inadvertently deleted from all nine of those disks. If you have critical data,

make sure to back it up.

Many organizations are understandably reluctant to move their data into the

cloud. It can be difficult to have your data in a place where you don't control

it. If there is data loss, or an outage, all you can do is take your business

elsewhere—there is no one except support staff to yell at and little you can

do to prevent the problem from happening in the future.

That said, the specialized knowledge and operational overhead required to

run your own hardware is large and gets only larger. The advantages of scale

that Google or Amazon has only get bigger as they get better at managing

their datacenters and improving their data warehousing techniques. It

seems likely that the days when most companies run their own IT hardware

are numbered.

Multitenancy and Parallel Execution

When you run a query on MySQL that takes one second, you get to occupy

a single processor core for one second. If you have eight processors, you can

run eight queries at once. Amazon Redshift lets you run a single query in

parallel, but on a fixed number of cores that are all yours for the entire time

you are renting the Redshift instance.

BigQuery operates on a fundamentally different model; your query will run

on thousands of cores in parallel. If you have eight queries, those may all

run on a thousand cores in parallel. The query engine will time-slice the

operations and make progress on some queries while others are waiting

for disk or network I/O. All queries perform a mix of I/O and processing;

waiting for I/O would mean that the processor would sit idle. The Dremel

engine underlying BigQuery can maximize the throughput of the system by

pipelining queries so that as some queries are waiting for I/O operations,

other queries will use the processor.

By allowing your queries to run on all the hardware in a compute cluster,

you can see performance far beyond what you otherwise could see unless

you were willing to pay for a dedicated similarly sized cluster. There just isn't

another way you can process hundreds of gigabytes per second without a

massive amount of hardware. If you want to build it yourself, it would cost

millions of dollars to build and maintain. (Licenses for on-premise solutions

like Netezza generally run in the seven figures.)

Search WWH ::

Custom Search

Home