Building a Data Dashboard with Google BigQuery - Data Just Right: Introduction to Large-Scale Data and Analytics

Database Reference

In-Depth Information

examples of applications in the data-science domain, from machine learning to data

collection, that are being developed entirely as services in the cloud. At the same time,

desktop and mobile applications are increasingly being developed as Web applications;

the aggregate data generated by these applications is already in the cloud. This com-

bination of trends will result in more fully hosted and managed cloud-based analytics

services.

Summary

Gaining insight from massive and growing datasets, such as those generated by large

organizations, requires specialized technologies for each step in the data analysis pro-

cess. Once organizational data is cleaned, merged, and shaped into the form desired,

the process of asking questions about data is often an iterative one. MapReduce frame-

works, such as the open-source Apache Hadoop project, are f lexible platforms for

the economical processing of large amounts of data using a collection of commodity

machines. Although it is often the best choice for large batch-processing operations,

MapReduce is not always the ideal solution for quickly running iterative queries over

large datasets. MapReduce can require a great deal of disk I/O, a great deal of admin-

istration, and multiple steps to return the result of a single query. Waiting for results to

complete makes iterative, ad hoc analysis difficult.

Analytical databases are a specialized class of technologies designed for ad hoc que-

rying over large datasets. These systems often have features meant for raw query speed,

such as storing data in columnar formats, using in-memory processing, and providing

access via SQL-like query languages.

Google BigQuery is an analytical database designed to run SQL-like queries over

very large datasets with results returned in seconds. Unlike some analytical databases,

BigQuery is completely hosted and accessed via a REST API. This enables developers

to focus on building applications that ask questions about data, rather than building the

infrastructure itself. Although this model presents some unique challenges, including

the process of loading data into the cloud, it removes the overhead of administering a

cluster of computers.

The design of BigQuery differs fundamentally from relational databases and

MapReduce frameworks. Data is stored not in rows but in a columnar format, allow-

ing data in the columns specified by the query to be accessed when necessary. Big-

Query supports both f lat and nested data structures. Query results are returned as

JSON objects, and very large results can be materialized into a new table.

Because BigQuery uses a REST-based API, it is useful for building interactive tools

such as online dashboards. Applications built with the BigQuery API require users to

authorize access to their data using a protocol called OAuth that prevents users from

having to share their passwords.

Although a hosted, in-memory analytical database such as BigQuery is an excellent

choice for building or incorporating large-scale data analysis into applications, it's not

the best tool for every data processing need. For long-running batch-processing tasks

Search WWH ::

Custom Search

Home