Database Reference
In-Depth Information
examples of applications in the data-science domain, from machine learning to data
collection, that are being developed entirely as services in the cloud. At the same time,
desktop and mobile applications are increasingly being developed as Web applications;
the aggregate data generated by these applications is already in the cloud. This com-
bination of trends will result in more fully hosted and managed cloud-based analytics
services.
Summary
Gaining insight from massive and growing datasets, such as those generated by large
organizations, requires specialized technologies for each step in the data analysis pro-
cess. Once organizational data is cleaned, merged, and shaped into the form desired,
the process of asking questions about data is often an iterative one. MapReduce frame-
works, such as the open-source Apache Hadoop project, are f lexible platforms for
the economical processing of large amounts of data using a collection of commodity
machines. Although it is often the best choice for large batch-processing operations,
MapReduce is not always the ideal solution for quickly running iterative queries over
large datasets. MapReduce can require a great deal of disk I/O, a great deal of admin-
istration, and multiple steps to return the result of a single query. Waiting for results to
complete makes iterative, ad hoc analysis difficult.
Analytical databases are a specialized class of technologies designed for ad hoc que-
rying over large datasets. These systems often have features meant for raw query speed,
such as storing data in columnar formats, using in-memory processing, and providing
access via SQL-like query languages.
Google BigQuery is an analytical database designed to run SQL-like queries over
very large datasets with results returned in seconds. Unlike some analytical databases,
BigQuery is completely hosted and accessed via a REST API. This enables developers
to focus on building applications that ask questions about data, rather than building the
infrastructure itself. Although this model presents some unique challenges, including
the process of loading data into the cloud, it removes the overhead of administering a
cluster of computers.
The design of BigQuery differs fundamentally from relational databases and
MapReduce frameworks. Data is stored not in rows but in a columnar format, allow-
ing data in the columns specified by the query to be accessed when necessary. Big-
Query supports both f lat and nested data structures. Query results are returned as
JSON objects, and very large results can be materialized into a new table.
Because BigQuery uses a REST-based API, it is useful for building interactive tools
such as online dashboards. Applications built with the BigQuery API require users to
authorize access to their data using a protocol called OAuth that prevents users from
having to share their passwords.
Although a hosted, in-memory analytical database such as BigQuery is an excellent
choice for building or incorporating large-scale data analysis into applications, it's not
the best tool for every data processing need. For long-running batch-processing tasks
 
 
Search WWH ::




Custom Search