Building a Data Dashboard with Google BigQuery - Data Just Right: Introduction to Large-Scale Data and Analytics

Database Reference

In-Depth Information

existing operational data into a format that makes it fast to query. A common goal of

OLAP systems is to shape data in a way that avoids excessive JOIN queries, which are

typically slow over large datasets in relational databases. In order to do this, some data

is pulled, or extracted, into a new schema in the OLAP system. This entire process is

complex and time consuming, but once it is done, it enables analysts to run faster que-

ries over data at the expense of f lexibility. Once a new type of query is required, the

process of building new OLAP schemas may sometimes need to be repeated.

The OLAP concept has proven useful and has spawned an industry of large enter-

prise vendors, including stalwarts such as Oracle and Microsoft. However, some tech-

nologies take a different approach. One class of software, broadly known as analytical

databases, dispenses with the relational model. Using innovations in on-disk storage

and the use of distributed memory, analytical databases combine the f lexibility of SQL

with the speed of traditional OLAP systems.

In 2003, Google released a paper describing their MapReduce framework for process-

ing data. This paper was a milestone in the movement toward greater accessibility of

large-scale data processing. MapReduce is a general algorithm for distributing the

processing of data over clusters of readily available commodity machines. The MapRe-

duce concept, in conjunction with a scalable distributed filesystem called GFS, helped

to enable Google to index the public Internet. Constant drops in hardware costs along

with Google's success in the search industry inspired computer scientists to spend

more time thinking about using distributed systems of commodity hardware for data

processing. Engineers at Yahoo! added capabilities inspired by the MapReduce paper

to process data collected by the Apache Nutch Web crawler. Ultimately, the work on

Nutch spawned the Hadoop project, which has since become the shining star of the

open-source Big Data world. Hadoop is a horizontally scalable, fault-tolerant frame-

work that enables developers to write their own MapReduce applications.

One of the biggest advantages of the MapReduce concept is that a variety of large

data processing tasks can be completed in an acceptably short time and using hardware

that is relatively inexpensive. Depending on the type of job, tasks that were previously

impossible to run on single machines could be completed in hours or even minutes.

The Hadoop project has gained quite a lot of media attention, and for good reason.

One of the drawbacks to this attention has been that some people new to the field of

data analytics might even consider Hadoop and MapReduce to be synonymous with

Big Data. Despite the success of the Hadoop community, some algorithms simply don't

transfer well to being expressed as distributed MapReduce jobs. Although Hadoop is

the open-source champion of batch processing, the MapReduce concept is not neces-

sarily the best approach to dealing with ad hoc query tasks.

Questioning data is often an iterative process. The answers from one question

might inspire a new question. Although MapReduce, along with tools such as Apache

Search WWH ::

Custom Search

Home