Database Reference
In-Depth Information
existing operational data into a format that makes it fast to query. A common goal of
OLAP systems is to shape data in a way that avoids excessive JOIN queries, which are
typically slow over large datasets in relational databases. In order to do this, some data
is pulled, or extracted, into a new schema in the OLAP system. This entire process is
complex and time consuming, but once it is done, it enables analysts to run faster que-
ries over data at the expense of f lexibility. Once a new type of query is required, the
process of building new OLAP schemas may sometimes need to be repeated.
The OLAP concept has proven useful and has spawned an industry of large enter-
prise vendors, including stalwarts such as Oracle and Microsoft. However, some tech-
nologies take a different approach. One class of software, broadly known as analytical
databases, dispenses with the relational model. Using innovations in on-disk storage
and the use of distributed memory, analytical databases combine the f lexibility of SQL
with the speed of traditional OLAP systems.
Dremel: Spreading the Wealth
In 2003, Google released a paper describing their MapReduce framework for process-
ing data. This paper was a milestone in the movement toward greater accessibility of
large-scale data processing. MapReduce is a general algorithm for distributing the
processing of data over clusters of readily available commodity machines. The MapRe-
duce concept, in conjunction with a scalable distributed filesystem called GFS, helped
to enable Google to index the public Internet. Constant drops in hardware costs along
with Google's success in the search industry inspired computer scientists to spend
more time thinking about using distributed systems of commodity hardware for data
processing. Engineers at Yahoo! added capabilities inspired by the MapReduce paper
to process data collected by the Apache Nutch Web crawler. Ultimately, the work on
Nutch spawned the Hadoop project, which has since become the shining star of the
open-source Big Data world. Hadoop is a horizontally scalable, fault-tolerant frame-
work that enables developers to write their own MapReduce applications.
One of the biggest advantages of the MapReduce concept is that a variety of large
data processing tasks can be completed in an acceptably short time and using hardware
that is relatively inexpensive. Depending on the type of job, tasks that were previously
impossible to run on single machines could be completed in hours or even minutes.
The Hadoop project has gained quite a lot of media attention, and for good reason.
One of the drawbacks to this attention has been that some people new to the field of
data analytics might even consider Hadoop and MapReduce to be synonymous with
Big Data. Despite the success of the Hadoop community, some algorithms simply don't
transfer well to being expressed as distributed MapReduce jobs. Although Hadoop is
the open-source champion of batch processing, the MapReduce concept is not neces-
sarily the best approach to dealing with ad hoc query tasks.
Questioning data is often an iterative process. The answers from one question
might inspire a new question. Although MapReduce, along with tools such as Apache
 
 
Search WWH ::




Custom Search