Database Reference
In-Depth Information
and processing tasks across a scalable number of separate machines or virtual servers.
The ecosystem of tools built with Hadoop can provide a very favorable value proposi-
tion depending on the use case. For some applications, Hadoop enables great perfor-
mance per dollar for data-processing tasks. For others, Hadoop can be the only way to
accomplish some warehousing and querying tasks economically. Others see Hadoop as a
promising technology that lacks the enterprise features necessary to merit investment.
Consider the use of Hadoop with the open-source Hive package as a data-
warehousing solution. Hadoop excels at f lexibility, but sometimes this comes at the
cost of performance for specific applications. Hive enables users to turn SQL-like
queries on data stored in a Hadoop cluster into MapReduce jobs that return the query
result. Although MapReduce can be f lexible for expressing many different kinds of
data-transformation and processing tasks, this is not always the most efficient architec-
ture for running aggregate queries. Furthermore, as it stands, the Hadoop ecosystem
currently lacks many of the enterprise features found in traditional data warehouse
solutions such as reliability and failover, automated backups, and interoperability with
existing filesystems.
In other words, Hadoop's MapReduce-based processing model has been overloaded
to address data problems that might better be solved in other ways. This does not
mean that Hadoop-based data tools are not also benefitting from feature convergence
that we will discuss later in this chapter. For example, Facebook, Hortonworks, and
other companies are sponsoring projects to help speed up the performance of Hive
queries.
However, users are starting to take a look at other data technologies that don't
depend specifically on the Hadoop framework. Consider the growth of new analyti-
cal databases , designed specifically to provide very fast aggregate query results over
large databases. Often these analytical tools use columnar-based data structures along
with distributed, in-memory processing that completely sidesteps the MapReduce
paradigm. These include projects inspired by Google's Dremel, including Cloudera's
Impala and MapR's Drill.
Using the ecosystem of tools built on top of Hadoop as solutions to data challenges
can be both a great enabler and a potential dead end. In any case, one of the amazing
results of the popularity of Hadoop is that it has changed the conversation about the
accessibility of large-scale data processing. Businesses no longer have an excuse for not
being able to store and process massive amounts of data, and entrenched database ven-
dors are starting to pay attention to the groundswell. In terms of technology culture,
Hadoop has empowered users to gain access to some of the technology previously
available only to large Internet companies or huge organizations with a great deal of
resources.
Everything in the Cloud
I've often met people who consider Internet applications, such as Web-based email,
as just another reincarnation of the days when it was commonplace for users to share
 
 
Search WWH ::




Custom Search