The Future: Trends in Data Technology - Data Just Right: Introduction to Large-Scale Data and Analytics

Database Reference

In-Depth Information

and processing tasks across a scalable number of separate machines or virtual servers.

The ecosystem of tools built with Hadoop can provide a very favorable value proposi-

tion depending on the use case. For some applications, Hadoop enables great perfor-

mance per dollar for data-processing tasks. For others, Hadoop can be the only way to

accomplish some warehousing and querying tasks economically. Others see Hadoop as a

promising technology that lacks the enterprise features necessary to merit investment.

Consider the use of Hadoop with the open-source Hive package as a data-

warehousing solution. Hadoop excels at f lexibility, but sometimes this comes at the

cost of performance for specific applications. Hive enables users to turn SQL-like

queries on data stored in a Hadoop cluster into MapReduce jobs that return the query

result. Although MapReduce can be f lexible for expressing many different kinds of

data-transformation and processing tasks, this is not always the most efficient architec-

ture for running aggregate queries. Furthermore, as it stands, the Hadoop ecosystem

currently lacks many of the enterprise features found in traditional data warehouse

solutions such as reliability and failover, automated backups, and interoperability with

existing filesystems.

In other words, Hadoop's MapReduce-based processing model has been overloaded

to address data problems that might better be solved in other ways. This does not

mean that Hadoop-based data tools are not also benefitting from feature convergence

that we will discuss later in this chapter. For example, Facebook, Hortonworks, and

other companies are sponsoring projects to help speed up the performance of Hive

queries.

However, users are starting to take a look at other data technologies that don't

depend specifically on the Hadoop framework. Consider the growth of new analyti-

cal databases , designed specifically to provide very fast aggregate query results over

large databases. Often these analytical tools use columnar-based data structures along

with distributed, in-memory processing that completely sidesteps the MapReduce

paradigm. These include projects inspired by Google's Dremel, including Cloudera's

Impala and MapR's Drill.

Using the ecosystem of tools built on top of Hadoop as solutions to data challenges

can be both a great enabler and a potential dead end. In any case, one of the amazing

results of the popularity of Hadoop is that it has changed the conversation about the

accessibility of large-scale data processing. Businesses no longer have an excuse for not

being able to store and process massive amounts of data, and entrenched database ven-

dors are starting to pay attention to the groundswell. In terms of technology culture,

Hadoop has empowered users to gain access to some of the technology previously

available only to large Internet companies or huge organizations with a great deal of

resources.

Everything in the Cloud

I've often met people who consider Internet applications, such as Web-based email,

as just another reincarnation of the days when it was commonplace for users to share

Search WWH ::

Custom Search

Home