Using Hadoop, Hive, and Shark to Ask Questions about Large Datasets - Data Just Right: Introduction to Large-Scale Data and Analytics

Database Reference

In-Depth Information

It's not enough to provide just a distributed in-memory environment; we need

practical tools like Hive to make the platform more accessible. Of course, a distributed

data warehousing solution built using Spark absolutely must be named after an animal;

it's almost a hard and fast rule of distributed data applications. The codebase of Hive

was extended to run on the Spark platform, and the result is called Shark .

Because it is based on Hive, Shark is inherently easy for the data application devel-

oper to use. Shark works well with existing Hadoop and Hive instances, and just like

Hive it can access HBase tables as well. In fact, it's easy to get a Shark instance run-

ning. Some users have even reported success accessing Shark queries from external

applications such as Tableau, using existing tools such as the Hive ODBC driver.

Shark is an excellent choice for many ad hoc queries for which Hive would nor-

mally be used, but like other in-memory data technologies, performance is dependent

on the amount of available memory in the cluster. However, even when Shark must

access a disk, some users have reported that query performance still beats a similar

Hive installation. However, another drawback to using very new technologies such as

Shark in production is that they lack the tool ecosystems and developer communities

of more mature projects.

From a practical standpoint, for long-running MapReduce jobs that process more

data than fits in available memory, Hadoop is still the right tool. Using Shark and

Hadoop in conjunction might provide the best of both worlds: A disk-based batch-

processing tool for transforming large amounts of data and an in-memory query

engine for analysis.

Another open-source implementation of a fast query engine on top of Hadoop is

Impala. Impala is very different from Hive or Shark, but it covers similar use cases.

Unlike Hive or Shark, Impala shares many of the same design characteristics of

Google BigQuery, which we will discuss in Chapter 6, “Building a Data Dashboard

with Google BigQuery.”

Throughout this chapter, we've discussed using Hive in the context of an existing

Hadoop installation. Large-scale data analytics and cloud computing have grown

together. By necessity, an entire industry of compute clouds and virtual servers for hire

has appeared.

Although completely managed Hadoop systems are becoming available, distributed

systems inherently require some type of administration. Some products are taking

the virtual, distributed data warehouse idea a step further, providing a fully managed

solution. One technology that has become popular in this space is Amazon's Redshift.

Redshift isn't a Hadoop-based product like Hive, but because it is based on Post-

greSQL, it has more in common with relational data warehouses.

From a practical standpoint, there's a lot of great reasons to consider using cloud-

based solutions for data warehousing. As more and more applications move to the Web,

data is already being hosted in the cloud. More importantly, as there are no upfront

Search WWH ::

Custom Search

Home