Database Reference
In-Depth Information
It's not enough to provide just a distributed in-memory environment; we need
practical tools like Hive to make the platform more accessible. Of course, a distributed
data warehousing solution built using Spark absolutely must be named after an animal;
it's almost a hard and fast rule of distributed data applications. The codebase of Hive
was extended to run on the Spark platform, and the result is called Shark .
Because it is based on Hive, Shark is inherently easy for the data application devel-
oper to use. Shark works well with existing Hadoop and Hive instances, and just like
Hive it can access HBase tables as well. In fact, it's easy to get a Shark instance run-
ning. Some users have even reported success accessing Shark queries from external
applications such as Tableau, using existing tools such as the Hive ODBC driver.
Shark is an excellent choice for many ad hoc queries for which Hive would nor-
mally be used, but like other in-memory data technologies, performance is dependent
on the amount of available memory in the cluster. However, even when Shark must
access a disk, some users have reported that query performance still beats a similar
Hive installation. However, another drawback to using very new technologies such as
Shark in production is that they lack the tool ecosystems and developer communities
of more mature projects.
From a practical standpoint, for long-running MapReduce jobs that process more
data than fits in available memory, Hadoop is still the right tool. Using Shark and
Hadoop in conjunction might provide the best of both worlds: A disk-based batch-
processing tool for transforming large amounts of data and an in-memory query
engine for analysis.
Another open-source implementation of a fast query engine on top of Hadoop is
Impala. Impala is very different from Hive or Shark, but it covers similar use cases.
Unlike Hive or Shark, Impala shares many of the same design characteristics of
Google BigQuery, which we will discuss in Chapter 6, “Building a Data Dashboard
with Google BigQuery.”
Data Warehousing in the Cloud
Throughout this chapter, we've discussed using Hive in the context of an existing
Hadoop installation. Large-scale data analytics and cloud computing have grown
together. By necessity, an entire industry of compute clouds and virtual servers for hire
has appeared.
Although completely managed Hadoop systems are becoming available, distributed
systems inherently require some type of administration. Some products are taking
the virtual, distributed data warehouse idea a step further, providing a fully managed
solution. One technology that has become popular in this space is Amazon's Redshift.
Redshift isn't a Hadoop-based product like Hive, but because it is based on Post-
greSQL, it has more in common with relational data warehouses.
From a practical standpoint, there's a lot of great reasons to consider using cloud-
based solutions for data warehousing. As more and more applications move to the Web,
data is already being hosted in the cloud. More importantly, as there are no upfront
 
 
Search WWH ::




Custom Search