Using Hadoop, Hive, and Shark to Ask Questions about Large Datasets - Data Just Right: Introduction to Large-Scale Data and Analytics

Database Reference

In-Depth Information

that can be compiled and used in queries. Due to its popularity, there is a large eco-

system of tools available for Hive, including command-line tools, the Hive Web inter-

face, and various connectors such as JDBC drivers that can be used to provide access

from external software.

Hive is not the only distributed data warehousing solution. The AmpLab Spark

project extends the Hive codebase to operate over data using the Spark distributed

processing engine. Shark's in-memory model enables queries to return results expo-

nentially faster than a typical Hive query. Shark can be used in conjunction with exist-

ing Hadoop clusters. Although Shark is relatively new, it is becoming more popular as

a replacement for Hive when interactive ad hoc querying is necessary.

Hive is a popular choice for users who need to ask questions about datasets that are

too large to be handled by relational databases or are relatively unstructured. Hive

is also useful for datasets that are constantly growing, as it scales well across many

machines, a situation in which other approaches may be economically challenging. In

addition, Hive makes a great complementary tool to existing Hadoop installations,

providing nondeveloper analysts access to data that would otherwise require compli-

cated code to query.

Search WWH ::

Custom Search

Home