Using Hadoop, Hive, and Shark to Ask Questions about Large Datasets - Data Just Right: Introduction to Large-Scale Data and Analytics

Database Reference

In-Depth Information

/* Create an external table. Deleting this table will leave the

data intact but remove the table's metadata. */

CREATE EXTERNAL TABLE employee_ids (name STRING, id INT)

LOCATION '/external/employee_ids';

LOAD DATA INPATH '/users/ids.csv' INTO TABLE employee_ids;

Querying Data: HiveQL

Now that we've created tables using Hive to keep track of our input files, we can

start asking questions about the data. Hive provides an SQL-like language known as

HiveQL, which enables functionality (such as GROUP BY , JOIN s, and HAVING) famil-

iar to users of standard SQL.

HiveQL can't express every type of query that is possible with standard SQL. The

differences between standard SQL and HiveQL mostly stem from the differences

between the design of relational databases and the MapReduce framework. As Hive is

not meant to do appends to existing datasets, transactions are not supported, nor are

materialized views. However, Hive is able to take advantage of some of the unique

benefits of MapReduce. One of the most useful is the multiple insertion feature, which

enables Hive queries to produce multiple output tables from a single query (rather than

just one, which is the normal result of an SQL query). This is more efficient than run-

ning separate queries, as Hive will only scan the source table once during a multiple

insertion. An example of a multiple-insertion query can be seen in Listing 5.2.

Listing 5.2 Using a multiple-insertion query in Hive

FROM bookstore

INSERT OVERWRITE TABLE book

SELECT author, SUM(revenue) AS total

WHERE category='book' GROUP BY author

INSERT OVERWRITE TABLE comics

SELECT author, SUM(revenue) AS total

WHERE category='comic' GROUP BY author;

Another difference between the standard SQL of a relational database and HiveQL

is the absence of inequality JOIN s. The MapReduce model is not well-designed to

compare keys that do not match. Therefore, it is only possible with Hive to run JOIN

queries between two tables when a particular key equals another.

These examples merely scratch the surface of what is possible with Hive. For prac-

tical explorations of asking questions about datasets, it's possible to run Hadoop and

Hive on a single machine, using either the local filesystem or a local Hadoop filesys-

tem for testing. Scaling the system up to use an external MySQL database to act as a

remote Hive metastore is slightly more work (mostly due to configuring proper per-

missions on the MySQL database).

The ease of setting up Hive on a small installation makes it a tempting choice for

small datasets, but if your data sizes are such that a single server will suffice, then it is

Search WWH ::

Custom Search

Home