Using Hadoop, Hive, and Shark to Ask Questions about Large Datasets - Data Just Right: Introduction to Large-Scale Data and Analytics

Database Reference

In-Depth Information

always a better option to use a relational database to ask questions about your data. On

its own, Hive can speed up the process of asking questions about very large datasets

thanks to providing an SQL-like interface on top of the MapReduce paradigm.

Optimizing Hive Query Performance

The MapReduce framework is designed to spread data processing tasks across many

machines, making large-scale data solutions economically feasible through parallelism.

Although MapReduce is a great model for transforming a huge batch of raw files in a

timely manner, query performance can be slow. Using Hive's EXPLAIN statement, you

can see that queries can often require multiple MapReduce steps, each of which is fur-

ther slowed by many disk-access events.

The first step in improving performance is to restrict the amount of data necessary

to provide a query result. An initial way to do this is to use Hive's partitioning func-

tion. When data is partitioned, Hive will only look at the partition requested rather

than all the data in the entire table. It is also possible to build indexes on columns in

Hive. Proper indexing is often crucial for optimizing the speed of Hive queries.

Another optimization is to use a file format that is most efficient for the types of

queries that you are interested in running. When Hadoop stores intermediate process-

ing files in HDFS, it natively uses a format called Sequence Files, which features a

key-value structure in which each key points to a single row of data. When using raw

text files or sequence files with Hive, the entire row of data must be accessed every

time a query is run. These file types are not the best format for running fast queries

that only require a few columns of data. However, a format called RCFile makes it

possible for Hive to only access the columns necessary to provide the query result.

One simple way to create an RCFile from original flat data is to populate a new

RCFile table using a SELECT statement. First, load data into Hive using the original

text format. Then create a new, empty Hive table that is stored in RCFile format.

Finally, run a Hive query that populates the RCFile table. Listing 5.3 demonstrates the

steps involved, as well as the HiveQL query performance difference between these two

file types.

Listing 5.3 Convert a text table to an RCFile table

/* Create a table for the original text format file */

CREATE TABLE text_table(

name string,

);

/* Load data from HDFS into the table */

LOAD DATA INPATH '/names.txt' INTO TABLE text_table;

/* SELECT everything and write results into

the new RCFile Hive table */

Search WWH ::

Custom Search

Home