Database Reference
In-Depth Information
always a better option to use a relational database to ask questions about your data. On
its own, Hive can speed up the process of asking questions about very large datasets
thanks to providing an SQL-like interface on top of the MapReduce paradigm.
Optimizing Hive Query Performance
The MapReduce framework is designed to spread data processing tasks across many
machines, making large-scale data solutions economically feasible through parallelism.
Although MapReduce is a great model for transforming a huge batch of raw files in a
timely manner, query performance can be slow. Using Hive's EXPLAIN statement, you
can see that queries can often require multiple MapReduce steps, each of which is fur-
ther slowed by many disk-access events.
The first step in improving performance is to restrict the amount of data necessary
to provide a query result. An initial way to do this is to use Hive's partitioning func-
tion. When data is partitioned, Hive will only look at the partition requested rather
than all the data in the entire table. It is also possible to build indexes on columns in
Hive. Proper indexing is often crucial for optimizing the speed of Hive queries.
Another optimization is to use a file format that is most efficient for the types of
queries that you are interested in running. When Hadoop stores intermediate process-
ing files in HDFS, it natively uses a format called Sequence Files, which features a
key-value structure in which each key points to a single row of data. When using raw
text files or sequence files with Hive, the entire row of data must be accessed every
time a query is run. These file types are not the best format for running fast queries
that only require a few columns of data. However, a format called RCFile makes it
possible for Hive to only access the columns necessary to provide the query result.
One simple way to create an RCFile from original flat data is to populate a new
RCFile table using a SELECT statement. First, load data into Hive using the original
text format. Then create a new, empty Hive table that is stored in RCFile format.
Finally, run a Hive query that populates the RCFile table. Listing 5.3 demonstrates the
steps involved, as well as the HiveQL query performance difference between these two
file types.
Listing 5.3 Convert a text table to an RCFile table
/* Create a table for the original text format file */
CREATE TABLE text_table(
name string,
);
/* Load data from HDFS into the table */
LOAD DATA INPATH '/names.txt' INTO TABLE text_table;
/* SELECT everything and write results into
the new RCFile Hive table */
 
Search WWH ::




Custom Search