Advanced Impala Concepts - Learning Cloudera Impala

Database Reference

In-Depth Information

Processing

different

file

and

compression types in Impala

Impala loads files stored in HDFS and these files could be of various types. Some of

these files are stored in HDFS directly from their source, or some of the files could be

the output of MapReduce or Pig or any other application running on Hadoop.

Impala is limited in terms of supporting various file types on Hadoop; however, it does

cover most popular Big Data file formats, which gives Impala a very wide range to

cover user input requests. If Impala cannot read an input file type, you can perform

the following steps to use a combination of Hive and Impala:

1. Use the CREATE TABLE statement in the Hive shell to create the table with

input data.

2. Use the Impala shell with the INVALIDATE METADATA statement so that it

does not generate unsupported file type errors.

3. Now write query statements in the Impala shell to achieve your objective.

A very important point to note here is that Impala performance mostly depends on the

input file format and the compression algorithm used to compress input files. Com-

pression is used for two main reasons. First, it requires less disk space to store files

and small file reads require less disk I/O and CPU resources to load files in memory.

Once the file is loaded in memory, it is decompressed in memory only when the data

in the file is required for processing. The following table shows the list of Impala-sup-

ported compression types and their usage patterns and properties:

Compression type

Why use it?

Snappy

Very fast; it is the fastest in compression and decompression

GZIP

It is the best option to save disk space

Search WWH ::

Custom Search

Home