Database Reference
In-Depth Information
Processing
different
file
and
compression types in Impala
Impala loads files stored in HDFS and these files could be of various types. Some of
these files are stored in HDFS directly from their source, or some of the files could be
the output of MapReduce or Pig or any other application running on Hadoop.
Impala is limited in terms of supporting various file types on Hadoop; however, it does
cover most popular Big Data file formats, which gives Impala a very wide range to
cover user input requests. If Impala cannot read an input file type, you can perform
the following steps to use a combination of Hive and Impala:
1. Use the CREATE TABLE statement in the Hive shell to create the table with
input data.
2. Use the Impala shell with the INVALIDATE METADATA statement so that it
does not generate unsupported file type errors.
3. Now write query statements in the Impala shell to achieve your objective.
A very important point to note here is that Impala performance mostly depends on the
input file format and the compression algorithm used to compress input files. Com-
pression is used for two main reasons. First, it requires less disk space to store files
and small file reads require less disk I/O and CPU resources to load files in memory.
Once the file is loaded in memory, it is decompressed in memory only when the data
in the file is required for processing. The following table shows the list of Impala-sup-
ported compression types and their usage patterns and properties:
Compression type
Why use it?
Snappy
Very fast; it is the fastest in compression and decompression
GZIP
It is the best option to save disk space
Search WWH ::




Custom Search