Database Reference
In-Depth Information
The Impala execution architecture
Previously we discussed the Impala daemon, statestore, and metastore in detail to
understand how they work together. Essentially, Impala daemons receive queries
from a variety of sources and distribute the query load to Impala daemons running on
other nodes. While doing so, it interacts with the statestore for node-specific updates
and accesses the metastore, either stored in the centralized database or in the local
cache. Now to complete the Impala execution, we will discuss how Impala interacts
with other components, that is, Hive, HDFS, and HBase.
Working with Apache Hive
We have already discussed earlier the Impala metastore using the centralized data-
base as a metastore, and Hive also uses the same MySQL or PostgreSQL database
for the same kind of data. Impala provides the same SQL-like query interface used in
Apache Hive. Since both Impala and Hive share the same database as a metastore,
Impala can access Hive-specific table definitions if the Hive table definition uses the
same file format, compression codecs, and Impala-supported data types for their
column values.
Apache Hive provides various kinds of file-type processing support to Impala. When
using formats other than a text file, that is, RCFile, Avro, and SequenceFile, the data
must be loaded through Hive first and then Impala can query the data from these
file formats. Impala can perform a read operation on more types of data using the
SELECT statement and then perform a write operation using the INSERT statement.
The ANALYZE TABLE statement in Hive generates useful table and column statistics
and Impala uses these valuable statistics to optimize the queries.
Working with HDFS
Impala table data are actually regular data files stored in HDFS and Impala uses
HDFS as its primary data storage medium. As soon as a data file or a collection of
files is available in a specific folder of a new table, Impala reads all of the files re-
gardless of their names, and new data is included in files with the name controlled by
Impala. HDFS provides data redundancy through the replication factor and relies on
such redundancy to access data on other DataNodes in case it is not available on a
specific DataNode. We have already learned earlier that Impala also maintains the in-
Search WWH ::




Custom Search