Database Reference
In-Depth Information
Appendix A. Technology Behind Impala
and
Integration
with
Third-party
Applications
In the last seven chapters, I described the various traits of Impala, and I believe that
you have learned those details as well. Now it is time to finish the topic by adding a
few more details, which will help you understand the true potential of Impala.
Technology behind Impala
The technology behind Impala is revolutionary and inspired by a Google research pro-
ject named Dremel . Dremel is a scalable ad hoc query-based analysis system for
read-only nested data. Dremel-based implementations can run aggregation queries
over trillions of rows in seconds by combining multilevel executing trees and columnar
data layout. It does not use MapReduce as the core; instead it complements MapRe-
duce. Impala is considered to be a native Massive Parallel Processing query engine
running on Apache Hadoop. Depending on the type of query and configuration, Im-
pala excels in data processing performance over traditional database applications on
Hadoop, such as Hive, and processing frameworks, such as MapReduce, due to the
following key reasons:
• Distributed, scalable aggregation algorithms.
• Specialized hardware configuration, such as reducing CPU load, which in-
creases aggregate I/O bandwidth.
• Using the columnar binary storage format on Hadoop, which adds speed to
query processing. This is done by taking advantage of Parquet file types as
an input source.
• Impala extends its reach beyond Dremel and provides support for various oth-
er popular file formats, making its availability and reach beyond Parquet to
multifold users.
• Impala uses the available memory on a machine as a table cache, which
mean queries always process the data that is available in the cache, making
processing super fast by speeding their execution up to 90 times faster than
conventional processing when data is read from a disk.
Search WWH ::




Custom Search