Technology Behind Impala and Integration with Third-party Applications - Learning Cloudera Impala

Database Reference

In-Depth Information

Appendix A. Technology Behind Impala

and

Integration

with

Third-party

Applications

In the last seven chapters, I described the various traits of Impala, and I believe that

you have learned those details as well. Now it is time to finish the topic by adding a

few more details, which will help you understand the true potential of Impala.

Technology behind Impala

The technology behind Impala is revolutionary and inspired by a Google research pro-

ject named Dremel . Dremel is a scalable ad hoc query-based analysis system for

read-only nested data. Dremel-based implementations can run aggregation queries

over trillions of rows in seconds by combining multilevel executing trees and columnar

data layout. It does not use MapReduce as the core; instead it complements MapRe-

duce. Impala is considered to be a native Massive Parallel Processing query engine

running on Apache Hadoop. Depending on the type of query and configuration, Im-

pala excels in data processing performance over traditional database applications on

Hadoop, such as Hive, and processing frameworks, such as MapReduce, due to the

following key reasons:

• Distributed, scalable aggregation algorithms.

• Specialized hardware configuration, such as reducing CPU load, which in-

creases aggregate I/O bandwidth.

• Using the columnar binary storage format on Hadoop, which adds speed to

query processing. This is done by taking advantage of Parquet file types as

an input source.

• Impala extends its reach beyond Dremel and provides support for various oth-

er popular file formats, making its availability and reach beyond Parquet to

multifold users.

• Impala uses the available memory on a machine as a table cache, which

mean queries always process the data that is available in the cache, making

processing super fast by speeding their execution up to 90 times faster than

conventional processing when data is read from a disk.

Search WWH ::

Custom Search

Home