Accelerating Queries on Very Large Datasets - Scientific Data Management

Database Reference

In-Depth Information

where indexes can dramatically speed up the query processing. For exam-

ple, for interactive exploration, one might select a few million records from

a dataset with billions of records. Going through the whole dataset to find a

few million is likely to take longer than using an effective index. Another ex-

ample is in query estimation. Often, before users commit to evaluate subsets

of data records, they may want to know the size of the result and the pos-

sible processing time. This task is usually better accomplished with indexes.

Additionally, indexes may provide a faster way to gather some statistics. To

illustrate the usefulness of an index in accomplishing such tasks, we next give

a specific example of answering count queries from the Set Query Benchmark.

The Set Query Benchmark is a benchmark designed to evaluate the perfor-

mance of OLAP-type applications. 63 , 67 The test data contains 12 columns of

uniform random numbers plus a sequence number that serves as a row identi-

fier. Our test uses queries of the form “ Select count(*) From Bench Where ... ,”

which we call count queries. 63 We use a test dataset with 100 million rows in-

stead of the original specification of a million rows. In the test dataset, we also

adjust the column cardinality of the two random columns to be 25 million and

50 million instead of 250,000 and 500,000 as the benchmark specification indi-

cated. We note that such uniform random values with extremely high column

cardinalities are the worst type of data for the compressed bitmap indexes.

The table in Figure 6.19 contains the names of these count queries, Q1, Q2A,

Q2B, and so on. Each of these queries has a number of different instances that

involve different columns or different query conditions. Figure 6.19a shows the

total time to complete all instances of a query, and Figure 6.19b shows the

query response time for each instance of Q1.

The query response times are gathered from two systems: one with a com-

pressed bitmap index and the other with compressed columns but without

any index. The bitmap index is from FastBit and the indexless system is a

commercial product that reorders and compresses the base data. This index-

less system is advertised as the fastest data processing system. During our

testing, we consulted with the vendor to obtain the best ordering and com-

pression options available in early 2007. The two systems are run on the same

computer with a 2.8 GHz Intel Pentium CPU and a small hardware RAID

that is capable of supporting about 60 MB/s throughput.

The time results in Figure 6.19 clearly indicate that indexes are useful for

these queries. Overall, one can answer all the queries about 11 times faster

using bitmap indexes than using the compressed base data. Figure 6.19b shows

some performance details that help to explain the observed differences. The

horizontal axis in Figure 6.19b is the column cardinality. The best reordering

strategy that minimizes the overall query response time is to order the lowest

cardinality column first, then the next lowest cardinality column, and so on.

More specifically, the column with cardinality 2 (the lowest cardinality) is

completely sorted; when the lowest cardinality column values are the same,

the rows are ordered according to the column with cardinality 4; when the first

two columns have the same value, the rows are ordered according to the next

Scientific Data Management

Search WWH ::

Custom Search

Home