Database Reference
In-Depth Information
pages accessed. Because the selected events typically are randomly scattered
in the data files, the data files are compressed in blocks, 18 and the analysis
jobs often involve a significant amount of computation; therefore, the speed-
up is not the inverse of selectivity. However, as the selectivity decreases, the
average speed-up value increases. When one out of 1,000 events is selected,
the speed-up values are observed to be between 20 and 50. Even if 1 in 10
events is used in an analysis job, the observed speed-up is more than 2. STAR
has hundreds of users at their various analysis facilities, and improving these
facilities' overall throughput by a factor of 2 is a great benefit to the whole
community.
As mentioned before, other parallel data management systems such as
Hadoop currently iterate through all data records as well. A smart iterator
similar to that of Grid Collector could benefit such a system as well.
6.8 Summary and Future Trends
In this chapter, we discussed two basic issues for accelerating queries on large
scientific datasets, namely indexing and data organization. Since the data
organizations are typically tied to an individual data management system,
we also briefly touched on a number of different systems with distinct data
organization schemes.
The bulk of this chapter discusses different types of index methods, most
of which are better suited for secondary storage. Applications that use sci-
entific data don't require simultaneous read and write accesses of the same
dataset. This allows the data and indexes to be packed more tightly than in
transactional applications. Furthermore, the indexes can be designed to focus
more on query processing speed and less on updating of individual records. In
general, scientific data tend to have a large number of searchable attributes
and require indexes on every searchable attribute, whereas a database for a
banking application may require only one index for the primary key.
After reviewing many of the well-known multidimensional indexes, we con-
cluded that the bitmap indexes are the most appropriate indexing schemes for
scientific data. We reviewed some recent advances in bitmap index research
and discussed their uses in two examples. These examples use an open-source
bitmap index software called FastBit. The first example demonstrated the use-
fulness of indexes by measuring the time needed to answer a set of queries from
the Set Query Benchmark. We saw that FastBit outperforms the best available
indexless system by an order of magnitude. This demonstrates that there are
situations where the use of an index significantly improves performance of an
application. The second example demonstrated the use of FastBit indexes to
implement a smart iterator for a distributed data analysis framework. Since
Search WWH ::




Custom Search