Database Reference
In-Depth Information
and multiattribute indexing when we discuss both the direct access and the hy-
brid index methods. There is a further level of classification that distinguishes
index schemes into those that are more suitable for static data and those that
more are suitable for dynamic data. Most large-scale scientific datasets are
append-only and as such are not typically subjected to concurrent reads and
writes in a manner that requires models of transactional accessing. As such, we
do not consider index methods under further classification of dynamic versus
static methods.
In this chapter, we focus primarily on disk-based datasets that are stored on
external memory and are paged in and out of memory during processing us-
ing memory buffers or cache pools . Consequently, we address access methods,
denoted as the highlighted ovals of the leaf nodes of our classification hierar-
chy shown in Figure 6.1, where we also show some examples of the indexing
methods that have been used in practice. In the next sections we describe
methods of accessing massively large datasets by (1) simple scans, (2) hash-
ing, (3) single-attribute indexes, (4) multiattribute indexes, and (5) hybrid
schemes that are comprised of one or more combinations of hashing and scan-
ning, hashing and tree indexes, or hashing, tree index, and scanning. Note also
that methods that generate a single key by combining the multiple attribute
values of the data, either by concatenation or bit interleaving, and then using
the generated key in a single-attribute index scheme, are considered under the
hybrid methods.
6.4 Single-Attribute Index Schemes
6.4.1 Sequential Scan Access
A natural approach to searching unordered datasets without an index is by
a simple sequential scan. This gives an O
search time for datasets of N
items. Although seemingly naive and simplistic, it has the advantage that in-
sertions are fast since new data items are simply appended at the end of the
existing file. If order is maintained on either a single or combined attributes,
searches can be performed in O
(
N
)
time, using a binary search. Of course,
this requires a preprocessing cost of sorting the items. As new records are
added, one needs to ensure that the sorted order is preserved.
For very large high-dimensional datasets, where searches are allowed on any
combination of attributes, it has been shown that such simple sequential scans
of processing queries can be just as effective, if not more ecient, as building
and maintaining multidimensional indexes, 14 , 82 , 97 particularly when nearest
neighbor queries are predominant. With parallel computing and parallel file
systems, a sequential scan can achieve near-linear speed-up with a balanced
data partition. We revisit the case of sequential scans for high-dimensional
datasets in Section 6.5.4.
(
log N
)
Search WWH ::




Custom Search