Accelerating Queries on Very Large Datasets - Scientific Data Management

Database Reference

In-Depth Information

and multiattribute indexing when we discuss both the direct access and the hy-

brid index methods. There is a further level of classification that distinguishes

index schemes into those that are more suitable for static data and those that

more are suitable for dynamic data. Most large-scale scientific datasets are

append-only and as such are not typically subjected to concurrent reads and

writes in a manner that requires models of transactional accessing. As such, we

do not consider index methods under further classification of dynamic versus

static methods.

In this chapter, we focus primarily on disk-based datasets that are stored on

external memory and are paged in and out of memory during processing us-

ing memory buffers or cache pools . Consequently, we address access methods,

denoted as the highlighted ovals of the leaf nodes of our classification hierar-

chy shown in Figure 6.1, where we also show some examples of the indexing

methods that have been used in practice. In the next sections we describe

methods of accessing massively large datasets by (1) simple scans, (2) hash-

ing, (3) single-attribute indexes, (4) multiattribute indexes, and (5) hybrid

schemes that are comprised of one or more combinations of hashing and scan-

ning, hashing and tree indexes, or hashing, tree index, and scanning. Note also

that methods that generate a single key by combining the multiple attribute

values of the data, either by concatenation or bit interleaving, and then using

the generated key in a single-attribute index scheme, are considered under the

hybrid methods.

6.4 Single-Attribute Index Schemes

6.4.1 Sequential Scan Access

A natural approach to searching unordered datasets without an index is by

a simple sequential scan. This gives an O

search time for datasets of N

items. Although seemingly naive and simplistic, it has the advantage that in-

sertions are fast since new data items are simply appended at the end of the

existing file. If order is maintained on either a single or combined attributes,

searches can be performed in O

(

N

)

time, using a binary search. Of course,

this requires a preprocessing cost of sorting the items. As new records are

added, one needs to ensure that the sorted order is preserved.

For very large high-dimensional datasets, where searches are allowed on any

combination of attributes, it has been shown that such simple sequential scans

of processing queries can be just as effective, if not more ecient, as building

and maintaining multidimensional indexes, 14 , 82 , 97 particularly when nearest

neighbor queries are predominant. With parallel computing and parallel file

systems, a sequential scan can achieve near-linear speed-up with a balanced

data partition. We revisit the case of sequential scans for high-dimensional

datasets in Section 6.5.4.

(

log N

)

Scientific Data Management

Search WWH ::

Custom Search

Home