Emerging Database Systems in Support of Scientific Data - Scientific Data Management

Database Reference

In-Depth Information

state: “Partitioning in database design is the process of assigning a logical

object (relation) from the logical schema of the database to several physical

objects (files) in a stored database. Vertical partitioning subdivides attributes

into groups and assigns each group to a physical object.” That paper, however,

was not concerned with such analytical applications in which ad hoc queries

are dominant. Instead, it discusses how tables may be partitioned in order

to exploit known correlations between attribute hit rates, to obtain better

average query performance. This is worthwhile mainly in routinely repeating

processes where access patterns that display such correlations dominate and

change slowly.

The term [ fully ] transposed file was used in early papers, 3 , 14 - 16 to de-

note what is today called “vertically fragmented” or “vertically decomposed”

data structures, 17 “vertical partitioning,” 9 “column-oriented” databases, 18 or

“column store” databases. 8

Copeland and Khoshafian 12 is the first published paper on transposed files

and related structures that is widely referenced in recent database literature.

While the authors note that some early database systems used a fully trans-

posed storage model, for example, RM, 19 TOD, 10 RAPID, 14 ALDS, 20 Delta 21

and Tanaka, 22 they described in that paper the advantages of a fully decom-

posed storage model (DSM). A DSM is a “[fully] transposed storage model

with surrogates * included.” In a DSM each column of a relational table is

stored in a separate binary association table (BAT), as an array of fixed-size

two-field records (TID, value), where TID refers to Tuple ID. According to

Khoshafian et al., 23 the DSM further assumes that two copies of each binary

relation are stored, one clustered (i.e., sorted or hashed with an index) on each

of the two attributes (TID, value). Copeland and Khoshafian 12 conclude that

there seems to be a general consensus among the database community that

the conventional N-ary Storage Model (NSM) is better. They suggest that the

consensus opinion is not well founded and that neither is clearly better until

a closer analysis is made.

In Khoshafian et al., 23 a parallel query-processing strategy for the DSM is

presented, called the pivot algorithm . The algorithm makes use of the join in-

dex concept. 24 An informal description of a generic select-join-project query

is given in the paper, where all selects are assumed to be range restriction

operations and all joins are equi-joins. The initial select phase executes a

select operation for every predicate binding in the query, using the appropri-

ate value-clustered BAT as index. The output of this phase is a collection

of temporary index lists, each containing the TIDs of selected tuples from

conceptual relation tables. All these operations are done in parallel. Dur-

ing the pivot phase of the algorithm, the main m -way join operation of the

query is executed. A “pivot” TID column is chosen from those attributes that

* The term surrogate is used by some researchers to denote object identifiers, or as here in

relational databases, tuple identifiers or TIDs . We will use the latter term unless we make a

direct quotation.

Scientific Data Management

Search WWH ::

Custom Search

Home