large data warehouses), improving query workload performance by reducing
the I/O costs, and reducing manageability costs by decreasing the time and
storage needed by applications like backup/restore.
While data compression yields significant benefits in the form of reduced
storage costs and reduced I/O, there is a substantial central processing unit
(CPU) cost to be paid in decompressing the data during query processing.
Thus, the decision on when to use compression needs to be made carefully.
Given that compression increases the space of physical design options, there
is a natural motivation to extend automated physical design tools to handle
compression. In order to meet the storage bound as well as to reason with
I/O costs during query execution, it is necessary to perform a quantitative
analysis of the effects of compression. Specifically, given an index, we need to
understand how much space will be saved by compressing it and how the work-
load performance is impacted by compressing it. One of the key challenges in
answering these questions is to estimate the size of an index if it were to
be compressed. With that functionality, we can reuse the what-if approach of
Chapter 5 by appropriately modifying the hypothetical index metadata. Since
the space of physical design options is large, it is important to be able to per-
form this estimation accurately and eciently. The naıve method of actually
building and compressing the index in order to estimate its size is obviously
not feasible. Thus, an important challenge to incorporate compression as an-
other physical design dimension is to accurately estimate the compressed size
of an index without incurring the cost of actually compressing it. This problem
is challenging because the size of the compressed index significantly depends
on both the data distribution and the specific compression algorithm. This
is in contrast to the size estimation of an uncompressed index, which can be
derived from the schema and table cardinality.
In addition to the question of which indexes to materialize, an important
decision is to determine how the resulting indexes would be assigned to the
available disk drives (i.e., designing the database layout ). A good layout should
result in balanced loads in the storage subsystem, or otherwise the most loaded
disk would quickly become the bottleneck while others might be underutilized.
Traditionally, DBMSs rely on solutions that spread out tables and indexes uni-
formly over all available disk drives (e.g., by relying on redundant arrays of in-
expensive disks, or RAID configurations). Such alternatives are relatively easy
to manage because DBAs do not need to be concerned about specific object
placement. At the same time, when multiple large indexes are accessed con-
currently during execution (e.g., in complex decision support queries), these
solutions may perform suboptimally. The reason is interference between re-
quests for different objects that are laid on the same disk. For instance, if
two indexes are laid out in the same disk and need to be read sequentially