Shared-nothing Partitioning - Physical Database Design

Databases Reference

In-Depth Information

Select SUM(SALES) from

MYSCHEMA.SALESDATA

where SALESDATE < '2006-11-17' and SALESDATE >

'2004-01-01'

If there is an index on {YEAR, SALES} the database will likely scan the index and

sum all the sales by year where the year equals 2006. For the sake of argument, let's

assume there are 3 million entries in the database for that date range. In order to com-

pute the result the database would need to access 3 million keys, and sum the sales for

each. If the sales data were distributed across 10 nodes, each server would only need to

access and sum 300,000 index keys. After each of the 10 nodes has computed a sum,

the 10 summations are then passed back to the coordinator and collectively summed.

The total sum is therefore calculated in nearly one-tenth of the time.

In real systems, shared-nothing MPP configurations with several hundred nodes are

not uncommon. This impressive scale-out has not been achieved with other multipro-

cessor architectures to date.

6.1.2 Why Shared Nothing Scales So Well

A casual observer may reject this argument by pointing out the benefits achieved are

simply the result of applying 10 times the processing power rather than anything spe-

cific to a shared-nothing server topology. Perhaps the use of a single server with 10 times

the CPU capacity and memory might scale just as well. Indeed there are classes of prob-

lems where that would be the case, but there are multiple problems with trying to

increase processing time linearly by simply buying larger and larger servers. These prob-

lems have kept shared-nothing architectures as the dominant topology for highly scal-

able decision support systems.

First, it may not be possible to purchase a server that has 10 times the CPU power

and 10 times the memory, bus bandwidth, etc. Second, even the largest and most pow-

erful NUMA systems are often constrained by their bus architectures. What if the num-

ber of nodes was 100 or 200 instead of 10, would it be reasonable to simply or even pos-

sibly purchase a single server that is 200 times faster and larger than the single server

alternative that is available?

Second, it turns out to be quite difficult to design algorithms that can scale linearly

to perform parallel processing on shared data. For example, considering the example

above, if we could increase the system resources and number of CPUs by a factor of 10,

what algorithm would we use to scan subsets of the index in parallel? Such algorithms

exist, but they are harder to adaptively determine at runtime by a generic query process-

ing engine and are prone to conflicts such as latch contention, cache coherency, and

false sharing 4 across the L1 caches of the CPUs that affect scalability.

Search WWH ::

Custom Search

Home