Loading a PDW Region in APS - SQL Server Integration Services Design Patterns

Database Reference

In-Depth Information

sisted there. Rather, the Control node directs the loading and retrieval of user data to

the appropriate Compute node and distribution. This is the first time we've introduced

the term distribution , so if you're not yet familiar with the term, don't worry. We'll

cover distributions in the next section.

Shared-Nothing Architecture

At the core of PDW is the concept of shared-nothing architecture . In a shared-nothing

architecture, a single logical table is broken up into numerous smaller physical pieces.

The exact number of pieces depends on the number of Compute nodes in the PDW re-

gion. Within a single Compute node, each data piece is then split across eight (8) distri-

butions. The number of distributions per Compute node cannot be configured and is

consistent across all hardware vendors.

A distribution is the most granular physical level within PDW. Each distribution

contains its own dedicated CPU, memory, and storage (LUNs), which it uses to store

and retrieve data. Because each distribution contains its own dedicated hardware, it can

perform load and retrieval operations in parallel with other distributions. This concept

is what we mean by “shared-nothing.” There are numerous benefits that a shared-noth-

ing architecture enables, such as more linear scalability. But perhaps PDW's greatest

power is its ability to scan data at incredible speeds.

Let's do some math. Assume you have a PDW appliance with a base rack contain-

ing 9 Compute nodes, and you need to store a table with 1 billion rows. The data will

be split across all 9 Compute nodes, and each Compute node will split its data across 8

distributions. Thus, the 1-billion-row table will be split into 72 distributions (9 Com-

pute nodes × 8 distributions per Compute node). That means each distribution will

store roughly 13,900,000 rows.

But what does this mean from the end user's standpoint? Let's look at a hypothetic-

al situation. You are a user at a retail company and you have a query that joins two

tables together: a Sales table with 1 billion rows, and a Customer table with 50 million

rows. And, as luck would have it, there are no indexes available that will cover your

query. This means you will need to scan , or read, every row in each table.

In an SMP system—where memory, storage, and CPU are shared—this query could

take hours or days to run. On some systems, it might not even be feasible to attempt

this query, depending on factors such as the server hardware and the amount of activity

on the server. Suffice it to say, the query will take a considerable amount of time to re-

turn and will most likely have a negative impact on other activity on the server.

Search WWH ::

Custom Search

Home