Data Warehouses and Hadoop Integration - Microsoft Big Data Solutions

Database Reference

In-Depth Information

ON t.object_id =

tp.object_id

The abridged results for this query are shown in Figure 10.6 . Notice that

the TableDistributionPolicyDesc column states HASH as it is a

distributed table and that the SalesOrderNumber row has a 1 for the

IsDistributionColumn .

Figure 10.6 Extended table and column properties for distributed tables

This query works just as well for showing off additional data about

replicated tables too. Let's look at them now.

Replicated Tables

Replicated tables tend to be smaller tables, typically dimensions, which

represent whole copies of data. A replicated table has its entire data set

copied to all the compute nodes. Why do this? Well, it really helps when

reading the data. At the end of the day, each compute node is a highly tuned

SQL Server SMP instance with one buffer pool of memory. To facilitate a

join, PDW needs to ensure that all the data is available. By replicating a

table,weguaranteethatthistableatleasthasallitsdataavailableforalocal,

co-located join.

However, if the join was between two distributed tables, we may have an

issue. It's still possible, of course, that both tables may be distributed in

the same way; this is often a primary design goal. However, it's not always

possible. Under some circumstances, one of the tables may have to be

redistributed to make the join compatible. The worst case is realized when

either table is distributed on a joining key. In this case both tables need to

be redistributed. This is called a double shuffle.

It is important to note that data movement can even happen with a

replicated table. When the join is an outer join, this triggers movement

because we have to handle the nulls generated by the join.

Search WWH ::

Custom Search

Home