Data Warehouses and Hadoop Integration - Microsoft Big Data Solutions

Database Reference

In-Depth Information

With this information, we can see that PDW handles the data file in the

primary filegroup differently from all other files and filegroups. In short,

the primary always grows irrespective of the CREATE DATABASE setting for

AUTOGROW . This makes sense; it's not something we've actually specified in

the DDL, and the space consumed is tiny. We certainly wouldn't want a DDL

request such as CREATE TABLE to fail for lack of storage. I have included

this here just for completeness of information.

More important is that the other database files and log files are handled

differently. Furthermore, these allocations are fixed by the release and do

not change (at the time of this writing) based on size of the database; that is,

itdoesnotmatterwhetheryourdatabase is1TBor1PB,the AUTOGROW value

is the same. It also doesn't matter whether the filegroup is for distributed or

replicated tables, the autogrow works in exactly the same way.

Finally, it should also be clear that PDW is not simply offering up SQL

Server functionality. AUTOGROW in PDW is very different than AUTOGROW

in SQL Server. Each filegroup type has been evaluated for the AUTOGROW

function, and a predesigned, tested configuration is applied on your behalf.

This is classic appliance behavior.

Therefore, in conclusion, the AUTOGROW function is really there to prevent

failures and 3:00-a.m. alarm support calls! We should definitely be

monitoring for consumption of our pre-sized database and log files and be

proactive with our management of storage capacity. You can consider it a

good practice to preallocate your distributed, replicated, and log sizes for

your anticipated growth. A typical projection would be roughly six months

to a year. That gives you enough space to fine-tune future allocations

without over-allocating in the first instance.

Distributed Tables

Distributed tables are the heart of PDW and are key to understanding its

parallelism. In PDW, a table is defined as being distributed when it is

created.Wepickacolumn(yes,onlyone,justlikepartitioning) andhashthe

distinct values of this column, allocating each distinct value into one of our

buckets. To be a bit more precise, the hash is performed by the DMS as the

data is written to the table. Let's look at an example:

CREATE TABLE [dbo].[FactInternetSales]

(

Search WWH ::

Custom Search

Home