Data Warehouses and Hadoop Integration - Microsoft Big Data Solutions

Database Reference

In-Depth Information

Table 10.5 HDFS File Naming Convention

Attribute

Format

Description

PDW reuses its internally generated,

appliance unique, formatted numeric

request_id that we typically see in

sys.dm_pdw_exec_requests to mark the

export. This enables us to tie the files in

HDFS to the query in PDW that exported it.

Therefore, n* represents the query in this

context.

QUERYID

QIDn*

YYYYMMDD Date query started to export data to HDFS.

(Note that this is the date in PDW, not the

date in Hadoop.)

DATE

Time query started to export data to HDFS.

(Note that this is the time in PDW, not the

time in Hadoop.)

TIME

HHMMSS

Zero-based number for representing each

distribution in the appliance. There are 8

distributions in the appliance per compute

node. Therefore, if you have 6 compute

nodes, you have 48 distributions. However,

the range of these distribution numbers

would be 0 to 47 in this instance. The n* in

this context would therefore be 0-47.

DISTRIBUTION n*

QUERYID_DATE_TIME_DISTRIBUTION

We can conclude that PDW is creating one file per distribution and that

this is how PDW is able to write the data out in parallel. Consequently, we

can also conclude that if the data is skewed in PDW, the files in HDFS will

be similarly skewed. By default, there will be three copies of the data, and

the files will be constituted from 64MB blocks (because this is the block

size). We can see the actual size and the number of replicas with our -ls

command. The value 524139 shown in Figure 10.14 is the size in bytes for

the first distribution (that is, distribution 0 or compute node 1 distribution

A in PDW parlance). We can also see how many copies of this file we have

Search WWH ::

Custom Search

Home