Database Reference
In-Depth Information
Table 10.5 HDFS File Naming Convention
Attribute
Format
Description
PDW reuses its internally generated,
appliance unique, formatted numeric
request_id that we typically see in
sys.dm_pdw_exec_requests to mark the
export. This enables us to tie the files in
HDFS to the query in PDW that exported it.
Therefore, n* represents the query in this
context.
QUERYID
QIDn*
YYYYMMDD Date query started to export data to HDFS.
(Note that this is the date in PDW, not the
date in Hadoop.)
DATE
Time query started to export data to HDFS.
(Note that this is the time in PDW, not the
time in Hadoop.)
TIME
HHMMSS
Zero-based number for representing each
distribution in the appliance. There are 8
distributions in the appliance per compute
node. Therefore, if you have 6 compute
nodes, you have 48 distributions. However,
the range of these distribution numbers
would be 0 to 47 in this instance. The n* in
this context would therefore be 0-47.
DISTRIBUTION n*
QUERYID_DATE_TIME_DISTRIBUTION
We can conclude that PDW is creating one file per distribution and that
this is how PDW is able to write the data out in parallel. Consequently, we
can also conclude that if the data is skewed in PDW, the files in HDFS will
be similarly skewed. By default, there will be three copies of the data, and
the files will be constituted from 64MB blocks (because this is the block
size). We can see the actual size and the number of replicas with our -ls
command. The value 524139 shown in Figure 10.14 is the size in bytes for
the first distribution (that is, distribution 0 or compute node 1 distribution
A in PDW parlance). We can also see how many copies of this file we have
 
Search WWH ::




Custom Search