Database Reference
In-Depth Information
You can use the
SKEWED BY
clause to create separate files for each row where a specified column value is in a list
of specified values. Rows with values not listed are stored in a single other file.
You can use the
CLUSTERED BY
clause to distribute data across a specified number of subfolders (described as
buckets
)
based on the values of specified columns using a hashing algorithm.
There are a few of ways to execute Hive queries against your HDInsight cluster:
•
Using the Hadoop Command Line
•
Using .NET SDK
•
Using Windows Azure PowerShell
In this chapter, we use Windows Azure PowerShell to create, populate, and query Hive tables. The Hive tables are
based on some demo stock data of different companies as specified here:
•
Apple
•
Facebook
•
Google
•
MSFT
•
IBM
•
Oracle
Let's first load the input files to the WASB that our democluster is using by executing the following PowerShell
script in Listing 8-1. The input files used in this topic are just a subset of the stock market dataset available for free at
www.infochimps.com
and is provided separately.
Listing 8-1.
Uploading files to
WASB
$subscriptionName = "<YourSubscriptionname>"
$storageAccountName = "democluster"
$containerName = "democlustercontainer"
#This path may vary depending on where you place the source .csv files.
$fileName ="D:\HDIDemoLab\TableFacebook.csv"
$blobName = "Tablefacebook.csv"
# Get the storage account key
Select-AzureSubscription $subscriptionName
$storageaccountkey = get-azurestoragekey $storageAccountName | %{$_.Primary}
# Create the storage context object
$destContext = New-AzureStorageContext -StorageAccountName
$storageAccountName -StorageAccountKey $storageaccountkey
# Copy the file from local workstation to the Blob container
Set-AzureStorageBlobContent -File $fileName -Container $containerName
-Blob $blobName -context $destContext
■
repeat these steps with other .csv files in the folder by changing the
$filename
variable and
$blobname
variables and rerun
Set-AzureStorageBlobContent
.
Note