Database Reference
In-Depth Information
Listing 8-8.
Querying data from a Hive table
$subscriptionName = "YourSubscriptionName"
$clustername = "democluster"
Select-AzureSubscription -SubscriptionName $subscriptionName
Use-AzureHDInsightCluster $clusterName
-Subscription (Get-AzureSubscription -Current).SubscriptionId
$querystring = "select * from stock_analysis
where stock_symbol LIKE 'MSFT' LIMIT 10;"
Invoke-Hive -Query $querystring
You should see output similar to the following once the job execution completes:
Successfully connected to cluster democluster
Submitting Hive query..
Started Hive query with jobDetails Id : job_201311240635_0014
Hive query completed Successfully
MSFT 2/8/2013 31.69 31.9 31.57 31.89 29121500 31.89 NASDAQ
MSFT 1/8/2013 32.06 32.09 31.6 31.67 42328400 31.67 NASDAQ
MSFT 31/07/2013 31.97 32.05 31.71 31.84 43898400 31.84 NASDAQ
MSFT 30/07/2013 31.78 32.12 31.55 31.85 45799500 31.85 NASDAQ
MSFT 29/07/2013 31.47 31.6 31.4 31.54 28870700 31.54 NASDAQ
MSFT 26/07/2013 31.26 31.62 31.21 31.62 38633600 31.62 NASDAQ
MSFT 25/07/2013 31.62 31.65 31.25 31.39 63213000 31.39 NASDAQ
MSFT 24/07/2013 32.04 32.19 31.89 31.96 52803100 31.96 NASDAQ
MSFT 23/07/2013 31.91 32.04 31.71 31.82 65810400 31.82 NASDAQ
MSFT 22/07/2013 31.7 32.01 31.6 32.01 79040700 32.01 NASDAQ
It is very important to note that Hive queries use minimal caching, statistics, or optimizer tricks. They generally
read the entire data set on each execution, and thus are more suitable for batch processing than for online work. One
of the strongest recommendations I have for you while you are querying Hive is to write
SELECT *
instead of listing
specific column names. Fetching a selective list of columns like in Listing 8-9 is a best practice when the source is a
classic database management system like SQL Server database, but the story is completely different with Hive.
Listing 8-9.
Selecting a partial list of columns
SELECT stock_symbol, stock_volume
FROM stock_analysis;
The general principle of HIVE is to expose Hadoop MapReduce functionality through an SQL-like language.
Thus, when you issue a command like that in Listing 8-9 a MapReduce job will be triggered to remove any columns
from the Hive table data set that aren't being specified in the query, and to send back only the columns
stock_symbol
and
stock_volume
.
On the other hand, the HiveQL in Listing 8-10 does not require any MapReduce job to return its results, because
there is no need to eliminate columns. Hence, there is less processing in the background.