Accessing HDInsight over Hive and ODBC - Pro Microsoft HDInsight: Hadoop on Windows

Database Reference

In-Depth Information

Hive Query Language (HQL)

ODBC

JDBC

Hive Web

Interface (HWI)

Command Line

Interface (CLI)

MetaStore

Thrift Server

Compiler, Optimizer, Executor

Figure 8-2. Hive architecture

In effect, Hive enables you to create an interface layer over MapReduce that can be used in a similar fashion to

a traditional relational database. This layer enables business users to use familiar tools like Excel and SQL Server

Reporting Services to consume data from HDInsight as they would from a database system such as SQL Server,

remotely through an ODBC connection. The rest of this chapter walks you through different Hive operations and

using the Hive ODBC driver to consume the data.

Working with Hive

Hive uses tables to impose schema on data and provides a query interface for client applications. The key difference

between Hive tables and those in traditional database systems, such as SQL Server, is that Hive adopts a schema on

read approach. This approach enables you to be flexible about the specific columns and data types that you want

to project on top of your data. You can create multiple tables with different schema from the same underlying data,

depending on how you want to use that data. The most important point to take away from this approach is that the

table is simply a metadata schema that is imposed on data in underlying files.

Creating Hive Tables

You create tables by using the HiveQL CREATE TABLE statement, which in its simplest form looks similar to the

analogous statement in Transact-SQL.

One thing to note about Hive tables is that you can create two types of tables: External and Internal . If you do not

specify a table type, a table is created as Internal .

Be careful! An internal table tells Hive to manage the data by itself. If you drop the table, by default the data is

also dropped and cannot be recovered. If you want to manage the data and data locations, if your data is used outside

Hive, or if you need to retain the data, create an external table. The syntax is pretty much similar, requiring just the

addition of the EXTERNAL keyword.

You can use the PARTITIONED BY clause to create a subfolder for each distinct value in a specified column (for example,

to store a file of daily data for each date in a separate folder). Partitioning can improve query performance because

HDInsight will scan only relevant partitions in a filtered query.

Search WWH ::

Custom Search

Home