Adding Structure with Hive - Microsoft Big Data Solutions

Database Reference

In-Depth Information

Hive provides several forms of connectivity to Hadoop data through Thrift.

Thrift is a software framework that supports network service

communication, including support for JDBC and ODBC connectivity.

Because ODBC is broadly supported by query access tools, it makes it much

easier for business users to access the data in Hadoop using their favorite

analysis tools. Excel is one of the common tools used by end users for

working with data, and it supports ODBC. (Using Excel with Hadoop is

discussed further in Chapter 11, “Visualizing Big Data with Microsoft BI.”)

In addition to providing ODBC data access, Hive also acts as a translator for

the SQL. As mentioned previously, many users and developers are familiar

withwritingSQLstatementstoqueryandtransformdata.Hivecantakethat

SQLandtranslateitintoMapReducejobs.So,ratherthanthebusinessusers

having to learn Java and MapReduce, or learn a new tool for querying data,

they can leverage their existing knowledge and skills.

Hive manages this SQL translation by providing Hive Query Language

(HQL). HQL provides support for common SQL language operations like

SELECT for retrieving information and INSERT INTO to load data.

Although HQL is not ANSI SQL compliant, it implements enough of the

standard to be familiar to users who have experience working with RDBMS

systems.

Differentiating Hive from Traditional RDBMS Systems

This chapter has discussed several of the ways that Hive emulates a

relational database. It's also covered some of the ways in which it differs,

including the data types and the storage of the data. Those topics are worth

covering in a bit more depth because they do have significant impact on how

Hive functions and what you should expect from it.

In a relational database like SQL Server, the database engine manages the

data storage. That means when you insert data into a table in a relational

database, the server takes that data, converts it into whatever format it

chooses, and stores it in data structures that it manages and controls. At

that point, the server becomes the gatekeeper of the data. To access the data

again, you must request it from the relational database so that the server

can retrieve it from the internal storage and return it to you. Other systems

cannot access or change the data directly without going through the server.

Search WWH ::

Custom Search

Home