Expanding Your Capability with HBase and HCatalog - Microsoft Big Data Solutions

Database Reference

In-Depth Information

One caveat to note is that the dynamic partition columns are selected by

order and are the last columns in the select clause.

Integrating HCatalog with Pig and Hive

Although originally designed to provide the metadata store for Hive,

HCatalog's role has greatly expanded in the Hadoop ecosystem. It integrates

with other tools and supplies read and write interfaces for Pig and

MapReduce. It also integrates with Sqoop, which is a tool designed to

transfer data back and forth between Hadoop and relational databases such

as SQL Server and Oracle. HCatalog also exposes a REST interface so that

you can create custom tools and applications to interact with Hadoop data

structures. In addition, HCatalog contains a notification service so that it

can notify workflow tools such as Oozie when data has been loaded or

updated.

Another key feature of HCatalog is that it allows developers to share data

and structures across internal toolsets like Pig and Hive. You do not have to

explicitly type the data structures in each program. This allows us to use the

right tool for the right job. For example, we can load data into Hadoop using

HCatalog, perform some ETL on the data using Pig, and then aggregate the

data using Hive. After the processing, you could then send the data to your

data warehouse housed in SQL Server using Sqoop. You can even automate

the process using Oozie.

To complete the following exercise, you need to download and install the

HDP for Windows from Hortonworks. You can set up HDP for Windows on

a development server to provide a local test environment that supports a

single-node deployment. (For a detailed discussion of installing the Hadoop

development environment on Windows, see http://hortonworks.com/

products/hdp-windows/ .)

In this exercise, we analyze sensor data collected from HVAC systems

monitoring the temperatures of buildings. You can download the sensor

data from http://www.wiley.com/go/microsoftbigdatasolutions . There

should be two files, one with sensor data ( HVAC.csv ) and a file containing

building information ( building.csv ). After extracting the files, load the

data into a staging table using HCatalog and Hive:

1. Open the Hive CLI. Because Hive and HCatalog are so tightly coupled,

you can write HCatalog commands directly in the Hive CLI. As a matter

Search WWH ::

Custom Search

Home