HDFS, Hive, HBase, and HCatalog - Microsoft Big Data Solutions

Database Reference

In-Depth Information

paths directly works fine and is perfectly acceptable in many scenarios, what

it does is bind your Hive table or Pig job to a specific data layout within

HDFS.

Ifthisdatalayoutweretochangeduringanactivitylikedatamaintenanceor

simply because the size of the data outgrew the initial HDFS organizational

structure, your script or job would be broken. This would require you to

revisit every script or job that referenced this data, which in large systems

could be potentially unpleasant.

This scenario is just one of the reasons the Apache HCatalog project was

created. HCatalog started as an abstraction of the Hive metadata

management functionality (currently is part of the larger Apache Hive

project) and is intended to allow for shared metadata across the Hadoop

ecosystem.

Table definitions and even data type mappings can be created and shared,

so users can work with data stored in HDFS without worrying about the

underlying details such as where or how the data is stored. HCatalog

currently works with MapReduce, Hive, and of course, Pig; and as an

abstraction of the Hive platform, the syntax for creating tables is identical,

except that we have to specify the data location during creation of the table:

CREATE EXTERNAL TABLE iislog (

date STRING,

time STRING,

username STRING,

ip STRING,

port INT,

method STRING,

uristem STRING,

uriquery STRING,

timetaken INT,

useragent STRING,

referrer STRING

)

ROW FORMAT DELIMITED FIELDS TERMINATED BY ','

STORED AS TEXTFILE

LOCATION '/data/logs';

Search WWH ::

Custom Search

Home