Database Reference
In-Depth Information
paths directly works fine and is perfectly acceptable in many scenarios, what
it does is bind your Hive table or Pig job to a specific data layout within
HDFS.
Ifthisdatalayoutweretochangeduringanactivitylikedatamaintenanceor
simply because the size of the data outgrew the initial HDFS organizational
structure, your script or job would be broken. This would require you to
revisit every script or job that referenced this data, which in large systems
could be potentially unpleasant.
This scenario is just one of the reasons the Apache HCatalog project was
created. HCatalog started as an abstraction of the Hive metadata
management functionality (currently is part of the larger Apache Hive
project) and is intended to allow for shared metadata across the Hadoop
ecosystem.
Table definitions and even data type mappings can be created and shared,
so users can work with data stored in HDFS without worrying about the
underlying details such as where or how the data is stored. HCatalog
currently works with MapReduce, Hive, and of course, Pig; and as an
abstraction of the Hive platform, the syntax for creating tables is identical,
except that we have to specify the data location during creation of the table:
CREATE EXTERNAL TABLE iislog (
date STRING,
time STRING,
username STRING,
ip STRING,
port INT,
method STRING,
uristem STRING,
uriquery STRING,
timetaken INT,
useragent STRING,
referrer STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '/data/logs';
Search WWH ::




Custom Search