Hive and the Hadoop herd - Hadoop in Action

Databases Reference

In-Depth Information

11.1.3 HiveQL in details

Having seen Hive in action, we're ready to formally look at different aspects and usage

of HiveQL.

DATA MODEL

We've already seen that Hive supports tables as a fundamental data model. Physically,

Hive stores tables as directories under /user/hive/warehouse. For example, the cite ta-

ble we created earlier would have its data under the /user/hive/warehouse/cite direc-

tory. The output table cite_count would be under /user/hive/warehouse/cite_count.

In the most basic setup, the directory hierarchy under a table is only one level deep,

and the table's data are spread out over many files under that one directory.

Relational databases use indexes on columns to speed up queries on those columns.

Hive, instead, uses a concept of partition columns, which are columns whose values

would divide the table into separate partitions. For example, a state column would

partition a table into 50 partitions, one for each state. 6 A date column is a popular

partition column for log data; data for each day would belong to its own partition. Hive

treats partition columns differently than regular data columns, and executes queries

involving partition columns much more efficiently. This is because Hive physically

stores different partitions in different directories. For example, let's say you have a

table named users that has two partition columns date and state (plus the regular data

columns). Hive will have a directory structure like this for that table:

/user/hive/warehouse/users/date=20090901/state=CA

/user/hive/warehouse/users/date=20090901/state=NY

/user/hive/warehouse/users/date=20090901/state=TX

...

/user/hive/warehouse/users/date=20090902/state=CA

/user/hive/warehouse/users/date=20090902/state=NY

/user/hive/warehouse/users/date=20090902/state=TX

...

/user/hive/warehouse/users/date=20090903/state=CA

/user/hive/warehouse/users/date=20090903/state=NY

/user/hive/warehouse/users/date=20090903/state=TX

...

All user data for California ( state=CA ) on September 1, 2009 ( date=20090901 ) re-

sides in one directory, and data for other partitions is in other directories. If a query

comes in asking about California users on September 1, 2009, Hive only has to process

data in that one directory and ignore data in the users table that have been stored in

other partitions. Queries over ranges in the partition columns will involve processing

multiple directories, but Hive will still avoid a full scan of all data in users . In some

sense partitioning brings similar benefits to Hive as indexing provides to a traditional

relational database, although partitioning works at a much less granular level. You'll

want each partition to still be big enough that a MapReduce job on it can be reason-

ably efficient.

6 In practice you'll also have to handle District of Columbia and various territories.

Hadoop in Action

Search WWH ::

Custom Search

Home