Databases Reference
In-Depth Information
11.1.3 HiveQL in details
Having seen Hive in action, we're ready to formally look at different aspects and usage
of HiveQL.
DATA MODEL
We've already seen that Hive supports tables as a fundamental data model. Physically,
Hive stores tables as directories under /user/hive/warehouse. For example, the cite ta-
ble we created earlier would have its data under the /user/hive/warehouse/cite direc-
tory. The output table cite_count would be under /user/hive/warehouse/cite_count.
In the most basic setup, the directory hierarchy under a table is only one level deep,
and the table's data are spread out over many files under that one directory.
Relational databases use indexes on columns to speed up queries on those columns.
Hive, instead, uses a concept of partition columns, which are columns whose values
would divide the table into separate partitions. For example, a state column would
partition a table into 50 partitions, one for each state. 6 A date column is a popular
partition column for log data; data for each day would belong to its own partition. Hive
treats partition columns differently than regular data columns, and executes queries
involving partition columns much more efficiently. This is because Hive physically
stores different partitions in different directories. For example, let's say you have a
table named users that has two partition columns date and state (plus the regular data
columns). Hive will have a directory structure like this for that table:
/user/hive/warehouse/users/date=20090901/state=CA
/user/hive/warehouse/users/date=20090901/state=NY
/user/hive/warehouse/users/date=20090901/state=TX
...
/user/hive/warehouse/users/date=20090902/state=CA
/user/hive/warehouse/users/date=20090902/state=NY
/user/hive/warehouse/users/date=20090902/state=TX
...
/user/hive/warehouse/users/date=20090903/state=CA
/user/hive/warehouse/users/date=20090903/state=NY
/user/hive/warehouse/users/date=20090903/state=TX
...
All user data for California ( state=CA ) on September 1, 2009 ( date=20090901 ) re-
sides in one directory, and data for other partitions is in other directories. If a query
comes in asking about California users on September 1, 2009, Hive only has to process
data in that one directory and ignore data in the users table that have been stored in
other partitions. Queries over ranges in the partition columns will involve processing
multiple directories, but Hive will still avoid a full scan of all data in users . In some
sense partitioning brings similar benefits to Hive as indexing provides to a traditional
relational database, although partitioning works at a much less granular level. You'll
want each partition to still be big enough that a MapReduce job on it can be reason-
ably efficient.
6 In practice you'll also have to handle District of Columbia and various territories.
 
Search WWH ::




Custom Search