Information Technology Reference
In-Depth Information
3. Columns and column groups (families)
a. In HBase, row columns are grouped into column families.
b. All column family members will mandatorily have a com-
mon prefix, for example, the columns person:name and
person:comments are both members of the person column
family, where as e-mail:identifier belongs to the e-mail family.
c.
A table's column families must be specified upfront as part of the
table schema definition.
d.
New column family members can be added on demand.
17.3.3 Hive
Apache Hive is a data warehouse infrastructure built on top of Hadoop pro-
vided by Facebook. Similar to Pig, Hive was initially designed as an in-house
solution for large-scale data analysis. As the company expanded, the parallel
RDBMS infrastructure originally deployed at Facebook began to choke at
the amount of data that had to be processed on a daily basis. Following the
decision to switch to Hadoop to overcome these scalability problems in 2008,
the Hive project was developed internally to provide the high-level interface
required for a quick adoption of the new warehouse infrastructure inside
the company. Since 2009, Hive is also available for the general public as an
open-source project under the Apache umbrella. Inside Facebook, Hive runs
thousands of jobs per day on different Hadoop clusters ranging from 300 to
1200 nodes to perform a wide range of tasks including periodical reporting
of click counts, ad hoc analysis, and training machine learning models for ad
optimization. Other companies working with data in the petabyte magni-
tude like Netflix are reportedly using Hive for the analysis of website stream-
ing logs and catalog metadata information.
The fundamental goals of designing Hive were the following:
• Build a system for managing and querying data using structured
techniques on Hadoop
• Use native MapReduce for execution at HDFS and Hadoop layers
• Use HDFS for storage of Hive data
• Store key metadata in an RDBMS
• Extend SQL interfaces, a familiar data warehousing tool in use at
enterprises
High extensibility : User-defined types, user-defined functions,
formats, and scripts
• Leverage extreme scalability and performance of Hadoop
• Interoperability with other platforms
Search WWH ::




Custom Search