Introducing Big Data Technologies - Data Warehousing in the Age of Big Data

Databases Reference

In-Depth Information

Why do we need HBase when the data is stored in the HDFS file system, which is the core data

storage layer within Hadoop? For operations other than MapReduce execution and operations that

aren't easy to work with in HDFS, and when you need random access to data, HBase is very useful.

HBase satisfies two types of use cases:

● It provides a database-style interface to Hadoop, which enables developers to deploy programs

that can quickly read or write to specific subsets of data in an extremely voluminous data set,

without having to search and process through the entire data set.

● It provides a transactional platform for running high-scale, real-time applications as an ACID-

compliant database (meeting standards for atomicity, consistency, isolation, and durability) while

handling the incredible volume, variety, and complexity of data encountered on the Hadoop

platform. HBase supports the following properties of ACID compliance:

● Atomicity: All mutations are atomic within a row. For example, a read or write operation will

either succeed or fail.

● Consistency: All rows returned for any execution will consist of a complete row that existed or

exists in the table.

● Isolation: The isolation level is called “read committed” in the traditional DBMS.

● Durability: All visible data in the system is durable data. For example, to phrase durability, a

read will never return data that has not been made durable on disk.

HBase is different from the RDBMS and DBMS platforms and is architected and deployed like

any NoSQL database.

HBase architecture

Data is organized in HBase as rows and columns and tables, very similar to a database; however, here

is where the similarity ends. Let us look at the data model of HBase and then understand the imple-

mentation architecture.

● Tables:

● Tables are made of rows and columns.

● Table cells are the intersection of row and column coordinates. Each cell is versioned by

default with a timestamp. The contents of a cell are treated as an uninterpreted array of bytes.

● A table row has a sortable row key and an arbitrary number of columns.

● Rows:

●

Table row keys are also byte arrays. In this configuration anything can serve as the row key as

opposed to strongly typed data types in the traditional database.

●

Table rows are sorted byte-ordered by row key, the table's primary key, and all table accesses

are via the table primary key.

●

Columns are grouped as families and a row can have as many columns as loaded.

●

Columns and column groups (families):

●

In HBse row columns are grouped into column families.

●

All column family members will mandatorily have a common prefix, for example, the columns

person:name and person:comments are both members of the person column family, whereas

email:identifier belongs to the email family.

●

A table's column families must be specified upfront as part of the table schema definition.

●

New column family members can be added on demand.

Data Warehousing in the Age of Big Data

Search WWH ::

Custom Search

Home