Database Reference
In-Depth Information
The following diagram represents how data is stored physically on the disk:
Column Family :: CF1
Column Family :: CF2
Row Keys
Col-1
Col-2
Col-1
Col-2
ROW-1
David
982 765 2345
ROW-2
John
909 451 4587
863 441 4123
ROW-3
Elan
763 451 4587
863 341 4123
ROW-4
Maria
Physical Representation
Physical Representation
ROW-1 : CF1 : Col1 : TS1 : David
ROW-2 : CF2 : Col1 : TS1 : John
ROW-1 : CF1 : Col2 : TS1 : 982 765 2345
ROW-3 : CF2 : Col2 : TS1 : 909 451 4587
ROW-3 : CF1 : Col1 : TS1 : Elan
ROW-3 : CF2 : Col2 : TS2 : 823 441 4123
ROW-4 : CF1 : Col1 : TS1 : Maria
ROW-4 : CF1 : Col2 : TS1 : 763 451 4587
ROW-4 : CF1 : Col2 : TS2 : 863 341 4123
In HBase, the entire cell, along with the added structural information such as the
row key and timestamp, is called the key value. Hence, each cell not only represents
the column and data, but also the row key and timestamp stored.
While designing tables in HBase, we usually have two options to go for:
• Fewer rows with many columns (lat and wide tables)
• Fewer columns with many rows (tall and narrow tables)
Let's consider a use case where we need to store all the tweets made by a user in a
single row. This approach might work for many users, but there will be users who
will have a large magnitude of tweets in their account. In HBase, rows are identiied
by splitting them at boundaries. This also enforces the recommendation for tall and
narrow tables that have fewer columns with many rows.
Hence, a better approach would be to store each tweet of a user in a separate row,
where the row key should be the combination of the user ID and the tweet ID. Rows
with fewer columns is just a logical representation, and physically, at the disk level,
this makes no difference as all the values are stored in linear sets. Hence, even if the
tweet ID is deined in the column qualiier or in the row key, each cell will ultimately
contain a single tweet message.
 
Search WWH ::




Custom Search