Understanding the Storage Architecture - Professional NoSQL - page 85

Databases Reference

In-Depth Information

In addition to having the distributed fi lesystem for storage, an HBase cluster also leverages an

external confi guration and coordination utility. In the seminal paper on Bigtable, Google named

this confi guration program Chubby. Hadoop, being a Google infrastructure clone, created an

exact counterpart and called it ZooKeeper. Hypertable calls the similar infrastructure piece

Hyperspace. A ZooKeeper cluster typically front-ends an HBase cluster for new clients and

manages confi guration.

To access HBase the fi rst time, a client accesses two catalogs via ZooKeeper. These catalogs are

named -ROOT- and .META. The catalogs maintain state and location information for all the regions.

-ROOT- keeps information of all .META. tables and a .META. fi le keeps records for a user-space

table, that is, the table that holds the data. When a client wants to access a specifi c row it fi rst

asks ZooKeeper for the -ROOT- catalog. The -ROOT- catalog locates the .META. catalog relevant

for the row, which in turn provides all the region details for accessing the specifi c row. Using this

information the row is accessed. The three-step process of accessing a row is not repeated the

next time the client asks for the row data. Column databases rely heavily on caching all relevant

information, from this three-step lookup process. This means clients directly contact the region

servers the next time they need the row data. The long loop of lookups is repeated only if the region

information in the cache is stale or the region is disabled and inaccessible.

Each region is often identifi ed by the smallest row-key it stores, so looking up a row is usually as

easy as verifying that the specifi c row-key is greater than or equal to the region identifi er.

So far, the essential conceptual and physical models of column database storage have been intro-

duced. The behind-the-scenes mechanics of data write and read into these stores have also been

exposed. Advanced features and detailed nuances of column databases will be picked up again in

the later chapters, but for now I shift focus to document stores.

DOCUMENT STORE INTERNALS

The previous couple of chapters have offered a user's view into a popular document store MongoDB.

Now take the next step to peel the onion's skin.

MongoDB is a document store, where documents are grouped together into collections. Collections

can be conceptually thought of as relational tables. However, collections don't impose the strict

schema constraints that relational tables do. Arbitrary documents could be grouped together in

a single collection. Documents in a collection should be similar, though, to facilitate effective

indexing. Collections can be segregated using namespaces but down in the guts the representation

isn't hierarchical.

Each document is stored in BSON format. BSON is a binary-encoded representation of a

JSON-type document format where the structure is close to a nested set of key/value pairs.

BSON is a superset of JSON and supports additional types like regular expression, binary data, and

date. Each document has a unique identifi er, which MongoDB can generate, if it is not explicitly

specifi ed when the data is inserted into a collection, like when auto-generated object ids are, as

depicted in Figure 4-10.

Next Page

Professional NoSQL

Search WWH ::

Custom Search

Home