Databases Reference
In-Depth Information
In addition to having the distributed fi lesystem for storage, an HBase cluster also leverages an
external confi guration and coordination utility. In the seminal paper on Bigtable, Google named
this confi guration program Chubby. Hadoop, being a Google infrastructure clone, created an
exact counterpart and called it ZooKeeper. Hypertable calls the similar infrastructure piece
Hyperspace. A ZooKeeper cluster typically front-ends an HBase cluster for new clients and
manages confi guration.
To access HBase the fi rst time, a client accesses two catalogs via ZooKeeper. These catalogs are
named -ROOT- and .META. The catalogs maintain state and location information for all the regions.
-ROOT- keeps information of all .META. tables and a .META. fi le keeps records for a user-space
table, that is, the table that holds the data. When a client wants to access a specifi c row it fi rst
asks ZooKeeper for the -ROOT- catalog. The -ROOT- catalog locates the .META. catalog relevant
for the row, which in turn provides all the region details for accessing the specifi c row. Using this
information the row is accessed. The three-step process of accessing a row is not repeated the
next time the client asks for the row data. Column databases rely heavily on caching all relevant
information, from this three-step lookup process. This means clients directly contact the region
servers the next time they need the row data. The long loop of lookups is repeated only if the region
information in the cache is stale or the region is disabled and inaccessible.
Each region is often identifi ed by the smallest row-key it stores, so looking up a row is usually as
easy as verifying that the specifi c row-key is greater than or equal to the region identifi er.
So far, the essential conceptual and physical models of column database storage have been intro-
duced. The behind-the-scenes mechanics of data write and read into these stores have also been
exposed. Advanced features and detailed nuances of column databases will be picked up again in
the later chapters, but for now I shift focus to document stores.
DOCUMENT STORE INTERNALS
The previous couple of chapters have offered a user's view into a popular document store MongoDB.
Now take the next step to peel the onion's skin.
MongoDB is a document store, where documents are grouped together into collections. Collections
can be conceptually thought of as relational tables. However, collections don't impose the strict
schema constraints that relational tables do. Arbitrary documents could be grouped together in
a single collection. Documents in a collection should be similar, though, to facilitate effective
indexing. Collections can be segregated using namespaces but down in the guts the representation
isn't hierarchical.
Each document is stored in BSON format. BSON is a binary-encoded representation of a
JSON-type document format where the structure is close to a nested set of key/value pairs.
BSON is a superset of JSON and supports additional types like regular expression, binary data, and
date. Each document has a unique identifi er, which MongoDB can generate, if it is not explicitly
specifi ed when the data is inserted into a collection, like when auto-generated object ids are, as
depicted in Figure 4-10.
Search WWH ::




Custom Search