Databases Reference
In-Depth Information
Hadoop distributed filesystem) and MapReduce transforms for getting data into or
out of the systems. So be sure to consider these factors before you select a column
family implementation.
H IGHER SCALABILITY
The world Big in the title of the original Google paper tells us that Bigtable-inspired
column family systems are designed to scale beyond a single processor. At the core,
column family systems are noted for their scalable nature, which means that as you
add more data to your system, your investment will be in the new nodes added to the
computing cluster. With careful design, you can achieve a linear relationship between
the way data grows and the number of processors you require.
The principal reason for this relationship is the simple way that row ID s and col-
umn names are used to identify a cell. By keeping the interface simple, the back-end
system can distribute queries over a large number of processing nodes without per-
forming any join operations. With careful design of row ID s and columns, you give the
system enough hints to tell it where to get related data and avoid unnecessary network
traffic crucial to system performance.
H IGHER AVAILABILITY
By building a system that scales on distributed networks, you gain the ability to repli-
cate data on multiple nodes in a network. Because column family systems use efficient
communication, the cost of replication is lower. In addition, the lack of join opera-
tions allows you to store any portion of a column family matrix on remote computers.
This means that if the server that holds part of the sparse matrix crashes, other com-
puters are standing by to provide the data service for those cells.
E ASY TO ADD NEW DATA
Like the key-value and graph stores, a key feature of the column family store is that
you don't need to fully design your data model before you begin inserting data. But
there are a couple constraints that you should know before you begin. Your groupings
of column families should be known in advance, but row ID s and column names can
be created at any time.
For all the good things that you can do with column family systems, be warned that
they're designed to work on distributed clusters of computers and may not be appro-
priate for small datasets. You usually need at least five processors to justify a column
family cluster, since many systems are designed to store data on three different nodes
for replication. Column family systems also don't support standard SQL queries for
real-time data access. They may have higher-level query languages, but these systems
often are used to generate batch MapReduce jobs. For fast data access, you'll use a
custom API written in a procedural language like Java or Python.
In the next three sections, we'll look at how column family implementations have
been efficiently used by companies like Google to manage analytics, maps, and user
preferences.
Search WWH ::




Custom Search