NoSQL data architecture patterns - Making Sense of NoSQL

Databases Reference

In-Depth Information

Hadoop distributed filesystem) and MapReduce transforms for getting data into or

out of the systems. So be sure to consider these factors before you select a column

family implementation.

H IGHER SCALABILITY

The world Big in the title of the original Google paper tells us that Bigtable-inspired

column family systems are designed to scale beyond a single processor. At the core,

column family systems are noted for their scalable nature, which means that as you

add more data to your system, your investment will be in the new nodes added to the

computing cluster. With careful design, you can achieve a linear relationship between

the way data grows and the number of processors you require.

The principal reason for this relationship is the simple way that row ID s and col-

umn names are used to identify a cell. By keeping the interface simple, the back-end

system can distribute queries over a large number of processing nodes without per-

forming any join operations. With careful design of row ID s and columns, you give the

system enough hints to tell it where to get related data and avoid unnecessary network

traffic crucial to system performance.

H IGHER AVAILABILITY

By building a system that scales on distributed networks, you gain the ability to repli-

cate data on multiple nodes in a network. Because column family systems use efficient

communication, the cost of replication is lower. In addition, the lack of join opera-

tions allows you to store any portion of a column family matrix on remote computers.

This means that if the server that holds part of the sparse matrix crashes, other com-

puters are standing by to provide the data service for those cells.

E ASY TO ADD NEW DATA

Like the key-value and graph stores, a key feature of the column family store is that

you don't need to fully design your data model before you begin inserting data. But

there are a couple constraints that you should know before you begin. Your groupings

of column families should be known in advance, but row ID s and column names can

be created at any time.

For all the good things that you can do with column family systems, be warned that

they're designed to work on distributed clusters of computers and may not be appro-

priate for small datasets. You usually need at least five processors to justify a column

family cluster, since many systems are designed to store data on three different nodes

for replication. Column family systems also don't support standard SQL queries for

real-time data access. They may have higher-level query languages, but these systems

often are used to generate batch MapReduce jobs. For fast data access, you'll use a

custom API written in a procedural language like Java or Python.

In the next three sections, we'll look at how column family implementations have

been efficiently used by companies like Google to manage analytics, maps, and user

preferences.

Search WWH ::

Custom Search

Home