Extreme scale clinical analytics with open source software - Open Source Software in Life Science Research

Biomedical Engineering Reference

In-Depth Information

With all of these advantages, the tradeoff is a modifi ed, but surprisingly

simple, application development architecture. Use-cases become more

modular and web interfaces become mash-ups of many different services

hitting many different database clusters. Fat use-cases begin to disappear

because the complicated joins that produce them cannot be accomplished.

(Note: this is also partially true for the 'CDA Data Items' table in the

RDBMS example.) The methodology also has a major drawback of not

allowing traditional widespread access to the data by non-programmers.

One solution to provide traditional SQL access for non-programmers is

to use Hive on top of the HBase system. It does not fully replace the

capabilities of RDBMS, but at least it gives a familiar entry point.

To further understand the value of these approaches in this domain, we

will explore two systems that provide good insight into the power of

NoSQL, namely Cassandra [20] and Riak [21]. Cassandra was

contributed to Apache from Facebook, and is an example of a NoSQL

column store database. Like the others it is still driven primarily by

key/value access, but the value is built of a structured but extremely

'de-normalized' schema to be stored under each key, called a

ColumnFamily. The most powerful aspect of this being that the number

of practically usable columns is not fi xed; in fact the maximum number

of columns supported is over two billion! Each of these columns can be

created ad hoc, on the fl y, per transaction. This far exceeds the level of

fl exibility of new data items that might be expected in clinical

documentation. Cassandra takes this one step further and allows for

SuperColumns. A SuperColumn is essentially a column that supports

additional columns within it. For example, this allows the ability to

specify the patient as a SuperColumn, with the fi rst name and the last

name being subcolumns. The key/value, ColumnFamily, SuperColumn

model provides a nice mix of highly scalable, highly fl exible storage and

indexable, multidimensional, organized data.

Like most if not all NoSQL databases, Cassandra scales very easily.

What sets Cassandra apart is its ability to give the developer control over

the tradeoffs of consistency, availability, and partitioning (CAP). The

CAP theorem [22], fi rst proposed by Eric Brewer [23] at Inktomi, submits

that in any single large-scale distributed system, one can pick any two

of the three fundamental goals of highly available, scalable, distributed

data storage. The designers of Cassandra prioritized partitioning and

availability, and allowed consistency to be selected by the application

developer at a cost of latency. The design decision to allow these tradeoffs

to be tuned by the developer was ingenious, and more and more

architectures are moving this way. Medical use-cases typically experience

Search WWH ::

Custom Search

Home