Extreme scale clinical analytics with open source software - Open Source Software in Life Science Research

Biomedical Engineering Reference

In-Depth Information

Additionally, these databases are ready for the massive scalability and

redundancy required to handle an entire region's clinical documentation

en masse. Certainly relational databases can scale, but the total cost of

ownership with these next-generation databases is demonstrably less. In

fact, we would argue that NoSQL databases are the best for both simple

agile projects and applications, and for extreme scale or highly distributed

data stores. Medium-sized platforms with plenty of rich transactional

use-cases and rich reporting will probably remain best suited for a

relational database.

The fi rst major success of the NoSQL storage paradigm was by Google.

Google built a scalable distributed storage and processing framework

called Google File System, BigTable and MapReduce [13-16] for storing

and indexing web pages accessible through their search interface. These

were all donated to Apache for open source development as a set of

Hadoop [17] frameworks. BigTable is simply an extremely de-normalized,

fl at, and wide table, that allows any type of data to be stored in any

column. This schema provides the ability to retrieve all columns by a

single key, and each row retrieved could be of a different columnar form.

This is similar to pivoting the relational model discussed earlier.

MapReduce is a powerful framework that combines a master process,

a map function, and a reduce function to process huge amounts of data

in parallel across distributed server nodes. Because the map functions can

contain arbitrary code, they can be used to perform extremely expensive

and complicated functions. However, due to the framework they must

return results via an emit function to a data (or value) reduction phase.

The reduction phase can also process the intermediary data with any

logic it wishes so long as it produces a singular (but possibly large)

answer. Figure 20.8 shows a simple data fl ow where a map function

reviews CDA for large segments of population. Using UMLS one could

look for all medical codes that imply the patient has an acute heart

disease. Each map function would then put each resulting data set to an

intermediary location. The master process would then coordinate the

hand-off to reduce functions that then combine the intermediary data set

into one fi nal data set.

The success of this architectural breakthrough led to many other uses

within their product suite. Seemingly in parallel, all the largest major

internet sites that handled 'Big Data' had approached this problem by

building or utilizing (and subsequently making famous) various products

that exist in this realm. Public recognition accelerated when Google and

Facebook [18] donated their inventions to the open source community

via Apache. Finally, Amazon [19] paid the methodology a fi nal dose of

Search WWH ::

Custom Search

Home