Composable Data at Cerner - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

From CPUs to Semantic Integration

Cerner has long been focused on applying technology to healthcare, with much of our his-

tory emphasizing electronic medical records. However, new problems required a broader

approach, which led us to look into Hadoop.

In 2009, we needed to create better search indexes of medical records. This led to process-

ing needs not easily solved with other architectures. The search indexes required expensive

processing of clinical documentation: extracting terms from the documentation and resolv-

ing their relationships with other terms. For instance, if a user typed “heart disease,” we

wanted documents discussing a myocardial infarction to be returned. This processing was

quite expensive — it can take several seconds of CPU time for larger documents — and we

wanted to apply it to many millions of documents. In short, we needed to throw a lot of

CPUs at the problem, and be cost effective in the process.

Among other options, we considered a staged event-driven architecture (SEDA) approach

to ingest documents at scale. But Hadoop stood out for one important need: we wanted to

reprocess the many millions of documents frequently, in a small number of hours or faster.

The logic for knowledge extraction from the clinical documents was rapidly improving,

and we needed to roll improvements out to the world quickly. In Hadoop, this simply

meant running a new version of a MapReduce job over data already in place. The process

documents were then loaded into a cluster of Apache Solr servers to support application

queries.

These early successes set the stage for more involved projects. This type of system and its

data can be used as an empirical basis to help control costs and improve care across entire

populations. And since healthcare data is often fragmented across systems and institutions,

we needed to first bring in all of that data and make sense of it.

With dozens of data sources and formats, and even standardized data models subject to in-

terpretation, we were facing an enormous semantic integration problem. Our biggest chal-

lenge was not the size of the data — we knew Hadoop could scale to our needs — but the

sheer complexity of cleaning, managing, and transforming it for our needs. We needed

higher-level tools to manage that complexity.

Search WWH ::

Custom Search

Home