Composable Data at Cerner - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

Integrating Healthcare Data

There are dozens of processing steps between raw data and answers to healthcare-related

questions. Here we look at one: bringing together data for a single person from multiple

sources.

Unfortunately, the lack of a common patient identifier in the United States, combined with

noisy data such as variations in a person's name and demographics between systems,

makes it difficult to accurately unify a person's data across sources. Information spread

across multiple sources might look like Table 22-2 .

Table 22-2. Data from multiple sources

Source

Person ID First name Last name Address

Gender

Doctor's office 12345

Abraham Lincoln 1600 Pennsylvania Ave. M

Hospital

98765

Abe

Lincoln Washington, DC

M

Hospital

45678

Mary Todd Lincoln 1600 Pennsylvania Ave. F

Clinic

76543

A.

Lincoln Springfield, IL

M

This is typically resolved in healthcare by a system called an Enterprise Master Patient In-

dex (EMPI). An EMPI can be fed data from multiple systems and determine which records

are indeed for the same person. This is achieved in a variety of ways, ranging from humans

explicitly stating relationships to sophisticated algorithms that identify commonality.

In some cases, we can load EMPI information from external systems, and in others we

compute it within Hadoop. The key is that we can expose this information for use in our

Crunch-based pipelines. The result is a PCollection<EMPIRecord> with the data

structured as follows:

@namespace("com.cerner.example")

protocol EMPIProtocol {

record PersonRecordId {

string sourceId;

string personId

}

/**

* Represents an EMPI match.

*/

record EMPIRecord {

string empiId;

Hadoop: The Definitive Guide

Search WWH ::

Custom Search

Home