Database Reference
In-Depth Information
Integrating Healthcare Data
There are dozens of processing steps between raw data and answers to healthcare-related
questions. Here we look at one: bringing together data for a single person from multiple
sources.
Unfortunately, the lack of a common patient identifier in the United States, combined with
noisy data such as variations in a person's name and demographics between systems,
makes it difficult to accurately unify a person's data across sources. Information spread
across multiple sources might look like Table 22-2 .
Table 22-2. Data from multiple sources
Source
Person ID First name Last name Address
Gender
Doctor's office 12345
Abraham Lincoln 1600 Pennsylvania Ave. M
Hospital
98765
Abe
Lincoln Washington, DC
M
Hospital
45678
Mary Todd Lincoln 1600 Pennsylvania Ave. F
Clinic
76543
A.
Lincoln Springfield, IL
M
This is typically resolved in healthcare by a system called an Enterprise Master Patient In-
dex (EMPI). An EMPI can be fed data from multiple systems and determine which records
are indeed for the same person. This is achieved in a variety of ways, ranging from humans
explicitly stating relationships to sophisticated algorithms that identify commonality.
In some cases, we can load EMPI information from external systems, and in others we
compute it within Hadoop. The key is that we can expose this information for use in our
Crunch-based pipelines. The result is a PCollection<EMPIRecord> with the data
structured as follows:
@namespace("com.cerner.example")
protocol EMPIProtocol {
record PersonRecordId {
string sourceId;
string personId
}
/**
* Represents an EMPI match.
*/
record EMPIRecord {
string empiId;
Search WWH ::




Custom Search