Composable Data at Cerner - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

array<Diagnosis> diagnoses;

. . .

}

Note that a variety of data types are all nested in a common person record rather than in

separate datasets. This supports the most common usage pattern for this data — looking at

a complete record — without requiring downstream operations to do a number of expens-

ive joins between datasets.

A series of Crunch pipelines are used to manipulate the data into a PCollec-

tion<PersonRecord> hiding the complexity of each source and providing a simple

interface to interact with the raw, normalized record data. Behind the scenes, each Per-

sonRecord can be stored in HDFS or as a row in HBase with the individual data ele-

ments spread throughout column families and qualifiers. The result of the aggregation

looks like the data in Table 22-1 .

Table 22-1. Aggregated data

Source

Person ID Person demographics Data

Doctor's office 12345

Abraham Lincoln ... Diabetes diagnosis, lab results

Hospital

98765

Abe Lincoln ...

Flu diagnosis

Pharmacy

98765

Abe Lincoln ...

Allergies, medications

Clinic

76543

A. Lincoln ...

Lab results

Consumers wishing to retrieve data from a collection of authorized sources call a “retriev-

er” API that simply produces a Crunch PCollection of requested data:

Set < String > sources = ...;

PCollection < PersonRecord > personRecords =

RecordRetriever . getData ( pipeline , sources );

This retriever pattern allows consumers to load datasets while being insulated from how

and where they are physically stored. At the time of this writing, some use of this pattern

is being replaced by the emerging Kite SDK for managing data in Hadoop. Each entry in

the retrieved PCollection<PersonRecord> represents a person's complete medical

record within the context of a single source.

Search WWH ::

Custom Search

Home