Database Reference
In-Depth Information
array<Diagnosis> diagnoses;
. . .
}
}
Note that a variety of data types are all nested in a common person record rather than in
separate datasets. This supports the most common usage pattern for this data — looking at
a complete record — without requiring downstream operations to do a number of expens-
ive joins between datasets.
A series of Crunch pipelines are used to manipulate the data into a PCollec-
tion<PersonRecord> hiding the complexity of each source and providing a simple
interface to interact with the raw, normalized record data. Behind the scenes, each Per-
sonRecord can be stored in HDFS or as a row in HBase with the individual data ele-
ments spread throughout column families and qualifiers. The result of the aggregation
looks like the data in Table 22-1 .
Table 22-1. Aggregated data
Source
Person ID Person demographics Data
Doctor's office 12345
Abraham Lincoln ... Diabetes diagnosis, lab results
Hospital
98765
Abe Lincoln ...
Flu diagnosis
Pharmacy
98765
Abe Lincoln ...
Allergies, medications
Clinic
76543
A. Lincoln ...
Lab results
Consumers wishing to retrieve data from a collection of authorized sources call a “retriev-
er” API that simply produces a Crunch PCollection of requested data:
Set < String > sources = ...;
PCollection < PersonRecord > personRecords =
RecordRetriever . getData ( pipeline , sources );
This retriever pattern allows consumers to load datasets while being insulated from how
and where they are physically stored. At the time of this writing, some use of this pattern
is being replaced by the emerging Kite SDK for managing data in Hadoop. Each entry in
the retrieved PCollection<PersonRecord> represents a person's complete medical
record within the context of a single source.
Search WWH ::




Custom Search