Database Reference
In-Depth Information
new
MapFn
<
PersonRecord
,
Pair
<
String
,
String
>>() {
@Override
public
Pair
<
String
,
String
>
map
(
PersonRecord input
) {
return
Pair
.
of
(
input
.
getSourceId
(),
input
.
getPersonId
());
}
},
pairs
(
strings
(),
strings
()));
Joining the two
PTable
objects will return a
PTable<Pair<String, String>,
Pair<EMPIRecord, PersonRecord>>
. In this situation, the keys are no longer
useful, so we change the table to be keyed by the EMPI identifier:
PTable
<
String
,
PersonRecord
>
personRecordKeyedByEMPI
=
keyedPersonRecords
.
join
(
keyedEmpiRecords
)
.
values
()
.
by
(
new
MapFn
<
Pair
<
PersonRecord
,
EMPIRecord
>>() {
@Override
public
String
map
(
Pair
<
PersonRecord
,
EMPIRecord
>
input
) {
return
input
.
second
().
getEmpiId
();
}
},
strings
()));
The final step is to group the table by its key to ensure all of the data is aggregated togeth-
er for processing as a complete collection:
PGroupedTable
<
String
,
PersonRecord
>
groupedPersonRecords
=
personRecordKeyedByEMPI
.
groupByKey
();
This logic to unify data sources is the first step of a larger execution flow. Other Crunch
functions downstream build on these steps to meet many client needs. In a common use
case, a number of problems are solved by loading the contents of the unified
PersonRe-
cord
s into a rules-based processing model to emit new clinical knowledge. For instance,
we may run rules over those records to determine if a diabetic is receiving recommended
care, and to indicate areas that can be improved. Similar rule sets exist for a variety of
needs, ranging from general wellness to managing complicated conditions. The logic can
be complicated and with a lot of variance between use cases, but it is all hosted in func-
tions composed in a Crunch pipeline.