Databases Reference
In-Depth Information
CHAPTER 3
Test-Driven Development
Example 5: TF-IDF Implementation
In the previous example, we looked at extending pipe assemblies in Cascading work‐
flows. Functionally,
“Example 4: Replicated Joins”
is only a few changes away from
implementing an algorithm called
term frequency-inverse document frequency (TF-
IDF)
. This is the basis for many search indexing metrics, such as in the popular open
discussion of the algorithm and its use.
For this example, let's show how to implement TF-IDF in Cascading—which is a useful
subassembly to reuse in a variety of apps.
Figure 3-1
shows a conceptual diagram for
this. Based on having a more complex app to work with, we'll begin to examine Cas‐
cading features for testing at scale.