Databases Reference
In-Depth Information
CHAPTER 3
Test-Driven Development
Example 5: TF-IDF Implementation
In the previous example, we looked at extending pipe assemblies in Cascading work‐
flows. Functionally, “Example 4: Replicated Joins” is only a few changes away from
implementing an algorithm called term frequency-inverse document frequency (TF-
IDF) . This is the basis for many search indexing metrics, such as in the popular open
source search engine Apache Lucene . See the Similarity class in Lucene for a great
discussion of the algorithm and its use.
For this example, let's show how to implement TF-IDF in Cascading—which is a useful
subassembly to reuse in a variety of apps. Figure 3-1 shows a conceptual diagram for
this. Based on having a more complex app to work with, we'll begin to examine Cas‐
cading features for testing at scale.
 
Search WWH ::




Custom Search