Databases Reference
In-Depth Information
By the way, did you notice what the TF-IDF weights for the tokens rain and shadow
were? Those represent what these documents all have in common. How do those
compare with weights for the other tokens? Conversely, consider the weight for
australia (high weight) or area (different weights).
TF-IDF calculates a metric for each token, which indicates how “important” that token
is to a document within the context of a collection of documents. The metric is calculated
based on relative frequencies. On one hand, tokens that appear in most documents tend
to have very low TF-IDF weights. On the other hand, tokens that are less common but
appear multiple times in a few documents tend to have very high TF-IDF weights.
Note that information retrieval papers use token and term almost interchangeably in
some cases. More advanced text analytics might calculate metrics for phrases, in which
case a term becomes a more complex structure. However, we're only looking at single
words.
Example 6: TF-IDF with Testing
Now that we have a more complex workflow for TF-IDF, let's consider best practices for
test-driven development (TDD) at scale. We'll add unit tests into the build, then show
how to leverage TDD features that are unique to Cascading: checkpoints, traps, asser‐
tions, etc. Figure 3-6 shows a conceptual diagram for this app.
Generally speaking, TDD starts off with a failing test, and then you code until the test
passes. We'll start with a working app, with tests that pass—followed by discussion of
how to use assertions for the test/code cycle.
Starting from the source code directory that you cloned in Git, connect into the part6
subdirectory. As a first step toward better testing, let's add a unit test and show how it
fits into this example. We need to add support for testing into our build. In the Gradle
build script build.gradle we need to modify the compile task to include JUnit and other
testing dependencies:
dependencies {
compile ( 'cascading:cascading-core:2.1.+' ) { exclude group: 'log4j' }
compile ( 'cascading:cascading-hadoop:2.1.+' ) { transitive = true }
testCompile ( 'cascading:cascading-test:2.1.+' )
testCompile ( 'org.apache.hadoop:hadoop-test:1.0.+' )
testCompile ( 'junit:junit:4.8.+' )
}
Search WWH ::




Custom Search