Test-Driven Development - Enterprise Data Workflows with Cascading

Databases Reference

In-Depth Information

Figure 3-4. Document frequency branch

Notice that the results are named lhs_join , again preparing for the subsequent join.

Now we have all the components needed to calculate TF-IDF weights.

To finish the calculations in parallel, we'll use two different kinds of joins in Cascading

—a HashJoin followed by a CoGroup . Figure 3-5 shows how these joins merge the three

branches together:

// join to bring together all the components for calculating TF-IDF

// the D side of the join is smaller, so it goes on the RHS

Pipe idfPipe = new HashJoin ( dfPipe , lhs_join , dPipe , rhs_join );

// the IDF side of the join is smaller, so it goes on the RHS

Pipe tfidfPipe = new CoGroup ( tfPipe , tf_token , idfPipe , df_token );

We used HashJoin previously for a replicated join. In this case we know that document

count will not be a large amount of data, so it works for the RHS. The other join,

CoGroup , handles a more general case where the RHS cannot be kept entirely in memory.

In those cases a threshold can be adjusted for “spill,” where RHS tuples get moved to

disk.

Then we calculate TF-IDF weights using an ExpressionFunction in Cascading:

// calculate the TF-IDF weights, per token, per document

Fields tfidf = new Fields ( "tfidf" );

String expression =

"(double) tf_count * Math.log( (double) n_docs / ( 1.0 + df_count ) )" ;

ExpressionFunction tfidfExpression =

new ExpressionFunction ( tfidf , expression , Double . class );

Fields tfidfArguments = new Fields ( "tf_count" , "df_count" , "n_docs" );

tfidfPipe =

new Each ( tfidfPipe , tfidfArguments , tfidfExpression , Fields . ALL );

fieldSelector = new Fields ( "tf_token" , "doc_id" , "tfidf" );

tfidfPipe = new Retain ( tfidfPipe , fieldSelector );

tfidfPipe = new Rename ( tfidfPipe , tf_token , token );

Search WWH ::

Custom Search

Home