Test-Driven Development - Enterprise Data Workflows with Cascading

Databases Reference

In-Depth Information

Fields doc_id = new Fields ( "doc_id" );

Fields tally = new Fields ( "tally" );

Fields rhs_join = new Fields ( "rhs_join" );

Fields n_docs = new Fields ( "n_docs" );

Pipe dPipe = new Unique ( "D" , tokenPipe , doc_id );

dPipe = new Each ( dPipe , new Insert ( tally , 1 ), Fields . ALL );

dPipe = new Each ( dPipe , new Insert ( rhs_join , 1 ), Fields . ALL );

dPipe = new SumBy ( dPipe , rhs_join , tally , n_docs , long . class );

Figure 3-3. Document counts branch

This filters for the unique doc_id values and then uses another built-in partial aggregate

operation called SumBy , which sums values associated with duplicate keys in a tuple

stream. Great, now we've got the document count. Notice that the results are named

rhs_join , preparing for the subsequent join.

The third branch calculates document frequency for each token. We'll call that pipe

assembly dfPipe , with a branch name DF , as shown in Figure 3-4 :

// one branch tallies the token counts for document frequency (DF)

Pipe dfPipe = new Unique ( "DF" , tokenPipe , Fields . ALL );

Fields df_count = new Fields ( "df_count" );

dfPipe = new CountBy ( dfPipe , token , df_count );

Fields df_token = new Fields ( "df_token" );

Fields lhs_join = new Fields ( "lhs_join" );

dfPipe = new Rename ( dfPipe , token , df_token );

dfPipe = new Each ( dfPipe , new Insert ( lhs_join , 1 ), Fields . ALL );

Search WWH ::

Custom Search

Home