Databases Reference
In-Depth Information
Fields doc_id = new Fields ( "doc_id" );
Fields tally = new Fields ( "tally" );
Fields rhs_join = new Fields ( "rhs_join" );
Fields n_docs = new Fields ( "n_docs" );
Pipe dPipe = new Unique ( "D" , tokenPipe , doc_id );
dPipe = new Each ( dPipe , new Insert ( tally , 1 ), Fields . ALL );
dPipe = new Each ( dPipe , new Insert ( rhs_join , 1 ), Fields . ALL );
dPipe = new SumBy ( dPipe , rhs_join , tally , n_docs , long . class );
Figure 3-3. Document counts branch
This filters for the unique doc_id values and then uses another built-in partial aggregate
operation called SumBy , which sums values associated with duplicate keys in a tuple
stream. Great, now we've got the document count. Notice that the results are named
rhs_join , preparing for the subsequent join.
The third branch calculates document frequency for each token. We'll call that pipe
assembly dfPipe , with a branch name DF , as shown in Figure 3-4 :
// one branch tallies the token counts for document frequency (DF)
Pipe dfPipe = new Unique ( "DF" , tokenPipe , Fields . ALL );
Fields df_count = new Fields ( "df_count" );
dfPipe = new CountBy ( dfPipe , token , df_count );
Fields df_token = new Fields ( "df_token" );
Fields lhs_join = new Fields ( "lhs_join" );
dfPipe = new Rename ( dfPipe , token , df_token );
dfPipe = new Each ( dfPipe , new Insert ( lhs_join , 1 ), Fields . ALL );
 
Search WWH ::




Custom Search