Databases Reference
In-Depth Information
Fields
doc_id
=
new
Fields
(
"doc_id"
);
Fields
tally
=
new
Fields
(
"tally"
);
Fields
rhs_join
=
new
Fields
(
"rhs_join"
);
Fields
n_docs
=
new
Fields
(
"n_docs"
);
Pipe
dPipe
=
new
Unique
(
"D"
,
tokenPipe
,
doc_id
);
dPipe
=
new
Each
(
dPipe
,
new
Insert
(
tally
,
1
),
Fields
.
ALL
);
dPipe
=
new
Each
(
dPipe
,
new
Insert
(
rhs_join
,
1
),
Fields
.
ALL
);
dPipe
=
new
SumBy
(
dPipe
,
rhs_join
,
tally
,
n_docs
,
long
.
class
);
Figure 3-3. Document counts branch
This filters for the unique
doc_id
values and then uses another built-in partial aggregate
operation called
SumBy
, which sums values associated with duplicate keys in a tuple
stream. Great, now we've got the document count. Notice that the results are named
rhs_join
, preparing for the subsequent join.
The third branch calculates
document frequency
for each token. We'll call that pipe
assembly
dfPipe
, with a branch name
DF
, as shown in
Figure 3-4
:
// one branch tallies the token counts for document frequency (DF)
Pipe
dfPipe
=
new
Unique
(
"DF"
,
tokenPipe
,
Fields
.
ALL
);
Fields
df_count
=
new
Fields
(
"df_count"
);
dfPipe
=
new
CountBy
(
dfPipe
,
token
,
df_count
);
Fields
df_token
=
new
Fields
(
"df_token"
);
Fields
lhs_join
=
new
Fields
(
"lhs_join"
);
dfPipe
=
new
Rename
(
dfPipe
,
token
,
df_token
);
dfPipe
=
new
Each
(
dfPipe
,
new
Insert
(
lhs_join
,
1
),
Fields
.
ALL
);