Database Reference
In-Depth Information
Class
Method name(s)
Description
Converts a
PCollection<Pair<K, V>>
to a
PTable<K,
V>
.
PTables
asPTable()
Returns a
PTable
's keys as a
PCollection
.
keys()
Returns a
PTable
's values as a
PCollection
.
values()
Applies a map function to all the keys in a
PTable
, leaving
the values unchanged.
mapKeys()
Applies a map function to all the values in a
PTable
or
PGroupedTable
, leaving the keys unchanged.
mapValues()
Creates a sample of a
PCollection
by choosing each ele-
ment independently with a specified probability.
Sample
sample()
reservoirSample()
Creates a sample of a
PCollection
of a specified size, where
each element is equally likely to be included.
Sorts a
PTable<K, Pair<V1, V2>>
by
K
then
V1
, then ap-
plies a function to give an output
PCollection
or
PTable
.
SecondarySort sortAndApply()
Returns a
PCollection
that is the set difference of two
PCollection
s.
Set
difference()
Returns a
PCollection
that is the set intersection of two
PCollection
s.
intersection()
Returns a
PCollection
of triples that classifies each element
from two
PCollection
s by whether it is only in the first col-
lection, only in the second collection, or in both collections.
(Similar to the Unix
comm
command.)
comm()
Creates a
PCollection
that contains exactly the same ele-
ments as the input
PCollection
, but is partitioned (sharded)
across a specified number of files.
Shard
shard()
Performs a total sort on a
PCollection
in the natural order of
its elements in ascending (the default) or descending order.
There are also methods to sort
PTable
s by key, and collec-
tions of
Pair
s or tuples by a subset of their columns in a spe-
cified order.
Sort
sort()
One of the most powerful things about Crunch is that if the function you need is not
provided, then it is simple to write it yourself, typically in a few lines of Java. For an ex-
ample of a general-purpose function (for finding the unique values in a
PTable
), see