Database Reference
In-Depth Information
The simplest and most common operation that returns data to our driver program is
collect() , which returns the entire RDD's contents. collect() is commonly used in
unit tests where the entire contents of the RDD are expected to fit in memory, as that
makes it easy to compare the value of our RDD with our expected result. collect()
suffers from the restriction that all of your data must fit on a single machine, as it all
needs to be copied to the driver.
take(n) returns n elements from the RDD and attempts to minimize the number of
partitions it accesses, so it may represent a biased collection. It's important to note
that these operations do not return the elements in the order you might expect.
These operations are useful for unit tests and quick debugging, but may introduce
bottlenecks when you're dealing with large amounts of data.
If there is an ordering defined on our data, we can also extract the top elements from
an RDD using top() . top() will use the default ordering on the data, but we can sup‐
ply our own comparison function to extract the top elements.
Sometimes we need a sample of our data in our driver program. The takeSam
ple(withReplacement, num, seed) function allows us to take a sample of our data
either with or without replacement.
Sometimes it is useful to perform an action on all of the elements in the RDD, but
without returning any result to the driver program. A good example of this would be
posting JSON to a webserver or inserting records into a database. In either case, the
foreach() action lets us perform computations on each element in the RDD without
bringing it back locally.
The further standard operations on a basic RDD all behave pretty much exactly as
you would imagine from their name. count() returns a count of the elements, and
countByValue() returns a map of each unique value to its count. Table 3-4 summari‐
zes these and other actions.
Table 3-4. Basic actions on an RDD containing {1, 2, 3, 3}
Function name
Purpose
Example
Result
Return all elements
from the RDD.
collect()
rdd.collect()
{1, 2, 3, 3}
Number of
elements in the
RDD.
count()
rdd.count()
4
Number of times
each element
occurs in the RDD.
countByValue()
rdd.countByValue()
{(1, 1),
(2, 1),
(3, 2)}
Search WWH ::




Custom Search