Database Reference
In-Depth Information
Numeric RDD Operations
Spark provides several descriptive statistics operations on RDDs containing numeric
data. These are in addition to the more complex statistical and machine learning
methods we will describe later in Chapter 11 .
Spark's numeric operations are implemented with a streaming algorithm that allows
for building up our model one element at a time. The descriptive statistics are all
computed in a single pass over the data and returned as a StatsCounter object by
calling stats() . Table 6-2 lists the methods available on the StatsCounter object.
Table 6-2. Summary statistics available from
StatsCounter
Method
Meaning
Number of elements in the RDD
count()
Average of the elements
mean()
Total
sum()
Maximum value
max()
Minimum value
min()
Variance of the elements
variance()
Variance of the elements, computed for a sample
sampleVariance()
Standard deviation
stdev()
Sample standard deviation
sampleStdev()
If you want to compute only one of these statistics, you can also call the correspond‐
ing method directly on an RDD—for example, rdd.mean() or rdd.sum() .
In Examples 6-19 through 6-21 , we will use summary statistics to remove some outli‐
ers from our data. Since we will be going over the same RDD twice (once to compute
the summary statistics and once to remove the outliers), we may wish to cache the
RDD. Going back to our call log example, we can remove the contact points from our
call log that are too far away.
 
Search WWH ::




Custom Search