Advanced Spark Programming - Learning Spark

Database Reference

In-Depth Information

Numeric RDD Operations

Spark provides several descriptive statistics operations on RDDs containing numeric

data. These are in addition to the more complex statistical and machine learning

methods we will describe later in Chapter 11 .

Spark's numeric operations are implemented with a streaming algorithm that allows

for building up our model one element at a time. The descriptive statistics are all

computed in a single pass over the data and returned as a StatsCounter object by

calling stats() . Table 6-2 lists the methods available on the StatsCounter object.

Table 6-2. Summary statistics available from

StatsCounter

Method

Meaning

Number of elements in the RDD

count()

Average of the elements

mean()

Total

sum()

Maximum value

max()

Minimum value

min()

Variance of the elements

variance()

Variance of the elements, computed for a sample

sampleVariance()

Standard deviation

stdev()

Sample standard deviation

sampleStdev()

If you want to compute only one of these statistics, you can also call the correspond‐

ing method directly on an RDD—for example, rdd.mean() or rdd.sum() .

In Examples 6-19 through 6-21 , we will use summary statistics to remove some outli‐

ers from our data. Since we will be going over the same RDD twice (once to compute

the summary statistics and once to remove the outliers), we may wish to cache the

RDD. Going back to our call log example, we can remove the contact points from our

call log that are too far away.

Search WWH ::

Custom Search

Home