Database Reference
In-Depth Information
Example 6-19. Removing outliers in Python
# Convert our RDD of strings to numeric data so we can compute stats and
# remove the outliers.
distanceNumerics
=
distances
.
map
(
lambda
string
:
float
(
string
))
stats
=
distanceNumerics
.
stats
()
stddev
=
std
.
stdev
()
mean
=
stats
.
mean
()
reasonableDistances
=
distanceNumerics
.
filter
(
lambda
x
:
math
.
fabs
(
x
-
mean
)
<
3
*
stddev
)
print
reasonableDistances
.
collect
()
Example 6-20. Removing outliers in Scala
// Now we can go ahead and remove outliers since those may have misreported locations
// first we need to take our RDD of strings and turn it into doubles.
val
distanceDouble
=
distance
.
map
(
string
=>
string
.
toDouble
)
val
stats
=
distanceDoubles
.
stats
()
val
stddev
=
stats
.
stdev
val
mean
=
stats
.
mean
val
reasonableDistances
=
distanceDoubles
.
filter
(
x
=>
math
.
abs
(
x
-
mean
)
<
3
*
stddev
)
println
(
reasonableDistance
.
collect
().
toList
)
Example 6-21. Removing outliers in Java
// First we need to convert our RDD of String to a DoubleRDD so we can
// access the stats function
JavaDoubleRDD
distanceDoubles
=
distances
.
mapToDouble
(
new
DoubleFunction
<
String
>()
{
public
double
call
(
String
value
)
{
return
Double
.
parseDouble
(
value
);
}});
final
StatCounter
stats
=
distanceDoubles
.
stats
();
final
Double
stddev
=
stats
.
stdev
();
final
Double
mean
=
stats
.
mean
();
JavaDoubleRDD
reasonableDistances
=
distanceDoubles
.
filter
(
new
Function
<
Double
,
Boolean
>()
{
public
Boolean
call
(
Double
x
)
{
return
(
Math
.
abs
(
x
-
mean
)
<
3
*
stddev
);}});
System
.
out
.
println
(
StringUtils
.
join
(
reasonableDistance
.
collect
(),
","
));
With that final piece we have completed our sample application, which uses accumu‐
lators and broadcast variables, per-partition processing, interfaces with external pro‐
grams, and summary statistics. The entire source code is available in
src/python/
ChapterSixExample.py
,
src/main/scala/com/oreilly/learningsparkexamples/scala/
ChapterSixExample.scala
, and
src/main/java/com/oreilly/learningsparkexamples/java/
ChapterSixExample.java
, respectively.