Database Reference
In-Depth Information
RDD element from standard input as a String , manipulates that String however
you like, and then writes the result(s) as String s to standard output. The interface
and programming model is restrictive and limited, but sometimes it's just what you
need to do something like make use of a native code function within a map or filter
operation.
Most likely, you'd want to pipe an RDD's content through some external program or
script because you've already got complicated software built and tested that you'd like
to reuse with Spark. A lot of data scientists have code in R, 1 and we can interact with
R programs using pipe() .
In Example 6-15 we use an R library to compute the distance for all of the contacts.
Each element in our RDD is written out by our program with newlines as separators,
and every line that the program outputs is a string element in the resulting RDD. To
make it easy for our R program to parse the input we will reformat our data to be
mylat, mylon, theirlat, theirlon . Here we have a comma as the separator.
Example 6-15. R distance program
#!/usr/bin/env Rscript
library ( "Imap" )
f <- file ( "stdin" )
open ( f )
while ( length ( line <- readLines ( f , n = 1 )) > 0 ) {
# process line
contents <- Map ( as.numeric , strsplit ( line , "," ))
mydist <- gdist ( contents [[ 1 ]][ 1 ], contents [[ 1 ]][ 2 ],
contents [[ 1 ]][ 3 ], contents [[ 1 ]][ 4 ],
units = "m" , a = 6378137.0 , b = 6356752.3142 , verbose = FALSE )
write ( mydist , stdout ())
}
If that is written to an executable file named ./src/R/finddistance.R , then it looks like
this in use:
$ ./src/R/finddistance.R
37.75889318222431,-122.42683635321838,37.7614213,-122.4240097
349.2602
coffee
NA
ctrl-d
So far, so good—we've now got a way to transform every line from stdin into output
on stdout . Now we need to make finddistance.R available to each of our worker
1 The SparkR project also provides a lightweight frontend to use Spark from within R.
 
Search WWH ::




Custom Search