Database Reference
In-Depth Information
>>> pythonLines . first ()
u'## Interactive Python Shell'
Example 2-5. Scala filtering example
scala > val lines = sc . textFile ( "README.md" ) // Create an RDD called lines
lines : spark.RDD [ String ] = MappedRDD [ ... ]
scala > val pythonLines = lines . filter ( line => line . contains ( "Python" ))
pythonLines : spark.RDD [ String ] = FilteredRDD [ ... ]
scala > pythonLines . first ()
res0 : String = # # Interactive Python Shell
Passing Functions to Spark
If you are unfamiliar with the lambda or => syntax in Examples 2-4 and 2-5 , it is a
shorthand way to define functions inline in Python and Scala. When using Spark in
these languages, you can also define a function separately and then pass its name to
Spark. For example, in Python:
def hasPython ( line ):
return "Python" in line
pythonLines = lines . filter ( hasPython )
Passing functions to Spark is also possible in Java, but in this case they are defined as
classes, implementing an interface called Function . For example:
JavaRDD < String > pythonLines = lines . filter (
new Function < String , Boolean >() {
Boolean call ( String line ) { return line . contains ( "Python" ); }
}
);
Java 8 introduces shorthand syntax called lambdas that looks similar to Python and
Scala. Here is how the code would look with this syntax:
JavaRDD < String > pythonLines = lines . filter ( line -> line . contains ( "Python" ));
We discuss passing functions further in “Passing Functions to Spark” on page 30 .
While we will cover the Spark API in more detail later, a lot of its magic is that
function-based operations like filter also parallelize across the cluster. That is,
Spark automatically takes your function (e.g., line.contains("Python") ) and ships
it to executor nodes. Thus, you can write code in a single driver program and auto‐
matically have parts of it run on multiple nodes. Chapter 3 covers the RDD API in
detail.
 
Search WWH ::




Custom Search