Database Reference
In-Depth Information
The simplest way to create RDDs is to take an existing collection in your program
and pass it to SparkContext's parallelize() method, as shown in Examples 3-5
through 3-7 . This approach is very useful when you are learning Spark, since you can
quickly create your own RDDs in the shell and perform operations on them. Keep in
mind, however, that outside of prototyping and testing, this is not widely used since
it requires that you have your entire dataset in memory on one machine.
Example 3-5. parallelize() method in Python
lines = sc . parallelize ([ "pandas" , "i like pandas" ])
Example 3-6. parallelize() method in Scala
val lines = sc . parallelize ( List ( "pandas" , "i like pandas" ))
Example 3-7. parallelize() method in Java
JavaRDD < String > lines = sc . parallelize ( Arrays . asList ( "pandas" , "i like pandas" ));
A more common way to create RDDs is to load data from external storage. Loading
external datasets is covered in detail in Chapter 5 . However, we already saw one
method that loads a text file as an RDD of strings, SparkContext.textFile() , which
is shown in Examples 3-8 through 3-10 .
Example 3-8. textFile() method in Python
lines = sc . textFile ( "/path/to/README.md" )
Example 3-9. textFile() method in Scala
val lines = sc . textFile ( "/path/to/README.md" )
Example 3-10. textFile() method in Java
JavaRDD < String > lines = sc . textFile ( "/path/to/README.md" );
RDD Operations
As we've discussed, RDDs support two types of operations: transformations and
actions . Transformations are operations on RDDs that return a new RDD, such as
map() and filter() . Actions are operations that return a result to the driver pro‐
gram or write it to storage, and kick off a computation, such as count() and first() .
Spark treats transformations and actions very differently, so understanding which
type of operation you are performing will be important. If you are ever confused
Search WWH ::




Custom Search