Programming with RDDs - Learning Spark

Database Reference

In-Depth Information

The simplest way to create RDDs is to take an existing collection in your program

and pass it to SparkContext's parallelize() method, as shown in Examples 3-5

through 3-7 . This approach is very useful when you are learning Spark, since you can

quickly create your own RDDs in the shell and perform operations on them. Keep in

mind, however, that outside of prototyping and testing, this is not widely used since

it requires that you have your entire dataset in memory on one machine.

Example 3-5. parallelize() method in Python

lines = sc . parallelize ([ "pandas" , "i like pandas" ])

Example 3-6. parallelize() method in Scala

val lines = sc . parallelize ( List ( "pandas" , "i like pandas" ))

Example 3-7. parallelize() method in Java

JavaRDD < String > lines = sc . parallelize ( Arrays . asList ( "pandas" , "i like pandas" ));

A more common way to create RDDs is to load data from external storage. Loading

external datasets is covered in detail in Chapter 5 . However, we already saw one

method that loads a text file as an RDD of strings, SparkContext.textFile() , which

is shown in Examples 3-8 through 3-10 .

Example 3-8. textFile() method in Python

lines = sc . textFile ( "/path/to/README.md" )

Example 3-9. textFile() method in Scala

val lines = sc . textFile ( "/path/to/README.md" )

Example 3-10. textFile() method in Java

JavaRDD < String > lines = sc . textFile ( "/path/to/README.md" );

RDD Operations

As we've discussed, RDDs support two types of operations: transformations and

actions . Transformations are operations on RDDs that return a new RDD, such as

map() and filter() . Actions are operations that return a result to the driver pro‐

gram or write it to storage, and kick off a computation, such as count() and first() .

Spark treats transformations and actions very differently, so understanding which

type of operation you are performing will be important. If you are ever confused

Search WWH ::

Custom Search

Home