Working with Key/Value Pairs - Learning Spark

Database Reference

In-Depth Information

There are also multiple other actions on pair RDDs that save the RDD, which we will

describe in Chapter 5 .

Data Partitioning (Advanced)

The final Spark feature we will discuss in this chapter is how to control datasets' par‐

titioning across nodes. In a distributed program, communication is very expensive,

so laying out data to minimize network traffic can greatly improve performance.

Much like how a single-node program needs to choose the right data structure for a

collection of records, Spark programs can choose to control their RDDs' partitioning

to reduce communication. Partitioning will not be helpful in all applications—for

example, if a given RDD is scanned only once, there is no point in partitioning it in

advance. It is useful only when a dataset is reused multiple times in key-oriented

operations such as joins. We will give some examples shortly.

Spark's partitioning is available on all RDDs of key/value pairs, and causes the system

to group elements based on a function of each key. Although Spark does not give

explicit control of which worker node each key goes to (partly because the system is

designed to work even if specific nodes fail), it lets the program ensure that a set of

keys will appear together on some node. For example, you might choose to hash-

partition an RDD into 100 partitions so that keys that have the same hash value mod‐

ulo 100 appear on the same node. Or you might range-partition the RDD into sorted

ranges of keys so that elements with keys in the same range appear on the same node.

As a simple example, consider an application that keeps a large table of user informa‐

tion in memory—say, an RDD of (UserID, UserInfo) pairs, where UserInfo con‐

tains a list of topics the user is subscribed to. The application periodically combines

this table with a smaller file representing events that happened in the past five

minutes—say, a table of (UserID, LinkInfo) pairs for users who have clicked a link

on a website in those five minutes. For example, we may wish to count how many

users visited a link that was not to one of their subscribed topics. We can perform this

combination with Spark's join() operation, which can be used to group the User

Info and LinkInfo pairs for each UserID by key. Our application would look like

Example 4-22 .

Example 4-22. Scala simple application

// Initialization code; we load the user info from a Hadoop SequenceFile on HDFS.

// This distributes elements of userData by the HDFS block where they are found,

// and doesn't provide Spark with any way of knowing in which partition a

// particular UserID is located.

val sc = new SparkContext (...)

val userData = sc . sequenceFile [ UserID , UserInfo ]( "hdfs://..." ). persist ()

// Function called periodically to process a logfile of events in the past 5 minutes;

Search WWH ::

Custom Search

Home