Database Reference
In-Depth Information
There are also multiple other actions on pair RDDs that save the RDD, which we will
describe in Chapter 5 .
Data Partitioning (Advanced)
The final Spark feature we will discuss in this chapter is how to control datasets' par‐
titioning across nodes. In a distributed program, communication is very expensive,
so laying out data to minimize network traffic can greatly improve performance.
Much like how a single-node program needs to choose the right data structure for a
collection of records, Spark programs can choose to control their RDDs' partitioning
to reduce communication. Partitioning will not be helpful in all applications—for
example, if a given RDD is scanned only once, there is no point in partitioning it in
advance. It is useful only when a dataset is reused multiple times in key-oriented
operations such as joins. We will give some examples shortly.
Spark's partitioning is available on all RDDs of key/value pairs, and causes the system
to group elements based on a function of each key. Although Spark does not give
explicit control of which worker node each key goes to (partly because the system is
designed to work even if specific nodes fail), it lets the program ensure that a set of
keys will appear together on some node. For example, you might choose to hash-
partition an RDD into 100 partitions so that keys that have the same hash value mod‐
ulo 100 appear on the same node. Or you might range-partition the RDD into sorted
ranges of keys so that elements with keys in the same range appear on the same node.
As a simple example, consider an application that keeps a large table of user informa‐
tion in memory—say, an RDD of (UserID, UserInfo) pairs, where UserInfo con‐
tains a list of topics the user is subscribed to. The application periodically combines
this table with a smaller file representing events that happened in the past five
minutes—say, a table of (UserID, LinkInfo) pairs for users who have clicked a link
on a website in those five minutes. For example, we may wish to count how many
users visited a link that was not to one of their subscribed topics. We can perform this
combination with Spark's join() operation, which can be used to group the User
Info and LinkInfo pairs for each UserID by key. Our application would look like
Example 4-22 .
Example 4-22. Scala simple application
// Initialization code; we load the user info from a Hadoop SequenceFile on HDFS.
// This distributes elements of userData by the HDFS block where they are found,
// and doesn't provide Spark with any way of knowing in which partition a
// particular UserID is located.
val sc = new SparkContext (...)
val userData = sc . sequenceFile [ UserID , UserInfo ]( "hdfs://..." ). persist ()
// Function called periodically to process a logfile of events in the past 5 minutes;
Search WWH ::




Custom Search