Tuning and Debugging Spark - Learning Spark

Database Reference

In-Depth Information

To demonstrate Spark's phases of execution, we'll walk through an example applica‐

tion and see how user code compiles down to a lower-level execution plan. The appli‐

cation we'll consider is a simple bit of log analysis in the Spark shell. For input data,

we'll use a text file that consists of log messages of varying degrees of severity, along

with some blank lines interspersed ( Example 8-6 ).

Example 8-6. input.txt, the source file for our example

## input.txt ##

INFO This is a message with content

INFO This is some other content

(empty line)

INFO Here are more messages

WARN This is a warning

(empty line)

ERROR Something bad happened

WARN More details on the bad thing

INFO back to normal messages

We want to open this file in the Spark shell and compute how many log messages

appear at each level of severity. First let's create a few RDDs that will help us answer

this question, as shown in Example 8-7 .

Example 8-7. Processing text data in the Scala Spark shell

// Read input file

scala > val input = sc . textFile ( "input.txt" )

// Split into words and remove empty lines

scala > val tokenized = input .

| map ( line => line . split ( " " )).

| filter ( words => words . size > 0 )

// Extract the first word from each line (the log level) and do a count

scala > val counts = tokenized .

| map ( words => ( words ( 0 ), 1 )).

| reduceByKey { ( a , b ) => a + b }

This sequence of commands results in an RDD, counts , that will contain the number

of log entries at each level of severity. After executing these lines in the shell, the pro‐

gram has not performed any actions. Instead, it has implicitly defined a directed acy‐

clic graph (DAG) of RDD objects that will be used later once an action occurs. Each

RDD maintains a pointer to one or more parents along with metadata about what

type of relationship they have. For instance, when you call val b = a.map() on an

RDD, the RDD b keeps a reference to its parent a . These pointers allow an RDD to be

traced to all of its ancestors.

To display the lineage of an RDD, Spark provides a toDebugString() method. In

Example 8-8 , we'll look at some of the RDDs we created in the preceding example.

Search WWH ::

Custom Search

Home