Database Reference
In-Depth Information
To demonstrate Spark's phases of execution, we'll walk through an example applica‐
tion and see how user code compiles down to a lower-level execution plan. The appli‐
cation we'll consider is a simple bit of log analysis in the Spark shell. For input data,
we'll use a text file that consists of log messages of varying degrees of severity, along
with some blank lines interspersed ( Example 8-6 ).
Example 8-6. input.txt, the source file for our example
## input.txt ##
INFO This is a message with content
INFO This is some other content
(empty line)
INFO Here are more messages
WARN This is a warning
(empty line)
ERROR Something bad happened
WARN More details on the bad thing
INFO back to normal messages
We want to open this file in the Spark shell and compute how many log messages
appear at each level of severity. First let's create a few RDDs that will help us answer
this question, as shown in Example 8-7 .
Example 8-7. Processing text data in the Scala Spark shell
// Read input file
scala > val input = sc . textFile ( "input.txt" )
// Split into words and remove empty lines
scala > val tokenized = input .
| map ( line => line . split ( " " )).
| filter ( words => words . size > 0 )
// Extract the first word from each line (the log level) and do a count
scala > val counts = tokenized .
| map ( words => ( words ( 0 ), 1 )).
| reduceByKey { ( a , b ) => a + b }
This sequence of commands results in an RDD, counts , that will contain the number
of log entries at each level of severity. After executing these lines in the shell, the pro‐
gram has not performed any actions. Instead, it has implicitly defined a directed acy‐
clic graph (DAG) of RDD objects that will be used later once an action occurs. Each
RDD maintains a pointer to one or more parents along with metadata about what
type of relationship they have. For instance, when you call val b = a.map() on an
RDD, the RDD b keeps a reference to its parent a . These pointers allow an RDD to be
traced to all of its ancestors.
To display the lineage of an RDD, Spark provides a toDebugString() method. In
Example 8-8 , we'll look at some of the RDDs we created in the preceding example.
Search WWH ::




Custom Search