Spark - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

An Example

To introduce Spark, let's run an interactive session using spark-shell , which is a Scala

REPL with a few Spark additions. Start up the shell with the following:

% spark-shell

Spark context available as sc.

scala>

From the console output, we can see that the shell has created a Scala variable, sc , to store

the SparkContext instance. This is our entry point to Spark, and allows us to load a text

file as follows:

scala> val lines = sc.textFile("input/ncdc/micro-tab/sample.txt")

lines: org.apache.spark.rdd.RDD[String] = MappedRDD[1] at textFile at

<console>:12

The lines variable is a reference to a Resilient Distributed Dataset , abbreviated to RDD ,

which is the central abstraction in Spark: a read-only collection of objects that is partitioned

across multiple machines in a cluster. In a typical Spark program, one or more RDDs are

loaded as input and through a series of transformations are turned into a set of target RDDs,

which have an action performed on them (such as computing a result or writing them to

persistent storage). The term “resilient” in “Resilient Distributed Dataset” refers to the fact

that Spark can automatically reconstruct a lost partition by recomputing it from the RDDs

that it was computed from.

NOTE

Loading an RDD or performing a transformation on one does not trigger any data processing; it merely

creates a plan for performing a computation. The computation is only triggered when an action (like

foreach() ) is performed on an RDD.

Let's continue with the program. The first transformation we want to perform is to split the

lines into fields:

scala> val records = lines.map(_.split("\t"))

records: org.apache.spark.rdd.RDD[Array[String]] = MappedRDD[2] at

map at

<console>:14

This uses the map() method on RDD to apply a function to every element in the RDD. In

this case, we split each line (a String ) into a Scala Array of String s.

Search WWH ::

Custom Search

Home