Core Technologies - Field Guide to Hadoop

Database Reference

In-Depth Information

Spark operates with three core ideas:

Resilient Distributed Dataset (RDD)

RDDs contain data that you want to transform or analyze. They can either be be read

from an external source, such as a file or a database, or they can be created by a trans-

formation.

Transformation

A transformation modifies an existing RDD to create a new RDD. For example, a filter

that pulls ERROR messages out of a log file would be a transformation.

Action

An action analyzes an RDD and returns a single result. For example, an action would

count the number of results identified by our ERROR filter.

If you want to do any significant work in Spark, you would be wise to learn about Scala, a

functional programming language. Scala combines object orientation with functional pro-

gramming. Because Lisp is an older functional programming language, Scala might be called

“Lisp joins the 21st century.” This is not to say that Scala is the only way to work with

Spark. The project also has strong support for Java and Python, but when new APIs or fea-

tures are added, they appear first in Scala.

Tutorial Links

A quick start for Spark can be found on the project home page.

Example Code

We'll start with opening the Spark shell by running ./bin/spark-shell from the directory we

installed Spark in.

In this example, we're going to count the number of Dune reviews in our review file:

// Read the csv file containing our reviews

scala > val

val reviews = spark . textFile ( "hdfs://reviews.csv" )

testFile : spark.RDD

spark.RDD [ String

String ] = spark . MappedRDD

MappedRDD @ 3 d7e837f

// This is a two-part operation:

Search WWH ::

Custom Search

Home