Database Reference
In-Depth Information
Spark operates with three core ideas:
Resilient Distributed Dataset (RDD)
RDDs contain data that you want to transform or analyze. They can either be be read
from an external source, such as a file or a database, or they can be created by a trans-
formation.
Transformation
A transformation modifies an existing RDD to create a new RDD. For example, a filter
that pulls ERROR messages out of a log file would be a transformation.
Action
An action analyzes an RDD and returns a single result. For example, an action would
count the number of results identified by our ERROR filter.
If you want to do any significant work in Spark, you would be wise to learn about Scala, a
functional programming language. Scala combines object orientation with functional pro-
gramming. Because Lisp is an older functional programming language, Scala might be called
“Lisp joins the 21st century.” This is not to say that Scala is the only way to work with
Spark. The project also has strong support for Java and Python, but when new APIs or fea-
tures are added, they appear first in Scala.
Tutorial Links
A quick start for Spark can be found on the project home page.
Example Code
We'll start with opening the Spark shell by running ./bin/spark-shell from the directory we
installed Spark in.
In this example, we're going to count the number of Dune reviews in our review file:
// Read the csv file containing our reviews
scala > val
val reviews = spark . textFile ( "hdfs://reviews.csv" )
testFile : spark.RDD
spark.RDD [ String
String ] = spark . MappedRDD
MappedRDD @ 3 d7e837f
// This is a two-part operation:
Search WWH ::




Custom Search