Database Reference
In-Depth Information
val
files
:
RDD
[(
String
,
String
)]
=
sc
.
wholeTextFiles
(
inputPath
)
Spark can work with other file formats besides text. For example, sequence files can be
read with:
sc
.
sequenceFile
[
IntWritable
,
Text
](
inputPath
)
Notice how the sequence file's key and value
Writable
types have been specified. For
common
Writable
types, Spark can map them to the Java equivalents, so we could use
the equivalent form:
sc
.
sequenceFile
[
Int
,
String
](
inputPath
)
There are two methods for creating RDDs from an arbitrary Hadoop
InputFormat
:
hadoopFile()
for file-based formats that expect a path, and
hadoopRDD()
for those
that don't, such as HBase's
TableInputFormat
. These methods are for the old
MapReduce API; for the new one, use
newAPIHadoopFile()
and
newAPIHa-
doopRDD()
. Here is an example of reading an Avro datafile using the Specific API with
a
WeatherRecord
class:
val
job
= new
Job
()
AvroJob
.
setInputKeySchema
(
job
,
WeatherRecord
.
getClassSchema
)
val
data
=
sc
.
newAPIHadoopFile
(
inputPath
,
classOf
[
AvroKeyInputFormat
[
WeatherRecord
]],
classOf
[
AvroKey
[
WeatherRecord
]],
classOf
[
NullWritable
],
job
.
getConfiguration
)
In addition to the path, the
newAPIHadoopFile()
method expects the
In-
putFormat
type, the key type, and the value type, plus the Hadoop configuration. The
configuration carries the Avro schema, which we set in the second line using the
Av-
roJob
helper class.
The third way of creating an RDD is by transforming an existing RDD. We look at trans-
formations next.
Transformations and Actions
Spark provides two categories of operations on RDDs:
transformations
and
actions
. A
transformation generates a new RDD from an existing one, while an action triggers a
computation on an RDD and does something with the results — either returning them to
the user, or saving them to external storage.
Actions have an immediate effect, but transformations do not — they are lazy, in the sense
that they don't perform any work until an action is performed on the transformed RDD.
For example, the following lowercases lines in a text file: