Database Reference
In-Depth Information
We will use the
wholeTextFiles
method to access the location of each file. Using
these file paths, we will write custom code to load and process the images. In the follow-
ing example code, we will use
PATH
to refer to the directory in which you extracted the
lfw
subdirectory.
We can use a wildcard path specification (using the
*
character highlighted in the follow-
ing code snippet) to tell Spark to look in each directory under the
lfw
directory for files:
val path = "/
PATH
/lfw/
*
"
val rdd = sc.wholeTextFiles(path)
val first = rdd.first
println(first)
Running the
first
command might take a little time, as Spark first scans the specified
directory structure for all available files. Once completed, you should see output similar to
the one shown here:
first: (String, String) = (file:/PATH/lfw/Aaron_Eckhart/
Aaron_Eckhart_0001.jpg,
����
����
??JFIF????? ...
You will see that
wholeTextFiles
returns an RDD that contains key-value pairs,
where the key is the file location while the value is the content of the entire text file. For
our purposes, we only care about the file path, as we cannot work directly with the image
data as a string (notice that it is displayed as "binary nonsense" in the shell output).
Let's extract the file paths from the RDD. Note that earlier, the file path starts with the
file:
text. This is used by Spark when reading files in order to differentiate between
different filesystems (for example,
file://
for the local filesystem,
hdfs://
for
HDFS,
s3n://
for Amazon S3, and so on).
In our case, we will be using custom code to read the images, so we don't need this part of
the path. Thus, we will remove it with the following
map
function:
val files = rdd.map { case (fileName, content) =>
fileName.replace("file:", "") }
println(files.first)
This should display the file location with the
file:
prefix removed: