Database Reference
In-Depth Information
We will use the wholeTextFiles method to access the location of each file. Using
these file paths, we will write custom code to load and process the images. In the follow-
ing example code, we will use PATH to refer to the directory in which you extracted the
lfw subdirectory.
We can use a wildcard path specification (using the * character highlighted in the follow-
ing code snippet) to tell Spark to look in each directory under the lfw directory for files:
val path = "/ PATH /lfw/ * "
val rdd = sc.wholeTextFiles(path)
val first = rdd.first
println(first)
Running the first command might take a little time, as Spark first scans the specified
directory structure for all available files. Once completed, you should see output similar to
the one shown here:
first: (String, String) = (file:/PATH/lfw/Aaron_Eckhart/
Aaron_Eckhart_0001.jpg, ����
���� ??JFIF????? ...
You will see that wholeTextFiles returns an RDD that contains key-value pairs,
where the key is the file location while the value is the content of the entire text file. For
our purposes, we only care about the file path, as we cannot work directly with the image
data as a string (notice that it is displayed as "binary nonsense" in the shell output).
Let's extract the file paths from the RDD. Note that earlier, the file path starts with the
file: text. This is used by Spark when reading files in order to differentiate between
different filesystems (for example, file:// for the local filesystem, hdfs:// for
HDFS, s3n:// for Amazon S3, and so on).
In our case, we will be using custom code to read the images, so we don't need this part of
the path. Thus, we will remove it with the following map function:
val files = rdd.map { case (fileName, content) =>
fileName.replace("file:", "") }
println(files.first)
This should display the file location with the file: prefix removed:
Search WWH ::




Custom Search