Dimensionality Reduction with Spark - Machine Learning with Spark

Database Reference

In-Depth Information

We will use the wholeTextFiles method to access the location of each file. Using

these file paths, we will write custom code to load and process the images. In the follow-

ing example code, we will use PATH to refer to the directory in which you extracted the

lfw subdirectory.

We can use a wildcard path specification (using the * character highlighted in the follow-

ing code snippet) to tell Spark to look in each directory under the lfw directory for files:

val path = "/ PATH /lfw/ * "

val rdd = sc.wholeTextFiles(path)

val first = rdd.first

println(first)

Running the first command might take a little time, as Spark first scans the specified

directory structure for all available files. Once completed, you should see output similar to

the one shown here:

first: (String, String) = (file:/PATH/lfw/Aaron_Eckhart/

Aaron_Eckhart_0001.jpg, ��

�� ??JFIF????? ...

You will see that wholeTextFiles returns an RDD that contains key-value pairs,

where the key is the file location while the value is the content of the entire text file. For

our purposes, we only care about the file path, as we cannot work directly with the image

data as a string (notice that it is displayed as "binary nonsense" in the shell output).

Let's extract the file paths from the RDD. Note that earlier, the file path starts with the

file: text. This is used by Spark when reading files in order to differentiate between

different filesystems (for example, file:// for the local filesystem, hdfs:// for

HDFS, s3n:// for Amazon S3, and so on).

In our case, we will be using custom code to read the images, so we don't need this part of

the path. Thus, we will remove it with the following map function:

val files = rdd.map { case (fileName, content) =>

fileName.replace("file:", "") }

println(files.first)

This should display the file location with the file: prefix removed:

Search WWH ::

Custom Search

Home