Database Reference
In-Depth Information
Extracting features from the LFW dataset
In order to avoid having to download and process a very large dataset, we will work with a
subset of the images, using people who have names that start with an "A". This dataset can
be downloaded from
http://vis-www.cs.umass.edu/lfw/lfw-a.tgz
.
Note
For more details and other variants of the data, visit
http://vis-www.cs.umass.edu/lfw/
.
The original research paper reference is:
Gary B. Huang
,
Manu Ramesh
,
Tamara Berg
, and
Erik Learned-Miller
.
Labeled Faces in
the Wild: A Database for Studying Face Recognition in Unconstrained Environments
.
University of Massachusetts, Amherst, Technical Report 07-49, October, 2007.
It can be downloaded from
http://vis-www.cs.umass.edu/lfw/lfw.pdf
.
Unzip the data using the following command:
>tar xfvz lfw-a.tgz
This will create a folder called
lfw
, which contains a number of subfolders, one for each
person.
Exploring the face data
Start up your Spark Scala console by ensuring that you allocate sufficient memory, as di-
mensionality reduction methods can be quite computationally expensive:
>./SPARK_HOME/bin/spark-shell --driver-memory 2g
Now that we've unzipped the data, we face a small challenge. Spark provides us with a way
to read text files and custom Hadoop input data sources. However, there is no built-in func-
tionality to allow us to read images.
Spark provides a method called
wholeTextFiles
, which allows us to operate on entire
files at once, compared to the
textFile
method that we have been using so far, which
operates on the individual lines within a text file (or multiple files).