Dimensionality Reduction with Spark - Machine Learning with Spark

Database Reference

In-Depth Information

Extracting features from the LFW dataset

In order to avoid having to download and process a very large dataset, we will work with a

subset of the images, using people who have names that start with an "A". This dataset can

be downloaded from http://vis-www.cs.umass.edu/lfw/lfw-a.tgz .

Note

For more details and other variants of the data, visit http://vis-www.cs.umass.edu/lfw/ .

The original research paper reference is:

Gary B. Huang , Manu Ramesh , Tamara Berg , and Erik Learned-Miller . Labeled Faces in

the Wild: A Database for Studying Face Recognition in Unconstrained Environments .

University of Massachusetts, Amherst, Technical Report 07-49, October, 2007.

It can be downloaded from http://vis-www.cs.umass.edu/lfw/lfw.pdf .

Unzip the data using the following command:

>tar xfvz lfw-a.tgz

This will create a folder called lfw , which contains a number of subfolders, one for each

person.

Exploring the face data

Start up your Spark Scala console by ensuring that you allocate sufficient memory, as di-

mensionality reduction methods can be quite computationally expensive:

>./SPARK_HOME/bin/spark-shell --driver-memory 2g

Now that we've unzipped the data, we face a small challenge. Spark provides us with a way

to read text files and custom Hadoop input data sources. However, there is no built-in func-

tionality to allow us to read images.

Spark provides a method called wholeTextFiles , which allows us to operate on entire

files at once, compared to the textFile method that we have been using so far, which

operates on the individual lines within a text file (or multiple files).

Search WWH ::

Custom Search

Home