Modeling Data - Data Science at the Command Line

Database Reference

In-Depth Information

Dimensionality reduction is often regarded as being part of the exploring step. It's

useful for when there are too many features for plotting. You could do a scatter plot

matrix, but that only shows you two features at a time. It's also useful as a pre-

processing step for other machine-learning algorithms.

Most dimensionality reduction algorithms are unsupervised. This means that they

don't employ the labels of the data points in order to construct the lower-dimensional

mapping.

In this section, we'll look at two techniques: PCA, which stands for Principal Compo‐

nents Analysis (Pearson, 1901) and t-SNE, which stands for t-distributed Stochastic

Neighbor Embedding (van der Maaten & Hinton, 2008).

Introducing Tapkee

Tapkee is a C++ template library for dimensionality reduction (Lisitsyn, Widmer, &

Garcia, 2013). The library contains implementations of many dimensionality reduc‐

tion algorithms, including:

• Locally Linear Embedding

• Isomap

• Multidimensional scaling

• PCA

• t-SNE

Tapkee's website contains more information about these algorithms. Although Tapkee

is mainly a library that can be included in other applications, it also offers a

command-line tool. We'll use this to perform dimensionality reduction on our wine

data set.

Installing Tapkee

If you aren't running the Data Science Toolbox, you'll need to download and compile

Tapkee yourself. First make sure that you have CMake installed. On Ubuntu, you sim‐

ply run:

$ sudo apt-get install cmake

Consult Tapkee's website for instructions for other operating systems. Then execute

the following commands to download the source and compile it:

$ curl -sL https://github.com/lisitsyn/tapkee/archive/master.tar.gz > \

> tapkee-master.tar.gz

$ tar -xzf tapkee-master.tar.gz

$ cd tapkee-master

$ mkdir build && cd build

Search WWH ::

Custom Search

Home