Modeling Data - Data Science at the Command Line

Database Reference

In-Depth Information

by definition supervised algorithms, which means that they also incorporate the

labels into the model.

This is by no means an introduction to machine learning. That

implies that we must skim over many details. We strongly advise

that you become familiar with an algorithm before applying it

blindly to your data.

Overview

In this chapter, you'll learn how to:

• Reduce the dimensionality of your data set

• Identify groups of data points with three clustering algorithms

• Predict the quality of white wine using regression

• Classify wine as red or white via a prediction API

More Wine, Please!

In this chapter, we'll use a data set of wine tastings—specifically, red and white Portu‐

guese Vinho Verde wine. Each data point represents a wine, and consists of 11 physi‐

cochemical properties: (1) fixed acidity, (2) volatile acidity, (3) citric acid, (4) residual

sugar, (5) chlorides, (6) free sulfur dioxide, (7) total sulfur dioxide, (8) density, (9)

pH, (10) sulphates, and (11) alcohol. There is also a quality score. This score lies

between 0 (very bad) and 10 (excellent) and is the median of at least three evaluations

by wine experts. More information about this data set is available at the Wine Quality

Data Set web page .

There are two data sets: one for white wine and one for red wine. The very first step is

to obtain the two data sets using curl (and of course parallel because we haven't got

all day):

$ cd ~/book/ch09/data

$ parallel "curl -sL http://archive.ics.uci.edu/ml/machine-learning-databases" \

> "/wine-quality/winequality-{}.csv > wine-{}.csv" ::: red white

(The triple colon is another way to pass data to parallel .) Let's inspect both data sets

using head and count the number of rows using wc -l :

$ head -n 5 wine- { red,white } .csv | fold

==> wine-red.csv <==

"fixed acidity";"volatile acidity";"citric acid";"residual sugar";"chlorides";"f

ree sulfur dioxide";"total sulfur dioxide";"density";"pH";"sulphates";"alcohol";

"quality"

Search WWH ::

Custom Search

Home