Database Reference
In-Depth Information
by definition supervised algorithms, which means that they also incorporate the
labels into the model.
This is by no means an introduction to machine learning. That
implies that we must skim over many details. We strongly advise
that you become familiar with an algorithm before applying it
blindly to your data.
Overview
In this chapter, you'll learn how to:
• Reduce the dimensionality of your data set
• Identify groups of data points with three clustering algorithms
• Predict the quality of white wine using regression
• Classify wine as red or white via a prediction API
More Wine, Please!
In this chapter, we'll use a data set of wine tastings—specifically, red and white Portu‐
guese Vinho Verde wine. Each data point represents a wine, and consists of 11 physi‐
cochemical properties: (1) fixed acidity, (2) volatile acidity, (3) citric acid, (4) residual
sugar, (5) chlorides, (6) free sulfur dioxide, (7) total sulfur dioxide, (8) density, (9)
pH, (10) sulphates, and (11) alcohol. There is also a quality score. This score lies
between 0 (very bad) and 10 (excellent) and is the median of at least three evaluations
by wine experts. More information about this data set is available at the Wine Quality
Data Set web page .
There are two data sets: one for white wine and one for red wine. The very first step is
to obtain the two data sets using curl (and of course parallel because we haven't got
all day):
$ cd ~/book/ch09/data
$ parallel "curl -sL http://archive.ics.uci.edu/ml/machine-learning-databases" \
> "/wine-quality/winequality-{}.csv > wine-{}.csv" ::: red white
(The triple colon is another way to pass data to parallel .) Let's inspect both data sets
using head and count the number of rows using wc -l :
$ head -n 5 wine- { red,white } .csv | fold
==> wine-red.csv <==
"fixed acidity";"volatile acidity";"citric acid";"residual sugar";"chlorides";"f
ree sulfur dioxide";"total sulfur dioxide";"density";"pH";"sulphates";"alcohol";
"quality"
Search WWH ::




Custom Search