Database Reference
In-Depth Information
7.4;0.7;0;1.9;0.076;11;34;0.9978;3.51;0.56;9.4;5
7.8;0.88;0;2.6;0.098;25;67;0.9968;3.2;0.68;9.8;5
7.8;0.76;0.04;2.3;0.092;15;54;0.997;3.26;0.65;9.8;5
11.2;0.28;0.56;1.9;0.075;17;60;0.998;3.16;0.58;9.8;6
==> wine-white.csv <==
"fixed acidity";"volatile acidity";"citric acid";"residual sugar";"chlorides";"f
ree sulfur dioxide";"total sulfur dioxide";"density";"pH";"sulphates";"alcohol";
"quality"
7;0.27;0.36;20.7;0.045;45;170;1.001;3;0.45;8.8;6
6.3;0.3;0.34;1.6;0.049;14;132;0.994;3.3;0.49;9.5;6
8.1;0.28;0.4;6.9;0.05;30;97;0.9951;3.26;0.44;10.1;6
7.2;0.23;0.32;8.5;0.058;47;186;0.9956;3.19;0.4;9.9;6
$
wc -l wine-
{
red,white
}
.csv
1600 wine-red.csv
4899 wine-white.csv
6499 total
At first sight this data appears to be very clean already. Still, let's scrub this data a little
bit so that it conforms more with what most command-line tools are expecting.
Specifically, we'll:
• Convert the header to lowercase
• Convert the semicolons to commas
• Convert spaces to underscores
• Remove unnecessary quotes
These things can all be taken care of by
tr
. Let's use a
for
loop this time—for old
times' sake—to process both data sets:
$
for
T in red white;
do
>
< wine-
$T
.csv tr
'[A-Z]; '
'[a-z],_'
| tr -d
\"
> wine-
${
T
}
-clean.csv
>
done
Let's combine the two data sets. We'll use
csvstack
to add a column named
type
which will be
red
for rows of the first file, and
white
for rows of the second file:
$
HEADER
=
"$(head -n 1 wine-red-clean.csv),type"
$
csvstack -g red,white -n
type
wine-
{
red,white
}
-clean.csv |
>
csvcut -c
$HEADER
> wine-both-clean.csv
The new column
type
is added to the beginning of the table. Because some of the
command-line tools that we'll use in this chapter assume that the class label is the last
column, we'll rearrange the columns by using
csvcut
. Instead of typing all 13 col‐
umns, we temporarily store the desired header into a variable
HEADER
before we call
csvstack
.
It's good to check whether there are any missing values in this data set: