Modeling Data - Data Science at the Command Line

Database Reference

In-Depth Information

7.4;0.7;0;1.9;0.076;11;34;0.9978;3.51;0.56;9.4;5

7.8;0.88;0;2.6;0.098;25;67;0.9968;3.2;0.68;9.8;5

7.8;0.76;0.04;2.3;0.092;15;54;0.997;3.26;0.65;9.8;5

11.2;0.28;0.56;1.9;0.075;17;60;0.998;3.16;0.58;9.8;6

==> wine-white.csv <==

"fixed acidity";"volatile acidity";"citric acid";"residual sugar";"chlorides";"f

ree sulfur dioxide";"total sulfur dioxide";"density";"pH";"sulphates";"alcohol";

"quality"

7;0.27;0.36;20.7;0.045;45;170;1.001;3;0.45;8.8;6

6.3;0.3;0.34;1.6;0.049;14;132;0.994;3.3;0.49;9.5;6

8.1;0.28;0.4;6.9;0.05;30;97;0.9951;3.26;0.44;10.1;6

7.2;0.23;0.32;8.5;0.058;47;186;0.9956;3.19;0.4;9.9;6

$ wc -l wine- { red,white } .csv

1600 wine-red.csv

4899 wine-white.csv

6499 total

At first sight this data appears to be very clean already. Still, let's scrub this data a little

bit so that it conforms more with what most command-line tools are expecting.

Specifically, we'll:

• Convert the header to lowercase

• Convert the semicolons to commas

• Convert spaces to underscores

• Remove unnecessary quotes

These things can all be taken care of by tr . Let's use a for loop this time—for old

times' sake—to process both data sets:

$ for T in red white; do

> < wine- $T .csv tr '[A-Z]; ' '[a-z],_' | tr -d \" > wine- ${ T } -clean.csv

> done

Let's combine the two data sets. We'll use csvstack to add a column named type

which will be red for rows of the first file, and white for rows of the second file:

$ HEADER = "$(head -n 1 wine-red-clean.csv),type"

$ csvstack -g red,white -n type wine- { red,white } -clean.csv |

> csvcut -c $HEADER > wine-both-clean.csv

The new column type is added to the beginning of the table. Because some of the

command-line tools that we'll use in this chapter assume that the class label is the last

column, we'll rearrange the columns by using csvcut . Instead of typing all 13 col‐

umns, we temporarily store the desired header into a variable HEADER before we call

csvstack .

It's good to check whether there are any missing values in this data set:

Search WWH ::

Custom Search

Home