Modeling Data - Data Science at the Command Line

Database Reference

In-Depth Information

The second file is passed to csvstack using file redirection. This allows us to cre‐

ate a temporary file using shuf , which creates a random permutation of the wine-

white-clean.csv and head which only selects the header and the first 1559 rows.

Finally, we reorder the columns of this data set using csvcut because by default,

bigmler assumes that the last column is the label.

Let's verify that wine-balanced.csv is actually balanced by counting the number of

instances per class using parallel and grep :

$ parallel --tag grep -c {} wine-balanced.csv ::: red white

red 1599

white 1599

As you can see, the data set wine-balanced.csv contains both 1,599 red and 1,599

white wines. Next we split the data set into train and test data sets using split (Gran‐

lund & Stallman, 2012):

$ < wine-balanced.csv header > wine-header.csv

$ tail -n +2 wine-balanced.csv | shuf | split -d -n r/2

$ parallel --xapply "cat wine-header.csv x0{1} > wine-{2}.csv" \

> ::: 0 1 ::: train test

This is another long command that deserves to be broken down:

Get the header using header and save it to a temporary file named wine-

header.csv .

Mix up the red and white wines using tail and shuf and split it into two files

named x00 and x01 using a round-robin distribution.

Use cat to combine the header saved in wine-header.csv and the rows stored in

x00 to save it as wine-train.csv ; similarly for x01 and wine-test.csv . The --xapply

option tells parallel to loop over the two input sources in tandem.

Let's check again the number of instances per class in both wine-train.csv and wine-

test.csv :

$ parallel --tag grep -c { 2 } wine- { 1 } .csv ::: train test ::: red white

train red 821

train white 778

test white 821

test red 778

That looks like our data sets are well balanced. We're now ready to call the prediction

API using bigmler .

Search WWH ::

Custom Search

Home