Database Reference
In-Depth Information
The second file is passed to csvstack using file redirection. This allows us to cre‐
ate a temporary file using shuf , which creates a random permutation of the wine-
white-clean.csv and head which only selects the header and the first 1559 rows.
Finally, we reorder the columns of this data set using csvcut because by default,
bigmler assumes that the last column is the label.
Let's verify that wine-balanced.csv is actually balanced by counting the number of
instances per class using parallel and grep :
$ parallel --tag grep -c {} wine-balanced.csv ::: red white
red 1599
white 1599
As you can see, the data set wine-balanced.csv contains both 1,599 red and 1,599
white wines. Next we split the data set into train and test data sets using split (Gran‐
lund & Stallman, 2012):
$ < wine-balanced.csv header > wine-header.csv
$ tail -n +2 wine-balanced.csv | shuf | split -d -n r/2
$ parallel --xapply "cat wine-header.csv x0{1} > wine-{2}.csv" \
> ::: 0 1 ::: train test
This is another long command that deserves to be broken down:
Get the header using header and save it to a temporary file named wine-
header.csv .
Mix up the red and white wines using tail and shuf and split it into two files
named x00 and x01 using a round-robin distribution.
Use cat to combine the header saved in wine-header.csv and the rows stored in
x00 to save it as wine-train.csv ; similarly for x01 and wine-test.csv . The --xapply
option tells parallel to loop over the two input sources in tandem.
Let's check again the number of instances per class in both wine-train.csv and wine-
test.csv :
$ parallel --tag grep -c { 2 } wine- { 1 } .csv ::: train test ::: red white
train red 821
train white 778
test white 821
test red 778
That looks like our data sets are well balanced. We're now ready to call the prediction
API using bigmler .
Search WWH ::




Custom Search