Database Reference
In-Depth Information
The second file is passed to
csvstack
using file redirection. This allows us to cre‐
ate a temporary file using
shuf
, which creates a random permutation of the
wine-
white-clean.csv
and
head
which only selects the header and the first 1559 rows.
Finally, we reorder the columns of this data set using
csvcut
because by default,
bigmler
assumes that the last column is the label.
Let's verify that
wine-balanced.csv
is actually balanced by counting the number of
instances per class using
parallel
and
grep
:
$
parallel --tag grep -c
{}
wine-balanced.csv ::: red white
red 1599
white 1599
As you can see, the data set
wine-balanced.csv
contains both 1,599 red and 1,599
white wines. Next we split the data set into train and test data sets using
split
(Gran‐
lund & Stallman, 2012):
$
< wine-balanced.csv header > wine-header.csv
$
tail -n +2 wine-balanced.csv | shuf | split -d -n r/2
$
parallel --xapply
"cat wine-header.csv x0{1} > wine-{2}.csv"
\
>
::: 0 1 ::: train
test
This is another long command that deserves to be broken down:
Get the header using
header
and save it to a temporary file named
wine-
header.csv
.
Mix up the red and white wines using
tail
and
shuf
and split it into two files
named
x00
and
x01
using a round-robin distribution.
Use
cat
to combine the header saved in
wine-header.csv
and the rows stored in
x00
to save it as
wine-train.csv
; similarly for
x01
and
wine-test.csv
. The
--xapply
option tells
parallel
to loop over the two input sources in tandem.
Let's check again the number of instances per class in both
wine-train.csv
and
wine-
test.csv
:
$
parallel --tag grep -c
{
2
}
wine-
{
1
}
.csv ::: train
test
::: red white
train red 821
train white 778
test white 821
test red 778
That looks like our data sets are well balanced. We're now ready to call the prediction
API using
bigmler
.