Database Reference
In-Depth Information
Converting Between CSV and ARFF
Weka uses ARFF as a file format. This is basically CSV with additional information
about the columns. We'll use two convenient command-line tools to convert between
CSV and ARFF, namely
csv2arff
(see
Example 9-1
) and
arff2csv
(see
Example 9-2
).
Example 9-1. Convert CSV to ARFF (csv2arf)
#!/usr/bin/env bash
weka core.converters.CSVLoader /dev/stdin
Example 9-2. Convert ARFF to CSV (arf2csv)
#!/usr/bin/env bash
weka core.converters.CSVSaver -i /dev/stdin
Comparing Three Clustering Algorithms
In order to cluster data using Weka, we need yet another custom command-line tool
to help us with this. The
AddCluster
class is needed to assign data points to the
learned clusters. Unfortunately, this class does not accept data from standard input,
not even when we specify
-i /dev/stdin
, because it expects a file with the
.arf
extension. We consider this to be bad design. The source code of
weka-cluster
is:
#!/usr/bin/env bash
ALGO
=
"$@"
IN
=
$(
mktemp --tmpdir weka-cluster-XXXXXXXX
)
.arff
finish
()
{
rm -f
$IN
}
trap
finish EXIT
csv2arff >
$IN
weka filters.unsupervised.attribute.AddCluster -W
"weka.${ALGO}"
-i
$IN
\
-o /dev/stdout | arff2csv
Now we can apply the EM clustering algorithm and save the assignment as follows:
$
cd
data
$
< wine-both-scaled.csv csvcut -C quality,type |
>
weka-cluster clusterers.EM -N 5 |
>
csvcut -c cluster > data/wine-both-cluster-em.csv
Use the scaled data set, and don't use the features
quality
and
type
for the clus‐
tering
Apply the algorithm using
weka-cluster