Modeling Data - Data Science at the Command Line

Database Reference

In-Depth Information

Converting Between CSV and ARFF

Weka uses ARFF as a file format. This is basically CSV with additional information

about the columns. We'll use two convenient command-line tools to convert between

CSV and ARFF, namely csv2arff (see Example 9-1 ) and arff2csv (see Example 9-2 ).

Example 9-1. Convert CSV to ARFF (csv2arf)

#!/usr/bin/env bash

weka core.converters.CSVLoader /dev/stdin

Example 9-2. Convert ARFF to CSV (arf2csv)

#!/usr/bin/env bash

weka core.converters.CSVSaver -i /dev/stdin

Comparing Three Clustering Algorithms

In order to cluster data using Weka, we need yet another custom command-line tool

to help us with this. The AddCluster class is needed to assign data points to the

learned clusters. Unfortunately, this class does not accept data from standard input,

not even when we specify -i /dev/stdin , because it expects a file with the .arf

extension. We consider this to be bad design. The source code of weka-cluster is:

#!/usr/bin/env bash

ALGO = "$@"

IN = $( mktemp --tmpdir weka-cluster-XXXXXXXX ) .arff

finish () {

rm -f $IN

}

trap finish EXIT

csv2arff > $IN

weka filters.unsupervised.attribute.AddCluster -W "weka.${ALGO}" -i $IN \

-o /dev/stdout | arff2csv

Now we can apply the EM clustering algorithm and save the assignment as follows:

$ cd data

$ < wine-both-scaled.csv csvcut -C quality,type |

> weka-cluster clusterers.EM -N 5 |

> csvcut -c cluster > data/wine-both-cluster-em.csv

Use the scaled data set, and don't use the features quality and type for the clus‐

tering

Apply the algorithm using weka-cluster

Search WWH ::

Custom Search

Home