Database Reference
In-Depth Information
Most of the recipes in this chapter will use the Weka machine learning and data mining
library ( http://www.cs.waikato.ac.nz/ml/weka/ ) . This is a full-featured library,
which is used to analyze data using many different procedures and algorithms. It includes
a more complete set of these algorithms than Incanter, which we've been using a lot so far.
We'll start by seeing how to load CSV iles into Weka and work with Weka datasets. However,
for most of the chapter, we'll examine how to use this powerful library to perform different
analyses. Weka's interface to the classes implementing these algorithms is very consistent.
For the irst recipe, in which we use one of these algorithms, Discovering groups of data using
K-Means clustering , we'll deine a macro that will facilitate creating wrapper functions for
Weka algorithms. This is a great example shows using macros, and of how easy it is to create
a wrapper over an external Java library to make it more natural to use from Clojure.
Loading CSV and ARFF iles into Weka
Weka is most comfortable when using its own ile format: the Attribute-Relation File Format
(ARFF). This format includes the types of data in the columns and other information that allow
it to be loaded incrementally, and both of these can be important features. Because of this,
Weka can load data more reliably. However, Weka can still import CSV iles, and when it does,
it attempts to guess the type of data in the columns.
In this recipe, we'll see what's necessary to load data from a CSV ile and an ARFF ile.
Getting ready
First, we'll need to add Weka to the dependencies in our Leiningen project.clj ile:
(defproject d-mining "0.1.0-SNAPSHOT"
:dependencies [[org.clojure/clojure "1.6.0"]
[nz.ac.waikato.cms.weka/weka-dev "3.7.11"]])
Then we'll import the right classes into our script or REPL:
(import [weka.core.converters ArffLoader CSVLoader]
[java.io File])
Finally, we'll need to have a CSV ile to import. In this recipe, I'll use the dataset of Chinese
land use data that we compiled for the Scaling variables to simplify variable relationships
recipe in Chapter 7 , Statistical Data Analysis with Incanter . It's in the ile named data/
chnchn-land.csv . You can also download this ile from http://www.ericrochester.
com/clj-data-analysis/data/chn-land.csv .
 
Search WWH ::




Custom Search