Database Reference
In-Depth Information
Classifying data with the Naive Bayesian
classiier
Bayesian classiication is a way of updating your estimate of the probability that an item is
in a given category, depending on what you already know about that item, category, and the
world at large. In the case of a Naive Bayesian system, we assume that all features of the
items are independent. For example, elevation and average snowfall are
not
independent
(higher elevations tend to have more snow), but elevation and median income should be
independent. This algorithm has been useful in a number of interesting areas, for example,
spam detection in emails, automatic language detection, and document classiication. In this
recipe, we'll apply it to the mushroom dataset that we looked at in the
Classifying data with
decision trees
recipe.
Getting ready
First, we'll need to use the dependencies that we speciied in the
project.clj
ile in the
Loading CSV and ARFF iles into Weka
recipe. We'll also use the
defanalysis
macro from
the
Discovering groups of data using K-Means clustering
recipe, and we'll need this import in
our script or REPL:
(import [weka.classifiers.bayes NaiveBayes]
[weka.core Instances])
For data, we'll use the mushroom dataset that we did in the
Classifying data with decision
trees
recipe. You can download it from
http://www.ericrochester.com/clj-data-
analysis/data/UCI/mushroom.arff
.
We'll also need to ensure that the class attribute is
marked, just as we did in that recipe:
(def shrooms (doto (load-arff "data/UCI/mushroom.arff")
(.setClassIndex 22)))
How to do it…
In order to test the classiier, we'll take a sample of the data and train the classiier on that.
We'll then see how well it classiies the entire dataset:
1.
The following function takes a dataset of instances and a sample size, and it returns
a sample of the dataset:
(defn sample-instances [instances size]
(let [inst-count (.numInstances instances)]
(if (<= inst-count size)
instances