Database Reference
In-Depth Information
Classifying data with the Naive Bayesian
classiier
Bayesian classiication is a way of updating your estimate of the probability that an item is
in a given category, depending on what you already know about that item, category, and the
world at large. In the case of a Naive Bayesian system, we assume that all features of the
items are independent. For example, elevation and average snowfall are not independent
(higher elevations tend to have more snow), but elevation and median income should be
independent. This algorithm has been useful in a number of interesting areas, for example,
spam detection in emails, automatic language detection, and document classiication. In this
recipe, we'll apply it to the mushroom dataset that we looked at in the Classifying data with
decision trees recipe.
Getting ready
First, we'll need to use the dependencies that we speciied in the project.clj ile in the
Loading CSV and ARFF iles into Weka recipe. We'll also use the defanalysis macro from
the Discovering groups of data using K-Means clustering recipe, and we'll need this import in
our script or REPL:
(import [weka.classifiers.bayes NaiveBayes]
[weka.core Instances])
For data, we'll use the mushroom dataset that we did in the Classifying data with decision
trees recipe. You can download it from http://www.ericrochester.com/clj-data-
analysis/data/UCI/mushroom.arff . We'll also need to ensure that the class attribute is
marked, just as we did in that recipe:
(def shrooms (doto (load-arff "data/UCI/mushroom.arff")
(.setClassIndex 22)))
How to do it…
In order to test the classiier, we'll take a sample of the data and train the classiier on that.
We'll then see how well it classiies the entire dataset:
1.
The following function takes a dataset of instances and a sample size, and it returns
a sample of the dataset:
(defn sample-instances [instances size]
(let [inst-count (.numInstances instances)]
(if (<= inst-count size)
instances
 
Search WWH ::




Custom Search