Spam Filters, Naive Bayes, and Wrangling - Doing Data Science

Databases Reference

In-Depth Information

Aside: Digit Recognition

Say you want an algorithm to recognize pictures of hand-written dig‐

its as shown in Figure 4-2 . In this case, k-NN works well.

Figure 4-2. Handwritten digits

To set it up, you take your underlying representation apart pixel by

pixel—say in a 16x16 grid of pixels—and measure how bright each

pixel is. Unwrap the 16x16 grid and put it into a 256-dimensional

space, which has a natural archimedean metric. That is to say, the

distance between two different points on this space is the square root

of the sum of the squares of the differences between their entries. In

other words, it's the length of the vector going from one point to the

other or vice versa. Then you apply the k-NN algorithm.

If you vary the number of neighbors, it changes the shape of the

boundary, and you can tune k to prevent overfitting. If you're careful,

you can get 97% accuracy with a sufficiently large dataset.

Moreover, the result can be viewed in a “confusion matrix.” A confu‐

sion matrix is used when you are trying to classify objects into k bins,

and is a k × k matrix corresponding to actual label versus predicted

label, and the i , j th element of the matrix is a count of the number

of items that were actually labeled i that were predicted to have label

j . From a confusion matrix, you can get accuracy , the proportion of

total predictions that were correct. In the previous chapter, we dis‐

cussed the misclassification rate. Notice that accuracy = 1 - misclas‐

sification rate.

Search WWH ::

Custom Search

Home