Algorithms - Doing Data Science

Databases Reference

In-Depth Information

the range of 0 and 100. You could do the same with number of actors.

You could play around with the 10; maybe 50 is better.

You'll want to justify why you're making these choices. The justifica‐

tion could be that you tried different values and when you tested the

algorithm, this gave the best evaluation metric. Essentially this 10 is

either a second tuning parameter that you've introduced into the al‐

gorithm on top of the k , or a prior you've put on the model, depending

on your point of view and how it's used.

Training and test sets

For any machine learning algorithm, the general approach is to have

a training phase, during which you create a model and “train it”; and

then you have a testing phase, where you use new data to test how good

the model is.

For k-NN, the training phase is straightforward: it's just reading in

your data with the “high” or “low” credit data points marked. In testing,

you pretend you don't know the true label and see how good you are

at guessing using the k-NN algorithm.

To do this, you'll need to save some clean data from the overall data

for the testing phase. Usually you want to save randomly selected data,

let's say 20%.

Your R console might look like this:

> head ( data )

age income credit

1 69 3 low

2 66 57 low

3 49 79 low

4 49 17 low

5 58 26 high

6 44 71 high

n.points <- 1000 # number of rows in the dataset

sampling.rate <- 0.8

# we need the number of points in the test set to calculate

# the misclassification rate

num.test.set.labels <- n.points * ( 1 - sampling.rate )

# randomly sample which rows will go in the training set

training <- sample ( 1 : n.points , sampling.rate * n.points ,

replace = FALSE )

train <- subset ( data [ training , ], select = c ( Age , Income ))

# define the training set to be those rows

Search WWH ::

Custom Search

Home