Databases Reference
In-Depth Information
the range of 0 and 100. You could do the same with number of actors.
You could play around with the 10; maybe 50 is better.
You'll want to justify why you're making these choices. The justifica‐
tion could be that you tried different values and when you tested the
algorithm, this gave the best evaluation metric. Essentially this 10 is
either a second tuning parameter that you've introduced into the al‐
gorithm on top of the k , or a prior you've put on the model, depending
on your point of view and how it's used.
Training and test sets
For any machine learning algorithm, the general approach is to have
a training phase, during which you create a model and “train it”; and
then you have a testing phase, where you use new data to test how good
the model is.
For k-NN, the training phase is straightforward: it's just reading in
your data with the “high” or “low” credit data points marked. In testing,
you pretend you don't know the true label and see how good you are
at guessing using the k-NN algorithm.
To do this, you'll need to save some clean data from the overall data
for the testing phase. Usually you want to save randomly selected data,
let's say 20%.
Your R console might look like this:
> head ( data )
age income credit
1 69 3 low
2 66 57 low
3 49 79 low
4 49 17 low
5 58 26 high
6 44 71 high
n.points <- 1000 # number of rows in the dataset
sampling.rate <- 0.8
# we need the number of points in the test set to calculate
# the misclassification rate
num.test.set.labels <- n.points * ( 1 - sampling.rate )
# randomly sample which rows will go in the training set
training <- sample ( 1 : n.points , sampling.rate * n.points ,
replace = FALSE )
train <- subset ( data [ training , ], select = c ( Age , Income ))
# define the training set to be those rows
Search WWH ::




Custom Search