Databases Reference
In-Depth Information
Take a second and think whether or not linear regression would work
to solve problems of this type.
OK, so the answer is: it depends. When you use linear regression, the
output is a continuous variable. Here the output of your algorithm is
going to be a categorical label, so linear regression wouldn't solve the
problem as it's described.
However, it's not impossible to solve it with linear regression plus the
concept of a “threshold.” For example, if you're trying to predict peo‐
ple's credit scores from their ages and incomes, and then picked a
threshold such as 700 such that if your prediction for a given person
whose age and income you observed was above 700, you'd label their
predicted credit as “high,” or toss them into a bin labeled “high.”
Otherwise, you'd throw them into the bin labeled “low.” With more
thresholds, you could also have more fine-grained categories like “very
low,” “low,” “medium,” “high,” and “very high.”
In order to do it this way, with linear regression you'd have establish
the bins as ranges of a continuous outcome. But not everything is on
a continuous scale like a credit score. For example, what if your labels
are “likely Democrat,” “likely Republican,” and “likely independent”?
What do you do now?
The intution behind k-NN is to consider the most similar other items
defined in terms of their attributes, look at their labels, and give the
unassigned item the majority vote. If there's a tie, you randomly select
among the labels that have tied for first.
So, for example, if you had a bunch of movies that were labeled
“thumbs up” or “thumbs down,” and you had a movie called “Data
Gone Wild” that hadn't been rated yet—you could look at its attributes:
length of movie, genre, number of sex scenes, number of Oscar-
winning actors in it, and budget. You could then find other movies
with similar attributes, look at their ratings, and then give “Data Gone
Wild” a rating without ever having to watch it.
To automate it, two decisions must be made: first, how do you define
similarity or closeness? Once you define it, for a given unrated item,
you can say how similar all the labeled items are to it, and you can take
the most similar items and call them neighbors , who each have a “vote.”
This brings you to the second decision: how many neighbors should
you look at or “let vote”? This value is k , which ultimately you'll choose
as the data scientist, and we'll tell you how.
Search WWH ::




Custom Search