Algorithms - Doing Data Science

Databases Reference

In-Depth Information

Take a second and think whether or not linear regression would work

to solve problems of this type.

OK, so the answer is: it depends. When you use linear regression, the

output is a continuous variable. Here the output of your algorithm is

going to be a categorical label, so linear regression wouldn't solve the

problem as it's described.

However, it's not impossible to solve it with linear regression plus the

concept of a “threshold.” For example, if you're trying to predict peo‐

ple's credit scores from their ages and incomes, and then picked a

threshold such as 700 such that if your prediction for a given person

whose age and income you observed was above 700, you'd label their

predicted credit as “high,” or toss them into a bin labeled “high.”

Otherwise, you'd throw them into the bin labeled “low.” With more

thresholds, you could also have more fine-grained categories like “very

low,” “low,” “medium,” “high,” and “very high.”

In order to do it this way, with linear regression you'd have establish

the bins as ranges of a continuous outcome. But not everything is on

a continuous scale like a credit score. For example, what if your labels

are “likely Democrat,” “likely Republican,” and “likely independent”?

What do you do now?

The intution behind k-NN is to consider the most similar other items

defined in terms of their attributes, look at their labels, and give the

unassigned item the majority vote. If there's a tie, you randomly select

among the labels that have tied for first.

So, for example, if you had a bunch of movies that were labeled

“thumbs up” or “thumbs down,” and you had a movie called “Data

Gone Wild” that hadn't been rated yet—you could look at its attributes:

length of movie, genre, number of sex scenes, number of Oscar-

winning actors in it, and budget. You could then find other movies

with similar attributes, look at their ratings, and then give “Data Gone

Wild” a rating without ever having to watch it.

To automate it, two decisions must be made: first, how do you define

similarity or closeness? Once you define it, for a given unrated item,

you can say how similar all the labeled items are to it, and you can take

the most similar items and call them neighbors , who each have a “vote.”

This brings you to the second decision: how many neighbors should

you look at or “let vote”? This value is k , which ultimately you'll choose

as the data scientist, and we'll tell you how.

Search WWH ::

Custom Search

Home