Databases Reference
In-Depth Information
• Data is in some feature space where a notion of “distance” makes
sense.
• Training data has been labeled or classified into two or more
classes.
• You pick the number of neighbors to use, k .
• You're assuming that the observed features and the labels are
somehow associated. They may not be, but ultimately your eval‐
uation metric will help you determine how good the algorithm is
at labeling. You might want to add more features and check how
that alters the evaluation metric. You'd then be tuning both which
features you were using and k . But as always, you're in danger here
of overfitting.
Both linear regression and k-NN are examples of “supervised learn‐
ing,” where you've observed both x and y , and you want to know the
function that brings x to y . Next up, we'll look at an algorithm you can
use when you don't know what the right answer is.
k-means
So far we've only seen supervised learning, where we know beforehand
what label (aka the “right answer”) is and we're trying to get our model
to be as accurate as possible, defined by our chosen evaluation metric.
k-means is the first unsupervised learning technique we'll look into,
where the goal of the algorithm is to determine the definition of the
right answer by finding clusters of data for you.
Let's say you have some kind of data at the user level, e.g., Google+
data, survey data, medical data, or SAT scores.
Start by adding structure to your data. Namely, assume each row of
your dataset corresponds to a user as follows:
age gender income state household size
Your goal is to segment the users. This process is known by various
names: besides being called segmenting, you could say that you're go‐
ing to stratify , group , or cluster the data. They all mean finding similar
types of users and bunching them together.
Why would you want to do this? Here are a few examples:
Search WWH ::




Custom Search