Algorithms - Doing Data Science

Databases Reference

In-Depth Information

• Data is in some feature space where a notion of “distance” makes

sense.

• Training data has been labeled or classified into two or more

classes.

• You pick the number of neighbors to use, k .

• You're assuming that the observed features and the labels are

somehow associated. They may not be, but ultimately your eval‐

uation metric will help you determine how good the algorithm is

at labeling. You might want to add more features and check how

that alters the evaluation metric. You'd then be tuning both which

features you were using and k . But as always, you're in danger here

of overfitting.

Both linear regression and k-NN are examples of “supervised learn‐

ing,” where you've observed both x and y , and you want to know the

function that brings x to y . Next up, we'll look at an algorithm you can

use when you don't know what the right answer is.

k-means

So far we've only seen supervised learning, where we know beforehand

what label (aka the “right answer”) is and we're trying to get our model

to be as accurate as possible, defined by our chosen evaluation metric.

k-means is the first unsupervised learning technique we'll look into,

where the goal of the algorithm is to determine the definition of the

right answer by finding clusters of data for you.

Let's say you have some kind of data at the user level, e.g., Google+

data, survey data, medical data, or SAT scores.

Start by adding structure to your data. Namely, assume each row of

your dataset corresponds to a user as follows:

age gender income state household size

Your goal is to segment the users. This process is known by various

names: besides being called segmenting, you could say that you're go‐

ing to stratify , group , or cluster the data. They all mean finding similar

types of users and bunching them together.

Why would you want to do this? Here are a few examples:

Search WWH ::

Custom Search

Home