Algorithms - Doing Data Science

Databases Reference

In-Depth Information

y <- true_beta_0 + true_beta_1 * x_1 + true_error

hist ( y ) # plot p(y)

plot ( x_1 , y , pch = 20 , col = "red" ) # plot p(x,y)

1. Build a regression model and see that it recovers the true values

of the β s.

2. Simulate another fake variable x 2 that has a Gamma distribution

with parameters you pick. Now make the truth be that y is a linear

combination of both x 1 and x 2 . Fit a model that only depends on

x 1 . Fit a model that only depends on x 2 . Fit a model that uses both.

Vary the sample size and make a plot of mean square error of the

training set and of the test set versus sample size.

3. Create a new variable, z , that is equal to x 2 . Include this as one of

the predictors in your model. See what happens when you fit a

model that depends on x 1 only and then also on z . Vary the sample

size and make a plot of mean square error of the training set and

of the test set versus sample size.

4. Play around more by (a) changing parameter values (the true β s),

(b) changing the distribution of the true error, and (c) including

more predictors in the model with other kinds of probability dis‐

tributions. ( rnorm() means randomly generate values from a nor‐

mal distribution. rbinom() does the same for binomial. So look

up these functions online and try to find more.)

5. Create scatterplots of all pairs of variables and histograms of single

variables.

k-Nearest Neighbors (k-NN)

K-NN is an algorithm that can be used when you have a bunch of

objects that have been classified or labeled in some way, and other

similar objects that haven't gotten classified or labeled yet, and you

want a way to automatically label them.

The objects could be data scientists who have been classified as “sexy”

or “not sexy”; or people who have been labeled as “high credit” or “low

credit”; or restaurants that have been labeled “five star,” “four star,”

“three star,” “two star,” “one star,” or if they really suck, “zero stars.”

More seriously, it could be patients who have been classified as “high

cancer risk” or “low cancer risk.”

Search WWH ::

Custom Search

Home