Databases Reference
In-Depth Information
y <- true_beta_0 + true_beta_1 * x_1 + true_error
hist ( y ) # plot p(y)
plot ( x_1 , y , pch = 20 , col = "red" ) # plot p(x,y)
1. Build a regression model and see that it recovers the true values
of the β s.
2. Simulate another fake variable x 2 that has a Gamma distribution
with parameters you pick. Now make the truth be that y is a linear
combination of both x 1 and x 2 . Fit a model that only depends on
x 1 . Fit a model that only depends on x 2 . Fit a model that uses both.
Vary the sample size and make a plot of mean square error of the
training set and of the test set versus sample size.
3. Create a new variable, z , that is equal to x 2 . Include this as one of
the predictors in your model. See what happens when you fit a
model that depends on x 1 only and then also on z . Vary the sample
size and make a plot of mean square error of the training set and
of the test set versus sample size.
4. Play around more by (a) changing parameter values (the true β s),
(b) changing the distribution of the true error, and (c) including
more predictors in the model with other kinds of probability dis‐
tributions. ( rnorm() means randomly generate values from a nor‐
mal distribution. rbinom() does the same for binomial. So look
up these functions online and try to find more.)
5. Create scatterplots of all pairs of variables and histograms of single
variables.
k-Nearest Neighbors (k-NN)
K-NN is an algorithm that can be used when you have a bunch of
objects that have been classified or labeled in some way, and other
similar objects that haven't gotten classified or labeled yet, and you
want a way to automatically label them.
The objects could be data scientists who have been classified as “sexy”
or “not sexy”; or people who have been labeled as “high credit” or “low
credit”; or restaurants that have been labeled “five star,” “four star,”
“three star,” “two star,” “one star,” or if they really suck, “zero stars.”
More seriously, it could be patients who have been classified as “high
cancer risk” or “low cancer risk.”
Search WWH ::




Custom Search