Algorithms - Doing Data Science

Databases Reference

In-Depth Information

type in summary(model) , which is the name we gave to this model, the

output would be:

summary (model)

Call:

lm(formula = y ~ x)

Residuals:

Min 1Q Median 3Q Max

-121.17 -52.63 -9.72 41.54 356.27

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -32.083 16.623 -1.93 0.0565 .

x 45.918 2.141 21.45 <2e-16 ***

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 77.47 on 98 degrees of freedom

Multiple R-squared: 0.8244, Adjusted R-squared: 0.8226

F-statistic: 460 on 1 and 98 DF, p-value: < 2.2e-16

R-squared

R 2 = 1− ∑ i y i − y i 2

∑ i yi − y 2 . This can be interpreted as the proportion of

variance explained by our model. Note that mean squared error

is in there getting divided by total error, which is the proportion

of variance unexplained by our model and we calculate 1 minus

that.

p-values

Looking at the output, the estimated β s are in the column marked

Estimate. To see the p-values, look at Pr > t . We can interpret

the values in this column as follows: We are making a null hy‐

pothesis that the β s are zero. For any given β , the p-value captures

the probability of observing the data that we observed, and ob‐

taining the test-statistic that we obtained under the null hypothe‐

sis . This means that if we have a low p-value, it is highly unlikely

to observe such a test-statistic under the null hypothesis, and the

coefficient is highly likely to be nonzero and therefore significant.

Cross-validation

Another approach to evaluating the model is as follows. Divide

our data up into a training set and a test set: 80% in the training

and 20% in the test. Fit the model on the training set, then look at

the mean squared error on the test set and compare it to that on

the training set. Make this comparison across sample size as well.

Search WWH ::

Custom Search

Home