Database Reference
In-Depth Information
technique called simple linear regression . A common method for finding a regression line
is called the least-squares method. Imagine a comparison of a collection of two catego-
ries of values, one of which we call x and the other y . We can plot all of these values
on a graph, with the x values on the horizontal axis, and the y values on the vertical.
The least-squares method produces a regression line with the smallest values for the
sum of the squares of the vertical distance from each data point to the line. Put simply,
this type of regression line is fairly easy to calculate. Even better, R provides functions
that take care of this and other regression calculations.
A note of caution: Linear regression is conceptually easy to understand, and there
are plenty of software packages that are able to provide regression analysis even when
it is not appropriate. For this reason, it's easy to misuse this technique and to misin-
terpret the results. There are plenty of cases in which linear regression is not a great
fit for determining relationships between variables. First of all, linear regression tests
assume that the scatterplot of the distribution of the values of the two variables is
roughly linear in shape and that the individual distributions of the variables both fol-
low a normal distribution with few outliers. Another assumption that linear regression
techniques make is that the variables involved have a uniform variance and are rela-
tively free of random values (such as those derived from erroneous measurements). In
situations in which any of these assumptions are not met, linear regression may not be
a valid way to approach the problem.
Caveats and cautions aside, what happens when you have a legitimate linear-
regression challenge and you can't fit it into available system memory? This is where
the biglm package comes in. The biglm package enables linear regression analysis over
very large datasets. The biganalytics package, the sibling of bigmemory, contains a
useful wrapper function that helps to build linear regression lines from big.matrix data.
Listing 11.5 demonstrates an example of this using the airline on-time data seen in the
bigmemory example.
Listing 11.5 Using biglm with big.matrix objects
library(bigmemory)
library(biganalytics)
library(biglm)
# Load our airline on-time data
airlinematrix <- read.big.matrix("2000_airline.csv",
type="integer", header=TRUE,
backingfile="2000_airline.bin",
descriptor="2000_airline.desc")
# Use the biglm.big.matrix wrapper function to create a
# regression line comparing ArrDelay and DepDelay values
delay_lm <- biglm.big.matrix(ArrDelay~DepDelay,data=airlinematrix)
summary(delay_lm)
 
Search WWH ::




Custom Search