Database Reference
In-Depth Information
The ff library supports the standard R atomic data types, such as integer and dou-
ble. It also provides a few package-specific data types, such as single bits, bytes, and
“nibbles” (four bits of information). These efficient atomic structures provide a space
and performance advantage when working with some types of data, such as genomics
information. Listing 11.4 provides an example of using ff to help find the maximum
and average delay times across all U.S. f lights in 2000.
Listing 11.4 Using ff to create data frames from large datasets
# Create an "ff" data frame
> library(ff)
> airline_dataframe <- read.csv.ffdf(file="2000_airline.csv",
header=TRUE)
# List the number of records in the dataset
> dim(airline_dataframe)
[1] 5683047 29
# Find the max and mean arrival delay for all U.S. flights in 2000,
# ignoring any NA values
> max(airline_dataframe$ArrDelay, na.rm=TRUE)
[1] 1441
> mean(airline_dataframe$ArrDelay, na.rm=TRUE)
[1] 10.28266
biglm: Linear Regression for Large Datasets
A common data challenge is to understand how variables are related to one another.
Sometimes this type of analysis is used to create models that can help predict unknown
variables under given conditions. Are sales of a product possibly related in some way to
weather, season, or other factors? Can the ratings of a movie help predict what similar
movies viewers might want to watch?
Regression analysis is a common technique for determining how variables might
be related to each other. The term regression is linked to the work of Sir Francis Galton,
a man who is credited with inventing our modern scientific concept of correlation. In
a drearily named publication called “Regression towards mediocrity in hereditary stat-
ure ,” 3 Galton noted that children born from abnormally tall parents tended to be close
to an average height. Galton's observance of this tendency, commonly known as regres-
sion to the mean , has been a cornerstone of statistical analysis ever since.
When comparing values from two independent variables (the relationship between
measurements of height and weight is a commonly cited example), you can use a
3. Galton, Francis. “Regression towards mediocrity in hereditary stature.” The Journal of the
Anthropological Institute of Great Britain and Ireland 15 (1886): 246-263.
 
 
Search WWH ::




Custom Search