Database Reference
In-Depth Information
industry. These files can be challenging to use with R, as an uncompressed CSV file
of statistics from the year 2000 alone is about 500MB. On a 32-bit machine, memory
is quickly exhausted as data is loaded into R matrices for further analysis.
This is where the bigmemory package comes in. The bigmemory read.big.matrix
function enables R users to read very large datasets into big.matrix objects, from an
on-disk source. When using the read.big.matrix function, non-numeric entries are
skipped and replaced with “NA.” Listing 11.3 compares the use of a big.matrix object
to that of a standard R matrix for loading a single year of airline on-time data from a
CSV file.
Listing 11.3 Using bigmemory to read large amounts of data
# Attempt to read CSV file of airline on-time data from 2000
> airline_data = read.csv("2000_airline.csv", sep=",")
*** error: can't allocate region
# Using the bigmemory big.matrix object
> airlinematrix <- read.big.matrix("2000_airline.csv",
type="integer", header=TRUE,
backingfile="2000_airline.bin",
descriptor="2000_airline.desc")
> summary(airlinematrix)
Length Class Mode
164808363 big.matrix S4
Another useful feature of the bigmemory package is the ability to place data objects
in shared memory. This means that multiple instances of R can use the same bigmem-
ory object, further saving total system memory when necessary.
ff: Working with Data Frames Larger than Memory
Data frames are the workhorse data structures of the R language and a very produc-
tive way to work with tabular data using named rows and columns. An R data frame
is more than just a collection of addressable cells; R keeps track of a number of prop-
erties internally using object metadata. This is one of the reasons the R interpreter
requires more memory than the size of the dataset for calculations, slicing, and other
operations. Even if you have many gigabytes of memory available, it is easy to find a
massive dataset that does not fit into system RAM. What happens when your dataset is
even larger than the available system memory?
The ff package attempts to overcome the memory limits of R by decoupling the
underlying data from the R interpreter. ff uses the system disk to store large data
objects. When running an R operation over this on-disk data, chunks of it are pulled
into memory for manipulation. In essence, ff tries to keep the same R interface and
data types that are used with smaller, memory-sized datasets. Like bigmemory, ff is
able to store data objects as on-disk images that can be read across R sessions.
 
 
Search WWH ::




Custom Search