Using R with Large Datasets - Data Just Right: Introduction to Large-Scale Data and Analytics

Database Reference

In-Depth Information

industry. These files can be challenging to use with R, as an uncompressed CSV file

of statistics from the year 2000 alone is about 500MB. On a 32-bit machine, memory

is quickly exhausted as data is loaded into R matrices for further analysis.

This is where the bigmemory package comes in. The bigmemory read.big.matrix

function enables R users to read very large datasets into big.matrix objects, from an

on-disk source. When using the read.big.matrix function, non-numeric entries are

skipped and replaced with “NA.” Listing 11.3 compares the use of a big.matrix object

to that of a standard R matrix for loading a single year of airline on-time data from a

CSV file.

Listing 11.3 Using bigmemory to read large amounts of data

# Attempt to read CSV file of airline on-time data from 2000

> airline_data = read.csv("2000_airline.csv", sep=",")

*** error: can't allocate region

# Using the bigmemory big.matrix object

> airlinematrix <- read.big.matrix("2000_airline.csv",

type="integer", header=TRUE,

backingfile="2000_airline.bin",

descriptor="2000_airline.desc")

> summary(airlinematrix)

Length Class Mode

164808363 big.matrix S4

Another useful feature of the bigmemory package is the ability to place data objects

in shared memory. This means that multiple instances of R can use the same bigmem-

ory object, further saving total system memory when necessary.

ff: Working with Data Frames Larger than Memory

Data frames are the workhorse data structures of the R language and a very produc-

tive way to work with tabular data using named rows and columns. An R data frame

is more than just a collection of addressable cells; R keeps track of a number of prop-

erties internally using object metadata. This is one of the reasons the R interpreter

requires more memory than the size of the dataset for calculations, slicing, and other

operations. Even if you have many gigabytes of memory available, it is easy to find a

massive dataset that does not fit into system RAM. What happens when your dataset is

even larger than the available system memory?

The ff package attempts to overcome the memory limits of R by decoupling the

underlying data from the R interpreter. ff uses the system disk to store large data

objects. When running an R operation over this on-disk data, chunks of it are pulled

into memory for manipulation. In essence, ff tries to keep the same R interface and

data types that are used with smaller, memory-sized datasets. Like bigmemory, ff is

able to store data objects as on-disk images that can be read across R sessions.

Search WWH ::

Custom Search

Home