Using R with Large Datasets - Data Just Right: Introduction to Large-Scale Data and Analytics

Database Reference

In-Depth Information

Even when using a very powerful workstation, it can be difficult or impossible to

fit data into the amount of memory that the system has allocated to R. In order to

work with datasets that are larger than available system memory, we must either use a

different strategy or take advantage of an existing R package to overcome these chal-

lenges. R installations on 32-bit machines are often limited to much less memory than

the overall system has available. The most important first step to addressing memory

limitations is to use R on a 64-bit system, which will help ensure that as much system

memory can be used by R as possible.

There are several types of problems that one can encounter when working with

large datasets from R. One issue is that the amount of memory available to R may

be sufficient to load a dataset, but additional operations can be prohibitively slow or

impossible to accomplish. Another type of challenge occurs when the dataset of inter-

est is far larger than the entire amount of memory available to the system.

Before exploring additional workarounds for these challenges, consider using a

method that does not require such large amounts of data. It is often not necessary to

use an entire dataset to obtain a statistically significant result. Consider random sam-

pling of large datasets to scope down the amount of data necessary for analysis. There

are many R modules available that allow reading a subset of data from a database.

Consider if it is appropriate to use methods such as sampling and tests of significance

to gain statistical insight without having to interact with the entire dataset.

When a dataset is smaller than available system memory, it is possible to use a package

such as bigmemory to improve the way R interacts with available RAM. The bigmemory

package is an R interface to an underlying set of C++ functions that improve the use of

available memory. bigmemory provides a new R data type called “big.matrix,” which

acts similarly to standard R matrices.

When data sizes are so large that they overwhelm the total amount of system

memory, R developers should consider using features from the ff package. ff attempts

to provide support for very large datasets by using disk access that looks as much like

native, memory-based R as possible. Like bigmemory, ff provides specialized data

structures such as the ff data frame. Data objects from the ff package can be stored and

even used across different R sessions.

Linear regression is a very common strategy for exploring the potential for numeri-

cal variables to be related. For very large datasets, the biglm package allows developers

to run regression analysis and create generalized regression models using datasets that

are larger than available R system memory.

When building data applications, it's common to require a number of different

solutions to work in tandem. Hadoop is a commonly used open-source framework for

data processing tasks. RHadoop is a collection of packages that provides the ability to

access Apache Hadoop's MapReduce functionality as well as data contained in HDFS

and Hbase.

Search WWH ::

Custom Search

Home