Database Reference
In-Depth Information
Even when using a very powerful workstation, it can be difficult or impossible to
fit data into the amount of memory that the system has allocated to R. In order to
work with datasets that are larger than available system memory, we must either use a
different strategy or take advantage of an existing R package to overcome these chal-
lenges. R installations on 32-bit machines are often limited to much less memory than
the overall system has available. The most important first step to addressing memory
limitations is to use R on a 64-bit system, which will help ensure that as much system
memory can be used by R as possible.
There are several types of problems that one can encounter when working with
large datasets from R. One issue is that the amount of memory available to R may
be sufficient to load a dataset, but additional operations can be prohibitively slow or
impossible to accomplish. Another type of challenge occurs when the dataset of inter-
est is far larger than the entire amount of memory available to the system.
Before exploring additional workarounds for these challenges, consider using a
method that does not require such large amounts of data. It is often not necessary to
use an entire dataset to obtain a statistically significant result. Consider random sam-
pling of large datasets to scope down the amount of data necessary for analysis. There
are many R modules available that allow reading a subset of data from a database.
Consider if it is appropriate to use methods such as sampling and tests of significance
to gain statistical insight without having to interact with the entire dataset.
When a dataset is smaller than available system memory, it is possible to use a package
such as bigmemory to improve the way R interacts with available RAM. The bigmemory
package is an R interface to an underlying set of C++ functions that improve the use of
available memory. bigmemory provides a new R data type called “big.matrix,” which
acts similarly to standard R matrices.
When data sizes are so large that they overwhelm the total amount of system
memory, R developers should consider using features from the ff package. ff attempts
to provide support for very large datasets by using disk access that looks as much like
native, memory-based R as possible. Like bigmemory, ff provides specialized data
structures such as the ff data frame. Data objects from the ff package can be stored and
even used across different R sessions.
Linear regression is a very common strategy for exploring the potential for numeri-
cal variables to be related. For very large datasets, the biglm package allows developers
to run regression analysis and create generalized regression models using datasets that
are larger than available R system memory.
When building data applications, it's common to require a number of different
solutions to work in tandem. Hadoop is a commonly used open-source framework for
data processing tasks. RHadoop is a collection of packages that provides the ability to
access Apache Hadoop's MapReduce functionality as well as data contained in HDFS
and Hbase.
 
Search WWH ::




Custom Search