Database Reference
In-Depth Information
Limitations of R for Large Datasets
Modern workstations and laptops are incredibly powerful. Average laptops with their
multicore designs and impressive processors put models just a few years older to shame.
Of course, everyone is familiar with the mythos of Moore's law: The overall density
of transistors on processors tends to double about every two years. The Internet is full
of apples-to-oranges comparisons between archaic computing systems and modern
workstations. The Apollo Guidance Computer, one of the first computers built using
integrated circuits, was responsible for helping humans travel to the moon and back.
But even the average smartphone has orders of magnitude more processing speed and
memory than the Apollo capsule's brain (which makes one wonder why our smart-
phones are such underachievers).
Memory capacity has also grown in a similar fashion, and in-memory data systems
are becoming more common as a solution to high-throughput data problems. Keeping
data in memory and away from disk is a useful way to speed up processing tasks. R is
also designed to run completely in memory. This is great news for moderately sized
datasets. A modern laptop may have many gigabytes of memory available in its stan-
dard configuration, but even with all of this power at hand it's not uncommon to run
into data sources that generate gigabytes of data per day. Using R to directly access
this amount of data is nearly impossible on a single workstation.
R has a somewhat notorious reputation for confusing newcomers with unintui-
tive messages about when the interpreter runs out of memory. For systems based on a
32-bit architecture, the total memory available to R is about 4GB, and in practical use,
this number is actually closer to 2GB. With every operation, copy, or added dataset,
the amount of memory available is impacted. Other restrictions, such as limits on how
much memory an application can use per user, may also bring down the amount of
memory available to R. Huge datasets may also contain very large integer values, and
a 32-bit build of R may not be able to express values greater than a few billion. You
wouldn't be able to produce an integer sum expressing values such as the United States
national debt with this limitation.
With these limitations in mind, one of the most important first steps in avoiding
memory problems is to use a 64-bit machine whenever possible. Like most interpreted
languages, R has a garbage collector that frees up memory when objects go out of
scope. Another approach to avoiding such problems is to manually flush R's garbage
collector immediately once a dataset is no longer needed. The R function gc is pri-
marily used to print out information about the memory available to the R system, but
it will also try to force the garbage collector to delete objects. Listing 11.1 provides a
collection of functions useful for interrogating the state of memory and objects in R.
Listing 11.1 Helpful functions for understanding memory usage in R
# Use the sessionInfo function to report on what build of
# R you are using, as well as which packages are attached.
> sessionInfo()
R version 2.15.1 (2012-06-22)
 
Search WWH ::




Custom Search