Using R with Large Datasets - Data Just Right: Introduction to Large-Scale Data and Analytics

Database Reference

In-Depth Information

Always Better Data,” cites examples in which collecting more data is not a panacea

for high-quality data. For example, imagine the results of a survey. Survey data can

be notoriously difficult to work with. Some survey respondents lie, skip questions,

and generally do everything possible to uncover edge cases in your otherwise expertly

designed question list.

A great deal of work in the field of statistical analysis is dedicated to attempts to

determine the probability that a set of observations is significantly different from a

random collection of values. However, a common fallacy of large data analysis prob-

lems is that you require an entire corpus of data to make decisions or arrive at conclu-

sions. Indeed, one of the cornerstones of statistical analysis is to provide quantitative

evidence that whatever subset of data you are looking at from the possible entire cor-

pus can produce a convincingly valid and significant probability-based conclusion if

the subset is collected properly (random sampling).

The corollary to this argument is, of course: If you can access all the data in an

entire dataset, then why wouldn't you? For one thing, the processing power can be

expensive and the process time consuming. You probably don't need to use the entire

dataset for statistical analysis if a sample will provide the same result. As datasets grow

larger, increasing numbers of statistically significant relationships may appear when in

fact there may be no real-world relationship.

Large Matrix Manipulation: bigmemory and biganalytics

Numbers, numbers everywhere. You've just collected a huge amount of numerical

data, and you'd like to run a summary and plot correlations to understand what the

data means. In some cases, your available amount of machine memory is much smaller

than your dataset. The challenges that prevent R from using the entire available mem-

ory of the system can be daunting. bigmemory is an R library that attempts to apply

an extra level of memory management to large datasets. In fact, bigmemory uses a

clever interface that interacts with a speedy C++ framework for managing underlying

data.

R matrices all must contain the same data type. If you are working with a numeri-

cal dataset that can still potentially fit within system memory, then the bigmemory

package is a good choice. bigmemory provides a data type called big.matrix, which

works very similarly to a standard R matrix. A big.matrix object can only contain

numerical data. The bigmemory package has an interesting design that enables mul-

tiple R instances to access underlying C++-managed data simultaneously. By itself,

different instances of R cannot access each other's data objects. With bigmemory, it is

possible to define on-disk binary files that not only can be shared but can be quickly

loaded into R upon startup of a new session.

A very interesting dataset often used to demonstrate the bigmemory package is

the U.S. Department of Transportation's airline on-time statistics database, collected

and hosted by RITA (the Research and Innovative Technology Administration).

This dataset is a blast to work with, not only because it is large and freely available,

but because it provides very detailed statistics about an often-frustrating U.S. airline

Search WWH ::

Custom Search

Home