Database Reference
In-Depth Information
Always Better Data,” cites examples in which collecting more data is not a panacea
for high-quality data. For example, imagine the results of a survey. Survey data can
be notoriously difficult to work with. Some survey respondents lie, skip questions,
and generally do everything possible to uncover edge cases in your otherwise expertly
designed question list.
A great deal of work in the field of statistical analysis is dedicated to attempts to
determine the probability that a set of observations is significantly different from a
random collection of values. However, a common fallacy of large data analysis prob-
lems is that you require an entire corpus of data to make decisions or arrive at conclu-
sions. Indeed, one of the cornerstones of statistical analysis is to provide quantitative
evidence that whatever subset of data you are looking at from the possible entire cor-
pus can produce a convincingly valid and significant probability-based conclusion if
the subset is collected properly (random sampling).
The corollary to this argument is, of course: If you can access all the data in an
entire dataset, then why wouldn't you? For one thing, the processing power can be
expensive and the process time consuming. You probably don't need to use the entire
dataset for statistical analysis if a sample will provide the same result. As datasets grow
larger, increasing numbers of statistically significant relationships may appear when in
fact there may be no real-world relationship.
Large Matrix Manipulation: bigmemory and biganalytics
Numbers, numbers everywhere. You've just collected a huge amount of numerical
data, and you'd like to run a summary and plot correlations to understand what the
data means. In some cases, your available amount of machine memory is much smaller
than your dataset. The challenges that prevent R from using the entire available mem-
ory of the system can be daunting. bigmemory is an R library that attempts to apply
an extra level of memory management to large datasets. In fact, bigmemory uses a
clever interface that interacts with a speedy C++ framework for managing underlying
data.
R matrices all must contain the same data type. If you are working with a numeri-
cal dataset that can still potentially fit within system memory, then the bigmemory
package is a good choice. bigmemory provides a data type called big.matrix, which
works very similarly to a standard R matrix. A big.matrix object can only contain
numerical data. The bigmemory package has an interesting design that enables mul-
tiple R instances to access underlying C++-managed data simultaneously. By itself,
different instances of R cannot access each other's data objects. With bigmemory, it is
possible to define on-disk binary files that not only can be shared but can be quickly
loaded into R upon startup of a new session.
A very interesting dataset often used to demonstrate the bigmemory package is
the U.S. Department of Transportation's airline on-time statistics database, collected
and hosted by RITA (the Research and Innovative Technology Administration).
This dataset is a blast to work with, not only because it is large and freely available,
but because it provides very detailed statistics about an often-frustrating U.S. airline
 
Search WWH ::




Custom Search