Geoscience Reference
In-Depth Information
In the examples in this chapter, the data consists of at most around 160 numbers (only 20 in the
first example) and could possibly be assigned to a variable within the embedded R code and any
subsequent modifications recorded in that code. However, in many situations - particularly in
many current GC applications - the data set is much larger than this, and it becomes impracti-
cal to incorporate the data in an Rnw file. One approach may be to incorporate code that reads
from a supplied file - ultimately, this still implies that the code and data must be distributed
together to allow reproducible research. However, this may lead to difficulties if the data file is
very large. An alternative might be to provide details (such as URLs) from where the data were
downloaded.
A further step to reproducibility could be achieved in the aforementioned situation by noting that
a number of the file-reading commands in R will work with URLs as well as ordinary file names.
Thus, by incorporating code used to access data directly from a URL into an Rnw file, a third party
may obtain the raw data from the same source as the original analysis and apply any cleaning or
other processing operations. This is shown in the examples in Sections 17.3.2 and 17.3.3. Indeed,
the code in Appendix 17B also illustrates how data cleaning may be recorded - in the example, one
downloaded data set records De Kalb as the name of one county, and the other uses DeKalb with
no space in the text. Although the modification is fairly trivial, the recorded steps demonstrate that
it has actually been done. Without this step, the data preparation process needed for these examples
could not have taken place, and so the inclusion of this information is essential for a third party to
reproduce the results.
However, in the aforementioned example, it should be understood that reproducibility depends on
the remote data obtainable from the URL not being modified in the time between the original analy-
sis being done and the attempt to reproduce it. In particular, care should be taken when obtaining
data from social networking application programming interfaces - such as Twitter or Facebook -
where accessing the URL provides the most recent information on a moving temporal window, and
if there is any notable delay between successive queries, results are almost certain to differ. In this
situation, supplying the actual data used is likely to be the only way to ensure reproducibility.
17.4.2 d ealing with d ifficult c oMPutational r equireMentS
The second problem is not so much one of reproducibility, but one of practicality. Some simulation-
based approaches - for example, the Markov chain Monte Carlo approach (Besag and York,
1989; Gelfand and Smith, 1990, 1995) or other methods using large data sets or slowly converg-
ing algorithms - may require code taking several hours to run, and therefore major resources are
required for reproduction. In a sense, this is a less extreme version of the lunar rock example in
the introduction. Reproduction may be difficult and require a large amount of resources, but not
impossible - this is simply in the nature of such research. One suggestion here is a two-stage process:
1. Create a cache of results if one does not exist already:
a. Run the Rnw file containing the full code (this may take a long time) to produce key
results.
b. Store these results in a binary file.
2. Produce publishable output based on the cache:
a. Read in the binary file created in step 1b to restore the data.
b. Write out tables of results and draw graphs.
The idea of this approach is that if the results of the lengthy computation have already been created,
then the code simply reads these in and presents them in a publishable format. On occasion, if a full
reproduction of the work is required, the cached results could be used to test for the presence of a
cache of results in Step 1 given earlier. Steps 1a and 1b would only be executed if the result of a test
was negative. This approach has the added advantage that, if for some reason, the cache and the Rnw
Search WWH ::




Custom Search