Geoscience Reference
In-Depth Information
TABLE 17.3
Demonstration of Seed Control in R
> set.seed(123,kind = “Mersenne-Twister”)
> # Generate some uniform random numbers with seed as above
> runif(4)
[1] 0.2875775 0.7883051 0.4089769 0.8830174
> runif(4)
[1] 0.9404673 0.0455565 0.5281055 0.8924190
> set.seed(123,kind = “Mersenne—Twister”)
> # Resetting seed should reproduce the first set of numbers
> runif(4)
[1] 0.2875775 0.7883051 0.4089769 0.8830174
were separated, re-processing the Rnw file would recreate the cache. Another related approach is
cacheSweave (Peng, 2010) - an R package providing a number of tools for caching results when
using R in conjunction with Sweave .
Another issue that affects reproducibility in terms of computation occurs when working with
simulation-based studies. This is the use of pseudo-random numbers. Unless the software being
used gives explicit control of the random number generation method and specification of a seed,
distinct runs of the same code will give different results. Fortunately, in R such control is possible
via the set.seed function. This function specifies the seed of the pseudo-random number genera-
tor and also the algorithm used for random number generation. An example is given in Table 17.3.
Here, the numerical seed for the generator is 123, and the algorithm used is the Mersenne twister
(Matsumoto and Nishimura, 1998). After initially setting up the generator, two sets of four uniform
random numbers in the range [0,1] are produced by calling runif(4) . After this, the generator is
re-initiated with the same seed. Calling runif(4) after re-seeding to the same value to obtain a
further four random numbers gives the same result as the first set of four in the earlier call.
Reproducibility here is important: for example, one may wish to test whether the result in
a simulation-based analysis may be an artefact of the choice of random number generator or
of the choice of seed. If this information is embedded in an Rnw file, it is then possible, with
minor edits, to test for stability of the results to such choices. In Van Niel and Laffan (2003),
for example, the importance of this is demonstrated in a GC context by considering the effect
of changing the random number generator when considering the effect of random perturbations
to a digital elevation model, when slope and flow accumulation are estimated, and conclude by
outlining the importance of reporting the choice of algorithm and seed values when carrying
out studies of this kind.
A number of further issues relate to reproducibility when using pseudo-random numbers. One
problem with using, for example, Microsoft Excel 2007 when working with random numbers is
that there is no means of specifying the seed for the random number generator - it is therefore
not possible to exactly reproduce simulations in the way set out in the aforementioned example.
A further issue - and perhaps indicative of a far wider issue - is the availability of the source code
used to implement the pseudo-random number generating algorithm. Again, Excel 2007 has to be
considered here as an example. McCullough (2008) considered this application's random number
generator and found a number of issues. In particular, although it is claimed that the generator used
in Excel 2007 is the Wichmann and Hill (1982) algorithm (see Microsoft Knowledge Base Article
828795), extensive investigations by McCullough suggested that this is not the case - and, quoting
from the article:
… Excel users will have to content themselves with an unknown RNG [Random Number Generator] of
unknown period that is not known to pass any standard battery tests for randomness.
 
Search WWH ::




Custom Search