Databases Reference
In-Depth Information
CHAPTER 1
Introduction
When I was a graduate student, I had a serious problem: a brand-new dataset, made up
of millions of data points collected painstakingly over a full week on a nationally rec‐
ognized plasma research device, that contained values that were much too small.
About 40 orders of magnitude too small.
My advisor and I huddled in his office, in front of the shiny new G5 Power Mac that ran
our visualization suite, and tried to figure out what was wrong. The data had been
acquired correctly from the machine. It looked like the original raw file from the ex‐
periment's digitizer was fine. I had written a (very large) script in the IDL programming
language on my Thinkpad laptop to turn the raw data into files the visualization tool
could use. This in-house format was simplicity itself: just a short fixed-width header
and then a binary dump of the floating-point data. Even so, I spent another hour or so
writing a program to verify and plot the files on my laptop. They were fine. And yet,
when loaded into the visualizer, all the data that looked so beautiful in IDL turned into
a featureless, unstructured mush of values all around 10 -41 .
Finally it came to us: both the digitizer machines and my Thinkpad used the “little-
endian” format to represent floating-point numbers, in contrast to the “big-endian”
format of the G5 Mac. Raw values written on one machine couldn't be read on the other,
and vice versa. I remember thinking that's so stupid (among other less polite variations).
Learning that this problem was so common that IDL supplied a special routine to deal
with it ( SWAP_ENDIAN ) did not improve my mood.
At the time, I didn't care that much about the details of how my data was stored. This
incident and others like it changed my mind. As a scientist, I eventually came to rec‐
ognize that the choices we make for organizing and storing our data are also choices
about communication. Not only do standard, well-designed formats make life easier
for individuals (and eliminate silly time-wasters like the “endian” problem), but they
make it possible to share data with a global audience.
 
Search WWH ::




Custom Search