Databases Reference
In-Depth Information
In this case, indexing along the first axis advances us into the buffer in steps ( strides , in
NumPy lingo) of 2, while indexing along the second axis advances us in steps of 1.
For example, the indexing expression a[0,1] is handled as follows:
offset = 2*0 + 1*1 -> 1
buffer[offset] -> value "B"
You might notice that there are two possible conventions here: wheth‐
er the “fastest-varying” index is the last (as previously shown), or the
first. This choice is the difference between row-major and column-
major ordering. Python, C, and HDF5 all use row-major ordering, as
in the example.
By default, all but the smallest HDF5 datasets use contiguous storage. The data in your
dataset is flattened to disk using the same rules that NumPy (and C, incidentally) uses.
If you think about it, this means that certain operations are much faster than others.
Consider as an example a dataset containing one hundred 640×480 grayscale images.
Let's say the shape of the dataset is (100, 480, 640):
>>> f = h5py . File ( "imagetest.hdf5" )
>>> dset = f . create_dataset ( "Images" , ( 100 , 480 , 640 ), dtype = 'uint8' )
A contiguous dataset would store the image data on disk, one 640-element “scanline”
after another. If we want to read the first image, the slicing code would be:
>>> image = dset [ 0 , :, :]
>>> image . shape
(480, 640)
Figure 4-1 (A) shows how this works. Notice that data is stored in “blocks” of 640 bytes
that correspond to the last axis in the dataset. When we read in the first image, 480 of
these blocks are read from disk, all in one big block.
This leads us to the first rule (really, the only one) for dealing with data on disk, locali‐
ty : reads are generally faster when the data being accessed is all stored together. Keeping
data together helps for lots of reasons, not the least of which is taking advantage of
caching performed by the operating system and HDF5 itself.
Search WWH ::




Custom Search