D e n o r m a l i z a t i o n
What's normal, anyway?
—Anon.: Where Bugs Go
I want to say a few words about denormalization. Now, I haven't considered, so far in this topic, any level of
normalization higher than BCNF (at least, not in detail). But denormalization, if it means anything at all, can't apply
just to BCNF specifically; I mean, it can't refer just to dropping back to some level of normalization that's lower
than BCNF specifically. Rather, it has to mean dropping back from any given level of normalization to some lower
That said, however, I need to say too that relvars that are in BCNF and not in some higher normal form are
comparatively unusual (though not completely unknown, I hasten to add). In practice, therefore, denormalization
does usually refer quite specifically to dropping back to some level of normalization below BCNF; hence the
inclusion of this chapter in this part of the topic.
“DENORMALIZE FOR PERFORMANCE”?
Ever since SQL products first came on the market, the claim that it's necessary to “denormalize for performance”
has been widely promulgated. The (specious!) supporting argument goes something like this:
Normalization means lots of relvars.
Lots of relvars means lots of stored files.
Lots of stored files means lots of I/O.
In the case of suppliers and parts, for example, a request to get details for suppliers who supply red parts
involves two dyadic joins—suppliers to shipments first, perhaps, and then the result of that join to parts. And if the
three relvars correspond to three physically separate stored files, then those two joins will require lots of I/O and will
therefore perform badly.
As already noted, this argument is specious, at least in principle. The reason is that the relational model
nowhere stipulates that relvars must map one for one to stored files. In the case of suppliers and parts, for example,
there's no logical reason why we couldn't physically store the join of the three relvars—possibly even
redundantly—as one single stored file on the disk, 1 which could reduce the amount of I/O significantly for the query
under consideration. The point is irrelevant for present purposes, however, because:
1 I'm speaking pretty loosely here, of course. In particular, I'm ignoring the possibility that there might be some suppliers or some parts with no