Database Reference
In-Depth Information
Figure 1-1. Disk seek versus sequential access
The end result of all this math? The seek takes well over 99% of the time spent reading a row.
When it comes to disk access, random seeks are the enemy. The reason why this is so import-
antinthiscontextisbecause
JOIN
stypically requirerandomseeks.Givenournormalized data
model, a likely plan for our query would be something similar to the following Python code:
for
for
number_row
iin
find_by_contact_id
(
numbers
,
3
):
yield
yield
(
contact_row
.
name
,
number_row
.
number
)
So there ends up being at least one disk seek for every contact in our database. Of course,
we've glossed over how
find_by_contact_id
works, assuming that
all
it needs to do is a
single disk seek. Typically, this is actually accomplished by reading an index on
numbers
that
is keyed by
contact_id
, potentially resulting in even more disk seeks.
Of course, modern database systems have evolved structures to mitigate some of this, largely
by caching frequently used objects (particularly indexes) in RAM. However, even with such
optimizations, joining tables is one of the most expensive operations that relational databases
do. Additionally, if you end up needing to scale your database to multiple servers, you intro-
duce the problem of generating a
distributed join
, a complex and generally slow operation.
Denormalizing for Performance
The dirty little secret (which isn't really so secret) about relational databases is that once we
have gone through the data modeling process to generate our nice
n
th normal form data mod-
el, it's often necessary to
denormalize
the model to reduce the number of
JOIN
operations re-
quired for the queries we execute frequently.