Database Reference
In-Depth Information
Implications of Multiversion Read Consistency
So far, we've seen how multiversioning provides us with nonblocking reads, and I have stressed that this is a good
thing: consistent (correct) answers with a high degree of concurrency. What could be wrong with that? Well, unless
you understand that it exists and what it implies, then you are probably doing some of your transactions incorrectly.
Recall from Chapter 1 the scheduling resources example whereby we had to employ some manual locking techniques
(via SELECT FOR UPDATE to serialize modifications to the SCHEDULES table by resource). But can it affect us in other
ways? The answer to that is definitely yes. We'll go into the specifics in the sections that follow.
A Common Data Warehousing Technique That Fails
A common data warehousing technique I've seen people employ goes like this:
They use a trigger to maintain a LAST_UPDATED column in the source table, much like the
method described in the last chapter in the “Optimistic Locking” section.
1.
2.
To initially populate a data warehouse table, they remember what time it is right now by
selecting out SYSDATE on the source system . For example, suppose it is exactly 9:00 a.m.
right now.
They then pull all of the rows from the transactional system—a full SELECT * FROM
TABLE —to get the data warehouse initially populated.
3.
4.
To refresh the data warehouse, they remember what time it is right now again. For
example, suppose an hour has gone by—it is now 10:00 a.m. on the source system. They
will remember that fact. They then pull all changed records since 9:00 a.m. (the moment
before they started the first pull) and merge them in.
This technique may pull the same record twice in two consecutive refreshes. This is unavoidable due to the
granularity of the clock. a MERGE operation will not be affected by this (i.e., update existing record in the data warehouse
or insert a new record).
Note
They believe that they now have all of the records in the data warehouse that were modified since they did the
initial pull. They may actually have all of the records, but just as likely they may not. This technique does work on
some other databases—ones that employ a locking system whereby reads are blocked by writes and vice versa. But in
a system where you have nonblocking reads, the logic is flawed.
To see the flaw in this example, all we need to do is assume that at 9:00 a.m. there was at least one open,
uncommitted transaction. At 8:59:30 a.m., it had updated a row in the table we were to copy. At 9:00 a.m., when we
started pulling the data and thus reading the data in this table, we would not see the modifications to that row; we
would see the last committed version of it. If it was locked when we got to it in our query, we would read around the
lock. If it was committed by the time we got to it, we would still read around it since read consistency permits us to
read only data that was committed in the database when our statement began. We would not read that new version
of the row during the 9:00 a.m. initial pull, nor would we read the modified row during the 10:00 a.m. refresh. The
reason? The 10:00 a.m. refresh would only pull records modified since 9:00 a.m. that morning, but this record was
modified at 8:59:30 a.m. We would never pull this changed record.
In many other databases where reads are blocked by writes and a committed but inconsistent read is
implemented, this refresh process would work perfectly. If at 9:00 a.m. when we did the initial pull of data, we hit that
row and it was locked, we would have blocked and waited for it, and read the committed version. If it were not locked,
we would just read whatever was there, committed.
 
 
Search WWH ::




Custom Search