Database Reference
In-Depth Information
A Common Data Warehousing Technique That Fails
A common data warehousing technique I've seen people employ goes like this:
They use a trigger to maintain a LAST_UPDATED column in the source table, much like the
method described in the Chapter 2 in the “Optimistic Locking” section.
1.
2.
To initially populate a data warehouse table, they remember what time it is right now by
selecting out SYSDATE on the source system . For example, suppose it is exactly 9:00 a.m.
right now.
3.
They then pull all of the rows from the transactional system—a full
SELECT * FROM TABLE —to get the data warehouse initially populated.
4.
To refresh the data warehouse, they remember what time it is right now again. For
example, suppose an hour has gone by—it is now 10:00 a.m. on the source system. They
will remember that fact. They then pull all changed records since 9:00 a.m. (the moment
before they started the first pull) and merge them in.
This technique may pull the same record twice in two consecutive refreshes. This is unavoidable due to the
granularity of the clock. a MERGE operation will not be affected by this (i.e., update existing record in the data warehouse
or insert a new record).
Note
They believe that they now have all of the records in the data warehouse that were modified since they did the
initial pull. They may actually have all of the records, but just as likely they may not. This technique does work on
some other databases—ones that employ a locking system whereby reads are blocked by writes and vice versa. But in
a system where you have nonblocking reads, the logic is flawed.
To see the flaw in this example, all we need to do is assume that at 9:00 a.m. there was at least one open,
uncommitted transaction. At 8:59:30 a.m., it had updated a row in the table we were to copy. At 9:00 a.m., when we
started pulling the data and thus reading the data in this table, we would not see the modifications to that row; we
would see the last committed version of it. If it was locked when we got to it in our query, we would read around the
lock. If it was committed by the time we got to it, we would still read around it since read consistency permits us to
read only data that was committed in the database when our statement began. We would not read that new version
of the row during the 9:00 a.m. initial pull, nor would we read the modified row during the 10:00 a.m. refresh. The
reason? The 10:00 a.m. refresh would only pull records modified since 9:00 a.m. that morning, but this record was
modified at 8:59:30 a.m. We would never pull this changed record.
In many other databases where reads are blocked by writes and a committed but inconsistent read is
implemented, this refresh process would work perfectly. If at 9:00 a.m. when we did the initial pull of data, we hit that
row and it was locked, we would have blocked and waited for it, and read the committed version. If it were not locked,
we would just read whatever was there, committed.
So, does this mean the preceding logic just cannot be used? No, it means that we need to get the “right now” time
a little differently. We need to query V$TRANSACTION and find out which is the earliest of the current time and the time
recorded in START_TIME column of this view. We will need to pull all records changed since the start time of the oldest
transaction (or the current SYSDATE value if there are no active transactions):
select nvl( min(to_date(start_time,'mm/dd/rr hh24:mi:ss')),sysdate)
from v$transaction;
 
 
Search WWH ::




Custom Search