Database Reference
In-Depth Information
partition contains summarized data up to the current moment of the week.
When the week closes, the partition is loaded to the fact table.
Finally, accumulating snapshot real-time partitions are used for short
processes, like order handling. The real-time partition accumulates frequent
updates of facts, and the fact table is refreshed with the last version of these
facts. For example, suppose that in the Northwind case study the Sales fact
table is refreshed once a day. This table contains records about order lines,
and their data (e.g., the due date or the quantity) can change during a day.
These updates are performed on the real-time partition, which typically is
small and can fit in main memory. At the end of the day, the records in the
partition are loaded into the fact table.
There are several alternative approaches for achieving real-time data
warehouses, which make use of the real-time partitions studied above. One
of such approaches is called direct trickle feed , where new data from
operational sources are continuously fed into the data warehouse. This is
done by either inserting data in the fact tables or into separate real-time
partitions of the fact tables. A variant of this strategy, which addresses the
mixed workload problem (i.e., updates and queries over the same table),
is called trickle and flip . Here, data are continuously fed into staging
tables that are an exact copy of the warehouse tables. Periodically, feeding
is stopped, and the copy is swapped with the fact table, bringing the data
warehouse up to date. Another strategy called real-time data caching
avoids mixed workload problems: a real-time data cache consists in a
dedicated database server for loading, storing, and processing real-time data.
In-memory database technologies studied in this chapter could be used when
we have large volumes of real-time data (in the order of hundreds or thousands
of changes per second) or extremely fast query performance requirements. In
this case, real-time data are loaded into the cache as they arrive from the
source system. A drawback of this strategy is that, since the real-time and
historical data are separately stored, when a query involves both kinds of
data, the evaluation could be costly.
We have commented above that not all applications have the same latency
requirements. In many situations, part of the data must be loaded quickly
after arrival, while other parts can be loaded at regular intervals. However,
therearemanysituationswherewewould like data to be loaded when needed,
but not necessarily before that. Right-time data warehousing follows this
approach. Here, right time may vary from right now (i.e., real time) to several
minutes or hours, depending on the required data latency. The key idea is
that data are loaded when needed, avoiding the cost of providing real time
when it is not actually needed.
The RiTE (Right-Time ETL) system is a middleware aimed at achieving
right-time data warehousing. In RiTE, a data producer continuously inserts
data into a data warehouse in bulk fashion, and, at the same time, data
warehouse user queries get access to fresh data on demand. The main
component of the RiTE architecture is called the catalyst , a software module
Search WWH ::




Custom Search