New Data Warehouse Technologies - Data Warehouse Systems: Design and Implementation

Database Reference

In-Depth Information

partition contains summarized data up to the current moment of the week.

When the week closes, the partition is loaded to the fact table.

Finally, accumulating snapshot real-time partitions are used for short

processes, like order handling. The real-time partition accumulates frequent

updates of facts, and the fact table is refreshed with the last version of these

facts. For example, suppose that in the Northwind case study the Sales fact

table is refreshed once a day. This table contains records about order lines,

and their data (e.g., the due date or the quantity) can change during a day.

These updates are performed on the real-time partition, which typically is

small and can fit in main memory. At the end of the day, the records in the

partition are loaded into the fact table.

There are several alternative approaches for achieving real-time data

warehouses, which make use of the real-time partitions studied above. One

of such approaches is called direct trickle feed , where new data from

operational sources are continuously fed into the data warehouse. This is

done by either inserting data in the fact tables or into separate real-time

partitions of the fact tables. A variant of this strategy, which addresses the

mixed workload problem (i.e., updates and queries over the same table),

is called trickle and flip . Here, data are continuously fed into staging

tables that are an exact copy of the warehouse tables. Periodically, feeding

is stopped, and the copy is swapped with the fact table, bringing the data

warehouse up to date. Another strategy called real-time data caching

avoids mixed workload problems: a real-time data cache consists in a

dedicated database server for loading, storing, and processing real-time data.

In-memory database technologies studied in this chapter could be used when

we have large volumes of real-time data (in the order of hundreds or thousands

of changes per second) or extremely fast query performance requirements. In

this case, real-time data are loaded into the cache as they arrive from the

source system. A drawback of this strategy is that, since the real-time and

historical data are separately stored, when a query involves both kinds of

data, the evaluation could be costly.

We have commented above that not all applications have the same latency

requirements. In many situations, part of the data must be loaded quickly

after arrival, while other parts can be loaded at regular intervals. However,

therearemanysituationswherewewould like data to be loaded when needed,

but not necessarily before that. Right-time data warehousing follows this

approach. Here, right time may vary from right now (i.e., real time) to several

minutes or hours, depending on the required data latency. The key idea is

that data are loaded when needed, avoiding the cost of providing real time

when it is not actually needed.

The RiTE (Right-Time ETL) system is a middleware aimed at achieving

right-time data warehousing. In RiTE, a data producer continuously inserts

data into a data warehouse in bulk fashion, and, at the same time, data

warehouse user queries get access to fresh data on demand. The main

component of the RiTE architecture is called the catalyst , a software module

Data Warehouse Systems: Design and Implementation

Search WWH ::

Custom Search

Home