New Data Warehouse Technologies - Data Warehouse Systems: Design and Implementation

Database Reference

In-Depth Information

Note that data latency requirements differ between application scenarios.

For example, collaborative filtering, with queries such as “People who like

X also like Y,” requires a data freshness in the range of hours, while fraud

detection, for instance, in credit card usage, needs a data latency in the

order of minutes or seconds. However, most applications do not require these

stringent latency levels. In these cases, the common strategy in practice

consists just in increasing the frequency of ETL operations using so-called

mini-batch ETL processes, for example, loading data every 10min.

Several strategies have been devised to achieve real-time ETL for reducing

data latency. The simplest one, which requires the least effort in terms of

changes to existing architectures, is the one called near real-time ETL ,

which simply increases the frequency of ETL processes. Most of the research

work in the field follows this approach. However, this is not enough when

data latency must be drastically reduced.

A classic solution to reduce data latency consists in defining real-time

partitions for fact tables. In this case, real-time and static data are stored

in separate tables. Real-time partitions are subject to special update and

query rules and must have the same schema as the fact tables. Ideally, they

must:

Contain all updates occurred since the last refresh of the fact table.

Have the same granularity as the fact table.

Be lightly indexed in order to eciently handle input data.

Support high-performance querying.

Query tools should be able to distinguish between both kinds of tables and

know where to find data. That means these tools must formulate a query

over the static fact tables and the real-time partitions. This capability is not

always achieved by commercial tools, however. Note also that this technique

is orthogonal to the database technology used. Thus, real-time partitions

could be stored in traditional RDBMS, column-store database systems, or

IMDBSs.

There are three types of real-time partitions depending on their granular-

ity, which can be transaction, periodic snapshot, and accumulating snapshot

granularity. We explain these types next.

A transaction-granularity real-time partition contains one record for each

individual transaction in the source system since the beginning of the

recording period. The real-time partition has the same structure as its

underlying static fact table, but it just contains the transactions that have

occurred since the last data warehouse refresh. In addition, the real-time

partition should not be indexed in order to be always ready for loading.

Although the static fact tables are usually big and heavily indexed, real-time

partitions may fit in main memory, and thus, there is no need of indexing

them. As an example, let us consider a simplified version of the Sales fact

table below.

Data Warehouse Systems: Design and Implementation

Search WWH ::

Custom Search

Home