Database Reference
In-Depth Information
Note that data latency requirements differ between application scenarios.
For example, collaborative filtering, with queries such as “People who like
X also like Y,” requires a data freshness in the range of hours, while fraud
detection, for instance, in credit card usage, needs a data latency in the
order of minutes or seconds. However, most applications do not require these
stringent latency levels. In these cases, the common strategy in practice
consists just in increasing the frequency of ETL operations using so-called
mini-batch ETL processes, for example, loading data every 10min.
Several strategies have been devised to achieve real-time ETL for reducing
data latency. The simplest one, which requires the least effort in terms of
changes to existing architectures, is the one called near real-time ETL ,
which simply increases the frequency of ETL processes. Most of the research
work in the field follows this approach. However, this is not enough when
data latency must be drastically reduced.
A classic solution to reduce data latency consists in defining real-time
partitions for fact tables. In this case, real-time and static data are stored
in separate tables. Real-time partitions are subject to special update and
query rules and must have the same schema as the fact tables. Ideally, they
must:
￿ Contain all updates occurred since the last refresh of the fact table.
￿ Have the same granularity as the fact table.
￿ Be lightly indexed in order to eciently handle input data.
￿ Support high-performance querying.
Query tools should be able to distinguish between both kinds of tables and
know where to find data. That means these tools must formulate a query
over the static fact tables and the real-time partitions. This capability is not
always achieved by commercial tools, however. Note also that this technique
is orthogonal to the database technology used. Thus, real-time partitions
could be stored in traditional RDBMS, column-store database systems, or
IMDBSs.
There are three types of real-time partitions depending on their granular-
ity, which can be transaction, periodic snapshot, and accumulating snapshot
granularity. We explain these types next.
A transaction-granularity real-time partition contains one record for each
individual transaction in the source system since the beginning of the
recording period. The real-time partition has the same structure as its
underlying static fact table, but it just contains the transactions that have
occurred since the last data warehouse refresh. In addition, the real-time
partition should not be indexed in order to be always ready for loading.
Although the static fact tables are usually big and heavily indexed, real-time
partitions may fit in main memory, and thus, there is no need of indexing
them. As an example, let us consider a simplified version of the Sales fact
table below.
Search WWH ::




Custom Search