Databases Reference
In-Depth Information
A 1 ,...,A n , the values for individual attributes A i are analyzed. This analysis
includes determining the distribution and frequency of attribute values, e.g.,
in the form of histograms, and the minimum and maximum values and lengths
for numerical and alpha-numerical attributes, respectively.
Such a simple analysis, which can be done using SQL statements at the
relation level, can reveal quite a lot of useful information, for example, outliers
that do not conform to certain properties the data are assumed to have. The
analysis can also be extended to sets of tuples from one or more relations where
now tuples and combinations thereof are analyzed and correlations among at-
tribute values are determined using standard association analysis techniques
(see, e.g., [57]). Assume, for example, two relations R 1 ( A, C )and R 2 ( B, D )
with a foreign key dependency R 2 .B
R 1 .A . For any two matching tuples
t 1
R 2 , an association can describe that the value of t 1 .C always
determines the range of the attribute value t 2 .D . The association rules discov-
ered are then evaluated and compared to what is expected from the data. In
principle, many of the analysis tasks resemble standard actions employed in
the evaluation of the quality of the data [9, 18]. It should also be noted that
in production databases, most of the above information is typically readily
available. This is in particular the case where statistics for relations are pe-
riodically maintained to provide cost-based query optimizer with information
for choosing ecient query evaluation plans. Statistical information about the
relations is then available in the database's data dictionary.
For data profiling, it is assumed that the above analysis tasks are per-
formed on a snapshot of the database, that is, at a particular point in time.
This can be done in a batch-mode when the workload of the database is low
(e.g., during night), or by using a stand-by or recent backup database. For
large-scale databases with hundreds of relations, clearly not all relations are
analyzed but only those that are relevant to specific security policies or those
that contain sensitive or mission-critical data as indicated in the initial step in
Figure 1. The snapshot profiles for these individual relations or parts thereof
(e.g., sets of tuples that have some specific properties or contain particularly
sensitive information) are managed in separate relations, specifically designed
for access to database security mechanisms (see Section 5). In the most sim-
ple case, a snapshot profile for a relation R
R 1 ,t 2
determined at a particular
point in time t , denoted DataP rof ( R, t ), is a collection of measure-value pairs
that describe properties of attributes of R . The measure is related to the data,
e.g., number or frequencies of different attribute values, and the measure value
denotes what has been computed for a measure.
The above discussion makes one point very clear: developing a misuse de-
tection approach on top of a database for which the quality of the data is
not known or the data is of poor quality is likely to be not practical. That
is, if the normal behavior of users is to be determined, erroneous, missing,
or extra data in the underlying relations can significantly skew the statistical
models describing the normal behavior of users in terms of accesses to the
data. Therefore, it is essential to “clean” the data, i.e., removing (if possi-
∈R
Search WWH ::




Custom Search