Databases Reference
In-Depth Information
Table 4.6
Generalized Relation Obtained by Attribute-Oriented Induction on Table 4.5's Data
gender major birthcountry agerange residencecity gpa
count
M
Science
Canada
20 - 25
Richmond
very good
16
F
Science
Foreign
25 - 30
Burnaby
excellent
22
respect to the attribute generalization threshold. Generalization of
birth date
should
therefore take place.
6.
residence:
Suppose that
residence
is defined by the attributes
number, street, resi-
dence city, residence province or state
, and
residence country
. The number of distinct
values for
number
and
street
will likely be very high, since these concepts are quite low
level. The attributes
number
and
street
should therefore be removed so that
residence
is then generalized to
residence city
, which contains fewer distinct values.
7.
phone#:
As with the
name
attribute,
phone#
contains too many distinct values and
should therefore be removed in generalization.
8.
gpa:
Suppose that a concept hierarchy exists for
gpa
that groups values for grade
point average into numeric intervals like f3.75-4.0, 3.5-3.75, . . .g, which in turn are
grouped into descriptive values such as f“excellent”, “very
good”, . . .g. The attribute
can therefore be generalized.
The generalization process will result in groups of identical tuples. For example, the
first two tuples of Table 4.5 both generalize to the same identical tuple (namely, the first
tuple shown in Table 4.6). Such identical tuples are then merged into one, with their
counts
accumulated. This process leads to the generalized relation shown in Table 4.6.
Based on the vocabulary used in OLAP, we may view
count( )
as a
measure
, and the
remaining attributes as
dimensions
. Note that aggregate functions, such as
sum( )
, may be
applied to numeric attributes (e.g.,
salary
and
sales
). These attributes are referred to as
measure attributes
.
4.5.2
Efficient Implementation of Attribute-Oriented Induction
“How is attribute-oriented induction actually implemented?”
Section 4.5.1 provided an
introduction to attribute-oriented induction. The general procedure is summarized in
Figure 4.18. The efficiency of this algorithm is analyzed as follows:
Step 1 of the algorithm is essentially a relational query to collect the task-relevant data
into the
working relation
,
W
. Its processing efficiency depends on the query pro-
cessing methods used. Given the successful implementation and commercialization
of database systems, this step is expected to have good performance.
Step 2 collects statistics on the working relation. This requires scanning the relation
at most once. The cost for computing the minimum desired level and determining
the mapping pairs,
v
,
v
0
.
/
, for each attribute is dependent on the number of distinct