Databases Reference
In-Depth Information
Table 4.6 Generalized Relation Obtained by Attribute-Oriented Induction on Table 4.5's Data
gender major birthcountry agerange residencecity gpa count
M
Science
Canada
20 - 25
Richmond
very good
16
F
Science
Foreign
25 - 30
Burnaby
excellent
22
respect to the attribute generalization threshold. Generalization of birth date should
therefore take place.
6. residence: Suppose that residence is defined by the attributes number, street, resi-
dence city, residence province or state , and residence country . The number of distinct
values for number and street will likely be very high, since these concepts are quite low
level. The attributes number and street should therefore be removed so that residence
is then generalized to residence city , which contains fewer distinct values.
7. phone#: As with the name attribute, phone# contains too many distinct values and
should therefore be removed in generalization.
8. gpa: Suppose that a concept hierarchy exists for gpa that groups values for grade
point average into numeric intervals like f3.75-4.0, 3.5-3.75, . . .g, which in turn are
grouped into descriptive values such as f“excellent”, “very good”, . . .g. The attribute
can therefore be generalized.
The generalization process will result in groups of identical tuples. For example, the
first two tuples of Table 4.5 both generalize to the same identical tuple (namely, the first
tuple shown in Table 4.6). Such identical tuples are then merged into one, with their
counts accumulated. This process leads to the generalized relation shown in Table 4.6.
Based on the vocabulary used in OLAP, we may view count( ) as a measure , and the
remaining attributes as dimensions . Note that aggregate functions, such as sum( ) , may be
applied to numeric attributes (e.g., salary and sales ). These attributes are referred to as
measure attributes .
4.5.2 Efficient Implementation of Attribute-Oriented Induction
“How is attribute-oriented induction actually implemented?” Section 4.5.1 provided an
introduction to attribute-oriented induction. The general procedure is summarized in
Figure 4.18. The efficiency of this algorithm is analyzed as follows:
Step 1 of the algorithm is essentially a relational query to collect the task-relevant data
into the working relation , W . Its processing efficiency depends on the query pro-
cessing methods used. Given the successful implementation and commercialization
of database systems, this step is expected to have good performance.
Step 2 collects statistics on the working relation. This requires scanning the relation
at most once. The cost for computing the minimum desired level and determining
the mapping pairs,
v , v 0
.
/
, for each attribute is dependent on the number of distinct
 
Search WWH ::




Custom Search