Data Warehousing and Online Analytical Processing - Data Mining: Concepts and Techniques - page 172

Databases Reference

In-Depth Information

Table 4.6 Generalized Relation Obtained by Attribute-Oriented Induction on Table 4.5's Data

gender major birthcountry agerange residencecity gpa count

M

Science

Canada

20 - 25

Richmond

very good

16

F

Science

Foreign

25 - 30

Burnaby

excellent

22

respect to the attribute generalization threshold. Generalization of birth date should

therefore take place.

6. residence: Suppose that residence is defined by the attributes number, street, resi-

dence city, residence province or state , and residence country . The number of distinct

values for number and street will likely be very high, since these concepts are quite low

level. The attributes number and street should therefore be removed so that residence

is then generalized to residence city , which contains fewer distinct values.

7. phone#: As with the name attribute, phone# contains too many distinct values and

should therefore be removed in generalization.

8. gpa: Suppose that a concept hierarchy exists for gpa that groups values for grade

point average into numeric intervals like f3.75-4.0, 3.5-3.75, . . .g, which in turn are

grouped into descriptive values such as f“excellent”, “very good”, . . .g. The attribute

can therefore be generalized.

The generalization process will result in groups of identical tuples. For example, the

first two tuples of Table 4.5 both generalize to the same identical tuple (namely, the first

tuple shown in Table 4.6). Such identical tuples are then merged into one, with their

counts accumulated. This process leads to the generalized relation shown in Table 4.6.

Based on the vocabulary used in OLAP, we may view count( ) as a measure , and the

remaining attributes as dimensions . Note that aggregate functions, such as sum( ) , may be

applied to numeric attributes (e.g., salary and sales ). These attributes are referred to as

measure attributes .

4.5.2 Efficient Implementation of Attribute-Oriented Induction

“How is attribute-oriented induction actually implemented?” Section 4.5.1 provided an

introduction to attribute-oriented induction. The general procedure is summarized in

Figure 4.18. The efficiency of this algorithm is analyzed as follows:

Step 1 of the algorithm is essentially a relational query to collect the task-relevant data

into the working relation , W . Its processing efficiency depends on the query pro-

cessing methods used. Given the successful implementation and commercialization

of database systems, this step is expected to have good performance.

Step 2 collects statistics on the working relation. This requires scanning the relation

at most once. The cost for computing the minimum desired level and determining

the mapping pairs,

v , v 0

.

/

, for each attribute is dependent on the number of distinct

Next Page

Data Mining: Concepts and Techniques

Search WWH ::

Custom Search

Home