Data Warehousing and Online Analytical Processing - Data Mining: Concepts and Techniques

Databases Reference

In-Depth Information

“What does the ' where status in “graduate”' clause mean?” The where clause implies

that a concept hierarchy exists for the attribute status . Such a concept hierarchy organizes

primitive-level data values for status (e.g., “M.Sc.,” “M.A.,” “M.B.A.,” “Ph.D.,” “B.Sc.,”

and “B.A.”) into higher conceptual levels (e.g., “graduate” and “undergraduate”). This

use of concept hierarchies does not appear in traditional relational query languages, yet

is likely to become a common feature in data mining query languages.

The data mining query presented in Example 4.11 is transformed into the following

relational query for the collection of the task-relevant data set:

use Big University DB

select name , gender , major , birth place , birth date , residence , phone# , gpa

from student

where status in f“M.Sc.,” “M.A.,” “M.B.A.,” “Ph.D.”g

The transformed query is executed against the relational database, Big University DB ,

and returns the data shown earlier in Table 4.5. This table is called the (task-relevant)

initial working relation . It is the data on which induction will be performed. Note that

each tuple is, in fact, a conjunction of attribute-value pairs. Hence, we can think of a

tuple within a relation as a rule of conjuncts, and of induction on the relation as the

generalization of these rules.

“ Now that the data are ready for attribute-oriented induction, how is attribute-oriented

induction performed? ” The essential operation of attribute-oriented induction is data

generalization , which can be performed in either of two ways on the initial working

relation: attribute removal and attribute generalization .

Attribute removal is based on the following rule: If there is a large set of distinct values

for an attribute of the initial working relation, but either (case 1) there is no generalization

operator on the attribute (e.g., there is no concept hierarchy defined for the attribute), or

(case 2) its higher-level concepts are expressed in terms of other attributes, then the attribute

should be removed from the working relation .

Let's examine the reasoning behind this rule. An attribute-value pair represents a

conjunct in a generalized tuple, or rule. The removal of a conjunct eliminates a con-

straint and thus generalizes the rule. If, as in case 1, there is a large set of distinct values

for an attribute but there is no generalization operator for it, the attribute should be

removed because it cannot be generalized. Preserving it would imply keeping a large

number of disjuncts, which contradicts the goal of generating concise rules. On the

other hand, consider case 2, where the attribute's higher-level concepts are expressed

in terms of other attributes. For example, suppose that the attribute in question is street ,

with higher-level concepts that are represented by the attributes h city, province or state,

country i. The removal of street is equivalent to the application of a generalization oper-

ator. This rule corresponds to the generalization rule known as dropping condition in the

machine learning literature on learning from examples .

Attribute generalization is based on the following rule: If there is a large set of distinct

values for an attribute in the initial working relation, and there exists a set of generalization

operators on the attribute, then a generalization operator should be selected and applied

Search WWH ::

Custom Search

Home