Databases Reference
In-Depth Information
“What does the
'
where
status
in
“graduate”'
clause mean?”
The
where
clause implies
that a concept hierarchy exists for the attribute
status
. Such a concept hierarchy organizes
primitive-level data values for
status
(e.g., “M.Sc.,” “M.A.,” “M.B.A.,” “Ph.D.,” “B.Sc.,”
and “B.A.”) into higher conceptual levels (e.g., “graduate” and “undergraduate”). This
use of concept hierarchies does not appear in traditional relational query languages, yet
is likely to become a common feature in data mining query languages.
The data mining query presented in Example 4.11 is transformed into the following
relational query for the collection of the task-relevant data set:
use
Big University DB
select
name
,
gender
,
major
,
birth place
,
birth date
,
residence
,
phone#
,
gpa
from
student
where status in
f“M.Sc.,” “M.A.,” “M.B.A.,” “Ph.D.”g
The transformed query is executed against the relational database,
Big University DB
,
and returns the data shown earlier in Table 4.5. This table is called the (task-relevant)
initial working relation
. It is the data on which induction will be performed. Note that
each tuple is, in fact, a conjunction of attribute-value pairs. Hence, we can think of a
tuple within a relation as a rule of conjuncts, and of induction on the relation as the
generalization of these rules.
“
Now that the data are ready for attribute-oriented induction, how is attribute-oriented
induction performed?
” The essential operation of attribute-oriented induction is
data
generalization
, which can be performed in either of two ways on the initial working
relation:
attribute removal
and
attribute generalization
.
Attribute removal
is based on the following rule:
If there is a large set of distinct values
for an attribute of the initial working relation, but either (case 1) there is no generalization
operator on the attribute (e.g., there is no concept hierarchy defined for the attribute), or
(case 2) its higher-level concepts are expressed in terms of other attributes, then the attribute
should be removed from the working relation
.
Let's examine the reasoning behind this rule. An attribute-value pair represents a
conjunct in a generalized tuple, or rule. The removal of a conjunct eliminates a con-
straint and thus generalizes the rule. If, as in case 1, there is a large set of distinct values
for an attribute but there is no generalization operator for it, the attribute should be
removed because it cannot be generalized. Preserving it would imply keeping a large
number of disjuncts, which contradicts the goal of generating concise rules. On the
other hand, consider case 2, where the attribute's higher-level concepts are expressed
in terms of other attributes. For example, suppose that the attribute in question is
street
,
with higher-level concepts that are represented by the attributes h
city, province or state,
country
i. The removal of
street
is equivalent to the application of a generalization oper-
ator. This rule corresponds to the generalization rule known as
dropping condition
in the
machine learning literature on
learning from examples
.
Attribute generalization
is based on the following rule:
If there is a large set of distinct
values for an attribute in the initial working relation, and there exists a set of generalization
operators on the attribute, then a generalization operator should be selected and applied