Database Reference
In-Depth Information
Fig. 16.4 A modified decision tree for recommender systems.
the two groups. However, different splits may result in different group
sizes. Comparing the size of the intersection between splits of different
sub-relation sizes would not be sucient since an even split for example
(half the tuples in group
A
and half in group
B
) would probably have a
larger intersection than a very uneven split (such as one tuple in group
A
and all the rest in group
B
). Instead, we look at the probability of our
split's item-intersection size compared to a random (uniform) split of similar
group sizes. A split which is very likely to occur even in a random split is
considered a bad split (less preferred) since it is similar to a random split
and probably does not distinguish the groups from each other. A split that
is the least probable to occur is assumed to be a better distinction of the
two subgroups and is the split selected by the heuristic.
More formally let us denote:
•
items
(
S
)=
π
ItemID
(
S
), where
π
is the projection operation in relational
algebra, the set of all
ItemIDs
that appear in a set
S
.
•
O
i
(
S
)=
,where
σ
is the select operation in relation
algebra, is the number of occurrences of the
ItemID i
in the set
S
.
|
σ
ItemID
=
i
(
S
)
|
Let
S
q
(
q
denotes the number of ratings) be a random binary partition
of the tuples in
Ratings
into two sub-relations
A
and
B
consisting of
k
and
q
k
tuples, respectively. We are interested in the probability distribution
of
S
q
≡|
−
(
items
(
A
)
items
(
B
)
|
.
Search WWH ::
Custom Search