An Efficient Mining of Dominant Entity Based Association Rules in Multi-databases (WiMoA 2011 and ICCSEA 2011)

Abstract

Today, we have a large collection of data that is organized in the form of a set of relations which is partitioned into several databases. There could be implicit associations among various parts of this data. In this paper, we give a scheme for retrieving these associations using the notion of dominant entity. We propose a scheme for mining for dominant entity based association rules (DEBARs) which is not constrained to look for co-occurrence of values in tuples. We show the importance of such a mining activity by taking a practical example called personalized mining. We introduce a novel structure called multi-database domain link network (MDLN) which can be used to generate DEBARs between the values of attributes belonging to different databases. We show that MDLN structure is compact and this property of MDLN structure permit it to be used for mining vary large size databases. Experimental results reveal the efficiency of the proposed scheme.

Keywords: Multi-databases, Valueset, Personalized mining, Dominant entity, Association rules.

Introduction

In real-life situations, associations among data items are hidden and mining for these associations is further complicated because of possible distribution of the data across several databases [1] [2]. For example, consider person based information system; wherein information about the person is available as customer in SUPERMARKET database, as employee in COMPANY database, as patient in the MEDICAL CENTRE database, and as passenger in TRAVEL database. There could be an implicit association among various parts of this data. For example, there may be an association between salary, region of living, mode of traveling and disease; like people who have salary in the range of US$ 1000 -US$ 2,000, eat frequently at CENTRAL CALCUTTA hotels and travel by air-in executive class have cardiovascular diseases, where CENTRAL CALCUTTA is marked with a high pollution rating. Such examples provide the associations of a person in different contexts. We call such a mining activity person-based associations mining (or personalized mining) and the system under which the mining activity performed is called person warehouse. The primary property of such a warehouse is the possibility to generate a global schema linked by one entity which we call, dominant entity. The associations between the values of attributes of interest using the dominant entity are called dominant entity based associations. In the above example, the dominant entity is person which is characterized by attributes name and address. The set of attributes which characterizes dominant entity is called dominant entity attributes (DEA). The sets of attributes between which we want to find the associations are called characteristic attributes (CA). In person based information system example quoted above, {salary, region-of-living}, {mode-of-travel}, {diseases} are characteristic attributes.

In this paper, we are addressing the problem of ‘dominant entity based association mining activity in multiple databases’. For illustration purposes we use ‘person warehouse’ through out the paper. However, the notion can suit any domain where we have a dominant entity and the data warehouse.

Problem Definition

The conventional association rules are mainly based on togetherness or cooccurrence of values in a tuple [3] [6]. However, in the dominant entity based mining like personalized mining, we want to mine the rules of the following type: income > "Range X" A age < "Range Y" purchase = "Costly goods". Note that in order to generate such rules from multiple relations/databases, support for individual values are important and associations are built using the dominant entity. Such rules are called dominant entity based association rules, (DEBAR). Further, in the above rule, all "Costly goods" need not be purchased together. (The goods purchased is said to be "costly" if the cost is more than "Z Dollars"). We originate a scheme for mining for dominant entity based association rules which is not constrained to look for co-occurrence of values in tuples. It can look at attributes in one more databases which may be located at many places. We discuss this scheme in section 2.

In order to generate DEBARs, there is a need to link the values of attributes of interest to the values of dominant entity attributes. We propose and develop a structure called "multi-database domain link network (MDLN)" for the mining activity that involves several databases. This structure provides link between the values of characteristic attributes and the values of dominant entity attributes in an efficient manner in terms of – space to hold the structure; and access time to generate DEBARs out of it.

The rest of this paper is organized as follows. We discuss MDLN structure in section 3. We show how meaningful associations can be mined using MDLN in section 4. Experimental results are described in section 5. We conclude our study in section 6.

Dominant Entity Based Association Rules (DEBAR)

Letvalues of characteristic attributes,and

which belong to relationsof databases Dx and Dy respectively.

Let d\,d2,… dN be the N distinct values pertaining to DEA.

For illustrative example, please refer [5]. We present an algorithm to generate DEBARs in section 4.

The DEBARs are different from conventional association rules (AR) [3] in the following aspects :

1. In AR mining, the frequency (largeness) of an itemset is based on the number of times it appears in the transaction database. It would account for every transaction made ignoring the fact that the same transaction is made by the same customer many number of times. However, in DEBAR, we are interested in mapping of valuesets on distinct DEA values. For example, in the person based information system, if many people who have two wheelers are suffering from headache, then an interesting DEBAR could be "two-wheeler user has-headaches". Here, we do not care if the same person has many two wheelers. Because in our frame work, the frequency is based on how many DEA values are involved in the mapping process.

2. In AR mining, even if items ia and ib occur many number of times, and do not occur together in any transaction, then (ia, ib) is not a large itemset because of the togetherness property. Note that people may not purchase several costly goods together. Therefore, costly goods may not occur together in the same transaction of a transaction database. So, the rule like "{Air-conditioner, Microwave, Refrigerator, TV} Four wheeler" is not possible, because the possibility of number of transactions which have all the four items in a transaction is less likely. In the case of DEBAR, we do not insist on together occurrence of the values/items; rather, how many DEA values are involved in the association process is of importance. So, the above rule is possible, if we find many persons with those goods bought, perhaps, even at different times.

Multi-database Domain Link Network (MDLN)

As discussed in section 2, we want to find the associations between the values of characteristic attributes which may belong to several databases. These values are mapped to a range of values of DEA. So, there is a need to link the values of characteristic attributes using the DEA values, with the help of which we can find the required associations. We propose a structure called multi-database domain link network (MDLN) to generate such links between the values of characteristic attributes and values of DEA across the databases.

Let Ci,C2,…Cm be m characteristic attributes pertaining to any relations which may belong to different databases. Let the domain of DEA be {d1: d2,.. .dN }.In person warehouse, the values of DEA, d1 ,d2,…dN map to each of <name,address> groups pertaining to N persons. [d1 ,d2,...,dN could be social security numbers (SSN), if they are available]. The linking mechanism among values of m characteristic attributes (CAs) through d1,d2,…, dN is shown in Figure 1. In the figure, B1,B2,…BN are structures to hold the associations.

Fig. 1. Linking mechanism

There are four types of relationships, one to many, one to one, many to one and many to many – that are possible between the values of characteristic attributes and DEA.

We discuss the handling of each of these relationships in the following four sub-sections. In order to realize a compact structure, we treat the relationships one to many (one to one) and many to many (many to one) between the characteristic attributes and dominant entity attributes differently.

One to Many (1:N) Relationship

The relationship between the characteristic attribute, C and the dominant entity attribute, DEA is 1 : N, if each value of Cj maps to one or more values of the DEA and not vice-versa.

For example, consider the relation, EMP.PERMANENTJNFORMATION shown in Figure 2A. Assume that the characteristic attribute is Age and dominant entity attribute is Pid.

Fig. 2. Handling of 1:N relationship for m associations

There is 1:N relationship between Age and Pid. Figure 2B shows the part of MDLN structure which handles the 1:N relationship. The figure shows two types of nodes – vnode1n (value node) and dnode (DEA value node). A vnode1n holds a value of the characteristic attribute, and the number of times the value is mapped to distinct dominant entity values. A dnode holds a value of the DEA. Both vnode1n and dnode also have two pointers, the Link pointer and the Next pointer.

1. Link pointer : This is a pointer which helps in navigation through the actual associated values of the DEA corresponding to a value of the CA in the form of a circular list. In Figure 2B, from the DEA node (called dnode) having value 1, the link pointer points to the dnode having value 3 whose link pointer points to the dnode having the value 6. Note that these three distinct Pid values are associated with a value 30 of the characteristic attribute Age. So, a circular list is constructed involving the vnode1n whose value = 30 and all dnodes which are having the associated values corresponding to Age = 30 using the link pointers. Since there are three Pid values associated with Age = 30, Count field of corresponding vnode1n is set to 3.

Link pointer in the vnode1n is a single pointer. However, dnode contains m link pointers since m characteristic attributes are involved in the relationship with the DEA.

2. Next pointer : This is used to generate the list of nodes pertaining to the values of a characteristic attribute or DEA.

One to One (1:1) Relationship

This is a trivial version of 1:N relationship. So, it can be handled as a specialization of the 1:N relationship. Here, there is no linked list, but only a node to node linking.

Many to Many (M:N) Relationship

The relationship between the characteristic attribute, Cj and the dominant entity attribute, DEA is M : N, if values of both Cj and DEA have more than one mapping from each other.

For example, consider the relation, CUSTOMER_PURCHASE shown in Figure 3A. Assume that the characteristic attribute is Goods-purchased and dominant entity attribute is Pid.

Fig. 3. Handling of M:N relationship for m associations

There is M:N relationship between Goods-purchased and Pid. Figure 3B shows the part of MDLN structure which handles the M:N relationship. The figure shows three types of nodes, dnode (DEA value node), vnodem„ (value node) and cnode (connector node). Similar to vnode1n, a vnodem„ holds a value of the characteristic attribute, and the number of times the value is mapped to the distinct DEA values. A cnode is a dummy node which is used to visualize a M:N relationship as two 1:N relationships. There are two types of pointers in dnode and vnodem„ : they are Link pointer and Next pointer. Link pointer and Next pointer in dnode are similar to that of respectively Link pointer and Next pointer in dnode of 1:N relationship. But Link pointer of vnodem„ points to a cnode and Next pointer of vnodem„ points to a vnodem„. A cnode has two pointers. One is dpointer, pointing to a cnode or a dnode. This pointer generates a circular list involving dnodes and cnodes. Other is a vpointer, pointing to a cnode or a vnodem„. This pointer generates a circular list involving vnodem„s and cnodes. The Count field of vnodem„ has a similar functionality to that of Count field of vnode1n.

Many to One (M:1) Relationship

It can be handled as an M:N relationship which is discussed in the previous section.

Refer [5] for the detailed algorithm on the construction of the MDLN structure and its space and time complexity analysis.

DEBAR Generation Using MDLN Structure

From the definition of DEBARs, it is clear that we are interested in the associations between the valuesets which are mapped to a range of values of DEA such that the cardinality of the range is greater than or equal to the user defined threshold value. Note that repeated mappings between the value of a characteristic attribute and the value of DEA are ignored.

In MDLN structure, we are not storing any information necessary to get the togetherness /co-occurrences of values in a tuple. And also the Count field of the node pertaining to a value of the CA is updated only if it maps to different DEA values. For example, in Figure 3B even though the relation has three instances of coffee, the Count value of the node is set to 2 indicating that two distinct Pids are having association with coffee.

We give a general outline of the DEBAR generating algorithm using the MDLN structure below.

Algorithm

Experimental Results

We conducted experiments to study the storage requirement patterns of the MDLN structure. We compare the size requirements of MDLN with PC-tree structure [4] for multiple databases. "PC-tree is a data structure which is used to store all the patterns occurring in the tuples of a database, where a count field is associated with each value in every pattern." [4]. PC-tree is chosen for this comparison because, navigation across a collection of PC-trees (each corresponds to a database) can be done using the DEA values as in the MDLN structure. It is already shown in [4] that, the size of PC-tree structure is less than that of the database. Here, we examine whether the MDLN structure is more compact than the PC-tree or not. For this, we assumed the disk block size = 2K Bytes and used three parameters : number of tuples (projected w.r.t. CA), domain size of DEA and number of characteristic attributes (linkages). In order to make the PC-tree attribute-order independent, we constructed PC-trees each with values of DEA and values of characteristic attribute. So, <value of DEA, value of CA> constitute a pattern based on which PC-tree is constructed. In order to handle multiple databases, there is a need to link across the PC-trees through the DEA values. So, the parameter number of attributes (linkages) typically indicates number of PC-trees constructed and in the case of MDLN structure, it represents number of linkages. We have two cases.

Case 1 : One to Many relationship between Characteristic and Dominant entity attributes :

Figure 4 shows the size of the MDLN structure with the corresponding PC-tree size. Here, the number of tuples (projected w.r.t. CA) is varied while size of the domain of DEA is fixed at 999 and number of characteristic attributes (linkages) is changed from 5 to 10. It is clear from both the graphs that there is no further increase in the sizes of the structures with increase in the number of tuples. Further, it is also clear from Figure 4 that, since MDLN structure depends a) on the size of the domain of the attributes and not on the number of tuples, and b) for each value of an attribute only one node is created, the size does not increase as that of PC-tree and the storage requirement is about 50% of that of the PC-tree.

Fig. 4. size of domain of DEA = 999

Case 2: Many to Many relationship between Characteristic and Dominant entity attributes :

Figure 5 shows the sizes of MDLN structure and the corresponding PC-tree where the number of tuples (projected w.r.t. CA) is varied while the size of the domain of DEA is fixed at 99 and number of characteristic attributes (linkages) is changed from 5 to 10. Here, we chose a smaller domain to show that as in 1:N relationships, there is no increase in sizes of the structures as the number of tuples increases. In this case, storage requirement for MDLN structure is about 75% of that of the PC-tree.

Fig. 5. size of domain of DEA = 99

From the experimental results it is clear that MDLN structure handles 1:N relationship between CA and DEA in a more compact manner than M:N relationship between CA and DEA.

Conclusions

We introduced the notion of mining for Dominant Entity Based Association Rules (DEBARs). These rules – are dominant entity attributes oriented, do not depend on co-occurrences of values in tuples, capture n-ary relationships and can be generated from values of characteristic attributes across several relations/databases. We showed the importance of such a mining activity by taking a practical example like personalized mining. We proposed a novel structure called Multi-database Domain Link Network (MDLN) which can be used to explore associations between values of various attributes which belong to same or different relations which in turn belong to the same or different databases. We gave a concrete framework for mining for DEBARs in single and multiple relations belonging to one or more databases using the MDLN structure. Our experiments reveal that the storage requirement for MDLN structure varies from 50% to 75% of that of the PC-tree based representation.