Advanced Analytical Theory and Methods: Association Rules - Data Science and Big Data Analytics

Database Reference

In-Depth Information

5.7 Diagnostics

Although the Apriori algorithm is easy to understand and implement, some of the

rules generated are uninteresting or practically useless. Additionally, some of the

rules may be generated due to coincidental relationships between the variables.

Measures like confidence, lift, and leverage should be used along with human

insights to address this problem.

Another problem with association rules is that, in Phase 3 and 4 of the Data

Analytics Lifecycle (Chapter 2), the team must specify the minimum support prior

to the model execution, which may lead to too many or too few rules. In related

research, a variant of the algorithm [13] can use a predefined target range for the

number of rules so that the algorithm can adjust the minimum support accordingly.

Section 5.2 presented the Apriori algorithm, which is one of the earliest and the

most fundamental algorithms for generating association rules. The Apriori

algorithm reduces the computational workload by only examining itemsets that

meet the specified minimum threshold. However, depending on the size of the

dataset, the Apriori algorithm can be computationally expensive. For each level of

support, the algorithm requires a scan of the entire database to obtain the result.

Accordingly, as the database grows, it takes more time to compute in each run. Here

are some approaches to improve Apriori's efficiency:

• Partitioning: Any itemset that is potentially frequent in a transaction

database must be frequent in at least one of the partitions of the transaction

database.

• Sampling: This extracts a subset of the data with a lower support

threshold and uses the subset to perform association rule mining.

• Transaction reduction: A transaction that does not contain frequent

k -itemsets is useless in subsequent scans and therefore can be ignored.

• Hash-based itemset counting: If the corresponding hashing bucket

count of a k -itemset is below a certain threshold, the k -itemset cannot be

frequent.

• Dynamic itemset counting: Only add new candidate itemsets when all

of their subsets are estimated to be frequent.

Search WWH ::

Custom Search

Home