Generality Is Predictive of Prediction Accuracy - Data Mining: Theory, Methodology, Techniques, and Applications

Database Reference

In-Depth Information

Table 2. Generality relationships between rules

More Specific

More General

most specific rule

combined rule

most specific rule

random most general rule

most specific rule

initial rule

combined rule

random most general rule

different combinations of boundaries from the most specific rule. Fig. 1(d) shows

the combined rule, formed from the conjunction of all most general rules. The

generality relationships between these rules are presented in Table 2.

Note that it could not be guaranteed that any pair of these rules were strictly

more general or more specific than each other as it was possible for the most

specific and random most general rules to be identical (in which case the set of

most general rules would contain only a single rule and the initial and combined

rules would also both be identical to the most specific and random most general

rules. It was also possible for the initial rule to equal the most specific rule even

when there were multiple most general rules. Also, it was possible for no gen-

erality relationship to hold between an initial and the combined or the random

most general rule developed therefrom.

We wished to evaluate whether the predicted effects held between the rules of

differing levels of generality so formed. It was not appropriate to use the normal

machine learning experimental method of averaging over multiple runs for each

of several data sets, as our prediction is not about relationships between average

outcomes, but rather relationships between specific outcomes. Further, it would

not be appropriate to perform multiple runs on each of several data sets and

then compare the relative frequencies with which the predicted effects held and

did not hold, as this would violate the assumption of independence between ob-

servations relied on by most statistical tools for assessing such outcomes. Rather,

we applied the process once only to each of the following 50 data sets from the

UCI repository [11]:

abalone, anneal, audiology, imports-85, balance-scale, breast-cancer,

breast-cancer-wisconsin, bupa, chess, cleveland, crx, dermatology, dis,

echocardiogram, german, glass, heart, hepatitis, horse-colic,

house-votes-84, hungarian, allhypo, ionosphere, iris, kr-vs-kp,

labor-negotiations, lenses, long-beach-va, lung-cancer, lymphography,

new-thyroid, optdigits, page-blocks, pendigits, pima-indians-diabetes,

post-operative, promoters, primary-tumor, sat, segmentation, shuttle,

sick, sonar, soybean-large, splice, switzerland, tic-tac-toe, vehicle,

waveform, wine.

These were all appropriate data sets from the repository to which we had ready

access and to which we were able to apply the combination of software tools

employed in the research. Note that there is no averaging of results. Statistical

analysis of the outcomes over the large number of data sets is used to compensate

for random effects in individual results due to the use of a single run.

Data Mining: Theory, Methodology, Techniques, and Applications

Search WWH ::

Custom Search

Home