Information Technology Reference
In-Depth Information
The weight vectors generated during this stage have a direct effect on the quality of
the generated model. Different combinations of weight calculations for the available
fields were tested. The use of flight preference fields and shopping preference fields
was also tested. The best resulting model included the flight preference fields, but
not the shopping preference fields.
5.7
Classification
The classification stage determines if each pair of records extracted from the block-
ing stage and subsequently weighed in the weight generation stage, is a match or
not. Three different approaches are evaluated to classify the record pairs as matches
and non-matches. The first approach is the traditional approach described by Fellegi
and Sunter, the second approach uses a set of declarative if-then rules and the third
approach uses a supervised SVM classifier using the libsvm [19] library.
Importantly, the frequent flyer number enabled the testing and evaluation of the
entire classification process. Four different record sets of random records contain-
ing frequent flyer numbers were extracted from the data set. There was no overlap
between records in each data set, so each record was present in only one data set.
One of the record sets was used only to train the SVM classifier, and in the case
of the rule base and the Fellegi Sunter this set was used to empirically adjust the
thresholds of rules and classification.
The cut-off thresholds of the Fellegi and Sunter were determined on the training
set by separating the number of matches and non matches in different weight buckets
and using the resulting thresholds for matches and non matches. The rules for the
rule base approach were encoded according to the understanding of the data set. The
application of the rules was tried several times on the testing set to determine the
best group of rules and the best value threshold for the rules. After the thresholds
were set the same rules were applied to the three different testing sets.
For the SVM classification the training set was used to train two types of classi-
fiers, a linear classifier and a RBF classifier. For the training of the RBF classifier,
10 fold cross validation was used to determine the best parameters for C and
.For
the linear classifier, three different values for C (0.1, 1, 10) were used and the best
parameter of the three (10) was chosen. The models generated by the SVM train-
ing were saved and subsequently applied to the three testing sets. In the training of
the SVM the frequent flyer number was only used to identify a pair of records as a
match or a non-match, but did not form part of the weight vector of attributes. This
approach allows us to report on our results which a high degree on confidence due
to the frequent flyer number ground truth data.
For each of the three classification approaches the accuracy, precision, recall and
f-measure were calculated for the training set used and the three different testing
sets. As discussed in section 4.3 the accuracy values for an unbalanced data set task
is usually skewed because of the disproportion between matches and non matches.
For this reason we based the evaluation on individual precision and recall values.
γ
 
Search WWH ::




Custom Search