Actor Identification in Implicit Relational Data Sources - Mining and Analyzing Social Networks

Information Technology Reference

In-Depth Information

The weight vectors generated during this stage have a direct effect on the quality of

the generated model. Different combinations of weight calculations for the available

fields were tested. The use of flight preference fields and shopping preference fields

was also tested. The best resulting model included the flight preference fields, but

not the shopping preference fields.

5.7

Classification

The classification stage determines if each pair of records extracted from the block-

ing stage and subsequently weighed in the weight generation stage, is a match or

not. Three different approaches are evaluated to classify the record pairs as matches

and non-matches. The first approach is the traditional approach described by Fellegi

and Sunter, the second approach uses a set of declarative if-then rules and the third

approach uses a supervised SVM classifier using the libsvm [19] library.

Importantly, the frequent flyer number enabled the testing and evaluation of the

entire classification process. Four different record sets of random records contain-

ing frequent flyer numbers were extracted from the data set. There was no overlap

between records in each data set, so each record was present in only one data set.

One of the record sets was used only to train the SVM classifier, and in the case

of the rule base and the Fellegi Sunter this set was used to empirically adjust the

thresholds of rules and classification.

The cut-off thresholds of the Fellegi and Sunter were determined on the training

set by separating the number of matches and non matches in different weight buckets

and using the resulting thresholds for matches and non matches. The rules for the

rule base approach were encoded according to the understanding of the data set. The

application of the rules was tried several times on the testing set to determine the

best group of rules and the best value threshold for the rules. After the thresholds

were set the same rules were applied to the three different testing sets.

For the SVM classification the training set was used to train two types of classi-

fiers, a linear classifier and a RBF classifier. For the training of the RBF classifier,

10 fold cross validation was used to determine the best parameters for C and

.For

the linear classifier, three different values for C (0.1, 1, 10) were used and the best

parameter of the three (10) was chosen. The models generated by the SVM train-

ing were saved and subsequently applied to the three testing sets. In the training of

the SVM the frequent flyer number was only used to identify a pair of records as a

match or a non-match, but did not form part of the weight vector of attributes. This

approach allows us to report on our results which a high degree on confidence due

to the frequent flyer number ground truth data.

For each of the three classification approaches the accuracy, precision, recall and

f-measure were calculated for the training set used and the three different testing

sets. As discussed in section 4.3 the accuracy values for an unbalanced data set task

is usually skewed because of the disproportion between matches and non matches.

For this reason we based the evaluation on individual precision and recall values.

γ

Mining and Analyzing Social Networks

Search WWH ::

Custom Search

Home