Database Reference
In-Depth Information
classified into the four categories: (i) vector-based methods: include the L p distance,
the cosine similarity; (ii) skewness-based methods: include computing skewness
(sample and Bowley); (iii) entropy-based methods: include the Jensen-Shannon
and the Kullback-Leibler divergence [15]; and (iv) goodness-of-fit tests include the
Kolmogorov-Smirnov and the chi-square test statistic.
Different methods for comparing probability distributions provide different infor-
mation as they measure different properties. For instance, if the skewness of a dis-
tribution is measured, all symmetric distributions will be considered similar to each
other as they have null skewness. However, if other properties are measured, such
as, the L 2 distance, two symmetric distributions will, in general, be different. Using
an ensemble of statistical methods provides a more accurate characterization of the
observed deviation than using a single method. This is crucial for analyzing massive
data sets, comprising a wide range of different patterns.
To precisely measure the observed deviation and identify fraudulent entities, the
outcomes of the different statistical methods listed above are combined in a signa-
ture vector, σ k , specific to each publisher. Intuitively, significant deviations from the
expected distribution, measured by several statistical methods, represent strong indi-
cators of abusive click traffic. For this reason, the fraud score is modeled as a linear
function of the observed deviations,
p
1
φ
=
θ σ
,
(14.5)
k
j
kj
j
=
where σ kj indicates the j th component of σ k and θ j is the weight associated with it.
The optimal set of weights, θ, in Equation 14.5 are determined to minimize the least-
2
φ
p
square cost function, J
=
θ σ
using a stochastic gradient
k
j
kj
k
j
=
1
descent method trained on a small subset of publishers, , which includes legiti-
mate distributions and known attacks provided both by other automated systems,
and by manual investigation of the logs. The model in Equation 14.5 is then applied
to a large data set of entities to predict the fraud score as a function of their IP size
distribution.
14.5.5.3 Performance Results
Figure 14.14 shows the accuracy of the model in Equation 14.5 in predicting the
fraud score as a function of the number of statistical methods used to compare dis-
tributions. First, the accuracy of the anomaly detection system is assessed when all
methods are used. Next, the features that cause the least amount of variation in the
prediction accuracy are iteratively removed until a single feature is left [13]. The
training set is 10% of the entities, and testing set comprises the remaining entities.
Figure 14.14 shows that using multiple comparison methods that measure different
type of deviations allows for reducing the prediction errors, down to a 3% error. This
Search WWH ::




Custom Search