Large-Scale Network Traffic Analysis for Estimating the Size of IP Addresses and Detecting Traffic Anomalies - Large Scale and Big Data: Processing and Management

Database Reference

In-Depth Information

classified into the four categories: (i) vector-based methods: include the L p distance,

the cosine similarity; (ii) skewness-based methods: include computing skewness

(sample and Bowley); (iii) entropy-based methods: include the Jensen-Shannon

and the Kullback-Leibler divergence [15]; and (iv) goodness-of-fit tests include the

Kolmogorov-Smirnov and the chi-square test statistic.

Different methods for comparing probability distributions provide different infor-

mation as they measure different properties. For instance, if the skewness of a dis-

tribution is measured, all symmetric distributions will be considered similar to each

other as they have null skewness. However, if other properties are measured, such

as, the L 2 distance, two symmetric distributions will, in general, be different. Using

an ensemble of statistical methods provides a more accurate characterization of the

observed deviation than using a single method. This is crucial for analyzing massive

data sets, comprising a wide range of different patterns.

To precisely measure the observed deviation and identify fraudulent entities, the

outcomes of the different statistical methods listed above are combined in a signa-

ture vector, σ k , specific to each publisher. Intuitively, significant deviations from the

expected distribution, measured by several statistical methods, represent strong indi-

cators of abusive click traffic. For this reason, the fraud score is modeled as a linear

function of the observed deviations,

p

∑ 1

φ

=

θ σ

,

(14.5)

k

j

kj

j

=

where σ kj indicates the j th component of σ k and θ j is the weight associated with it.

The optimal set of weights, θ, in Equation 14.5 are determined to minimize the least-

2

∑ φ

∑

p

square cost function, J

(θ

=

−

θ σ

using a stochastic gradient

k

j

kj

k

∈



j

=

1

descent method trained on a small subset of publishers,  , which includes legiti-

mate distributions and known attacks provided both by other automated systems,

and by manual investigation of the logs. The model in Equation 14.5 is then applied

to a large data set of entities to predict the fraud score as a function of their IP size

distribution.

14.5.5.3 Performance Results

Figure 14.14 shows the accuracy of the model in Equation 14.5 in predicting the

fraud score as a function of the number of statistical methods used to compare dis-

tributions. First, the accuracy of the anomaly detection system is assessed when all

methods are used. Next, the features that cause the least amount of variation in the

prediction accuracy are iteratively removed until a single feature is left [13]. The

training set is 10% of the entities, and testing set comprises the remaining entities.

Figure 14.14 shows that using multiple comparison methods that measure different

type of deviations allows for reducing the prediction errors, down to a 3% error. This

Large Scale and Big Data: Processing and Management

Search WWH ::

Custom Search

Home