Large-Scale Network Traffic Analysis for Estimating the Size of IP Addresses and Detecting Traffic Anomalies - Large Scale and Big Data: Processing and Management

Database Reference

In-Depth Information

to the weakness of this periodicity compared with others. It bails out once the frac-

tion of the discarded elements exceeds some threshold.

14.4.4 t he P reDiCt s izes a lgorithm

Since PredictSizes deals with each periodicity of each IP separately, it can be mas-

sively parallelized using the Mapreduce framework [20], as the size estimates are

stored in files sharded by the period IDs and IPs. The algorithm combines all the

stable sizes of all the periodicities using a Combiner function that also does sanity

checks on the predicted sizes.

The main factor that influences the choice of the Combiner is the loss function of

the predictions, which is application-dependent. In its simplest form, a Combiner can

be a simple statistic, such as the mean, truncated mean, median, max, or min. For

instance, when sizes are used for service optimization, the mean statistic minimizes

the expected loss under the mean squared error loss function. Another alternative

for Combiner functions is using a weighted average of the stable sizes, where the

weights are inversely proportional to the fraction of size outliers (extreme estimates)

discarded by the StableSize function for each periodicity. More involved analysis

entails doing a regression of the size estimates of a particular period (as the expected

predicted size) as related to the stable sizes from individual periodicities (as explana-

tory sizes(s)). However, such a regression-based Combiner can be easily influenced

by the heterogeneity of IPs discussed in Section 14.4.1, such as time zones.

The Combiner algorithm ensures that the predicted size agrees with the stable

sizes of all the periodicities. A simple solution was implemented that does two san-

ity checks. First, it checks that the predicted size is within some factor of the stable

size for each periodicity. Second, it checks that the predicted size is within a specific

quantile range of all the stable sizes. If the predicted size does not conform to the

Combiner sanity check, the IP is deemed unstable, and no predicted size is produced

for it. In our experiments, these simple sanity checks proved to be very effective in

detecting abrupt legitimate size changes early on, and hence reducing the false posi-

tives caused by overfiltering legitimate traffic.

14.4.5 e valuating P reDiCtions

To evaluate the predictions algorithm, an experiment was run on three months worth

of query data log files. Two metrics were measured: (i) for every period p , the agree-

ment of the predicted sizes of the IPs with their estimated sizes during p ; and (ii) the

coverage (the ratio of IPs in the traffic in period p that had predictions).

14.4.5.1 Prediction Accuracy

To assess the prediction accuracy, a random sample of 10M IPs was collected. The

relative ratio, predicted size/estimated size , is shown in Figure 14.6, where each cir-

cle represents one or multiple overlapping IP(s).

A total of 98% of the absolute errors are between −4 and 2 and 54% of the predic-

tions are exact. The mean absolute error is −0.149. All the quantiles with a step of

0:001 were calculated. The topmost four 0:001 quantiles are 5282, 5, 4, and 3, and

Search WWH ::

Custom Search

Home