Database Reference
In-Depth Information
to the weakness of this periodicity compared with others. It bails out once the frac-
tion of the discarded elements exceeds some threshold.
14.4.4 t he P reDiCt s izes a lgorithm
Since PredictSizes deals with each periodicity of each IP separately, it can be mas-
sively parallelized using the Mapreduce framework [20], as the size estimates are
stored in files sharded by the period IDs and IPs. The algorithm combines all the
stable sizes of all the periodicities using a Combiner function that also does sanity
checks on the predicted sizes.
The main factor that influences the choice of the Combiner is the loss function of
the predictions, which is application-dependent. In its simplest form, a Combiner can
be a simple statistic, such as the mean, truncated mean, median, max, or min. For
instance, when sizes are used for service optimization, the mean statistic minimizes
the expected loss under the mean squared error loss function. Another alternative
for Combiner functions is using a weighted average of the stable sizes, where the
weights are inversely proportional to the fraction of size outliers (extreme estimates)
discarded by the StableSize function for each periodicity. More involved analysis
entails doing a regression of the size estimates of a particular period (as the expected
predicted size) as related to the stable sizes from individual periodicities (as explana-
tory sizes(s)). However, such a regression-based Combiner can be easily influenced
by the heterogeneity of IPs discussed in Section 14.4.1, such as time zones.
The Combiner algorithm ensures that the predicted size agrees with the stable
sizes of all the periodicities. A simple solution was implemented that does two san-
ity checks. First, it checks that the predicted size is within some factor of the stable
size for each periodicity. Second, it checks that the predicted size is within a specific
quantile range of all the stable sizes. If the predicted size does not conform to the
Combiner sanity check, the IP is deemed unstable, and no predicted size is produced
for it. In our experiments, these simple sanity checks proved to be very effective in
detecting abrupt legitimate size changes early on, and hence reducing the false posi-
tives caused by overfiltering legitimate traffic.
14.4.5 e valuating P reDiCtions
To evaluate the predictions algorithm, an experiment was run on three months worth
of query data log files. Two metrics were measured: (i) for every period p , the agree-
ment of the predicted sizes of the IPs with their estimated sizes during p ; and (ii) the
coverage (the ratio of IPs in the traffic in period p that had predictions).
14.4.5.1 Prediction Accuracy
To assess the prediction accuracy, a random sample of 10M IPs was collected. The
relative ratio, predicted size/estimated size , is shown in Figure 14.6, where each cir-
cle represents one or multiple overlapping IP(s).
A total of 98% of the absolute errors are between −4 and 2 and 54% of the predic-
tions are exact. The mean absolute error is −0.149. All the quantiles with a step of
0:001 were calculated. The topmost four 0:001 quantiles are 5282, 5, 4, and 3, and
Search WWH ::




Custom Search