Large-Scale Network Traffic Analysis for Estimating the Size of IP Addresses and Detecting Traffic Anomalies - Large Scale and Big Data: Processing and Management

Database Reference

In-Depth Information

14.3.1 t he l learning m oDels

For each IP, estimates are calculated based on its traffic rate and diversity. A

regression-based model is built using the following four below. An evaluation fol-

lows in Section 14.3.2.

Linear regression: Linear regression is one of the most widely used regression

techniques and it minimizes the root mean square error.

Quantile regression: Quantile regression is more robust than linear regression

toward outliers because it estimates a conditional quantile instead of mean. Median

quantile is often used for estimation. Using principal components of the features

instead of true features were not found to improve the results, and are hence not used.

PCA + MARS: Multivariate Adaptive regression Splines (MARS) [8] and prin-

cipal component analysis (PCA) [11] is used to build regression models for IP size

estimation. MARS is a nonparametric regression technique that captures nonlinear

behavior by building piecewise nonlinear models. The scope of the evaluation is lim-

ited to the piecewise linear functions only to avoid overfitting. It automatically does

variable selection, however, collinearity can be a problem. To reduce multicollinear-

ity, the principal components are identified using principal component analysis. This

was found to significantly improve the final results.

Percentage regression: Percentage regression [26] minimizes the relative error

(ratio of absolute error and the true observed values).

14.3.2 g auging e stimation a CCuraCy

Assessing the accuracy of the estimation process is done using the estimation models

on a hold out testing set. These estimates are compared against the baseline mea-

sured sizes, the number of trusted cookies behind these IPs. For the purpose of

modeling and gauging accuracy, only the traffic from the trusted cookies is used

to produce the size estimates, and traffic from the nontrusted cookies is ignored.

However, in reality, all the traffic is used to estimate the total number of users and

not only the trusted cookies users.

The estimated sizes from the four learning models and the measured sizes are

plotted in Figure 14.5 with logarithmic axes, where a circle represents one, or mul-

tiple overlapping IP(s). The line passing though (1, 1) with slope 1 represents perfect

estimation. For a quantile, q , let the q-quantile-curve be the set of the q -quantile

points of the measured sizes across all values of the estimated sizes on the x -axis.

The 0.1, 0.5, and 0.9 quantile-curves are plotted.

The median quantile-curve for quantile-regression-based and MARS-based IP esti-

mates is almost overlapping with perfect estimation for estimates above 1. The estimation

using percentage regression is less spread out for the lower values of IPs compared with

the other learning methods, however, for large IPs, the other methods are better.

The different learning models are compared using the following types of error:

root mean square error (RMSE), relative error (i.e., ratio of the absolute difference

between the true and the estimated value, and the true value), and bucket error .

The bucket error increases as IPs get assigned different size buckets, where buck-

ets are based on a function of the IP size. Minimizing the bucket error is crucial to

Large Scale and Big Data: Processing and Management

Search WWH ::

Custom Search

Home