Database Reference
In-Depth Information
14.3.1 t he l learning m oDels
For each IP, estimates are calculated based on its traffic rate and diversity. A
regression-based model is built using the following four below. An evaluation fol-
lows in Section 14.3.2.
Linear regression: Linear regression is one of the most widely used regression
techniques and it minimizes the root mean square error.
Quantile regression: Quantile regression is more robust than linear regression
toward outliers because it estimates a conditional quantile instead of mean. Median
quantile is often used for estimation. Using principal components of the features
instead of true features were not found to improve the results, and are hence not used.
PCA + MARS: Multivariate Adaptive regression Splines (MARS) [8] and prin-
cipal component analysis (PCA) [11] is used to build regression models for IP size
estimation. MARS is a nonparametric regression technique that captures nonlinear
behavior by building piecewise nonlinear models. The scope of the evaluation is lim-
ited to the piecewise linear functions only to avoid overfitting. It automatically does
variable selection, however, collinearity can be a problem. To reduce multicollinear-
ity, the principal components are identified using principal component analysis. This
was found to significantly improve the final results.
Percentage regression: Percentage regression [26] minimizes the relative error
(ratio of absolute error and the true observed values).
14.3.2 g auging e stimation a CCuraCy
Assessing the accuracy of the estimation process is done using the estimation models
on a hold out testing set. These estimates are compared against the baseline mea-
sured sizes, the number of trusted cookies behind these IPs. For the purpose of
modeling and gauging accuracy, only the traffic from the trusted cookies is used
to produce the size estimates, and traffic from the nontrusted cookies is ignored.
However, in reality, all the traffic is used to estimate the total number of users and
not only the trusted cookies users.
The estimated sizes from the four learning models and the measured sizes are
plotted in Figure 14.5 with logarithmic axes, where a circle represents one, or mul-
tiple overlapping IP(s). The line passing though (1, 1) with slope 1 represents perfect
estimation. For a quantile, q , let the q-quantile-curve be the set of the q -quantile
points of the measured sizes across all values of the estimated sizes on the x -axis.
The 0.1, 0.5, and 0.9 quantile-curves are plotted.
The median quantile-curve for quantile-regression-based and MARS-based IP esti-
mates is almost overlapping with perfect estimation for estimates above 1. The estimation
using percentage regression is less spread out for the lower values of IPs compared with
the other learning methods, however, for large IPs, the other methods are better.
The different learning models are compared using the following types of error:
root mean square error (RMSE), relative error (i.e., ratio of the absolute difference
between the true and the estimated value, and the true value), and bucket error .
The bucket error increases as IPs get assigned different size buckets, where buck-
ets are based on a function of the IP size. Minimizing the bucket error is crucial to
Search WWH ::




Custom Search