Database Reference
In-Depth Information
(a)
(b)
0.9 quantile
0.5 quantile
0.1 quantile
0.9 quantile
0.5 quantile
0.1 quantile
1
10
100
1000
10,000
1
10 100 1000 10,000
Estimated distinct user IDs
Estimated distinct user IDs
(d)
(c)
0.9 quantile
0.5 quantile
0.1 quantile
0.9 quantile
0.5 quantile
0.1 quantile
1
10
100
1000
10,000
1
105 50100 500
Estimated distinct user IDs
Estimated distinct user IDs
FIGURE 14.5 Comparison of the true vs . estimates IP sizes in various regression techniques.
(a) Linear regression. (b) Quantile regression. (c) PCA + MARS. (d) Percentile regression.
applications where the rough estimate of the IP size is more important than the exact
value of the size, such as the abusive traffic filtering application discussed in Section
14.5. To calculate the bucket error, each IP is bucketed based on its size. The bucket
of an IP is given by bucket (IP) = Φ(SizeOf(IP)), for some function Φ. The bucket
error of an IP is defined as the absolute difference between the true bucket of an IP
and the estimated one. The average bucket error for a bucket B is the average absolute
bucket deviation of the IPs with the true bucket B , and the average bucket error is the
average of the bucket error for all buckets.*
The following table summarizes the error values for estimating the IPs using the
four learning models (discussed in Section 14.3.1) on the clicks data set. There is no
* If instead of averaging over all the buckets we average over all the IPs, then all the IPs would contribute
equally to the error, which disregards the relative importance of larger IPs.
Search WWH ::




Custom Search