Database Reference
In-Depth Information
9e+6
8e+6
7e+6
6e+6
5e+6
4e+6
3e+6
2e+6
1e+6
0 0
10
20
30
40
50
60
Query rate * arbitrary constant
FIGURE 14.3
The query rate distribution PDF.
the output of RT-L og(p) is not used by RT-Abuse-Dtct (.) before p + 2. The longer the
lookahead delay, the higher the chance of filtering based on inaccurate sizes.
14.3 IP SIZE ESTIMATION
A user is defined as an entity that generates the average activity of a trusted human
for a specific application over a particular time period of length l . Cookie IDs in the
log files temporarily identify trusted users. Models of the average activity built from
cookie IDs are influenced by the noise of users sharing or frequently clearing cook-
ies. The log entries of one cookie may show the activity of multiple users, or part of
the activity of one user. This phenomenon is part of the problem and is not dealt with
in the scope of this study.
A regression model is built for the numbers of trusted cookies behind IPs, i.e., the
baseline of the measured sizes, as related to the following two types of features, the
rate of the activity of the IP and the features diversity of the IP.
The activity rate: For any activity, if the trusted-user activity rate follows a
Poisson distribution, then from the distribution properties, the size of an IP can be
estimated based on its activity rate. Figure 14.3 shows the Google query rate distri-
bution of ≈100M highly trusted cookies.* If the average trusted user has a query rate
of λ m , an IP with M users is expected to have a rate of M × λ m .
Explanatory diversity of The Observed Traffic: The explanatory diversity of a
feature(s) can be quantified in several ways. One simple way is counting its distinct
values in the IP traffic. More sophisticated ways include calculating the perplex-
ity of the feature in the IP traffic. A feature, X (e.g., the query) in the traffic of an
IP typically assumes several values, x 1 , x 2 ,… (all the possible query phrases). The
perplexity of a feature is calculated as Perp Xb
b H bp
(,)
( = , where b is some base and
* Due to the sensitive nature of the exact distribution, the rate is scaled by an arbitrary constant.
Search WWH ::




Custom Search