Database Reference
In-Depth Information
0.8 quantile
0.5 quantile
0.2 quantile
0.8 quantile
0.5 quantile
0.2 quantile
0e+00 1e+05 2e+05 3e+05 4e+05 5e+05 6e+05
FIGURE 14.4 The scatter plots of query diversity and the number of users. (a) Distinct count
(Raw queries) * Arbitrary constant (c1). (b) Perplexity (Raw queries) * Arbitrary constant (c2).
is the entropy of the distribution of feature X in the
IP traffic.*
A training sample of ≈10M IPs, was collected to build a regression model of the
number of users as related to the query diversity quantified using distinct counting
and perplexity. The data used to build the regression model for query sizes is plotted
in Figure 14.4. In Figure 14.4, each circle represents one, or multiple overlapping
sampled IP(s). Each circle shows the distinct number of trusted cookies querying
Google, and the distinct count (Figure 14.4a) and perplexity (Figure 14.4b) of the
queries issued by these trusted cookies.
Because models are built from log files aggregated at the IP level, and the over-
whelming majority of IPs have very few trusted cookies behind them, sampling
noise can cause issues. If a random training sample is selected, the few IPs with large
measured sizes, that is, the IPs with numerous trusted cookies, can be easily missed
out. To avoid underrepresenting IPs with large measured sizes, stratified sampling is
used [21]. IPs are bucketed into disjoint classes by their measured sizes. The number
of samples from each class should represent this class in the global sample by the
same proportion of that class in the global IPs population.
From Figures 14.4a and b, an observation can be made about the stratified sample
of IPs. The relationship between the measured sizes, that is, the number of trusted
cookies, and query diversity shows high heteroscedasticity. That is, the variance of
the measured sizes increases with the distinct count (Figure 14.4a) and perplexity
(Figure 14.4b) of queries.
* Perplexity was verified on several data sets to exhibit linear relationship with the numbers of trusted
cookie IDs behind IPs (measured sizes). Entropy does not exhibit this quality.
Due to the sensitive nature of the exact distribution, the x -axes of Figure 14.4 are scaled, and the per-
plexity is calculated with two bases, b 1 ≠ b 2, as Perp Xb bb Hp
(, ,)
Search WWH ::

Custom Search