Database Reference
In-Depth Information
q min , is set that should be satisfied by a set of legitimate clicks. Different websites
have different quality scores depending on various factors, such as the services pro-
vided and the ads displayed. Thus, q min is computed as a fixed fraction of the average
quality score associated with each publisher group.
For each group and each bucket, a percentile threshold, t , is computed. In real
time, if any publisher receives more than t % of her traffic on this bucket, its traffic
from this bucket gets filtered. To set t , a fine-grain scan of all the possible percentiles
of this bucket is carried out. For each percentile, p , the traffic from all the publish-
ers that received more than p % of their traffic from that bucket, with some binomial
confidence threshold, is aggregated. If the quality score of this aggregated traffic is
lower than q min , p is set as a candidate threshold. The final threshold, t , is picked to be
the candidate threshold that has the highest impact, that is, discards the most traffic.
This technique takes into account the observed empirical distributions, the number
of available samples (IP sizes), and the desired confidence level.
Intuitively, the filtered clicks represent regions of high probability for specific
publishers, i.e., spikes in their IP size distributions, that also have a significantly
lower quality than expected for the same group of publishers and set of ads.
14.5.4.4 Performance Results
In this section, the effectiveness of the IP size histogram filter is assessed. The sys-
tem is implemented using a Google-built language specifically designed to handle
massive data sets using a distributed MapReduce-based infrastructure. Each phase
of the above filter is distributed across a few hundred machines using the MapReduce
framework [6]. For the results described in this section, τ was set to 90 days to build
the threshold model, and a testing period of τ live = 30 days was used.
Figures with the sensitive values of the quality score, the fraud scores, and the num-
ber of clicks have been anonymized by scaling the original values by arbitrary constants
so as to preserve trends and relative differences while obscuring the absolute numbers.
14.5.4.5 IP Size Distributions
Figure 14.9a-d depicts two groups of publishers, named here A and B for anonymity
purpose. Each figure is a four-dimensional plot. The x -axis represents the bucket of the
IP size, while the y -axis represents the probability value. Each point is associated with a
single publisher and represents the probability that the publisher receives a click of a cer-
tain size. In Figure 14.9a and c, the size of data points represents the number of clicks and
the color represents the scaled fraud score. Figure 14.9b and d display the same points as
in Figure 14.9a and c with the difference that the size represents the number of clicks fed
to the quality classifier system, and the color represents the scaled quality score. Circles
are plotted with different sizes to represent different levels of statistical confidence.
These figures confirm on real data the motivating intuition discussed in Figure
14.8. Figure 14.9a and b shows the results on one of the largest groups, comprising
hundreds of publishers. Despite the complexity of the problem and the variety of
possible attacks, Figure 14.9a shows that spikes in the IP size distribution of a pub-
lisher are reliable indicators of high fraud score. In fact, most points associated with
an anomalous high probability are red, thus indicating that they are known to be abu-
sive clicks. As an additional validation, Figure 14.9b illustrates the corresponding
Search WWH ::




Custom Search