Large-Scale Network Traffic Analysis for Estimating the Size of IP Addresses and Detecting Traffic Anomalies - Large Scale and Big Data: Processing and Management

Database Reference

In-Depth Information

q min , is set that should be satisfied by a set of legitimate clicks. Different websites

have different quality scores depending on various factors, such as the services pro-

vided and the ads displayed. Thus, q min is computed as a fixed fraction of the average

quality score associated with each publisher group.

For each group and each bucket, a percentile threshold, t , is computed. In real

time, if any publisher receives more than t % of her traffic on this bucket, its traffic

from this bucket gets filtered. To set t , a fine-grain scan of all the possible percentiles

of this bucket is carried out. For each percentile, p , the traffic from all the publish-

ers that received more than p % of their traffic from that bucket, with some binomial

confidence threshold, is aggregated. If the quality score of this aggregated traffic is

lower than q min , p is set as a candidate threshold. The final threshold, t , is picked to be

the candidate threshold that has the highest impact, that is, discards the most traffic.

This technique takes into account the observed empirical distributions, the number

of available samples (IP sizes), and the desired confidence level.

Intuitively, the filtered clicks represent regions of high probability for specific

publishers, i.e., spikes in their IP size distributions, that also have a significantly

lower quality than expected for the same group of publishers and set of ads.

14.5.4.4 Performance Results

In this section, the effectiveness of the IP size histogram filter is assessed. The sys-

tem is implemented using a Google-built language specifically designed to handle

massive data sets using a distributed MapReduce-based infrastructure. Each phase

of the above filter is distributed across a few hundred machines using the MapReduce

framework [6]. For the results described in this section, τ was set to 90 days to build

the threshold model, and a testing period of τ live = 30 days was used.

Figures with the sensitive values of the quality score, the fraud scores, and the num-

ber of clicks have been anonymized by scaling the original values by arbitrary constants

so as to preserve trends and relative differences while obscuring the absolute numbers.

14.5.4.5 IP Size Distributions

Figure 14.9a-d depicts two groups of publishers, named here A and B for anonymity

purpose. Each figure is a four-dimensional plot. The x -axis represents the bucket of the

IP size, while the y -axis represents the probability value. Each point is associated with a

single publisher and represents the probability that the publisher receives a click of a cer-

tain size. In Figure 14.9a and c, the size of data points represents the number of clicks and

the color represents the scaled fraud score. Figure 14.9b and d display the same points as

in Figure 14.9a and c with the difference that the size represents the number of clicks fed

to the quality classifier system, and the color represents the scaled quality score. Circles

are plotted with different sizes to represent different levels of statistical confidence.

These figures confirm on real data the motivating intuition discussed in Figure

14.8. Figure 14.9a and b shows the results on one of the largest groups, comprising

hundreds of publishers. Despite the complexity of the problem and the variety of

possible attacks, Figure 14.9a shows that spikes in the IP size distribution of a pub-

lisher are reliable indicators of high fraud score. In fact, most points associated with

an anomalous high probability are red, thus indicating that they are known to be abu-

sive clicks. As an additional validation, Figure 14.9b illustrates the corresponding

Large Scale and Big Data: Processing and Management

Search WWH ::

Custom Search

Home