Large-Scale Network Traffic Analysis for Estimating the Size of IP Addresses and Detecting Traffic Anomalies - Large Scale and Big Data: Processing and Management

Database Reference

In-Depth Information

Finally, two sets of blacklists were used, the Gmail Blacklist [25] and the

Spamhaus Exploit Blacklist (XBL) [24], to determine whether or not the IP

addresses that generate fraudulent ad events are also known to generate other types

of abusive traffic. Gmail blacklist is a list of source IPs that are likely to send email

spam. Spamhaus XBL is a real-time database of hosts infected by some exploits.

14.5.4 C liCk F iltering

This section focuses on the general scenario, where the click traffic received by a

publisher is a mixture of both legitimate and abusive clicks. The main goal is to

automatically detect and filter out the abusive clicks.

14.5.4.1 IP Size Histogram Filter Overview

As shown in Figure 14.8, machine-generated traffic attacks naturally induce an anoma-

lous IP size distribution. Keeping this in mind, a detection system based on the IP size

histogram was built that automatically filters abusive clicks associated with any pub-

lisher. The system first groups together publishers with similar legitimate IP size distri-

butions. Second, for each group, a statistical model of the click traffic is built based on

historical data. Since the IP size distribution might change over time, a fresh estimation

is periodically computed. Finally, live click traffic of each publisher is partitioned into

separate buckets depending on the IP size value and sets of clicks of any publishers that

violate the computed model while some statistical confidence are filtered out.*

14.5.4.2 Grouping Publishers

Identifying a proper grouping of publishers is the first fundamental step in combating

machine-generated traffic. As observed in Section 14.5.1, the type of services provided

by the publisher's website and the type of traffic driven to her website affect the IP size

distribution of a publisher. Furthermore, this is also influenced by the geolocation of

the source IP addresses visiting her website. The rationale behind this is that different

countries have different IP size distributions due to various reasons, such as heavy use of

proxy, population density vs. number of IP addresses available, and government policies.

For these reasons, publishers are grouped together if they provide the same type

of service, receive clicks from the same type of connecting device (e.g., desktops,

smartphones, tablets), and from IP addresses assigned to the same country. For

instance, if a publisher receives clicks from more than one type of device, its traffic is

split depending on the type of devices, and accordingly assigned to different groups.

This provides a fine-grained grouping of publishers, which takes into account the

various factors that affect the IP size.

14.5.4.3 Threshold Model for Legitimate Click Traffic

After grouping publishers, a statistical threshold model of the click traffic associated

with each group is computed. First, the click traffic received by any publisher within

the same group, over a time period τ, is aggregated. Next, a minimum quality score,

* Publishers that do not receive a statistically significant number of clicks in the period considered are not

considered in this analysis, since this is not enough information to provide a statistically sound estimation.

Search WWH ::

Custom Search

Home