Database Reference
In-Depth Information
Finally, two sets of blacklists were used, the Gmail Blacklist [25] and the
Spamhaus Exploit Blacklist (XBL) [24], to determine whether or not the IP
addresses that generate fraudulent ad events are also known to generate other types
of abusive traffic. Gmail blacklist is a list of source IPs that are likely to send email
spam. Spamhaus XBL is a real-time database of hosts infected by some exploits.
14.5.4 C liCk F iltering
This section focuses on the general scenario, where the click traffic received by a
publisher is a mixture of both legitimate and abusive clicks. The main goal is to
automatically detect and filter out the abusive clicks.
14.5.4.1 IP Size Histogram Filter Overview
As shown in Figure 14.8, machine-generated traffic attacks naturally induce an anoma-
lous IP size distribution. Keeping this in mind, a detection system based on the IP size
histogram was built that automatically filters abusive clicks associated with any pub-
lisher. The system first groups together publishers with similar legitimate IP size distri-
butions. Second, for each group, a statistical model of the click traffic is built based on
historical data. Since the IP size distribution might change over time, a fresh estimation
is periodically computed. Finally, live click traffic of each publisher is partitioned into
separate buckets depending on the IP size value and sets of clicks of any publishers that
violate the computed model while some statistical confidence are filtered out.*
14.5.4.2 Grouping Publishers
Identifying a proper grouping of publishers is the first fundamental step in combating
machine-generated traffic. As observed in Section 14.5.1, the type of services provided
by the publisher's website and the type of traffic driven to her website affect the IP size
distribution of a publisher. Furthermore, this is also influenced by the geolocation of
the source IP addresses visiting her website. The rationale behind this is that different
countries have different IP size distributions due to various reasons, such as heavy use of
proxy, population density vs. number of IP addresses available, and government policies.
For these reasons, publishers are grouped together if they provide the same type
of service, receive clicks from the same type of connecting device (e.g., desktops,
smartphones, tablets), and from IP addresses assigned to the same country. For
instance, if a publisher receives clicks from more than one type of device, its traffic is
split depending on the type of devices, and accordingly assigned to different groups.
This provides a fine-grained grouping of publishers, which takes into account the
various factors that affect the IP size.
14.5.4.3 Threshold Model for Legitimate Click Traffic
After grouping publishers, a statistical threshold model of the click traffic associated
with each group is computed. First, the click traffic received by any publisher within
the same group, over a time period τ, is aggregated. Next, a minimum quality score,
* Publishers that do not receive a statistically significant number of clicks in the period considered are not
considered in this analysis, since this is not enough information to provide a statistically sound estimation.
Search WWH ::




Custom Search