Database Reference
In-Depth Information
Legitimate users
thispagemakesmoney.com
Ad
Ad
Ad
Ad
Ad
Publisher's friend
Advertising
network
(commissioner)
thispagetoo.com
Ad
Ad
Botnet
Ad
Bot-master
iwontmakemoney.com
Ad
Ad
Ad
FIGURE 14.1 Three publishers contract with an advertising network to host ads for a com-
mission for each, click on these ads and they illustrate three types of traffic: (1) ads on the
publisher site thispagemakesmoney.com are clicked only by legitimate users (white pointers);
(2) ads on thispagetoo.com are clicked by both legitimate users and fraudsters (gray pointers);
and (3) iwontmakemoney.com uses a large botnet to generate fake traffic.
Combating abusive traffic is at the heart of analyzing large data sets because it
involves building statistical models at a global scale. This chapter explains how a
statistical framework was built for estimating the number of the users of a specific
application that are currently sharing a public IP address, called hereinafter the size
of the IP, to combat abusive traffic at Google. From a data analysis perspective, this
is a very challenging problem because the range of sizes is huge; the size of any
IP can change abruptly, and sizes of a significant portion of the IP space need to
be estimated. The estimation techniques presented are scalable, parallelizable using
the MapReduce framework [6], and provide statistically sound and timely estimates
of the IP sizes that rely solely on passively mining aggregated application log data,
without probing machines or deploying active content like Java applets.
The chapter also describes how IP size estimation is employed to detect traffic
anomalies. The detection technology relies on the observation that various machine-
generated traffic attacks share a common characteristic: they induce an anomalous
deviation from the IP size distribution expected from legitimate users. The detection
technology is based on a fundamental characteristic of these attacks and is thus
robust (e.g., to DHCP re-assignment) and hard to evade or reverse engineer, even if
the spammers are aware of its existence. Most importantly, it has low complexity and
is parallelizable using MapReduce.
The methodologies presented here, while mainly motivated by combating publish-
ers' attacks, can be applied to detect advertisers' attacks, as well as other machine-
generated anomalous traffic, including combating DDoS and detecting spam on
social networks.
The rest of the chapter is organized as follows. Section 14.2 discusses the chal-
lenges and the main framework. Section 14.3 discusses building statistical models
for size estimation. Section 14.4 discusses predicting the size of each possible IP in
Search WWH ::




Custom Search