Large-Scale Network Traffic Analysis for Estimating the Size of IP Addresses and Detecting Traffic Anomalies - Large Scale and Big Data: Processing and Management

Database Reference

In-Depth Information

Legitimate users

thispagemakesmoney.com

Ad

Publisher's friend

Advertising

network

(commissioner)

thispagetoo.com

Ad

Botnet

Ad

Bot-master

iwontmakemoney.com

Ad

FIGURE 14.1 Three publishers contract with an advertising network to host ads for a com-

mission for each, click on these ads and they illustrate three types of traffic: (1) ads on the

publisher site thispagemakesmoney.com are clicked only by legitimate users (white pointers);

(2) ads on thispagetoo.com are clicked by both legitimate users and fraudsters (gray pointers);

and (3) iwontmakemoney.com uses a large botnet to generate fake traffic.

Combating abusive traffic is at the heart of analyzing large data sets because it

involves building statistical models at a global scale. This chapter explains how a

statistical framework was built for estimating the number of the users of a specific

application that are currently sharing a public IP address, called hereinafter the size

of the IP, to combat abusive traffic at Google. From a data analysis perspective, this

is a very challenging problem because the range of sizes is huge; the size of any

IP can change abruptly, and sizes of a significant portion of the IP space need to

be estimated. The estimation techniques presented are scalable, parallelizable using

the MapReduce framework [6], and provide statistically sound and timely estimates

of the IP sizes that rely solely on passively mining aggregated application log data,

without probing machines or deploying active content like Java applets.

The chapter also describes how IP size estimation is employed to detect traffic

anomalies. The detection technology relies on the observation that various machine-

generated traffic attacks share a common characteristic: they induce an anomalous

deviation from the IP size distribution expected from legitimate users. The detection

technology is based on a fundamental characteristic of these attacks and is thus

robust (e.g., to DHCP re-assignment) and hard to evade or reverse engineer, even if

the spammers are aware of its existence. Most importantly, it has low complexity and

is parallelizable using MapReduce.

The methodologies presented here, while mainly motivated by combating publish-

ers' attacks, can be applied to detect advertisers' attacks, as well as other machine-

generated anomalous traffic, including combating DDoS and detecting spam on

social networks.

The rest of the chapter is organized as follows. Section 14.2 discusses the chal-

lenges and the main framework. Section 14.3 discusses building statistical models

for size estimation. Section 14.4 discusses predicting the size of each possible IP in

Large Scale and Big Data: Processing and Management

Search WWH ::

Custom Search

Home