Large-Scale Network Traffic Analysis for Estimating the Size of IP Addresses and Detecting Traffic Anomalies - Large Scale and Big Data: Processing and Management

Database Reference

In-Depth Information

14.5.2 m aChine -g enerateD a ttaCks anD iP s ize D istributions

Machine-generated attacks can be performed in various ways, depending on the

resources available, motivations, and skills of the attackers. For instance, if an

attacker controls a large number of hosts through a botnet, the attack can be highly

distributed across the available hosts to maximize the overall amount of traffic

generated while maintaining a low-activity profile for each individual host. This

type of attacks is referred to as botnet-based attacks. Conversely, if an attacker

controls a few hosts but still wants to generate a large amount of traffic, she can use

anonymizing proxies, such as TOR nodes, to hide the actual source IPs involved. This

type of attacks is referred to as proxy-based attacks. Botnet- and proxy-based attacks

are two diverse examples in the wide spectrum of possible attacks using machine-

generated traffic, in terms of both the resources required and level of sophistication.

Figure 14.8 illustrates these two attacks and how they affect the IP size distribu-

tion associated with a publisher. Let us assume the existence of an a priori knowledge

of the expected IP size distribution based on historical data. The curve marked as

“Reference PDF” represents the expected distribution of IP sizes. Figure 14.8a depicts

an example of a botnet-based attack. Bots are typically end-user machines and have

a relatively small IP size. Intuitively, this is because end-user machines are easier to

compromise than large well-maintained proxies. As a result, a botnet-based attack gen-

erates a higher than expected number of clicks with small size. Analogously, a proxy-

based attack skews the IP size distribution toward large IP sizes because as higher than

expected number of clicks comes from large proxies, as in Figure 14.8b.

The attacks in Figure 14.8 represent two opposite scenarios. However, despite

their differences, they both can be revealed as a deviation from the expected IP size

distribution. Most attacks induce an unexpected deviation of the IP size distribution.

In fact, different deviations represent different signatures of attacks.

14.5.3 t he D ata s et

The data set used in this analysis is the advertisement click logs collected at Google

from a sample of hundreds of thousands of different publisher websites. These logs

were to gain insights into modern machine-generated traffic attacks, as well as to test

and evaluate the performance of this anomaly detection system on real data. In this

section, the data set and the specific features used in this study are briefly described.

The IPs were bucketed, and from each bucket, 100k clicks logs were sampled for a

period of 90 consecutive days. Total samples vary each day but on average there were

1M IPs. The analysis and development relies on the following fields in each entry:

(i) the source IP address that generated the click; (ii) the publisher ID, a unique iden-

tifier associated with each publisher; (iii) the timestamp when the click occurred;

and (iv) the abusive flag: a binary flag that indicates whether or not the click was

tagged by any of the existing detection systems.

14.5.3.1 Assessing the Quality of Traffic

A Google-internal classifier is leveraged that takes as input click logs of network traffic

and determines the likelihood that the network traffic is fraudulent machine-generated

Search WWH ::

Custom Search

Home