Database Reference
In-Depth Information
14.5.2 m aChine -g enerateD a ttaCks anD iP s ize D istributions
Machine-generated attacks can be performed in various ways, depending on the
resources available, motivations, and skills of the attackers. For instance, if an
attacker controls a large number of hosts through a botnet, the attack can be highly
distributed across the available hosts to maximize the overall amount of traffic
generated while maintaining a low-activity profile for each individual host. This
type of attacks is referred to as botnet-based attacks. Conversely, if an attacker
controls a few hosts but still wants to generate a large amount of traffic, she can use
anonymizing proxies, such as TOR nodes, to hide the actual source IPs involved. This
type of attacks is referred to as proxy-based attacks. Botnet- and proxy-based attacks
are two diverse examples in the wide spectrum of possible attacks using machine-
generated traffic, in terms of both the resources required and level of sophistication.
Figure 14.8 illustrates these two attacks and how they affect the IP size distribu-
tion associated with a publisher. Let us assume the existence of an a priori knowledge
of the expected IP size distribution based on historical data. The curve marked as
“Reference PDF” represents the expected distribution of IP sizes. Figure 14.8a depicts
an example of a botnet-based attack. Bots are typically end-user machines and have
a relatively small IP size. Intuitively, this is because end-user machines are easier to
compromise than large well-maintained proxies. As a result, a botnet-based attack gen-
erates a higher than expected number of clicks with small size. Analogously, a proxy-
based attack skews the IP size distribution toward large IP sizes because as higher than
expected number of clicks comes from large proxies, as in Figure 14.8b.
The attacks in Figure 14.8 represent two opposite scenarios. However, despite
their differences, they both can be revealed as a deviation from the expected IP size
distribution. Most attacks induce an unexpected deviation of the IP size distribution.
In fact, different deviations represent different signatures of attacks.
14.5.3 t he D ata s et
The data set used in this analysis is the advertisement click logs collected at Google
from a sample of hundreds of thousands of different publisher websites. These logs
were to gain insights into modern machine-generated traffic attacks, as well as to test
and evaluate the performance of this anomaly detection system on real data. In this
section, the data set and the specific features used in this study are briefly described.
The IPs were bucketed, and from each bucket, 100k clicks logs were sampled for a
period of 90 consecutive days. Total samples vary each day but on average there were
1M IPs. The analysis and development relies on the following fields in each entry:
(i) the source IP address that generated the click; (ii) the publisher ID, a unique iden-
tifier associated with each publisher; (iii) the timestamp when the click occurred;
and (iv) the abusive flag: a binary flag that indicates whether or not the click was
tagged by any of the existing detection systems.
14.5.3.1 Assessing the Quality of Traffic
A Google-internal classifier is leveraged that takes as input click logs of network traffic
and determines the likelihood that the network traffic is fraudulent machine-generated
Search WWH ::




Custom Search