Large-Scale Network Traffic Analysis for Estimating the Size of IP Addresses and Detecting Traffic Anomalies - Large Scale and Big Data: Processing and Management

Database Reference

In-Depth Information

a timely manner. Section 14.5 discusses using the size information for combating

abuse. Section 14.6 reviews the related work, and we conclude in Section 14.7.

14.2 IP SIZE: CHALLENGES AND APPROACH

In [19], the sizes of the IPs were defined based on two dimensions: application and

time. Each IP has a specific size for each application, depending on the number of

human users of this application. The query size of an IP is the number of humans

querying a search engine for example, Google, which may differ from the number of

users clicking ads. Thus, sizes should be estimated using the log files of the applica-

tion whose activity is subject to estimation.

The second dimension for defining the size is time. The number of human users

behind an IP changes over time, for instance, when the IP observes a flash crowd,

i.e., an unexpected surge in usage, or when it gets reassigned to households and/

or companies. The size estimates should be issued frequently enough to cope with

these frequent changes. This calls for a short estimation time period. On the other

hand, the estimation period should be long enough to yield enough IP coverage, and

enough traffic per IP to produce statistically sound estimates.

14.2.1 e stimation C hallenges anD m ethoDology

While estimating the sizes of individual IPs has ramifications on the security field,

the primary concern is violating the user privacy. The work at [19] preserves the

user privacy by estimating sizes of IPs using the application-level log files. First, the

application users are assumed to be only temporarily identified, for example, with

cookie IDs in the case of HTTP-based log files. Thus, no Personally Identifiable

Information, such as the name or the email address, is revealed. Second, no individ-

ual machines are tracked. Third, the framework uses application log data aggregated

at the IP level. Over 30% of dynamic IPs are reassigned every 1 to 3 days [27], and

thus an IP is considered a temporary identification of a user. Finally, the majority of

the users share IPs. This is illustrated in Figure 14.2, where 10M random IPs (from

Google ad click log files) are shared by 26.9M total estimated users.

Estimating sizes from the log files is not straightforward. Naïve counting of dis-

tinct user identifications, for example, cookie IDs or user agents (UAs), per IP fails

1e+06

1e+04

1e+02

1e+00

1

10

100

1000

10,000

Estimated distinct count (user IDs)

FIGURE 14.2

The estimated sizes of 10M random IPs.

Search WWH ::

Custom Search

Home