Java Reference
In-Depth Information
The name of the file to be processed is passed to Weblog as the first argument on the
command line. A FileInputStream fin is opened from this file and an InputStream
Reader is chained to fin . This InputStreamReader is buffered by chaining it to an
instance of the BufferedReader class. The file is processed line by line in a for loop.
Each pass through the loop places one line in the String variable entry . entry is then
split into two substrings: ip , which contains everything before the first space, and
theRest , which is everything from the first space to the end of the string. The position
of the first space is determined by entry.indexOf(" ") . The substring ip is converted
to an InetAddress object using getByName() . getHostName() then looks up the host‐
name. Finally, the hostname and everything else on the line ( theRest ) are printed on
System.out . Output can be sent to a new file through the standard means for redirecting
output.
Weblog is more efficient than you might expect. Most web browsers generate multiple
logfile entries per page served, because there's an entry in the log not just for the page
itself but for each graphic on the page. And many visitors request multiple pages while
visiting a site. DNS lookups are expensive and it simply doesn't make sense to look up
each site every time it appears in the logfile. The InetAddress class caches requested
addresses. If the same address is requested again, it can be retrieved from the cache
much more quickly than from DNS.
Nonetheless, this program could certainly be faster. In my initial tests, it took more than
a second per log entry. (Exact numbers depend on the speed of your network connection,
the speed of the local and remote DNS servers, and network congestion when the pro‐
gram is run.) The program spends a huge amount of time sitting and waiting for DNS
requests to return. Of course, this is exactly the problem multithreading is designed to
solve. One main thread can read the logfile and pass off individual entries to other
threads for processing.
A thread pool is absolutely necessary here. Over the space of a few days, even low-
volume web servers can generate a logfile with hundreds of thousands of lines. Trying
to process such a logfile by spawning a new thread for each entry would rapidly bring
even the strongest virtual machine to its knees, especially because the main thread can
read logfile entries much faster than individual threads can resolve domain names and
die. Consequently, reusing threads is essential. The number of threads is stored in a
tunable parameter, numberOfThreads , so that it can be adjusted to fit the VM and net‐
work stack. (Launching too many simultaneous DNS requests can also cause problems.)
This program is now divided into two classes. The first class, LookupTask , shown in
Example 4-11 , is a Callable that parses a logfile entry, looks up a single address, and
replaces that address with the corresponding hostname. This doesn't seem like a lot of
work and CPU-wise, it isn't. However, because it involves a network connection, and
Search WWH ::




Custom Search