Java Reference
In-Depth Information
The name of the file to be processed is passed to
Weblog
as the first argument on the
command line. A
FileInputStream
fin
is opened from this file and an
InputStream
Reader
is chained to
fin
. This
InputStreamReader
is buffered by chaining it to an
instance of the
BufferedReader
class. The file is processed line by line in a
for
loop.
Each pass through the loop places one line in the
String
variable
entry
.
entry
is then
split into two substrings:
ip
, which contains everything before the first space, and
theRest
, which is everything from the first space to the end of the string. The position
of the first space is determined by
entry.indexOf(" ")
. The substring
ip
is converted
to an
InetAddress
object using
getByName()
.
getHostName()
then looks up the host‐
name. Finally, the hostname and everything else on the line (
theRest
) are printed on
System.out
. Output can be sent to a new file through the standard means for redirecting
output.
Weblog
is more efficient than you might expect. Most web browsers generate multiple
logfile entries per page served, because there's an entry in the log not just for the page
itself but for each graphic on the page. And many visitors request multiple pages while
visiting a site. DNS lookups are expensive and it simply doesn't make sense to look up
each site every time it appears in the logfile. The
InetAddress
class caches requested
addresses. If the same address is requested again, it can be retrieved from the cache
much more quickly than from DNS.
Nonetheless, this program could certainly be faster. In my initial tests, it took more than
a second per log entry. (Exact numbers depend on the speed of your network connection,
the speed of the local and remote DNS servers, and network congestion when the pro‐
gram is run.) The program spends a huge amount of time sitting and waiting for DNS
requests to return. Of course, this is exactly the problem multithreading is designed to
solve. One main thread can read the logfile and pass off individual entries to other
threads for processing.
A thread pool is absolutely necessary here. Over the space of a few days, even low-
volume web servers can generate a logfile with hundreds of thousands of lines. Trying
to process such a logfile by spawning a new thread for each entry would rapidly bring
even the strongest virtual machine to its knees, especially because the main thread can
read logfile entries much faster than individual threads can resolve domain names and
die. Consequently, reusing threads is essential. The number of threads is stored in a
tunable parameter,
numberOfThreads
, so that it can be adjusted to fit the VM and net‐
work stack. (Launching too many simultaneous DNS requests can also cause problems.)
This program is now divided into two classes. The first class,
LookupTask
, shown in
Example 4-11
, is a
Callable
that parses a logfile entry, looks up a single address, and
replaces that address with the corresponding hostname. This doesn't seem like a lot of
work and CPU-wise, it isn't. However, because it involves a network connection, and