Database Reference
In-Depth Information
Table 2: Cleaned web log data stored in main table
IP Address
Date
Time
URL Request
144.214.36.91
07/May/2001
22:42:04
A.htm
144.214.36.91
07/May/2001
22:45:06
B.htm
144.214.36.91
07/May/2001
22:49:15
D.htm
07/May/2001
E.htm
144.214.36.91
22:52:44
144.214.36.91
07/May/2001
23:40:00
B.htm
144.214.36.91
07/May/2001
23:42:00
A.htm
144.214.36.92
07/May/2001
23:43:05
A.htm
144.214.36.92
07/May/2001
23:46:06
B.htm
144.214.36.92
07/May/2001
23:47:30
C.htm
144.214.36.93
07/May/2001
23:47:50
E.htm
144.214.36.93
07/May/2001
23:48:15
C.htm
time fi eld because separating them makes it easier to compute the time for staying on each
page. For the access time fi eld, it contains both the access date and access time, separated
by a colon signal (:).
When the web server cannot retrieve those fi les successfully, the situation is refl ected
on the value of the status. The value of the status for the successful fi le retrieval is 200, while
that of the unsuccessful retrieval is larger than 400. When the fi le is reloaded from the web
server, the status will be 304. Therefore, those records with the status value other than 200
are eliminated. Moreover, there are some special characters enclosed at the beginning or
end of each fi eld. There such characters must be removed before storing the records in the
database. Figure 6 shows the pseudo-code for data preprocessing.
After removal of all the irrelevant records from the web log fi le, the valid records are
stored in the main table, as shown in Table 2.
Step 1.2: User Identifi cation and Session Identifi cation
The cleaning techniques discussed earlier are used to preprocess a given web server
log. After the data cleaning, the log entries must be partitioned into logical clusters using one
or a series of transaction identifi cation modules. In the best case, we rely on the values in
fi elds rfcname and/or logname to accurately identify a user. But in most cases, fi elds rfcname
and logname are empty. In the absence of such information, host name/IP information are
the only available choices to identify a user. In an ideal scenario, each user is allocated a
unique IP address when accessing a web site. However, this is not necessarily correct. For
example, some Internet Service Providers (ISPs) randomly assign an IP address to each
user's request (dynamic IP assignment); some repeat users access the web each time from
a different machine or web browser.
Thus, we use the host name incorporated with user navigation session/user session to
identify a user. A user session is all of the pages' references made by a user during a single
visit to a web site. Identifying user sessions is similar to the problem of identifying individual
Search WWH ::




Custom Search