Database Reference
In-Depth Information
Figure 5: Common log format
fi les are mapped into a relational database. After the data is cleansed, the web log is loaded
into a data warehouse (relational database) and new implicit data, like frequency occurrence
of access paths and the time spent by each visitor on each page, are calculated. Also the
database facilitates information extraction and data summarization based on individual
attributes. In the second step, web mining techniques predict and discover interesting user
access paths. After the initialization of loading web log into the data warehouse, whenever
a user access path is recorded in the web log fi le, a corresponding update is made to the
frame metadata, which triggers the update of user access patterns of web pages online, and
generates path traversal patterns. In summary, the system provides the following services
as given in Table 1.
Web Access Log
An important source of information about web site visitors is the server transfer log
fi le, known as the access log (web log fi le). This is where every transaction between the
server and browser is recorded with a date and time, the IP address (domain name) of the
server making the request for each page on the site, the status of that request, and the number
of bytes transferred to that requester, etc. We analyze users' activities on a web site using
server log fi les (access log). There are several kinds of log formats. The most popular one
is the Common Log Format (CLF), which was used by most web servers. The common log
format appears in Figure 5.
Example: Raw Data of the Access Log
144.214.121.52 - - [31/Mar/2001:20:38:11 +0800] “GET /an_cityu.gif HTTP/1.1” 200 90713
144.214.121.52 - - [31/Mar/2001:20:39:31 +0800] “GET /Courses.htm HTTP/1.1” 200 1213
Step 1: Date Preprocessing
An important step of knowledge discovery is data preprocessing. Since not all the
materials within the log fi le are useful for the mining process, a data preparation process must
be performed fi rst. Here we focus on techniques used to preprocess server-level web access
log fi les, namely Common Log Format access log. After the data cleaning, the log entries
must be partitioned into logical clusters using one or a series of transaction identifi cation
modules, which include user and session identifi cations.
Search WWH ::




Custom Search