Database Reference
In-Depth Information
Step 1.1: Data Loading and Cleaning
Web access log is a plain text fi le. Therefore it is necessary to identify each fi eld in this
fi le. Each fi eld is separated by a space. Also, some fi elds are enclosed with special characters
such as the double quotation marks, slash or open and close square brackets. Therefore these
characters are used to identify what these fi elds are.
A large proportion of the log fi le is related to graphics, pictures that constitute the
pages and provide no information on the usage of the web site. Data cleaning is the fi rst
step performed in the web usage mining process. As web usage mining is investigating the
access path sequence made by visitors, all log entries with the picture fi lename suffi x such
as “.jpg”, “.JPG”, “.gif” or “.GIF” in the access path fi eld are removed. Likewise, those
records with the fi lename suffi x as “ counter.cgi ” are also eliminated. Moreover, for those
records with the methods other than using “GET” (i.e., “PUT”, “POST”, “HEAD”) in the
access method fi eld to access the specifi ed fi le are eliminated. It needs to separate the access
Figure 6: Pseudo-code for data preprocessing
Search WWH ::




Custom Search