Information Technology Reference
In-Depth Information
In our experiments, we used a value of 1800 seconds for max-idle, which is very common in Web
usage mining studies.
Prediction Model Learning
Once we are done with preprocessing of Web logs, the next step is to build prediction models to predict
the navigation patterns of Web users. If P 1 , P 2 , P 3 , . . ., P i are the pages browsed by the user so far, the
prediction models will try to predict the next page, P i+1 .
We use an N-gram based prediction model. An N-gram is a substring of N characters, each character
from an alphabet. The order of an N-gram is defined as N, the number of characters in the N-gram. In
the context of this work, the alphabet is the set of URLs of Web pages on the Web server. An N-gram
is a sequence of URLs.
After Web server logs are preprocessed into a number of sessions, an N-gram prediction model can
be built as follows:
1. Each session is decomposed into a set of overlapping, subsequent paths of length N.
2. These paths are entered into an N-gram table T as N-grams.
3. For each path in T, the next pages right after it and their occurrences, in all the sessions, are re-
corded.
4. The probabilities of the next pages for all paths are calculated from the occurrences of all possible
next pages.
For example, given a log file consisting of the following sessions, a 3-gram model for prediction can
be built as follows:
A, B, C, D, H
B, C, D, G
The first session is decomposed into two 3-grams: “A, B, C” and “B, C, D.” The second session is
decomposed into one 3-gram: “B, C, D.” For 3-gram “A, B, C,” the next page would be D, while for
3-gram “B, C, D,” the next page may be G or H. The complete N-gram table, T, is given in Table 1.
The N-gram table T is used as our prediction model. Given an observed path, it is matched against
N-grams in T. The predicted next pages and their corresponding probabilities of the matching entries
in T are returned for shortcut recommendations.
The algorithm for constructing an N-gram model is given in Figure 1. It is modified from the al-
gorithm from Su et al., (2000) such that it stores all possible next pages for an N-gram, rather than the
most probable one only.
This algorithm has a time complexity of O(|L| * |S|) where |L| is the number of sessions and |S| is the
length of sessions. The time of the algorithm is dominated by the first for loop. Its outer loop runs |L|
times and its inner loop runs |S| times, which gives the time complexity of O(|L| * |S|).
An N-gram based prediction model is proposed in Su et al. ( 2000), which evaluates the model's ac-
curacy without a specific application. In our study, the N-gram model is evaluated on its effectiveness
in improving mobile Web navigation. To suit our application, the model is modified so that a lookup
operation for an N-gram will return all predictions, instead of just the most probable one.
Search WWH ::




Custom Search