Information Technology Reference
In-Depth Information
n
p ( d j |d j− 1 ,d j− 2 ,p j− 1 ,p j ,p j +1 )
p ( D|P )
(8.7)
j =1
n
p ( d j |p j− 1 ,p j ,p j +1 ) ,
p ( D|P )
(8.8)
j =1
where the model given by Eq. 8.8 has the same constrained output language as
the model of Eq. 8.6, i.e., an output language consistent with the definitions of the
events {start, middle, end} given by Eq. 8.5.
8.4.2 Sequence Model Estimation
The problem of determining the different sequence models introduced in the previous
section is given by estimating a probability of the form p ( x|p c ,y ) with e.g., x denoting
a document type and y a history of document types. As outlined in Section 8.3, a
bag of words model is used for the page content 7 p c , i.e., p c = { ( c 1 ,w 1 ) ,..., ( c n ,w n ) }
with c j denoting the number of occurences of word w j on the page, yielding
n
|x, y ) p ( x, y )
p ( x|p c ,y )= p ( p c
p ( w j |x, y ) c j p ( x, y ) ,
p ( p c ,y )
(8.9)
j =1
whereby in the last step the constant factor 1 /p ( p c ,y ) has been omitted. As can be
seen from Eq. 8.9, the sequence model estimation is reduced to the determination of
the probabilities p ( w j |x, y ) and p ( x, y ). These probabilities are estimated empirically
by using sample documents (training examples) for the various events ( x, y ). For a
typical training corpus , provided by the customer, the statistics for determining the
word probabilities p ( w|x, y ) are very low. 8 Given such statistics, overfitting to the
training data is a common problem. Smoothing techniques, like those developed for
language modeling [9], are a common tool to address the problem of low statistics by
reserving some probability mass for unobserved events. In the case of determining
the conditioned word probabilities p ( w|x, y ), words that have been observed in the
training data would be assigned lower probabilites than the maximum likelihood
estimates, whereas unobserved words would be assigned higher probabilities than
their maximum likelihood estimates. Statistical learning methods, e.g., [10, 11], uti-
lizing methods of regularization theory, allow us to determine the tradeoff between
memorization and generalization more principled than the smoothing techniques
mentioned above. The learning method adopted here for estimating the sequence
model is a Support Vector Machine[10] (SVM). It is commonly known that Support
Vector Machines are well suited for text applications given a small number of train-
ing examples [12]. This is an important aspect for the commercial use of the system,
since the process of gathering, preparing, and cleaning up training examples is time
consuming and expensive.
7 Here, p c indicates both content models we are considering: The page content at
a given time step as well as the content of the pages p j− 1 ,p j ,p j +1 at a time step
j .
8 For a typical training corpus, almost all words occur rarely with words counts of
one to two.
 
Search WWH ::




Custom Search