Databases Reference
In-Depth Information
and it is determined by how close the words in the two sentences are (i.e., lexical cohesion) and by
the use of other linguistic devices, including primarily pronouns.
lexical
cohesion
For instance, if you look at these three sentences:
1. “Ciro is the best pizza maker in town.”
2. “He serves super fresh ingredients on a very thin crust pizza!”
3. “They do not think Incendies is still playing at a movie theater on Granville.”
Sentence 1 and sentence 2 are very cohesive. The word “pizza” mentioned in 1 is repeated in 2
and the subject of 1 “Ciro” agrees in number and gender with the pronoun “He” in 2. Furthermore,
the word “pizza” in 1 is also semantically close to the words “crust” and “ingredients” in 2, because
the crust is a part of a pizza, and because a pizza, being a type of food, is made of ingredients. In
contrast, sentence 1 and sentence 3 are not cohesive, since their words are semantically quite distant
and the subject of sentence 1 “Ciro” does not agree in number with the pronoun “They” in sentence 3.
One of the first and most influential methods for topic segmentation based on lexical cohesion
is TextTiling , which was developed in the 90s [ Hearst , 1997 ].
TextTiling
An extremely simplified version of the TextTiling algorithm can be described as follows
(see Jurafsky and Martin [ 2008 ] Chapter 21 for details). Two adjacent sliding windows covering
blocks of (let us say) five sentences are moved down on the target document. At the onset, the
first window covers the first five sentences of the document, while the second will cover the block
from sentence 6 to sentence 10 (see top of Figure 3.1 ). At each iteration, the two windows are slid
one sentence down. So, at the second step the two windows will cover the 2-6 and 7-11 sentence
blocks, respectively. At each iteration, a lexical cohesion score between the two sentence blocks is also
computed and assigned to the gap between the two blocks. Such a score intuitively measures to what
extent the words in the two blocks overlap (see Chapter 4 for a discussion of cosine similarity scores).
Once the end of the document is reached, the algorithm looks at the plot of the cohesion scores
collected at each gap between two adjacent blocks. Whenever there is a deep valley in the cohesion
function, the gap corresponding to the bottom of the valley (where the blocks were minimally
similar) is returned as a plausible topic segment boundary. For instance, if the plot in Figure 3.1
represented the cohesion scores computed by TextTiling on a given document, gaps 11 and 25 would
be good candidates as segment boundaries. Notice that selecting the bottom of the deep valleys of
the cohesion function matches the assumption that text from two different segments should be
minimally cohesive.
The basic ideas behind TextTiling have been later refined in more sophisticated algorithms
(e.g., see Choi [ 2000 ] and Utiyama and Isahara [ 2001 ]), which still represent challenging baselines
for more recent approaches.
Probabilistic Topic Modeling: A novel approach to topic modeling called Latent Dirichlet
Latent
Dirichlet
Allocation
(LDA)
Allocation (LDA) was presented in Blei et al. [ 2003 ] (see Blei and Lafferty [ 2009 ] for a gentle
Search WWH ::




Custom Search