Mining Text Conversations - Methods for Mining and Summarizing Text Conversations

Databases Reference

In-Depth Information

and it is determined by how close the words in the two sentences are (i.e., lexical cohesion) and by

the use of other linguistic devices, including primarily pronouns.

lexical

cohesion

For instance, if you look at these three sentences:

1. “Ciro is the best pizza maker in town.”

2. “He serves super fresh ingredients on a very thin crust pizza!”

3. “They do not think Incendies is still playing at a movie theater on Granville.”

Sentence 1 and sentence 2 are very cohesive. The word “pizza” mentioned in 1 is repeated in 2

and the subject of 1 “Ciro” agrees in number and gender with the pronoun “He” in 2. Furthermore,

the word “pizza” in 1 is also semantically close to the words “crust” and “ingredients” in 2, because

the crust is a part of a pizza, and because a pizza, being a type of food, is made of ingredients. In

contrast, sentence 1 and sentence 3 are not cohesive, since their words are semantically quite distant

and the subject of sentence 1 “Ciro” does not agree in number with the pronoun “They” in sentence 3.

One of the first and most influential methods for topic segmentation based on lexical cohesion

is TextTiling , which was developed in the 90s [ Hearst , 1997 ].

TextTiling

An extremely simplified version of the TextTiling algorithm can be described as follows

(see Jurafsky and Martin [ 2008 ] Chapter 21 for details). Two adjacent sliding windows covering

blocks of (let us say) five sentences are moved down on the target document. At the onset, the

first window covers the first five sentences of the document, while the second will cover the block

from sentence 6 to sentence 10 (see top of Figure 3.1 ). At each iteration, the two windows are slid

one sentence down. So, at the second step the two windows will cover the 2-6 and 7-11 sentence

blocks, respectively. At each iteration, a lexical cohesion score between the two sentence blocks is also

computed and assigned to the gap between the two blocks. Such a score intuitively measures to what

extent the words in the two blocks overlap (see Chapter 4 for a discussion of cosine similarity scores).

Once the end of the document is reached, the algorithm looks at the plot of the cohesion scores

collected at each gap between two adjacent blocks. Whenever there is a deep valley in the cohesion

function, the gap corresponding to the bottom of the valley (where the blocks were minimally

similar) is returned as a plausible topic segment boundary. For instance, if the plot in Figure 3.1

represented the cohesion scores computed by TextTiling on a given document, gaps 11 and 25 would

be good candidates as segment boundaries. Notice that selecting the bottom of the deep valleys of

the cohesion function matches the assumption that text from two different segments should be

minimally cohesive.

The basic ideas behind TextTiling have been later refined in more sophisticated algorithms

(e.g., see Choi [ 2000 ] and Utiyama and Isahara [ 2001 ]), which still represent challenging baselines

for more recent approaches.

Probabilistic Topic Modeling: A novel approach to topic modeling called Latent Dirichlet

Latent

Dirichlet

Allocation

(LDA)

Allocation (LDA) was presented in Blei et al. [ 2003 ] (see Blei and Lafferty [ 2009 ] for a gentle

Methods for Mining and Summarizing Text Conversations

Search WWH ::

Custom Search

Home