Java Reference
In-Depth Information
Finding sentences
We tend to think of the process of identifying sentences as a simple process. In English, we
look for termination characters such as a period, question mark, or exclamation mark.
However, as we will see in Chapter 3 , Finding Sentences , this is not always that simple.
Factors that make it more difficult to find the end of sentences include the use of embedded
periods in such phrases as "Dr. Smith" or "204 SW. Park Street".
This process is also called Sentence Boundary Disambiguation ( SBD ). This is a more
significant problem in English than it is in languages such as Chinese or Japanese that have
unambiguous sentence delimiters.
Identifying sentences is useful for a number of reasons. Some NLP tasks, such as POS tag-
ging and entity extraction, work on individual sentences. Question-anwering applications
also need to identify individual sentences. For these processes to work correctly, sentence
boundaries must be determined correctly.
The following example demonstrates how sentences can be found using the Stanford
DocumentPreprocessor class. This class will generate a list of sentences based on
either simple text or an XML document. The class implements the Iterable interface al-
lowing it to be easily used in a for-each statement.
Start by declaring a string containing the sentences, as shown here:
String paragraph = "The first sentence. The second
sentence.";
Create a StringReader object based on the string. This class supports simple read
type methods and is used as the argument of the DocumentPreprocessor constructor:
Reader reader = new StringReader(paragraph);
DocumentPreprocessor documentPreprocessor =
new DocumentPreprocessor(reader);
The DocumentPreprocessor object will now hold the sentences of the paragraph. In
the next statement, a list of strings is created and is used to hold the sentences found:
List<String> sentenceList = new LinkedList<String>();
Search WWH ::




Custom Search