Java Reference
In-Depth Information
What makes SBD difficult?
Breaking text into sentences is difficult for a number of reasons:
• Punctuation is frequently ambiguous
• Abbreviations often contain periods
• Sentences may be embedded within each other by the use of quotes
• With more specialized text, such as tweets and chat sessions, we may need to con-
sider the use of new lines or completion of clauses
Punctuation ambiguity is best illustrated by the period. It is frequently used to demark the
end of a sentence. However, it can be used in a number of other contexts as well, including
abbreviation, numbers, e-mail addresses, and ellipses. Other punctuation characters, such
as question and exclamation marks, are also used in embedded quotes and specialized text
such as code that may be in a document.
Periods are used in a number of situations:
• To terminate a sentence
• To end an abbreviation
• To end an abbreviation and terminate a sentence
• For ellipses
• For ellipses at the end of a sentence
• Embedded in quotes or brackets
Most sentences we encounter end with a period. This makes them easy to identify.
However, when they end with an abbreviation, it a bit more difficult to identify them. The
following sentence contains abbreviations with periods:
"Mr. and Mrs. Smith went to the ball."
In the next two sentences, we have an abbreviation that occurs at the end of the sentence:
"He was an agent of the CIA."
"He was an agent of the C.I.A."
In the last sentence, each letter of the abbreviation is followed by a period. Although not
common, this may occur and we cannot simply ignore it.
Search WWH ::




Custom Search