Java Reference
In-Depth Information
Another issue that makes SBD difficult is trying to determine whether or not a word is an
abbreviation. We cannot simply treat all uppercase sequences as abbreviations. Perhaps
the user typed in a word in all caps by accident or the text was preprocessed to convert all
characters to lowercase. Also, some abbreviations consist of a sequence of uppercase and
lowercase letters. To handle abbreviations, a list of valid abbreviations is sometimes used.
However, the abbreviations are often domain-specific.
Ellipses can further complicate the problem. They may be found as a single character (Ex-
tended ASCII 0x85 or Unicode (U+2026)) or as a sequence of three periods. In addition,
there is the Unicode horizontal ellipsis (U+2026), the vertical ellipsis (U+22EE), and the
presentation form for the vertical and horizontal ellipsis (U+FE19). Besides these, there
are HTML encodings. For Java, \uFE19 is used. These variations on encoding illustrate
the need for good preprocessing of text before it is analyzed.
The next two sentences illustrate possible uses of the ellipses:
"And then there was … one."
"And the list goes on and on and …"
The second sentence was terminated by an ellipsis. In some situations, as suggested by the
MLA handbook ( http://www.mlahandbook.org/fragment/public_index ) , we can use brack-
ets to distinguish ellipses that have been added from ellipses that were part of the original
text, as shown here:
"The people […] used various forms of transportation […]" ( Young 73 ).
We will also find sentences embedded in another sentence, such as:
The man said, "That's not right."
Exclamation marks and questions marks present other problems, even though the occur-
rence of these characters is more limited than that of the period. There are places other
than at the end of a sentence where exclamation marks can occur. In the case of some
words, such as Yahoo!, the exclamation mark is a part of the word. In addition, multiple
exclamation marks are used for emphasis such as "Best wishes!!" This can lead to identi-
fication of multiple sentences where they do not actually exist.
Search WWH ::




Custom Search