Java Reference
In-Depth Information
for (Chunk sentence : sentences) {
System.out.println("[" +
slice.substring(sentence.start(), sentence.end()) + "]");
}
The following is the output. However, it still has problems with sentences ending with an
ellipsis, so a period has been added to the end of the last sentence before the text is pro-
cessed.
[When determining the end of sentences we need to consider
several factors.]
[Sentences may end with exclamation marks!]
[Or possibly questions marks?]
[Within sentences we may find numbers like 3.14159,
abbreviations such as found in Mr. Smith, and possibly
ellipses either within a sentence …, or at the end of a
sentence….]
Although the IndoEuropeanSentenceModel class works reasonably well for Eng-
lish text, it may not always work well for specialized text. In the next section, we will ex-
amine the use of the MedlineSentenceModel class, which has been trained to work
with medical text.
Using the MedlineSentenceModel class
The LingPipe sentence model uses MEDLINE , which is a large collection of biomedical
literature. This collection is stored in XML format and is maintained by the United States
National Library of Medicine ( http://www.nlm.nih.gov/ ).
LingPipe uses its MedlineSentenceModel class to perform SBD. This model has
been trained against the MEDLINE data. It uses simple text and tokenizes it into tokens
and whitespace. The MEDLINE model is then used to find the text's sentences.
In the next example, we will use a paragraph from http://www.ncbi.nlm.nih.gov/pmc/art-
icles/PMC3139422/ to demonstrate the use of the model, as declared here:
paragraph = "HepG2 cells were obtained from the American
Type Culture "
+ "Collection (Rockville, MD, USA) and were used only
until "
Search WWH ::




Custom Search