Finding Sentences - Natural Language Processing with Java

Java Reference

In-Depth Information

for (Chunk sentence : sentences) {

System.out.println("[" +

slice.substring(sentence.start(), sentence.end()) + "]");

}

The following is the output. However, it still has problems with sentences ending with an

ellipsis, so a period has been added to the end of the last sentence before the text is pro-

cessed.

[When determining the end of sentences we need to consider

several factors.]

[Sentences may end with exclamation marks!]

[Or possibly questions marks?]

[Within sentences we may find numbers like 3.14159,

abbreviations such as found in Mr. Smith, and possibly

ellipses either within a sentence …, or at the end of a

sentence….]

Although the IndoEuropeanSentenceModel class works reasonably well for Eng-

lish text, it may not always work well for specialized text. In the next section, we will ex-

amine the use of the MedlineSentenceModel class, which has been trained to work

with medical text.

Using the MedlineSentenceModel class

The LingPipe sentence model uses MEDLINE , which is a large collection of biomedical

literature. This collection is stored in XML format and is maintained by the United States

National Library of Medicine ( http://www.nlm.nih.gov/ ).

LingPipe uses its MedlineSentenceModel class to perform SBD. This model has

been trained against the MEDLINE data. It uses simple text and tokenizes it into tokens

and whitespace. The MEDLINE model is then used to find the text's sentences.

In the next example, we will use a paragraph from http://www.ncbi.nlm.nih.gov/pmc/art-

icles/PMC3139422/ to demonstrate the use of the model, as declared here:

paragraph = "HepG2 cells were obtained from the American

Type Culture "

+ "Collection (Rockville, MD, USA) and were used only

until "

Search WWH ::

Custom Search

Home