Java Reference
In-Depth Information
Finding word dependencies using the GrammaticalStructure
class
Another approach to parse text is to use the
LexicalizedParser
object created in the
previous section in conjunction with the
TreebankLanguagePack
interface. A
Tree-
bank
is a text corpus that has been annotated with syntactic or semantic information,
providing information about a sentence's structure. The first major Treebank was the Penn
TreeBank (
http://www.cis.upenn.edu/~treebank/
)
. Treebanks can be created manually or
semiautomatically.
The next example illustrates how a simple string can be formatted using the parser. A
tokenizer factory creates a tokenizer.
The
CoreLabel
class that we discussed in the
Using the LexicalizedParser class
section
is used here:
String sentence = "The cow jumped over the moon.";
TokenizerFactory<CoreLabel> tokenizerFactory =
PTBTokenizer.factory(new CoreLabelTokenFactory(), "");
Tokenizer<CoreLabel> tokenizer =
tokenizerFactory.getTokenizer(new
StringReader(sentence));
List<CoreLabel> wordList = tokenizer.tokenize();
parseTree = lexicalizedParser.apply(wordList);
The
TreebankLanguagePack
interface specifies methods for working with a Tree-
bank. In the following code, a series of objects are created that culminate with the creation
of a
TypedDependency
instance, which is used to obtain dependency information
about elements of a sentence. An instance of a
GrammaticalStructureFactory
object is created and used to create an instance of a
GrammaticalStructure
class.
As this class' name implies, it stores grammatical information between elements in the
tree:
TreebankLanguagePack tlp =
lexicalizedParser.treebankLanguagePack;
GrammaticalStructureFactory gsf =
tlp.grammaticalStructureFactory();
GrammaticalStructure gs =