Finding People and Things - Natural Language Processing with Java

Java Reference

In-Depth Information

Techniques for name recognition

There are a number of NER techniques available. Some use regular expressions and others

are based on a predefined dictionary. Regular expressions have a lot of expressive power

and can isolate entities. A dictionary of entity names can be compared to tokens of text to

find matches.

Another common NER approach uses trained models to detect their presence. These mod-

els are dependent on the type of entity we are looking for and the target language. A model

that works well for one domain, such as web pages, may not work well for a different do-

main, such as medical journals.

When a model is trained, it uses an annotated block of text, which identifies the entities of

interest. To measure how well a model has been trained, several measures are used:

• Precision : It is the percentage of entities found that match exactly the spans found

in the evaluation data

• Recall : It is the percentage of entities defined in the corpus that were found in the

same location

• Performance measure : It is the harmonic mean of precision and recall given by

F1 = 2 * Precision * Recall / (Recall + Precision)

We will use these measures when we cover the evaluation of models.

NER is also known as entity identification and entity chunking. Chunking is the analysis

of text to identify its parts such as nouns, verbs, or other components. As humans, we tend

to chunk a sentence into distinct parts. These parts form a structure that we use to determ-

ine its meaning. The NER process will create spans of text such as "Queen of England".

However, there may be other entities within these spans such as "England".

Search WWH ::

Custom Search

Home