Developing scientifi c business applications using open source search and visualisation technologies - Open Source Software in Life Science Research

Biomedical Engineering Reference

In-Depth Information

'Birmingham University, Alabama'. Other clues need to be used to

identify which institute is actually being referenced.

A Java program to perform this disambiguation was developed around

a number of institute names and locations. These locations are held in

another SOLR index, which provides fast lookup to allow rapid

disambiguation. This service is also made available via a web interface to

allow other programs to use it. It includes a number of steps, each

designed to 'hone' the search further if no individual match is found in

the previous steps:

■ check the address to see if it contains any 'stem' words that uniquely

identify key organisations (e.g. 'AstraZeneca', 'Pfi zer'). If so, match

against that organisation;

■ split the provided address into parts (split on a comma) and identify

the parts that reference any country, state and city information from

the address (by matching against data in the index);

■ at the same time check to see if any of the address parts contain an

'institute-like term' (e.g. 'University'), in which case only that part is

used in the next steps of the search (otherwise all non-location address

parts are used);

■ look for an exact match for the institute name with the information

held in the SOLR index. This can also be honed by state/country, if

that information is available. In addition, common spellings of key

institute-type names (e.g. 'Universitet') are automatically normalised

by the SOLR searching process;

■ if still no match is found, a 'bag-of-words' search is performed to

check to see if all the words in the institute name match the words of

an institute appearing in any order within the institute synonym.

If no match is found or multiple matches still exist, the application returns

the address part that had an institute-like term in it as the institute (as

well as any location information extracted from the full address). In

future, more advances could be incorporated to handle mergers and

acquisitions.

Text tagging

The main body of text from the documents is automatically scanned for

key scientifi c entities of interest to AstraZeneca (e.g. genes, diseases and

biomedical observations). This uses a text markup system called Peregrine,

which was developed by the BioSemantics group at Erasmus University

Search WWH ::

Custom Search

Home