Biomedical Engineering Reference
In-Depth Information
'Birmingham University, Alabama'. Other clues need to be used to
identify which institute is actually being referenced.
A Java program to perform this disambiguation was developed around
a number of institute names and locations. These locations are held in
another SOLR index, which provides fast lookup to allow rapid
disambiguation. This service is also made available via a web interface to
allow other programs to use it. It includes a number of steps, each
designed to 'hone' the search further if no individual match is found in
the previous steps:
check the address to see if it contains any 'stem' words that uniquely
identify key organisations (e.g. 'AstraZeneca', 'Pfi zer'). If so, match
against that organisation;
split the provided address into parts (split on a comma) and identify
the parts that reference any country, state and city information from
the address (by matching against data in the index);
at the same time check to see if any of the address parts contain an
'institute-like term' (e.g. 'University'), in which case only that part is
used in the next steps of the search (otherwise all non-location address
parts are used);
look for an exact match for the institute name with the information
held in the SOLR index. This can also be honed by state/country, if
that information is available. In addition, common spellings of key
institute-type names (e.g. 'Universitet') are automatically normalised
by the SOLR searching process;
if still no match is found, a 'bag-of-words' search is performed to
check to see if all the words in the institute name match the words of
an institute appearing in any order within the institute synonym.
￿ ￿ ￿ ￿ ￿
If no match is found or multiple matches still exist, the application returns
the address part that had an institute-like term in it as the institute (as
well as any location information extracted from the full address). In
future, more advances could be incorporated to handle mergers and
acquisitions.
Text tagging
The main body of text from the documents is automatically scanned for
key scientifi c entities of interest to AstraZeneca (e.g. genes, diseases and
biomedical observations). This uses a text markup system called Peregrine,
which was developed by the BioSemantics group at Erasmus University
 
Search WWH ::




Custom Search