Biomedical Engineering Reference
In-Depth Information
Schematic overview of the system from accessing
document sources (1), standardising them into a
single format (2), converting these and performing any
disambiguation clean-up steps (3), indexing (4),
re-tagging as appropriate (5) and fi nally development
into business-driven systems (6)
Figure 14.1
The fi rst version of KOL Miner utilised Lucene indexing technology,
but this proved to be too restrictive in that it did not provide an enterprise-
class index and rapid search. In addition, it proved to be more scalable,
especially as the faceting feature of SOLR enabled the structured querying
of a large number of documents without having to load individual
document details.
A document pipeline process was developed to process the documents
as available. This is summarised in the Figure 14.1.
Each process was quite specifi c and kept componentised for effi ciency
as steps could be replaced easily as better solutions were identifi ed.
￿ ￿ ￿ ￿ ￿
14.4.1 Raw documents
Data feeds are collected from a number of different sources on a regular
basis (either daily or weekly). By default, all these feeds come in the form
of XML fi les. However, they are collected using a number of different
technologies, including:
 
Search WWH ::




Custom Search