Developing scientifi c business applications using open source search and visualisation technologies - Open Source Software in Life Science Research

Biomedical Engineering Reference

In-Depth Information

Schematic overview of the system from accessing

document sources (1), standardising them into a

single format (2), converting these and performing any

disambiguation clean-up steps (3), indexing (4),

re-tagging as appropriate (5) and fi nally development

into business-driven systems (6)

Figure 14.1

The fi rst version of KOL Miner utilised Lucene indexing technology,

but this proved to be too restrictive in that it did not provide an enterprise-

class index and rapid search. In addition, it proved to be more scalable,

especially as the faceting feature of SOLR enabled the structured querying

of a large number of documents without having to load individual

document details.

A document pipeline process was developed to process the documents

as available. This is summarised in the Figure 14.1.

Each process was quite specifi c and kept componentised for effi ciency

as steps could be replaced easily as better solutions were identifi ed.

14.4.1 Raw documents

Data feeds are collected from a number of different sources on a regular

basis (either daily or weekly). By default, all these feeds come in the form

of XML fi les. However, they are collected using a number of different

technologies, including:

Search WWH ::

Custom Search

Home