The PsyGrid Experience: Using Web Services in the Study of Schizophrenia

abstract

The key aim of the PsyGrid project was the creation of an information system to ascertain and characterise a large, representative cohort of schizophrenics, beginning from their first episode of psychosis. The cohort was to be drawn from eight geographically dispersed regions of England, covering in total one-sixth of the entire population. In order to meet the current and future requirements we needed to build a secure distributed system, which not only could support remote data collection, but could also be integrated with other data sets, applications, and workflows for statistical analysis. We concluded that a service-oriented architecture was required and that the implementation technology should be Web services. In this article we present the design, deployment and operation of the PsyGrid data collection system as a case study in applying Web services to health informatics. The major problems we faced were related to the deployment of Web services into an existing network infrastructure, but overall found Web services to be the most suitable middleware technology.

introduction

In the 1970′s Archie Cochrane (Cochrane, 1972) and colleagues alerted the medical profession to the need to weed out subjectivity and anecdote from clinical practice. At the same time there was a move to improve the safety of medicines. Since then the evidence-based care movement has grown and is now accepted by most healthcare professionals to be best practice. However, there are serious problems with the evidence on which we base healthcare: it is expensive to produce; it takes a long time to produce; it takes a long time to influence clinical practice; it is based on clinicians’ more often than patients’ perceptions of important outcomes, which might not match; and it is crude—relating to the average participant and simple treatment definitions under ideal/trial conditions, often long ago—in other words, it gives a low-resolution picture of how your patient might respond to treatment. Most of the health informatics literature on electronic health records and “evidence into practice” is about weaving the existing evidence-base into healthcare decision-making. The role of clinical information systems in improving the evidence-base, however, has been neglected. This situation is changing: PsyGrid, which is funded under the UK Medical Research Council’s E-Science programme, is focused on providing informatics employing e-Science principles for improving clinical trials and longitudinal studies in mental health research (Ainsworth et al., 2006).

PsyGrid employs service-oriented architecture computing techniques and technologies in the implementation of a system that will remove the barriers to epidemiological research. The process of epidemiological research has three phases—the establishment and characterisation of a large, representative cohort from a geographically distributed population; the integration of the cohort data with other data sources to provide additional characterisation; and the formulation of a hypothesis and generation of the corresponding predictions. For the establishment and characterisation of a cohort, many epidemiological studies use paper based data collection systems. A computer-based data collection system which enables geographically distributed data collection would alleviate much of the labour, tedium, and error that are inherent in paper-based data collection. Such a system is required to store personal, confidential medical data, and this data must be sent across a network from remote data entry clients to a data repository server. The immediate goals in developing the system were to ensure high quality data; ensure privacy and confidentiality of sensitive patient data; enable data collection to be performed at any location; ensure that the system was scaleable to cover larger populations and that it was highly available. In the long term the system needs to be enhanced and extended to provide a platform for the epidemiological study of schizophrenia and intervention research for predicting and preventing adverse outcomes.

We selected Web services as the underlying middleware because they offer firewall traversal, interoperability, modularity, and loose coupling which provides the ability to incrementally extend the capabilities of the system as it grows.

The health informatics findings from PsyGrid will be generalisable well beyond its disease focus. By treating every new patient as a participant in a longitudinal study, it will start to test a new model of care and research combined. We believe that this combination is essential to providing a timely and more flexible evidence base for future healthcare. This future could be called “high resolution healthcare”; it would encompass “personalised medicine”, self-care decision-support, efficient and opportunistic clinical trials, complex (including genomic) epidemiology, and tactical development of local services based on local environmental factors and outcomes at the population level. High-resolution care and research requires information systems to link relevant data, methods and people in a clear and timely fashion.

In the following section we examine the clinical need for PsyGrid, and in then we analyse the requirements that flow from this. Afterwards, we provide a description of the NHS Information Technology infrastructure into which PsyGrid will be deployed. The rationale for using Web services is then presented, followed by an overview of the PsyGrid system architecture and the interactions between the Web services. We report our experience of implementing Web services, and provide a detailed description of the systems functionality and implementation. Thereafter, we capture our experiences of deploying and operating the system, and finally, we cover related work, future work, and a discussion of our findings.

clinical overview of psygrid

Mental disorders are a large public health burden, whose treatment accounts for about 22% of the UK National Health Service (NHS) budget (Davies & Drummond, 1994). The major disorders are schizophrenia and bipolar disorder, each with a lifetime prevalence of about 0.5%. Both are major public health challenges, with costs of health and social care in the UK of over £2 billion annually, a similar order of magnitude to cancer or ischaemic heart disease. Schizophrenia usually starts in early adult life and leads to persistent disability in most cases. It arises out of a complex interaction of genetic and environmental risk factors.

The evidence base for the treatment of psychotic disorders is underdeveloped (NICE, 2003). Interventions can be divided into drug treatments, psychological treatments, and service-level interventions. Historically, randomised, controlled trials have been few in number and uneven in quality (Thornley & Adams, 1998). Rational service planning is constrained by the lack of knowledge currently about the comparative incidence rates and course of psychotic disorders in different regions of the UK. Models purporting to predict local service usage on the basis of demographic indicators are not well developed’ (Glover et al., 1998). For example, the extent to which an urban environment contributes to the risk of psychosis has only recently been understood and quantified (Pederson & Mortensen, 2001).

The focus of first episode psychosis (FEP) in PsyGrid arises out of a convergence of key questions in clinical science that require large, well characterised, representative cohorts to address them, along with recent developments in the NHS and NHS R&D to facilitate such research. Delayed detection and treatment is a widespread problem and predicts poor clinical outcome (Marshall et al.,2005).

Schizophrenia typically first occurs in one to two people in every 10,000 of a population each year. However, about two-thirds of episodes of schizophrenia start before 35; this proportion is higher in men, who more commonly suffer from the disorder anyway. PsyGrid subjects are drawn from eight different geographic regions of England that together cover one-sixth of the population. This coverage could be expected to yield up to 1,000 patients per year who would meet the eligibility criteria for the project.

requirements

In this section we discuss the key requirements of the system that shaped the system architecture.

• Privacy and Confidentiality: The PsyGrid data repository stores sensitive clinical data describing a patient’s mental illness. Ensuring a patient’s privacy and the confidentiality of their data was of prime importance. A fine-grained access control system was required which would only permit users with the necessary privileges to access and operate on the data in the repository. The ethical approval granted to PsyGrid further restricted the data we could store, mandating that the data stored be anonymous. However, removing certain identifying data items completely, such as post code and date of birth, restricts the hypotheses that can be tested and so these data items needed to be transformed such that they are still useful, but are not identifying.

• Remote Data Entry: The goal of the project is to study the occurrence and outcome of schizophrenia, beginning from a patient’s first episode, over the course of the next 12 months. Subjects were to be drawn initially from eight geographically dispersed locales covering in total one-sixth of the population of England, thus providing a highly representative sample. Consequently, a client-server architecture was required, where data collection clients provide the user interface for data entry, and the server hosts a centralised data repository, which persists the assessment data. In line with the privacy and confidentiality requirements, all communications between client and server had to be secure against interception attacks.

• Paperless Data Collection: PsyGrid was required to be a paperless data collection system, where the data is entered directly into the system at the time of collection. The implication of this requirement was that the system must be as natural and easy to use during a client interview as a pen and paper form. Since data capture is happening in real time it also implies that the system is always available for data entry. Due to the nature of the illness, if a schizophrenic is willing and able to be interviewed, then the opportunity should not be lost. The absence of a paper record from which the data could be recovered if necessary necessitates a back up system that can recover data up to the last committed transaction.

• Off-line Data Entry: The clinical researchers needed the capability to interview clients in any location, such as the hospital for the initial assessment or the client’s home for any of the follow-up assessments. Consequently, a connection to the network cannot be guaranteed to be present and so the data entry system must be capable of working off-line on a portable computer. When a connection to the network becomes available, then the data entered whilst off-line will be uploaded. This requirement places a lot of complexity into the data entry client, and so a rich client application was preferable to Web-browser based solution.

• High availability: This requirement was implied by paperless data collection. The central data repository must be hosted on a high availability hardware platform, with data storage redundancy, to ensure continuous operation. The high availability architecture also must have the capability of performing a live upgrade to the system software and to modify a data set definition.

• Scaleable: The initial sizing of the system was for eight remote locations with 25 users using a single data set, which it was anticipated would contain approximately 1,500 subjects. However, we worked on the assumption that the initial deployed system would be require to support 32 remote locations with 100 users and four different data sets.

Data Quality: Data quality (or the lack of it) is a major problem for all evidence based medical research. Electronic data entry systems can restrict the range of valid values and enforce them in a way that is not possible with pen and paper. Wherever possible we needed to restrict the ranges of data elements and provide an intuitive user interface for our end users.

• Extension: The data collection system is the first phase ofthe PsyGrid system. One vision for PsyGrid is to provide an epidemiology work-bench which will provide the capability to integrate the PsyGrid data set with other data sets, such as socio-economic data, which will enable the aetiology of schizophrenia to be studied in the population at large, which can be directly used in service planning for the treatment of the disease. The myGrid (Goble et al., 2003) toolkit provides this capability for the bioinformatics domain and could easily be adapted to epidemiology. This improved understanding of the lifestyle and environmental causes of schizophrenia, when combined with genetic information, could be used to predict an individual’s risk of developing schizophrenia, and so early intervention treatment could be effectively targeted.

• Remote Management: The PsyGrid system was required to support a geographically dispersed user base, with minimal technical support available on site. Therefore the goal was to build a system that could be managed remotely with no ongoing on site support required. All software upgrades needed to be performed remotely and ideally automatically. This requirement suggested to us that a browser-based Web application, which could be managed centrally, would be the easiest solution, but this solution would not support off-line data collection.

• User Defined and User Managed: PsyGrid was required to be simple to use and simple to extend. PsyGrid needed to support multiple independent data collection projects that could be designed and managed by their users. To simplify the creation of a new data set, the user interface for entering data should not need to be defined as well, as the data collection client application must be able to render this solely from the data set definition. The user interfaces for the management tools needed to be intuitive and easy to use.

• Security Credential Management: Based on our end user community we ruled out the possibility of using file system based Public Key Infrastructure (PKI). It is unrealistic to expect clinical users to deal with the complexity of managing certificates and keys themselves, and it would be a deterrent to the use of the system, which was not our aim. However, we did not want to preclude the use of a hardware token-based PKI approach in the future, and so the system needed to be designed with a PKI security infrastructure that would enable this to happen in the future. Consequently, the system needed to provide the ability to translate a user name and password credential into a PKI credential. • Federation ofSystems: Over time we envisage that multiple PsyGrid data repositories could be deployed. By supporting federation of these systems, much larger data sets could be used in analysis. In this federated scenario, there would be multiple instances of PsyGrid, where each one is operated by a different autonomous organisation and has its own user directory and security policy, which can be used collaboratively, such that a user from one PsyGrid in the federation can access another system without repeated authentication.

THE OPERATING ENVIRONMENT

The National Health Service (NHS) in England and Wales provides publicly available healthcare for the whole population. The NHS comprises of largely autonomous Trusts, which are responsible for procuring, deploying and operating their own IT infrastructure and applications. Consequently, no two Trusts are the same, and a large variety of infrastructure exists. To address the problems inherent in the NHS IT infrastructure, and to enable the deployment of NHS wide applications, such as the Electronic Care Record, the UK government embarked on the £6 billion “Connecting for Health” IT infrastructure project (formally the National Programme for IT). The aim of Connecting for Health is to modernise the NHS IT infrastructure and provide a set of global applications that will enable mobility of patients and their clinical data between Trusts. The first phase of the project was to deploy a new NHS-wide network infrastructure, which would provide high-speed links (>100Mb/s) between all Trusts via the backbone network. This infrastructure is commonly known as N3. The backbone would then be able to host applications global to the NHS, which are known as spine services. These services include the National Care Record, the Secondary Usage Service and Picture Archiving and Communication Services, and will be rolled out in subsequent phases. The current state of the NHS IT infrastructure can best be described as transitional.

During the design of the PsyGrid data collection system we made the following assumptions about the NHS network infrastructure. Firstly, we assumed that the N3 high bandwidth infrastructure would be ubiquitous; secondly that each Trust would allow direct HTTP/HTTPS outbound connections to be made to any public NHS server in any trust on any port; and thirdly that inbound HTTP and HTTPS connections would be possible on ports 80, 443, and 8443 from any source address on the NHS network.

Motivation for Using Web services

Service-oriented architectures have been successfully applied in the bioinformatics domain as typified by the myGrid (Goble et al., 2003) project. One of the motivations for the PsyGrid proj ect was to provide the same ability to integrate and analyse multiple data sources using the same workflow approach based on Web services (Oinn et al., 2004). This was the primary factor in our decision to pursue a service-oriented architecture. By exposing our data repository this way, it enables the data to be analysed as part of a workflow that tests an epidemiological hypothesis. The need for a service-oriented architecture leads directly to Web services as the implementation technology. Perhaps the most compelling advantage Web services have over other technologies is the ability to pass through firewalls, which the analysis of the operating environment (presented previously) identified as particularly important. The availability of open-source implementations of Web service containers and protocol stacks also played a role, as PsyGrid was to be made freely available to the community, and it should be possible for other groups to deploy a PsyGrid system without purchasing any software. The existence of open Web service standards was also a factor in the decision, which would enable the PsyGrid system to be used from disparate implementations and operating systems, removing some of the barriers that are commonly found in heterogeneous environments such as the NHS. This is critical to fostering adoption of PsyGrid, as we do not have to mandate one particular platform or implementation. Finally, Web services bring with them the benefits of loose coupling. The system would be deployed and operational for data collection early in the project, long before much of the functionality for data integration and workflow-based statistical analysis could be completed. As additional functionality became available we needed the capability to add this into the operating system incrementally with no disruption to the deployed system.

However, we did have concerns about using Web services in the PsyGrid project. We were not aware of the use of Web services in the NHS and so we would not be able to benefit from the experience of others. Also, many of the Web services standards are new and potentially immature. This would be true of the open source implementation of these standards. Finally Web services have a reputation for being very slow in comparison to other technologies offering similar functionality. There was concern that a system based on Web services would not scale.

system architecture

The data collection system has four main architectural components. The data repository stores the datasets, the data collection client provides the user interface for data entry, the security system provides authentication and authorization capability, and the project manager client provides the tools to setup and manage data collection projects. The data repository has been implemented as two Web services, and the security system is composed of three Web services. The data collection client and the project manager are clients of the Web services. The project manager client application is still under development and we do not describe it further. Figure 1 shows the architecture of the system and the principal interactions between the components.

Figure 1. PsyGrid system architecture showing principal interactions between clients, Web services and databases.  

Grid system architecture showing principal interactions between clients, Web services and databases.

 

Data Repository

The two Web services, which together form the data repository, are the Repository service and the Transformers service. The Repository service provides a Web service interface to the backend database, which stores the data set definitions and the collected patient data. The operations on the Web service can be grouped into two. The first set of operations enables the remote management of data set definitions such that they can be installed and updated. These operations are used by the Project Manager client application. The other set are used by the Data Collection client application and allow patient data to be added, updated, and retrieved. The implementations of these operations allow for client applications that must operate off-line. For instance, a client is not required to download an existing record when new documents are to be completed; the repository manages appending the new document instances to the existing record during the save process. The Repository service enforces access control on its exposed operations to ensure data privacy and confidentiality. The Transformers service provides a set of operations that enable data to be anonymised. For example, a SHA-1 operation is provided to return the hash value of the input. For any data value contained within a data set, a transformer can be specified. The input data type and the output data type are specified along with the URL of the Transformer Web service and the operation to be invoked. The Transformers Web service effectively provides a configurable data pre-processor, which is primarily used for de- identification ofidentifiable data between it being entered by a user and stored in the database.

security system

The three services implemented for the security system are the Authentication service, the Attribute Authority and the Policy Authority. Together they provide a Role Based Access Control (RBAC) (Sandhu et al., 1996) system for PsyGrid. In RBAC, users are assigned privileges, and authorisation decisions are based upon the possession of the required privilege. There are three components in RBAC. The first is a privilege manager, which maps a user to their privileges. The second is a policy decision point (PDP) that is used to control access to a resource. The third component is a Policy Decision Function (PDF), which is used to make a decision on whether a user has sufficient privileges to access a resource.

The Authentication service provides a port type for the user to login to the system. It is essentially a Web service wrapper around an on-line certificate authority, which itself authenticates users against an LDAP directory. If authentication is successful, then the on-line certificate authority will issue an X. 509 PKI credential, bound to the user’s identity. This credential is then used to authenticate with the other Web services in the system which require mutual authentication over SSL.

In PsyGrid, the Attribute Authority (AA) provides the privilege management function. It stores the list of projects that are active on the system. For each project it records its name, a unique identifier, a list of the sub-groups of the project, and the roles that users can take on in this project. It also maintains a registry of users and their privileges. The user’s privileges are maintained on a per project basis. For each project the user is a member of, the privileges granted to the user (role or group membership) are listed. The AA issues Security Assertion Mark-up Language (SAML) tokens that bind a user’s identity to their privileges in a project. The AA digitally signs these statements, which guarantees their authenticity. Any entity that trusts the AA can accept its assertion about a user’s privileges.

The Policy Authority (PA) maintains the security policy. It stores multiple policies, such that each data collection project can have a unique policy. A policy consists of statements, and each statement has an action, target, and a rule. The rule is a Boolean logic expression composed of operators (AND, OR, NOT) and privileges. A rule may be composed of many sub-expressions. The action is the operation the user wishes to perform, and the target is the resource on which they want to perform it. In the following example, in order to invoke the “getRecordSummary” method (the “action”) on the data repository to retrieve the records owned by the North West hub (the “target”), then you must either be a Clinical Researcher belonging to the North West hub or be the Clinical Project Manager (the “rule”):

ACTION = “ACTION_DR_GET_RECORD_SUM-MARY”, TARGET=”NORTH_WEST_HUB”, RULE = {{ROLE = CLINICAL_RESEARCH-ER AND GROUP = NORTH_WEST_HUB} OR {ROLE=CLINICAL_PROJECT_MANAGER}}

The current policy for the First Episode Psychosis study has approximately 600 statements, covering some 30 actions and 10 targets.

The access enforcement function (AEF) provides the policy decision function. It is a client side API for the PA, which can be invoked from any Web service that protects a resource. The AEF requires the caller to supply the target and action, and either the users identity, or a signed SAML assertion that can be verified. The PA, AA and Repository all use the AEF to protect the operations that are exposed as Web services.

The on-line certificate authority is used to issue short-lived user credentials. However, an alternative source of authority, the off-line PsyGrid Infrastructure Certificate Authority (CA), is used to issue long-lived credentials. This gives us two levels of trust, determined by the certificate authority that issued the end entity’s credentials. Those in possession of a credential from the PsyGrid Infrastructure CA, typically servers, are able to invoke services on behalf of other users. In this case, the identity presented during authentication and the identity of the subject in the SAML assertion need not be the same. This is known as delegation.

By using signed SAML assertions to identify a user’s roles, and using a role-based access control system, then to federate multiple data collection systems only requires the policy decision function to accept SAML assertions signed by the other attribute authorities participating in the federation. In the current implementation, the PA is configured with a list of AA’s that it trusts. This federated model is similar in design to the Shibboleth (Shibboleth, 2006) system but has been implemented for Web services, where as Shibboleth is used to protect Web pages.

Data collection client Application (dcca)

The PsyGrid data collection client application provides the functionality required for the entry, secure storage and secure transmission of data collected from longitudinal studies. To achieve this, it interacts with the data repository and the security subsystem. A Web-services interface is used by the data client to communicate with the other systems, including the data repository, attribute authority, policy authority, and certificate authority. All communication is encrypted through the use of SSL.

inter service communication

To ensure confidentiality and privacy, all components in the PsyGrid system communicate over secure, encrypted communications links.

Figure 2. Interaction between services when the user logs in and saves a document

Interaction between services when the user logs in and saves a document

We selected transport layer security (TLS) in conjunction with HTTP as the transport for our SOAP-based Web services. We use TLS in mutual authentication mode to ensure both server and client can be sure of the identity of the other party. TLS was chosen over message level security (MLS), as specified in WS-Security, because it is widely available and implementations are mature.

The basic interaction between the client, the data repository and the security system can be illustrated by the example of a user logging into the system and saving a document, as shown in the interaction sequence in Figure 2. Prior to this the user has been working off-line and completed a patient assessment form, which they now wish to save.

1. The user launches the data collection client application and logs in with their user name and password. The DCCA passes the login information to the Authentication service, which attempts to authenticate with the credential repository. The credential repository in turn attempts to authenticate the user with the directory; if successful, it uses its on-line certificate authority to generate a temporary PKI credential for the user, which contains the user’s Distinguished Name as the principal.

2. The DCCA next contacts the attribute authority, which issues a signed SAML assertion for the user, based on the Distinguished Name in the credential supplied during mutual authentication between DCCA and the AA, containing their privileges for the requested data collection project.

3. The DCCA then invokes the save Record()operation on the repository passing the Record which contains the document(s) to be saved and the signed SAML assertion.

4. The Web service uses the AEF to invoke the Policy Authority’s makePolicyDecision() operation, including the action (saveRecord), target (determined from the owning group of the record) and SAML assertion. Based on the users privileges listed in the SAML assertion and the stored policy, the Policy Authority will either grant or deny access.

5. The document to be saved has a date entry marked as requiring date transformation. The date is to be converted from day-month-year to a month-year, and the appropriate transformer is invoked.

6. The document has another entry that is marked as requiring encryption. The SHA-1 encryption transformer is invoked.

7. The document is store in the Repository’s database, and the unique identifier of the persisted object is returned as an indication of success.

Once the user has logged into the system then temporary credentials can be used until they expire. The DCCA will automatically refresh the end user’s temporary credential before it becomes invalid by accessing the log in () operation of the authentication service, using user name and password which were cached by the client when the user logged in. This is done without further interaction from the user. Thus the infrastructure is secured by PKI, and yet there is no burden on the user of managing any more than a user name and password. It would not require any changes to the backend system to integrate a hardware token approach; only the DCCA would be affected.

interface Description

Each of the Web service exposes a number of operations, and the parameters required range in complexity from simple integers or strings, to complex documents that describe an entire data set. For the Repository, Authentication and Transformer Web services, the interface description in WSDL is generated from the Java implementation of the interface. For the PA and AA services, the Java stubs are generated from the WSDL definition, and the messages are documents defined by an XML schema. The number and range of operations is too great to describe each one in detail, but selected examples are presented in Table 1.

Table 1. Examples of operations exposed by the PsyGrid Web services


Web Service

Operation

Input Parameters

Output

Description

Transformer

encrypt

Value:string

String

Returns a SHA-1 hash of the input string

Authentication

login

userId:String password:char[]

String

Authenticates the user and returns a PKI credential store in a Java KeyStore that is base-64 encoded and returned as a string.

Repository

getRecordsByGroups

project:String

groups:String[]

saml:String

Record[]

Returns an array of Java Record objects (as defined by the Repository data model), which exists in the specified project and groups within the project.

Policy Authority

makePolicyDecision

project:ProjectType action:ActionType target:TargetType saml:String

boolean

Determines whether the user’s privileges are sufficient to perform the requested action on the specified target. The input is an XML document containing the project, action, target and SAML assertion.

 

Web service implementation experiences

In this section we discuss three detailed aspects of implementing our Web services together with the lessons learned.

Using Ws-security

Our original intention was to use the WSS4J implementation of the WS-Security standard for retrieving, transporting and verifying the signed SAML assertions used in the access control system. Whilst the transport and verification functionality met our needs, retrieval from the Attribute Authority required that the data collection project identifier be included in the request. There was no easy way to communicate this from the client application through the Axis Web service stack to the WSS4J handler that implemented the SAML retrieval function. The only way we found to do this was to set the information on the Axis Ca ll object, but this required that the Web service client API stubs were extended to pass this information through to Axis, which in turn meant that they could not be generated automatically from the WSDL definition of the service. Consequently, we moved the SAML functionality into the application layer and added the SAML assertion as a parameter to each operation of the repository Web service that required access control.

Document vs. RPc Encoding

Within PsyGrid we have used both styles of encoding for our services. The Authentication and the Repository services use RPC encoding, and are generated using Java2WSDL, directly from the Java class definitions. Conversely the Policy Authority and the Attribute Authority use document-literal encoding. An XML schema defines the content of the documents, which is used in the WSDL definitions of the services. WSDL2Java was then used to generate the Java classes for the implementation. The Repository object model was implemented making full use of Java’s complex types for managing collections. Java2WSDL was unable to map these complex types into the corresponding XML representation and so a Data Transfer Object layer was implemented which mapped the complex collections types into simple arrays. We then used the DTO classes to generate the WSDL for the Repository. In a similar way we also implemented a DTO layer for the Policy Authority and Attribute Authority. The WSDL2Java implementation was unable to handle abstract types in the XSD definition, however they were required for the natural implementation of the object models of these two services. Consequently the XSD was restructured to eliminate the use of abstract types and the DTO layer was implemented to restore this in the Policy Authority and Attribute Authority services.

SSL Implementation

By default Axis provides a default SSL socket factory for the management of SSL connections, but allows a custom socket factory to be configured. The default factory was unsuitable for our needs as the DCCA had to refresh the PKI credential each time it expires and so we implemented a PsyGrid specific socket factory. The PsyGridCli-entSocketFactory Java class provides a specialised key manager and a certificate manager that is capable of updating the Java SSL subsystem with the new PKI credentials.

Logging

All Web services implement logging of each attempt to access the operations they provide. This provides a full audit trail of the usage of the system. Each log entry records the entity invoking the operation (from the X.509 certificate), the user requesting access (from the SAML assertion), the source IP address from which the request originated, a time stamp, and the name of the operation being invoked. If a request to access a Web service is denied, this is also logged. A common logging API was implemented for use by all Web services, which uses the Apache Log4J framework.

Implementation Details

We describe the implementation of three of the four core components of the data collection system architecture. The project manager client application has not been implemented at the time of writing.

Data Repository

The PsyGrid data repository has two principal objectives; to store the definition and structure of the data to be stored in the repository, and to store the data itself. It also performs a number of ancillary functions, which will be described below.

To define the definition of the data to be collected a three-level hierarchy has been created. At the top-level of the hierarchy a single data repository may define a number of datasets, each dataset being equivalent to a single data collection project. At the next level of the hierarchy, within each dataset may be defined a number of documents, each document being equivalent to a single paper-based assessment form. The repository also caters for documents that are to be completed multiple times within a single study (for instance for longitudinal studies), and allows documents to be grouped into those that are intended to be completed together.

The final level of the hierarchy allows each document to contain a number of entries, with each entry being equivalent to a single item of data to be collected. Provision has been made for collecting a variety of different types of data, including, but not restricted to, numeric data, textual data, and selection from a list of options. The entries in a document may be logically grouped into sections, and it is possible to define sections that are to be completed multiple times, but with a different context each time. This simplifies the definition of documents with repetitive entries.

More advanced behaviour for entries is also supported. Each entry may have one or more validation rules defined for it to prevent illegal data from being entered. Also, an entry type is provided for which data is not entered directly, but the value is calculated by performing a calculation involving the values entered for other entries in the document; this can be used to calculate the overall value of an assessment scale involving multiple questions for example. Finally, for entries where the value is selected from a list of options it is possible to define the users flow through a document in response to the option they select, by enabling or disabling subsequent entries in the document.

For storing the data collected a similar three-level hierarchy is used, to match the hierarchy of the data definition. At the top level of the hierarchy a record represents a single instance of a dataset. At the bottom level of the hierarchy is a response to a single entry; each response maintains a collection of values which represent an add-only store of all entered data for the response, with accompanying provenance metadata.

The PsyGrid data repository is implemented as a set of persistable Java classes, mapped to a relational database using the open-source Hibernate object-relational mapping package. As well as reducing development time by eliminating the need to hand-craft SQL this approach also allows the PsyGrid data repository to be deployed using any of the back-end databases supported by Hibernate, which includes MySQL (used during development), IBM DB2 (used for the production deployment) and the majority of other RDBMS providers. The data repository also provides scheduling functionality whereby for a dataset representing a longitudinal study it can be configured to send e-mail reminders to the responsible persons when the next set of documents are ready to be completed for each record.

Data collection client

The PsyGrid data client is a rich-client application that automatically generates a visual representation of a data set from a definition provided by the data repository. It represents the data set by using a combination of simple user interface elements such as radio buttons, combo boxes and text fields, as well as more complex ones such as tables. The data client also has a specialised date widget that allows the user to select a date from a calendarlike interface. Finally, support for grouping is available in the form of sections.

The application presents a clean and simple interface using a wizard-like approach providing the user with a unique path to follow in order to complete an assessment. This approach was taken considering that users of the system are unlikely to have a technical background. In addition, context-sensitive help is available throughout the application. The data set definition is the source of the information used by the context-sensitive help, providing maximum flexibility. In keeping with making the usage of the application as intuitive as possible, the application is able to fully restore its state after being terminated. So, if for some reason an assessment had to be interrupted, the user could simply close the application without having to worry about losing data.

The accuracy of the data entered is of great importance, and humans entering data are likely to occasionally make mistakes. The data client attempts to mitigate this problem by providing instant (“as you type”) feedback on the validity of the data inserted. If the data entered is not valid, an icon is displayed next to the entry to indicate it. An explanation is also available and it is shown as a tool tip when the mouse cursor hovers over the icon. Validation rules to govern this process can be defined per data set and can be applied to individual entries. A common source of errors is the incorrect usage of measurement units. In order to solve this issue, the client application displays the units defined for an entry in a combo box, allowing the user to choose from one of them. Finally, entries whose values are calculated from values in other entries do not require human intervention at all. The calculated score is updated in real-time as the values are entered in the relevant entries. Additional features provided by the client are a “review mode”, and specific support for unanswered questions. The former allows a user to examine a completed assessment in order to correct any mistakes. Any correction requires an annotation describing the reason for the change. An audit trail is retained making it possible for a change to be reverted if necessary. Unanswered questions are common when collecting data, so special support is present to allow the user to select the reason for not answering the question. The set of options available is defined per repository. This information can be useful in the analysis stage and when revising the contents of an assessment.

The client has two modes of operation, “offline” and “on-line”. The former provides a subset of the functionality available in the latter mode. The ability to save records, retrieve data sets, and to review already submitted data are only available in the on-line mode. While in the off-line mode, the information collected is stored in an encrypted form and as soon as the application enters the on-line mode, the data is uploaded to the data repository and deleted from local storage.

security

The Authentication service provides a Web service interface to the myProxy credential repository (Basney et al., 2005) that has been configured to act as an on-line certificate authority (CA). The Attribute Authority is implemented as a set of persistable Java classes, mapped to a relational database. The definition of a project includes the project name, a list of sub-groups which exist for this project, and a list of the valid roles which a user can be assigned. The AA knows each user by their Distinguished Name, and the AA stores the list of projects and privileges for the user. The AA provides a Web service interface, which exposes operations to query project information and user privileges. The other major function of the AA is the issuing of signed SAML assertions, which binds a users identity to their privileges. We have used the OpenSAML implementation for this. The Policy Authority makes access control decisions, based on the stored policy and the user privileges supplied in the SAML assertion. Policies are implemented as sets of persistable Java objects, and any number of policies may be stored; typically there will be one policy for each data collection project. The PA verifies the validity and integrity of a SAML assertion and confirms it comes from a trusted AA. It then checks the policy to determine if the supplied privileges are sufficient for the request target and action. In this context a target corresponds to an object or object group in the data repository and the action is the operation to be performed.

Client side APIs have been developed for both the PA and AA to hide the details of accessing the Web service. This means the services using the access control functionality need only call the makePolicyDecision() function provide by the API DEPLOYMENT AND OPERATION.

Third Party software components

All software components developed as part of PsyGrid are open source licensed under the GPL. We rely on a number of 3rd party software components, and to ensure that anyone can deploy their own PsyGrid, we have used only free, open source components. Most notably we have used Apache Tomcat (version 5.5.12) as our Web service container and Apache Axis (version 1.3) as our Web service stack. OpenLDAP was used for the directory, which provides user authentication, combined with myProxy on-line credential repository. Our backend database in the production system is IBM DB2, although in our development environment we use mySQL. Hibernate (version 3.1) provides an object relation mapping service (ORM) which enables any database supported by Hibernate to be used.

Hardware Architecture

PsyGrid is a production system and so it was designed to be continuously available. The deployment configuration is shown in Figure 3. High availability was achieved by using a pair of servers to host the Web services and database. These two servers are deployed with an identical software stack, which consists of the Web service container (Apache Tomcat), the LDAP directory (OpenLDAP), the database, which is IBM DB2 in the production system, and the myProxy on line certificate authority. The disk used for storage of the data is a RAID 5 array. The servers are identical in every respect, and use the Linux Heartbeat application to manage high availability, using a shared “virtual” IP address. At any one time one server is active and the other is in standby. The active server is assigned the virtual IP address by Heartbeat. The URL of the Web services, that the client used to access the system, resolves in the NHS network DNS to this virtual IP address. The active server has the RAID 5 disk mounted, which stores the database. Heartbeat manages failure detection, and can switch activity for the active server to the standby under a number of conditions. Heartbeat will determine that the active server has failed if the active server does not respond to the heartbeat message sent by the standby server via both IP and a direct serial cable connection. It will also determine the active server to have failed if the Web services do not respond to a getVersion() request, which would indicate a Tomcat failure. The active server continually monitors the health of the DB2 and Tomcat processes. A switch of activity proceeds as follows:

1. The standby node forces the other node to shutdown, releasing the virtual IP address and unmounting the database disk.

2. The standby node mounts the database disk.

3. Tomcat is started and the Web services initialised. myProxy and the LDAP server are always running.

4. The virtual IP address is taken over by the standby server, which is now the active server.

Since all persistent data for the AA, PA and Repository services is stored on the shared disk, there is no need for replication of their data. The LDAP directory uses the openLDAP replication service to keep the two directories synchronised, and myProxy maintains no state data. A switch of activity has been measured to take 12 seconds on the production system, which causes minimal interruption to service.

There was no opportunity to perform scaling tests on the system before it had to be placed into production due to time constraints, though for the initial deployment we were confident that our hardware platform would be more than powerful enough. We have long since passed the initial sizing requirements and we now support three independent data collection projects, with a combined user base of fifty distributed across 16 sites, and there is no degradation in performance. As and when performance becomes an issue, the use of Web services will enable us to seamlessly add new hardware and redistribute the services among the available hardware.

Figure 3. PsyGrid Data Collection System deployed for FEP data collection (DCCA = Data Collection Client Application, AA=Atribute Authority, PA=Policy Authority, DT=Data Transformer, DR=Data Repository, Auth=Authentication Service)

PsyGrid Data Collection System deployed for FEP data collection (DCCA = Data Collection Client Application, AA=Atribute Authority, PA=Policy Authority, DT=Data Transformer, DR=Data Repository, Auth=Authentication Service)

 

software upgrade

Our requirement was that a software upgrade should not cause an interruption to service. There are two different types of modification that need to be made to the running system.

The first of these is an upgrade to the Web services that occurs when bugs are fixed or new functionality has been added. The redundant hardware architecture employed enables the upgrade to be completed with minimal service interruption. The process we used is as follow:

1. Suspend the heartbeat application so that fail-over cannot occur.

2. On the standby machine remove the currently deployed Web services that are to be upgraded

3. Deploy the new versions of the Web services.

4. Restart heartbeat and manually switch activity. The upgraded Web services now become active.

5. Repeat steps 1, 2, and 3 so that the other server is upgraded.

6. Repeat step 4 to ensure that the deployment to the second server has been successful.

Changes to the interface of Web services require the data collection client to also be upgraded. The client is distributed using Java WebStart, and a new version is checked for each time it is launched by the user. If a newer version exists it is automatically updated. If the user is running the client during the upgrade, then any change to an existing Web service port type will cause problems. Consequently we maintain backwards compatibility between consecutive releases of a Web service. The addition of new port types to a Web service can be safely deployed without affecting running clients. To assist in debugging and user support, all Web services implement a getVersion() port type which return the version information set at compile time. We do not use Tomcat’s Web service “hot deploy” capability as it proved to be unreliable, producing runtime errors. Instead the Web archive file and the deployment directory of the service are completely deleted.

The second type of modification we need to make to a running system is modification of the data sets. This is required when either a bug is found or a change request from the project clinicians is made. The repository Web service enables a live update to the data set to be made using a “patching” port type. This enables the complete object graph to be downloaded, modified by the patching client and then saved in the repository. Each data set records its own patch level.

The combination of Web services and the high availability hardware architecture enable us to upgrade and patch the system remotely and in a way that is largely transparent to our users.

Remote connectivity

Deploying PsyGrid across the eight NHS Trusts proved to be very difficult. The NHS code of connection requires that only computers owned and managed by an N HS Trust may connect to the N HS network. Consequently, the purchase of hardware for the PsyGrid project and the configuration of those computers had to be devolved to the local NHS Trusts. We also required information from the Trusts about the source IP addresses the remote client computers would present to the firewall protecting the Central Manchester and Manchester Children’s Hospital Trust (CMMCHT), which was the host for the central PsyGrid servers, so that this could be configured accordingly. The CMMCHT policy is to treat the NHS network link as hostile, and so connection was denied by default. The IP addresses were needed so that they could be added to the “white list” of trusted remote computers. Acquiring this information proved to be very difficult. The first hurdle that had to be overcome was to find the person who was able to release this information, and then to explain to them we were a legitimate research project, providing our supporting documentation. The response we often received to our first approach was one of suspicion, and that we were attempting to obtain restricted security information using social engineering techniques. The situation was further compounded by the fact the NHS IT departments are often overloaded with the work of keeping the IT systems used for the provision of clinical care running. Whilst we encountered no refusals to help support PsyGrid, research projects such as ours have a low priority and so help was provided on a best effort basis. We concluded that we needed to minimise as much as possible our dependence on NHS Trust IT staff as each dependency could turn into a project delay. We were fortunate in that we were able to negotiate with CMMCHT a relaxation in their firewall policy such that access to our servers would be possible from any computer on the NHS network. Without this relaxation it would be impossible to scale the deployment of PsyGrid across the NHS in the future.

Once the central system was deployed the next step was to test connectivity from the remote NHS Trusts. The initial assumptions we made about the NHS network infrastructure (in a previous section) were very quickly shown to be wrong. We discovered that there were three different classes of NHS Trusts differentiated by the way they allowed connections to be made to the NHS network. The first type of Trust caused no problems as they permitted direct connection to the NHS network with no restrictions on the ports that could be used. The second type of Trust employs HTTP proxy servers between the Trust’s network and the NHS network. These proxy servers only permit HTTP traffic on port 80 and HTTPS traffic on port 443. The third type of Trust also employs proxy servers, but additionally requires the user to authenticate to the proxy server. All of the Trusts that require proxy authentication were not using the standard HTTP proxy server authentication method, but were using the proprietary Microsoft NTLM authentication protocol, which requires the user to supply their Windows domain credentials.

Whilst it may have been possible to negotiate with the individual Trust’s to remove these restrictions so that a direct connection could be made in line with our initial assumptions, we had already concluded that relying on NHS Trusts for this was not viable or scaleable. Even if we could negotiate direct connection, there would be no guarantees that this would not be withdrawn in the future without out knowledge. Therefore we had to extend and refactor the system to take into account the reality of the different types of Trusts. The first task was to use only the standard ports for HTTP(S). We were using two ports for HTTPS traffic, 443 and 8443, with Tomcat configured to perform server side authentication and mutual authentication respectively. The data collection application Web start Web service and the authentication service were hosted on the server-authenticated connector and the other Web services required mutual authentication. We now only had one port we could use for HTTPS and one port for HTTP. The data collection application Web start Web service was moved to the HTTP connector on port 80. The authentication service could not be moved to an unecryptyed connection however, since the user name and password would be sent in the clear to the authentication service. However, the available HTTPS port had to employ mutual authentication, for the other Web services that use the distinguished name for the client certificate to make access control decisions (for Policy Authority and Attribute Authority) and log service invocation. We therefore had to place the authentication service on this connector and distribute with the data collection client a default certificate that would enable authentication to the server, so that login may proceed. The consequence of this is that the client application is distributed with a certificate with a long lifetime, which could be used to mutually authenticate with the PsyGrid servers. However, the actual access-control decision on all services requires either a valid signed SAML Assertion or a certificate issued by the PsyGrid Infrastructure CA. This default certificate has the distinguished name of “CN=nobody, O=user, O=psygrid, C=uk”, and is issued by the PsyGrid Online CA and so can not be used to do anything more than complete the SSL mutual authentication. It cannot be used to retrieve a SAML Assertion (as there is no user with this Distinguished Name), nor are certificates from this CA accepted as valid for access without a SAML assertion.

The next task was to provide the ability to tunnel SSL through proxy servers. By default the version of Apache Axis (version 1.3) we were using did not provide support for SSL tunnelling through proxy servers. The existing PsyGridCli-entSocketFactory was extended to provide this behaviour and a configuration dialog was added to set the proxy server address and port number to the client application. If a proxy server were configured, then the PsyGridClientSocketFactory would setup an SSL tunnel through the proxy, using the HTTP Connect method. One negative consequence of this is that it requires the end user to perform this configuration, which often requires a call to PsyGrid technical support to guide then through this process. Finally we had to deal with authenticating proxies. Basic and digest authentication is a standard extension of the HTTP protocol, but the NTLM protocol is

Microsoft proprietary, and the specification is not published. However, the protocol has been reverse engineered and the Apache Commons HTTP Client provides an implementation. It also implements basic and digest authentication methods. Axis 1.3 does not use the HTTP Client and so we had to again integrate this into the PsyGridClientSocket-Factory, so that it would perform authentication if this was configured. We had to further extend the configuration dialog so that the authentication type could be selected, and a further user name and password dialog is displayed during login if proxy authentication is configured.

discussion

The operating environment in the NHS has undoubtedly provided the biggest challenge to PsyGrid. The high degree of heterogeneity between NHS Trusts, the use of propriety protocols and the firewall restrictions, which we only really began to understand when we tried to make the system operational, all caused the system to be reworked to overcome these restrictions. This is the single most important lesson learnt—a thorough understanding of the operating environment is required when the system is being design. However, these obstacles were not insurmountable and this shows that Web service implementations have reached a level of maturity that is able to cope with such a hostile environment. There is no doubt that a Web application based around off the shelf products would not have encountered these problems, but this would not have met our current or future requirements. We identified restrictions in the Web service tooling which defined a lowest common dominator as to the way we structured our data types whether the starting point was Java or XML. We anticipate our decision to use Web services will be fully justified in subsequent phases of the project as we begin to extend the system to include Web service orchestration and workflow techniques. Performance of the Web services, which we thought might cause problems, has not proved to be an issue to date, and we currently support three data collection projects encompassing 16 locations and 50 users.

The adoption of the PsyGrid system for mental health research in the UK has resulted in modification to the business processes of data collection in three ways. Firstly, data collection is now paperless, and data no longer needs to be transcribed from paper into an electronic format for analysis. This requires fewer resources, and the risk of introducing errors in the transcription is eliminated. Secondly, because PsyGrid can operate off-line, clinical researchers are able to enter data at any time in any location, including patient’s homes, and consequently they have more flexibility in their working patterns. Thirdly, we believe that the system is easier to use, and the restrictions on data input imposed in the user interface will lead to higher data quality. An informal analysis indicates that this is likely to be true, and we will report on a thorough evaluation in the future.

Grid computing techniques and technology are spreading rapidly through the health informatics community. Web services underpin the majority of current Grid middleware and consequently the use of Web services is becoming wide spread. The caCORE project (Phillips et al., 2006) and the closely allied caBIG from the National Cancer Institute Centre for Bioinformatics, have developed a set of tools and applications that provide much of the functionality of PsyGrid for the cancer domain. We are evaluating whether the caCORE SDK can be used to implement the data set designer function in the Project Manager application. PsyGrid is closely allied with three sister projects, namely CancerGrid (Brenton et al., 2005), NeuroGrid (Geddes et al., 2006) and VOTES (Virtual Organisations for Trials and Epidemiological Studies) (Stell et al., 2006). All these projects have a common theme, which is to develop middleware that will enable distributed collaboration and resource in their specific domain. CancerGrid is focused on supporting clinical trials in cancer research. NeuroGrid is developing a medical imaging Grid, which provides a searchable, distributed file system for the curation brain images, and workflow tools for analysis. Finally VOTES is focused on the security aspects of dynamic virtual organisations in a clinical context, and their application to the integration and query of clinical data sets arising from routine care, which can be used to identify eligible clinical trial participants.

We have described the PsyGrid data collection system, which is being used to record the patient assessments in the longitudinal FEP cohort study. The next phase of the project will see the development of tools for epidemiology, which will enable the testing of hypotheses about the causes of schizophrenia, and identify environmental risk factors. This will require the core PsyGrid data set to be integrated with existing data sets such as census data from the Office of National Statistics and mental health service data from the Mental Health Minimum Data Set. We plan to expose these data sets using grid-computing techniques. The OGSA-DAI protocol (Chervenak et al., 2003) provides a uniform method of accessing data sources and OGSA-DQP (Alpdemir et al., 2004) provides the capability to perform distributed queries over multiple data sets. A range of Web services will be created to provide functionality for data cleaning and statistical analysis. To orchestrate the Web services developed for epidemiology we will use the myGrid e-science workbench that provides a creation and execution environment.

We will add a clinical trial manager component to the data collection system. This will be implemented as an additional Web service and will provided the ability to randomise treatments according to a user configurable algorithm.

In the long term we plan to further develop PsyGrid so that it can be used for predicting patients most at risk, and given their early symptoms from brain imaging and psychiatric assessments combined with genetic data, can be used to target the most effective treatments to prevent adverse outcomes. This will require integration with further data sources collected as part of clinical care and the development of a decision support system for clinicians. The longitudinal data collected during the FEP cohort study will provide the evidence base for making these predictions. .

Next post:

Previous post: