Cluster architecture (Bioinformatics)

1. Clusters

Cluster computing allows mass-market PC and server systems to be networked together to form an extremely cost-effective system capable of handling supercomputer-scale workloads. Cluster sizes can range from two machines up through many thousands of interconnected systems.

Biologists have found that clusters can be used both to expand the scale of existing informatics research efforts as well as to investigate research areas previously written off as prohibitively expensive or computationally infeasible.

The use of the term “cluster” in this article refers to systems operating on the same network or within the same cabinet, datacenter, building, or campus. This differs from “grid computing”, which is typically a term associated with the use of many clusters or diverse distributed systems linked together via the Internet or other wide area networking (WAN) technologies. Clusters typically have a single administrative domain, whereas grids can be composed of geographically separated systems and services each of which with its own administrative domain and access policies. The term “Beowulf cluster” typically refers systems purpose-built for parallel computation.

2. Life science cluster characteristics

For maximum utility, flexibility, and capability, scientific and research goals are the primary drivers for cluster architecture decisions. To do otherwise is to risk unintended consequences that limit how the cluster may be used as a research or data processing tool.

The performance characteristics and runtime requirements of the intended scientific application mix play a major role in hardware selection and overall system design. Researchers with a significant need for bioinformatics sequence analysis often find that many of their applications are performance-bound by the amount of physical memory (RAM) in a machine and the speed of underlying storage and I/O subsystems. Users running large parallel applications will find that the speed and latency characteristics of the cluster network will often be the most important factor in optimizing performance and throughput. Some applications including chemistry and molecular modeling codes can be CPU-bound, and run best on systems with very fast CPUs and high data transfer rates between processor, onboard cache, and external memory.

Understanding the performance-affecting requirements of the scientific application mix is essential when planning new clusters or even upgrading existing systems. Wherever possible, benchmarks reflecting real-world usage and workflows should be performed.

3. Serial or “batch computing” versus parallel computing

Unfortunately, many cluster references, resources, and cluster “kits” are biased toward a process of parallel cluster computing that is not commonly used on many life science settings. A single parallel application is designed to run across many systems simultaneously. The most commonly seen parallel applications are based on PVM (Parallel Virtual Machine) or MPI (Message Passing Interface) standards.

The use of PVM- or MPI-aware parallel applications tends to be rare in the life sciences. The exception tends to be in the areas of molecular modeling and computational chemistry, where there exists a significant body of parallel software available and in use.

A far more common requirement is the need to repeatedly run large numbers of traditional nonparallel scientific applications or algorithms. Each application instance becomes a stand-alone job that can be efficiently scheduled and independently distributed across a cluster.

Large computational biology problems such as bioinformatics sequence analysis fit nicely into this paradigm – every large analysis task is capable of being broken down into individual pieces that can be executed in any order, independent of any other segment. This approach is known as “serial” or “batch” computing. Problems that can be broken up for serial or batch distribution across a cluster are also referred to as “embarrassingly parallel” problems.

The workflow bias toward serial computing rather than parallel computing is one of the main distinguishing characteristics of life science clusters.

4. Cluster topology

Clusters tend to use variations on a “portal” architecture in which all of the cluster compute nodes are kept isolated on a private network (Figure 1). Management and usage of the cluster is achieved via use of a machine that is attached to both the public organizational network and the private cluster network. Additional servers, storage devices, database servers, and management servers are also “multi-homed” to both networks as needed.

A schematic representation of the portal style cluster architecture is seen in Figure 2.

Advantages of the portal architecture approach include:

• Easier management and administration. Cluster operators are free to control, customize, and modify essential network services such as DHCP, TFTP, PXE, LDAP, NIS, and so on, without affecting the organizational network.

A small bioinformatics research cluster using Apple G4 and Intel Xeon-based server systems

Figure 1 A small bioinformatics research cluster using Apple G4 and Intel Xeon-based server systems

 Logical view - portal architecture

Figure 2 Logical view – portal architecture

• Security and abstraction of computing resources. The architecture prevents large numbers of compute nodes from being directly accessible to the public network. Cluster users are encouraged to think of the cluster nodes as anonymous and interchangeable.

In instances in which jobs running on the cluster may need to communicate with systems or services outside of the cluster (database or LIMS systems, etc.), it is a simple matter to set up NAT (network address translation) or proxy services.

5. Network and interconnects

General-purpose clusters use standard switched Ethernet networking components as the primary method of interconnecting cluster nodes. Some cluster operators find that running a second private “management” network alongside the primary network has administrative advantages. The cost of copper-based Gigabit Ethernet networking products has plummeted to the point where it has become the default choice for clusters of all sizes.

Multiple Gigabit Ethernet links can be “trunked” or bonded together to achieve higher performance. In many cases, the cluster services that best benefit from added bandwidth and network performance capability are cluster file-servers and other data staging or storage systems.

Ethernet networking may not be suitable for all network-dependent tasks and use-cases. In particular, some parallel PVM or MPI applications may be performance limited when run over an Ethernet network due to the relatively high latency between transported packets. Some parallel or global distributed file-system technologies also prefer or may even require the use of a special high-speed, low-latency interconnect.

There are a number of available products and technologies aimed at providing clusters with a higher-performance interconnect. Examples include Myrinet and Infiniband. These can be deployed cluster-wide to complement an existing Ethernet network or deployed to a limited subset of cluster nodes to support parallel applications.

Given the relative lack of latency-sensitive, massively parallel scientific software in the life sciences, the use of interconnect technologies other than Ethernet is quite rare.

6. Distributed resource management

An essential component, especially systems supporting multiple groups or research efforts, is the software layer that handles resource allocation and all aspects of job scheduling and execution across many machines. Generally known as “distributed resource management” (DRM), these software products are critical to successful cluster operation. The most commonly seen DRM products in life science settings are Platform LSF and Sun Grid Engine. Other DRM suites include Portable Batch System (PBS) and Condor. Proper selection and configuration of the DRM software layer is extremely important, as the DRM layer is the “glue” that ties the cluster together.

7. Storage

The most commonly encountered performance-limiting bottleneck in life science clusters is the speed of both local and network-resident storage systems. Network attached storage (NAS) devices providing NFS-based file systems are often used as a way of making vast amounts of raw research data available for analysis within the cluster. These devices (or the network itself) can quickly be saturated on even moderately busy clusters.

The cost of acquiring a few terabytes of raw NAS can vary by as much as $50 000 USD or more between competitive storage products. Fortunately, the wide price range allows for a healthy ecosystem of differentiated storage products offering different levels of performance, resiliency, cost, and capability.

A popular method of increasing local disk performance in life science clusters involves populating compute nodes with multiple large but inexpensive ATA drives that are mirrored or striped together via software RAID. In addition to vastly increased local I/O performance, these disks can also be used to cache popular databases or files from the central file-server. Very significant amounts of cluster network traffic and file-server load can be eliminated simply by staging data to the local compute nodes prior to launching a large analytical job.

8. Reducing administrative burden

Clusters of loosely interconnected server systems can represent a significant operational challenge. Several inexpensive hardware or software-based methodologies can greatly reduce the amount of effort needed to maintain cluster systems. Approaches include various techniques for performing unattended operating system installations (or reinstallation), remote power control products, and serial console access concentrators.

Next post:

Previous post: