Scientific Data Analysis - Scientific Data Management

Database Reference

In-Depth Information

provides the mathematical foundation for establishing a hierarchical database

for crystallography. While this is not possible for all types of data bases

in materials science, data mining and informatics provide a framework for

identifying, searching, and organizing descriptors that form the foundation

of databases; and hence provide the key for data analysis of databases in

materials science.

8.6 Parallel R for High-Performance Analytics:

Applications to Biology

This last section focuses on a different aspect of data analysis, namely, the role

of parallel processing in the analysis of massive amounts of data. We describe

how a commonly used statistical analysis package can be enhanced to apply

it to large-data problems in biology.

R 88 is an open-source software platform for statistical computing and graph-

ics. It is broadly used by the statistics, bioinformatics, engineering, and other

communities. R supports diverse statistical analysis tasks such as linear re-

gression, classic statistical tests, time-series analysis, and clustering. It also

provides a variety of graphical functions such as histograms, pie charts, and

3D surface plots. More importantly, R provides easy-to-use hooks for adding

extension packages by external developers. The major drawback of R is the

lack of scalability to massive datasets, which are quite common in scientific

domains. For instance, a typical output from mass spectrometry proteomics

measurements for a single bacterial genome easily reaches gigabytes, while

the output from a climate simulation reaches terabytes. The most straightfor-

ward approach to address this challenge is to equip R with high-performance,

scalable, parallel processing capabilities.

There are a number of requirements that should be met for any solution

to the parallelization of R to be practical and to be easily adapted by a

broad community of users. First, it would be ideal to attain the performance

comparable with the performance of parallel solutions for compiled languages,

like C or Fortran. While R is a scripting language, it is written on top of C and

provides mechanisms for calling functions written in such languages. Second,

the interface to calling parallel analysis routines should mimic the original R

interface, and ideally, should not require users to change their R code to run

in a parallel mode. Third, it should carry enough intelligence to detect on the

end-user's behalf which parts of the code are parallelizable and which parts

are not. Finally, it should require no knowledge or very minimal knowledge of

parallel computing from the end user.

Two major approaches to enable parallel processing with R have been in-

troduced (Figure 8.8). The first approach offers message-passing capabilities

Search WWH ::

Custom Search

Home