Database Reference
In-Depth Information
provides the mathematical foundation for establishing a hierarchical database
for crystallography. While this is not possible for all types of data bases
in materials science, data mining and informatics provide a framework for
identifying, searching, and organizing descriptors that form the foundation
of databases; and hence provide the key for data analysis of databases in
materials science.
8.6 Parallel R for High-Performance Analytics:
Applications to Biology
This last section focuses on a different aspect of data analysis, namely, the role
of parallel processing in the analysis of massive amounts of data. We describe
how a commonly used statistical analysis package can be enhanced to apply
it to large-data problems in biology.
R 88 is an open-source software platform for statistical computing and graph-
ics. It is broadly used by the statistics, bioinformatics, engineering, and other
communities. R supports diverse statistical analysis tasks such as linear re-
gression, classic statistical tests, time-series analysis, and clustering. It also
provides a variety of graphical functions such as histograms, pie charts, and
3D surface plots. More importantly, R provides easy-to-use hooks for adding
extension packages by external developers. The major drawback of R is the
lack of scalability to massive datasets, which are quite common in scientific
domains. For instance, a typical output from mass spectrometry proteomics
measurements for a single bacterial genome easily reaches gigabytes, while
the output from a climate simulation reaches terabytes. The most straightfor-
ward approach to address this challenge is to equip R with high-performance,
scalable, parallel processing capabilities.
There are a number of requirements that should be met for any solution
to the parallelization of R to be practical and to be easily adapted by a
broad community of users. First, it would be ideal to attain the performance
comparable with the performance of parallel solutions for compiled languages,
like C or Fortran. While R is a scripting language, it is written on top of C and
provides mechanisms for calling functions written in such languages. Second,
the interface to calling parallel analysis routines should mimic the original R
interface, and ideally, should not require users to change their R code to run
in a parallel mode. Third, it should carry enough intelligence to detect on the
end-user's behalf which parts of the code are parallelizable and which parts
are not. Finally, it should require no knowledge or very minimal knowledge of
parallel computing from the end user.
Two major approaches to enable parallel processing with R have been in-
troduced (Figure 8.8). The first approach offers message-passing capabilities
Search WWH ::




Custom Search