Geoscience Reference
In-Depth Information
from diverse sources without having a shared standard for the data collection
(Hall et al. 2005). For example, screening and identifying complex chemical
mixtures in the natural environment are difficult because there so many possible
mixtures and the mixtures change temporally and spatially (Casey et al. 2004).
A second example involves conducting gene-screening analysis to differentiate
among tens of thousands of genes or single-nucleotide polymorphisms along a
hypothesized disease pathway with only a small number of subjects. Overzeal-
ous findings of a positive association are a consequence of this high-dimension
problem (Rajaraman and Ullman 2011). Mining that type of data could pose
serious challenges in validity and utility when the data are from across geo-
graphic and disciplinary boundaries and have heterogeneous quality standards.
A special danger with huge datasets is a problem of multiple comparisons,
which can lead to massive false positive results. Also with such data, there is
sometimes a dominance of bias over randomness—increasing the amount of
data generally reduces variances, sometimes close to zero, it but does not reduce
bias. In fact, it may even increase bias by diverting attention from the basic qual-
ity of the data. Another challenge involves the modeling of complex biologic
systems (such as pathway models, physiologically based pharmacokinetic and
pharmacodynamic models, and hospital admission data). Information from a
small number of static datasets is insufficient to support a large number of un-
known model parameters. Two approaches are widely used: fixing some pa-
rameters at values that have only weak support from external systems (Wang et
al. 1997) and tightening the range of variation of the values of the parameters by
imposing probabilistic distributions in a Bayesian approach, such as a Monte
Carlo Markov chain (Bois 2000). Those methods may give the user an unwar-
ranted sense of truth when there are substantial uncertainties in the true model.
As informatics and data-mining become standard, techniques for data analysis
will be increasingly hybrid, combining mathematical, computational, graphical,
and statistical tools and qualitative methods to conduct data exploration, ma-
chine learning, modeling, and decision-making. Developing its inhouse capabil-
ity will help EPA to adopt and apply the new techniques.
Data Sharing and Distribution
EPA devotes substantial resources to the public sharing of data resources.
It also provides support and encouragement to software and application (app)
developers for the creation of both institutional and consumer applications for
accessing, presenting, and analyzing available environmental data. One example
is the Toxics Release Inventory. Others being developed are the EPA Saves
Your Skin mobile telephone app, which provides ZIP code-based ultraviolet
index information to help the public take action to protect their skin and an air-
quality index mobile app, which feeds air-quality information based on ZIP
code. The agency has made strides in analytic and simulation activities, as
shown in the leadership role that it has played in computational toxicology (see
Search WWH ::




Custom Search