Database Reference
In-Depth Information
the data as it is generated before it is stored on disk or archived. Processing of
the data as it is being generated can be made on the I/O processors of large
supercomputers, or oine on smaller clusters.
Ecient indexing methods for searching and subsetting large datasets are
becoming essential in many applications. New emerging indexing methods
that take advantage of the fact that scientific data only grows and does not
change over time, have proven to be effective in terms of size and performance
of the index. New database management systems designed specifically for sci-
entific data are emerging. These include so-called vertical database systems,
and systems that support data models for scientific data structures such as
arrays. Scientific data analysis will continue to be an art of trying various
methods such as filtering, de-noising, dimensionality reduction, and feature
extraction. The challenges are to provide facilities where such iterative explo-
rations can take place with minimal effort for the scientists and with good
response time. The trend toward parallelizing analysis codes and packages
will continue as the volume of data to be analyzed grows. Scientific data vi-
sualization is another aspect of data analysis that is indispensible in many
applications. Many of the techniques needed for effective real-time interaction
with the data, such as multi-resolution analysis, require data structures that
can be searched eciently on parallel machines. Such techniques will continue
to evolve, and will most likely be most effective when running on facilities
close to where the data is stored. Integration of data from multiple disciplines
is already an extremely important problem. Experience in the Geoscience do-
main has shown that the best chance to succeed is to develop standard data
formats and ontologies that various tools can adhere to. In practice, the pro-
cess of adopting standards is an evolutionary process. Once such standards are
developed, data transformations from legacy formats will continue to be ap-
plied for some time. Streaming data is becoming a highly challenging problem
as the speed and quantity of data generated by sensor devices, experiments,
and satellite data increases. Processing the data streams in parallel is an ob-
vious technique, but generating indexes as the data is streaming, as well as
approximate summarizations, are effective techniques as well. As more and
more data is collected, the metadata associated with it becomes essential.
Many datasets have lost their value over time, because the metadata was not
adequately collected, or was lost. The challenge is to develop systems that
automatically collect metadata on the structural and content information, on
the way it was generated, and on its provenance.
Finally, it is evident that workflow management is becoming an essential
part of managing the data generation, data processing, and data analysis
processes. Many tasks need to be performed soon after the raw data is gener-
ated, and workflow systems are needed to perform these in a timely manner.
Workflow systems are extremely useful in repetitive tasks, such as running
a simulation repeatedly with various parameters, and then generating sum-
maries and graphs while the data is generated in order to monitor progress.
There are still many challenges in this relatively new area, including simple
Search WWH ::




Custom Search