Conclusions and Future Outlook - Scientific Data Management

Database Reference

In-Depth Information

the data as it is generated before it is stored on disk or archived. Processing of

the data as it is being generated can be made on the I/O processors of large

supercomputers, or oine on smaller clusters.

Ecient indexing methods for searching and subsetting large datasets are

becoming essential in many applications. New emerging indexing methods

that take advantage of the fact that scientific data only grows and does not

change over time, have proven to be effective in terms of size and performance

of the index. New database management systems designed specifically for sci-

entific data are emerging. These include so-called vertical database systems,

and systems that support data models for scientific data structures such as

arrays. Scientific data analysis will continue to be an art of trying various

methods such as filtering, de-noising, dimensionality reduction, and feature

extraction. The challenges are to provide facilities where such iterative explo-

rations can take place with minimal effort for the scientists and with good

response time. The trend toward parallelizing analysis codes and packages

will continue as the volume of data to be analyzed grows. Scientific data vi-

sualization is another aspect of data analysis that is indispensible in many

applications. Many of the techniques needed for effective real-time interaction

with the data, such as multi-resolution analysis, require data structures that

can be searched eciently on parallel machines. Such techniques will continue

to evolve, and will most likely be most effective when running on facilities

close to where the data is stored. Integration of data from multiple disciplines

is already an extremely important problem. Experience in the Geoscience do-

main has shown that the best chance to succeed is to develop standard data

formats and ontologies that various tools can adhere to. In practice, the pro-

cess of adopting standards is an evolutionary process. Once such standards are

developed, data transformations from legacy formats will continue to be ap-

plied for some time. Streaming data is becoming a highly challenging problem

as the speed and quantity of data generated by sensor devices, experiments,

and satellite data increases. Processing the data streams in parallel is an ob-

vious technique, but generating indexes as the data is streaming, as well as

approximate summarizations, are effective techniques as well. As more and

more data is collected, the metadata associated with it becomes essential.

Many datasets have lost their value over time, because the metadata was not

adequately collected, or was lost. The challenge is to develop systems that

automatically collect metadata on the structural and content information, on

the way it was generated, and on its provenance.

Finally, it is evident that workflow management is becoming an essential

part of managing the data generation, data processing, and data analysis

processes. Many tasks need to be performed soon after the raw data is gener-

ated, and workflow systems are needed to perform these in a timely manner.

Workflow systems are extremely useful in repetitive tasks, such as running

a simulation repeatedly with various parameters, and then generating sum-

maries and graphs while the data is generated in order to monitor progress.

There are still many challenges in this relatively new area, including simple

Scientific Data Management

Search WWH ::

Custom Search

Home