Conclusions and Future Outlook - Scientific Data Management

Database Reference

In-Depth Information

points can save much more power as compared with maintaining the disk

contents static. Several algorithms for packing files into disks have been de-

veloped where it was shown that skewed allocations of files to disks, .i.e.,

allocations that make few disks very active while the rest are relatively idle,

perform much better than balanced file allocations. It has also been observed

that traditional caching policies that are aimed only at increased performance

are not always optimal in terms of their power consumption. While more re-

search work is required to study power-aware caching policies and prefetching

algorithms that can further reduce the power consumption in disk systems,

we expect such algorithms to become part of any file system in the future.

Co-Location of Data and Analysis

When the volume of data could fit into a workstation or even a small cluster,

copying the data was the common practice, so that it can be analyzed locally.

As the volume of data grows, it is becoming increasingly prohibitive to copy

hundreds of terabytes or petabytes to scientists' sites. Another consideration is

the need to manage the analysis software. As the volume of data grows, many

analysis packages do not scale, and a new generation of parallel analysis tools

is being developed. The cost of purchasing and managing cluster machines and

the parallel analysis software is becoming out of reach for many scientists.

The obvious alternative is “data-side analysis facility” in order to perform

the analysis on a shared analysis facility that is co-located with the data. Such

a facility, which could be built on a medium size cluster (perhaps a few thou-

sand cores) and include various packages of parallel analysis software. The size

of the facility will depend on the number of users expected to be served. How-

ever, in addition to managing the use of resources by users, the facility needs

to include a way for scientists to express the analysis process. As explained in

Chapter 8, the analysis process can be viewed as a pipeline, consisting of steps

such as feature extraction, dimensionality reduction, and pattern recognition.

Similarly, a visual analysis pipeline, as discussed in Chapter 9, can consist of

steps such as filtering, mapping, and rendering. More generally, this process

can be an acyclic workflow where flows can branch and join together. Thus,

workflow systems need to be deployed on such facilities to facilitate data-side

analysis. Further, metadata and provenance capabilities need to be supported

as well.

There are already some limited examples where the analysis scripts are sent

to facilities near the data to be executed. However, a general purpose facility

needs to provide a way for the user to specify the analysis components to be

used by the workflow, the inputs needed for each step, and the output pro-

duced. This facility should insulate the user from unnecessary details of where

Search WWH ::

Custom Search

Home