Database Reference
In-Depth Information
points can save much more power as compared with maintaining the disk
contents static. Several algorithms for packing files into disks have been de-
veloped where it was shown that skewed allocations of files to disks, .i.e.,
allocations that make few disks very active while the rest are relatively idle,
perform much better than balanced file allocations. It has also been observed
that traditional caching policies that are aimed only at increased performance
are not always optimal in terms of their power consumption. While more re-
search work is required to study power-aware caching policies and prefetching
algorithms that can further reduce the power consumption in disk systems,
we expect such algorithms to become part of any file system in the future.
Co-Location of Data and Analysis
When the volume of data could fit into a workstation or even a small cluster,
copying the data was the common practice, so that it can be analyzed locally.
As the volume of data grows, it is becoming increasingly prohibitive to copy
hundreds of terabytes or petabytes to scientists' sites. Another consideration is
the need to manage the analysis software. As the volume of data grows, many
analysis packages do not scale, and a new generation of parallel analysis tools
is being developed. The cost of purchasing and managing cluster machines and
the parallel analysis software is becoming out of reach for many scientists.
The obvious alternative is “data-side analysis facility” in order to perform
the analysis on a shared analysis facility that is co-located with the data. Such
a facility, which could be built on a medium size cluster (perhaps a few thou-
sand cores) and include various packages of parallel analysis software. The size
of the facility will depend on the number of users expected to be served. How-
ever, in addition to managing the use of resources by users, the facility needs
to include a way for scientists to express the analysis process. As explained in
Chapter 8, the analysis process can be viewed as a pipeline, consisting of steps
such as feature extraction, dimensionality reduction, and pattern recognition.
Similarly, a visual analysis pipeline, as discussed in Chapter 9, can consist of
steps such as filtering, mapping, and rendering. More generally, this process
can be an acyclic workflow where flows can branch and join together. Thus,
workflow systems need to be deployed on such facilities to facilitate data-side
analysis. Further, metadata and provenance capabilities need to be supported
as well.
There are already some limited examples where the analysis scripts are sent
to facilities near the data to be executed. However, a general purpose facility
needs to provide a way for the user to specify the analysis components to be
used by the workflow, the inputs needed for each step, and the output pro-
duced. This facility should insulate the user from unnecessary details of where
Search WWH ::




Custom Search