Database Reference
In-Depth Information
them separately. This provides a means to allow for column-oriented data
access.
ROOT provides an extensive set of data analysis frameworks, which makes
analyses of high-energy physics data convenient and interactive. Its interpreted
C++ environment also offers the possibility of infinitely complex analysis that
some users desire. Since each ROOT file can be processed independently, the
ROOT system also offers huge potential for parallel processing on a cluster of
commodity computers. This is a nice feature that enabled the cash-strapped
physicists to effectively process petabytes of data before anyone else could. The
ROOT system is now being extensively used by many scientific applications,
and has even gained some fans in the commercial world. More information
about ROOT can be found at http://root.cern.ch/.
6.7.1.3
MapReduce Parallel Analysis System
The MapReduce parallel data analysis model has gained considerable atten-
tion recently. 27 , 28 , 50 Under this model, a user only needs to write a map function
and a reduce function in order to make use of a large cluster of computers.
This ease of use is particularly attractive because many other parallel data
analysis systems require much more programming effort. This approach has
been demonstrated to be effective in a number of commercial settings.
There are a number of different implementations of the MapReduce system
following the same design principle. In particular, there is an open-source
implementation from the Apache Hadoop project that is available for anyone
to use. To use this system, one needs to place the data on a parallel file
system supported by the MapReduce run-time system. The run-time system
manages the distribution of the work onto different processors, selecting the
appropriate data files for each processor, and passing the data records from the
file to the map and reduce functions. The run-time system also manages the
coordination among the parallel tasks, collects the final results, and recovers
any errors.
The MapReduce system treats all data records as key/value pairs. The
primary mechanism offer in this model is an iterator (identified by a key).
Recall that the ROOT system also provides a similar iterator for data access.
Another similarity is that both ROOT and MapReduce can operate on large
distributed data. The key difference between ROOT and MapReduce is that
the existing MapReduce systems rely on underlying parallel file systems for
managing and distributing the data, while the ROOT system uses a set of
daemons to deliver the files to the parallel jobs. In a MapReduce system,
the content of the data is opaque to the run-time system and the user has
to explicitly extract the necessary information for processing. In the ROOT
system, an event has a known definition and accessing the attributes of an
event therefore requires less work.
The data access mechanism provided by a MapReduce system can be con-
sidered as row-oriented because all values associated with a key are read into
Search WWH ::




Custom Search