Database Reference
In-Depth Information
of the century, so this is plausible). The complete run for the century took 42 minutes in
one run on a single EC2 High-CPU Extra Large instance.
To speed up the processing, we need to run parts of the program in parallel. In theory, this
is straightforward: we could process different years in different processes, using all the
available hardware threads on a machine. There are a few problems with this, however.
First, dividing the work into equal-size pieces isn't always easy or obvious. In this case,
the file size for different years varies widely, so some processes will finish much earlier
than others. Even if they pick up further work, the whole run is dominated by the longest
file. A better approach, although one that requires more work, is to split the input into
fixed-size chunks and assign each chunk to a process.
Second, combining the results from independent processes may require further process-
ing. In this case, the result for each year is independent of other years, and they may be
combined by concatenating all the results and sorting by year. If using the fixed-size
chunk approach, the combination is more delicate. For this example, data for a particular
year will typically be split into several chunks, each processed independently. We'll end
up with the maximum temperature for each chunk, so the final step is to look for the
highest of these maximums for each year.
Third, you are still limited by the processing capacity of a single machine. If the best time
you can achieve is 20 minutes with the number of processors you have, then that's it. You
can't make it go faster. Also, some datasets grow beyond the capacity of a single machine.
When we start using multiple machines, a whole host of other factors come into play,
mainly falling into the categories of coordination and reliability. Who runs the overall
job? How do we deal with failed processes?
So, although it's feasible to parallelize the processing, in practice it's messy. Using a
framework like Hadoop to take care of these issues is a great help.
Search WWH ::




Custom Search