Database Reference
In-Depth Information
In fact, big data has become more and more common. The deinition of big data is a moving
target as disk sizes grow, but you're working with big data if you have trouble processing it in
memory or even storing it on disk. There's a lot of information locked in that much data, but
getting it out can be a real problem. The recipes in this chapter will help address these issues.
Currently, the most common way to distribute computation in production is to use the
MapReduce algorithm, which was originally described in a research paper by Google,
although the company has moved away from using this algorithm ( http://research.
google.com/archive/mapreduce.html ). The MapReduce process should be familiar
to anyone who uses Clojure, as MapReduce is directly inspired by functional programming.
Data is partitioned over all the computers in the cluster. An operation is mapped across the
input data, with each computer in the cluster performing part of the processing. The outputs
of the map function are then accumulated using a Reduce operation. Conceptually, this is
very similar to the reducers library that we discussed in some of the recipes in Chapter 4 ,
Improving Performance with Parallel Programming . The following diagram illustrates the three
stages of processing:
map
reduce
data
Clojure itself doesn't have any features for distributed processing. However, it has an excellent
interoperability with the Java Virtual Machine, and Java has a number of libraries and systems
for creating and using distributed systems. In the recipes of this chapter, we'll primarily
use Hadoop ( http://hadoop.apache.org/ ), and we'll especially focus on Cascading
( http://www.cascading.org/ ) and the Clojure wrapper for this library, Cascalog
( https://github.com/nathanmarz/cascalog ) . This toolchain makes distributed
processing simple.
All these systems also work using just a single server, such as a developer's working
computer. This is what we'll use in most of the recipes in this chapter. The code should
work in a multiserver, cluster environment too.
 
Search WWH ::




Custom Search