Database Reference
In-Depth Information
Deining new Cascalog operators
Cascalog comes with a number of operators; however, you'll often need to deine your own,
as we saw in the Aggregating data in Cascalog recipe.
For different uses, Cascalog deines a number of different categories of operators, each with
different properties. Some are run in the map phase of processing, and some are run in the
reduce phase. The ones in the map phase can use a number of extra optimizations, so if you
can push some of your processing into that stage, you'll get better performance. In this recipe,
you'll see which categories of operators are on the map side and which are on the reduce side.
We'll also provide an example of each and see how they it into the larger processing model.
Getting ready
For this recipe, we'll use the same dependencies and inclusions that we did in the Initializing
Cascalog and Hadoop for distributed processing recipe. We'll also use the Doctor Who
companion data from that recipe.
How to do it…
As I mentioned, Cascalog allows you to specify a number of different operator types. Each type
is used in a different situation and with different classes of problems and operations. Let's
take a look at each type of operator.
Creating map operators
Map operators transform data in the map phase, with one input row being mapped to one
output row. A simple example of a custom map operator is an operator that triples all the
numbers that pass through it:
(defmapfn triple-value [x] (* 3 x))
Something similar to this can be used to rescale all the values in a ield.
Creating map concatenation operators
Map concatenation operators transform data in the map phase, but each input row can
map to one output row, many output rows, or none. These operators return a sequence, and
each item in the sequence is a new output row. For example, this operator splits a string in
whitespace, and each token is a new output row. We'll use this in the following predicate to
count the number of names that each companion had:
(defmapcatfn split [string] (string/split string #"\s+"))
(?<- (stdout)
[?name ?count]
(full-name _ ?name) (split ?name :> ?token) (c/count ?count))
 
Search WWH ::




Custom Search