Databases Reference
In-Depth Information
words, Scalding provides an almost pure expression of the DAG for this Cascading flow.
This point underscores the expressiveness of the functional programming paradigm.
Figure 4-1. Conceptual flow diagram for “Example 3: Customized Operations”
Examining this app line by line, the first thing to note is that we extend the Job() base
class in Scalding:
class Example3 ( args : Args ) extends Job ( args ) { ... }
Next, the source tap reads tuples from a data set in TSV format. This expects to have a
header, then doc_id and text as the fields. The tap identifier for the data set gets specified
by a --doc command-line parameter:
Tsv ( args ( "doc" ), ( 'doc_id , 'text ), skipHeader = true )
. read
The flatMap() function in Scalding is equivalent to a generator in Cascading. It maps
each element to a list, then flattens that list—emitting a Cascading result tuple for each
item in the returned list. In this case, it splits text into tokens based on RegexSplitGen
erator :
. flatMap ( 'text -> 'token ) { text : String => text . split ( "[ \\[\\]\\(\\),.]" ) }
In essence, Scalding extends the collections API in Scala. Scala has functional constructs
such as map , reduce , filter , etc., built into the language, so the Cascading operations have
been integrated as operations on its parallel iterators . In other words, the notion of a
pipe in Scalding is the same as a distributed list. That provides a powerful abstraction
for large-scale parallel processing. Keep that in mind for later.
The mapTo() function in the next line shows how to call a customized function for
scrubbing tokens. This is substantially simpler to do in Scalding:
. mapTo ( 'token -> 'token ) { token : String => scrub ( token ) }
. filter ( 'token ) { token : String => token . length > 0 }
 
Search WWH ::




Custom Search