Scalding—A Scala DSL for Cascading - Enterprise Data Workflows with Cascading

Databases Reference

In-Depth Information

words, Scalding provides an almost pure expression of the DAG for this Cascading flow.

This point underscores the expressiveness of the functional programming paradigm.

Figure 4-1. Conceptual flow diagram for “Example 3: Customized Operations”

Examining this app line by line, the first thing to note is that we extend the Job() base

class in Scalding:

class Example3 ( args : Args ) extends Job ( args ) { ... }

Next, the source tap reads tuples from a data set in TSV format. This expects to have a

header, then doc_id and text as the fields. The tap identifier for the data set gets specified

by a --doc command-line parameter:

Tsv ( args ( "doc" ), ( 'doc_id , 'text ), skipHeader = true )

. read

The flatMap() function in Scalding is equivalent to a generator in Cascading. It maps

each element to a list, then flattens that list—emitting a Cascading result tuple for each

item in the returned list. In this case, it splits text into tokens based on RegexSplitGen

erator :

. flatMap ( 'text -> 'token ) { text : String => text . split ( "[ \\[\\]\\(\\),.]" ) }

In essence, Scalding extends the collections API in Scala. Scala has functional constructs

such as map , reduce , filter , etc., built into the language, so the Cascading operations have

been integrated as operations on its parallel iterators . In other words, the notion of a

pipe in Scalding is the same as a distributed list. That provides a powerful abstraction

for large-scale parallel processing. Keep that in mind for later.

The mapTo() function in the next line shows how to call a customized function for

scrubbing tokens. This is substantially simpler to do in Scalding:

. mapTo ( 'token -> 'token ) { token : String => scrub ( token ) }

. filter ( 'token ) { token : String => token . length > 0 }

Search WWH ::

Custom Search

Home