Databases Reference
In-Depth Information
words, Scalding provides an almost pure expression of the DAG for this Cascading flow.
This point underscores the expressiveness of the functional programming paradigm.
Figure 4-1. Conceptual flow diagram for
“Example 3: Customized Operations”
Examining this app line by line, the first thing to note is that we extend the
Job()
base
class in Scalding:
class
Example3
(
args
:
Args
)
extends
Job
(
args
)
{
...
}
Next, the source tap reads tuples from a data set in TSV format. This expects to have a
header, then
doc_id
and
text
as the fields. The tap identifier for the data set gets specified
by a
--doc
command-line parameter:
Tsv
(
args
(
"doc"
),
(
'doc_id
,
'text
),
skipHeader
=
true
)
.
read
The
flatMap()
function in Scalding is equivalent to a generator in Cascading. It maps
each element to a list, then flattens that list—emitting a Cascading result tuple for each
item in the returned list. In this case, it splits text into tokens based on
RegexSplitGen
erator
:
.
flatMap
(
'text
->
'token
)
{
text
:
String
=>
text
.
split
(
"[ \\[\\]\\(\\),.]"
)
}
In essence, Scalding extends the collections API in Scala. Scala has functional constructs
such as
map
,
reduce
,
filter
, etc., built into the language, so the Cascading operations have
been integrated as operations on its
parallel iterators
.
In other words, the notion of a
pipe
in Scalding is the same as a distributed list. That provides a powerful abstraction
for large-scale parallel processing. Keep that in mind for later.
The
mapTo()
function in the next line shows how to call a customized function for
scrubbing tokens. This is substantially simpler to do in Scalding:
.
mapTo
(
'token
->
'token
)
{
token
:
String
=>
scrub
(
token
)
}
.
filter
(
'token
)
{
token
:
String
=>
token
.
length
>
0
}