Extending Pipe Assemblies - Enterprise Data Workflows with Cascading

Databases Reference

In-Depth Information

Think about it: what are some of the most common words posted online in comments?

Words that are not the most common words in “polite” English? Do you really want

those words to bubble up in your text analytics? In automated systems that leverage

unsupervised learning , this can lead to highly embarrassing situations. Caveat machi‐

nator .

Next, let's consider working with a Joiner in Cascading. We have two pipes: one for the

“scrubbed” token stream and another for the stop words list. We want to filter all in‐

stances of tokens from the stop words list out of the token stream. If we weren't working

in MapReduce, a naive approach would be to load the stop words list into a hashtable,

then iterate through our token stream to lookup each token in the hashtable and delete

it if found. If we were coding in Hadoop directly, a less naive approach would be to put

the stop words list into the distributed cache and have a job step that loads it during

setup, then rinse/lather/repeat from the naive coding approach described earlier.

Instead we leverage the workflow orchestration in Cascading. One might write a custom

operation, as we did in the previous example—e.g., a custom Filter that performs look‐

ups on a list of stop words. That's extra work, and not particularly efficient in parallel

anyway.

Cascading provides for joins on pipes, and conceptually a left outer join solves our

requirement to filter stop words. Think of joining the token stream with the stop words

list. When the result is non-null, the join has identified a stop word. Discard it.

Understand that there's a big problem with using joins at scale in Hadoop. Outside of

the context of a relational database , arbitrary joins do not perform well. Suppose you

have N items in one tuple stream and M items in another and want to join them? In the

general case, for an arbitrary join, that requires N × M operations and also introduces

a data dependency , such that the join cannot be performed in parallel. If both N and M

are relatively large, say in the millions of tuples, then we'd end up processing 10 12

operations on a single processor—which defeats the purpose in terms of leveraging

MapReduce.

Fortunately, if some of that data is sparse , then we can use specific variants of joins to

compute more efficiently in parallel. A join has a lefthand side (LHS) branch and one

or more righthand side (RHS) branches. Cascading includes a HashJoin when the data

for all but one branch is small enough to fit into memory. In other words, given some

insights about the “shape” of the data, when we have a large data set (nonsparse) we can

join with one or more small data sets (sparse) in memory. HashJoin implements a non‐

blocking “asymmetrical join” or “replicated join,” where the leftmost side will not block

(accumulate into memory) in order to complete the join, but the rightmost sides will.

So we put the sparser data on the righthand side to leverage the performance benefits

of the HashJoin .

Search WWH ::

Custom Search

Home