Databases Reference
In-Depth Information
Think about it: what are some of the most common words posted online in comments?
Words that are not the most common words in “polite” English? Do you really want
those words to bubble up in your text analytics? In automated systems that leverage
unsupervised learning , this can lead to highly embarrassing situations. Caveat machi‐
nator .
Next, let's consider working with a Joiner in Cascading. We have two pipes: one for the
“scrubbed” token stream and another for the stop words list. We want to filter all in‐
stances of tokens from the stop words list out of the token stream. If we weren't working
in MapReduce, a naive approach would be to load the stop words list into a hashtable,
then iterate through our token stream to lookup each token in the hashtable and delete
it if found. If we were coding in Hadoop directly, a less naive approach would be to put
the stop words list into the distributed cache and have a job step that loads it during
setup, then rinse/lather/repeat from the naive coding approach described earlier.
Instead we leverage the workflow orchestration in Cascading. One might write a custom
operation, as we did in the previous example—e.g., a custom Filter that performs look‐
ups on a list of stop words. That's extra work, and not particularly efficient in parallel
anyway.
Cascading provides for joins on pipes, and conceptually a left outer join solves our
requirement to filter stop words. Think of joining the token stream with the stop words
list. When the result is non-null, the join has identified a stop word. Discard it.
Understand that there's a big problem with using joins at scale in Hadoop. Outside of
the context of a relational database , arbitrary joins do not perform well. Suppose you
have N items in one tuple stream and M items in another and want to join them? In the
general case, for an arbitrary join, that requires N × M operations and also introduces
a data dependency , such that the join cannot be performed in parallel. If both N and M
are relatively large, say in the millions of tuples, then we'd end up processing 10 12
operations on a single processor—which defeats the purpose in terms of leveraging
MapReduce.
Fortunately, if some of that data is sparse , then we can use specific variants of joins to
compute more efficiently in parallel. A join has a lefthand side (LHS) branch and one
or more righthand side (RHS) branches. Cascading includes a HashJoin when the data
for all but one branch is small enough to fit into memory. In other words, given some
insights about the “shape” of the data, when we have a large data set (nonsparse) we can
join with one or more small data sets (sparse) in memory. HashJoin implements a non‐
blocking “asymmetrical join” or “replicated join,” where the leftmost side will not block
(accumulate into memory) in order to complete the join, but the rightmost sides will.
So we put the sparser data on the righthand side to leverage the performance benefits
of the HashJoin .
Search WWH ::




Custom Search