Databases Reference
In-Depth Information
Lady Gaga may have millions. This can cause a highly skewed distribution of values per
key during the reduce tasks. The effect is that many tasks will start during a reduce
phase, and most finish relatively quickly. A few “straggler” tasks—e.g., Lady Gaga's set
of followers—continue processing, perhaps for many hours. Overall the cluster utiliza‐
tion metrics drop because only a few tasks are running; however, the app itself cannot
progress until all of its reduce tasks complete. A potential workaround is to filter the
outlier keys that have huge sets of values and process them in a different branch of the
app.
Other Resources
This topic is intended to be an introduction to Cascading and related open source
projects. There are several resources online for learning about Cascading in much more
detail:
User Guide
JavaDoc API Guide
SDK and Sample Apps
Extensions
Conjars Maven repo
Also, there are a wealth of Cascading users and active discussions on the cascading-
user email forum . If you have a problem with a Cascading app—or Cascalog, Scalding,
PyCascading, Cascading.JRuby, etc.—then generate your flow diagram as a DOT file
and post a note to the email list.
Search WWH ::




Custom Search