Database Reference
In-Depth Information
or scaling existing Python processing scripts from a single machine to a distributed
environment.
Python-based MapReduce frameworks for Hadoop streaming provide a great deal
of utility for building more complex multistep pipelines while keeping code simple
and manageable. Frameworks such as mrjob and Dumbo can not only be used locally
and with existing Hadoop clusters but can also use cloud-based services such as Elastic
MapReduce as a processing environment. Because of the overall advantages and useful
features available in these frameworks, it is almost always a good idea to use these tools
whenever building anything more complicated than a simple single-step streaming
pipeline.
Hadoop streaming scripts and frameworks can be very useful for a large number
of tasks, but as data workf lows get even more complex, even streaming scripts can
become hard to manage. Another consideration is performance; building tools using
scripting languages that call the Hadoop streaming API may be easy to write but may
ultimately be slower overall than tools that interact directly with the raw Hadoop API.
In the next chapter, we take a look at some Hadoop-based tools designed to manage
data workflows.
 
Search WWH ::




Custom Search