Putting It Together: MapReduce Data Pipelines - Data Just Right: Introduction to Large-Scale Data and Analytics

Database Reference

In-Depth Information

or scaling existing Python processing scripts from a single machine to a distributed

environment.

Python-based MapReduce frameworks for Hadoop streaming provide a great deal

of utility for building more complex multistep pipelines while keeping code simple

and manageable. Frameworks such as mrjob and Dumbo can not only be used locally

and with existing Hadoop clusters but can also use cloud-based services such as Elastic

MapReduce as a processing environment. Because of the overall advantages and useful

features available in these frameworks, it is almost always a good idea to use these tools

whenever building anything more complicated than a simple single-step streaming

pipeline.

Hadoop streaming scripts and frameworks can be very useful for a large number

of tasks, but as data workf lows get even more complex, even streaming scripts can

become hard to manage. Another consideration is performance; building tools using

scripting languages that call the Hadoop streaming API may be easy to write but may

ultimately be slower overall than tools that interact directly with the raw Hadoop API.

In the next chapter, we take a look at some Hadoop-based tools designed to manage

data workflows.

Search WWH ::

Custom Search

Home