Putting It Together: MapReduce Data Pipelines - Data Just Right: Introduction to Large-Scale Data and Analytics

Database Reference

In-Depth Information

Listing 8.2 Example of custom Python scripts for manipulating data from stdin:

input_filter.py

#!/usr/bin/python

import string

import sys

legal_characters = string.ascii_lowercase + ' '

for line in sys.stdin:

line = line.lower()

print ''.join(c for c in line if c in legal_characters)

Listing 8.3 Example of custom Python scripts for manipulating data from stdin:

output_unique.py

#!/usr/bin/python

import string

import sys

for line in sys.stdin:

terms = line.split()

unique_terms = list(set(terms))

print sorted(unique_terms)

#Pipe the input and output

> echo 'best test is 2 demo... & demo again!' | \

python input_filter.py | python output_unique.py

['again', 'best', 'demo', 'is', 'test']

Now we are starting to get somewhere; this simple Python example gives us some

hints for how to transform unstructured source data into something with more clar-

ity. However, if we ran this script over many gigabytes or terabytes of source files on

a single server it would take forever. Well, not literally forever, but it probably would

not run fast enough for you to be able to convince anyone to give you a raise.

If we could run these simple scripts at the same time on many files at once, using

a large number of machines, the transformation task would be much faster. This type

of process is known as parallelization . Providing automatic parallelization is how

Apache Hadoop and the MapReduce framework come to our aid. The power of using

Hadoop to process large datasets comes from the ability to deploy it on a cluster of

machines and let the framework handle the complexities of managing parallel tasks for

you.

There are many ways to deploy Hadoop using physical hardware, using a collection

of virtual machines, and even by purchasing access to an already running cluster com-

plete with administrative software. Because there are a large variety of ways to deploy

Search WWH ::

Custom Search

Home