Database Reference
In-Depth Information
Listing 8.2 Example of custom Python scripts for manipulating data from stdin:
input_filter.py
#!/usr/bin/python
import string
import sys
legal_characters = string.ascii_lowercase + ' '
for line in sys.stdin:
line = line.lower()
print ''.join(c for c in line if c in legal_characters)
Listing 8.3 Example of custom Python scripts for manipulating data from stdin:
output_unique.py
#!/usr/bin/python
import string
import sys
for line in sys.stdin:
terms = line.split()
unique_terms = list(set(terms))
print sorted(unique_terms)
#Pipe the input and output
> echo 'best test is 2 demo... & demo again!' | \
python input_filter.py | python output_unique.py
['again', 'best', 'demo', 'is', 'test']
Now we are starting to get somewhere; this simple Python example gives us some
hints for how to transform unstructured source data into something with more clar-
ity. However, if we ran this script over many gigabytes or terabytes of source files on
a single server it would take forever. Well, not literally forever, but it probably would
not run fast enough for you to be able to convince anyone to give you a raise.
If we could run these simple scripts at the same time on many files at once, using
a large number of machines, the transformation task would be much faster. This type
of process is known as parallelization . Providing automatic parallelization is how
Apache Hadoop and the MapReduce framework come to our aid. The power of using
Hadoop to process large datasets comes from the ability to deploy it on a cluster of
machines and let the framework handle the complexities of managing parallel tasks for
you.
There are many ways to deploy Hadoop using physical hardware, using a collection
of virtual machines, and even by purchasing access to an already running cluster com-
plete with administrative software. Because there are a large variety of ways to deploy
 
 
Search WWH ::




Custom Search