Lightweight Programming - Graph Analysis and Visualization

Graphics Reference

In-Depth Information

The following Python script reads each line, replaces Timothy with Tim ,

removes a leading STMP: , removes the domain name from the e-mail

address, and strips out Unicode characters:

import re

import unicodedata

filein = open("origData.csv","r")

fileout = open("cleanData.csv","w")

for line in filein:

line =

line.replace("Timothy","Tim").replace("SMTP:","")

line = re.sub("@[a-zA-Z0-9_.-]*","",line)

line= "".join(x for x in

unicodedata.normalize("NFKD",line)

if unicodedata.category(x)!="Mn" )

fileout.write(line)

filein.close()

fileout.close()

The first two lines of this script import Python libraries that add

functionality to Python. The first one is for handling regular expressions

( re ), which is a way to do search and replace on string patterns. The second

provides Unicode data functionality (that is, handling strings with complex

characters from a wide variety of languages).

The next two lines open the original data file and the output data file. Then,

the main loop reads each line from the input file and processes it:

• line.replace("oldstring","newstring") is a simple find and

replace. replace() can be chained in a sequence for multiple

substitutions.

• re.sub() is used to perform regular expression substitution. To

remove the domain name from an e-mail, the expression

@[a-zA-A0-9_.-]* matches an initial @ symbol followed by a

character that is any of the following for any length after the @ symbol:

Search WWH ::

Custom Search

Home