Graphics Reference
In-Depth Information
The following Python script reads each line, replaces Timothy with Tim ,
removes a leading STMP: , removes the domain name from the e-mail
address, and strips out Unicode characters:
import re
import unicodedata
filein = open("origData.csv","r")
fileout = open("cleanData.csv","w")
for line in filein:
line =
line.replace("Timothy","Tim").replace("SMTP:","")
line = re.sub("@[a-zA-Z0-9_.-]*","",line)
line= "".join(x for x in
unicodedata.normalize("NFKD",line)
if unicodedata.category(x)!="Mn" )
fileout.write(line)
filein.close()
fileout.close()
The first two lines of this script import Python libraries that add
functionality to Python. The first one is for handling regular expressions
( re ), which is a way to do search and replace on string patterns. The second
provides Unicode data functionality (that is, handling strings with complex
characters from a wide variety of languages).
The next two lines open the original data file and the output data file. Then,
the main loop reads each line from the input file and processes it:
line.replace("oldstring","newstring") is a simple find and
replace. replace() can be chained in a sequence for multiple
substitutions.
re.sub() is used to perform regular expression substitution. To
remove the domain name from an e-mail, the expression
@[a-zA-A0-9_.-]* matches an initial @ symbol followed by a
character that is any of the following for any length after the @ symbol:
Search WWH ::




Custom Search