Mapping Foreclosures - Data Mashups in R

Database Reference

In-Depth Information

to submit viable addresses to the geocoder. Here are some typical addresses that our regular

expression should match:

3509 N. Lee St.

2120-2128 E. Allegheny Ave.

7601 Crittenden St., #E-10

370 Tomlinson Place

2311 N. 33rd St.

6822-24 Old York Rd.

335 W. School House Lane

These are not addresses and should not be matched:

2,700 sq. ft. BRT# 124077100 Improvements: Residential Property

</b> C.P. June Term, 2009 No. 00575    

R has built-in functions that allow the use of Perl-type regular expressions. For more info on

regular expressions, see Mastering Regular Expressions (O'Reilly) and Regular Expression

Pocket Reference (O'Reilly).

With some minor deletions to clean up address idiosyncrasies, we should be able to correctly

identify street addresses from the mess of other data contained in properties.html. We'll use a

single regular expression pattern to do the cleanup. For clarity, we can break the pattern into

the familiar elements of an address (number, name, suffix)

> stNum<-"^[0-9]{2,5}(\\-[0-9]+)?"

> stName<-"([NSEW]\\. )?[0-9A-Z ]+"

> myStPat<-paste(stNum,stName,stSuf,sep=" ")

Note the backslash characters themselves must be escaped with a backslash to avoid conflict

with R syntax. Let's test this pattern against our examples using R's grep() function:

> grep(myStPat,"6822-24 Old York

Rd.",perl=TRUE,value=FALSE,ignore.case=TRUE)

[1] 1

> grep(myStPat,"2,700 sq. ft. BRT# 124077100 Improvements:

Residential Property",

perl=TRUE,value=FALSE,ignore.case=TRUE)

integer(0)

The result, [1] 1 , shows that the first of our target address strings matched; we tested only

one string at a time. We also have to omit strings that we don't want with our address, such as

extra punctuation (like quotes or commas), or sheriff's office designations that follow street

names:

Search WWH ::

Custom Search

Home