Database Reference
In-Depth Information
> badStrings<-"(\\r| a\\/?[kd]\\/?a.+$| - Premise.+$| assessed
as.+$|,
Unit.+|<font size=\"[0-9]\">|Apt\\..+| #.+$|[,\"]|\\s+$)"
Test this against some examples using R's gsub() function:
> gsub(badStrings,'',"119 Hagy's Mill Rd. a/k/a 119 Spring Lane",
perl=TRUE)
[1] "119 Hagy's Mill Rd."
> gsub(badStrings,'',"3229 Hurley St. - Premise A",perl=TRUE)
[1] "3229 Hurley St."
Let's encapsulate this address parsing into a function that will accept an HTML file and return
a vector , a one-dimensional ordered collection with a specific data type, in this case charac-
ter . Copy and paste this entire block into your R console:
#input:html filename
#returns:data frame of geocoded addresses that can be plotted by
PBSmapping
getAddressesFromHTML<-function(myHTMLDoc){
myStreets<-vector(mode="character",0)
stNum<-"^[0-9]{2,5}(\\-[0-9]+)?"
stName<-"([NSEW]\\. )?([0-9A-Z ]+)"
stSuf<-"(St|Ave|Place|Blvd|Drive|Lane|Ln|Rd)(\\.?)$"
badStrings<-paste(
"(\\r| a\\/?[kd]\\/?a.+$| - Premise.+$| assessed as.+$|,",
"Unit.+|<font size=\"[0-9]\">|Apt\\..+| #.+$|[,\"]|\\s+$)")
myStPat<-paste(stNum,stName,stSuf,sep=" ")
for(line in readLines(myHTMLDoc)){
line<-gsub(badStrings,'',line,perl=TRUE)
matches<-grep(myStPat,line,perl=TRUE,
value=FALSE,ignore.case=TRUE)
if(length(matches)>0){
myStreets<-append(myStreets,line)
}
}
myStreets
}
We can test this function on our downloaded HTML file:
> streets<-getAddressesFromHTML("properties.html")
> length(streets)
[1] 1264
Search WWH ::




Custom Search