Geocoding Addresses (GIS and Spatial Analysis)

What kind of addresses can be geocoded? Software like ArcGIS can deal with addresses in a number of formats. A standard address would be something like 1922 Jones St, and an address in this format is potentially geocodable. An address in the form of 22nd St and Bowman Ave, indicating an intersection, can also be codable. Sometimes the conjunction between the two streets in an intersection can cause the address not to be recognized, but in ArcGIS there is some flexibility in the recognition of conjunctions: ampersands can be used (&), slashes can be used (/), as well as the word "and." The user has control of this process and can specify alternative forms. What if an address is misspelled? The geocoding engine of ArcGIS can be set to varying levels of sensitivity, that is to how much of an exact match the address has to be to what is in the database in order for the program to geocode the address successfully. Once you are familiar with the forms of address expected by the software, you can edit the set of addresses you want to geocode to more closely conform to the expectations of the software.

There are two kinds of geocoding, automatic and interactive. In the automatic method, issue such as sensitivity and the form of intersections are very important, because the software will try to geocode every address without any input from the investigator. However, depending on the sensitivity settings and the accuracy of the source of the addresses to be geocoded, the success rate for automatic geocoding could vary from 20 percent to 80 percent of the addresses you would like to locate on a map.

Interactive geocoding is the way you make up for the failures of automatic geocoding. In interactive mode, ArcMap presents you with a list of addresses it was not able to decide where to code. The program may also list some possible locations for the uncoded address, and you can investigate these locations by zooming to the map. However, before we can illustrate the process of geocoding, we must create what ArcGIS calls an address locator. An address locator is a set of commands and descriptions that together inform ArcGIS about the format of the address database you are using and how to use that database to compare with the address you wish to geocode and make automatic geocoding decisions. In order to begin this process, click on the icon for the Toolbox on the tool bar to the left of the help button (arrow with a question mark), as in Figure 1.15.

FIGURE 1.15 ArcGIS Toolbox menu

Locate the geocoding sub-toolbox on the menu, and click on it.

Figure 1.16 Selecting the command to create an address locator

Double-clicking on the highlighted command line in Figure 1.16 brings up the Address Locator submenu. A number of selections and identifications are made on this submenu to create the address locator for geocoding.

FIGURE 1.17 Creating an address locator

First, click on the folder icon to the right of the line below "Address Locator Style," to designate the style of address locator database to be used here. There are a number of options that make for a flexible system: the address database we are using here and shown in Figure 1.14 above are U.S. streets in a file; you can also designate a location for the address locator file, either on your own hard drive (local) or on a server (ArcSDE Server).

FIGURE 1.18 Address Locator Style selection

Next, click on the file icon to the right of the box labeled "Reference Data" and identify the location of the street map database; clicking on the downward arrow next to the box reveals the possible databases that ArcGIS has already located as possible reference files. The next selection is to identify the role that this database will play in the process of geocoding. As you can see from the number of lines in this box, it is possible to specify a number of address reference files. In some cases, the files may each provide reference to one address element that you wish to consider in the coding process—for example, street may be in one file, city council district in another, and U.S. congressional district in another database. If your addresses to be coded contain fields for all of these, you could include this information from multiple sources in the process. Notice that once the database and its role are established, the field map at the bottom of this menu is filled out—in the left-hand column are the specifications from the type of locater we selected, U.S. Streets—such as "house from left," "house to right," and "street name"—and in the right-hand column are the fields in the database we are using, as shown in Figure 1.11 above, that correspond to those specifications; e.g. "house to left" is equated with L_T_ADD in the table shown in Figure 1.19 for our address database.

Figure 1.19 Specifying the details of the address locator

Click OK and ArcGIS will create an address locator to use in geocoding the addresses we want to locate on the map of Riverside. Figure 1.20 shows the end of the process being successful.

Figure 1.20 A successful creation of an address locator

Now that an address locator has been established, the next step is to bring the data with addresses to geocode into the ArcMap workspace. We can do this by adding these data to the table of contents using the Add Data command (black cross on a light background) as illustrated in Figure 1.7. In many cases such data will be in the form of an Excel spreadsheet, as in Figure 1.21. In order to bring these data into ArcGIS, the file needs to be saved in database format, with an extension of ".dbf." Figure 1.21 illustrates how you can do this in Excel.

FIGURE 1.21 Saving a spreadsheet in database format (DBF) with Excel

Once these data, which contain the locations of youth involved violent incidents from the Riverside Police Department, are saved in database format, these data can be added to the map as a new layer; once they are added to the map, they can be geocoded using the address locator we established previously. Using the Add Data command, these data are added to the map in Figure 1.22.

Figure 1.22 Adding the database format file to the map; the file has the same name as in Figure 1.21 but with the .dbf extension

The next step is to authorize the automatic geocoding of the location column in this file, which as you can see in Figure 1.21 gives the street address where each incident occurred. There are sometimes multiple listings at the same address adjacent to each other in the file—this is because there is a line in the file for each violation of the law recorded in connection with the incident. If one youth assaulted another with a weapon and stole money and a bicycle, there would be multiple entries—one for the assault, for the bicycle theft, and another if the attacker used a weapon.

We can begin the process as in Figure 1.23, by clicking on Tools, Geocoding, and Geocode Addresses.

FIGURE 1.23 A step in the geocoding process

Notice that the table of contents window to the left is now on the "Source" tab at the bottom of the window; this occurred automatically when we added the "Jan 2002 …" file with the addresses to be coded. The reason this happens is that the spreadsheet table with the addresses is not displayed on the map, so if you click on the display tab, this file is not listed in the table of contents window.

Once you click on Tools, Geocoding, and Geocode Addresses, a menu comes up asking you to add an address locator; Figure 1.24 shows that you must point to the ArcGIS catalog, select address locators and double-click to bring up the list of available address locators.

FIGURE 1.24 Address locator submenu

You can see there are two listed in Figure 1.25; both are the same, so you can select either one. A different address locator would have a different name after the user name (Rob in this case).

Figure 1.25 Selecting the address locator

Click on Add and you are one step closer to starting the process of geocoding. Figure 1.26 shows what happens next; you can now tell ArcGIS about your address file—in what column is the street name and number given?

Figure 1.26 Specifying the details of the automatic geocoding process

First, you click on the drop down menu to identify the column in the database that gives the addresses—in this case, location (as seen in Figure 1.28). If you want to change the default location and file name for the geocoded layer, click on the file symbol next to the listing under "Output shapefile or feature class." You can also modify the options that control how the automatic geocoding gets accomplished by clicking on the button labeled Geocoding Options. Let’s examine these further in Figure 1.27.

FIGURE 1.27 Geocoding Options submenu

As we discussed above, the automatic process for geocoding addresses will use the information in the address locator to attempt to match the location of each address in the column "location" in the table of youth violent incidents we have asked ArcGIS to code. The addresses being examined were compiled from police department records based on the reports of each police officer in the field—one of the pieces of information officers always record is the address of the incident. However, police officers are human, and they have a lot to do when confronting the scene of an incident, and so the address may be incomplete or abbreviated in a way that the officers understand, but which may not match the address locator database information, which has usually been based on U.S. Census Bureau address collections and verified by Census workers and local officials. The automatic geocoding process creates a score for each address to be matched, from 0 to 100, where 100 would be an exact match to the information in the address locator database. Various modifications or abbreviations in the address to be geocoded can result in a lower score. For example, if in the address locator database a street has a direction North, East, West, or South, such as West 3rd Street, the address locator may expect that the direction is spelled fully; the address recorded by the officer may be in the form of W. 3rd st. This abbreviated version is understood by all officers and observers alike to be the same as West 3rd Street, but the address locator is now less certain of the correspondence, and so it will lower the score a certain amount. Suppose the address was known as 234 Robinson Ave in the address locator, and the officer records 234 Robnson ave instead—the misspelling of the street name would lower the score further. The officer might record 2267 Jones Ave., when the street in question is actually Jones Drive—again, if there were no other roadways named Jones in the locator database, the program would lower the score; if there were two such streets, Jones Ave. and Jones Drive, the program might match the wrong one, but more likely, as the address range on Jones Drive includes 2267 and that for Jones Ave. ends at 1945, the score would lower again. You can see on the menu in Figure 1.27 that you can set the sensitivity of the automatic process higher or lower depending on your confidence about the nature of the errors or abbreviations made and any mistakes that may be in the addresses to be coded. The program has a default of 80 for spelling, a minimum match score of 60, and a score of 10 to be considered a candidate. This latter score is for the interactive process that comes after the automatic process has run; more about this below.

You can also specify a table of aliases for your street names if you have such information. You could specify in a spreadsheet that has a column in it in which you list aliases or alternative names for streets in the location column we are trying to geocode—such information could increase the score in the face of an initial comparison that shows an abbreviation or a misspelling.

Intersections are also a major issue in geocoding of address-based data. The program has three default symbols that it will interpret as indicating an intersection: an ampersand (&), a vertical line, and the at sign used in Internet addresses (@). However, you can add additional ones on the line in this submenu as shown in Figure 1.27.If you add the word "and" as indicated in this command, the program will now successfully code all of your intersections with "and" providing they are actually intersections that the address locator recognizes. Some officers may record an intersection that does not exist according to the address locator, in which case that address will come up for interactive geocoding.

Another consideration is the address number itself. Many times an address will not successfully code because the address range in the address locator database does not include the number given on the address to be coded. This could happen for a number of reasons, but two main reasons tend to explain these cases: (1) new construction, so that the end of the street was extended, new lots defined, and new housing built after the map was constructed. The further you get from the decennial census year, in this case 2000, the more this can be a problem if you are working with a geographic location that is growing in population and expanding its housing stock, such as Southern California; (2) the address number recorded is in error. If either of these things is true, the automatic process will usually reject the address and bring it up in the interactive process.

In many cases there will be multiple candidates identified depending on the magnitude of the error or the misspellings. Another option is to match if there is a tie in the score; this is indicated in a check-off box towards the lower left of the Geocoding Options submenu. For example, the address range on the right may include 2236, and on the left 2239, but address in the data to be coded is 2237. The program may decide that both of the existing addresses in the address locator are candidates, and it may decide they have an equal likelihood of being correct, so there will be a tie. If you check this box, and the score is above the threshold you select for matching, the program will geocode the address at the first location in the address locator of the pair that tied. Maybe someone who owned a big lot at 2239 decided to subdivide their lot and sold half to a developer, who built a new house at 2237. If the tie listed 2239 first, this would be a very accurate coding; if it listed 2236 first, it might still be very close but on the wrong side of the street. The impact of such an error depends on your goals in geocoding and in the larger project for which you are using the geocoded data.

In the case of the project we are illustrating here, the purpose of geocoding these youth violent incidents is to place each one inside a block group, the unit of analysis shown in Figure 1.6. Once we have succeeded in geocoding each one into a block group, we can then use the population of youths in each block group as measured by the U.S. Census to construct a youth violence rate per 1000 youth. This can in turn be used as a dependent variable in research. We can try and explain the reason for the variation in this rate of violence across all the block groups that we are likely to observe once we construct the rates. This then allows us to test theories about the causes of youth violence and, if we have made any attempts to intervene to prevent or reduce youth violence in certain areas defined by block groups, we can test over time whether or not these interventions have been successful at reducing youth violence in the block groups where the intervention occurred.

So, in this case, if the coding placed the new house on the wrong side of the street, chances are very high that it would still be in the same block group, and thus for the purposes of this study the error is of no consequence. If you were studying whether violence was more likely on the right or the left side of the streets, however, such an error would be of major consequence.

This example shows how careful we should be in adjusting the sensitivity of the automatic procedure. If we lower the sensitivity and the minimum match score too far, the automatic geocoding procedure will make too many errors, and some of those errors, in our case, may place violent incidents in the wrong block group, distorting the rates we observe. In Section 3 of this handbook, when we discuss spatial modeling of data such as these, this topic will come up again and we will discuss different statistical corrections we can apply for this problem. In the case of geocoding, the more accurate we can be the better for every type of mapping or analysis that comes afterwards. So while we hope the automatic geocoding succeeds in finding matches, we should not lower the sensitivity too much so that errors are increased. Any candidate that the automatic process will not match becomes the subject of an interactive geocoding process, where we can in essence check up on the automatic process. It is better to err on the side of having too strict match criteria than too lenient, because if you make the criteria too lenient you are likely to increase the errors made and, because the process is automated, you will never be able to check every geocoded point, especially if you have hundreds or thousands of them. When you make the criteria on the strict side, you may get too many addresses rejected, but you can match these to your satisfaction, with a minimal amount of error, through the interactive process.

Geocoding Addresses (GIS and Spatial Analysis)

Related Links

:: Search WWH ::