Data Mining and GIS Part 1

One of the challenges of GIS is finding the data you need to make the kinds of maps you would like to make for your project. In this case, data were obtained from the city police department as part of a project designed to evaluate a gang prevention effort (Spergel, 1996). Once the data were obtained, they were converted to the .dbf format as shown above, and in the course of the geocoding we edited some of the entries in the location column so that ArcMap would more easily recognize intersections and we corrected some data entry errors along the way, such as misspelled street names, incorrect street types (blvd vs. bld), and, in one case, we corrected an error in the street map database; a street was misnamed in the database compared to the actual name in use in the city. These are examples of the process of data mining: finding and preparing data for use in GIS.

To get a sense of how geocoding is done, please follow the examples set out here on the computer yourself. We have designed each example in this topic actually to be conducted as you go along in the text. Some of you may be taking a course, we hope in a fully computerized "smart" classroom, in which each student or perhaps each team of two or three students is in front of a computer, and perhaps the instructor is also at a computer, doing each example for the class as the students follow along on their own computer screen, looking up to the instructor’s projected screen for reference. Perhaps you saw the instructor demonstrate this example, and now you are in a lab class with a computer equipped with ArcMap and the data from the DVD that comes with this topic. Perhaps you are learning GIS on your own for your school work, your research, your thesis, or some other reason. In any case, please find an opportunity to follow the examples in this topic as step-by-step exercises. These examples were designed to be used in this way, for you to follow along by doing the same thing on your own computer, with the same data we used, and to check your screen’s appearance to that shown in the examples. Occasionally, we will also give you an additional example that we do not illustrate directly with these same data, at the end of an example, to further illustrate a technique or approach. But please do not forget that each example we present is designed to be an exercise for the student of GIS, so that you can learn by doing and be able to check what you have achieved step by step and screen by screen. Our experience is that if you go through this text in this way, you will really learn to be a master of GIS. You will also then be able to adapt your own data and research questions to the techniques you learn by doing in this topic. You can begin the process by loading the DVD that came with the topic, or by accessing it where your instructor has indicated on your local computer network, and using Microsoft Excel open the spreadsheet shown in Figure 1.28, "Jan 2002 Aggregate Data.dbf."

Now let’s see how these decisions and selections work in an example. Looking at Figure 1.28, we see the street names given in the location column of the data we wish to geocode.

Figure 1.28 Data to be geocoded

Data to be geocoded

Notice some of the features of these addresses. For example, the first one is "E Alessandro Blvd"; the "E" is an abbreviation for East, and "Blvd" is an abbreviation for Boulevard. If our address locator is set up to accept "East" and not "E", we have a problem already, and we may not know in advance how boulevard should be abbreviated. Notice also that the intersections are indicated by an ampersand, "&", as in "Jacob Dr & Siegal Ave." We could search the address file to see if any "and" indicators remain, and edit those out of the file. At this point, however, it makes sense to run the automatic geocoding process and see what happens.

FIGURE 1.29 Automatic geocoding in process

Automatic geocoding in process

In Figure 1.29 you can see the display showing geocoding in process; the little magnifying glass searches over the little glob again and again as the process proceeds.

FIGURE 1.30 Report of results from automatic geocoding process

Report of results from automatic geocoding process

The popup shown in Figure 1.30 appears after the automatic geocoding process is completed, and gives a report as to the success and failure of the procedure. For example, we see that 72 percent or 5459 of the addresses coded successfully with a score of more that 80 out of a possible 100. In addition, another 9 percent or 701 addresses were matched with a score between 80 and 60, the point where the low point was set in the Geocoding Options menu shown in Figure 1.27. So a total of 81 percent of the addresses were matched and placed on the map. This is a very good result; experienced geocoders report success rates on the first automated pass anywhere from 20 percent to 80 percent using the stricter criteria and sensitivity we have used in this example. This means, however, that 19 percent of the addresses, or 1373, remain unmatched, and are candidates for interactive geocoding. Figure 1.31 shows the results of the automatic geocoding on the map of Riverside.

Figure 1.31 Results of automatic geocoding process

Results of automatic geocoding process

Each dot represents at least one violent incident reported in these data; as there is often overlap, or more than one incident at an address, dots can represent more than one coded incident. A pattern is already obvious from the distribution of dots on the map—there is a central core of the city running from the lower left of the map to the upper right in the middle of the city that seems to have a large concentration of violent incidents. There is also a secondary concentration along a line running from the central core down and to the right, with a side shoot that takes off to the right from that concentration to the right up to the border of the city. A few scattered incidents occur elsewhere outside these areas of concentration, and some areas at the extremes of the city have no dots at all. This may change after the interactive geocoding, as all the reported incidents have yet to be represented. However, we have already learned something about the distribution of these incidents by geocoding them to the extent we have thus far. What is it about those central areas that cause them to have more violent incidents? What is it about the outlying areas that explain the lack of incidents? There are many theories and hypotheses that can be discussed, and many of these can be explored once we have geocoded these data completely and accurately. Mapping these data will allow us to geographically link information about the nature of the populations that live in each block group, and the kinds of economic and social activities that go on there, and we will be able to develop and empirically examine such possibilities. None of this is possible without successful geocoding, which in this case, as in most cases, requires interactive geocoding to finish the job appropriately.

Example: The Science and Art of Interactive Geocoding

There are many reasons why an address does not geocode automatically. We have already mentioned some of these issues, such as address ranges, misspellings on the street names and types of streets, and of course, just plain old human error. The process of interactive geocoding is one of applying logic, knowledge, and good judgment to each address to place it appropriately on the map. In order to proceed, one needs access to a good independent source for geographic information, such as an atlas or an online service like Mapquest. These are necessary additional sources of information that may be more up to date than the address locator database, and they also provide the geocoder with another perspective on the map, streets, and address ranges they are working with. One also needs to keep in mind the goal of the geocoding project. In this case, we are trying to place all the violent crime incidents in the U.S. Census block groups in order to compute the rate of violence in each block group. Thus any decision made in interactive geocoding that places an incident across a boundary line of one block group to the other may be a source of error being introduced by human action. In most types of human activity, accuracy is valuable, and we should not undermine it deliberately or carelessly. Let’s see how these ideas work in an actual application of interactive geocoding. However, before we get started, double-click on the street map database in the table of contents window, and then click on Label Features, if you have not already done so. The street labels will be useful in interactive geocoding.

First, click on Tools, next on Geocoding, and then select Review/Rematch Addresses. The results from your automatic geocoder will appear to the right; move the cursor to highlight that file, and click.

FIGURE 1.32 Beginning an interactive geocoding session

Beginning an interactive geocoding session

When you click on the file with your automatic geocoding results, the interactive geocoding menu will appear.

FIGURE 1.33 Review/Rematch submenu

Review/Rematch submenu

You may be asked if you want to edit the database, click on OK and the menu will appear. Make sure that the circle to the left of "Unmatched addresses" is selected, and click on Match Interactively.

FIGURE 1.34 Interactive geocoding, Alessandro Blvd

Interactive geocoding, Alessandro Blvd

The screen in Figure 1.34 illustrates the many issues involved in interactive geoco-ding. At the top of the screen are the first nine unmatched addresses from the automatic procedure. The first candidate is selected and given in the box labeled "Street or Intersection." As you can see, it is 2624 E Alessandro Blvd; the address is broken down into the components recognized by the address locator to the right of the box labeled "Modify," that is the number, the directional prefix (E), the name, and the street type suffix (Boulevard abbreviated as BLVD). The next box indicates that in the address locator there are two candidates that the program identified, but neither had a very high score—in fact they were both scored 15 out of 100 on the sensitivity match score. The address number may be the problem, as the address locator has ranges of numbers on either left or right that do not include the number given in this address record. There is a set of low numbers— 124 and down on the right of the street, and some high address ranges on the left—6799 to 7099 in this case. This is a case where you need a good paper atlas of the area that can give you an indication of address ranges, or access to an online atlas for the place you are geocoding addresses from. Web-based mapping programs like Mapquest or Yahoo can be useful, as well as a GPS system.

The first thing to do is to see where these candidates are on your map in ArcMap. This can give you an idea of what the problem might be in geocoding this address. To do this, click on the first address in the "candidate" window, and click below on Zoom to: Candidates. Minimize the geocoding window and you will see the map, zoomed in to the exact location that the street database wants to put this address; remember that this may not be the right place for this address, however.

FIGURE 1.35 Candidates for geocoding on the map

Candidates for geocoding on the map

The light dot symbol is the candidate address, or where the street database thinks it should go; notice a larger dark symbol by the intersection of Canyon Crest and Alessandro; this is the other candidate for geocoding. If we were to go back to the geocoding menu and zoom to the second candidate, it would appear light on the map and the current light symbol would become dark.

In Figure 1.35, some of the streets appear darker than others; this is because these streets are also block group boundaries—an example is Alessandro or Canyon Crest. It could be the case that selecting one of these candidates might put the crime incident in one block group, and selecting the other candidate might shift the location to another block group. This is a case where accuracy is very important in interactive geocoding.

What does our atlas source tell us? Putting in this address as recorded, or looking up this block on the paper atlas, we can easily see that the street database is considerably off from where the atlas places this address, by about three miles. Instead of intersecting with Canyon Crest or Cannon, the atlas view shows that this address is nearer to the intersection of Alessandro and Sycamore Canyon Blvd. Why would this be the case? As mentioned above, one of the reasons why an address will fail to code is the fact that new construction adds to the possible address ranges along a street. The program knows what address ranges it has, and if it is asked to geocode one that is outside the known ranges, the program attempts to fit the unknown address with the known—but with little confidence that it is correct, hence the low score. In Riverside this area has been undeveloped until a few years ago, and therefore this is a prime candidate for an area with new address ranges that are not in the 2000 Census database.

Now that we have updated information about where this address is located, what can we do with that information? We can use our options for interactive geocoding to modify this address to better reflect the information we have at hand. If we go to the window on the left middle of the screen labeled "Street or Intersection," we can see that it displays the address being considered. If we modify the contents of this box to read, "Sycamore Canyon Blvd & E Alessandro Blud," the placement of this address will be very close to where the atlas shows it should be on the map. Doing this produces a new candidate, as shown in Figure 1.36.

FIGURE 1.36 A modified street address and a new candidate for geocoding

A modified street address and a new candidate for geocoding

Selecting the new candidate, which has a score of 86, high enough to be matched automatically, and zooming to this location, shows the result of the process of interactive geocoding in Figure 1.37.

FIGURE 1.37 Zooming to the new candidate on the map

Zooming to the new candidate on the map

This location is literally within a few feet of the location shown in the atlas, and as long as the youth violent incident is within the city limits, it can only be in one block group being located where it is. The only step remaining is to go back to the interactive window and click on the Match key. The other two incidents at the exact same address can be handled in the same way, so that now three additional addresses have been successfully geocoded.

Looking back to Figure 1.36, the next address that is uncoded is on Arlington. Here, the address range given is only one single number away from the address we wish to code—7599 in the range vs. 7600 in the address. Zooming to the candidate shows that while a block group boundary is nearby, there is only one logical place for this address to be on the map. A check of the atlas confirms that this location is near Harold Way, as shown in Figure 1.38, so we can declare another match for this address.

FIGURE 1.38 Candidate for Arlington Ave

Candidate for Arlington Ave

The next set of three unmatched addresses are from Chicago Ave; examining the location of candidates places these incidents well within the boundaries of a block group, so deciding to accept this location will have few consequences for the overall study; another three addresses coded. The next address, on Stonehaven Ct, presents a different kind of geocoding situation.

FIGURE 1.39 An address with no likely candidates: 1439 Stonehaven Ct

An address with no likely candidates: 1439 Stonehaven Ct

In this situation, there are no candidates to examine. This is a situation in which having another atlas that is more up to date is very important. Checking the atlas rules out all kinds of possibilities—that the street does not exist, that the number is way off from the range, that the street is not a "Court" but something else, and so on. So why does this address generate no candidates? A more important question is what can we do about this? It is unclear why the program may not produce any candidates, but we can examine the location in more detail and come up with an approach that is logical and appropriate given the goals of our project. In focusing in on this location in ArcMap (using the magnifying glass with the plus sign icon) we note that Stonehaven intersects with Allendale, and that there are no nearby block group boundaries. What if we modified the address to the intersection of Stonehaven and Allendale? Stonehaven is a very short street, so we could not be too far off, and placing the address at the intersection makes no errors in terms of which block group the incident happened in. Figure 1.40 shows what happens when we modify the address and a candidate is zoomed to.

FIGURE 1.40 A nearby intersection generates a candidate for geocoding

 A nearby intersection generates a candidate for geocoding

Even in the case of an address that should be coded automatically, and where there are no obvious reasons why it has not been coded automatically, we can use interactive geocoding to rectify the situation. Now we have coded an additional address.

Although you should expect to geocode every address successfully with interactive geocoding, there will be some addresses that are impossible to code. Although it is possible to force an address onto the map, this is not always justified given the logic of geocoding, the information you have in hand, and the impact such a decision might have on the ultimate goal of your project. If you end up with 5 percent or fewer addresses uncoded, you have probably done a very thorough job of geocoding, and any biases you introduce by not coding certain addresses is probably minimized. This is particularly the case if those uncoded addresses are roughly randomly distributed across the space you are examining; the likelihood of repeat crime incidents at the same address would tend to undermine this idea of randomness. This is one of the tensions that interactive geocoding creates, and it is a tradeoff that the geocoder is often faced with. The choice is between maximizing the information available to geocode more addresses, minimizing the errors of assuming too much about an address, and the loss of information in giving up being able to place an address on the map. The next address in these data illustrates some of these issues.

As seen in Figure 1.41, the next address also produces no candidates for consideration—1299 Tyler St.

FIGURE 1.41 No candidates for geocoding

No candidates for geocoding

By editing the contents of the "Street or Intersection" box to remove the address number, the total range of addresses in the database for Tyler St becomes visible, and a likely reason for the lack of automatic geocoding is revealed: the address given is outside the range of addresses on this street in the database.

FIGURE 1.42 Full address range available for Tyler St

Full address range available for Tyler St

Notice that the lowest address range is for numbers in the 2700s, and the original address was 1299—not very similar; if the address to be coded was 2650, and you had 2701 as the lowest in an address range, it might be reasonable to declare a match. The difference here is too great for such an assumption. We can examine on the map the location of this lowest range by clicking on this range and zooming to the candidate.

FIGURE 1.43 Addresses on Tyler Street, Riverside

Addresses on Tyler Street, Riverside

The light dot shows the block of Tyler Street with the smallest address number in the range recognized by the database—from Figure 1.42, you can see that this is 2700 Tyler Street. The address to be coded, 1299 Tyler, is quite some distance from this, and it would make no sense to assume that 1299 and 2700 are in the same block group, the ultimate unit of analysis here, so that we could code them as being similar. Figure 1.44 shows a close-up view of this section of the map; this was obtained by saving and exiting the interactive geocoding menu, using the magnifying glass tool to zoom in to the intersection of Tyler Street and Victoria. Next select the drawing tool from the tool bar at the bottom of the screen, as shown in Figure 1.44, and select the dot option from this collection of drawing tools.

FIGURE 1.44 The drawing tool selection popup menu

The drawing tool selection popup menu

To put a dot on the intersection of Tyler and Victoria, as shown in Figure 1.46, place the cursor on the intersection and left-click once; double-click to bring up the properties menu to change the size and color of the dot you just placed on the map, as in Figure 1.45.

FIGURE 1.45 The symbol properties popup

The symbol properties popup

You can see in Figure 1.46 that Tyler Street does not start again to the southeast (see below to learn how to put a compass sign on your map) after dead-ending at Victoria.

FIGURE 1.46 Detailed map of 2700 Tyler Street and surrounding area


Sometimes you cannot make any assumption or adjustment that can make sense of an address, and you have to drop this address from the geocoding process and from the map. Perhaps it was recorded in error, or perhaps this is an address not in the city of Riverside; in either case, this address cannot be coded given the information at hand. Perhaps after an initial geocoding, you could get together with a representative from the organization that provided these data, and show them a list of uncoded addresses. These files could be reexamined to see if any additional information is available which could help resolve this record and allow it to be coded.

In this case, the record can be skipped and the next address be subjected to interactive geocoding. In this way, following the steps described here, the entire list of uncoded addresses can be processed and, in most cases, successfully coded. The next address of interest shows some of the ways you can use the information you have to make a reasonable decision to modify an address and achieve a successful geocode. The interactive geocoding menu for these data in Figure 1.47 shows that there are three possible candidates for 1275 Coronet Dr, an address that did not automatically geocode.

FIGURE 1.47 Interactive geocoding for 1275 Coronet Dr

Interactive geocoding for 1275 Coronet Dr

Highlighting the middle candidate in the bottom of Figure 1.47, clicking on the Zoom to: Candidates button, and minimizing the interactive geocoding menu reveals the three locations of these candidates, shown in Figure 1.48 in the larger symbols. This map also shows that a youth violent incident has already been coded near these three candidates. You can use this information to help make a better decision about how to geocode 1275 Coronet.

FIGURE 1.48 Three candidates for 1275 Coronet and a previously geocoded incident nearby (smaller symbol)

Three candidates for 1275 Coronet and a previously geocoded incident nearby (smaller symbol)

First, maximize the interactive geocoding menu, and click Close on the lower right of the screen (see Figure 1.47). Next, exit the geocoding process by clicking Done on the lower right hand of the Review/Rematch window you now see in front of the map. Now you see the main map, and you should click the selection drop down on the main tool bar; in Figure 1.49, you can see the drop down, and you should click on Select by Attributes.

FIGURE 1.49 Selection tool drop down menu

Selection tool drop down menu

Selecting by attributes brings up the menu shown in Figure 1.50.

Figure 1.50 Selection by attributes

Selection by attributes

Step 1 Make sure the layer name containing the street database is shown in the Layer box at the top of the menu; if not, click on the drop down and select the proper layer.

Step 2 The large box below the Method drop down (which should be set on "Create a new selection" as shown in Figure 1.50) will give the columns in the attribute table for this layer; you should recognize them from the interactive geocoding menu as shown in Figure 1.47. Double-click on the column labeled "Name" and the column "Name" will appear in the box at the bottom of the menu where you will specify the selection formula.

Step 3 Click on the icon for the equal sign, and place your cursor in the box to the right of the equal sign, and type a single quote, type the name of the street you want to select (in this case, Coronet) with no type of suffix like Street, Dr, Blvd—just the street name itself, and then close with a single quote.

Step 4 Click on the verify button, and the program will tell you if you have specified the formula properly, and it will give you a message if no records would be selected by this formula—in which case, something is wrong; perhaps you misspelled the name or perhaps you left out an equal sign or a quote. In any event, once you have it correctly specified, the message will say, "Expression was successfully verified."

Step 5 Click OK on this window, and click Apply on the lowest set of buttons, and your selection will be made.

Step 6 Click Close, and the main map reappears on the screen. Click on the Selection drop down, and if you have made a selection successfully, you should be able to select the command, "Zoom to selected features"; this should reveal a detailed map of Coronet Dr with one previously geocoded incident shown, as in Figure 1.51.

FIGURE 1.51 Coronet Drive highlighted, along with one previously geocoded incident

Coronet Drive highlighted, along with one previously geocoded incident

Step 7 Select the Identify tool from the tool bar at the top of the window; this is a dark circle with a light "i" inside it. Move the cursor to the previously geocoded incident and right-click; a box in the upper left of the screen will show the address of this incident, as in Figure 1.52.

FIGURE 1.52 Identifying a geocoded incident on Coronet Dr

Identifying a geocoded incident on Coronet Dr

The address of this previously geocoded youth violent incident, 1161 Coronet Dr, can now help us to decide where to put the address we are trying to interactively geocode, 1275 Coronet Dr.

Step 1 Navigate back into the interactive geocoding menu, by clicking on Tools, moving the cursor to Geocoding, Review/Rematch Addresses, and finally to Geocoding Result: youth_violence_coded1, as in Figure 1.53; this opens the interactive geocoding window once again.

FIGURE 1.53 Navigating the menus back to the interactive geocoding window

Navigating the menus back to the interactive geocoding window

Step 2 Click on Match Interactively, to bring up the uncoded address list. Step 3 Click on 1275 Coronet Dr to bring up the address ranges; click on the first range, and click on Zoom to: Candidates.

You can see that this candidate is near the previously coded address of 1161 Coronet Dr, and is closest to the listed range of addresses on the right side of Coronet Dr. This address is on the left side of the street (even addresses on the right side, odd addresses on the left), but the other two candidates (indicated on the map in Figure 1.54) have ranges that are further away than the first one from 1161 and presumably from 1275 Coronet Dr. A reasonable conclusion is that this location is a good one for 1275 Coronet.

FIGURE 1.54 Candidates for 1275 Coronet Dr

Candidates for 1275 Coronet Dr

Step 4 Maximize the uncoded address list window, make sure the first candidate is highlighted, and click Match to accept this location for 1275 Coronet Dr.

The process of geocoding is a challenging one, but following the principles and examples we have shown here, any set of addresses can be successfully geocoded. As you have seen, some addresses just cannot be coded by any reasonable set of assumptions and procedures, but with some diligence you can expect to successfully and reasonably code 90 percent of the addresses in your database, and most analysts would agree that to adequately represent your data and to avoid the introduction of biases, you should achieve at least 90 percent coded at a minimum. Any less than that could introduce bias into your data and your maps, as the addresses you are unable to code may share some unknown but systematic characteristic or set of characteristics that make the events and data associated with the uncoded addresses significantly different than the ones you successfully code. Coding almost all of the addresses you have will insure that bias does not enter into your research project as a result of the geocoding procedures you have implemented.

Once the geocoding process has been completed, you will want to save your edits and save the files associated with this map and data, and you may want to export the map for inclusion in a report produced by a word-processing program like Microsoft Word, or perhaps a presentation program like Microsoft PowerPoint. The next example shows how to do this for any maps constructed in ArcMap.

To save your geocoding, you first save the edits you have made to the addresses as you proceeded with the interactive geocoding phase. To do this, follow these steps:

Step 1 Click on the Editor drop down menu, and click on Save Edits, as shown in Figure 1.55. This will save your edits to the street addresses you made during the process of geocoding interactively.

FIGURE 1.55 The Editor drop down menu

The Editor drop down menu

Step 2 Click on File > Save As, and select an appropriate name and folder to save your map document in. The file, with extension .mxd, has links to all the component files your map was built from, so that every time you load the .mxd file, all the components that are needed to construct the map as you saved it are accessible.

Step 3 Click on Save to complete the operation, as in Figure 1.56.

FIGURE 1.56 Saving an ArcMap file

 Saving an ArcMap file

Next post:

Previous post: