Symbol-Table-Handling Techniques (Compiler Writing) Part 3

Hash Symbol Tables

The best search methods for the symbol-table organizations we have discussed so far have had an average length of search requiring on the order of log2« comparisons. In this section we investigate an organization in which the search time is essentially independent of the number of records in the table (i.e., the search time is constant for any n). We begin our discussion by introducing some basic terminology and describing the notion of a hashing function. Next, a number of hashing-function strategies are surveyed and the problem of collision resolution is addressed. Finally, a comparison of hash-table organizations and the other table organizations discussed in the section is presented.

To describe adequately the concept of a hash symbol table, some new terminology must be introduced. First, the notions of name space and address space must be described. The name space (also called identifier space or key space) is the set K of unique variable names that can appear in a program. Foi example, in FORTRAN, the name space is the set of one to six character identifiers each of which begins with a letter and consists of letters and digits. Some programming languages do not have any restriction on the length of variable names. Of course, in a particular implementation of a compiler some maximum length must be assumed, and therefore the name space is always finite.

The address space (or table space) A is the set of record locations {1,2,…, m) in a table. An important measure of space utilization is the load factor, which is defined as the ratio of the number of records to the number of record locations in a table. Therefore, if n records currently reside in a table, the load factor a is n/m.


Formally, a hashing function (or key-to-address transformation) is defined as a mappingtmp1A-317_thumb[2][2]That is, a hashing function H takes as its argument a variable name and produces a table address (i.e., location) at which the set of attributes for that variable are stored. The function generates this address by performing some simple arithmetic or logical operations on the name or some part of the name.

Before we describe a number of hashing functions, we should introduce the notion of preconditioning as it relates to the name space K. Each element of K usually contains characters which are numeric, alphabetic, or special (i.e., punctuation or operator types of characters). The individual characters of a name are not particularly amenable to arithmetical or logical operation. The process of transforming a variable’s name to a form which can be easily manipulated by a hashing function is often called preconditioning. To illustrate this process, let us consider the problem of preconditioning the name AD#1. One alternative is to encode the letters as the numbers 11,12,…, 36 and the set of special characters tmp1A-318_thumb[2][2]i as 37,38,39,… . Using this scheme, AD#1 can be encoded as 11143701.

Preconditioning can be handled most efficiently by using the numerically cod fid internal representation (e.g., ASCII or EBCDIC) of each character in the name. On an ASCII machine, AD#1 can be encoded as 1000001 1000100 0100011 0110001 in binary, or 137,433,521. Expressed in EBCDIC, AD#1 is C1C47BF1 in hexadecimal, or 3,250,879,473. The last two preconditioned results generate very large numbers that may not be conveniently represented in a given machine (especially a 16-bit word microcomputer). Note also that our example variable name is only four characters in length. The solution to this size problem is to disregard certain digits of the preconditioned result. In fact some of the hashing functions to be described perform various types of size-reduction transformations to generate a table address. It is common practice to use one hashing function to map names to an intermediate preconditioned key space and then a second hashing function to map the values in that space to a final table location.

Let us now examine a number of hashing functions that are applicable to symbol-table handling. For notational convenience, we use the term key throughout the discussion to mean a preconditioned numeric representation of a variable name.

The most widely accepted hashing function is the division method, which is defined as

tmp1A-321_thumb[2][2]

for divisor m. In mapping keys to addresses, the division method preserves, to a certain extent, the uniformity that exists in a key set. Keys which are closely bunched together or clustered are mapped to unique addresses. For example, keys 2000,2001,…. and 2017 would be mapped to addresses 82,83,…, and 99 if the divisor for the division method is 101. Unfortunately, this preservation of uniformity is a disadvantage if two or more clusters of keys are mapped to the same addresses. For example, if another cluster of keys is 3310,3311, 3313,3314,…,3323, and 3324, then those keys are mapped to addresses 79, 80,82,83,…, 92, and 93 by divisor 101, and there are many collisions with keys from the cluster starting at 2000. The reason for this is that keys in the two clusters are congruent modulo 101.

In general, if many keys are congruent modulo d, and m is not relatively prime to d, then using m as a divisor can result in poor performance of the division method. This is shown in the preceding example, where m = d = 101. As another example, if all the keys in a table are congruent modulo 5 and the divisor is 65, the keys are mapped to only 13 different positions. Since it is uncommon for a number of keys to be congruent modulo m, where m is a large prime number, a prime divisor should be used, although research has shown that odd division without factors less than 20 are also satisfactory. In particular, divisors which are even numbers are to be avoided, since even and odd keys would be mapped to odd and even addresses, respectively (assuming that the addres pace {1,2,…, m}). This would be a problem in a table containing predominantly even or predominantly odd keys.

A second hashing function that perioims reasonably well is the midsquarc hashing-method. In this method, a key is multiplies by itself and an address is obtained by truncating bits or digits at both ends of the product until the number of bits or digits left is equal to the desired address length. The same positions must be used for all products. As an example, consider a six-digit key, 113586. Squaring the key gives 12901779396. If a four-digit address is required, positions 5 to 8 could be chosen, giving address 1779. The midsquare method has been criticized by Buchholz (1963), but it has given good results when applied to some key sets (see Lum et al., 1971).

For the folding method, a key is partitioned into a number of parts, each of which has the same length as the required address with the possible exception of the last part. The parts are then added together, ignoring the final carry, to form an address. If the keys are in binary form, the exclusive-or operation may be substituted for addition. There are variations of this technique which can best be illustrated by an example involving the key 187249653. In the fold-shifting method, 187, 249, and 653 are added to yield 89. In the fold-boundary method, the digits of the outermost partitions are reversed, so that 781, 249, and 356 are added, yielding 386. Folding is a hashing function which is useful for compressing multiword keys so that other hashing functions can be used.

Another hashing technique which we will refer to as a length-dependent method is used commonly in symbol-table handling. In this approach, the length of the variable name is used in conjunction with some subpart of the name to produce either a table address directly or, more commonly, an intermediate key which is used, for example, with the division method to produce a final table address. McKeeman (1974) experimented with six different length-dependent functions. The function that produced the best results summed the internal binary representation of the first and last characters and the length of the variable name shifted left four binary places (or equivalently the length multiplied by 16). Therefore, ID#1 becomes 201 + 241 + (4 X 16) = 506 assuming an EBCDIC representation. If we treat 506 as an intermediate key and apply the division method with a divisor of 29, the resulting address is 14.

Thus far we have described how to perform variable name to address transformations using hashing functions, but we have neglected an important aspect relevant to this process, the problem of colliding records. A hashing function is a many-to-one mapping. That is, the name space K is in general much larger than the address space A onto which K is mapped. For example, a FORTRAN program has a name space of 26 + 26 X 36 + 26 X 362 + • • • +26 X 365 = 1.6 X 109. Typically this space is mapped to the address space of a symbol table containing a few hundred record locations. Obviously in this mapping two names can be transformed into the same address. For example, AD#1 and ALL1 are both mapped to the same location 14 using the length-dependent transformation discussed earlier. Of course, two records cannot occupy the same location, and therefore some method must be used to resolve the collisions that can result. A major part of the remainder of this subsection is devoted to the topic of collision resolution.

Open addressing. To minimize the number of collisions, a hashing function should map the variable names in a program to the address space as uniformly as possible. The question of which hashing techniques provide the best such mapping has been addressed in two empirical studies by Buchholz (1963) and Lum et al. (1971). In the Lum study, the division, folding, and midsquare methods, along with some computationally complex methods such as digital analysis, radix transformation, and algebraic coding were compared using a variety of key sets. We purposely have not discussed some of the more computationally complex methods. They involve considerable preanalysis of the set of variable names in a program or a lot of arithmetic operations. In some of these methods the time to perform an address transformation becomes significant when compared with the small search time needed to resolve a collision. Therefore, in such instances, one does better to use a less sophisticated method that may generate more collisions. It is interesting to note, however, that in both of the studies cited earlier one of the simplest methods, the division method, provided the overall best performance.

There are basically two collision-resolution techniques: open addressing and chaining. Algorithms will be presented for both techniques, and certain variations of the basic methods are mentioned. Whenever possible, we will use the same notation as used previously and introduce new notation only when necessary.

With open addressing, if a variable name x is mapped to a storage location d, and this location is already occupied, then other locations in the table are scanned until a free record location is found for the new record. It is possible that the free record location contains a record that was previously deleted. When a record with name field NAME, is deleted, NAME, is set to a special value called DEL, which is not equal to the value of any variable name (e.g., a field of all nines). The locations are scanned according to a sequence which can be defined in many ways. The simplest technique for handling collisions is to use the following sequence:

tmp1A-322_thumb[2][2]

A free record location is always found if at least one is available; otherwise, the search halts after scanning m locations. When a record is looked up, the same sequence of locations is scanned until that record is located, or until an empty (never used) record position is found. In the latter case, the required record is not in the table and the search fails. This method of collision resolution is called linear probing.

The following algorithm performs the table-lookup and insertion operations for a hashed symbol table using linear probing with the sequence d,d 4- 1,…, m — 1, m, 1,2,…,d – 1. The record to be inserted or located is identified by a search argument SARG; any attributes to be inserted are contained in a subre-cord called DATA, and the type of operation to be performed is indicated by the logical parameter INSERT as described previously in this section. It is assumed that if a record location has never contained a record, the corresponding name field has a special value called DUMMY.

Function OPENLP(INSERT, SARG, DATA). Given the parameters INSERT, SARG, and DATA, this function performs the table-lookup and insertion operations on a hashed symbol table using a linear-probe collision-resolution technique. The /th record location in the table holds a name field NAME, and a subrecord ATTR, which can contain the other attributes for a variable. The hashing function H is used to calculate an initial address. If the requested operation is successful, the index of the NAME being inserted or searched for is returned; otherwise a negative value is returned. DUMMY and DEL have predefined values which denote the NAME fields of a record not in use and a record marked for deletion, respectively.

tmp1A-323_thumb[2][2]

In step 1 of the function an initial address is calculated—any of the hashing functions discussed previously can be used. Step 3 scans a position i, and if the NAME field and the search argument match, the table location is returned in the case of a lookup operation. For an insertion, the negated index is returned, indicating a duplicate variable-name error. If position i is empty or contains a deleted record, the new record is placed into this location during an insertion operation. If the location is empty and a lookup operation is performed, a negated index is returned, indicating the search argument is not present in the table. Step 4 increments the index i and resets i to 1 if necessary. If i becomes equal to d, its initial value, either no record locations are available or the lookup operation fails. In either case, an unsuccessful operation is indicated by returning an index value of 0.

Let us look at an example involving open addressing. We assume a hashing function which performs the following mapping:

the name NODE is mapped into 1 the name STORAGE is mapped into 2 the names AN and ADD are mapped into 3

the names FUNCTION, B, BRAND, and PARAMETER are mapped into 9

Assuming that the insertions are performed in the following order:

NODE, STORAGE, AN, ADD, FUNCTION, B, BRAND, and PARAMETER

Figure 8-21 represents the resulting structure with w = 11. The first three keys are each placed in a single probe, but then ADD must go into position 4 instead of 3, which is already occupied. FUNCTION is placed in position 9 in one probe, but B and BRAND take two and three probes, respectively. Finally, PARAMETER ends up in position 5 after eight probes, since positions 9, 10, 11, 1, 2. 3, and 4 are all occupied. A lookup is completed successfully when the record with its name equal to SARG is found, or unsuccessfully if an empty record location is encountered. Since steps 1 and 3 apply equally to insertions and lookups, the number of probes for the lookup operations are identical to those just given for the insertion operations for this example.

Each time that step 2 of Function OPENLP is executed for either insertion or lookup, one comparison is required. For a table of n records, if all records are stored or retrieved, the number of times that step 2 is executed divided by is the average length of search (ALOS). Knuth (1973) gives a probabilistic model for analyzing collision-resolution techniques and develops formulas for the expected average length of a successful search (EfALOS]) in the case of open addressing. The model assumes that each key has probability 1 /m of being mapped to each of the m addresses in the table.

Collision resolution by using open addressing.

Figure 8-21 Collision resolution by using open addressing.

Therefore, there are m" ways of mapping keys to the address space.

E[ALOS] is dependent on the load factor. Iftmp1A-325_thumb[2][2]is the load factor for n and m as defined previously, then Knuth derives the following formulas:

tmp1A-327_thumb[2][2]

Table 8-2 gives representative values for these formulas with a number of different load factors. E[ALOS] increases with increasing load factor, since a greater number of collisions is probable as more records are being stored in the table. Note that for a < 0.80, the results are quite good as compared with the search strategies discussed previously. The number of comparisons is proportional to the load factor. This result, however, is based on the key set being uniformly mapped onto the address space.

The linear-probing method of collision resolution has a number of shortcomings. The first is related to how deletions are performed. The approach that is used consists of having a special table entry for the name field with a value of DEL, which denoted the deletion of that entry. This strategy enables us to search the table properly. For example, assume that the record whose key is FUNCTION in Fig. 8-21 is marked for deletion by assigning the value of DEL to A9. Then, if it is desired to retrieve the record with a key value of BRAND, our previous algorithm will still work. The reader may wonder: Why bother to use a special value such as DEL to denote deleted entries? Why not just assign a negative value to the entry which is to be deleted? The reason is that if this were done in the previous example, the algorithm would find an empty position in position 9, decide that BRAND is not in the table, and proceed to insert it once more.

Table 8-2 E|ALOS] for linear probing

tmp1A-328

Number of probes

Successful

Unsuccessful

0.10

1.056

1.118

0.20

1.125

1.281

0.30

1.214

1.520

0.40

1.333

1.889

0.50

1.500

2.500

0.60

1.750

3.625

0.70

2.167

6.060

0.80

3.000

13.000

0.90

5.500

50.500

0.95

10.500

200.500

This solution to the handling of deletions is tolerable if few deletions are made as is the normal situation when interacting with a symbol table. For the case of many deletions, however, the table will contain numerous entries that are marked for deletion, and this may result in extensive search times.

Another shortcoming of the linear-probing method is due to clustering effects which tend to become severe when the table becomes nearly full. This phenomenon can be explained by considering a trace of Fig. 8-21 that would show the state of the table after each insertion. Such a trace is given in Fig. 8-22. When the first insertion is made, the probability of a new element being inserted in a particular position is clearly 1/11. For the second insertion, however, the probability that position 2 becomes occupied is twice as likely as any remaining available position; namely, the entry is placed in position 2 if the variable name is mapped into either 1 or 2. Continuing in this manner, on the fifth insertion the probability that the new entry is placed in position 5 is five times as likely as its being placed in any remaining unoccupied position. Thus the trend is for long sequences of occupied positions to become longer. Such a phenomenon is called primary clustering.

The primary-clustering problem can be reduced if a different probing method is used. A method which accomplishes this is called random probing. This technique generates a random sequence of positions rather than an ordered sequence, as was the case in the linear-probing method. The random sequence generated must contain every integer between 1 and m exactly once. The table is considered to be full when the first duplicate number is encountered. An example of a random-number generator which generates such a cyclic permutation of numbers consists of the statement

tmp1A-329_thumb[2][2]

where y is the initial number of the sequence (the generator) and c and m are relatively prime; i.e., their greatest common divisor is 1. For example, assuming that m = 11 and c = 7, this statement starting with an initial value of 3 will generate the sequence 10, 6, 2, 9, 5, 1, 8, 4, 0, 7, and 3. Thus, adding 1 to each element transforms the sequence to a number in the desired interval [1, 11].

Although random probing improves the problem of primary clustering, clustering can still occur. This situation arises when two keyr are hashed into the same value. In such a case, the same sequence or path will be generated for both keys by the random-probe method just discussed. This phenomenon is called secondary clustering.

This clustering phenomenon can be alleviated by double hashing (also called rehashing), a method first described by de Balbine (1968). In this technique, the increment value c is computed by using a second hashing function H2 which is independent of the initial hashing function H1 and which generates a value that is relatively prime to the table size. (If the probability that a name is mapped to the same address is on the order of 1 /m2 when applied to two hashing functions, these functions are said to be independent.) The function H2(k) is used to compute the value for c as given earlier in the cyclic-permutation formula [i.e., (y + c)modm] for random probing. Knuth (1973) suggests selecting m to be prime (when using the division method), and setting H^k) = 1 + k mod m and H2{k) = 1 + (/cmod(m – 2)), where k is the key and m is the table size.

Trace of insertions using open addressing.

Figure 8-22 Trace of insertions using open addressing.

This works particularly well if m and m — 2 are "twin primes" such as 1021 and 1019. A key of 125 generates the following sequence of probes for Hl and H2 as just given and assuming m is 13 (Hx and H2 will have values of 9 and 5, respectively):

tmp1A-331_thumb[2][2]

The average length of search for a double-hashing technique where and H2 are independent is given by the following pair of formulas:

tmp1A-332_thumb[2][2]

Table 8-3 gives a summary of representative values for a double-hashing method. Its performance is certainly better than that obtained for linear probing.

There are three main difficulties with the open-addressing method discussed thus far. First, when trying to locate an open location for record insertion, there is, in many instances, the necessity to examine records that do not have the same initial hash value (i.e., lists of colliding records for different hash values become intermixed). Second, a table-overflow situation cannot be satisfactorily handled using open addressing. If an overflow occurs, the entire table must be reorganized. The overflow problem cannot be ignored in symbol tables for compilers because the table-space requirements can vary significantly depending on the size of the source program. The third problem is the difficulty of physically deleting records —although, as mentioned previously, the deletion of individual records rarely occurs in symbol-table handling. Let us turn to a chained-allocation organization to alleviate some of these problems.

Table 8-3 EJALOS] for random probing with double hashing

Load factor

Number

of probes

tmp1A-333

Successful

Unsuccessful

010

1.054

1.111

0.20

1.116

1.250

0.30

1.189

1.429

0.40

1.277

1.667

0.50

1.386

2000

0.60

1.527

2 500

0.70

1.720

3.333

0.80

2.012

5.000

0.90

2.558

10.000

0.95

3.153

20.000

Separate chaining. Chaining can be used in a variety of ways to handle overflow records; however, we will concentrate only on the most popular method called separate chaining. This method involves the chaining of colliding records into a special overflow area which is separate from the prime area (i.e., the area of the table into which records are hashed initially). A separate chain (i.e., linked list) is kept for each set of colliding records, and consequently a pointer field must accompany each record in a primary or an overflow location. Figure 8-23 shows a separate chaining representation of the variables used earlier in this subsection assuming a prime area of 11 locations and the same order of declarations. Note that the insertions in a list of colliding records are made in alphabetical order of the variable names. This is a technique suggested by Amble and Knuth (1974), and it has been shown to reduce the lookup time for unsuccessful searches. Since an unsuccessful search determines that a variable should be declared implicitly, this strategy is commonly adopted.

In the following algorithm, named CHAINSL (meaning chaining with separate lists), the three fields in both a prime area record and an overflow record are designated as NAME, ATTR, and LINK. NAME and ATTR have identical roles as in previous algorithms, and LINK contains the address of the next node in the list of colliding records for a given primary location. The end of a list is signified by placing a NULL value in the link field of the last record in the list. The prime and overflow areas of the symbol table can be implemented as two separate tables, or they can be considered as subtables within a single symbol table.

Function CHAINSL(INSERT, SARG, DATA). Given the parameters INSERT, SARG, and DATA as described previously, this algorithm performs the table lookup and insertion operations on a hashed symbol table using the separate chaining collision-resolution technique.

Collision resolution with separate chaining.

Figure 8-23 Collision resolution with separate chaining.

The /th record location in the prime area of the table holds a name field NAME,, a subrecord ATTR, which contains the other attributes for a variable, and a link field LINK,. In a similar fashion, the index j is used to denote the current overflow record that is being examined. The index p contains the previous value of j, and k is used as a temporary index variable. The hashing function H is used to calculate an initial address. If the requested operation is successful, the index of the NAME being inserted or searched for is returned; otherwise a negative value is returned. DUMMY has a predefined value which denotes the NAME field of a record not in use.

tmp1A-335

 

 

 

 

tmp1A-336

The algorithm performs the insertion and lookup operations by first examining the prime area location, as determined by the hashing function, and then the overflow area if necessary. Note that for explicit declarations the algorithm can be improved by having insertions performed at the front of a list of unordered overflow records. This approach allows for fast insertion; however, it does not guarantee that duplicate declarations will be detected.

It should be observed that the special deletion marker DEL is not considered in Function CHAINSL. On the rare occasions when a deletion may be necessary (e.g., recovering from a error in a declaration), it is assumed that the erroneous record is physically deleted. A deletion operation can be performed easily when a chaining type of collision resolution is used.

Knuth shows that an expected average length of search for separate chaining is given as follows:

tmp1A-337_thumb[2][2]

Representative values for this method are given in Table 8-4. It is desirable to make the load factor as small as possible, which can be achieved by making m large. Although additional storage is required to store the links using this resolution technique, its performance and versatility make chaining superior to open addressing for most symbol-table applications. The open-addressing scheme is easier to implement, and because of its efficient utilization of storage, it should be considered when implementing a compiler on a small machine.

Table 8-4 E[ALOS] with separate chaining

Load factor

Number

of probes

tmp1A-338

Successful

Unsuccessful

0.10

1.050

1 005

0.20

1 100

1 019

0.30

1 150

1 041

0.40

1.200

1.070

0 50

1 250

1 107

0 60

1 300

1 149

0.70

1.350

1 197

0.80

1 400

1.249

0 90

1.450

1.307

0.95

1 475

1 337

Before this subsection is concluded, a few remarks are necessary concerning the representation of a symbol table when using a hashing scheme. First, we should observe that the address space A into which a set of variable names is mapped is assumed always to be of some fixed size m. In general, the larger the size of A relative to the number n of table records stored, the better the hashing function performs (i.e., the smaller the load factor « the better the performance). This is verified in Tables 8-2 to 8-4. To avoid making the symbol table extremely large and hence expensive storagewise, an intermediate table—often called the hash table—is created. Figure 8-24 illustrates how a hash table is used in conjunction with the symbol table assuming the same variables, hashing function, and collision-resolution technique as in Fig. 8-23.

Hashing using a hash table.

Figure 8-24 Hashing using a hash table.

In this method all of the table records can be thought of as residing in an overflow area and the prime area (i.e., the hash table) contains only link fields. Therefore, the symbol table, which contains large records, is packed densely and the hash table, which contains records with only a link field, can be made large while still requiring little space overall. It is a relatively straightforward process to adapt a hash table to an open-addressing strategy. In the section to follow, a hash table is of particular importance and is used extensively when discussing symbol tables for block-structured languages.

We conclude this section with a synopsis of the table organizations discussed. If it is known that only a few variables (i.e., 10 or fewer) are going to appear in a program, an unordered table should be considered. Such situations are rare, however. Adopting a binary-search strategy for an ordered table can improve the lookup operation if more than 25 records occur. Also, ordering the table facilitates making a cross-reference listing. If insertion activity is high, a tree-structured table can provide a better performance overall—mainly because of the ease with which insertions can be handled. Tree-structured tables are particularly good if certain properties of a language are present (e.g., six or fewer character names as in FORTRAN). The best method to use if memory is not at a premium is hashing. An average length of search of between one and three accesses is relatively easy to achieve. The main disadvantage with a hash symbol table is that since the records are not ordered either physically or logically by variable name, such an organization is not conducive to building a cross-reference listing.

It is common practice, when designing a compiler, to include a parameter which allows the user to estimate the size of the symbol-table requirements for a program. A possible strategy is to have the compiler select an appropriate symbol-table organization depending upon the table size requested.

Let us now turn to an examination of symbol-table handling for block-structured programming languages.

Next post:

Previous post: