EXTRACTING DATA - HTTP Programming Recipes for Java Bots

Java Reference

In-Depth Information

Now we have the entire character encoding loaded into the variable named buffer .

The first thing to check is to see if the first character is a pound sign (#). If the first character

is a pound sign, then this is an ASCII encoding. We should parse the number immediately

following the pound sign and return that as the encoded character.

String b = buffer.toString().trim().toLowerCase();

if (b.charAt(0) == '#')

{

try

{

return (char) (Integer.parseInt(b.substring(1)));

} catch (NumberFormatException e)

{

return '&';

}

If the number is invalid, and a NumberFormatException is thrown, then we

return an ampersand (&). Again, since this is an error, returning an ampersand is the best

we can do with regards to decoding the character.

If it is not an ASCII encoding, then we look up the character in the charMap , which

was setup earlier. This will give us the ASCII code for the character. For example, the string

“quot” is mapped to ASCII 34, which is the ASCII code for a quote.

} else

{

if (charMap.containsKey(b))

return charMap.get(b);

else

return '&';

}

} else

return ch;

Finally, we return the character, if the very first if-statement failed. This is because there

was no character-encoded character.

Reading Characters

The HTML parse class contains a function, named read that is called to read the next

character from an HTML file. The function will return zero if an HTML tag is encountered.

Additionally it will decode any special HTML characters.

The function begins by looking for a less-than sign. The less-than sign signals the begin-

ning of an HTML tag. If a less-than sign is found, then the parseTag method is called, and

a zero is returned. Calling the getTag function can access the tag, which was parsed by

the parseTag method.

Search WWH ::

Custom Search

Home