Database Reference
In-Depth Information
('WWF#', 4), ('WF##', 5), ('F###', 6) } . This representation also
has the advantage that the beginning and ending characters are repre-
sented explicitly using their own unique q -grams, hence capturing more
accurately the beginning and ending of the string.
One can also define generalizations to sets of grams of various sizes
(as opposed to fixed length q -grams); for example, variable length
grams, or all grams of length one up to length q . The bigger the gram
set is, the larger the information captured by the grams becomes. The
similarity of gram sets can be assessed using any set similarity function.
The advantage of gram sets is that they can be used to evaluate similar-
ity of strings on a substring level, where small inconsistencies or spelling
mistakes do not affect the similarity of the gram sets significantly.
On the other hand, since representing strings as gram sets increases
the representation size of the string, gram sets have the drawback of
significantly larger space requirements, even when hashing is used. In
addition, given the increased representation size, the evaluation of the
respective similarity functions becomes expensive, especially for long
strings.
Search WWH ::




Custom Search