Database Reference
In-Depth Information
affect the similarity of strings. For instance, the Jaccard similarity
between strings 'Doctors Without Borders' and 'Doctor Witout Bor-
der' is zero, even though the two strings are practically the same. In
practice, the most common solution to such problems is, first, to use
stemming on the queries and the data, and second, to correct spelling
mistakes using a dictionary search. Neither of these solutions might be
applicable in certain situations. For example, for mathematical equa-
tions or source code, neither stemming nor a dictionary might always
make sense; for business names, stemming might result in unwanted
side effects (e.g., 'Feed the Children' becomes 'food child' and can result
in many irrelevant answers).
Non-overlapping tokenization results in token sets with storage
requirements equal or smaller than the storage size of the string itself.
The token set can be stored explicitly or, in some cases, hashing can
be used to represent tokens as integers.
3.2 Overlapping Tokens
In overlapping tokenization the idea is to extract all possible substrings
of a given length q of string s . The resulting substrings are more com-
monly referred to as q -grams.
Definition 3.1 (Set of positional
q
-grams). Let s = σ 1 ···
σ ,
σ i
Σ, be a string consisting of characters. The set of positional
q -grams G q ( s ) consists of all q -length substrings of s :
G q ( s )=
{
( σ 1 ···
σ q , 1) , ( σ 2 ···
σ q +1 , 2) ,..., ( σ −q +1 ···
σ ,
q +1)
}
.
For example, let string s = 'UNICEF' and q = 3. The resulting set of
3-grams is G 3 ( s )=
{
}
.
Notice that for strings with length smaller than q the q -
gram set is empty. For simplicity of implementation, when work-
ing with q -grams, the q -grams are extracted over strings extended
with a special beginning and ending character #
('UNI', 1), ('NIC', 2), ('ICE', 3), ('CEF', 4)
Σ. For exam-
ple, for q = 4 the string 'WWF' is extended to '###WWF###'
and G 4 (' WWF ') =
{
('###W', 1), ('##WW', 2), ('#WWF', 3),
 
Search WWH ::




Custom Search