Database Reference
In-Depth Information
3
String Tokenization
The type of string tokenization is a crucial design choice behind any
string processing framework. There are two fundamental tokenization
approaches: Non-overlapping and overlapping tokens. Non-overlapping
tokens are better for capturing similarity between short/long query
strings and long data strings (or documents) from a relevant document
retrieval perspective. Overlapping tokens are better at capturing simi-
larity of strings in the presence of spelling mistakes and inconsistencies
on a sub-token level.
3.1 Non-overlapping Tokens
The most basic instantiation of non-overlapping tokenization is to con-
sider tokens on word boundaries, sentences or any other natural bound-
ary depending on the particular domain of strings. For example, the
string s = 'Doctors Without Borders' would be decomposed into the
token set T =
.
The similarity between two strings is evaluated according to the sim-
ilarity of their respective token sets. Notice that in this example minor
inconsistencies or spelling mistakes on a word level will significantly
{
' Doctors ' , ' W ithout ' , ' Borders '
}
287
 
Search WWH ::




Custom Search