String Tokenization - Approximate String Processing

Database Reference

In-Depth Information

3

String Tokenization

The type of string tokenization is a crucial design choice behind any

string processing framework. There are two fundamental tokenization

approaches: Non-overlapping and overlapping tokens. Non-overlapping

tokens are better for capturing similarity between short/long query

strings and long data strings (or documents) from a relevant document

retrieval perspective. Overlapping tokens are better at capturing simi-

larity of strings in the presence of spelling mistakes and inconsistencies

on a sub-token level.

3.1 Non-overlapping Tokens

The most basic instantiation of non-overlapping tokenization is to con-

sider tokens on word boundaries, sentences or any other natural bound-

ary depending on the particular domain of strings. For example, the

string s = 'Doctors Without Borders' would be decomposed into the

token set T =

.

The similarity between two strings is evaluated according to the sim-

ilarity of their respective token sets. Notice that in this example minor

inconsistencies or spelling mistakes on a word level will significantly

{

' Doctors ' , ' W ithout ' , ' Borders '

}

287

Search WWH ::

Custom Search

Home