Database Reference
In-Depth Information
3.9.7
Exercises for Section 3.9
EXERCISE
3.9.1 Suppose our universal set is the lower-case letters, and the order of ele-
ments is taken to be the vowels, in alphabetic order, followed by the consonants in reverse
alphabetic order. Represent the following sets as strings.
(a) {
q, w, e, r, t, y
}.
(b) {
a, s, d, f, g, h, j, u, i
}.
EXERCISE
3.9.2 Suppose we filter candidate pairs based only on length, as in
Section 3.9.3
.
If
s
is a string of length 20, with what strings is
s
compared when
J
, the lower bound on
Jaccard similarity has the following values: (a)
J
= 0
.
85 (b)
J
= 0
.
95 (c)
J
= 0
.
98?
EXERCISE
3.9.3 Suppose we have a string
s
of length 15, and we wish to index its prefix as
in
Section 3.9.4
.
(a) How many positions are in the prefix if
J
= 0
.
85?
(b) How many positions are in the prefix if
J
= 0
.
95?
!
(c) For what range of values of
J
will
s
be indexed under its first four symbols, but no
more?
EXERCISE
3.9.4 Suppose
s
is a string of length 12. With what symbol-position pairs will
s
be compared with if we use the indexing approach of
Section 3.9.5
,
and (a)
J
= 0
.
75 (b)
J
= 0
.
95?
Strings
s
and
t
are both chosen at random from a universal set of 100 elements. Assume
J
= 0
.
9. What is the probability that
s
and
t
will be compared if
(a)
s
and
t
are both of length 9.
(b)
s
and
t
are both of length 10.
EXERCISE
3.9.6 Suppose we use indexes based on both position and suffix length, as in
Section 3.9.6
.
If
s
is a string of length 20, with what symbol-position-length triples will
s
be compared with, if (a)
J
= 0
.
8 (b)
J
= 0
.
9?