Databases Reference
In-Depth Information
A GHT is a balanced tree-based data structure that does not require
periodic rebalancing. In a GHT, predefined hashes of the record key determine
all its possible lookup or insertion locations. The locations where a record can
be inserted or looked up are therefore immutable. To insert or look up a record
in a GHT, the record key is hashed to obtain a position within the root node
(see Figure 2). If the corresponding node position at the root node is empty,
the record is inserted there. If there is a collision, the key is rehashed (using
a different hash function) and an attempt is made to insert the key in the
appropriate subtree of the root node. This process is repeated until an empty
node position is found. If a record cannot be inserted in any existing node of
the tree, a new leaf node is added.
Full-text search (keyword search) is the most convenient way to query un-
structured records such as email bodies and reports. Search engines typically
use inverted indexes for this purpose [9]. As shown in Figure 3(a), an inverted
index comprises a dictionary of terms (i.e., words that appear in documents)
plus a posting list for each term, containing the identifiers of all records con-
taining that term (with additional metadata such as term frequency within
the record, term type, and term position within the record). Queries are an-
swered by scanning the posting lists of the terms in the query. The records
referenced in the posting lists are assigned scores for the query, based on simi-
larity measures (e.g., cosine, Okapi, pivoted, Dirichlet [47, 36]). The scores are
used to rank the records, producing an ordered list of results. Multi-keyword
conjunctive queries can be answered by intersecting the posting lists of the
query terms. To make the intersection fast, an additional index such as a B+
tree is usually kept for each posting list, and a zigzag join is used to perform
the intersection [46].
Query
1
3
11
17
36
Query
Data
Base
Worm
Index
1
3
11
17
36
Data
Base
Worm
Index
3
9
31
3
19
3#Data 3#Base 9#Data 19#Base
31#Data
7
36
7
36
3
3
(a) Ordinary Inverted Index
(b) Inverted Index After Merging
Fig. 3. Inverted indexes. With each keyword, a posting list ofIDsofdocuments
containing that keyword is stored. Each posting list is stored as a separate file on
WORM. After merging, the keyword (or its hash) must also be stored in the posting
list.
For a trustworthy version of inverted indexes, each posting list can be
stored in a separate append-only file on WORM storage [20]. In a traditional
inverted index, updates are processed in batches that involve sorting all the
entries and regenerating the posting lists in their entirety; of course this is not
Search WWH ::




Custom Search