Database Reference
In-Depth Information
The algorithm, as shown in Algorithm 1, takes as input a set of query terms
(denoted as T ) and a data section block (denoted as B ), and identifies as output
a set of data records (denoted as R ). The algorithm first identifies starting leaf
nodes (lines 2-6) by matching each query term with each text node in the Visual
Block tree of the data section. Second, the algorithm forms a data unit group
with each starting leaf node and tries to expand it with leaf nodes that are
horizontally aligned with it (lines 7-13). Third, the algorithm groups leaf nodes
that are not included in data unit groups, and horizontally aligned with each
other, into leaf node groups G l (lines 14-22). Fourth, the algorithm expands each
data unit group with the other data unit group that are horizontally aligned with
it (lines 23-30). Fifth, the algorithm expands each data unit group with other
data unit group and leaf node groups that are vertically adjacent to it (lines
31-42). The algorithm ends when there is no more leaf node group or data unit
group that is vertically adjacent to the expanding data unit group.
To illustrate how the algorithm works, we take the first two data records
shown in Figure 1 as an example. “Plates” has been used as a query term. Two
leaf nodes representing text “Dinner Plates by” are identified as starting leaf
nodes. These two starting leaf nodes are used to initiate two data unit groups.
Those leaf nodes representing the second rows of these two data records form
two leaf node groups. Each data unit group is expanded with a leaf node group
which is vertically adjacent to it. Each extracted data unit group represents a
data record.
6 Experimental Results
We have implemented a prototype in Visual C++ based on our proposed ap-
proach. Every query result page is first parsed by the VIPS into a Visual Block
tree which the prototype takes as input. We have conducted experiments on
a data set of 200 query result pages that are returned from 20 web databases
in the UIUC Web Integration Repository [19]. These web databases are from
5 domains - Books, Jobs, Movies, Music and Hotels. 15 of these pages contain
a single data record. For each web database, 10 result pages are collected after
manually submitting 10 different queries via its query interface. We use two com-
mon measures, recal l and precision , to evaluate the performance of our approach.
Recall is the percentage of the number of data records that have been correctly
extracted over the total number of data records on a result page. Precision is
the percentage of the number of data records that have been correctly extracted
over the total number of data records that have been extracted.
We compare our approach with MDR [3], which is a well known data record
extraction system. We set the similarity threshold for MDR at its recommended
value (60%). Table 1 shows the experimental results of both our approach and
MDR. As we can see from Table 1, our approach has much better experimental
results than MDR in total, and in almost every domain our approach significantly
outperforms MDR. The precision and recall of our approach are both high across
all domains, approaching 100%. Our approach can also extract query result pages
 
Search WWH ::




Custom Search