Extracting Data Records from Query Result Pages Based on Visual Features - Advances in Databases

Database Reference

In-Depth Information

The algorithm, as shown in Algorithm 1, takes as input a set of query terms

(denoted as T ) and a data section block (denoted as B ), and identifies as output

a set of data records (denoted as R ). The algorithm first identifies starting leaf

nodes (lines 2-6) by matching each query term with each text node in the Visual

Block tree of the data section. Second, the algorithm forms a data unit group

with each starting leaf node and tries to expand it with leaf nodes that are

horizontally aligned with it (lines 7-13). Third, the algorithm groups leaf nodes

that are not included in data unit groups, and horizontally aligned with each

other, into leaf node groups G l (lines 14-22). Fourth, the algorithm expands each

data unit group with the other data unit group that are horizontally aligned with

it (lines 23-30). Fifth, the algorithm expands each data unit group with other

data unit group and leaf node groups that are vertically adjacent to it (lines

31-42). The algorithm ends when there is no more leaf node group or data unit

group that is vertically adjacent to the expanding data unit group.

To illustrate how the algorithm works, we take the first two data records

shown in Figure 1 as an example. “Plates” has been used as a query term. Two

leaf nodes representing text “Dinner Plates by” are identified as starting leaf

nodes. These two starting leaf nodes are used to initiate two data unit groups.

Those leaf nodes representing the second rows of these two data records form

two leaf node groups. Each data unit group is expanded with a leaf node group

which is vertically adjacent to it. Each extracted data unit group represents a

data record.

6 Experimental Results

We have implemented a prototype in Visual C++ based on our proposed ap-

proach. Every query result page is first parsed by the VIPS into a Visual Block

tree which the prototype takes as input. We have conducted experiments on

a data set of 200 query result pages that are returned from 20 web databases

in the UIUC Web Integration Repository [19]. These web databases are from

5 domains - Books, Jobs, Movies, Music and Hotels. 15 of these pages contain

a single data record. For each web database, 10 result pages are collected after

manually submitting 10 different queries via its query interface. We use two com-

mon measures, recal l and precision , to evaluate the performance of our approach.

Recall is the percentage of the number of data records that have been correctly

extracted over the total number of data records on a result page. Precision is

the percentage of the number of data records that have been correctly extracted

over the total number of data records that have been extracted.

We compare our approach with MDR [3], which is a well known data record

extraction system. We set the similarity threshold for MDR at its recommended

value (60%). Table 1 shows the experimental results of both our approach and

MDR. As we can see from Table 1, our approach has much better experimental

results than MDR in total, and in almost every domain our approach significantly

outperforms MDR. The precision and recall of our approach are both high across

all domains, approaching 100%. Our approach can also extract query result pages

Advances in Databases

Search WWH ::

Custom Search

Home