Extracting Data Records from Query Result Pages Based on Visual Features - Advances in Databases - page 151

Database Reference

In-Depth Information

with single data records, but MDR cannot. Table 1 shows that our approach

has slightly higher precision than recall. The main reasons for missing data

records are as follows. First, sometimes some data records do not contain any

query terms so our approach cannot identify the appropriate starting leaf nodes.

Second, sometimes the VIPS divides a data section into multiple sections, and

our approach only identifies the largest section as the data section. The main

reasons for extracting data records incorrectly are as follows. First, some noisy

blocks have not been removed from the data section because they may contain

query terms. Second, sometimes the VIPS parses result pages incorrectly so that

some data items are missing on the Visual Block tree. Third, sometimes the

VIPS fails to give correct block positions, which leads to data units missing

from some data records. The performance of MDR is inversely proportional to

the complexity of the result pages, and it performs relatively well on extracting

data records from tables.

Table 1. Comparison results between our approach and MDR

Our Approach

MDR

Domain

Precision Recall Precision Recall

Books

97.86% 96.76% 40.38% 82.01%

Hotel

99.20% 98.30% 18.21% 32.68%

Jobs

99.48% 98.37% 99.62% 67.60%

Movies&Music

100%

98.54% 28.05% 72.46%

Single Record Page

100%

100%

0%

0%

Total

99.26% 98.11% 38.68% 74.86%

7 Related Work

Automatic extraction of web query results has attracted a lot of attention over

the recent years. Several automatic extraction systems have been developed.

Earlier works mainly focus on finding repetitive patterns and templates in result

pages, e.g., IEPAD [13], RoadRunner [12], DeLa [14] and EXALG [15]. Recent

techniques have focused on exploiting tag structures and visual features, e.g.,

MDR [3], DEPTA [4, 5], MSE [7], ViNTs [6], ViPER [8], ViDE [16] and [9].

The works that use visual features include ViPER [8], ViNTs [6], MSE [7] and

ViDE [16]. ViDE, is the most related to our approach. It is the first work that

is primarily based on visual features. There are several main differences between

ViDE and our approach. ViDE first clusters data units of the same semantics

based on similarity between their appearances, and then groups appropriate data

units from each of the clusters into data records. Our approach uses a proximity-

based technique to directly group data units in the same data records. ViDE

may cluster data units with different semantics because sometimes neighboring

data units in the same data record may not have distinguishable appearances,

resulting in them being clustered together and then grouped into different data

records. Second, ViDE uses the positions and sizes of visual blocks to determine

if a block is the data section. If multiple blocks are identified as candidate data

Next Page

Advances in Databases

Search WWH ::

Custom Search

Home