Database Reference
In-Depth Information
with single data records, but MDR cannot. Table 1 shows that our approach
has slightly higher precision than recall. The main reasons for missing data
records are as follows. First, sometimes some data records do not contain any
query terms so our approach cannot identify the appropriate starting leaf nodes.
Second, sometimes the VIPS divides a data section into multiple sections, and
our approach only identifies the largest section as the data section. The main
reasons for extracting data records incorrectly are as follows. First, some noisy
blocks have not been removed from the data section because they may contain
query terms. Second, sometimes the VIPS parses result pages incorrectly so that
some data items are missing on the Visual Block tree. Third, sometimes the
VIPS fails to give correct block positions, which leads to data units missing
from some data records. The performance of MDR is inversely proportional to
the complexity of the result pages, and it performs relatively well on extracting
data records from tables.
Table 1. Comparison results between our approach and MDR
Our Approach
MDR
Domain
Precision Recall Precision Recall
Books
97.86% 96.76% 40.38% 82.01%
Hotel
99.20% 98.30% 18.21% 32.68%
Jobs
99.48% 98.37% 99.62% 67.60%
Movies&Music
100%
98.54% 28.05% 72.46%
Single Record Page
100%
100%
0%
0%
Total
99.26% 98.11% 38.68% 74.86%
7 Related Work
Automatic extraction of web query results has attracted a lot of attention over
the recent years. Several automatic extraction systems have been developed.
Earlier works mainly focus on finding repetitive patterns and templates in result
pages, e.g., IEPAD [13], RoadRunner [12], DeLa [14] and EXALG [15]. Recent
techniques have focused on exploiting tag structures and visual features, e.g.,
MDR [3], DEPTA [4, 5], MSE [7], ViNTs [6], ViPER [8], ViDE [16] and [9].
The works that use visual features include ViPER [8], ViNTs [6], MSE [7] and
ViDE [16]. ViDE, is the most related to our approach. It is the first work that
is primarily based on visual features. There are several main differences between
ViDE and our approach. ViDE first clusters data units of the same semantics
based on similarity between their appearances, and then groups appropriate data
units from each of the clusters into data records. Our approach uses a proximity-
based technique to directly group data units in the same data records. ViDE
may cluster data units with different semantics because sometimes neighboring
data units in the same data record may not have distinguishable appearances,
resulting in them being clustered together and then grouped into different data
records. Second, ViDE uses the positions and sizes of visual blocks to determine
if a block is the data section. If multiple blocks are identified as candidate data
 
Search WWH ::




Custom Search