Database Reference
In-Depth Information
Extracting data records from query result pages enables integrating data from
a magnitude of web databases to generate value-added web applications, such
as price comparison sites and meta-search engines, etc. Query result pages are
dynamically generated from back-end databases in response to user queries and
encoded in HTML using pre-defined templates or script programs. These pages
are semi-structured and displayed for human use, rather than for processing by
programs. How to automatically extract data records into a structured form that
is machine processable is a very challenging problem.
There has been a lot of research on fully-automatic approaches [3-16] for
extracting data from query result pages. Those in [3-10] represent the current
technical trend of query result extraction. First, they identify a data section,
which contains a set of data records. Second, they identify data records from the
data section. Finally, they extract data by aligning the corresponding attributes
of different records, producing a relational table [4, 5, 8, 10].
However, the existing approaches to query result extraction have some inher-
ent limitations. First, web pages are becoming more complex; their tag structures
are ever-growing complex since HTML itself is evolving constantly, and other
technologies like JavaScript and CSS are widely deployed to make result pages
more dynamic. This may make the layouts of result pages different from their
tag tree or token string representations, and thus the existing approaches that
rely on such representations may fail. Second, some of the existing approaches
employ a similarity measure on page segments to identify data records. However,
data records may not be extracted correctly if the sibling tree segments of the
same root are not similar to each other. This also makes it impossible to extract a
single data record in the data section. Third, most of the existing approaches do
not filter out noisy contents. Noisy contents refer to any parts of a query result
page that are not part of any data record, e.g. banner advertisements, navigation
bar, copyright notice, record statistical information etc. We are most interested
in the part of a result page which contains all the data records with few noisy
contents which often affect the accuracy of data record extraction. Thus it is
very important to remove any noisy contents before data record extraction.
In this paper, we focus on the problem of data record extraction, that is, given
a query result page that contains a single column of data records, automatically
identify the data section and data records. We propose a novel approach to over-
come the limitations of the existing approaches. First, our approach transforms
a query result page into a Visual Block tree using the VIPS algorithm [17], which
represents a visual partition of the web page. Such a representation reflects the
content structure of the page enforced by visual cues so that content related data
items are represented in the same branch of the Visual Block tree. For example,
Figure 2 shows a visual partition of the result page shown in Figure 1; Figure 3
shows part of the visual block tree for visual block
b
1 1 2 2 3 4
. We can also get
visual features (e.g., positions, width, height etc) of each block on the Visual
Block tree.
Second, our approach identifies the data section by exploiting the sizes of the
visual blocks of the result page, and counting the occurrences of query terms