Database Reference
In-Depth Information
the noisy blocks. Second, we propose an approach for identifying data records
based on an observation that the data units of a data record are visually aligned
with and close to each other, and that they are distant from the data units of
the other data records. By grouping data units in such a way, our approach does
not miss any data record that is not similar to the other data records, and our
approach can extract a single data record from a query result page.
The rest of this paper is organized as follows. Section 2 presents web page
representation, the problem definition and an overview of our approach. Sections
3 - 5 describe our approaches for identifying data sections, removing noisy blocks
and identifying data records. Experimental results are given in section 6. Section
7 discusses related work. Section 8 concludes the paper.
2 Fundamentals and Overview
In this section, we first introduce Visual Block trees and give a formal definition
of the rendering box model of web pages based on the Visual Block tree, which is
the basis of our approach. We then define the problem of data record extraction
and present an overview of our approach.
2.1 Visual Representation of Query Result Pages
The content of a query result page is typically organized into different regions to
make it easy for human use, e.g., advertisements, menu bar, sponsor links, query
results and so on. Each region contains semantically related content. Visual cues
(e.g. lines, spaces, font sizes, background colours etc) can be used to distinguish
regions from each other. To make use of visual features for data record extrac-
tion, we employ the VIPS [17] algorithm to represent a query result page as a
Visual Block tree. The root of the tree represents the entire page and each node
represents a rendering box (a visual block) on the page. A leaf node represents a
block containing a basic semantic unit that cannot be further decomposed, e.g.,
a text or image. Node a is an ancestor of node b if the block that a represents
contains the block that b represents on the page. The blocks represented by
nodes at the same level of the tree do not overlap. The order of the child nodes
with the same parent follows the order of the blocks they represent on the page,
i.e., top-down, left-right. For example, Figure 2 shows the visual block layout
produced by the VIPS algorithm for the query result page shown in Figure 1.
For example, b 1 represents the body of the page, b 1 1 2 1 represents the block
containing the category links on the page, b 1 2 contains the website information
and b 1 1 2 2 3 4 contains all data records denoted as b 1 1 2 2 3 4 1 to b 1 1 2 2 3 4 10 .
Figure 3 shows part of the Visual Block tree for b 1 1 2 2 3 4 .
2.2 Overview of Our Approach
Given the Visual Block tree of a query result page, first we identify a visual block
that contains all the data records and treat it as the data section. Second, we
 
Search WWH ::




Custom Search