Extracting Data Records from Query Result Pages Based on Visual Features - Advances in Databases

Database Reference

In-Depth Information

the noisy blocks. Second, we propose an approach for identifying data records

based on an observation that the data units of a data record are visually aligned

with and close to each other, and that they are distant from the data units of

the other data records. By grouping data units in such a way, our approach does

not miss any data record that is not similar to the other data records, and our

approach can extract a single data record from a query result page.

The rest of this paper is organized as follows. Section 2 presents web page

representation, the problem definition and an overview of our approach. Sections

3 - 5 describe our approaches for identifying data sections, removing noisy blocks

and identifying data records. Experimental results are given in section 6. Section

7 discusses related work. Section 8 concludes the paper.

2 Fundamentals and Overview

In this section, we first introduce Visual Block trees and give a formal definition

of the rendering box model of web pages based on the Visual Block tree, which is

the basis of our approach. We then define the problem of data record extraction

and present an overview of our approach.

2.1 Visual Representation of Query Result Pages

The content of a query result page is typically organized into different regions to

make it easy for human use, e.g., advertisements, menu bar, sponsor links, query

results and so on. Each region contains semantically related content. Visual cues

(e.g. lines, spaces, font sizes, background colours etc) can be used to distinguish

regions from each other. To make use of visual features for data record extrac-

tion, we employ the VIPS [17] algorithm to represent a query result page as a

Visual Block tree. The root of the tree represents the entire page and each node

represents a rendering box (a visual block) on the page. A leaf node represents a

block containing a basic semantic unit that cannot be further decomposed, e.g.,

a text or image. Node a is an ancestor of node b if the block that a represents

contains the block that b represents on the page. The blocks represented by

nodes at the same level of the tree do not overlap. The order of the child nodes

with the same parent follows the order of the blocks they represent on the page,

i.e., top-down, left-right. For example, Figure 2 shows the visual block layout

produced by the VIPS algorithm for the query result page shown in Figure 1.

For example, b 1 represents the body of the page, b 1 1 2 1 represents the block

containing the category links on the page, b 1 2 contains the website information

and b 1 1 2 2 3 4 contains all data records denoted as b 1 1 2 2 3 4 1 to b 1 1 2 2 3 4 10 .

Figure 3 shows part of the Visual Block tree for b 1 1 2 2 3 4 .

2.2 Overview of Our Approach

Given the Visual Block tree of a query result page, first we identify a visual block

that contains all the data records and treat it as the data section. Second, we

Search WWH ::

Custom Search

Home