Extracting Data Records from Query Result Pages Based on Visual Features - Advances in Databases

Database Reference

In-Depth Information

Extracting data records from query result pages enables integrating data from

a magnitude of web databases to generate value-added web applications, such

as price comparison sites and meta-search engines, etc. Query result pages are

dynamically generated from back-end databases in response to user queries and

encoded in HTML using pre-defined templates or script programs. These pages

are semi-structured and displayed for human use, rather than for processing by

programs. How to automatically extract data records into a structured form that

is machine processable is a very challenging problem.

There has been a lot of research on fully-automatic approaches [3-16] for

extracting data from query result pages. Those in [3-10] represent the current

technical trend of query result extraction. First, they identify a data section,

which contains a set of data records. Second, they identify data records from the

data section. Finally, they extract data by aligning the corresponding attributes

of different records, producing a relational table [4, 5, 8, 10].

However, the existing approaches to query result extraction have some inher-

ent limitations. First, web pages are becoming more complex; their tag structures

are ever-growing complex since HTML itself is evolving constantly, and other

technologies like JavaScript and CSS are widely deployed to make result pages

more dynamic. This may make the layouts of result pages different from their

tag tree or token string representations, and thus the existing approaches that

rely on such representations may fail. Second, some of the existing approaches

employ a similarity measure on page segments to identify data records. However,

data records may not be extracted correctly if the sibling tree segments of the

same root are not similar to each other. This also makes it impossible to extract a

single data record in the data section. Third, most of the existing approaches do

not filter out noisy contents. Noisy contents refer to any parts of a query result

page that are not part of any data record, e.g. banner advertisements, navigation

bar, copyright notice, record statistical information etc. We are most interested

in the part of a result page which contains all the data records with few noisy

contents which often affect the accuracy of data record extraction. Thus it is

very important to remove any noisy contents before data record extraction.

In this paper, we focus on the problem of data record extraction, that is, given

a query result page that contains a single column of data records, automatically

identify the data section and data records. We propose a novel approach to over-

come the limitations of the existing approaches. First, our approach transforms

a query result page into a Visual Block tree using the VIPS algorithm [17], which

represents a visual partition of the web page. Such a representation reflects the

content structure of the page enforced by visual cues so that content related data

items are represented in the same branch of the Visual Block tree. For example,

Figure 2 shows a visual partition of the result page shown in Figure 1; Figure 3

shows part of the visual block tree for visual block b 1 1 2 2 3 4 . We can also get

visual features (e.g., positions, width, height etc) of each block on the Visual

Block tree.

Second, our approach identifies the data section by exploiting the sizes of the

visual blocks of the result page, and counting the occurrences of query terms

Advances in Databases

Search WWH ::

Custom Search

Home