Database Reference
In-Depth Information
Extracting Data Records from Query Result
Pages Based on Visual Features
Daiyue Weng, Jun Hong, and David A. Bell
School of Electronics, Electrical Engineering and Computer Science,
Queen's University Belfast, Belfast BT7 1NN, UK
{ dweng01,j.hong,da.bell } @qub.ac.uk
Abstract. Web databases contain a large amount of structured data
which are accessible via their query interfaces only. Query results are
presented in dynamically generated web pages, usually in the form of
data records, for human use. The problem of automatically extracting
data records from query result pages is critical for web data integration
applications, such as comparison shopping sites, meta-search engines, etc.
A number of approaches to query result extraction have been proposed.
As the structures of web pages become more complex, these approaches
start to fail. Query result pages usually also contain other types of in-
formation in addition to query results, e.g., advertisements, navigation
bar, etc. Most of the existing approaches do not remove such irrelevant
contents which may affect the accuracy of data record extraction. We
have observed that query results are usually displayed in regular visual
patterns and terms used in a query often re-appear in query results. We
propose a novel approach that makes use of visual features and query
terms to identify the data section and extract data records from it. We
also use several content and visual features of visual blocks in a data sec-
tion to filter out noisy blocks. The results of our experiments on a large
set of query result pages in different domains show that our proposed
approach is highly effective.
1
Introduction
The volume of structured data on the Web has been increasing enormously. Such
data are usually returned from back-end databases in response to specific user
queries, and presented in the form of data records in query result pages. Access
to web databases is via their query interfaces (usually HTML query forms) only.
In literature, the contents of web databases are usually referred to as the Deep
Web. A recent study [20] estimates that the number of web databases that are
'hidden' on the Web is well in the order of 10 5 and continues expanding rapidly.
Many e-commerce sites are supported by web databases.
In general, the majority of query result pages are list pages, each of which
contains a number of data records in columns with each row on each column
representing a data record. For example, Figure 1 shows a list page from cook-
ing.com, which has a single column containing 10 data records about plates.
 
Search WWH ::




Custom Search