Extracting Data Records from Query Result Pages Based on Visual Features - Advances in Databases

Database Reference

In-Depth Information

sections, it chooses the one with smallest size as the data section. Our approach

counts the occurrences of query terms in candidate blocks to select the real

data section that makes our approach more robust. Third, ViDE identifies noisy

blocks by deciding whether the blocks are aligned to the left of a data section but

it may not remove all the noisy blocks. Our approach evaluates the importance

of blocks within the section based on content and visual features which improve

the effect of removing noisy blocks.

Our algorithm for grouping data units of a data record is inspired by the work

of Gatterbauer and Bohunsky [1, 2] on extracting web tables. Our approach

instead extracts data records from query result pages that have more complex

content structures. Though our approach also uses the alignment and adjacency

techniques, our alignment definition is much simpler than the one in [1, 2]. Our

approach uses also query terms in the process of grouping data units.

8 Conclusions

In this paper, we present an automatic approach for extracting data records

from query result pages. Our approach first uses the sizes of visual blocks and

the occurrences of query terms in visual blocks to identify the data section. It

then groups data units in the data section, which are in close proximity, into data

records. It also uses content and visual features of visual blocks to evaluate their

importance and to filter out noisy blocks. Our work can be part of a web data

integration system which interacts with multiple web databases, e.g. e-commerce

web sites. Our experimental results show that our proposed approach is highly

effective. In future work, we will develop algorithms for aligning data units in the

extracted data records so that data units of the same attribute can be aligned

into the same column of the query result table.

References

1. Gatterbauer, W., Bohunsky, P., Herzog, M., Krupl, B., Pollak, B.: Towards

Domain-Independent Information Extraction from Web Tables. In: WWW 2007,

pp. 71-80 (2007)

2. Gatterbauer, W., Bohunsky, P.: Table Extraction Using Spatial Reasoning on the

CSS2 Visual Box Model. In: AAAI 2006, pp. 1313-1318 (2006)

3. Liu, B., Grossman, R., Zhai, Y.: Mining Data Records in Web Pages. In: KDD

2003, pp. 601-606 (2003)

4. Zhai, Y., Liu, B.: Web Data Extraction Based on Partial Tree Alignment. In:

WWW 2005, pp. 76-85 (2005)

5. Zhai, Y., Liu, B.: Structured Data Extraction from the Web Based on Partial Tree

Alignment. IEEE Trans. on Knowl. and Data Eng. 18(12), 1614-1628 (2006)

6. Zhao, H., Meng, W., Wu, Z., Raghavan, V., Yu, C.: Fully automatic wrapper

generation for search engines. In: WWW 2005, pp. 66-75 (2005)

7. Zhao, H., Meng, W., Yu, C.: Automatic Extraction of Dynamic Record Sections

from Search Engine Result Pages. In: VLDB 2006, pp. 989-1000 (2006)

Search WWH ::

Custom Search

Home