Database Reference
In-Depth Information
sections, it chooses the one with smallest size as the data section. Our approach
counts the occurrences of query terms in candidate blocks to select the real
data section that makes our approach more robust. Third, ViDE identifies noisy
blocks by deciding whether the blocks are aligned to the left of a data section but
it may not remove all the noisy blocks. Our approach evaluates the importance
of blocks within the section based on content and visual features which improve
the effect of removing noisy blocks.
Our algorithm for grouping data units of a data record is inspired by the work
of Gatterbauer and Bohunsky [1, 2] on extracting web tables. Our approach
instead extracts data records from query result pages that have more complex
content structures. Though our approach also uses the alignment and adjacency
techniques, our alignment definition is much simpler than the one in [1, 2]. Our
approach uses also query terms in the process of grouping data units.
8 Conclusions
In this paper, we present an automatic approach for extracting data records
from query result pages. Our approach first uses the sizes of visual blocks and
the occurrences of query terms in visual blocks to identify the data section. It
then groups data units in the data section, which are in close proximity, into data
records. It also uses content and visual features of visual blocks to evaluate their
importance and to filter out noisy blocks. Our work can be part of a web data
integration system which interacts with multiple web databases, e.g. e-commerce
web sites. Our experimental results show that our proposed approach is highly
effective. In future work, we will develop algorithms for aligning data units in the
extracted data records so that data units of the same attribute can be aligned
into the same column of the query result table.
References
1. Gatterbauer, W., Bohunsky, P., Herzog, M., Krupl, B., Pollak, B.: Towards
Domain-Independent Information Extraction from Web Tables. In: WWW 2007,
pp. 71-80 (2007)
2. Gatterbauer, W., Bohunsky, P.: Table Extraction Using Spatial Reasoning on the
CSS2 Visual Box Model. In: AAAI 2006, pp. 1313-1318 (2006)
3. Liu, B., Grossman, R., Zhai, Y.: Mining Data Records in Web Pages. In: KDD
2003, pp. 601-606 (2003)
4. Zhai, Y., Liu, B.: Web Data Extraction Based on Partial Tree Alignment. In:
WWW 2005, pp. 76-85 (2005)
5. Zhai, Y., Liu, B.: Structured Data Extraction from the Web Based on Partial Tree
Alignment. IEEE Trans. on Knowl. and Data Eng. 18(12), 1614-1628 (2006)
6. Zhao, H., Meng, W., Wu, Z., Raghavan, V., Yu, C.: Fully automatic wrapper
generation for search engines. In: WWW 2005, pp. 66-75 (2005)
7. Zhao, H., Meng, W., Yu, C.: Automatic Extraction of Dynamic Record Sections
from Search Engine Result Pages. In: VLDB 2006, pp. 989-1000 (2006)
Search WWH ::




Custom Search