Database Reference
In-Depth Information
Fig. 4. The query interface of cooking.com
candidate blocks is identified as the data section. For example, after applying the
second constraint to the candidate data sections, b 1 1 2 2 3 4 ,asshowninFigure
3, is identified as the data section .
4RemvingNosyBoks
The identified data section usually contains noisy blocks on the top and bottom
of the section, and data records in the middle of the section with no noisy blocks
on either the left or right of the records. Noisy blocks are the ones that are in a
data section but are not part of any data record [16], such as data record numbers
(e.g., “Items (1-15) of 15” in Figure 1). We observe that a data record typically
contains images, description of data, links, and occupies a significant area on
the page. For example, each of the data records shown in Figure 1 contains
the image, name, and model etc of the product, one or more links for detailed
information about a specific model and the rectangle of each data record is very
noticeable. Specifically, we evaluate the importance of each first-level child block
within the data section by using the five features about the content of the block:
ImgNum (the number of images in the block), LinkNum (the number of links
in the block), LinkTextLen (the anchor text length of the block), TextLen (the
text length of the block), and Area (the rendering area of the block).
These content features are provided by the Visual Block tree and are normalized
across the whole data section block. The importance of a child block is defined
as ImBlk = w 1 × ImgNum + w 2 × LinkNum + w 3 × LinkT extLen + w 4 ×
 
Search WWH ::




Custom Search