Database Reference
In-Depth Information
Fig. 3. Part of the Visual Block tree for b 1 1 2 2 3 4
in them. This is based on the following two observations: the data section in a
result page, which contains all the data records, usually occupies a significant
area of the result page; query terms often re-appear in the data records. We use
these two observations to identify the data section. For example, block b 1 1 2 2 3 4
in Figure 2 is likely to the data section since it occupies a large portion of the
result page in Figure 1 and contains a number of occurrences of query terms
(e.g., “Accent Plates”). To get query terms, we make use of the query interface
and assume that the result pages are generated in response to the queries made
via the interface.
The identified data section often contains noisy blocks. Data records are ob-
viously more vivid in content than noisy blocks, have one or more links or some
images. To filter out noisy blocks, we use a vector of content and visual fea-
tures to characterize each block within the data section. These features provide
statistical information about texts, block area, links and images in the block.
The overall importance of a block for a data record should be higher than noisy
blocks. We set up a threshold of importance to ensure that any blocks that have
less importance than the threshold are identified as noisy blocks and removed.
For example, as shown in Figure 1 there are ten data record blocks while the
block containing information about the data records (“Items (1 - 15) of 15”) is
identified as a noisy block and removed.
Third, we observe that each data record contains semantically related data
units of a data object, which reside in the leaf nodes of the Visual Block trees,
and are visually aligned with and adjacent to each other. Our approach identifies
data records by purely using the rendering boxes of the leaf nodes in the data
section to infer their alignment and proximity. For example, the data units of
each data record shown in Figure 1 are aligned with each other, in close proximity
and relatively far away from the data units of the other data records. Thus we
can group data units based on their positional information with each group
representing a data record.
In summary, we make the following contributions. First, we propose an ap-
proach for identifying data sections based on the visual features of the blocks and
re-occurrences of query terms in them. Based on the content and visual features
of visual blocks, our approach for removing noisy blocks can eliminate most of
 
Search WWH ::




Custom Search