Database Reference
In-Depth Information
remove any noisy blocks in the data section. Third, we group leaf nodes of the
Visual Block tree into data records based on the positions of their corresponding
visual blocks. The approach takes as input a query result page from a specific
web database, and produces as output a set of data records.
3
Identifying Data Sections
We identify a data section as a node in the Visual Block tree, which represents
a rectangular box in the result page that contains all the data record blocks and
as few noisy blocks as possible.
We observe that the size of a data section is usually large relative to the size
of the whole page. For example, as shown in Figure 1, the data section that
contains all the plate products occupies a relatively large area. To utilize the
observation, we first select those blocks, each of which satisfies a constraint that
the ratio between the sizes of the block and the whole page is greater than a
threshold T dr ([16]), which can be trained from sample result pages.
The method for identifying data sections first takes the root node of the Visual
Block tree as input. It returns a set of candidate data section blocks. The blocks
at higher levels of the Visual Block tree occupy bigger portions of the result page
so that their area ratios are much higher than the threshold and will certainly
contain more noisy blocks than the ones at lower levels of the Visual Block tree.
The algorithm selects candidate data section blocks in a depth-first fashion. It
traverses the Visual Block tree from the root, and identifies those blocks that
satisfy the area ratio constraint but none of their child blocks changes it that.
These blocks thus contain less noisy blocks. For example, after applying the
area constraint, we can identify b 1 1 , b 1 1 2 , b 1 1 2 2 , b 1 1 2 2 3 and b 1 1 2 2 3 4 as
candidate data sections.
Candidate data sections are further considered to determine the real data
section. To do this we make use of query terms that are used in queries over query
interfaces. A query interface exposes the attributes of the web database schema
to the user and usually consists of a set of input elements, e.g., text boxes, radio
buttons, check boxes and selection lists. Each input element is associated with
an attribute ([18]). For example “Dinnerware” “Plates” “Royal Doulton” and
“$25 to $50” are query terms used for input elements associated with attributes
“Category” “Product type”“Brand” and “Price” of the query interface, as shown
in Figure 4. We observe that query terms often re-appear in the data records.
For example, the data records shown in Figure 1 are in response to the query
shown in Figure 4. We can see that the text nodes of each data record contain
the occurrences of query terms “Plates” and “Royal Doulton”.
The frequency of each query term in a candidate block reflects the importance
of the candidate block. The more query terms occur in a block, the more likely
the block is the data section. Given a set of query terms q i
for i =1 , 2 , ..., n,
and a candidate block, the importance of the block is measured as R = i =1 f i ,
where f i represents the frequency of query term i in the candidate block. The
block that has the maximum number of occurrences of query terms among all the
 
Search WWH ::




Custom Search