Extracting Data Records from Query Result Pages Based on Visual Features - Advances in Databases

Database Reference

In-Depth Information

remove any noisy blocks in the data section. Third, we group leaf nodes of the

Visual Block tree into data records based on the positions of their corresponding

visual blocks. The approach takes as input a query result page from a specific

web database, and produces as output a set of data records.

3

Identifying Data Sections

We identify a data section as a node in the Visual Block tree, which represents

a rectangular box in the result page that contains all the data record blocks and

as few noisy blocks as possible.

We observe that the size of a data section is usually large relative to the size

of the whole page. For example, as shown in Figure 1, the data section that

contains all the plate products occupies a relatively large area. To utilize the

observation, we first select those blocks, each of which satisfies a constraint that

the ratio between the sizes of the block and the whole page is greater than a

threshold T dr ([16]), which can be trained from sample result pages.

The method for identifying data sections first takes the root node of the Visual

Block tree as input. It returns a set of candidate data section blocks. The blocks

at higher levels of the Visual Block tree occupy bigger portions of the result page

so that their area ratios are much higher than the threshold and will certainly

contain more noisy blocks than the ones at lower levels of the Visual Block tree.

The algorithm selects candidate data section blocks in a depth-first fashion. It

traverses the Visual Block tree from the root, and identifies those blocks that

satisfy the area ratio constraint but none of their child blocks changes it that.

These blocks thus contain less noisy blocks. For example, after applying the

area constraint, we can identify b 1 1 , b 1 1 2 , b 1 1 2 2 , b 1 1 2 2 3 and b 1 1 2 2 3 4 as

candidate data sections.

Candidate data sections are further considered to determine the real data

section. To do this we make use of query terms that are used in queries over query

interfaces. A query interface exposes the attributes of the web database schema

to the user and usually consists of a set of input elements, e.g., text boxes, radio

buttons, check boxes and selection lists. Each input element is associated with

an attribute ([18]). For example “Dinnerware” “Plates” “Royal Doulton” and

“$25 to $50” are query terms used for input elements associated with attributes

“Category” “Product type”“Brand” and “Price” of the query interface, as shown

in Figure 4. We observe that query terms often re-appear in the data records.

For example, the data records shown in Figure 1 are in response to the query

shown in Figure 4. We can see that the text nodes of each data record contain

the occurrences of query terms “Plates” and “Royal Doulton”.

The frequency of each query term in a candidate block reflects the importance

of the candidate block. The more query terms occur in a block, the more likely

the block is the data section. Given a set of query terms q i

for i =1 , 2 , ..., n,

and a candidate block, the importance of the block is measured as R = i =1 f i ,

where f i represents the frequency of query term i in the candidate block. The

block that has the maximum number of occurrences of query terms among all the

Advances in Databases

Search WWH ::

Custom Search

Home