Database Reference
In-Depth Information
TextLen + w 5 × Area ,where w 1 , w 2 , w 3 , w 4 and w 5 are real numbers so that
w 1 + w 2 + w 3 + w 4 + w 5 =1,and0
1. LinkTextLen and TextLen are
considered as the most important features for differentiating data record blocks
from noisy blocks. When the ImBlk of a block is greater than the given threshold
θ , it is very likely that the block is a data record. Otherwise the block is taken as a
noisy block. The threshold can be trained using sample pages.
ImBLk
5 Grouping Data Units of Data Records
A data record represents a data object retrieved from a web database and consists
of multiple data units that are semantically related. Data units are represented
as leaf nodes on the Visual Block tree, and they are visually aligned with and
adjacent to each other on query result pages. For example, as shown in Figure
1, the data units of each record are the leaf nodes in the Visual Block tree, and
they are visually aligned with and adjacent to each other on the web page. To
identify data records, our approach first identifies leaf nodes that are part of a
data record and can be used as starting points for grouping other data units of
the record. Given a starting point, our approach first group data units that are
horizontally aligned with it to form a data unit group based on the positions of
the visual blocks of the corresponding leaf nodes. It then groups data units that
are horizontally aligned with each other to form leaf node groups. Finally, our
approach progressively expands each data unit group with other data unit groups
and leaf node groups that are vertically adjacent to it until there is no vertically
adjacent group. Each data unit group thus corresponds to a data record.
Definition 1. (Block and group positions) - We use the coordinate of the top-
left corner, height and width of the visual block of a data unit to determine its
left, right, top and bottom positions. Furthermore, we use the left position of the
leftmost node of a node group as the left position of the group, the top position
of the topmost node as the top position of the group, the right position of the
rightmost node of a node group as the right position of the group, and the bottom
position of the bottom node as the bottom of the group.
Definition 2. (Horizontal alignment) - We say that two leaf nodes, a and b,
are horizontally aligned with each other, if they have similar top positions. Fur-
thermore, we say that two node groups, a and b, are horizontally aligned with
each other, if they have similar top positions.
Definition 3. (Vertical adjacency) - We say that two leaf nodes, a and b, are
vertically adjacent, if the distance between the bottom position of a and the top
position of b, or the distance between the top position of a and the bottom position
of b (vertical distance) is less than a given number of pixels (in close proximity).
Furthermore, we say that two node groups, a and b, are vertically adjacent, if
the shortest vertical distance between the nodes in a and b is less than a given
number of pixels (in close proximity), and the nodes in the two groups are on
thesamesubtree.
 
Search WWH ::




Custom Search