
 
ExtractInMultiLine(NodeSet) { 
  LineSets ← divideByLine (NodeSet); 
  for each LineSet in LineSets { 
    <TITLE, LINK> ← extractTitleLink(LineSet); 
    Push(ResultList, <TITLE, LINK, pubDate>); 
  } 
} 
InfoExtract (M, <r, c, n>, j, TPL) { 
 TR ← total row number of M 
for (i = r, i< r+n, i++) { 
  TN ← getTimeNode(R[i]); 
  TBN ← getNode (M[i, j]); 
  pubDate ← getTime(TN,TPL[i]); 
  NodeSetB ← searchInBorder( TBN);  
if (NodeSetB ≠ NULL) { 
  isInSameLine ← checkPostion(NodeSetB); 
  if (isInSameLine == TRUE) { 
    <TITLE, LINK> ← extractTitleLink(NodeSetB); 
    Push(ResultList, <TITLE, LINK, pubDate>); 
  } 
  else (isInSameLine == FALSE) { 
    ExtractInMultiLine(NodeSetB); 
  } 
} 
else { 
  NodeSetL ← searchInLine(TBN); 
  If (NodeSetL ≠ NULL) { 
    <TITLE, LINK> ← extractTitleLink (NodeSetL); 
    Push(ResultList, < TITLE, LINK, pubDate>); 
  } 
  else { 
    if (r+i ≠ TR-1) { 
      NextTBN ← getNode (M[i+1, j]); 
    } 
    else { 
      NextTBN ← detectSearchBorder;  
    } 
    AreaSet ← searchArea (TBN, NextTBN); 
    ExtractInMultiLine(AreaSet); 
  } 
} 
}  
 } 
in the j
th
 column of the M are same or not. If the 
values are not same, splitByValues(<r, j, n>) will 
segment the section <r, j, n> into k sub-sections in 
each which the values in the j
th
 column are the same. 
When each sub-section contains only 1 row, the 
segmentation process will be stopped and we can 
extract the information items in the current section. 
Although HTML pages containing the time 
pattern have diverse contents and structures, they 
can be classified into two types in terms of the 
layout. In the first type, each news item has an 
individual release time, and the page showed in 
figure 2(a) is a typical example. The page in figure 
2(b) is an example of the second type, in which 
multiple news items follow every release time. The 
algorithm in figure 6 describes the details of 
information extraction based on the structure and 
layout analysis. 
getTimeNode(R[i]) returns the time node TN 
corresponding to the i
th 
row of M. getNode(M[i, j]) 
returns a node TBN corresponding to M[i, j], which 
defines the border of current TN.  getTime extracts 
the time information from TN based on the 
corresponding time pattern stored in TPL and output 
pubDate in the standard format such as ‘Tue, 18 Jan  
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Figure 6: Algorithm for information item extraction 
 
 
 
 
 
 
 
 
Figure 7: Information Extraction in multiple lines 
 
2005 07:27:42 GMT’. searchInBorder searches and 
outputs all <a> nodes under TBN to a node set 
NodeSetB. checkPostion checks if all the <a> nodes 
in a set are presented in the same line in a browser or 
not. For the list-oriented information, each item is 
usually displayed in an individual line. This is an 
important layout feature. The line presentation relies 
on the DOM tree structure and specific tags such as 
<ul>, <li>,  <tr>,  <p>,  <div> and <br>, which 
cause a new line in the display. extractTitleLink uses 
heuristic rules to select the href attribute of a suitable 
<a> node and a proper title text in the current line 
as the title and link in RSS feeds. searchInLine 
searches  <a> nodes in the line in which TBN is 
presented, and outputs to a set NodeSetL. 
ExtractInMultiline, described in figure 7, extracts 
information items from a <a> node set in which the 
nodes are displayed in multiple lines. devideByLine 
is used to divide a node set into multiple sub-sets in 
which all the nodes are displayed in the same line
. 
For some pages, like the example in the figure 2(b), 
we detect the position of two adjacent TBNs and 
search target nodes between them by searchArea. 
But for last TBN in M there is no next TBN as the 
end border detectSearchBorder is used to decide the 
end border of search. In general, the structure of 
each section is similar, so we can use the structure in 
the last section to deduce the current end border. 
Obviously,  ResultList can be easily translated to a 
RSS format.  
After recognition of all the items in a section, we 
can decide the complete border of this section. In 
some pages, such as the page in figure 2(a), each 
section has a category title for summarizing content 
in the section, which corresponds to the category in 
the RSS item. The category data is usually presented 
in a line above and adjacent to the first item of the 
section, and contained in continuous text nodes on 
the left part in the line. If category is presented in an 
image, we can use a similar method to check the alt 
attribute of the appropriate <img> node. If 
necessary, we can also extract this information 
automatically. 
The idea of the time pattern discovery can be 
easily extended to mine other distinct format 
patterns, such as price patterns, which can be used to 
extract pairs of the product name and price from 
pages in e-commercial sites. 
WEBIST 2005 - WEB INTERFACES AND APPLICATIONS
314