Method for inset detection in document layout analysis6377704Abstract The present invention is a method for detecting insets in the structure of a document page so as to further complement the document layout and textual information provided in an optical character recognition system. A system employing the present method preferably includes a document layout analysis system wherein the inset detection methodology is used to extend the capability of an associated character recognition package to more accurately recreate the document being processed. Claims I claim: Description METHOD FOR INSET DETECTION IN DOCUMENT LAYOUT ANALYSIS
const RECOMP_UNITS MAX_FRAME_JOINT_GAP = 70;
const RECOMP_UNITS MIN_FRAME_HITE = 40;
BOOL LA::AnalyzeRestOfFullFrame (const LARgn * const pRgnTop,
LAHRulingRgn * const pHRuleTop)
{
INT32 RefNum2, RefNum3, RefNum4;
LAHRulingRgn *pHRuleBot;
LAVRulingRgn *pVRuleLeft, *pVRuleRight;
RECOMP_UNITS fTop, fBot, fLeft, fRight;
fRight = pRgnTop->GRight( );
fLeft = pRgnTop->GLeft( );
fTop = pRgnTop->GBot( );
//for each HRule, find a right connecting lower vrule
for (pVRuleRight = RgnsL.GFirstVRuling (RefNum2);
pVRuleRight; pVRuleRight = RgnsL.GNextVRuling (RefNum2))
{
if ((ABS (pVRuleRight->GTOP( ) - fTop) < MAX_FRAME_JOINT_GAP)
&& (ABS (pVRuleRight->GRight( ) - fRight) <
MAX_FRAMEJOINT_GAP))
{
//Right is reset to curr Vrule
fRight = pVRuleRight->GRight( );
fBot = pVRuleRight->GBot( );
//find a connecting bot left hrule
for (pHRuleBot = RgnsL.GFirstHRuling (RefNum3);
pHRuleBot; pHRuleBot = RgnsL.GNextHRuling (RefNum3))
{
if (((pHRuleBot->GTop( ) - fTop) >= MIN_FRAME_HITE)
&& (ABS (pHRuleBot->GBot( ) - fBot) <
MAX_FRAME_JOINT_GAP)
&& (ABS (pHRuleBot->GRight( ) - fRight) <
MAX_FRAME_JOINT_GAP))
{
//Bot is reset to curr Vrule
fBot = pHRuleBot->GBot( );
fLeft = pHRuleBot->GLeft( );
//find a connecting left top vrule
for (pVRuleLeft = RgnsL.GFirstVRuling (RefNum4);
pVRuleLeft; pVRuleLeft = RgnsL.GNextVRuling
(RefNum4))
{
if (((fRight - pVRuleLeft->GLeft( )) >=
MIN_FRAME_HITE)
&& (ABS (pVRuleLeft->GLeft( ) - fLeft)
<MAX_FRAME_JOINT_GAP)
&& (ABS (pVRuleLeft->GTop( ) - fTop)
<MAX_FRAME_JOINT_GAP)
&& (ABS (pVRuleLeft->GBot( ) - fBot)
<MAX_FRAME_JOINT_GAP))
{
//Left is reset to curr Vrule
fLeft = pVRuleLeft->GLeft( );
MakeARuleFrame (fTop, fBot, fLeft, fRight,
pHRuleTop, pHRuleBot,
pVRuleLeft, pVRuleRight);
return TRUE;
}
}
break;
}
}
break;
}
}
return FALSE;
}
The measurement unit called RECOMP_UNITS, is preferably on the order of 0.1 mm so, "const RECOMP_UNITS MAX_FRAME_JOINT_GAP=70;", refers to a distance of approximately 7 millimeters. Alternate units of, perhaps resolution based measurement (e.g., pixels or a relative scale based upon the original document size), may be employed as well for dimensions indicated herein. The preceding C++ function will call MakeARuleFrame, which makes Rule Frames (a list of the four rulings that make a frame), and the coordinates on the page of the resulting box. Another specific type of box that is considered is a text region in reverse video (region where foreground brightness is brighter than background--requiring prior recognition of the reverse video region), with three rulings forming a box below the reverse video. In such cases, the box is assumed to have a top at the top of the area of reverse video background, and is otherwise treated as other boxes. Thus, for each text that is reverse video that also is not a header, footer, or caption: the same C++ function is called and if that function returns true, the text region is classified as a frame inset. The Rule frame extends to the top of that text frame. Subsequent to completing the analysis of full frames, the step of finding pictures and text in full frames, step 514, is executed where for each frame earlier created MakeARuleFrame operation is carried out to determine which text regions are within a frame:
const RECOMP_UNITS WITHIN_FRAME_MARGIN = 30;
BOOL LAFrameRgn::WithinFrame (const RECOMP_UNITS iTop,
const RECOMP_UNITS iLeft,
const RECOMP_UNITS iBot,
const RECOMP_UNITS iRight) const
{
RECOMP_UNITS iXMid = (iLeft + iRight + 1) / 2;
RECOMP_UNITS iYMid = (iTop + iBot + 1) / 2;
if ((iTop >= (GTop( ) - WITHIN_FRAME_MARGIN))
&& (iLeft >= (GLeft( ) - WITHIN_FRAME_MARCIN))
&& (iBot <= (GBot( ) + WITHIN_FRAME_MARGIN))
&& (iRight <= (GRight( ) + WITHIN_FRAME_MARGIN))
&& (iXMid > GLeft( ))
&& (iXMid < GRight( ))
&& (iYMid > GTop( ))
&& (iYMid < GBot( )))
{
return TRUE;
}
else
{
return FALSE;
}
}
where GLeft( ), GRight( ), GTop( ), GBot( ), get the coordinates of the frame. The input arguments, iTop, iLeft, iBot, iRight are the coordinates of the smallest box that encloses all the text in the region. For each text region that is neither a header, footer, nor a caption the find frame insets operation, step 520, below is executed:
BestFrameSize = ARBITRARY_LARGE_RECOMP_UNIT;
pBestFrame = NIL;
for (pCurrFrame = RgnsL.GFirstFrame (RefNum2); pCurrFrame;
pCurrFrame = RgnsL.GNextFrame (RefNum2))
{
if (pCurrFrame->WithinFrame (pText))
{
if (pCurrFrame->GArea( ) < BestFrameSize)
{
pText->SetTextType (InsetRegion);
pBestFrame = pCurrFrame;
BestFrameSize = pCurrFrame->GArea( );
}
}
}
if (pBestFrame)
{
pBestFrame->AddInsetWithin (pText->GRgnld( ));
}
The above code fragment finds the smallest possible frame for each eligible text region, and declares it to be an inset of that frame, using the AddInsetWithin function. Each frame now includes a list of any insets that are a part of it. If the frame also contains a picture, the text might actually be a caption, and not an inset, but for the purposes of the present embodiment captions and insets are treated the same, and this distinction does not have to be made. It will be apppreciated that a caption can often be associated with a figure, and that the format of the caption may have different requirements than the format of an inset. A further description of the methods employed to find captions can be found in U.S. patent application Ser. No. 08/652,766. There is one remaining condition that could cause the text within a frame to not be called an inset. In pseudo-code: For each frame: For each text region initially considered an inset of that frame: For each other text region also initially considered an inset of that frame: if both the text regions are each wider than 19 mm and region A is to the side of region B, then no text regions in that frame are classified as Frame Insets. where the state of "being to the side of", is determined using the following code:
const INT32 MAX_PERCENT_OFF_CENTER_SIDE = 30;
const RECOMP_UNITS VERT_OVRLAP_MARGIN = 35;
BOOL LA::bToSideOf (const LARgn *pRgnA, const LARgn
*pRgnB)
{
if ((pRgnA->GBot( ) < (pRgnB->GTop( ) +
VERT_OVRLAP_MARGIN))
.parallel. (pRgnB->GBot( ) < (pRgnA->GTop( ) +
VERT_OVRLAP_MARGIN))
.parallel. PercentOffCenter (*pRgnA, *pRgnB) <
MAX_PERCENT_OFF_CENTER_SIDE)
{
return FALSE;
}
else
{
return TRUE;
}
}
INT32 LA::PercentOffCenter (const LARgn &ROuter, const
LARgn &RInner)
const
{
RECOMP_UNITS ExcessLeft, ExcessRight,
InnerWidth, ExcessDiff;
ExcessLeft = RInner.GLeft( ) - ROuter.GLeft( );
ExcessRight = ROuter.GRight( ) - RInner.GRight( );
InnerWidth = RInner.GWidth( );
ExcessDiff = ABS (ExcessLeft - ExcessRight);
return ((ExcessDiff * 100) / InnerWidth);
}
Subsequently, step 530, described below is executed to find credit insets. Credit Insets are typically author credits, found at the bottom of an article. For the purposes of this system, only text region insets within the spacing Parameter-D of the bottom of the page, and within the spacing Parameter-E of the left of the page are considered as possible credit insets--where "parameter-#" denotes a predetermined or programmable variable. Those are the credit insets that would be output in a fixed location. Other article credits might flow with the end of the article, and therefore would not be considered an inset. Credit insets have less than Parameter-F lines of text, and have a ruling above them that overlaps the X-direction coordinates of the bounding box of the text region, by at least Parameter-G percent. In addition the rulings of credit insets is within Parameter-H distance of the top of the text region, its width is below Parameter-I, and the height of the text region is typically less than Parameter-J. The following code accomplishes the various calculations and analyses of the find credit insets operation using parameter values that were established empirically:
const RECOMP_UNITS MAX_RULE_DIST = 70;
const RECOMP_UNITS MAX_RULE_OVERLAP_DIST = 30;
const RECOMP_UNITS MAX_CREDIT_INSET_RANGE = 175;
const INT32 MIN_RULE_PRCNT_OVRLP = 90;
const INT32 MAX_RULE_PRCNT_WIDTH = 150;
const INT32 MAX_RULE_PRCNT_PAGE_WIDTH = 35;
const INT32 PERCENT_DOWN_PAGE_CREDIT_INSET = 75;
const INT32 PERCENT_RIGHT_PAGE_CREDIT_INSET = 25;
const INT32 MAX_LINES_FOR_CREDIT_INSET = 6;
const INT32 MAX_CREDIT_HEIGHT = 500;
void LA::DetectCreditInsets( )
{
INT32 RefNum, RefNum2;
LATextRgn *pText;
LAHRulingRgn *pHRule;
RECOMP_UNITS TextTop, TextBot, TextLeft, TextRight, Overlap;
RECOMP_UNITS MinOvrlpX, MaxX;
for (pText = RgnsL.GFirstText (RefNum);
pText; pText = RgnsL.GNextText (RefNum))
{
if (pText->bCanBeSpeciat( ))
{
TextBot = pText->GBot( );
if ((pText->OrderingData
&& (pText->OrderingData->GNumRgnsBelow( ) == 0))
&& (TextBot
> (PageData.GTopMostNonSpclText( )
+ (((PageData.GBotMostNonSpclText( )
- PageData.GTopMostNonSpclText( ))
* PERCENT_DOWN_PAGE_CREDIT_INSET)
/100)))
&& (pText->GLeft( )
<(PageData.GLeftMostNonSpclText( )
+ (((PageData.GRightMostNonSpclText( )
- PageData.GLeftMostNonSpclText( ))
* PERCENT_RIGHT_PAGE_CREDIT_INSET)
/100)))
&& (pText->GNumLines( ) <=
MAX_LINES_FOR_CREDIT_INSET))
{
TextTop = pText->GTop( );
TextBot = pText->GBot( );
TextLeft = pText->GLeft( );
TextRight = pText->GRight( );
MinOvrlpX = ((pText->GWidth( ) * MIN_RULE_PRCNT_OVRLP) +
50)
/100;
MaxX = (((TextRight - TextLeft + 1) * MAX_RULE_PRCNT_WIDTH)
/100);
MAX_EQ (MaxX, ((PageData.GUsedWidth( ) *
MAX_RULE_PRCNT_PAGE_WIDTH)
/100));
for (pHRule = RgnsL.GFirstHRuling (RefNum2);
pHRule; pHRule = RgnsL.GNextHRuling (RefNum2))
{
if (!pHRule->GblsPageFrameRuling( )
&& !pHRule->GblsNonPageFrameRuling( )
&& ((TextTop - pHRule->GBot( )) <=
MAX_RULE_DIST)
&& (pHRule->GTop( ) <= (TextTop +
MAX_RULE_OVERLAP_DIST))
&& ((TextBot - pHRule->GBot( )) <=
MAX_CREDIT_INSET_RANGE))
{
Overlap = (MIN (pHRule->GRight( ), TextRight)
- MAX (pHRule->GLeft( ), TextLeft)) + 1;
//for HRule, GWidth returns right - left + 1
if ((Overlap >= MinOvrlpX)
&& (pHRule->GWidth( ) <= MaxX))
{
if (pText->GHeight( ) <=
MAX_CREDIT_HEIGHT)
{
pText->SetTextType (InsetRegion);
CallSetObjectBorderFrame (plaObj,
pText->GRgnld( ),
TRUE, FALSE, FALSE, FALSE);
set Credit Inset
break;
}
}
}
}
}
}
}
}
Having described in detail the operations associated with step 500, attention is now turned to the detailed explanation of the operations accomplished in step 700 for identification of center insets, column insets, and stray (or non-column) insets. In a preferred embodiment, these functions are called after recomposition operation (step 600) has identified sections, reading order, columns, and column widths (e.g., as described in U.S. patent application Ser. No. 08/652,766). However, after the inset classification functions are complete, the recomposition functions that identify reading order, columns, and column widths, are preferably rerun to treat each newly found inset in a similar manner as the recomposition function treats captions. The identification of those insets, improves the results of reading order, column determination, and column width determination. Referring to FIG. 5, in a preferred embodiment, one or more functions may be called before the inset classification steps. In particular, the function bInsetLikeFont( ), step 710, tests to determine if the text region has font and/or reverse video characteristics that are common for insets. A text region is considered to have an inset-like font if the font information of the most commonly used font in the region, has at least one of the following attributes; bold, italic, reverse video, or at least MIN_TIMES_AV_PTSIZE_FOR_INSET_SIZE=1.5, times the average font height on the page. A further discussion of font attributes can be found in U.S. Pat. No. 5,668,891, issued to Fan et al. on Sep. 16, 1997. A second function that is preferably invoked before inset classification determines if the text region has main text to the left or right of it, step 720. The pseudocode for such a function is as follows: for each text region on the page (called "TextA") if the text region is wider than 19 mm. MAX_TEXT_NOT_TO_SIDE_OVERLAP=1.5 mm EffectiveTop=TextA's top+MAX_TEXT_NOT_TO_SIDE_OVERLAP; EffectiveBot=MAX (TextA's bottom--MAX_TEXT_NOT_TO_SIDE_OVERLAP, Effective Top); for each other text region on the page (called "TextB") if the TextB's top is less than or equal to EffectiveBot and TextB's bottom is greater to or equal to EffectiveTop and TextB is wider than 19 mm then if TextB's left is left of TextA's left then TextA is considered to have main text to its left if TextB's right is right of TextA's righ then TextA is considered to have main text to its right If a text region meeting the above conditions were found both to the left and the right of this text region, it is marked as such, so that a call to "there-is-main-text-to-the-left-and-right" for this text region would return true. Center insets (FIG. 6) are found within columns of main text flow in step 730. In a preferred embodiment, only center insets that span more than one column are found. Although insets in the middle of a single column are often falsely identified, it is possible to find center insets entirely within a column. To be classified as a center inset a text region that is found must have the following characteristics: span at least two columns; have an inset-like font (as previously described); or for each text on the page that is not a header, footer, caption, or previously identified inset, if the text region is wider than 19 mm, and the text has an inset-like-font, and there is main text to the left and right, and This-Text-In-Section-Can-Be-A-Center-insect (code follows) requirements are true.
This-Text-In-Section-Can-Be-A-Center-Inset ( )
const INT32 MAX_CENTER_INSET_COL_OVERLAP_PERCENT = 90;
const INT32 MIN_CENTER_INSET_COL_OVERLAP_PERCENT = 15;
const INT32 MAX_CENTER_INSET_ORIG_2_COL_OVERLAP.sub.--
PERCENT = 60;
const INT32 MAX_CENTER_INSET_COL_OVRLAP_PCNT.sub.--
DIFFS = 20;
BOOL LA::bTextInSectCanBeCenterInset (const LATextRgn *pText)
//the section that pText is found, note all references to "pText", indicate
a
//section of text, as found by segmentation
//where if there are less than 2 columns, the boolean returns False.
TextLeft = left edge of pText
TextRight = right edge of pText
bFoundLeftCol = FALSE
PercentLeftColOverlap = 0
PercentRightColOverlap = 0
//for each column in the section
ColLeft = the left edge of that column
ColRight = the right edge of that column
if ((TextLeft < ColLeft) && !bFoundLeftCol)
{//started outside of any col
return FALSE;
}
if ((TextRight <= ColRight) && (TextRight >= ColLeft))
{//text ends in this col
if ((TextLeft >= ColLeft) && !bFoundLeftCol)
{//in only one col
return FALSE;
}
else
{
PercentRightColOverlap
= ((TextRight - ColLeft + 1) * 100)
/(ColRight - ColLeft + 1);
if ((PercentRightColOverlap >
MAX_CENTER_INSET_COL_OVERLAP_PERCENT)
.parallel. (PercentRightColOverlap <
MIN_CENTER_INSET_COL_OVERLAP_PERCENT)
.parallel. (ABS (PercentRightColOverlap -
PercentLeftColOverlap)
>MAX_CENTER_INSET_COL_OVRLAP_PCNT_DIFFS))
{//might still end in another col also.
continue;
}
else
{
return TRUE;
}
}
}
if ((TextLeft >= ColLeft) && (TextLeft <= ColRight))
{//text starts in this col
bFoundLeftCol = TRUE;
PercentLeftColOverlap
= ((ColRight - TextLeft + 1) * 100)
/(ColRight - ColLeft + 1);
if ((PercentLeftColOverlap>
MAX_CENTER_INSET_COL_OVERLAP_PERCENT)
.parallel. (PercentLeftColOverlap <
MIN_CENTER_INSET_COL_OVERLAP_PERCENT))
{
return FALSE;
}
}
} //if the above function ran through each column without returning, it
will
return FALSE now.
And another type of center inset, with somewhat different conditions may also be found, where for each text region on the page:
if the text region is wider than 19 mm,
and it contains at least MIN_MULTI_COL_INSET_NUM_LINES =
2 lines of text,
and it has an inset-like-font (defined above),
and there is a text region above it, lined up with it and at least
19 mm width and there is a text region below it, lined up with
it and at least 19 mm in width,
and the-Text-In-Section-Can-Be-MultiColInset is TRUE,
then it is classified as a center inset. The following s the code for
the Text-In-Section-Can-Be-MultiColInset ( ).
const RECOMP_UNITS MULTI_COL_RANGE_MARGIN = 25;
BOOL LA::bTextInSectCanBeMultiColInset (const LATextRgn *pText)
//the section that pText is found, note all references to "pText", indicate
a
//section of text, as found by segmentation
//if there are less than 2 columns, return False.
TextLeft = left edge of pText
TextRight = right edge of pText
bFoundLeftCol = FALSE
//for each column in the section
ColLeft = the left edge of that column
ColRight = the right edge of that column
if (((TextLeft + MULTI_COL_RANGE_MARGIN) < ColLeft)
&& !bFoundLeftCol)
{ //started outside of any col
return FALSE;
}
if (TextRight <= (Col Right + MULTI_COL_RANGE_MARGIN))
{ //text ends in this col
if ((TextLeft >= ColLeft) && !bFoundLeftCol)
{ //in only one col
return FALSE;
}
else
{
if (!bColumnTextAboveAndBelow (pText))
{
return FALSE;
}
return TRUE;
}
}
if ((TextLeft <= (ColLeft + MULTI_COL_RANGE_MARGIN))
&& ((TextLeft + MULTI_COL_RANGE_MARGIN) >=
ColLeft))
{ //text starts in this col
bFoundLeftCol = TRUE
}
} //if it ran through each column without returning, it will return
FALSE now
where bColumnTextAboveAndBelow (pText) does the following:
const RECOMP_UNITS MAX_COL_TO_INSET_RANGE =
200;
const INT32 MIN_LINES_FOR_COL = 6;
const INT32 MAX_PERCENT_OF_INSET_WIDTH_IS_COL =
90;
BOOL LA::bColumnTextAboveAndBelow (const LATextRgn
*pText)
{
/* end C++ code */
BOOL bFoundCol = FALSE;
if there is a text region above it, lined up with it, and at least 19 mm
in width, and there is a text region below it, lined up with it of
at least 19 mm in width,
then MaxColWidth = ((pText's width * MAX_PERCENT_OF.sub.--
INSET_WIDTH_IS_COL) /100);
Since the inset for a multi column case, even if just the 1-1 col type, has to be wider than the columns above and below it:
for each test region, TextB, immediately below pText
if TextB is wider than 19 mm
and TextB has at least MIN_LINES_FOR_COL)
and TextB is narrower or equal to MaxColWidth
and (TextB's top - pText's bottom) <= MAX_COL_TO.sub.--
INSET_RANGE))
then
bFoundCol = TRUE;
break;
else
continue
if, after checking each TextB bFoundCol is still False
then return FALSE;
bFoundCol = FALSE;
pTextUpperRight = NIL;
for each text region, TextB, immediately above pText
if TextB is wider than 19 mm
and TextB has at least MIN_LINES_FOR_COL)
and TextB is narrower or equal to MaxColWidth
and (pText's top - TextB's bottom) <= MAX_COL_TO.sub.--
INSET_RANGE))
and there is no ruling between pText and TextB
then
if ((!p TextUpperRight)
.parallel. (pTextB->GRight( ) > pTextUpperRight-> GRight(
)))
{
pTextUpperRight = pTextB;
}
bFoundCol= TRUE;
if, after checking each TextB bFoundCol is still False
then return FALSE;
else
if (pTextUpperRight->GIsTextContinues( ) == False)
{
//Ruled out inset (might not be anyway)
//since upper right text doesn't continue
return FALSE;
}
return TRUE;
}
the function pTextUpperRight->GIsTextContinues( ) checks whether a particular text region is likely to have ended in a hard return (in which case it returns FALSE), or whether it likely continues somewhere else on the page (in which case it returns TRUE) In further identifying the center insets, the following operations may be employed (and may also be used in the recomposition operations previously described. Additional data is collected as follows: Rights is an array of the distance from the beginning of the page to the end of each line in the text region; Lefts is an array of the distance from the beginning of the page to the beginning of each line in the text region; NumLines is the number of text lines in the text region; AvLeft is the mean of all the Lefts for this text region; and AvRight is the mean of all the Rights for this text region.
const INT32 MIN_PERCENT_LINE_WIDTH_TO_CONTINUE = 94;
// Now determine if last line continues on same page.
if ((Rights[NumLines - 1] - Lefts[NumLines - 1] + 1)
< (((AvRight - AvLeft + 1)
* MIN_PERCENT_LINE_WIDTH_TO_CONTINUE)
/100))
{
SetTextContinues (False);
}
else
{
SetTextContinues (True);
}
It will be appreciated that additional "rules" may be employed to improve or alter the robustness of the identification algorithm presented herein. For example, the step of finding center insets (730) may also include additional steps such as: (i) determining whether the text region 730r has columnar text regions 730a, 730b or 730c positioned to at least one side thereof; (ii) determining whether the text region 730r spans at least two columns 730a-730c; (iii) determining whether a width dimension (730w) of the text region is greater than a predefined width (I); (iv) identifying a left-hand column 730a where a left edge of the text region is located, then determining the portion of the left-hand column's width (732) that the text region comprises and determining that the text region is a center inset whenever the portion is greater than a predetermined value (A) and less than a second predetermined value (B); (v) identifying a right-hand column where a right edge of the text region is located, then determining the portion of the right-hand column's width (734) that the text region comprises and determining that the text region is a center inset whenever the portion is greater than the predetermined value (A) and less than the predetermined value (B); (vi) determining that a column portion difference value, representing the difference between the portion of the left-hand column's width and the portion of the right-hand column's width, is less than a predetermined value (C); determining whether the text region has a number of text lines at least equal to a predetermined value (F); (vii) determining whether there is another text region above and vertically aligned with said text region, and having a width greater than the predefined width (I); (viii) determining whether there is another text region below and vertically aligned with said text region, and having a width greater than the predefined width (I); (ix) determining whether there is a first column having a left edge, and whether a left edge of said text region is located within a predetermined distance (D.sub.1) of the left edge of the first column; (x) determining whether there is a second column having a right edge, and whether a right edge of said text region is located within a predetermined distance (D.sub.2) of the right edge of the second column; (xi) determining whether there is columnar text above and below the text region, wherein the columnar text regions above and below the text region are each lined up with the text region, and where each columnar text region includes at least a predetermined number of lines (G), and each columnar text region has a width that is less than a predetermined percentage (H) of the width of the text region, and wherein each columnar text region is closer than a predetermined value vertical distance (I) from the text region; or (xii) determining that a text region immediately above the text region, is likely to continue on the same page. The process of determining if a columnar text region interrupted by a center inset (e.g., region 730b) is likely to continue on the same page would preferably include the steps of (a) determining the width of the last line of the text region above (736), (b) determining the width of the average line of that text region (also equal to 736 in FIG. 7, but not necessarily so), and determining that it is unlikely to continue when the width of the last line is less than a predetermined percentage (K) of the width of the average line of that text region. After finding center insets, the find column insets process is initiated at step 740, where columns off to one side of the main text flow (e.g., inset 742 in FIG. 8) are identified. In a preferred embodiment, only the leftmost, and rightmost columns within a section are considered to be possible column insets as consideration of other columns would likely lead to a higher incidence of false identification as a column inset. Thus, there must be at least two, and preferably at least three columns in order for a section of the image to have a column inset. Leftmost and rightmost columns are considered to be column insets if they meet column width requirements. For purposes of explanation tests will be characterized for either a leftmost or rightmost column, but one skilled in the art of page layout analysis will appreciate that the tests may be applied to either the leftmost and/or rightmost columns of a page section. In one test, if the leftmost or rightmost columns are less than Parameter-O percent the width of the average of the other columns, and have less than Parameter-P percent of the number of lines that the average of the other columns do it may be a column inset. Furthermore, Parameter-O and Parameter-P are preferably different if the column has an inset-like font. Where const INT32 MIN_COLS_FOR_COL_INSET=3; for each section on the page, the following operations are executed:
/* Column Insets are only detected for sections that have at least
MIN_COLS_FOR_COL_INSET columns,
as found by the recomposition operation */
FirstColWid = 0
SecondColWid = 0
PenultimateColWid = 0
LastColWid = 0
LastColNumLines = 0
bFirstColBoldItRev = FALSE
bLastColBoldItRev = FALSE
AvRightColWidths = 0
AvLeftColWidths = 0
AvRightColNumLines = 0
AvLeftColNumLines = 0
for each column in section
ColLeft is the column left edge
ColRight is the column right edge
ColWidth is the column width
if this is the first column in the section
then
FirstColWid = ColWidth;
FirstColNumLines = the number of lines in this column
bFirstColBoldItRev = TRUE if this column's most common
font is Bold, Italic, or if it is in Reverse video
else
AvRightColWidths += ColWidth;
AvRightColNumLines += number of lines in this column
if this is the second column in the section
then
SecondColWid = ColWidth;
SecondColNumLines = number of lines in this column
if this is the second to last column in this section
then
PenultimateColWid = ColWidth;
PenultimateColNumLines = number of lines in this column
if this is the last column in the section
LastColWid = ColWidth;
LastColNumLines = number of lines in this column
bLastColBoldItRev = TRUE if this column's most common
font is Bold, Italic, or if it is in Reverse video
if this is not the last column in the section
AvLeftColWidths += ColWidth;
AvLeftColNumLines += number of lines in this column
// now all the columns have been checked
AvLeftColWidths /= (NumCols - 1);
AvRightColWidths /= (NumCols - 1);
AvLeftColNumLines /= (NumCols - 1);
AvRightColNumLines /= (NumCols - 1);
// now check if the first column is an inset
ThisColWid = FirstColWid
NextColWid = SecondColWid
OppositeColWid = LastColWid
AvColWid = AvRightColWidths
ThisColNumLines = FirstColNumLines
NextColNumLines = SecondColNumLines
AvColNumLines = AvRightColNumLines
bThisBoldItRev = bFirstColBoldItRev
as specifically represented by the function blsColInset below:
const INT32 MAX_BIR_PERCENT_INSET_COL_WID = 75;
const INT32 MAX_BIR_PERCENT_INSET_NUM_LINES = 65;
const INT32 MAX_PERCENT_INSET_COL_WID = 65;
const INT32 MAX_PERCENT_INSET_NUM_LINES = 30;
if ((bThisBoldItRev
&& ((ThisColWid < MIN (NextColWid, OppositeColWid))
&& ((ThisColWid * 100)
< (MIN (AvColWid, NextColWid)
* MAX_BIR_PERCENT_INSET_COL_WID))
&& ((ThisColNumLines * 100)
< (MIN (AvColNumLines, NextColNumLines)
* MAX_BIR_PERCENT_INSET_NUM_LINES))))
.parallel. (!bThisBoldItRev
&& ((ThisColWid < MIN (NextColWid, OppositeColWid))
&& ((ThisColWid * 100)
< (MIN (AvColWid, NextColWid)
* MAX_PERCENT_INSET_COL_WID))
&& ((ThisColNumLines * 100)
< (MIN (AvColNumLines, NextColNumLines)
* MAX_PERCENT_INSET_NUM_LINES)))))
{
return TRUE;
}
Next, blsColInset is called again for the last column, with the following values, to determine any insets are present within the column: ThisColWid=LastColWid NextColWid=PenultimateColWid OppositeColWid=FirstColWid AvColWid=AvLeftColWidths ThisColNumLines=LastColNumLines NextColNumLines=PenultimateColNumLines AvColNumLines=AvLeftColNumLines bThisBoldltRev =bLastColBoldltRev It will be appreciated by those skilled in the art of document analysis that various modifications and alterations to the above-described identification techniques may be employed based upon empirical knowledge about the documents. For example, as with any of the parameter-based tests, it is possible to utilize a programmable variable for the parameter so that one could simply alter the parameter value to obtain improved performance. It is also contemplated that the programmable may be adjusted automatically by the system after an analysis of a plurality of "test" (exemplary) documents so as to determine typical parameter values. Lastly, stray or non-column insets are detected at step 750. It may be preferable to perform a page recomposition operation before initiating the stray inset detection operation, because stray insets are other bits of text, not properly part of the text flow and not previously identified as an inset. If the rules that determined columns, did not include a column where the stray inset texts are, they are considered to be stray insets. Furthermore, stray insets must be Parameter-Q distance outside of the outermost column if they do not have an inset-like font, or their midpoint can be within Parameter-R (which may be a positive or negative value) of the column end if they do have an inset-Like font. The width of a stray inset, multiplied by Parameter-S, must be less than the width of the columns of the closest section. To identify stray insets, for each section on the page:
const INT32 MAX_LINES_FOR_OTHER_NON_COL_TEXT = 4;
const
RECOMP_UNITS MAX_CENTER_OVERLAP_SIDE_INSET = 120;
const
INT32 MAX_NON_COL_INSET_PERCENT_WITHIN_COL = 80;
const
INT32 MAX_SIDE_INSET_WIDTH_PERCENT_OF_COLS = 80;
ColsLeft = start of first column for this section
ColsRight = end of last column of this section
for each text region in this section that meets this condition
((TextWidth * 100)
<= ((ColsRight - ColsLeft + 1)
* MAX_SIDE_INSET_WIDTH_PERCENT_OF_COLS))
if ((!pText-> GFullColWidth( ))
&& ((pText->GRight( ) < ColsLeft)
.parallel. (pText->GLeft( ) > ColsRight)))
{
pText->SetTextType (InsetRegion);
}
else if ((!pText->GFullColWidth( )
.parallel. pText->GNumLines( )
<= MAX_LINES_FOR_OTHER_NON_COL_TEXT)
&& (((pText->AverageOfLeftAndRightEdges( )
< (ColsLeft + MAX_CENTER_OVERLAP_SIDE_INSET))
&& (((pText->GRight( ) - ColsLeft + 1) * 100)
<= (TextWidth
* MAX_NON_COL_INSET_PERCENT_WITHIN_COL)))
.parallel. ((pText->AverageOfLeftAnd RightEdges( )
> = (ColsRight - MAX_CENTER_OVERLAP_SIDE_INSET))
&& (((ColsRight - pText->GLeft( ) + 1) * 100)
<= (TextWidth
* MAX_NON_COL_INSET_PERCENT_WITHIN_COL))))
&& (pText->bInsetLikeFont (PageData.GAvFontPtSizeTimes5(
))))
{
pText->SetTextType (InsetRegion);
}
where pText->GFullColWidth( ) is true if pText is wider than 19 mm, and pText->GNumLines( ) is the number of lines of text in pText.
const INT32 MAX_LINES_FOR_OTHER_NON_COL_TEXT = 4;
const RECOMP_UNITS MAX_CENTER_OVERLAP_SIDE_INSET = 120;
const INT32 MAX_NON_COL_INSET_PERCENT_WITHIN_COL = 80;
const INT32 MAX_SIDE_INSET_WIDTH_PERCENT_OF_COLS = 80;
BOOL LA::FindStrayInsets (const LASect * const pSect,
const INT32 SectNum,
const INT32 * const pColDimsPairs)
{
BOOL rValue = FALSE;
INT32 RefNum;
LATextRgn *pText;
RECOMP_UNITS ColsLeft = pColDimsPairs[0];
RECOMP_UNITS ColsRight, TextWidth, TextSumX;
if (pSect->GNumCols( ) < 1)
{
return rValue;
}
ColsRight = pColDimsPairs[(pSect->GNumCols( ) * 2) - 1];
for (pText = RgnsL.GFirstText (RefNum);
pText; pText = RgnsL.GNextText (RefNum))
{
TextWidth = pText->GWidth( );
if ((!pText->bCanBeSpecial( ))
.parallel. !(pText->GSectNum( ) == SectNum)
.parallel. ((TextWidth * 100)
> ((ColsRight - ColsLeft + 1)
* MAX_SIDE_INSET_WIDTH_PERCENT_OF_COLS)))
{
continue;
}
TextSumX = pText->GSumX( );
if ((!pText->GFullColWidth( ))
&& ((pText->GRight( ) < ColsLeft)
.parallel. (pText->GLeft( ) > ColsRight)))
{
pText->SetTextType (InsetRegion);
rValue = TRUE;
}
else if ((!pText->GFullColWidth( )
.parallel. pText->GNumLines( )
<= MAX_LINES_FOR_OTHER_NON_COL_TEXT)
&& ((((TextSumX / 2)
< (ColsLeft + MAX_CENTER_OVERLAP_SIDE_INSET))
&& (((pText->GRight( ) - ColsLeft + 1) * 100)
<= (TextWidth
* MAX_NON_COL_INSET_PERCENT_WITHIN_COL)))
.parallel. (((TextSumX / 2)
>= (ColsRight -
MAX_CENTER_OVERLAP_SIDE_INSET))
&& (((ColsRight - pText->GLeft( ) + 1) * 100)
<= (TextWidth
* MAX_NON_COL_INSET_PERCENT_WITHIN_COL))))
&& (pText->bInsetLikeFont (PageData.GAvFontPtSizeTimes5(
))))
{
pText->SetTextType (InsetRegion);
rValue = TRUE;
}
}
return rValue;
}
Practically, the "rules" employed in the stray inset determination operations may include one or more of the following: (a) less than a predetermined number of lines (e.g., 5 lines); (b) determining a distance from a left edge of a leftmost column in a text region to a right edge of a rightmost column in the text region, wherein the text region is a part of document having columnar regions, and identifying as a stray inset a the text region having a width less than a predetermined percentage (S) of the distance; (c) determining whether the width of the text region is narrower than a predefined width (I); (d) determining whether a portion of the text region lies outside of a columnar space defined by the outermost edges of outermost columns within a section of the document; (e) whenever a portion of the text region lies within the columnar space, determining if the text region has less than a predetermined number of lines (T), and if so, characterizing the text region as a stray inset; or (f) whenever a portion of the text region lies within the columnar space, determining that the width of the text region is less than a predetermined fraction of the width of the columnar space. It is, therefore, apparent that there has been provided, in accordance with the present invention, a method and apparatus for page layout analysis, including a method for detecting and classifying insets within various portions of a document page. It will be appreciated by those skilled in the art that various implementations and modifications to the above-described embodiment may be accomplished to meet the requirements of alternative embodiments. Such natural extensions or modifications of the method described herein for inset identification are included within the spirit and scope of the of the invention as defined by the appended claims.
|
Same subclass Same class Consider this |
||||||||||
