Document page analyzer and method5848184Abstract Apparatus and method are provided which determine the geometric and logical structure of a document page from its image. The document image is partitioned into regions (both text and non-text) which are then organized into related "article" groups in the correct reading order. The invention uses image-based features, text-based features, and assumptions based on knowledge of expected layout, to find the correct reading order of the text blocks on a document page. It can handle complex layouts which have multiple configurations of columns on a page and inset material (such as figures and inset text blocks). The apparatus comprises two main components, a geometric page segmentor and a logical page organizer. The geometric page segmentor partitions a binary image of a document page into fundamental units of text or non-text, and produces a list of rectangular blocks, their locations on the page in points (1/72 inch), and the locations of horizontal rule lines on the page. The logical page organizer receives a list of text region locations, rule line locations, associated ASCII text (as found from an OCR) for the text blocks, and a list of text attributes (such as face style and point size). The logical page organizer groups appropriately the components (both text and non-text) which comprise a document page, sequences them in a correct reading order and establishes the dominance pattern (e.g., find the lead article on a newspaper page). Claims What is claimed is: Description A portion of the disclosure of this patent document contains material which is subject to copyright protection. The owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure, as it appears in the U.S. Patent and Trademark Office patent file or records, but otherwise reserves all copyright rights whatsoever.
TABLE 1
______________________________________
Cluster the blocks by starting x value:
››A, B!, ›C, D, E!, ›F, G!, ›H, I!, ›J!!
Construct all possible column chains:
B E F H
B D H
B D I J
B C
B E G J
B F F I J
______________________________________
Choose the chain corresponding to the largest area:
B D H
______________________________________
The method is sufficiently general that it does not matter whether all of the columns are the same width. A basic method for creating the geometric structure tree of a document page is shown in FIGS. 29, 30, 31 and 32. The generalization and improvement of this process is shown in FIGS. 26, 27 and 28. With regard to operation of tree generating module 34, we will first discuss creation of the geometric structure tree where a page contains multiple sets of column bounds. FIG. 26 shows sequence of operation of module 34. For example, consider a page from a single-articled document such as the lead page of an article from a technical journal, such as that illustrated in FIG. 2. It is readily apparent from FIG. 2 that constraint (1) of the section of this application entitled "Background of the Invention" does not hold because the partitioning into columns in the region where the author block and abstract are located is not the same as that lower where the running text begins. Still, in each region there is some column-based layout. This means that the page can be viewed as a composition of horizontal bands, each of which has internal column consistency. Thus, a geometric structure tree can be constructed within each band, after which the individual trees can be merged, by treating each as a child of the same "root", to form a single tree for the page. The results of the X-Y tree partitioning are used to specify the column zones (those horizontal bands in which there is internal column consistency) of the page, as shown in FIG. 7. The problem is then reduced to determining these bands or zones. As discussed above, segmentor 12 utilizes X-Y partitioning followed by RLSA and connected components. X-Y partitioning is used initially to set local parameters in guiding RLSA and connected components. The outputs of these respective processes can be characterized as being a coarse-grained segmentation and a fine-grained segmentation, respectively. An additional benefit of this sequential method is realized here when, as shown in FIG. 7, the resulting coarse blocks fall into column zones. Now, consider the case of pages with non-columnaligned inset blocks, wherein constraint (2) of the section of this application entitled "Background of the Invention" does not apply. Sequence of operation is shown in FIGS. 27 and 28. Modern layouts, particularly in popular periodicals, often use inset blocks which are not bilaterally column aligned, i.e., the inset block does not adhere to a column boundary on at least one side. Even The Wall Street Journal, whose front page format is otherwise regular, embeds on its front page portraits which span only partway across a column and which have text flowing around them. Several examples of inset blocks are illustrated in FIG. 13, which shows a typical magazine page layout in which several examples of inset blocks are shown; as illustrated, such blocks need not be constrained to a single column. An OCR has difficulty segmenting this sort of scenario, not just because block outlines overlap, but rather because they do so in a way which is somewhat arbitrary and not compatible with producing the kind of block list needed for tree construction. Rather than try to coerce the existing tree building algorithm to deal with non-column-aligned objects, the present invention instead utilizes an overlay plane for blocks which are not column-aligned. Separate trees are formed for the base plane and the overlay plane. These trees are then merged. All blocks on the page are partitioned into those which are column aligned and those which are not. A geometric structure tree is then formed for each partition. The resulting trees are merged by taking each node from the overlay tree and inserting it into the base tree. In accomplishing this tree merger, it is assumed that an inset block belongs to the same topical entity (an article in the case of a newspaper) as its surrounding text. Further, an inset block is deemed to be of secondary importance relative to that text, so the present invention makes its node the youngest child of the text's parent node. This process of tree formation and merger is illustrated in FIG. 9. Employing the overlay concept in tandem with the bounding box method of determining block edges accomplished by merger 24 (described above) not only facilitates tree construction over this wider domain, but allows the user to do so with the minimal number of objects to be processed. One comparison example is illustrated in FIGS. 10 and 11. In FIG. 11, there are only 3 segments: A and C which are each half of the total area, and B, which is overlaid on A. Under the uniplanar paradigm illustrated in FIG. 10, in general the fragmentation occurring in each inset block would mean (for this example) two more blocks to handle. This is true regardless of the category of (non-column-aligned) inset block. An enumeration of categories based on the dimensions of propriety and degree of alignment is shown in FIG. 12. The classification of FIG. 12 relies on the following definitions. An inset block (or group) is proper if it is contained within a single column; it is improper if it spills over column boundaries. In most cases this corresponds to the arithmetic notion of proper and improper fractions; however, as shown in part (f) of FIG. 12, an inset block which encroaches on two host blocks, but whose width does not exceed that of either, is still considered improper. Module 36 performs logical transformation on the geometric structure tree produced by module 34. FIG. 33 shows sequence of operation of module 34. For example, the present invention can be implemented as follows. Document images can be captured using a Xerox.TM. Imaging Systems' Model 1700 scanner. Optical character recognition can be performed with Xerox.TM. Imaging Systems' ScanWorX.TM. OCR software. Analyzer 10 can for example be implemented on a Sun.TM. SPARCstation.TM. with the Unix.TM. operating system. The scanning can be at for example 150 to 300 dpi (dots per inch). Gray level images can be binarized, and all binary images can be reduced to 75 dpi before geometric segmentation. The low-level, pixel-based operations required for geometric page segmentor 12 can be implemented in the C computer language, and the high-level manipulations required for logical page organizer 14 can be implemented in the Prolog computer language. Quintus.TM. Prolog has a C interface which allows Prolog to call C and vice-versa. Analyzer 10 can be implemented in software. An example of software, in the respective languages described in the preceding paragraph, implementing the various modules of FIG. 1 is given in the following pages and set forth as source code in the attached software appendix. FIGS. 14-33 are the flowcharts for this software. The respective caption for each of FIGS. 14-33 identifies the respective software represented by each such flowchart. The operation of the software will be explained below with reference to FIGS. 14-33. FIG. 14 is a flowchart of the main program for which the corresponding code can be found in the appendix as the process main.c. Step 100, as shown in FIG. 14, detects and removes horizontal rule lines. FIG. 15 is a flowchart for the Detect and Remove Lines step 100 shown in FIG. 14. Step 100 corresponds to source code which can be found in the appendix as the process lines.c. At step 1100, shown in FIG. 15, the first row of the image is selected. It is then determined at step 1105 whether the last row has been processed. If all of the rows have been processed, the horizontal rule lines are record at step 1135. If all of the rows have not been processed, the pixel run length for the selected row is determined at step 1110. The number of pixels in the run length is compared, at step 1115, with a specified value T. If the number of pixels is greater than T, the pixels are eliminated, at step 1130, from the run and it is determined, at step 1120, if the end of the selected line has been reached. If the number of pixels is not greater than T, it is determined, at step 1120, if the end of the selected line has been reached. If the end of the line has not been reached, the find pixels step 1110 is repeated. If the end of the line has been reached, the next row is selected at step 1125 and it is determined again, at step 1105, if all of the rows have been processed. Referring to FIG. 14, after the processing for step 100 is completed, the XY Tree Segmentation step 101 is performed. FIG. 16 is a flowchart for step 101 shown in FIG. 14. Step 101 corresponds to source code xytree.c in the appendix. The flowchart shown in FIG. 16 illustrates how a tree representation of the document is produced by finding nested rectangles in the image. At each step of the recursive process of image decomposition, a rectangle is subdivided into smaller rectangles by making cuts along a predetermined direction. Each node in the tree corresponds to a rectangle, i.e. a segmented block at the current level. Successors of a node correspond either to a set of rectangles obtained by horizontal partitions of the parent rectangle, or to a set of rectangles obtained by vertical partitions. The location of the gap and the width of the gap are determined using a Hough transform. If a gap is present, a cut is made through the block at that position. This is first done for all horizontal gaps, then for all vertical gaps, forming a grid of rectangles as children of the parent nodes. The algorithm stops for blocks which have a width less than 400 pixels and a height less than 90 pixels. At step 1200, shown in FIG. 16, a root node is first established. It is then determined, at step 1205, if the maximum number of segmentation levels has been reached. The number of levels that the X-Y tree contains is controlled by a predetermined set of parameters which depends on how finely the document needs to be segmented. Processing is completed when it is determined, at step 1205, that the maximum number of levels has been reached. The Hough transformation is computed, at step 1210, if the maximum number of levels has not been reached. Based on the computed Hough transform, the identified white space is cut through, at step 1215, to partition the rectangle. Then, at step 1220, it is determined if the segmentation produced any children nodes. If children nodes are produced, the segmentation processing is advanced to the next level at step 1225. Then, it is again determined if the maximum number of levels has been reached at step 1205. If no children nodes are produced, then it is determined, at step 1230, if there are any sibling nodes. If there are no sibling nodes step 101, shown in FIG. 14, is terminated. If sibling nodes are present, then, at step 1235, the next sibling node is selected and the Hough transform is computed for the selected sibling node at step 1210. Next, at step 102 of FIG. 14, ink pixels in the image are smoothed. Each region, i.e. node, (as determined by the XY Tree Segmentation 101) is considered separately. FIG. 17 is a flowchart for the Run-Length Smoothing step 102 shown in FIG. 14. Step 102 corresponds to source code rlsa.c in the appendix. The process shown in FIG. 17 smooths ink pixels in the image. At step 1300, shown in FIG. 17, it is first determined if the last region has been processed. If the last region has not been processed, the horizontal factor, i.e. the horizontal threshold, is determined at step 1320. The horizontal threshold is computed by populating a histogram of the distances between consecutive non-ink/ink transitions in the image. The distance with the largest number of transitions is used as the horizontal threshold. At step 1325, any ink pixels within a block which are separated by the horizontal threshold or fewer background pixels along a row of the image are smoothed by changing all the background pixels values to ink. At step 1330, the next region is selected. After the last region has been smoothed horizontally, the entire resultant image is smoothed vertically at step 1305. Any ink pixels which are separated by the vertical threshold background pixels or fewer along a column of the image are smoothed by changing all the background pixels to ink pixels. The vertical threshold distance is computed in a similar manner to the horizontal threshold, but using vertical non-ink/ink transitions from the smoothed horizontal image. The resultant image is smoothed horizontally, at step 1315, after the horizontal factor, horizontal threshold, has been determined at step 1310. The smoothed image is then used as an input to the Connected Component Analysis step 103 shown in FIG. 14. FIG. 18 is a flowchart for the Connected Component Analysis step 103 shown in FIG. 14. FIG. 18 corresponds to source code segment.c and conn.sub.-- comp.c in the appendix. The flow chart in FIG. 18 finds all of the connected components in the region under consideration. A connected component is a region in which all pixels have at least one other pixel in the region as one of its eight-connected neighbors. At step 1400, shown in FIG. 18, it is determined if the end of the image has been reached. If the end of the image has not been reached, it is determined, at step 1415, if the pixel has a neighbor. If the pixel does not have a neighbor, then it is determined, at step 1430, if the number of neighbors for the pixel is greater than 1. If the number of pixels is greater than 1, the pixel is labelled with the same label as one of the neighbors at step 1440. Then, at step 1445, an equivalency is setup between the neighbors. If the number of neighbors is less than 1, a new label is created for the pixel at step 1435. If, at step 1415, there is a match with a neighbor, the pixel is assigned the same label as the neighbor. After steps 1420, 1435, and 1445 a next pixel from the image is selected and it is determined, at step 1400, if the end of the image has been reached. Once the end of the image has been reached, at step 1405, the pixels are relabelled, at step 1445, based on the equivalences. Then, at step 1410, each connected component is represented by a bounding box: four coordinates which represent an enclosing rectangle. FIG. 19 is a flowchart for the Merging step 104 shown in FIG. 14. Step 104 corresponds to source code which can be found in the appendix as the process merge.c. Step 104 searches for overlapping bounding boxes determined by connected component analysis step 103 which are part of the same headline. A headline is a region with a few lines of text. All overlapping bounding boxes are merged into a single bounding box. At step 1500, shown in FIG. 19, the first bounding box, i.e. region, is retrieved. It is then determined, at step 1505, if the last region has been processed. If not, then the character height, the number of text lines, is computed at step 1510. Based on the character height, the block is classified or not classified, at step 1515, as a headline. Counting the number of INK runs in the projected image estimates the number of text lines. If the number of text lines in a block is less a specified value, it is considered a potential headline block. At step 1520, the next region is selected and step 1505 is repeated. After all of the regions have been processed, the first headline is retrieved at step 1525. Then, at step 1530, it is determined if the last headline has been processed. If the last headline has not been processed, then, at step 1535, all of the horizontal neighbor headline blocks of the selected headline block are found and merged. A horizontal neighbor is found by testing for any vertical overlap of the horizontal projections of the bounding boxes of the two bounding boxes in question. If the two bounding boxes overlap horizontally, then the two bounding boxes are tested to determined if the vertical overlap is less than 50% of the smaller block width. If the two bounding boxes pass both these tests, they are merged into one block by computing the bounding box of the combined boxes. These bounding boxes are fine-grained blocks. Next, characters in the regions are optically recognized at step 105. FIG. 20 is a flowchart for the Optical Character Recognition step 105 shown in FIG. 14. Step 105 corresponds to source code file ocr.c. At step 2010, shown in FIG. 20, it is first determined if the last region has been processed. Processing is terminated when the last region has been processed. If all of the regions have not been processed, then, at step 2020, the ASCII text in the region is found and then written, at step 2030, to a file. Then, at step 2040, the next region is selected and step 2010 is repeated. Once character recognition is completed, the block features of the ASCII text are determined at step 106. FIG. 21 is a flowchart for the Find Block Feature step 106 shown in FIG. 14. Step 106 corresponds to source code which can be found in the appendix as process feature.pl. Each block of the ASCII text is read from the file and processed. First, as shown in FIG. 21, it is determined, at step 2110, if all of the blocks have been processed. This procedure is terminated once all of the blocks have been processed. Otherwise, the OCR file, the file containing the ASCII text, is read at step 2120. For each block of ASCII text read, the representative block point size is determined, at step 2130, by ascertaining from its contents the point size which applies to the largest number of characters in the block. Next, the representative block face style, for example, `plain`, `italic`, and `bold`; is similarly determined at step 2140. Then, at step 2150, the block line count is computed by counting the number of line characters in the file. Finally, at step 2160, the representative block alignment (one of `left`, `right`, `both`, or `neither`) is determined. For those lines which have a left margin, the lines are deemed to be left-justified if the total count of lines in multi-line runs, as would occur for a paragraph, is at least 50% of the total count of lines whether in multi-line runs or in single-line runs. A similar computation is made for those lines which have a right margin. In each case, the margin must exceed a threshold. If the text is found to be both left-and right-justified, then the alignment is deemed `both`. If the justification is exactly one of `left` or `right`, then the alignment is `left` or `right`, respectively. Otherwise, the alignment is `neither`. Once the block features have been established, the page features are determined at step 107. FIG. 22 is a flowchart for the Find Page Features step 107 shown in FIG. 14. Step 107 corresponds to source code found in the software appendix as process classify.sub.-- blocks.pl. First, at step 2210, it is determined if all of the blocks have been processed. If all of the blocks have not been processed, then, at step 2220, it is determined if the current block is a meaningful text block. A block is a meaningful text block if its features will contribute to the page level computation. Blocks which have been classified as `rule line`, `non text`, or `garbage` where at least 5/8 of the characters are non-alphanumeric, are not considered meaningful text blocks. For those blocks which are meaningful text blocks, the product of point size and character count is accumulated by point size for all point sizes found in these blocks. The point size with the largest sum of products is selected as the representative page point size. Once all of the blocks have been processed, the point size for the page having the largest score is selected at step 2240. Next, at step 108, the column boundaries are determined for all fine-grained blocks. Blocks which are produced by Geometric Page Segmentor 12 which corresponds to the data produced from merge overlapping bounding boxes 24 are considered fine-grained blocks. FIG. 23 is a flowchart for the Determine Column Boundaries step 108 shown in FIG. 14. Step 108 corresponds to source code which can be found in the software appendix as the process chain.pl. First, at step 2300, the fine-grained blocks are clustered using a value x, height value, within some tolerance. Then, at step 2310, the blocks are subclustered by block width. All possible page-spanning column chains are then constructed at step 111. A column chain is a set of blocks such that each block has as its left boundary the same x value as another block's right boundary and vice-versa, within some tolerance. The first block has only the same right boundary as another block and the last block has only the same left boundary as another block. A page-spanning column chain is one whose horizontal extent is the width of the printed material on the page. Each chain is then scored, at step 2320, by replacing each block with the sum of the areas of all blocks in its subcluster. The chain with highest score is selected at step 2330 and the column boundaries it defines are extracted at step 2340. The page spanning chains generated at step 111 are produced in accordance with FIG. 24 which is a flowchart for the Construct All Possible Page-spanning column chains step 111 shown in FIG. 23. FIG. 24 corresponds to source code which can be found in the software appendix in the process chain.pl. In FIG. 24, N is the total cluster count, M is the maximum first link index, and L is a link or subcluster. The first subscript for L is a cluster index and the second subscript for L identifies a subcluster within that cluster. First, at step 2400, the total number N of clusters generated at steps 2300 and 2310, shown in FIG. 23, is determined. The maximum first link index M is determined, at step 2410, by dividing N by 2. At step 2420, a link or subcluster Lij which is less than or equal to M is chosen. Lij becomes the seed or initial link for constructing the chain. The remainder of the chain is generated by inductive step 112. FIG. 25 is a flowchart of the Inductive step 112 shown in FIG. 24. FIG. 25 corresponds to source code which can be found in the software appendix as the process chain.pl. In FIG. 25, a link L is a subcluster. The first subscript of L is a cluster index and the second subscript for L is for a subcluster within that cluster. Let the current link subcluster be denoted by Lij and let the remaining clusters be denoted by Lrs where r>i. First, if there are clusters remaining, as determined at step 2500, the chain construction process proceeds to step 2505 to choose a subcluster Lmn where m is greater than i. Otherwise, construction of the current chain is completed and the current chain is tested to see whether it spans the page at step 2530. If the current chain does span the page, as determined at step 2530, it is retained at step 2545. Otherwise the current chain is discarded at step 2535. At step 2540, if more seed links exist, the next seed is chosen at step 2550 and designated the current seed at step 2555. Returning to step 2505, chain construction is explained below. A subcluster Lmn is chosen, at step 2505, from the remaining clusters Lrs, where r>i and where the current subcluster is Lij. Then, at step 2510, it is determined whether Lmn abuts Lij on the right within some tolerance. If it does, then Lij becomes the new current link at step 2525 and the loop returns to step 2500 to determined whether there are any more subclusters. If, at step 2520, it is determined Lmn does not abut Lij, then the next subcluster is chosen at step 2520 and step 2505 is repeated. After the processing at step 108 is completed, a geometric structure tree is created at step 109. FIG. 26 is a flowchart for the Create Geometric Structure Tree step 109 shown in FIG. 14. FIG. 26 corresponds to source code which can be found in the software appendix as processes col.sub.-- bands.pl and build.sub.-- geom.sub.-- tree.pl. At step 2600, shown in FIG. 26, the coarse blocks, from XY Tree Segmentation 12, are grouped, at step 2600, into horizontal bands. The fine-grained blocks are assigned, at step 2610, to the coarse blocks in which they lie. Next, for each horizontal band a geometric structure tree is determined for the fine blocks in the horizontal band at step 113. Then, the respective geometric structure trees are merged at step 2620. FIG. 27 is a flowchart for the Determine Geometric Structure Tree For Fine Blocks In Each Horizontal Band shown in FIG. 26. FIG. 27 corresponds to source code which can be found in the software appendix as the processes inset.sub.-- boxes.pl and build.sub.-- geom.sub.-- tree.pl. First, the fine blocks are partitioned, at step 114, into fine blocks which are column-aligned and those which are not column-aligned. Partitioning is performed by testing each fine block with respect to column alignment in accordance with the flow chart shown in FIG. 28. Those blocks which are column-aligned are said to lie in the base plane and those blocks which are non-column-aligned are said to lie in an overlay plane. A geometric structure tree is then created, at steps 115, for the column-aligned blocks and the non-column-aligned blocks. The resulting trees are merged, at step 2700, by inserting the overlay tree into the base tree. In accomplishing this tree merger, it is assumed that an inset block (i.e., a non-column-aligned block) belongs to the same topical entity (an article in the case of a newspaper) as its surrounding text. Further, an inset block is deemed to be of secondary importance relative to that text, so the present invention makes its node the youngest child of the text's parent node. This process of tree formation and merger is also illustrated in FIG. 9. FIG. 28 is a flowchart for the Partition Blocks Into Column-Aligned, Non-Column-Aligned step 114 shown in FIG. 27. FIG. 28 corresponds to source code which can be found in the software appendix as the process inset.sub.-- boxes.pl. In FIG. 28, "Aligns left" is defined as the block's left edge aligns with the left edge of some column within some tolerance; "Aligns right" is defined as the block's right edge aligns with the right edge of some column boundary within some tolerance; "Column complement to right" is defined as the region between a block's right edge and the closest column boundary further right; "Column complement to left" is defined as the region between a block's left edge and the closest column boundary further left; and "Whitespace region" is defined as a region which does not intersect with any block. A block to be examined at step 2800 is a column aligned block if: (1) the block aligns left and aligns right as determined at steps 2805 and 2835, respectively; (2) the block aligns left and the column complement right is whitespace as determined at steps 2810 and 2840, respectively; (3) the block aligns left and any block which intersects the column complement right is not wholly contained in the same column as determined at steps 2815 and 2845, respectively; (4) the block aligns right and the column complement left is whitespace as determined at steps 2820 and 2850, respectively; (5) aligns right and any block which intersects the column complement left is not wholly contained by the same column as determined at steps 2825 and 2855, respectively; or (6) the column complement left is whitespace and the column complement right is whitespace for the block as determined at steps 2830 and 2860, respectively. If the block does not meet one of the six criteria above, the block is non-column-aligned, an inset block. FIG. 29 is a flowchart for the Create Geometric structure Tree For Column-Aligned and Create Geometric structure Tree For Non-Column-Aligned steps 115 shown in FIG. 27. The flow chart in FIG. 29 corresponds to source code in the software appendix in the process build.sub.-- geom.sub.-- tree.pl. As shown in FIG. 29, at step 116, a root node is created and the tree is then generated inductively at step 117. The root node is created in accordance with FIG. 30 which is a flowchart for the Create Root Node step shown in FIG. 29. FIG. 30 corresponds to source code which can be found in the software appendix in procedure build.sub.-- geom.sub.-- tree.pl. At step 3000, a synthetic block is created which spans the top of the page and has zero height. The synthetic block is generated using the page size provided at step 3010. The synthetic block is assigned to be the root node at step 3020. FIG. 31 is a flowchart for the Generate Tree Inductively step 117 shown in FIG. 29. FIG. 31 corresponds to source code which can be found in the software appendix in the procedure build.sub.-- geom.sub.-- tree.pl. As shown in FIG. 31, the children of the node are found at step 118. The current node, the root node, for which children are found is provided from step 3100. Then, if it is determined, at step 3110, that there are no more children, the process halts. Otherwise, each child in turn becomes the current node, at step 3120, and the process starts anew at step 118. The children nodes at step 118, shown in FIG. 31, are generated in accordance with FIG. 32 which is a flowchart for the Find This Node's Children step shown in FIG. 31. FIG. 32 corresponds to source code which can be found in the software appendix in the process build.sub.-- geom.sub.-- tree.pl. First, at step 3200, it is determined if there are any blocks remaining. If there are blocks remaining, a block is selected, at step 3205, as a candidate block and it is determined, at step 3215, if the candidate block is below the current node. If it is determined that the block is below the node, then at step 3220 the span of the block's column is determined. If the candidate block is not below the node, then step 3200 is repeated. Next, it is determined, at step 3230, if the column span of the block is a subspan of the current node. If it is a subspan, then, at step 3240 it is determined if other children have the same column span as column span of the block. If there are no existing children with the same column span, then a new children list is started with the column span of the block at step 3245 and processing proceeds to step 3200. Otherwise, it is determined, at step 3250, if the candidates block is contiguous with existing children for this column span. If it is contiguous, then the candidate block is added to the list of children, at step 3445, and processing determines if there is another block at step 3200. If there are existing children of the same column span and the candidate block is not contiguous, then processing proceeds directly to step 3200. When all blocks have been processed, the highest child in each list is selected at step 3255. If it is determined, at step 3260, that the a highest child is not dominated by a node more local to it than the current parent node, it and its siblings are retained, at step 3270, as a child node. Otherwise the node is discarded at step 3265. A node N2 is dominated by a node N1 if N1 is above N2 and N2's column span does not exceed that of N1. FIG. 33 is a flowchart for the Logical Transformation step shown in FIG. 14. FIG. 33 corresponds to code in source code xform.sub.-- geom.sub.-- tree.pl and build.sub.-- geom.sub.-- tree.pl. A logical transformation is accomplished by applying in turn, a set of transformation rules. The sequence of rule application is Rule S, Rule A, Rule B, Rule C, and Rule D. The transformed tree is then written out to a file. The rule specifications are as follows: Rule S Reorder the children of a node so that child nodes which are groups delimited by separator lines become the "youngest" children of the node. This allows inset boxes of text or photos to be isolated so that the text which flows around them can be connected. This could be described as a "textist" philosophy because it allows text portions to dominate (be the elder children) non-text portions such as photos. This rule is to be applied before Rules A, B, C and D. Rule A If the first element of the second node in the pair is a body, then remove it from the second node and append it to the first node. Continue until a head is encountered or the list is exhausted. If the list is exhausted, remove the branch leading to it from the tree. When processing has been completed for a given pair, apply it to the pair comprised of the second node from the previous pair and the next (non-null) node. In other words, given two adjacent (in a depth-first ordering) nodes, if the second starts with a body block, remove it and append it to the first. Do this until a head block is reached. In practice this follows the flow of an article which "snakes" upward to the next column. Rule B If the last element of the first node in the pair is a head, then remove the first element from the second node in the pair and append it to the first node. Continue as long as the elements being appended are heads and then take one more node--or until the list is exhausted. If the list is exhausted, remove the branch leading to it from the tree. When processing has been completed for a given pair, apply it to the pair comprised of the second node from the previous pair and the next (non-null) node. In other words, given two adjacent (in a depth-first ordering) nodes whose parent is not root, if the first ends with a head, then remove the first element of the second node and append it to the first. Do this until a body block is moved. In practice this connects body blocks in the next column with a head which falls at the bottom of the current column. Rule C Given a node in which an element other than the first is a "head" block, partition the node into two nodes such that all the elements preceding the head block comprise the first node and the rest comprise the second. The two new nodes will be siblings which together will replace the node from which they sprang. Descendants of the original node will belong to the second partition. Apply the rule recursively to the second new node until the original node is completely partitioned into sublists each beginning with a "head", possibly excepting the first. In other words, if a node contains two (or more) sequences beginning with a head block, then they are likely to be individual articles or sections. Split them. Rule D Because the other rules (rule A, rule B, and rule C) will already have been applied, we should only encounter a "head" sequence at the front of the node. If we do find such a sequence, then partition the node into the "head" sequence and a child node of all the "body" blocks. As mentioned above, analyzer 10 uses assumptions based on knowledge of expected layouts. These assumptions, knowledge and layout are for Western-language document pages. Such pages are read from the top down, and from left to right; however, this rule could be adjusted if, for example, Hebrew-language documents, that are read from the top down but from right to left are expected instead. Also, a headline is assumed to be above its article and as wide (same span) as its article. A picture caption is assumed to be below the picture to which it refers. A body block is assumed to go with the immediately preceding head blocks. The children of a node of a tree are sets of blocks each of which sits beneath the node, is contiguous, has the same column span, has a column span which is a subspan of that for the node, has a lead child which is the highest block in the set, and is not dominated by any block more local than the node. Rule lines denote article or paragraph boundaries. Columns are a key organizing principle. A page may have multiple sets of column configurations. After reading down, the reading order will snake back up and to the right, if a head block or a block with a wider column space is encountered. If reading goes to the right, then the reading order moves up to the parent of the node from which the reading is continuing. Inset blocks are logically related to the surrounding material and are subordinate to it. Articles near the top of the page or which span a wide number of columns are considered more important than articles that do not. Some of the many advantages of the invention should now be readily apparent. For example, a novel document page analyzer has been provided which is capable of geometrically segmenting an image of the page and logically organizing the result of such segmentation. The present invention automates the process of assembling the text into the correct reading order for a given document page, even for complex layouts. The present invention uses image-based features, text-based features and assumptions based on layout knowledge to find the correct reading order of the blocks on a document page. The present invention can handle complex layouts which have multiple configurations of columns on a page and inset material (such as figures and inset text blocks). "Inset material" here refers to blocks which are not column-aligned on at least one side. Obviously, many modifications and variations of the present invention are possible in light of the above teachings. It is therefore to be understood that the foregoing embodiments are presented by way of example only and that, within the scope of the appended claims and equivalents thereto, the invention may be practiced otherwise than as specifically described. ##SPC1##
|
Same subclass Same class Consider this |
||||||||||
