Method of and apparatus for character recognition through related spelling heuristics5465309Abstract A method and apparatus for improving an OCR process for recognizing unidentified characters through the creation of sets of like unidentified characters in a scanned image, deducing what the unidentified characters are with a spell checking procedure and replacing only those unidentified characters which are unambiguously determined. The order for identifying characters is strategically performed so the unidentified characters which are easier to recognize are processed before those that are more difficult to recognize. As characters become identified, relationships defined by corresponding sets allow replacing the same character in related but different textual components. As characters become recognized, other characters become more recognizable by virtue of belonging to the same textual component, such as a word, or to a related textual component such as another word containing the same character. Claims What I claim is: Description FIELD OF THE INVENTION
______________________________________
EXAMPLE OF OCR RESOLUTION TABLE ENTRIES;
Place Holder #
Character Position
Candidate List
______________________________________
1 Buffer Byte Offset 58
`m`, `n`, `y`, `u`
and `v`
2 Buffer Byte Offset 80
`g`, `j`, `g` or `y`
3 Buffer Byte Offset 87
`m`, `n`, `y`, `u`
and `v`
4 Buffer Byte Offset 88
`g`, `j`, `q` or `y`
5 Buffer Byte Offset 91
`t` and `f`
6 Buffer Byte Offset 96
`g`, `j`, `q` or `y`
7 Buffer Byte Offset 125
`g`, `j`, `q` or `y`
8 Buffer Byte Offset 158
`t` and `f`
9 Buffer Byte Offset 231
`a` and `c`
10 Buffer Byte Offset 284
`m`, `n`, `y`, `u`
and `v`
______________________________________
Returning again to block 42, the base information illustrated in the above resolution table, is used in subsequent processing to determine the correct characters. The output stream simply contains dummied characters which will be subsequently replaced. It is important to note that the processing described by the blocks proceeding block 42, may be utilized in conjunction with various current recognition improvement techniques. When characters are not recognized satisfactorily, blocks 42 and the following blocks may be employed. Integrating the present invention with current OCR techniques will also enhance the accuracy of the OCR resolution table. Referring again to block 42, Character Match (CM) Sets are created from the OCR Resolution Table. Each set contains records from the OCR Resolution Table which have equivalent candidate lists and which are identified to be substantially the same character. CM Sets created for the example as the result of block 42 are:
______________________________________
Sets defined by Place Holder # List
______________________________________
1, 3, 10
2, 7
4, 6
5, 8
______________________________________
Note that despite 2, 4, 6 and 7 having the same candidate list, block 42 compares the underlying characters and finds they are substantially different. The same OCR comparison techniques existing in the current art is utilized to make this decision. Processing then continues to block 43, from block 42, where the list of sets are sorted from smallest candidate list to largest candidate list. This permits recognizing the simple challenges before the difficult ones. It will be shown that recognizing the simple cases first, makes recognizing the more difficult cases easier. Block 43 then produces the following priority process ordered list:
______________________________________
Ordered Set List
______________________________________
1) 5, 8
2) 9
3) 2 and 7
4) 4 and 6
5) 1, 3 and 10
______________________________________
If multiple sets contain the same number of candidates, then the sets are ordered by maximum number of references to least number of references within the ordered candidate list count position, as shown by set 1 and set 2. If multiple sets, with the same number of candidates, contain the same number of references, then the maximum length word of any reference in a set takes priority over the set with a less referenced maximum length word. This is shown by sets 3 and 4. If sets with the same number of references contain the same maximum length, then various embodiments will select various priorities, including a random priority, without departing from the spirit and scope of the present invention. From block 43, processing proceeds to block 44 which initializes a STATUS2CHECK variable to CORRECTLY SPELLED. The STATUS2CHECK variable is used for processing each pass on the CM Sets. Status refers to the return status from a spell checker dictionary/textual component dictionary such as that found in an editor product. The first pass determines the correct characters by the maximum correctly spelled words for a set. When the first pass has replaced determined characters into the associated words, other passes are performed to continue determining other characters. When a pass has determined no characters, the next meaningful status is used to determine unidentifiable characters. In this fashion, multiple passes with the same status elaborate characters. Subsequent multiple pass processing with other status continue to determine characters not identified by previous multiple pass processing. From block 44, control proceeds to block 46 which initializes an OUTPUT.sub.13 CHANGED variable to FALSE. Block 46 defines the top of an iterative loop for processing multiple passes on the current CM Set list. Multiple passes on CM sets are performed until no characters are determined by a pass. Then, block 48 checks if all sets have been processed. If this is the first execution of block 48 and all characters were recognized by block 32 (i.e. no CM set list as the result of blocks 42 and 43), block 48 passes to block 80 in FIG. 2, on to block 82, and then on to block 84 where the completed output stream is provided to the underlying application. If in block 48, one or more sets exist, then none of the sets are yet processed and processing continues to block 50. For subsequent iterations of block 48, block 48 defines the top of an iterative loop for processing each set in the current list of CM Sets. If all sets are not yet processed within the execution iteration started at block 46, then block 48 proceeds to block 50. Block 50 retrieves the next set from the list of CM Sets. Then, block 52 checks if all place holder references have been processed from the current set. Block 52 defines the top of an iterative loop for processing every place holder reference within a particular set. If in block 52, all place holder references have not been processed from the current set, then block 54 retrieves the next place holder reference and continues to block 56 where the associated textual component (i.e. word) is retrieved from the output stream. The associated word is retrieved by using the Character Position field from the corresponding record in the OCR Resolution Table. Then, block 58 retrieves the candidate list from the same corresponding record in the OCR Resolution Table. Block 58 flows to block 60 where a check is made for other unresolved place holders within the word retrieved from the output stream. When a word is retrieved from the output stream, the special 0 valued character space is sought by block 60. Previous passes on this word may have already identified the character and replaced it in the output stream, thereby overwriting the special zeroed placeholder. Thus, block 60 only looks for unresolved (i.e. presence of 0 indicators) characters in the word. If in block 60, other unresolved place holders exist in the current word, then the corresponding OCR Resolution Table records are retrieved in block 62 for each associated unresolved character candidate list. Block 62 then continues to block 64 where every combination of characters is tried from all candidate lists of the current word and status is appropriately tabulated. If, in block 60, there are no other unresolved characters in the word, then processing continues directly to block 64. It will be shown, in following examples, that the word may have already been determined by previous passes, in which case the singular status will be logged in block 64. The current word may contain one or more unknown characters, each of which may be denoted as `SET i: Character[j,k]` such that i is the SET for which the character belongs to, j is the current place holder number within the set and k is the current character to try from the candidate list associated with the place holder. With reference now to the above example, upon the first execution of block 64, the following status is tabulated when using, for example, a WordPerfect (TM) 5.0 spell check algorithm. One skilled in the art recognizes that any spell dictionary may be used, and preferably it is one that maximizes words found in the particular document type being scanned.
__________________________________________________________________________
WORD STATUS
__________________________________________________________________________
SET 1 : Char[5,f]
=> with g in #6
familg
1 hit
with j in #6
familj
1 hit
with q in #6
familq
1 hit
with y in #6
family
CORRECTLY SPELLED
SET 1 : Char[5,t]
=> with g in #6
tamilg
3 hits
with j in #6
tamilj
2 hits
with q in #6
tamilq
2 hits
with y in #6
tamily
8 hits
__________________________________________________________________________
Block 64 flows to block 52 for processing the next place holder member in the currently processed set. If in block 52, all place holders have been processed from the current set, processing continues to FIG. 2 block 66. With reference now to the above example, status tabulation which would have occurred upon first execution of block 66 in FIG. 2 would be:
__________________________________________________________________________
WORD STATUS
__________________________________________________________________________
SET 1 : Char[5,f]
=> with g in #6
familg
1 hit
with j in #6
familj
1 hit
with q in #6
familq
1 hit
with y in #6
family
CORRECTLY SPELLED
SET 1 : Char(5,t)
=> with g in #6
tamilg
3 hits
with j in #6
tamilj
2 hits
with q in #6
tamilq
2 hits
with y in #6
tamily
8 hits
SET 1 : Char[8,f]
=> with f in #6
filed CORRECTLY SPELLED
SET 1 : Char[8,t]
=> with t in #6
tiled CORRECTLY SPELLED
__________________________________________________________________________
It should be noted that in block 62, execution was not necessary for place holder number 8 of SET 1 because there were no other unrecognized characters in the associated textual component. Referring now to FIG. 2, block 66 checks if any words contain the status indicated by the variable STATUS2CHECK. The first execution of block 66 implies the STATUS2CHECK variable is set to CORRECTLY SPELLED. Examination of tabulated status shows three words determined to be correctly spelled within the set. If in block 66, one or more entries have the sought status, then processing continues to block 68 for a conflict validation. A conflict would result if more than one word within a placeholder trial contains the same status while using a different character. Then, block 69 validates if all placeholders for the set are in conflict. If they are, then processing continues back to FIG. 1 (block 48) for the next set in the list. If there is one or more placeholders not in conflict, then processing continues to block 70. In the example, the word `family` is the only correctly spelled word within placeholder 5. However, there are two correctly spelled words within placeholder 8. Placeholder 8 is a conflict and is marked as such. Block 69 flows to block 70 where unresolved characters of non-conflicting words with the sought status are placed into the output stream. In the example, the word `family` is produced in the output stream. No direct action is performed on placeholder 8 because there was a conflict. Block 70 flows to block 72 where placeholder 5 is removed from SET 1 because it is resolved. It should be noted that the character `y` in family was indirectly processed and makes subsequent processing of SET 3 easier. Block 74 then checks for any other placeholders contained in the set. If there were no other placeholders, then the set is deleted in block 76 and processing flows to block 78. If, in block 74 other unresolved placeholders still exist, then block 77 is processed. In the example, the first execution of block 74 determines that placeholder 8 still exists, therefore processing continues to block 77. Block 77 then replaces the determined character of the current set in the other placeholder references. In the example, the character `f` is placed in the output at placeholder's 8 referenced character position. It should be noted that if the word contained other unidentified characters up to this point, the other characters would be more easily determined now that the `f` character was replaced into one of the unknown positions of the word. Block 78 then sets the OUTPUT.sub.13 CHANGED variable to TRUE thereby indicating one or more changes were made to the output stream. The output stream of the example now looks like: Dear Sir(s), I a 1 unhappy with the lon 2 wait 3 4 family has had to endure concernin 7 the hurricane insurance claim filed to you. We've been living in tents for the last month and still h 9ve no indication of being helped with our claim. My e 10ployer just . . . With the first set processed, the ordered set list now looks like:
______________________________________
Ordered Set List
______________________________________
1) 8
2) 9
3) 2 and 7
4) 4 and 6
5) 1, 3 and 10
______________________________________
Block 78 then returns to block 48 of FIG. 1. Referring now back to FIG. 1, block 48 starts the processing iteration for the next set as previously described. During the second set processing, the first encounter of block 66 in FIG. 2 would produce the following status conditions as described above.
__________________________________________________________________________
WORD STATUS
__________________________________________________________________________
SET 2 :
Char[9,a]
=> with a in #9
have CORRECTLY SPELLED
Char[9,c]
=> with c in #9
hcve 3 hits
__________________________________________________________________________
Starting at block 48 for set 2, blocks 50, 52, 54, 56, 58, 60, 64, 52, 66, 68, 69, 70, 72, 74, 76 and 78 are processed respectively as previously described. Block 78 then returns back to FIG. 1 block 48 for another iteration on the set list. No conflicts existed within set 2 and the output stream is replaced with the character `a` to form the word `have`. Placeholder 9 was the only placeholder for set 2, therefore block 76 deleted the set from the current set list. The output stream of the example now looks like: Dear Sir(s), I a 1 unhappy with the lon 2 wait 3 4 family has had to endure concernin 7 the hurricane insurance claim filed to you. We've been living in tents for the last month and still have no indication of being helped with our claim. My e 10ployer just . . . With the second set processed, the ordered set list now looks like:
______________________________________
Ordered Set List
______________________________________
1) 8
3) 2 and 7
4) 4 and 6
5) 1, 3 and 10
______________________________________
With reference to FIG. 1, block 48, another set is found in the current list, namely set 3. Processing proceeds to block 50 where set 3 is retrieved from the list. Processing then continues through blocks 52, 54, 56, 58, 60 and 64, respectively for placeholder 2. Processing continues through blocks 52, 54, 56, 58, 60 and 64, respectively for placeholder 7. Processing then continues to block 66 of FIG. 2, through block 52. With reference now to FIG. 2 block 66, upon encounter of block 66, the following status would be logged:
__________________________________________________________________________
WORD STATUS
__________________________________________________________________________
SET 3 : Char[2,g]
=> with a in #1
long CORRECTLY SPELLED
SET 3 : Char[2,j]
=> with a in #1
lonj 12 hits
SET 3 : Char[2,q]
=> with a in #1
lonq 10 hits
SET 3 : Char[2,y]
=> with a in #1
lony 30 hits
SET 3 : Char[7,g]
=> concerning
CORRECTLY SPELLED
SET 3 : Char[7,j]
=> concerninj
2 hits
SET 3 : Char[7,q]
=> concerning
2 hits
SET 3 : Char[7,y]
=> concerniny
2 hits
__________________________________________________________________________
Block 66 proceeds to block 68 for conflict determination. Block 69 then checks for all entries in conflict. The fact that either placeholder contained no conflict means processing moves to block 70. Block 70 replaces `g` in the output stream for both placeholders. Block 72 removed the placeholders from the set. Block 74 determines the set does not contain other placeholders and therefore deletes set 3 in block 76. Block 78 then sets the indicator for the output stream being changed. At this point, the output stream looks like: Dear Sir(s), I a 1 unhappy with the long wait 3 4 family has had to endure concerning the hurricane insurance claim filed to you. We've been living in tents for the last month and still have no indication of being helped with our claim. My e 10ployer just . . . With the third set processed, the ordered set list now looks like:
______________________________________
Ordered Set List
______________________________________
1) 8
4) 4 and 6
5) 1, 3 and 10
______________________________________
With reference now back to FIG. 1, block 48, another set is found in the current list, namely set 4. Processing proceeds to block 50 where set 4 is retrieved from the list. Processing then continues through blocks 52, 54, 56, 58, 60, 62 and 64, respectively for placeholder 4. Processing continues through blocks 52, 54, 56, 58, 60 and 64, respectively for placeholder 6. Note that in block 64 for placeholder 6, there are no unresolved characters because placeholder 6 was resolved by an earlier pass. With reference now to FIG. 2 block 66, upon encounter of block 66, the following status would be logged (# implies some number greater than or equal to 1):
__________________________________________________________________________
WORD STATUS
__________________________________________________________________________
SET 4 : Char[4,g]
=> with m in #3
mg CORRECTLY SPELLED
with n in #3
ng # hits
with w in #3
yg # hits
with u in #3
ug # hits
with v in #3
vg # hits
SET 4 : Char[4,j]
=> with m in #3
mj # hits
with n in #3
nj # hits
with w in #3
yj # hits
with u in #3
uj # hits
with v in #3
vj # hits
SET 4 : Char[4,q]
=> with m in #3
mq # hits
with n in #3
nq # hits
with w in #3
yq # hits
with u in #3
uq # hits
with v in #3
vq # hits
SET 4 : Char[4,y]
=> with m in #3
my CORRECTLY SPELLED
with n in #3
ny CORRECTLY SPELLED
with w in #3
yy # hits
with u in #3
uy # hits
with v in #3
vy # hits
SET 4 : Char[6,R]
=> all resolved
family
CORRECTLY SPELLED
__________________________________________________________________________
Block 66 would proceed to block 68 where there is a placeholder with no conflicts, namely placeholder 6. In fact, it contained no unresolved characters at all. Block 69 then flows to block 70 where there are no unresolved characters. Block 72 removes placeholder 6 from the set. Then, block 72 flows to block 77 through block 74 where the `y` character is substituted into placeholder 4. Block 78 then flows back to FIG. 1 block 48 as previously described. At this point, the output stream looks like: Dear Sir(s), I a 1 unhappy with the long wait 3y family has had to endure concerning the hurricane insurance claim filed to you. We've been living in tents for the last month and still have no indication of being helped with our claim. My e 10ployer just . . . With set 4 processed the Ordered set list now looks like:
______________________________________
Ordered Set List
______________________________________
1) 8
4) 4
5) 1, 3 and 10
______________________________________
With reference now back to FIG. 1 block 48, another set is found in the current list, namely set 5. Processing proceeds to block 50 where set 5 is retrieved from the list. Processing then continues through blocks 52, 54, 56, 58, 60 and 64, respectively for placeholder 1. Processing continues through blocks 52, 54, 56, 58, 60 and 64, respectively for placeholder 3. Processing then continues through blocks 52, 54, 56, 58, 60 and 64, respectively for placeholder 10. Note that in block 64 during placeholder 3 processing, there is a character which was resolved from a previous pass. Referring now back to FIG. 2 block 66, upon encounter of block 66, the following status would be logged:
______________________________________
WORD STATUS
______________________________________
SET 5 : Char[1,m]
=> am CORRECTLY
SPELLED
SET 5 : Char[1,n]
=> an CORRECTLY
SPELLED
SET 5 : Char[1,y]
=> ay 53 hits
SET 5 : Char[1,u]
=> au CORRECTLY
SPELLED
SET 5 : Char[1,v]
=> av 27 hits
SET 5 : Char[3,m]
=> my CORRECTLY
SPELLED
SET 5 : Char[3,n]
=> ny CORRECTLY
SPELLED
SET 5 : Char[3,y]
=> yy 38 hits
SET 5 : Char[3,u]
=> uy 42 hits
SET 5 : Char[3,v]
=> vy 15 hits
SET 5 : Char[10,m]
=> employer CORRECTLY
SPELLED
SET 5 : Char[10,n]
=> enployer 1 hits
SET 5 : Char[10,y]
=> eyployer 2 hits
SET 5 : Char[10,u]
=> euployer 2 hits
SET 5 : Char[10,v]
=> evployer 1 hits
______________________________________
From block 66, processing proceeds to block 68 where all placeholders except placeholder 10 are found to be conflict. Block 69 then flows to block 70. Block 70 replaces `m` in the output stream for placeholder 10. Block 72 removes placeholder 10 from the set. Block 74 determines the set contains other placeholders and therefore block 74 flows to block 77. Block 77 substitutes the `m` character for placeholder 3 and placeholder 4. Block 78 then sets the indicator for the output stream being changed. At this point, the output stream looks like: Dear Sir(s), I am unhappy with the long wait my family has had to endure concerning the hurricane insurance claim filed to you. We've been living in tents for the last month and still have no indication of being helped with our claim. My employer just . . . With the last set in the list processed, the ordered set list now looks like:
______________________________________
Ordered Set List
______________________________________
1) 8
4) 4 and 6
5) 1 and 3
______________________________________
One skilled in the art will appreciate that the word `my` never would have been successfully recognized if the word `family` had not been recognized first. The word `my` may have been interpreted to be `me`, `ma`, etc. by current art processes. The word `my` may have been interpreted to be `mg` by the present invention if the sets are not ordered properly. Block 78 proceeds back to block 48 in FIG. 1 which then determines if all the sets in the current list have been processed. Block 80 in FIG. 2 then executes. The OUTPUT.sub.13 CHANGED flag is set to true because one or more characters were replaced in the output stream during the list iteration. Block 80 proceeds to block 46 in FIG. 1 where the current list is again processed. Blocks 46 through block 80 will continue to be executed as previously described until no output is processed to the output stream and the OUTPUT.sub.13 CHANGED flag remains FALSE. After the second pass on the current list, the ordered list will have been reduced to an empty list. The OUTPUT.sub.13 CHANGED flag remains true for one more pass on the current list. Upon completion of the third pass, at block 80, if the OUTPUT.sub.13 CHANGED flag is not TRUE, then block 82 determines if there are any sets remaining for processing. If there are no more sets, then block 84 passes the successfully completed output stream to the underlying application and processing completes. If one or more sets do exist, then processing continues to block 86 in FIG. 3. With reference now to block 86 in FIG. 3, if STATUS2CHECK equals CORRECTLY.sub.13 SPELLED, then the STATUS2CHECK variable is set to a new status (i.e. 1.sub.13 HIT) and all remaining sets in the list are reprocessed as heretofore described. There may be cases where the OCR process in block 32 did not provide the correct character in the candidate list. A case may also exist where the unrecognized character is a typo. The present invention deals with such a problem by reprocessing sets with a lesser but desirable status to check for when performing spell checks. This allows identifying characters in textual components even when the OCR resolution table candidates are incorrect for one reason or another. If, in block 86, the STATUS2CHECK variable is already set to 1 HIT, then processing continues to block 88. It is very important to note that many status types may be implemented in FIG. 3 such that a current list processed until OUTPUT.sub.13 CHANGED remains FALSE may be reprocessed with many types of status. CORRECTLY.sub.13 SPELLED and 1 HIT are the most desirable examples and are therefore shown. Many varying descending levels of status may be implemented without departing from the spirit and scope of the invention. Should one or more sets still not be successfully processed, then the present invention relies on alternative techniques to make a final judgement. In the example, if placeholder 10 did not exist in SET 5, then SET 5 could not have been reduced because either `n` or `m` would work for a correctly spelled word. Assuming no existence of placeholder 10, block 88 would flow to block 100 which would retrieve SET 5 placeholder 1. Block 102 would then select the first candidate in the candidate list because it was the most confident character (priority order). Block 104 would then update the output stream correspondingly and processing would continue back to block 88 for the next unresolved placeholder. If in block 88, all remaining placeholders were processed, then block 108 would provide the output stream to the underlying application and processing would stop. It is important to note that actual processing of a scanned image may produce many placeholders within a set, thereby making character resolution more accurate. One skilled in the art will appreciate that the examples above were simple and merely served to explain processing. More uncertainty of a character during an OCR process simply provides a larger candidate list which is processed. It should be noted that grouping placeholders within sets allows identifying which characters are the same but are not yet recognized. Having this relationship facilitates recognizing words containing the same character which was recognized in other words. Simply having the relationship of belonging to the same word allows recognizing characters of words wherein other characters in the same word are recognized. The OCR resolution table may also contain references to characters which are already resolved by block 32. This allows grouping characters having similar candidates with known characters in order to provide a lead in resolving unknown characters. Blocks 60, 62 and 64 operate independently of characters already resolved. Turning now to FIG. 4, an Optical Character Recognition system 200 is shown where the operation of the present invention is carried out. The Optical Character Recognition system 200 consist of a processor 202 containing a CPU 204 and memory 206. Attached to the processor 202 is a scanner 220 for creating an output stream. In addition, a keyboard 216 is provided for entry of data by a user. Permanent storage is provided by hard disk storage 208 along with removable storage in the form of a floppy disk device 210. Program information or data may be inputted to the OCR system by a floppy disk 212 A display device 218 is also provided to permit a user to view information and data processing within the OCR system. While the invention has been particularly shown and described with reference to preferred embodiments thereof, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention.
|
Same subclass Same class Consider this |
||||||||||
