System and method for improved string matching under noisy channel conditions6687697Abstract Described is a system and method for improving string matching in a noisy channel environment. The invention provides a method for identifying string candidates and analyzing the probability that the string candidate matches a user-defined string. In one implementation, a find engine receives a query string, converts an image file into a textual file, and identifies each instance of the query string in the textual file. The find engine identifies candidates within the textual file that may match the query string. The find engine refers to a confusion table to help identify whether candidates that are near matches to the query string are actually matches to the query string but for a common recognition error. Candidates meeting a probability threshold are identified as matches to the query string. The invention further provides for analysis options including word heuristics, language models, and OCR confidences. Claims We claim: Description FIELD OF THE INVENTION
TABLE 1
-log p(R.vertline.s)
s .fwdarw. R (edit cost)
am .fwdarw. arn 1.074
en .fwdarw. ea 0.956
en .fwdarw. e, n 4.400
nt .fwdarw. at 1.013
end .fwdarw. ead 0.708
end .fwdarw. eud 2.508
men .fwdarw. rnea 0.858
me .fwdarw. me, 1.211
In one embodiment, a set of pairs (S, T) are obtained, where S is a word from the ground-truth file and T is the corresponding noisy OCR word. In one example, a syntactic signature method is used to extract correct matches from the comparison to achieve a set of pairs (S, T) where S is a word from the "known" text file and T is the corresponding noisy OCR word. In one embodiment, for each (S, T) pair, the greatest common substrings between S and T are found, where s represents the original string. The substrings are then used to derive an initial set of possible edits. In this embodiment, the initial set of possible edits is then expanded using up to a given number of characters on either side of the edit. In one example, the initial set of possible edits is expanded up to three (3) characters of context on either side of the edit. In this embodiment, the expansion is performed for both the ground-truth word and the OCR word. For each edit s.fwdarw.R in the expanded set, the overall frequency of the edit is obtained. Additionally, the frequency of all other edits based on s and the total frequency of s in the corpus (i.e., the ground-truth file) is obtained as well. Values of p(R.vertline.s) can then be determined. Values for p(s) may be calculated as well. In one embodiment, these values are obtained via Bayes Theorem. In this embodiment, the value of the posterior probability p(s.vertline.R) (i.e., the most useful edits) can be obtained from p(R.vertline.s). An edit cost can also be calculated as c=-log p(R.vertline.s). The most useful edits, defined as the edits having the highest values of p(R.vertline.s).multidot.p(s), can then be determined. In this way, a comparison between a known text (e.g., the ground-truth) and OCR'd text yields a corresponding database of characters and the associated probabilities that a character may represent one or more other characters. This data is then stored in confusion table 170. In another embodiment, probabilities for other classes of errors may be calculated as well in the training phase. Such possibilities include but are not limited to P.sub.insert, P.sub.delete, P.sub.subst, and represent the probability of an insertion of a character, deletion of a character, and substitution of a character respectively (see FIG. 5 and discussion). FIG. 4 is a logical flow diagram generally illustrating a process for string matching under noisy channel conditions. In describing FIG. 4, reference is made to the system described in conjunction with FIG. 3. Process 400 enters at starting block 410, where find engine 310 has received a query string from the user, and OCR engine 330 has created document text file 160 from document image file 150. The process begins at block 420 where find engine 310 performs a fast approximate string match on data in document text file 160. The fast approximate string match is illustrated in detail in FIG. 5 and described below. Briefly described, and referring to FIG. 3, the fast approximate string match involves find engine 310 utilizing a standard approximating string matching routine on data located in the document text file 160. The standard approximating string matching routine creates candidate data and stores the candidate data in the candidate list 350. Each candidate in the candidate list 350 essentially represents either a direct match or a near match to the query string, where any near match differs from the query string by less than some predetermined threshold. In one embodiment, candidate data is created using a standard approximate string matching routine with a "generous" maximum error threshold. In one example, the standard approximate string matching routine utilized is a method developed by G. Myers and disclosed in "A fast bit-vector algorithm for approximate pattern matching based on dynamic programming," Proc Combinatorial Pattern Matching 98, 1-13, Springer-Verlag, 1998. In another embodiment, candidate data is created using a standard approximate string matching routine with a "generous" maximum error threshold as detailed in method 420 of FIG. 5. At block 430, find engine 310 performs a noisy channel cost analysis on candidate data located in candidate list 350. The noisy channel cost analysis is illustrated in detail in FIG. 6 and described below. Briefly described, and referring also to FIG. 3, the noisy channel cost analysis involves find engine 310 comparing each string of candidate data to the query string to identify if a candidate would be an exact match to the query string but for a likely error during the OCR translation process. Each candidate is assigned a cost associated with substitutions that can be made from the confusion table to make the candidate an exact match to the query string. If the cost of a candidate meets a given threshold value V.sub.TH, that candidate is identified as a match to the query string. In one embodiment, search engine 310 performs the noisy channel cost analysis and the fast approximate string matching concurrently. In other words, candidate data produced by the fast approximate string match is passed to the noisy channel cost analysis as it is created. In another embodiment, the search engine 310 may complete the fast approximate string matching prior to beginning the noisy channel cost analysis. In this alternative embodiment, the fast approximate string matching completes processing candidate data and stores the candidate data in candidate list 350, and then the noisy channel cost analysis is performed on the candidate list 350 At block 440, find engine 310 may perform optional adjustments to the cost associated with each candidate in the candidate list 350. The optional adjustments are illustrated in detail in FIG. 7 and described below. Briefly described, and also referring to FIG. 3, the optional adjustments may involve performing additional comparisons using additional criteria. The optional adjustments are performed after the noisy channel cost analysis is performed. Optional adjustments may include but are not limited to word heuristics, language models, OCR confidence data, and the like. At block 450, find engine 310 determines which, if any, candidates satisfy the query string. The determination is illustrated in FIG. 8 and described below. Briefly described, and referring to FIGS. 1 and 3, the determination involves find engine 310 comparing each string of candidate data to a defined threshold value V.sub.TH. If the comparison meets the threshold, it results in that string of candidate data being identified as matching the query string. At block 460, processing ends. At this point find engine 310 has identified those strings of candidate data that have met the threshold value V.sub.TH and are identified as matching the query string. FIG. 5 is a logical flow diagram illustrating in greater detail a process 420 (FIG. 4) for performing a fast approximate string match (e.g., parsing) on data in document text file 160. In describing FIG. 5, reference is made to the system described in conjunction with FIG. 3. Method 420 enters at starting block 510 where OCR engine 330 has created a document text file 160 from image data in a document image file 150. Additionally, while creating the text data for document text file 160, confidence data is also created for and stored in OCR confidence table 340. At decision block 520, find engine 310 determines if an error rate K has been provided. In one embodiment, error rate K indicates the average number of errors per query string to be permitted in a match candidate. In this embodiment, an error is considered to be an insertion, deletion, or substitution of a single character relative to the original string. Therefore, larger values of K may increase recall slightly but may use more computation time, due to more candidates being examined. In this embodiment, using a value of K=0 is equivalent to exact matching. In one example, a value of K=5/8 is used. This value may represent a user-defined value or a default value. In this example, utilizing a value of K=5/8 allows an average of 5 character errors for an 8 letter word. If an error rate K is provided, the process advances to block 527. Alternatively, the process advances to block 523, where the error rate K is set to a default value. At block 527, the error rate K is set to the value provided. In one embodiment, error rate K is a user-defined term. In this embodiment, query dialog box 320 may allow a user to specify an error rate K rather than using the default value. Other embodiments might allow various error rates K to be used in different situations. For example, an embodiment might allow multiple defaults based on the type of document received, the median of transmission of the document reception, and the like. At block 530, method 420 determines a maximum edit distance to be used for the fast approximate string match. In one embodiment, the maximum edit distance is equal to the product of the error rate and the length of the query string. At block 540, find engine 310 searches the document text and identifies match candidates based on a standard approximate string matching routine. In one embodiment, match candidates are those strings that contain enough matching characters to fall within the error rate K threshold. At block 550, find engine 310 stores the match candidates in candidate list 350. FIG. 6 is a logical flow diagram illustrating in greater detail a process for performing a noisy channel cost analysis on candidate data located in candidate list 350. In describing FIG. 6, reference is made to the system and process described in conjunction with FIGS. 1 and 3. Method 430 enters at starting block 605 where find engine 310 prepares to perform a noisy channel cost analysis on candidate data previously parsed from document text file 160 (see FIG. 5 and discussion). In one example, query dialog box 320 receives a query term "amendment" from the user. In this example, method 430 receives candidate data generated by OCR engine 330, for example the term "arneadme,nt." The noisy channel cost analysis involves using a set of confusion data of typical string-to-string edits as discussed above. Each string-to-string edit in the set of confusion data has an associated probability p(s.vertline.R), where s represents the original string and R represents a corresponding erroneous OCR string partition. In one embodiment, the value of the probability is obtained via Bayes Theorem. In this embodiment, and ignoring the constant denominator, p(s.vertline.R) is given by p(R.vertline.s).multidot.p(s). Taking the negative logarithm of p(R.vertline.s) yields an edit cost c of the string-to-string edit. Given a confusion set C of m entries: {s.sub.1.fwdarw.R.sub.1, s.sub.2.fwdarw.R.sub.2, . . . , S.sub.m.fwdarw.R.sub.m } which have corresponding edit costs {c.sub.1, c.sub.2, . . . , c.sub.m }, a query term/string Q, and a candidate T in the OCR text, the probability that Q matches T can be calculated. At block 610, find engine 310 retrieves a candidate from the candidate list 350. At block 620 find engine 310 partitions the query string into n substrings {Q.sub.1, Q.sub.2, . . . , Q.sub.n }. A substring is a portion of the query string for which there is a corresponding set of characters in the candidate. That is, for each Q.sub.i, there is a corresponding set of characters T.sub.i in the candidate data. If there are multiple possibilities of substrings to be partitioned, each of the multiple possibilities is termed an OCR string partition. In one example, the query string "amendment" can be partitioned into several OCR string partitions, such as: 1. am.vertline.end.vertline.me.vertline.nt 2. a.vertline.men.vertline.d.vertline.me.vertline.nt At block 630, find engine 310 determines one or more sets of characters T.sub.i from the candidate where each set of characters is used to compare to each substring of the query string. In one embodiment, find engine 310 utilizes confusion table 170 to identify the sets of characters T.sub.i that qualify. At block 640, find engine 310 retrieves the set of characters T.sub.i with the lowest edit cost for each substring of the query string. In one embodiment, find engine 310 retrieves the set of characters T.sub.i with the lowest edit cost from confusion table 170. Thereafter, one of the following possibilities is satisfied for each Q.sub.i : 1. Q.sub.i maps without error to its counterpart T.sub.i, with probability P.sub.correct (Q.sub.i). 2. Q.sub.i has an entry in the confusion set such that it maps to T.sub.i according to the entry s.sub.j.fwdarw.R.sub.j, with probability c.sub.i. 3. Q.sub.i maps to some set of characters T.sub.i, but this mapping is not in the confusion set and therefore is modeled by a series of single character insertions, deletions, or substitutions. The probabilities of these operations may vary for individual characters. In one embodiment, the overall probabilities are denoted as p.sub.insert (Q.sub.i), p.sub.delete (Q.sub.i), and p.sub.subst (Q.sub.i) respectively. In one example, the following values can be used: p.sub.insert (Q.sub.i)=0.1 p.sub.subst (Q.sub.i)=0.1 p.sub.delete (Q.sub.i)=0.01 p.sub.correct (Q.sub.i)=0.9 In this example, the set of characters T.sub.i with the lowest edit cost for each substring of OCR string partition 1 (above) retrieved from Table 1 is: am.fwdarw.arn=1.074 end.fwdarw.ead=0.708 me.fwdarw.me,=1.211 nt.fwdarw.nt=0.105 Additionally, in this example, the set of characters T.sub.i with the lowest edit cost for each substring of OCR string partition 2 (above) retrieved from Table 1 is: a.fwdarw.a=0.105 men.fwdarw.rnea=0.858 d.fwdarw.d=0.105 me.fwdarw.me,=1.211 nt.fwdarw.nt=0.105 At block 650, find engine 310 calculates a total edit cost C.sub.1 for the first OCR string partition. In one embodiment, find engine 310 calculates the total edit cost for the OCR string partition by summing the negative logarithms of the probability that each substring Q.sub.n is the set of characters T.sub.i. The total edit costs c.sub.i can be calculated by identifying the most likely of all possible partitions. The most likely of all partitions is identified by denoting a set of possible partitions Q by Part(Q), assuming the transformations are independent. The result may be expressed as follows: p(Q.vertline.T)=arg max.sub.D.epsilon.Part(Q).PI.Q(i).epsilon.D p(Q.sub.i.fwdarw.T.sub.i) When the term p(Q.sub.i.fwdarw.T.sub.i) is expanded in terms of probability for the possibilities of the above equation, an expression for the total edit cost C.sub.Total is as follows: C.sub.Total =arg min.sub.D.epsilon.Part(Q).SIGMA.Q(a).epsilon.D -log p.sub.correct (Q.sub.a)+.sub..SIGMA.Q(b).epsilon.D -log c.sub.b +.sub..SIGMA.Q(c).epsilon.D -log p.sub.insert (Q.sub.c)+.sub..SIGMA.Q(d).epsilon.D -log p.sub.delete (Q.sub.d)+.sub..SIGMA.Q(e).epsilon.D -log p.sub.subst (Q.sub.e). In one example, find engine 310 calculates the total edit cost C.sub.1 for OCR string partition 1 as: C.sub.1 =-log p(am.fwdarw.arn)-log p(end.fwdarw.ead)-log p(me.fwdarw.me,)-log p.sub.correct (nt.fwdarw.nt) therefore, C.sub.1 =1.074+0.708+1.211+0.105=3.098 At block 660, find engine 310 stores the total edit cost C.sub.1 for the first OCR string partition. In one embodiment, find engine 310 stores the total edit cost C.sub.1 for the first OCR string partition in total edit cost table 360. At decision block 670, find engine 310 determines if there are additional OCR string partitions to calculate the total edit cost C.sub.m for. If there are additional total edit costs C.sub.m to calculate, the process advances to block 675. Alternatively, the process advances to block 680. At block 675, find engine 310 retrieves the next OCR string partition. In one embodiment, find engine retrieves the lowest edit cost for each substring of the OCR string partition from confusion table 170. The process then advances to blocks 650 and 660 where the total edit cost C.sub.2 for the second OCR string partition is calculated and subsequently stored in total edit cost table 360. In one example, find engine 310 calculates the total edit cost C.sub.2 for OCR string partition 2 as: C.sub.2 =-log p.sub.correct (a.fwdarw.a)-log p(men.fwdarw.rnea)-log p.sub.correct (d.fwdarw.d)-log p(me.fwdarw.me,)-log p.sub.correct (nt.fwdarw.nt) therefore, C.sub.1 =0.105+0.858+0.105+1.211+0.105=2.384 When there are no more OCR string partitions to calculate total edit costs C.sub.m for, the process advances to block 680. At block 680, find engine 310 identifies the lowest total edit cost C.sub.L of the total edit costs C.sub.m produced. In one embodiment, find engine 310 compares each of the total edit costs C.sub.m to each other. In one example, find engine 310 would compare the first total edit cost (C.sub.1) with the second total edit cost C.sub.2 (3.098 vs. 2.384 respectively). In this example, the second total edit cost C.sub.2 would be identified as the lowest total edit cost C.sub.L and returned to the total edit cost table 360. In another embodiment, block 680 is eliminated and all total edit costs C.sub.m remain in the total edit cost table 360. The process then advances to block 690. At block 690, the process returns to block 430 of method 400 (see FIG. 4 and discussion). FIG. 7 is a logical flow diagram illustrating in greater detail optional processes for performing additional processing on total edit costs C.sub.m calculated using method 430 (see FIG. 6 and discussion). In describing FIG. 7, reference is made to the system and process described in conjunction with FIGS. 1, 3, 4, and 6. Method 440 enters at starting block 710, where find engine 310 is preparing to perform one or more optional processes on candidate data located in candidate list 350 to refine the edit cost of each candidate. In one embodiment, find engine 310 has just received candidate data and the associated total edit costs C.sub.m. At block 720, find engine 310 performs optional process word heuristics to increase the accuracy of method 400. In one embodiment, optional process word heuristics includes steps that identify position-based probabilities that reflect the importance that a match be close to a complete word or word prefix. In one example, the candidate data may be tested for either punctuation or white-space at the start and end of a match. In this example, the total edit costs C.sub.m might be reduced by 0.25 for a word prefix match. Similarly, the total edit costs C.sub.m might be reduced by 0.50 for an entire word match. At block 730, find engine 310 performs optional process language models to increase the accuracy of method 400. In one embodiment, optional process language models include steps to calculate a rough confidence estimate based on simple language models. In one example, optional process language models utilizes a frequency table of English bigrams (utilizing other language bigrams if another language were in use) that provide either a "low" or "high" confidence estimate to any query strings containing at least one "rare" bigram or none, respectively. At block 740, find engine 310 performs optional process OCR confidence data to increase the accuracy of method 400. In one embodiment, optional process OCR confidence data includes steps that may increase edit costs, in the noisy channel model, according to a region's confidence value. In this embodiment, the region's confidence value is based on the word-level confidence data provided from OCR confidence table 340. The OCR confidence data may be produced when OCR engine 330 creates document text file 160 from document image file 150. In another embodiment, find engine 310 may provide an indeterminate confidence value for a word. In this embodiment, find engine 310 may utilize optional process language model to provide an estimate. In yet another embodiment, optional process OCR confidence data may utilize character-level confidence data. Character-level confidence data may provide an increase in accuracy but with a corresponding increase in file size. At block 750, the process returns to block 440 of method 400 (see FIG. 4 and discussion). FIG. 8 is a logical flow diagram illustrating a process 450 for determining if a match candidate is a valid match. In describing FIG. 8, reference is made to the system and process described in conjunction with FIGS. 3 and 4. Method 450 enters at block 810 where total edit costs C.sub.m for candidate data have been determined. At decision block 820, find engine 310 compares each total edit cost C.sub.m value to a threshold value V.sub.TH. In one example, the threshold value V.sub.TH is preset to 0.300. Continuing with the above example, the query string Q is "amendment" that includes nine characters, and assumes that the threshold value V.sub.TH is 2.700. In this example, the second total edit cost C.sub.2, having a value of 2.384, would be considered a valid match. In this example, the first total edit cost C.sub.1, having a value of 3.098, would not be considered a valid match. In another embodiment, the threshold value V.sub.TH can be modified by one or more optional adjustments implemented during method 440 of method 400 (see FIGS. 4 and 7 and accompanying discussion). At block 830, if the candidate data is determined to be an invalid match, the candidate data is returned, along with its associated total edit cost C.sub.m, to the candidate list 350 and total edit cost table 360 respectively. The process then advances to block 850. At block 840, if the candidate data is determined to be a valid match, find engine 310 presents the matching candidate to the user. In one embodiment, each candidate identified as a match to the user's query term is highlighted or otherwise made known to the user. For example, each candidate in the candidate list 350 may be indexed to a location in the document image file 150. Each candidate identified as a match may be presented to the user by "highlighting" the area in the document image file 150 corresponding to the matching candidate. Of course, the matching candidates could be presented to the user as each match is identified, after the matches are identified, or somewhere in between. The process then advances to block 850, where the process returns to block 450 of method 400 (see FIG. 4 and discussion). The above specification, examples and data provide a complete description of the manufacture and use of the composition of the invention. Since many embodiments of the invention can be made without departing from the spirit and scope of the invention, the invention resides in the claims hereinafter appended.
|
Same subclass Same class Consider this |
||||||||||
