Method for correcting spelling using error frequencies5572423Abstract An improved method for correcting spelling errors in text wherein candidate expressions for replacing a misspelled word are assigned probability functions. The misspelled word can be replaced automatically with the candidate expression having the highest probability function or candidate expressions can be displayed to a user in rank order of their probability functions for the user to make a choice. The probability function for a candidate expression is based on (1) the probability of occurrence of the candidate expression appearing in text and/or (2) the probability of occurrence of the particular typographical modification needed to convert the candidate expression into the misspelled word. Claims I claim: Description FIELD OF THE INVENTION
______________________________________
actress deletion of "t"
cress insertion of "a"
caress reversal of "ca"
access substitution of "r" for "c"
across substitution of "e" for "o"
acres insertion of "s"
______________________________________
As noted above, the candidate "actress" can be mistyped as "acress" by deleting the "t" between "ac" and "ress". "Cress" can also be mistyped as "acress" by inserting an initial "a". The other candidates are similarly derived as noted above. The table below illustrates this candidate generation step of the process in more detail--including also the character position where the modification occurs. Positions "0" through "5" refer respectively to the first through sixth characters of the misspelled word "acress".
TABLE A
______________________________________
Misspelled
Correction Posi- Trans-
Word Candidate Transform tion formation
______________________________________
acress actress delete t after c
2 deletion
acress cress insert a after @
0 insertion
acress caress reverse ca 0 reversal
acress access substitute r for c
2 substitution
acress across substitute e for o
3 substitution
acress acres insert s after e
4 insertion
acress acres insert s after s
5 insertion
______________________________________
The above table illustrates that the misspelled word "acress" can be generated from "acres" by insertion of the letter "s" in character position 4 of "acres" or insertion of the letter "s" in character position 5 of "acres ". The insertion and transformation operations are implemented in a straightforward way by nondeterministically trying all possibilities. Therefore, n dictionary accesses are required to see if the n(th) letter in the misspelled word may have been inserted erroneously. The deletion operation is more complicated since there are 26 letters that might have been deleted in n+1 positions. In order to reduce the number of dictionary accesses, I use a precomputed deletion table which maps words missing a letter to corrections. A sample table for the word "great" follows:
TABLE B
______________________________________
Deletion
Key Correction
______________________________________
grea t 4
gret a 3
grat e 2
geat r 1
reat g 0
______________________________________
Again, correction positions "0" through "4" correspond to characters "1" through "5" of the word "great". The deletion table is also useful in checking for substitutions. For each correction candidate, step 13 in FIG. 1 derives a first probability function, Pr(correction), indicating the probability of that candidate being used in the text. The function Pr(correction) is estimated by the expression (freq(correction)+1)/N, where freq(correction) is the number of times that the word "correction" appears in the large text corpus and where N is the total number of words in the corpus. The corpus used in my illustrative embodiment contained some 32 million words. A large table (not shown because it has four hundred thousand entries) contains each word in the vocabulary and stores data reflecting the number of times each such word appeared in the corpus. As described in Table C below, a second probability function, Pr(typo/correction), is also derived for each correction candidate and reflects the probability of occurrence of the particular typographical modification (i.e., deletion, insertion, reversal or substitution) used to generate the misspelled word from the candidate expression. For this probability function, (1)del[x,y] is the number of times that the character y was deleted after the character x in the "training" corpus (i.e., xy typed as x), (2) add[x,y] is the number of times that y was inserted after x (i.e., x typed as xy); (3) sub[x,y] is the number of times that y (from the correct word) was typed as x (i.e., y typed as x); (4) rev[x,y] is number of times that xy was reversed (i.e., xy typed as yx); and (5) chars[x,y] and chars[x] are the number of times that xy and x respectively appeared in the corpus. The computation of this probability function will be described hereinafter with respect to the misspelled word "acress".
TABLE C
__________________________________________________________________________
del [cor.sub.p-1, cor.sub.p ]/chars[cor.sub.p-1, cor.sub.p
if deletion
add [cor.sub.p-1, typo.sub.p ]/chars[cor.sub.p-1 ]
if insertion
Pr(typo.vertline. cor) =
sub [typo.sub.p, cor.sub.p ]/chars[cor.sub.p ]
if substitution
rev [cor.sub.p, cor.sub.p+1 ]/chars[cor.sub.p, cor.sub.p+1
if reversal
__________________________________________________________________________
FIGS. 2 through 5 are matrices which respectively reflect single character deletion, insertion, substitution, and reversal data. The data in the matrices was derived by examining the corpus for misspelled words, identifying the correct word and then updating the appropriate matrix to reflect the typing error that was detected. For example, if the word "receive" was incorrectly typed as "recieve" (i.e., "ei" was reversed) in the corpus, then, with reference to FIG. 5 (the matrix for reversals), the number "60" found at the intersection of y=i and x=e would be updated by 1 (i.e., 60+1). The other three matrices were generated in the same manner to reflect character deletions, insertions and substitutions. The combination existence information chars[x,y] and chars[x] in the above Table C are found in FIGS. 6 and 7 respectively. FIG. 6 shows the number of times the character x was followed by character y in the corpus, and FIG. 7 shows the number of times each character was observed in the corpus. The special symbol "@" found in FIGS. 2, 3, 6 and 7 is attached to the beginning of each word and refers to a null string. Thus, in FIG. 3, for example, the cell circled "Example B" refers to the insertion of the letter "a" after the beginning of a word. After the second probability number Pr (typo.vertline.cor) is derived by reference to the appropriate matrix in accordance with Table C, an overall probability, which is a function of the first (Pr(correction)) and second (Pr(typo.vertline.cor)) probability numbers, is derived for each candidate word (step 15, FIG. 1 ). In this illustrative embodiment, the first and second probability numbers are multiplied together to generate the overall probability. Others skilled in the art may elect to weigh the two probabilities differently in generating the overall probability and, due to memory, processing or other constraints, may elect to use one or other by itself as the overall probability. Once the overall probabilities have been generated for all candidate expressions, the expressions can be rank ordered and sorted to meet the intended use. The misspelled word can be replaced, with or without human intervention, by the candidate with the highest probability, or the candidate expressions may be proffered as suggested corrections in an ordered list in descending order of probability (step 16, FIG. 1). The derivation of probability figures in my illustrative embodiment will now be described in detail with respect to the example in Table A above. Table D below illustrates how the first, second and overall probability figures are generated for each of the seven candidate expressions discussed previously for the misspelled word "acress". With reference to Example A (e.g., "actress"), the probability (Pr(cor)) of the word "actress" occurring in the text (i.e., Pr(actress)=(freq(actress) +1)/N) is derived by adding 1 to 1343, which is number of times "actress" appeared in the corpus (not shown), and then dividing the sum by 32,000,000, the number of words in the corpus. Since the deletion transformation is used to generate the misspelled word "acress" from the candidate expression "actress", the second probability figure (Pr(type.vertline.cor))is derived by reference to the deletion portion of Table C above. More specifically, the second probability figure (Pr(type.vertline.cor)) for "actress" (i.e., Pr(acress.vertline.actress)=del[c,t]/chars[c,t])), which reflects the likelihood of deleting t after c, is derived with reference to the deletion table in FIG. 2 and the characters table in FIG. 6 where x=c and y=t. FIG. 2 shows that this character modification occurred 55 times in the corpus, and FIG. 6 shows that the character combination "ct" occurred 470573 times in the corpus (see circled cell in each of these figures labeled "Example A"). Thus, in Table D, Pr(acress.vertline.actress)=55/470573. The overall probability in Table D for "actress" is derived by multiplying together (and thereby giving equal weight to) the two calculated probability figures (e.g., Pr(actress) x Pr(acress.vertline.actress)) resulting in an "overall probability" of 49.times.10.sup.-10.
TABLE D
__________________________________________________________________________
Overall
(.times.10.sup.-10)
Normalized
Example
Typo
Cor Transform
Pr(cor) Pr(typo.vertline.cor)
Probability
Score
__________________________________________________________________________
A acress
actress
delete t after c
1344/32,000,000
55/470,573
49 37%
B acress
cress
insert a after @
1/32,000,000
46/32,158,917
.about.0
0%
C acress
caress
reverse ca
5/32,000,000
1/581,247
.about.0
0%
D acress
access
substitute r for c
2281/32,000,000
1/4,747,178
.about.0
0%
E acress
across
substitute e for o
8437/32,000,000
93/10,234,105
24 18%
F acress
acres
insert s after e
2880/32,000,000
417/13,128,963
29 22%
G acress
acres
insert s after s
2880/32,000,000
205/6,009,135
31 23%
__________________________________________________________________________
The probability figures in Example B (e.g., "cress") in Table D are similarly generated. Pr(cress)=(freq(cress)+1)/N or 0+1/32,000,000. The "0" indicates that "cress" was not found in the corpus. With reference to the "insertion" equation in Table C, Pr(acress.vertline.cress)=add[@,a]/chars[@] or 46/32,158,917 since FIG. 3 (the insertion matrix) indicates that "a" was inserted 46 times after "@", the special character denoting the beginning of a word, and FIG. 7 indicates that "@" (beginning of word) was observed 32,158,917 times. The other examples C-G in Table D are self-explanatory. FIGS. 2-7 identify, by designation to each example by the use of circled cells, the derivation of the numbers found in the Table D (with the exception of freq(correction)). The "normalized scores" in Table D, derived by simply averaging all the overall probability numbers, indicate that (1) the words "cress", "caress", and "access" are implausible since each of their overall probabilities was nearly zero, (2) the words "actress", "across", and "acres" are plausible with overall probabilities of 37%, 18%, and 45% (i.e, 22%+23%) respectively, and (3) "acres" is the most likely candidate expression followed by "actress". It, of course, is understood that one skilled in the art could utilize the inventive concepts discussed in a wide variety of other embodiments without departing from the spirit and scope of my invention. For example, my invention works equally well with character based foreign languages, such as, for instance, Spanish, French and German.
|
Same subclass Same class Consider this |
||||||||||
