Method and apparatus for detecting error strings in a text5715469Abstract A system for checking the spelling of words and character strings without the need for a stored dictionary of words and the memory required thereby. The system selects an error-free string and modifies it according to one or more rules which change the error-free string to a possible error string. The rules creating the possible error string can modify the error-free string by predictable character manipulation to yield usual and common errors of the character string. The frequency of occurrence of both the error and error-free strings within the text are determined. These frequencies are compared to each other and, based upon the comparison, the system decides whether the possible error string is an actual error string. The system can use modifying rules which are psychological or technically related to the computer system or operator, and rules which correspond to errors common with specialized input methods, such as character and speech recognition. Claims Having thus described the invention, what is claimed is: Description BACKGROUND OF THE INVENTION
TABLE 1
______________________________________
Rule R.sub.1 :
Transposition of two successive letters.
Example:
f.sub.11 =
"Olmypischen"(1)
S.sub.1 =
"Olympischen"(875)
.alpha..sub.11 =
1164625
Rule R.sub.2 :
Omission of a letter occurring at least twice.
Example:
f.sub.22 =
"Prasidumssitzung"(1)
S.sub.2 =
"Prasidiumssitzung"(7)
.alpha..sub.22 =
40824
Rule R.sub.3 :
Omission of a letter occurring at most once.
Example:
f.sub.33 =
"Diziplinen"(1)
S.sub.3 =
"Disziplinen"(89)
.alpha..sub.33 =
118549
Rule R.sub.4 :
Doubling of a letter.
Example:
f.sub.44 =
"Baskettball"(2)
S.sub.4 =
"Basketball"(179)
.alpha..sub.44 =
89500
Rule R.sub.5 :
Replacement of a letter.
Example:
f.sub.55 =
"Golopprennbahn"(1)
S.sub.5 =
"Galopprennbahn"(34)
.alpha..sub.55 =
93296
Rule R.sub.6 :
Insertion of a letter not previously occurring
in the word.
Example:
f.sub.66 =
"Wiederanspfiff"(1)
S.sub.6 =
"Wiederanpfiff"(47)
.alpha..sub.66 =
103259
Rule R.sub.7 :
Insertion of a letter previously occurring in
the word.
Example:
f.sub.77 =
"Abloseseumme"(1)
S.sub.7 =
"Ablosesumme"(91)
.alpha..sub.77 =
157248
Rule R.sub.8 :
Doubling of an incorrect letter, here:
LEFT-HAND neighbor.
Example:
f.sub.88 =
"Spvvg"(4)
S.sub.8 =
"Spvgg"(142)
.alpha..sub.88 =
4435
Rule R.sub.9 :
Doubling of an incorrect letter in a word,
here: RIGHT-HAND neighbor.
Example:
f.sub.99 : =
"Sperrwerfen"(1)
S.sub.9 : =
"Speerwerfen"(19)
.alpha..sub.99 =
25289
Rule R.sub.10 :
INSTEAD of the desired letter,
RIGHT-HAND neighbor was pressed.
Example:
f.sub.10 10 =
"erfolgteich"(1)
S.sub.10 =
"erfolgreich"(290)
.alpha..sub.10 10 =
385990
Rule R.sub.11 :
IN ADDITION TO the desired letter,
RIGHT-HAND neighbor was pressed; insertion
BEFORE intended letter.
Example:
f.sub.11 11 =
"Cjhristian"(1)
S.sub.11 =
"Christian"(175)
.alpha..sub.11 11 =
127575
Rule R.sub.12 :
IN ADDITION TO the desired letter,
RIGHT-HAND neighbor was pressed; insertion
AFTER intended letter.
Example:
f.sub.12 12 =
"Verletzunmg"(1)
S.sub.12 =
"Verletzung"(153)
.alpha..sub.12 12 =
153000
Rule R.sub.13 :
INSTEAD OF the desired letter,
LEFT-HAND neighbor was pressed.
Example:
f.sub.13 13 =
"Problene"(1)
S.sub.13 =
"Probleme"(290)
.alpha..sub.13 13 =
148480
Rule R.sub.14 :
IN ADDITION TO the desired letter,
LEFT-HAND neighbor was pressed; insertion
BEFORE intended letter.
Example:
f.sub.14 14 =
"Hoffnungstragwer"(1)
S.sub.14 =
"Hoffnungstrager"(18)
.alpha..sub.14 14 =
73728
Rule R.sub.15 :
IN ADDITION TO desired letter,
LEFT-HAND neighbor was pressed; insertion
AFTER intended letter.
Example:
f.sub.15 15 =
"Qualkifikation"(1)
S.sub.15 =
"Qualifikation"(255)
.alpha..sub.15 15 =
560235
Rule R.sub.16 :
Capitalization error on FIRST letter.
Example:
f.sub.16 16 =
"olympiastadion"(1)
S.sub.16 =
"Olympiastadion"(5)
.alpha..sub.16 16 =
13720
Rule R.sub.17 :
Capitalization error on SECOND letter.
Example:
f.sub.17 17 =
"SChwalbach"(1)
S.sub.17 =
"Schwalbach"(38)
.alpha..sub.17 17 =
38000
Rule R.sub.18 :
Omission of a double letter, leaving only
single letter.
Example:
f.sub.18 =
"Etapensieger"(1)
S.sub.18 =
"Etappensieger"(37)
.alpha..sub.18 18 =
81289
Rule R.sub.19 :
Doubling of a doubled letter, thus tripling it.
Example:
f.sub.19 19 =
"UdSSSR"(1)
S.sub.19 =
"UdSSR"(740)
.alpha..sub.19 19 =
92500
______________________________________
The rules R.sub.j are optimally selected when essentially only those variants best corresponding to the observed error types are generated in step 11. Here, the following rules have proven themselves: rule R.sub.1 (transposition of two successive letters: from "abcba", "bacba", "acbba", "abbca" and "abcab"), rule R.sub.2 (omission of a letter occurring at least twice, i.e., omission of individual letters, but only those which otherwise occur at least once: from "abcba", "bcba", "acba", "abca", and "abcb", and rule R.sub.7 (insertion of individual letters, but only those which otherwise occur at least once: from "abc", "aabc", "abac", "abca", "babc", "abbc", "abcb", "cabc", "acbc", "abcc", but not "abdc" or the like). Rule R.sub.2 serves primarily to simulate a possible psychological error source. Omissions of letters occur very easily during manual entry but are more difficult to find during copy editing if the omitted letter occurs again in the string--because it is then not "missed" so much. On the other hand, rules R.sub.10 to R.sub.15 serve to simulate technical deficiencies of the entry method employed--in this case a keyboard. The technical deficiency of the keyboard makes itself evident in this example in ergonomically unfavorable formation of the keys, so that adjacent keys are frequently pressed by mistake. A further possible rule is the replacement of optically similar letters in the error-free string, e.g., replacement of "c" by "e". In a word processing system according to the invention, use of this rule can simulate error sources arising from technical deficiencies--such as insufficient resolution of the screen used to display the text. In a character recognition system in accordance with the invention, this and other rules can simulate technical deficiencies of the system for automated optical character recognition, since optically similar letters are often not correctly recognized by such systems. In the same way, in a system for automatic recording of dictation, in accordance with the invention, technical deficiencies in the associated speech recognition system can be simulated. Applying the corresponding rules, phonetically similar letters are transposed, e.g., "m" with "n", since speech recognition systems often produce such errors. Of course, the rules mentioned can apply not only to words but also to strings of any construction. In calculating the value .alpha..sub.ij in step 12, a dictionary--based method can be used in addition. The possible error string f.sub.ij is then additionally checked using the dictionary--based method. If the string f.sub.ij is contained in the dictionary, i.e., if it is a valid string G.sub.i, this would initially indicate that the possible error string f.sub.ij is not in error. However, this is in no way certain, since an error in the corresponding error-free string S.sub.i can by chance also lead to a valid string Gi, i.e., the possible error string f.sub.ij can occur in the dictionary as valid string G.sub.i in addition to being an error string F.sub.i. Of course, as noted, a certain probability exists that a possible error string f.sub.ij occurring as a valid string G.sub.i in the dictionary is not an actual error string F.sub.i. This can be taken into account in the calculation of .alpha..sub.ij by modifying the value .alpha..sub.ij from equation (2) if the string f.sub.ij is a valid string G.sub.i. The modification can be done by multiplying the value from equation (2) by a factor between 0 and 1. The factor 0 here signifies that a valid string G.sub.i is defined unconditionally as error-free. In this case, however, an advantageous characteristic of the inventive method would be lost, namely consideration of the context. Using the inventive method, the word "director" in a manual for data processing was determined to be a possible error variant of the word "directory", although the word "director" is valid. Taking context into account is implicit in the inventive method, since the frequencies H(.sub.Si) and H(f.sub.ij) are compared with each other. The factor is advantageously chosen to be significantly greater than 0. The calculation of the value .alpha..sub.ij in step 12 can be further influenced by a method for automated learning. The method for automated learning assigns a factor .delta..sub.j (B) to an applied rule R.sub.ij. The factor .delta..sub.j (B) is variable and can be influenced on the one hand by the user and on the other hand by the type of hardware used. If the application of a rule R.sub.j leads with above average frequency to finding an error in the text, the method for automated learning assigns the rule R.sub.j a corresponding factor .delta..sub.j (B) greater than 1. In the opposite case, the method for automated learning assigns the rule a factor less than 1. The value .alpha..sub.ij determined in step 12 from equation (2) is thus additionally multiplied by the factor .delta..sub.j (B) associated with the applied rule R.sub.ij, so that the different success probabilities of the rules R.sub.j are considered in the calculation of .alpha..sub.ij. The rules R.sub.j can be sorted according to their factors .delta..sub.j (B) such that the rules R.sub.j with the highest probability of success, to which a relatively large factor .delta..sub.j (B) is assigned, are applied first in step 11. If the inventive method is conducted fully automatically, i.e., without displaying the detected errors as suggestions to the user, the definition in step 14 is the key for the method of automated learning. If a suggestion is given to the user, his acceptance of a string proposed as being in error is the key for the method of automated learning and thus for determining the factors .delta..sub.j (B). The method for automated learning can, for example, be implemented with a neural network, possibly in conjunction with an expert system. By using a system for automated learning, a user- and/or hardware-specific calibration can be implemented. For example, the transposition of "y" and "z", such as in "Szstem", can be expected only with those users who continually switch between German and American keyboards, but not with authors of newspaper copy, who generally work with only one type of keyboard. Since there are also corresponding word pairs which do not represent errors, for example "Holy" and "Holz", it is useful to consider such transpositions as possible errors only if they are reasonable for the application area. A hardware-related type of error that can be allowed for through the method of automated learning is, for example, the inadvertent simultaneous depression of two adjacent keys on the keyboard, such as in "Sysrtem". The probability of this type of error will depend on the keyboard used--in particular its action point and any generation of an acoustical signal when pressing a key. Furthermore, the method of automated learning can also allow for the use, prior to the inventive method, of other spell-check methods which detect certain error types with difficulty. Those rules R.sub.j which simulate these error types then are assigned a particularly heavy weighting via the factor .delta..sub.j (B). The user- and/or hardware-specific calibration can also result by direct entry of the user- and/or hardware-specific weighting factors .delta..sub.j (B). The factors .delta..sub.j (B) associated with a specific user, specific hardware, or a specific combination of user and hardware, can then be stored in separate data sets. If the user and/or hardware changes, the current set of factors .delta..sub.j (B) is replaced by the set of factors .delta..sub.j (B) associated with the new user and/or the new hardware, so that the latter set becomes the current one. The current set of factors .delta..sub.j (B) serves to weight the values obtained from equation (2) in step 12. The value .alpha..sub.ij is thus obtained by multiplying the value obtained from equation (2) with the factor .delta..sub.j (B) associated with the applied rule R.sub.j. The current set of factors .delta..sub.j (B) thus obtained can also serve as a set of initial values for the factors .delta..sub.j (B) for the method of automated learning, so that the method can start off with user- and/or hardware-specific weighting factors .delta..sub.j (B), which can then be further optimized automatically. If the user and/or hardware changes, the optimized set of factors .delta..sub.j (B) can be stored for later use as initial values. In addition, it is beneficial to provide an exception table, in which frequent word pairs such as form/from or three/there are stored. Proper names can also be stored in this table, e.g., Helmut/Hellmut or Hausmann/Haussmann, which could also arise from typographical errors, so that these words are not regarded as possible error strings in step 12. For a possible error string f.sub.ij generated in step 11, a check is made whether this string f.sub.ij is present in the exception table. If so, the next step executed will be 15 rather than 12. FIG. 2 shows the flow diagram of a second preferred embodiment of the invention. In step 20, the frequency H(Z.sub.i) of each string Z.sub.i occurring in the text is first determined. In this case, each unbroken sequence of letters and/or any other characters, depending on the application, can be defined as a string Z.sub.i. In step 21, the occurring strings Z.sub.i and their corresponding frequencies H(Z.sub.i) are stored pairwise in a table. In step 22, the condition H(Z.sub.i)>.delta. is tested. The value .delta. is a threshold value for the frequency H(Z.sub.i), above which the corresponding string Z.sub.i is defined to be an error-free string S.sub.i. If, therefore, the frequency H(Z.sub.i) of a specific string Z.sub.i exceeds the threshold value .delta., this specific string Z.sub.i is defined as an error-free string S.sub.i. The basis for this is that a string occurring relatively often in a text is with high probability an error-free string or a correctly spelled word of the applicable language. If the condition H(Z.sub.i)>.delta. in step 22 is not satisfied, the next step executed is 23, in which the index i is incremented by one. In the subsequent step 22, the condition H(Z.sub.i+1)>.delta. is tested for another string. If the condition H(Z.sub.i)>.delta. is satisfied by a string Z.sub.i, step 24 is executed next. In step 24, the corresponding string Z.sub.i is defined as an error-free string S.sub.i. The subsequent steps 11, 12, 13, 14, 15 correspond to the steps of the embodiment discussed with reference to FIG. 1. Step 24 thereby replaces the function of step 10 in the first embodiment, namely the selection of a specific error-free string S.sub.i. All possible variations previously discussed with respect to the first embodiment are also possible in the second embodiment. After completing the search for error strings F.sub.i of the string S.sub.i defined as error-free in step 14, the condition i =i.sub.max is tested in step 25. If index i has reached the maximum value i.sub.max, all strings Z.sub.i occurring in the text have been examined, so that the process is terminated in step 27. If the condition i=i.sub.max is not yet satisfied, the index i is incremented by 1 in step 26, and in step 22 the condition H(Z.sub.i+1)>.delta. is again checked for another string Z.sub.i Step 12 for computing the value .alpha..sub.ij can advantageously be carried out by obtaining the frequency H(f.sub.ij) from the table stored in step 21, so that the calculation is accelerated. If a possible error string f.sub.ij is not present in the table, its frequency is 0. In this case, step 15 can be executed without further evaluation, so that another rule R.sub.j can be applied to generate another possible error string. The result used in step 14 can be used for automatic correction, as previously discussed with reference to FIG. 1. It can be beneficial, however, to store all results obtained in step 14 and, after executing step 27, sort them by the corresponding values of .alpha..sub.ij. The user is then presented with a result list from which he can accept or reject individual results for automatic correction. Since the list is sorted by the values .alpha..sub.ij, the most reliable results are shown first. If the threshold value .beta. was selected relatively large, however, this procedure is not necessary, since in general practically all results obtained in step 14 can be used, so that an automatic correction can take place immediately without user intervention. To limit execution time of the process, e.g., because only a certain amount of computing time is available, the method can be terminated prematurely if a defined number of errors have already been found or a certain portion of computing time has been expended. To accelerate the process, the generation of possible error strings f.sub.ij can be controlled such that all rules R.sub.j are applied only if the frequency H(.sub.Si) for the error-free string S.sub.i associated with a possible error string f.sub.ij is high. In general, this expenditure will be worthwhile only if the frequency H(.sub.Si) is very high. A high H(.sub.Si) frequency implies a large statistical sample set, so that the reliability of the result in step 14 increases. In the case of lower H(.sub.Si) frequencies, the set of rules R.sub.j used for detecting an error string F.sub.i associated with an error-free string S.sub.i can be limited accordingly, so that steps 11 to 15 are executed faster overall. If prior to executing step 22 the table generated in step 21 is sorted, e.g., in alphabetical order, further acceleration results. The search of the table for a possible error string f.sub.ij for calculating the value .alpha..sub.ij in step 12 can then be carried out as a binary search. The binary search method is well-known, e.g., from Donald E. Knuth, "The Art of Computer Programming," Vol. 3, Section 6.2.1, Algorithm B, Addison-Wesley Publishing Company, 1973. In FIG. 3, a further possibility for storing the table generated in step 21 is shown. The tree structure depicted in FIG. 3 is generally described in the literature as a "linked tree", of. Franklin Mark Liang, "Word Hyphenation by Computer", Department of Computer Science, Stanford University, August 1983, pp. 11ff. and the references cited therein; de la Briandais, Rene, "File searching using variable length keys," Proc. Western Joint Computer Conf. 15, 1959, pp. 295-298; and Fredkin, Edward, "Tree memory," CACM 3, September 1960, pp. 490-500. In this example, the tree consists of nodes 30, whereby each node 30 contains entries 31 through 34. Entry 31 contains a letter or symbol, entry 32 contains the frequency H(Z.sub.i) of the corresponding string Z.sub.i, entry 33 is a pointer to a child if--present--of node 30, and entry 34 is a pointer to a sibling --if present--of node 30. Entry 32 in a node 30 is nonzero if the string from the highest level of the tree to the node 30 occurs in the text. An example is shown in FIG. 3 on the basis of a text that contains only the words "Festung", "Feuer", "Rauch", "Frieden", and "Fest", whereby the word "Feuer" occurs twice and the word "Fest" occurs three times in the text. The remaining words each occur only once in the text. This type of storing of the table in step 21 has the advantage of requiring less storage space and providing added acceleration to the process. The structure of the "linked tree" can occur in parallel with determining the individual strings Z.sub.i and their frequencies, so that subsequent sorting is unnecessary. The applicable algorithm has been specified by Knuth (Donald E. Knuth, "The Art of Computer Programming," Addison-Wesley Publishing Company, 1973, Section 6.2.2, pp. 422 ff., particularly Algorithm T). FIG. 4 shows an embodiment of a computer system in accordance with the invention. The computer system comprises storage means 1 for storing the text to be checked; storage means 12 for storing the frequencies H(Z.sub.i), or, in other words, for storing the table or tree structure established in step 21 (cf. FIG. 2 and FIG. 3); storage means 4 for storing rules R.sub.j used in step 11 for generating the possible error strings f.sub.ij (cf. FIG. 1 and FIG. 2); and processor means 2 for process control. The processor means 2 employ the frequency H(.sub.Si) of the error string F.sub.i for detecting the error string F.sub.i. The storage means 1, 4, 12 and the processor means 2 are interconnected via a bus 15, so that the processor means can access the different storage means. In this embodiment, the processor means contain storage means 3 for storing a frequency H(f.sub.ij) needed to compute the value .alpha..sub.ij in step 12; means 5 for modifying an error-free string S.sub.i in accordance with a rule R.sub.j, whereby a possible error string f.sub.ij can be generated according to step 11; means 6 for determining the frequency H(f.sub.ij); means 7 for comparing the frequencies H(f.sub.ij) and H(.sub.Si); means 8 for associating the possible error string f.sub.ij with the error string F.sub.i : means 11 for determining the frequency H(Z.sub.i) of various strings Z.sub.i in the text; and comparison means 13 for comparing the threshold value .delta. with a frequency H(Z.sub.i). The means 3, 5, 6, 7, 8, 11, 13 are interconnected via a processor-internal bus 16. The means 3, 5, 6, 7, 8, 11, and 13 Contained in the processor means 2, as well as bus 16, need not be discrete electronic components but can rather be generated via appropriate programming of processor 2. Such a program suitable for implementing the inventive method will interact with the control program of the computer system in a well-known manner such that the computer system assumes the configuration shown in FIG. 4. The means 6 for determining the frequency H(f.sub.ij) interact via the bus 16 and 15 with means 12 such that the desired frequency H(f.sub.ij) can be derived from means 12, if this frequency is stored there. If there is no entry for the possible error string f.sub.ij in the table stored in means 12, the frequency H(f.sub.ij) is zero. The determination of the frequency is needed to calculate the value .alpha..sub.ij in step 12. The means 7 for comparing the frequencies H(.sub.Si) and H(f.sub.ij) contain computing means 9 for computing the value .alpha..sub.ij in accordance with the computing rule .PHI..sub.ij (H(f.sub.ij),H(.sub.Si))=.alpha..sub.ij (1) This corresponds to the comparison of the frequencies H(.sub.Si) and H(f.sub.ij) carried out in step 12 in computing the value .alpha..sub.ij. The means 8 for associating the possible error string f.sub.ij to the error string F.sub.i contains means 10 for storing the threshold value .beta. for a comparison with the value .alpha..sub.ij. The value .alpha..sub.ij determined for comparison by means 7 is transferred via bus 16 to associating means 8. Associating means 8 process the value .alpha..sub.ij in accordance with steps 13 and 14. The means 11 for determining the frequency H(Z.sub.i) interact with the means 1 to identify individual strings Z.sub.i in the text and to calculate the corresponding frequencies H(Z.sub.i), in accordance with step 20. The comparison means 13 include means 14 for storing the threshold value .delta.. The comparison means 13 interact with means 11 to define those strings Z.sub.i as error-free strings S.sub.i whose frequency H(Z.sub.i) exceeds the threshold value .delta.. Using appropriate control by a program 17, the computer system in accordance with the invention can thereby carry out the procedure of FIG. 1 and. FIG. 2. The program can be stored in the means 17 for program control, whereby the means 17 for program control interact with the processor means 2 via bus 15. Using the computer system in accordance with the invention, the sports sections of the "Frankfurter Rundschau" newspaper for 1988 were examined. The corresponding text consists of 1,671,136 words, of which 77,745 are unique. The computer system calculated 5,849 possible error strings f.sub.ij, of which 643 are actual error strings F.sub.i. The rules R.sub.j indicated in Table 1 were applied, whereby the application of rules R.sub.2 and R.sub.3 alone resulted in detecting 295 different actual error strings F.sub.i. Of course, many modifications of the foregoing description of the present invention are possible without departing from the spirit of the invention. Further, portions of the present invention can be used to advantage without the corresponding use of other parts of the description. Accordingly, the foregoing description of the present invention should be considered as merely illustrative of the present invention and not in limitation thereof.
|
Same subclass Same class Consider this |
||||||||||
