Method and system for expanding document retrieval information7010519Abstract An apparatus for expanding a character string. The character string that is entered to search includes a character string dividing device to divide the entered character string into partial character strings. A referencing device is provided to reference a similarity table. The similarity table stores in advance similar partial character strings, each of the similar partial character strings being derived from each of the partial character strings by changing at least one of the characters of each string to a different character which is similar in shape. An expansion device is provided to combine similar partial character strings into expanded words and store them in table. Claims What is claimed is: Description BACKGROUND OF THE INVENTION And the logical sum set of these expanded words are used as the search condition. However, the expanded word "i6" which is a combination of the fifth candidate character and the fifth candidate character is considered to have a very low possibility of contributing to the improvement of the search accuracy. Thus, by taking advantage of the characteristic fact that "a combination of candidate characters that have low emergence probabilities in single characters results in a further reduction in their emergence probabilities," those candidate characters making no contributions to the improvement of the search accuracy can be eliminated. Thus, the expanded words for the word "lo" may be as shown in FIG. 15. The search accuracy of the search that is made using 15 expanded words listed above is hardly degraded when compared with that of the search using 25 expanded words generated from all combinations of candidate characters. The reason for this will be explained in the following example. For each of the characters "l" and "o", it is assumed that the first candidate character has an emergence probability of ½, second candidate character has ¼, third candidate character has ⅛, fourth candidate character has 1/16, fifth candidate character has 1/32, and subsequent candidate characters have 1/32. Then, combining the candidate characters and calculating the multiplied emergence probabilities results in: By adopting the character strings in the upper left half of these expanded words, it is possible with a probability of ¼+⅛×2+ 1/16×3+ 1/32×4+ 1/64×5= 57/64≈90% to prevent the words from escaping the search. Hence, removing the candidate characters with low emergence probabilities from the search has almost no adverse effects on the search precision. Next, the effect of arranging the candidate characters in the form of n-character strings will be explained by taking a word " ##CHR2## ) The expanded word " ##CHR3## ) as shown in FIG. 5. The search accuracy of the search that is made using 15 expanded words listed above is hardly degraded when compared with that of the search using 25 expanded words generated from all combinations of candidate characters. The reason for this will be explained in the following example. For each of the characters " ##CHR4## ", it is assumed that the first candidate character has an emergence probability of ½, second candidate character has ¼, third candidate character has ⅛, fourth candidate character has 1/16, fifth candidate character has 1/32, and subsequent candidate characters have 1/32. Then, combining the candidate characters and calculating the multiplied emergence probabilities results in: " ##CHR5## " 1/024 By adopting the character strings in the upper left half of these expanded words, it is possible with a probability of ¼+⅛×2+ 1/16×3+ 1/32×4+ 1/64×5= 57/64≈90% to prevent the words from escaping the search. Hence, removing the candidate characters with low emergence probabilities from the search has almost no adverse effects on the search precision. An example similarity table generated as described above is shown in FIG. 15. By selecting candidate characters based on the emergence probabilities of n-character strings, it is possible to narrow the candidate characters in the similarity table down to only those with high emergence probabilities. This can reduce the number of candidate characters contributing to the improvement of search accuracy. When character strings contain less than n characters, it is possible to generate an m-character-based similarity table (m<n) so that candidate m-character strings can be picked up for generating the expanded words. Described above is the explanation about the similarity table 109. Now, registration processing in the document retrieval system will be described by referring to FIG. 6. In registering a document, a paper document to be registered is set in the scanner 103 (step 2000). The system control program 201 accepts a command from the keyboard 101 to start the document registration control program 202 (step 2001). The document registration control program 202 first starts the scanner control program 203 to extract image data from the paper document set in the scanner 103 and outputs it to the work area 216 (step 2002). Next, the document registration control program 202 starts the OCR control program 204 to perform character-recognition using the image data in the work area 216 as an input, extract text data and output it to the work area 216 (step 2003). Finally the document registration control program 202 starts the document registration program 205 to associate with each other identifiers of the text data and the image data, both read into the work area 216. Index data for use in the search operation is generated from the text data. Then, the text data and the image data are stored as text data 106 and image data 107, respectively, in the magnetic disk 104 (step 2004). This embodiment may be applied to not only the configuration where image data is taken in from a paper document by the scanner but also a configuration where image data is input directly from a facsimile via communication lines. This is the registration processing in the document retrieval system. Next, the search processing in the document retrieval system will be explained by referring to FIG. 7. In the search operation, when a search condition equation is entered from the keyboard 101, the system control program 201 starts the expansion control program 206 (step 2010). Next, the expansion control program 206 activates the expanded word generation program 207 to generate a plurality of expanded words for the input search character string and outputs them to the work area 216 (step 2011). Next, the expansion control program 206 starts the search condition equation generation program 211 to expand the expanded words into a search condition equation, which represents a logical sum (OR) set of the expanded words read into the work area 216, and output the search condition equation to the system control program 201 (step 2012). Next, the system control program 201 starts the search control program 212 which takes in the search condition equation. Then this control program successively starts the search condition equation analyzing program 213 and the text search program 214 to perform a text search according to the search condition equation (step 2013). As a final step, the result of search is output to the system control program 201 (step 2014). The system of this invention may comprise, as shown in FIG. 1, an expanded word generation program (208-210), a work area 216 in which to store the expanded words, a similarity table 109, a CPU 102, and other programs 108. This is because the search operations at and following the step 2013 may be executed by a separate device. Next, detailed procedure of the expanded word generation program 207 will be explained by referring to FIG. 8. The expanded word generation program 207 starts the search character string dividing program 208 to divide the entered search character string into n-character (n≧2) partial search strings (step 2020). Then, the similarity table referencing program 209 is executed to reference candidate characters in the n-character-based (n≧2) similarity table 109 described above for each of the divided partial character strings and store the candidate characters in the work area 216 (step 2021). Next, the search character string expanding program 210 is executed to read out the candidate characters for each of the partial character strings from the work area 216 and combine them to generate a plurality of expanded words (step 2022). This is the procedure performed by the expanded word generation program 207 in the document retrieval system. What has been described above concerns the search processing in the document retrieval system. Now, document display processing in this document retrieval system will be described by referring to FIG. 9. When displaying a user-specified document from the search result, the user specifies a document he or she wants displayed (step 2030). Then, the system control program 201 starts the display program 215 which displays the text data 106 stored in the magnetic disk 104 (step 2031). At this time it is checked whether the display of the image data is specified (step 2032) and, if so, the associated image data 107 in the magnetic disk 104 is displayed (step 2033). The search method described above will be detailed in the following by taking " ##CHR6## " are looked up in the similarity table of FIG. 6. When a search character string " ##CHR7## are read into the work area. Next, these candidate characters for the partial character strings are combined to generate expanded words as shown in Table 1.
Performing the search according to the logical sum (OR) condition that includes any one of the expanded words shown above (" ##CHR14## ") can reduce the possibility of the search character string eluding the search. In the case of such a long search character string, the word expansion is based on the partial character strings of a predetermined length and candidate characters with low emergence probabilities are excluded from the similarity table. This procedure results in a search using 15×15=225 expanded words as opposed to 5×5×5×5=625 expanded words used in the conventional search method. That is, the expanded words generated by using the similarity table containing only the candidate characters with high emergence probabilities can be made substantially smaller in number than those which are generated by combining all candidate characters as in the conventional technique, while maintaining the search precision. This in turn allows a significant reduction in the search time. As described above, this embodiment generates a 2-character-based similarity table. Hence, when the search character string or search term is made up of an even number of characters, it is possible to divide the search term into two-character strings, refer to the similarity table and combine candidate character strings to generate expanded words. Next, example processing for a case where the search term consists of an odd number of characters (three or more characters) will be explained. When the search term is made up of an odd number of characters (three or more characters), the search term is divided into a character string of the first three characters and a remaining character string consisting of a fourth and subsequent characters (even number of characters). Then, the first three characters are divided into a two-character string made up of the first and the second character and a two-character string made up of the second and the third character. The character string of the fourth and the remaining characters is divided into 2-character strings. These divided character strings are checked against the similarity table. The expansion processing for the fourth and the subsequent characters is similar to that described in the previous case of " ##CHR15## " will be described. First, the 3-character search term " ##CHR16## " made up of the second and the third character. Then, by referring to the similarity table shown in FIG. 5, candidate character strings for " ##CHR17## ". For " ##CHR18## " are extracted. As a final step, among the candidate character strings expanded from " ##CHR19## ". More specifically, those candidate character strings for " ##CHR20## ". Combining these two candidate sets results in 25 expanded words. Similarly, those candidate character strings for " ##CHR21## ". Combining these two candidate sets results in 16 expanded words. Further, those candidate character strings for " ##CHR22## ". Combining these two candidate sets results in nine expanded words. Further, those candidate character strings for " ##CHR23## ". Combining these two candidate sets results in four expanded words. As a final step, those candidate character strings for " ##CHR24## ". Combining these two candidate sets results in one expanded word. It is seen from the above that the expanded words for the 3-character search term " ##CHR25## " that are generated by this embodiment are reduced in number to 25+16+9+4+1=55 from 5×5×5=125 generated by the conventional method. While in the above example the detailed processing flow has been described for the 3-character search term " ##CHR26## ", it is also possible in the case of a 5-character search term to use 2-character expanded words for a character string of the last two characters in combination with the 3-character expanded words described above. Further, it is obvious that for search terms consisting of an odd number of characters equal to or larger than seven, the similar processing can be used to reduce the number of expanded words. Further, in the search method described above, let us consider another case where "lock" is used as a search character string. In this example, the word expansion is based on 2-character strings and candidate characters for "lo" and "ck" are looked up in the similarity table of FIG. 15. When a search character string "lock" is entered, the expanded word generation processing is first carried out. In the expanded word generation, the search character string "lock" is divided into 2-character partial character strings "lo" and "ck". Next, the candidate characters for "lo" are looked up in the similarity table and (lo, Io, !o, 1o, io, lO, IO, !O, 1O, l0, I0, !0, lQ, IQ, l6) are read into the work area. Similarly, the candidate characters for "ck" (ck, Ck, Gk, ek, qk, cK, CK, GK, eK, ch, Ch, GH, cb, Cb, cR) are read into the work area. Next, these candidate characters for the partial character strings are combined to generate expanded words shown in Table 2.
Performing the search according to the logical sum (OR) condition that includes any one of the expanded words shown above ("lock" or "loCk" or "loGk" or "loek" or "loqk" or . . . or "16cR") can reduce the possibility of the search character string eluding the search. In the case of such a long search character string, the word expansion is based on the partial character strings of a predetermined length and candidate characters with low emergence probabilities are excluded from the similarity table. This procedure results in a search using 15×15=225 expanded words as opposed to 5×5×5×5=625 expanded words used in the conventional search method. That is, the expanded words generated by using the similarity table containing only the candidate characters with high emergence probabilities can be made substantially smaller in number than those which are generated by combining all candidate characters as in the conventional technique, while maintaining the search precision. This in turn allows a significant reduction in the search time. As described above, this embodiment generates a 2-character-based similarity table. Hence, when the search character string or search term is made up of an even number of characters, it is possible to divide the search term into two-character strings, refer to the similarity table and combine candidate character strings to generate expanded words. Next, example processing for a case where the search term consists of an odd number of characters (three or more characters) will be explained. When the search term is made up of an odd number of characters (three or more characters), the search term is divided into a character string of the first three characters and a remaining character string consisting of the fourth and subsequent characters (even number of characters). Then, the first three characters are divided into a two-character string made up of the first and the second character and a two-character string made up of the second and the third character. The character string of the fourth and the remaining characters is divided into 2-character strings. These divided character strings are checked against the similarity table. The expansion processing for the fourth and the subsequent characters is similar to that described in the previous case of "lock" and thus its explanation is omitted. Here, an example of expansion for a three-character search term of "log" will be described. First, the 3-character search term "log" is divided into a 2-character string "lo" made up of the first and the second character and a 2-character string "og" made up of the second and the third character. Then, by referring to the similarity table shown in FIG. 15, candidate character strings for "lo" are extracted, which are "lo", "Io", "!o", "1o", "io", "lO", "IO", "!O", "1O", "l0", "I0", "!0", "lQ", "IQ", "16". For "og", candidate characters "og", "Og", "0g", "Qg", "6g", "o8", "O8", "08", "Q8", "oq", "Oq", "0q", "o9", "O9", "o7" are extracted. As a final step, among the candidate character strings expanded from "lo", those whose second character matches the first character of "og" are picked up. Among the candidate character strings expanded from "og", those whose first character matches the second character of "lo" are picked up. These two sets of extracted candidate character strings are combined to generate expanded words for the 3-character string "log". More specifically, those candidate character strings for "lo" whose second character is "o" are "lo", "Io", "!o", "1o" and "io". Those candidate character strings for "og" whose first character is "o" are "og", "o8", "oq", "o9" and "o7". Combining these two candidate sets results in 25 expanded words. Similarly, those candidate character strings for "lo" whose second character is "O" are "lO", "IO", "!O" and "1O". Those candidate character strings for "og" whose first character is "O" are "Og", "O8", "Oq" and "O9". Combining these two candidate sets results in 16 expanded words. Further, those candidate character strings for "lo" whose second character is "0" are "l0", "I0" and "!0". Those candidate character strings for "og" whose first character is "0" are "0g", "08" and "0q". Combining these two candidate sets results in nine expanded words. Further, those candidate character strings for "lo" whose second character is "Q" are "lQ" and "IQ". Those candidate character strings for "og" whose first character is "Q" are "Qg" and "Q8". Combining these two candidate sets results in four expanded words. As a final step, those candidate character strings for "lo" whose second character is "6" are "l6. Those candidate character strings for "og" whose first character is "6" are "6g". Combining these two candidate sets results in one expanded word. It is seen from the above that the expanded words for the 3-character search term "log" that are generated by this embodiment are reduced in number to 25+16+9+4+1=55 from 5×5×5=125 generated by the conventional method. While in the above example the detailed processing flow has been described for the 3-character search term "log", it is also possible in the case of a 5-character search term to use 2-character expanded words for a character string of the last two characters in combination with the 3-character expanded words described above. Further, it is obvious that for search terms consisting of an odd number of characters equal to or larger than seven, the similar processing can be used to reduce the number of expanded words. Although the process of generating a similarity table shown in FIG. 3 uses 2-character learning data that combines all character codes, it is also possible to use as the learning data 2-character strings that combine those character codes having high frequencies of use. In that case, for a 2-character string not included in the similarity table, the search character string itself can be used as a candidate character string for expansion. The first embodiment has been described above. In the search allowing for OCR recognition errors, this embodiment can reduce the possibility of the search character string escaping the search and realize a high-precision search in a practical search time. Next, a second embodiment of the present invention will be described. In the first embodiment, the n-character-based similarity table is checked and those character strings with low probability of contributing to the improvement of the search accuracy are removed from the word expansion. This realizes a fast search even when the search character string is long. With this method, however, when the search character string specified in the document search is short, expanding the character string into candidate words for which the character string is likely to be mistaken and then performing a search using the expanded words can increase undesired results (hereinafter referred to as search noise). For example, in the case of a search character string of " ##CHR27## ". Hence, the search noise increases, degrading the search accuracy. In addition to the processing done by the first embodiment, the second embodiment has a step of checking whether or not the word expansion is done in the length of the entered search character string and switching between different expansion methods according to the check result. This arrangement offers the effect of reducing the search noise when the search character string is short. FIG. 10 shows a configuration of the second embodiment. The second embodiment has basically the same configuration as the first embodiment, except that a expansion method switching program 300 is added to the expansion control program 206. The document registration method is similar to that of the first embodiment and is not described here. The search method will be explained by referring to FIG. 11. When a search condition equation is entered from the keyboard 101 to perform a search, the system control program 201 starts the expansion control program 206 (step 3000). The expansion control program 206 first starts the expansion method switching program 300 to take in the length of the entered search character string (step 3001). Then, the length of the entered search character string is checked (step 3002). If the length does not exceed a predetermined length, the word expansion is not performed and the program proceeds to (step 3005) while maintaining the entered search condition equation. If the predetermined length is exceeded, the program moves to (step 3003). The expansion control program 206 starts the expanded word generation program 207 to generate a plurality of expanded words from the entered search character string and output them to the work area 216 (step 3003). Next, the expansion control program 206 starts the search condition equation generation program 211 to expand the expanded words into a search condition equation, which represents a logical sum (OR) set of the expanded words read into the work area 216, and output the search condition equation to the system control program 201 (step 3004). Next, the system control program 201 starts the search control program 213 which takes in the original search condition equation or the output search condition equation. Then this control program successively starts the search condition equation analyzing program 212 and the text search program 214 to perform a text search according to the search condition equation (step 3005). As a final step, the result of search is output to the system control program 201 (step 3006). The search processing in the document search system of this embodiment has been described above. The search method described above will be detailed by taking " ##CHR28## " as an example search character string. In this example, a predetermined length value for the search character string expansion decision is set to 1. When the search character string " ##CHR29## ". In this way, when the search character string is short, the search character string is not subjected to the expansion processing as it has been in the conventional technique. This can prevent the search result from including those documents which contain character strings of different meanings, thus reducing the search noise. Further, not only can this embodiment preset the predetermined length value for the expansion decision, it can also adjust the predetermined length value freely during the search. It is also possible to change the predetermined length value for the expansion decision according to the kind of characters, for example, to 1 character when the search character string is an ideogram or character such as kanji or Chinese character and to 2 characters when it is a phonogram or letter such as alphabet. The second embodiment has been described above. In the search allowing for OCR recognition errors, this embodiment can perform a highly accurate search without increasing the search noise when the search character string is short in length. Next, a third embodiment of this invention will be described. In addition to the processing performed by the first embodiment, the third embodiment generates entry characters in the similarity table by extracting a part of combinations of all character codes. This produces an effect of reducing a file capacity of the similarity table. That is, in the first embodiment, the entry characters in the similarity table are generated from the learning data consisting of all character code combinations. In that case, assuming that there are a total of about 8,000 Japanese character codes and that each entry character has 10 candidate characters, the 2-character string similarity table has the following capacity: Third embodiment on the other hand stores in the 2-character string similarity table only those character strings with high probability of being used as the search character string, thereby realizing a reduction in the capacity of the similarity table. This third embodiment is basically similar to the first embodiment, except that, while the first embodiment generates the n-character (n≧2) candidate strings by combining all character codes, this embodiment generates candidate character strings from only the main character combinations likely to be used as the search character string. Hence, because there are entry characters that are not contained in the similarity table, exception processing is added to the similarity table referencing program 209. It is noted here that the major character combinations used in this embodiment are contemplated to be the combinations of first level characters. Now, a expansion processing procedure using the similarity table of this embodiment, i.e., a new procedure added to the expanded word generation program 207, will be described by referring to FIG. 12. The expanded word generation program 207 initiates the search character string dividing program 208 to divide the entered search character string into partial character strings each consisting of a predetermined number n of characters (n≧2) (step 3000). Next, the similarity table referencing program 209 is executed to scan the similarity table 109 to see if the partial character strings are among the entry characters in the table (step 3001). A check is made as to whether the corresponding entry characters exists in the table (step 3002). If so, the candidate characters are stored in the work area 216 (step 3003). If not, the partial character strings themselves are stored in the work area 216 (step 3004). As a final step, the search character string expanding program 210 is executed to read the candidate characters for each partial character string or the partial character strings from the work area 216 and combine them to generate a plurality of expanded words (step 3005). This is the procedure of the expansion processing in the document retrieval system. Next, the file capacity of the similarity table containing major character combinations according to this embodiment will be explained. It is assumed that there are a total of about 3,000 first level characters and each entry character has 10 candidate characters. Then, the file capacity of the similarity table is: Further, this embodiment may extract not only the combinations of first level characters but combinations of characters found in corpuses including newspaper and various literatures and further narrow down the combinations of junctural characters as languages. The third embodiment has been described above. Because, in the search allowing for OCR recognition errors, this embodiment narrows the entry characters in the similarity table down to the combinations of major characters likely to be used as the search character string, the file capacity of the similarity table can be reduced significantly. When looking up the search character string in the n-character-based similarity table, the method of the third embodiment has been described not to incorporate those character strings not listed in the n-character-based similarity table into a group of candidate character strings for use in generating the expanded words. It is also possible to prepare in advance an m-character-based similarity table (m<n) for parallel use with the n-character-based similarity table consisting of major character strings so that, for a character string not listed in the n-character-based similarity table, a reference can be made to the m-character-based similarity table to generate expanded words. Next, a fourth embodiment of the present invention will be described. In the first to third embodiment, the word expansion processing and the search processing are independent of each other. In the fourth embodiment, the search processing is expanded to include the word expansion processing. FIG. 13 shows a configuration of this embodiment. Unlike the preceding embodiments, the fourth embodiment is characterized in that, when performing a search, the search control program 212 also controls the word expansion processing. Further, since the expanded word generation from the search character string is performed in the search processing, the search condition equation generation program 211 for generating a new search condition equation is not required. As described above, in a search performed on text data containing character recognition errors, which are produced by the OCR during the character-recognition operation on an image document, the number of expanded words is reduced by removing those n-character candidate words with low emergence probabilities from the similarity table and generating the expanded words using this table. This arrangement can realize a search in a practical search time while maintaining a high search precision. It will be further understood by those skilled in the art that the foregoing description has been made on embodiments of the invention and that various changes and modifications may be made in the invention without departing from the spirit and scope the appended claims.
|
Same subclass Same class Consider this |
||||||||||||||||||||||||||||||||||||||||
