Character-string retrieval system and method5655129Abstract In a TRIE dictionary, it is to retrieve a character string including a wild card and a normal expression at high speed. By extract a substring of M characters from the start of a word character string of length L, making a backward TRIE having the end of the substring as a route node with the aid of the substring, deciding a prefix portion from a substring in which the number of input characters has been decided to be small, and retrieving an original TRIE, both the necessary cost of space (dictionary size) and the retrieval cost (retrieval time) are balanced. Claims I claim: Description FIELD OF THE INVENTION
TABLE 1
______________________________________
Number of branches in each hierarchy of the TRIE dictionary:
Hierarchy number
Average number of branches
______________________________________
1 44.0
2 31.5
3 7.9
4 1.6
______________________________________
Therefore, in the TRIE dictionary, it must be ensured that neither the number of branches n[i] of an input nor the number of branches E(N[i]) in the dictionary increase. In consideration of the above-described point, the present invention, by extracting a substring of M characters from the start of a word character string of length L, making a backward TRIE having the end of the substring as a route node with the aid of the substring, deciding a prefix portion from a substring in which the number of input characters has been decided to be small, and then retrieving an original TRIE, is intended to balance both the necessary cost of space (dictionary size) and the retrieval cost (retrieval time). We now disclose a computer-based method for retrieving TRIE dictionaries, comprising the steps of: (a) constructing a forward TRIE dictionary from a plurality of character strings and storing the resulting forward TRIE dictionary in a computer memory or media; (b) constructing a fixed-length backward TRIE dictionary for each of the left substrings of each of said plurality of character strings which are the constituents of said forward TRIE dictionary, said fixed-length backward TRIE dictionary beginning with the last character of said left substring and ending with the first character of said left substring; (c) inputting a candidate character string lattice; (d) when said candidate character string lattice comprises M columns where M represents an integer number, calculating the work quantity for backward TRIE dictionary retrieval from a column k where k represents an integer number, for each of k=1, M, and thereby determining said column k in which said cost is minimum; (e) in response to the determination of said column k in said step (d), retrieving said backward TRIE dictionary made in said step (b) and having a length of k, with a left substring of said candidate character string lattice up to said column k, and storing the retrieval results; and (f) retrieving said forward TRIE dictionary based on said retrieval results stored in said step (e) and displaying or saving the retrieval results. BRIEF DESCRIPTION OF THE DRAWINGS FIGS. 1(A) and 1(B) are illustrations used to explain forward and backward TRIE dictionaries, respectively; FIG. 2 is an illustration used to explain processes by which a TRIE dictionary is made; FIG. 3 is a block diagram showing the functional construction of a candidate character string lattice dictionary retrieval system of the present invention; and FIG. 4 is a flowchart diagram showing an algorithm of the TRIE dictionary retrieval with respect to the candidate character string lattice. DETAILED DESCRIPTION OF THE INVENTION An embodiment of the present invention will be described with reference to FIGS. 1 to 4. FIG. 3 shows the overall function configuration and the data flow of a candidate character string lattice dictionary retrieval apparatus of the present invention. A candidate character string lattice, which is an input, is first input from an input device 307 such as a keyboard (not shown) and a magnetic disk device which are connected, for example, to a personal computer PS/55 (trademark of IBM). In accordance with the control of the input means 301, which is a control program residing in the main storage of a computer, the candidate character string lattice input by the input device 307 is stored as a candidate character string lattice 308 in a predetermined region in the main storage of the computer by the candidate character string lattice storage means 302, which is a control routine in the main storage of the computer. Thereafter, the candidate character string lattice storage means 302 provides the candidate character string lattice 308 for reference, or transfers it, when necessary. As shown in FIG. 3, through an input device 307 connected to a permanent storage means such as a magnetic disk (not shown), the input means 301 also stores the particular content of the files stored on the magnetic disk in a candidate character string lattice storage 308, when necessary. The retrieval work quantity estimating means 303 which is also a control routine in the main storage of the computer calculates a retrieval starting position in which the work quantity of dictionary retrieval is expected to become small from data on the average number of branches 309 which is obtained when the candidate character string lattice and a TRIE dictionary are made and which is held in a permanent storage means such as a magnetic disk (not shown). This retrieval work quantity estimating means 303 and the concept of the average number of branches are the gist of the present invention, and they will be described later in greater detail. If a TRIE dictionary selecting means 304 which is a control routine in the main storage of the computer selects a TRIE dictionary corresponding to the calculated starting position, the TRIE dictionaries 310 and 311 which are control routines in the main storage of the computer will be retrieved by the TRIE dictionary retrieving means 305 which is a control routine in the main storage of the computer. The retrieval result 312 is written to a predetermined region of the main storage or a permanent storage means such as a magnetic disk (not shown). The retrieval result outputting means 306 which is also a control routine in the main storage of the computer transmits the retrieval result 312 to an output device 313 such as a CRT device connected to the computer. The output device 313 displays the retrieval result 312 to users. Assume now that "?tsushita" is input. According to above-described Table 1, for this input, 44 branches exist at the first character, and it is necessary to perform the retrieval of the second character, i.e., to trace a child link and perform a character comparison for all of the branches. However, if "tashitsu?" is retrieved in backward, it can be expected that the number of branches is greatly reduced. This is a most extreme example, but it is obvious that, if the retrieving order is selected so that a product of the number of characters on the input side and an expected value of the number of branches on the dictionary side (E(N[i]).times.n[i]) becomes small, a TRIE dictionary can be retrieved at high speed. If TRIE dictionaries corresponding to all of the arbitrary orders are made, since the dictionary size becomes too large, backward TRIE is to be made from the end of a substring comprising the first M characters of a word, as shown in FIG. 1(B). [Making of a Forward TRIE Dictionary] The making of the TRIE structure will be described with reference to FIG. 2. It is now assumed that the TRIE structure shown in FIG. 1(A) has already been made and a word "masumoto" which does not exist in a dictionary is additionally registered. First, the "ma" which is the first character of the word to be added is retrieved by sequentially tracing the sibling links of the first character aggregation 21 connected to a child link 20 of a route node. Since "ma" is found at a node 22, a child link 23 of that node is traced and the next character "su" is retrieved. That is, the process of adding a word is identical to the process of retrieving a dictionary, as long as there exists a node that coincides with the first substring of a word to be added. In this case, the second character "su" exists at a node 24, a child link 25 of that node is traced, and the third character "mo" is compared with the next character aggregation. However, in the character aggregation connected to the child link 25 of "su," only "da" exists and therefore the retrieval fails. A dictionary retrieval is to be ended at the time it is found that a corresponding word does not exist but, in the case of a word addition, a sibling link 26 is added to the character aggregation in which the retrieval failed, and there is made a node 27 corresponding to the character "mo" which did not exist. Then, a child link 28 extends from the node 27, and a node 29 corresponding to the following character "to" is to be made. If the word further includes characters, the characters will be taken out one by one and the extension of a child link and the making of a node will be repeated until the word is ended. A dictionary of the same structure is obtained by repeating such additional registration for each word of a given word aggregation. That is, in an initial stage, a word is registered in a dictionary in which only a route node exists, and the operation of adding the next word to the obtained TRIE dictionary is repeated for the number of words. The TRIE structure can cope with all of the registered words by increasing the number of nodes each time a word is registered. The forward TRIE dictionary 310 which was made in the above-described manner and in which a word is added when necessary, is permanently and rewritably stored in a readable/writable storage medium connected to a computer, such as a magnetic disk storage device (not shown) or an optical magnetic disk device (DASD) (not shown). As shown in FIG. 3, at the time of actual retrieval, the forward TRIE dictionary 310 is accessed by the TRIE dictionary selecting means 304. [Making of Backward TRIE Dictionary] The making of the backward TRIE dictionary is identical in principle to the making of the forward TRIE dictionary but, in the making of the backward TRIE dictionary it is first necessary to obtain reverse forms of character strings for all of the left substrings for each word. For the word "Matsushita," the left substrings and the reverse forms of character strings are obtained as shown in Table 2.
TABLE 2
______________________________________
Reverse form of
Length Left substring
character string
______________________________________
2 Matsu Tsuma
3 Matsushi Shitsuma
4 Matsushita Tashimatsu
______________________________________
The reverse form of a character string such as this is made for all of the words, and a backward TRIE dictionary is made from the strings of the same length. In FIG. 1(B), there is shown only a 3-character backward TRIE dictionary corresponding to the subtree (word aggregation) shown in FIG. 1(A). However, in fact, the dictionary is constructed from the reverse form of character strings of the 3-character length taken out of all of name words such as a 3-character reverse form of the substring "shiketa" taken out of "takeshita." Thus, in addition to an original TRIE dictionary, the above-described backward TRIE dictionary is made for each left substring of 1 character to M characters. The average number of branches of each TRIE dictionary is recorded as data on the average number of branches 309 on a permanent recording medium such as an magnetic disk and an optical magnetic disk. More particularly, for the forward (E(N[i]), the number of columns counted from the first character is recorded, and, for the backward (E(Nb[j]), the number of columns counted from the first character of a reverse form of character string, i.e, the end of a left substring, is recorded. At the time of retrieval execution, the total of work quantities is assumed to be based on a value of the data 309. Exactly speaking, E(Nb[j]) can be considered to be different depending on the based first substring, but an average value of these can be used for simplicity. A further description will be made as to the average number of branches. If it is assumed that, for example, 100 branches extend from a route node, the average number of branches in hierarchy 1 will be 100/1 =100. In order to obtain the average number of branches in hierarchy 2, it is calculated how many branches extend from each of the 100 branches. All of the calculated branches are added up, and the sum is divided by 100. In this manner, calculating the average number of branches is possible up to the arbitrary hierarchy that the TRIE structure allows. It is to be noted that the number of branches is a value determined by only the TRIE structure constructed. The backward TRIE dictionary 311 which was made in the above-described manner and in which a word is added when necessary, is permanently and rewritably stored on a readable/writable storage medium connected to a computer, such as a magnetic disk storage device (DASD) (not shown) or an optical magnetic disk device (not shown). As shown in FIG. 3, at the time of actual retrieval, the backward TRIE dictionary 311 is selectively accessed by the TRIE dictionary selecting means 304. It is to be noted that a plurality of the backward TRIE dictionaries 311 exists, unlike the forward TRIE dictionary 310, and each of the dictionaries is made individually for each substring of the Mth character of a word to be retrieved (M=1, 2, . . . ) and is added. [Retrieval] A retrieval of the dictionaries made in the above-described manner will hereinafter be described in detail. In the first stage of the retrieval, the portion in which the number of branches F[i] is small is retrieved. The case that a word length is completely indeterminate will be described later but, if it is assumed that the word length is more than M characters, (a) An estimated value F[i] of the number of the branches is obtained for each of i=1, . . . , M by Equation (2), and a work quantity w[1] as the forward TRIE is retrieved by the normal method, is estimated by the following Equation. ##EQU3## (b) For k=2, . . . , M, the backward TRIE dictionary is retrieved from column k to narrow the extent of the prefix portion and, after the prefix portion has been replaced with the backward TRIE dictionary, the work quantity (w[k]) as the forward TRIE is retrieved is estimated by the following Equation: ##EQU4## where Fb[i] is calculated with Equation (2) in which N[i] was replaced with Nb[i]. Also, F[k-l] as F[k] is calculated and is replaced with Fb[l]. Further, the number of branches concerning the replaced prefix portion is smaller than at least Fb[l], since it is obtained when the backward TRIE dictionary is traced. Hence, that number of branches is estimated as an upper limit by Fb[i]. Further, it is to be noted that, in Equation (2) which Equation (3) quotes, the data 309 (FIG. 3) on the number of branches which are calculated at the time the forward TRIE structure and the backward TRIE structure have been constructed and which are stored on the disk, are used as E(N[i]) and E(Nb[i]). More particularly, E(N[i]) and E(Nb[i]) are the average number of branches in the ith hierarchy where i represents an integer number. (c) Among w[i] (i=1, . . . ,M), the smallest one is obtained, and i at that time is made im. (d) A backward TRIE is retrieved from the column im, and thereafter, if necessary, a forward TRIE is searched and a dictionary retrieval is performed. That is, it is the gist of the present invention to evaluate the work quantity as the TRIE is retrieved from the column ith, with the sum of branches at each point in time, and to start searching from a column in which the work quantity becomes minimum. For example, in the case of a character string lattice being [maarakaku]tsu[shimi][a-ko], the number of candidates is 5, 1, 2, and 10. If w[i] where i=1, . . . ,4 is calculated according to Equation (3), w[i] becomes 8.5, 7.8, 5.0, and 22.2 in sequence, and w[i] becomes the minimum value when i=3. Therefore, a backward TRIE is retrieved from the third character, and a forward (i.e., normal) TRIE retrieval is to be performed for each substring decided as a result of the backward TRIE retrieval. In the case of ma?[shimi][tara], the second character is a wild card "?," and the number of candidates becomes 50 if it is assumed that it is a katakana name including only a clear sound. Therefore, if the same calculation is performed, w[i] will be 38.8, 110.0, 76.8, and 22.8, so that the maximum efficiency can be expected if the backward TRIE is retrieved from the tail character. If the backward TRIE is retrieved, a sublattice comprising only candidates in which there is the possibility of a success in retrieval is obtained from among sublattices comprising the mi characters extracted from the left portion of an input. In other words, only if a prefix portion is inspected, a sublattice that no longer exists in a dictionary is to be excluded. Then, a forward TRIE is retrieved again for the remaining sublattices, and the result is obtained. It is a matter of course that, in the case of im=1, since the forward TRIE is retrieved from the beginning, retrieving the forward TRIE dictionary repeatedly is not needed. The processes described above are shown in FIG. 4. More particularly, the character retrieval processing according to the present invention is started in step 401 of FIG. 4. In step 402, a retrieval character string that may include a wild card and a normal expression are input by the user operating a keyboard and the like. In step 403, 1 is stored for a control variable im and an arbitrary value is stored for another variable minw. The arbitrary value (shown as "Large" in FIG. 4) is much greater than a value that w[k] where k is an integer greater than 1 can normally have. Step 410 bears the most important process of the present invention, and in step 410, steps 411, 412, and 413 are executed as to the individual columns of the input retrieval character string (for example, in the case of a retrieval character string such as [maarakaku]tsu[shimi][a-ko], the [maarakaku], tsu,[shimi], and [a-ko] are an individual column). More particularly, as to k=1, in step 411 the work quantity w[l] is calculated with the above-described Equation (3), based on the average number of branches of the forward TRIE dictionary in the first column and on the number of candidates on the candidate character string lattice in the first column. Next, in step 412, the w[l] thus calculated is compared with the variable minw defined in step 403. Since the variable minw is selected to be a greater value by definition, the determination in step 412 is always affirmative with respect to k=1. Consequently, in step 413, the work quantity w[l] is stored for the variable minw and 1 is stored for the control variable im. Next, the same processing is also performed for k=2. Since the work quantity w[l] has already been stored for the variable minw, step 413 is executed only when w[2]<w[l]. When the processing is completed in the above-described manner for k=1 . . . M, a value of k where w[k] is the minimum is to be stored for the control variable im. Then, in step 420, the forward TRIE dictionary 310 or backward TRIE dictionary 311 corresponding to the control variable im calculated in step 410 is selected by the TRIE dictionary selecting means 304 (FIG. 3). In step 421, the backward TRIE dictionary selected according to the value of the control variable im is retrieved, so that a character string (aggregation) of length im indicating that the retrieval succeeded is stored in a S. In step 422, it is determined whether the control variable im is 1. If im is 1, since it is meant that the forward TRIE dictionary 310 has been selected and the normal TRIE dictionary retrieval has been performed, in step 424 the result of the retrieval is immediately displayed and in step 425 the processing is completed. If the determination in step 422 is negative, i.e., the control variable im is more than 2, since the backward TRIE dictionary 311 corresponding to the length of im has been selected, the character string is partially retrieved only. Therefore, in step 423, the forward TRIE dictionary 310 is retrieved from the first column for each character string of the length im listed as S. If the forward TRIE dictionary 310 is thus retrieved and the retrieval is completed in step 423, in step 424 the result of the retrieval will be displayed and in step 425 the processing will be completed. It is to be noted that, since a word length has been assumed to be greater than M, if a retrieval is started from the column im, it is obvious that a word shorter in word length than im cannot be retrieved. It is also to be noted that, in the case of a word being shorter than M, it is necessary to gradually reduce M to a smaller value and to repeat the above-described operation or retrieve using the a normal method. [Advantage of the invention] The retrieval time will be evaluated first. Because the retrieval is performed from the first column with the normal method, a work quantity T.sub.org is obtained from Equations (1) and (2). ##EQU5## where f[0]=1 and L represents an average word length. If a value of this equation with respect to an input of "?XXX" where X represents a character that is not a wild card, on the one hand, is calculated from the number of branches in Table 1 and the number of character categories (in this example, the number of clear sound kana is 50), the result will be obtained as shown in Table 3.
TABLE 3
______________________________________
Work quantity with respect to input "?XXX" (e.g., "?tsushita")
Branch Work quantity
Branch
f[i - 1] T[i] f[i]
______________________________________
i = 1 1 1 .times. 44
44
i = 2 44 44 .times. 31.5
0.63 .times. 44
i = 3 27.72 27.72 .times. 7.9
27.72 .times. 0.158
i = 4 4.38 4.38 .times. 1.6
______________________________________
Total: About 1577
On the other hand, because the retrieval is started from column k, a work quantity T.sub.improved is the sum of the retrieval time of a word having a length of more than L (work quantity of the backward TRIE and, if necessary, the forward TRIE) and the work quantity as words shorter than L are retrieved in sequence from the start or end. If E(N[i])=E(Nb[i]) for simplicity, the work quantity T.sub.improved will be obtained as follows: ##EQU6## where F'[i] represents the number of branches after they have been narrowed with the backward TRIE, and min(x,y) represents a smaller value of x and y. If, as for the input "?XXX," is calculated from Table 2 and under the same condition, it will be obvious that the retrieval from the end is the most effective, and 269.4 will be obtained. In this case, 1577/269.4=5.85 and it is therefore understood that the retrieval according to the present invention is six times faster than the above-described conventional method. Since the method according to the present invention becomes more effective when a value of E(N[i]) is more rapidly reduced with respect to an increase in i, the present invention can be expected to be more effective in a kanji dictionary and the like. Next, the cost of space required by the method of the present invention will be evaluated. In the method of the present invention, in addition to the forward TRIE which is an original dictionary, it is necessary to have the backward TRIE for each of n=2, . . . , M. The cost of space can be estimated with the aid of the average number of branches E(N[i]). ##EQU7## Therefore, a capacity necessary for the backward TRIE dictionary will become as follows: ##EQU8## When M=4, this value is about 29900 and 2.7 times (=(29900+17520)/17520), as compared to an original TRIE having a value of 17520. Accordingly, the cost of space is acceptable enough. While in the embodiment the backward TRIE dictionary has been constructed by a binary tree, which is the most standard data structure, in other words, a child link and a sibling link, it may also be constructed by other data structures having equivalent information such as a double array. The double array is described in "High-Speed Digital Retrieval Algorithm by Double Array," Electronic Communication Society Thesis Journal, J71-D, 9, pp. 1592-1600 (1987). In addition, the constant term of the above-described Equation (3) is arbitrary and, for example, the term of Fb[l] in Equation (3) may be omitted.
|
Same subclass Same class Consider this |
||||||||||
