Document retrieval using fuzzy-logic inference5706497Abstract The results of a full-text, document search by a character string search processor are treated as vector patterns whose elements become a term match grade by use of a membership function of the term match frequency. The closest pattern to the query pattern is found by the similarity between the query pattern and each of the filed sample patterns. The similarity is calculated by use of fuzzy-logic. The similarity is ranked in order of similarity magnitude, thereby reducing the search time. The search time can be shortened by categorizing the filed patterns by term set and similarity to a cluster center pattern. If the cluster center patterns are stored, the closest cluster address can be inferred by fuzzy logic inference from the match between the query document and the term set or the similarity of the query to the cluster center. Claims What is claimed is: Description FIELD OF THE INVENTION
TABLE 1
______________________________________
T1 T2 T3 T4 T5 T6 T7 T8
______________________________________
Dq .fwdarw.
Gq = 0.0 0.5 0.7 0.9 0.0 0.6 0.0 0.9 .fwdarw.
QPM
Dj .fwdarw.
G1 = 0.9 0.8 0.7 0.5 0.0 0.0 0.6 0.7 .fwdarw.
FPM
G2 = 0.7 0.5 0.9 0.9 0.0 0.5 0.0 0.6 .fwdarw.
FPM
G3 = 0.0 0.7 0.0 0.5 0.0 0.8 0.9 0.0 .fwdarw.
FPM
G4 = 0.0 0.7 0.6 0.5 0.9 0.8 0.0 0.5 .fwdarw.
FPM
G5 = 0.5 0.7 0.0 0.6 0.8 0.0 0.9 0.7 .fwdarw.
FPM
G6 = 0.9 0.6 0.7 0.0 0.9 0.8 0.0 0.0 .fwdarw.
FPM
G7 = 0.8 0.0 0.8 0.5 0.7 0.0 0.6 0.7 .fwdarw.
FPM
G8 = 0.0 0.5 0.8 0.0 0.7 0.8 0.9 0.9 .fwdarw.
FPM
______________________________________
The query pattern Gq is stored in QPM 70 and the filed patterns Gj, j=1,2, . . . are stored in FPM 20 in FIG. 5. FPM requires a large memory capacity while QPM requires a smaller memory capacity. Each memory has associated address switches 74, 76 and 78 to provide either a term address (TiA) from TiA-CNT 80, a document address (DjA) from DjA-CNT 82 or a specified address from the host computer 12 through the I/O bus 84. Thus, QPM 70 stores the term match grade Giq at each term at each term address TiA. FPM 20 stores term match grade Gij at the ith term address TiA and the jth filed pattern address DjA. Each memory also has a data switch 72 to select either the QPM 70 or FPM 20 to store the output of MFM 24 through the R/W data ports of the QPM 70 and FPM 20. Particularly, in the write (W) mode of memory operation, the outputs of TiA-CNT 80 and DjA-CNT 82 are used as memory write addresses. FIG. 6 is a schematic block diagram of a pattern similarity calculator PSC 26 using a fuzzy coincidence logic circuit FCL 90 to calculate the similarity between the term match grade patterns Gq and Gj, j=1,2, . . . m stored in QPM 70 and FPM 20. In the pattern similarity calculator PSC 26, the pattern similarity between Gq and Gj is calculated as follows: Sjq=1-.SIGMA..sub.i (Giq.crclbar.Gij)(Gij.crclbar.Giq)}/n (13) where Giq.crclbar.Gij means a bounded difference of Giq and Gij, that is, the bounded difference becomes Giq-Gij when Giq is larger than Gij, otherwise it becomes zero. The (Giq.crclbar.Gij)(Gij.crclbar.Giq) means an absolute value of the difference between the term match grades Giq and Gij. Thus, fuzzy coincidence logic FCL 90 can be realized with a subtractor 901 to form (Gij-Giq), and subtractor 902 to form (Giq-Gij), a difference polarity detector 903, data switches 904, 905 and adder 906 to form the bounded difference (Giq.crclbar.Gij)(Gij.crclbar.Giq) and subtractor 907 to calculate Eq. (13). The pattern similarity accumulator 92 is used to form the average of the outputs of the FCL 90. The accumulator (PSA) 92 includes a subtractor 921, divider 922, adder 923, counter 924, and register 925. Assume that the average of i-1 coincidence grades Si, i=1,2, . . . ,i-1 coming from FCL 90 is expressed by ASi-1=.SIGMA.i Si-1/i-1 and a new input coincidence grade is expressed by Si. The average of Si is calculated recursively as follows: ##EQU2## where Si=1-(Giq.crclbar.Gij)(Gij.crclbar.Giq)}. The subtractor 921 calculates the similarity of the input Si and the average ASi-1. The similarity is sent to the divider 922 to be divided by i in the binary counter 924. Then, the content of divider 922 is added to the average ASi-1 at the adder 923. The result is stored in the register 925, which is equal to Eq. (14). The counter 924 is incremented whenever Si is input together with a clock signal or Mi=1. In general, recursive calculation is convenient because the calculation process is not complicated. Finally, when all term addresses have been scanned, the contents of register 925 is transferred to the register 94. The set signal is provided from the TiA-CNT 80 equal to count n. Referring to FIG. 6, in order to permit a search of the filed pattern by only search terms contained in the query, the output of FGMCLU can be masked by the polarity Mi of term match grade Giq in the query pattern: Mi=SGN(Giq) (15) where a signum function means that if A>0 then SGN(A)=1 and otherwise SGN(A)=0. Thus, only when Mi is 1 is the output of FCL 90 added to the register content in the register 925 of the accumulator (PSA) 92. This makes it possible to calculate the similarity with respect to only the search terms contained in the query document. Of course, if the masking by Mi is not required, Mi can always be kept to 1 by the CLK switch 96 being connected to the CLK generator 98. Tables 2 and 3 show the term coincidence grade based oil the conventional fuzzy coincidence logic (GiqGij)(Giq'Gij') and the new fuzzy coincidence logical 1-(Giq.crclbar.Gij)(Gij.crclbar.Giq). These tables show that the latter conforms well with an intuitive sense, when compared with the former.
TABLE 2
______________________________________
Giq.backslash.Gij
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
______________________________________
0.0 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0
0.1 0.9 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.1
0.2 0.8 0.8 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.2 0.2
0.3 0.7 0.7 0.7 0.7 0.6 0.5 0.4 0.3 0.3 0.3 0.3
0.4 0.6 0.6 0.6 0.6 0.6 0.5 0.4 0.4 0.4 0.4 0.4
0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5
0.6 0.4 0.4 0.4 0.4 0.4 0.5 0.6 0.6 0.6 0.6 0.6
0.7 0.3 0.3 0.3 0.3 0.4 0.5 0.6 0.7 0.6 0.7 0.7
0.8 0.2 0.2 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.8 0.8
0.9 0.1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.9
1.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
______________________________________
FIG. 7 is a schematic diagram of a ranker FIFO circuit for a closest pattern search. For example, if coincidence logic is applied to the example of Gq and Gj, j=1,2, . . . ,8 shown in Table 1, the following Sjq, j=1,2, . . . ,8 can be calculated by using the equation (13). S1q=(0.1+0.7+1.0+0.6+1.0+0.4+0.4+0.8)/8=4.8/8=0.60 S2q=(0.3+1.0+0.8+1.0+1.0+0.9+1.0+0.7)/8=6.7/8=0.84 S3q=(1.0+0.8+0.3+0.6+1.0+0.8+0.1+0.1)/8=4.7/8=0.59 S4q=(1.0+0.8+0.9+0.6+0.1+0.8+1.0+0.6)/8=5.8/8=0.72 S5q=(0.5+0.8+0.3+0.7+0.2+0.4+0.1+0.8)/8=3.8/8=0.47 S6q=(0.1+0.9+1.0+0.1+0.1+0.8+1.0+0.1)/8=4.1/8=0.51 S7q=(0.2+0.5+0.9+0.6+0.3+0.4+0.4+0.8)/8=4.1/8=0.51 S8q=(1.0+1.0+0.9+0.1+0.3+0.8+0.1+1.0)/8=5.2/8=0.65 S2q is the highest value 0.84. Then, it is possible to check the "closeness" of Dq and D2 by the contents. Certainly, the Gq and G2 are very similar as shown below. Gq=0.0 0.5 0.7 0.9 0.0 0.6 0.0 0.9 G2=0.7 0.5 0.9 0.9 0.0 0.5 0.0 0.6.fwdarw.Sq2=0.84 It is possible to conclude that the described method of determining similarity is reasonable, because Gq is similar to G2 when Sq2 is large. This is a good example of the use of fuzzy logic for determining the similarity of documents. In order to find the maximum pattern similarity between the query Dq and filed documents Dj, the similarity Sjq, j=1,2, . . . ,m is input to the ranker FIFO shown in FIG. 7. That is, after the similarities between Dq and Dj, j=1,2, . . . ,m are sequentially input to the ranker FIFO 28, the similarities and their addresses are stored in sequence of magnitude from the greatest similarity to the least similarity in the ranker FIFO 28. The ranker FIFO comprises an array of registers which are called address buffer (Add-Buf)/similarity buffers (Sim-Buf), comparators (Comp), similarity register (Sim-Reg), address register (Add-Reg) and data setting CLK generator (SetCLK) and down/up shifting CLK generator (DsfCLK). In FIG. 7, when the input similarity is greater than the similarities stored in all Sim-buffers 281, the input similarity is written in the top Sim-buffer after the Add-/Sim-buffer contents is down shifted. Then, the content of Add-Reg 282 is latched in the Add-Buff 283. When the input similarity is smaller than either of the buffer contents, the input is not stored in the buffers 281. When the input similarity is between the (k-1)th and the kth Sim-buffer's contents, the similarity and the contents of Add-Reg 282 are written in the kth Add-buffers 283 and the kth Sim-buffers 281, respectively, after the kth to the nth Add-/Sim-buffers' contents are down-shifted. The clock generator (DsfCLK) 284 produces the timing signals to down-shift the contents of Add-Bff 283 and Sim-Bff 281 through AND gate 285 when the comparators 286 generate a signal "1" to show that the input similarity is greater than the content of Sim-Bff 281. Alternatively, the clock generator (SetCLK) 287 activates the conditional AND gates 288 to show the comparator position where the input similarity is between the (k-1)th and kth Sim-buffer's contents. Finally, by driving the DsfCLK in the reverse direction, the contents stored in the buffers are shifted up to be output after N similarities are ranked in the array of the Add/Sim-buffers. The corresponding document file addresses are output in the same order as the similarities. The highest ranked similarity and the corresponding address can be checked from the top position Sim-Bff 620 and Add-Bff 610 without the upshift clock after N similarities are ranked in the array of the Add/Sim-buffers. In addition, during the ranking process, the ranker FIFO can output the HD (highest detection) signal. It is useful to know which input similarity has the highest value. The ranked similarity output process is performed as described above. That is, the addresses of filed patterns similar to the query are ranked in the same order as the similarity, as shown in Table 4.
TABLE 4
______________________________________
Output similarities
______________________________________
Sq1 = 0.60
Sq2 = 0.84 Sq3 = 0.59 Sq4 = 0.72
Sq5 = 0.47
Sq6 = 0.51 Sq7 = 0.51 Sq8 = 0.65
______________________________________
Ranked similarities
______________________________________
Sq2 = 0.84 .fwdarw.
Sq4 = 0.72 .fwdarw.
Sq8 = 0.65 .fwdarw.
Sq1 = 0.60 .fwdarw.
Sq3 = 0.59 .fwdarw.
Sq7 = 0.51 .fwdarw.
Sq6 = 0.51 .fwdarw.
Sq5 = 0.47
______________________________________
(the ranked addresses forms a sequence of 2 4 8 1 3 7 6 5)
The contents stored in the ranker FIFO can be output by using the up-shift clock. Although the address j corresponds to the greatest similarity and Sjq becomes the closest document address for the query document Dq, if the ranker FIFO contains 16 buffer registers, the 16 closest document addresses are output in order of decreasing similarity magnitude. These addresses are output from the ranker FIFO to the host computer 12. Using the FPM and the ranker FIFO, the closest document address search time can be reduced to the search time necessary to compare the query Gq with all the filed patterns Gj, j=1,2, . . . ,m stored in the FPM. When the comparison time is 1 .mu.sec, the time is n.multidot.m .mu.sec, where m and n are the numbers of filed patterns and search terms. When m and n are 10.sup.6 and 10.sup.3, it becomes 10.sup.3 sec. Furthermore, the preparation time of FPM will range to m.multidot.s .mu.sec, where s is the size of each document. When s is 10.sup.5 bytes, m.multidot.s .mu.sec becomes 10.sup.5 sec (28 hours). Although the time is very long, this preparation is needed only when the filed documents or the search term sets are updated. In order to reduce the time for performing comparisons between each query pattern and every filed patterns, filed patterns in the FPM are categorized. This categorization is done when the documents are filed. Thus, the following two layer categorization rules are assumed. First, the document is categorized in the ith document area, if it matches a set of search terms for the lth area documents and does not match a set of search terms for the non-lth area filed documents. Second, several cluster center documents are chosen in each document area. Then, the similarity between the other filed documents and each cluster center candidate document is calculated. If the similarity to the kth cluster center document is large, the filed document is categorized in the kth cluster, otherwise it is not categorized into the kth cluster. When filed documents are stored as filed patterns in the FPM based on these categorization rules, the query document can be classified in the cluster to store the filed documents based on the above categorization rules. The method of accomplishing the filed document categorization is described below. First, the filed documents are roughly categorized based on the filing contents, such as mathematics, physics, biology, art, cognitive-science, history, geography, astronomy, agriculture, industry, ecology, information-technology, device technology and the like. Next, each file is categorized into L areas by the detailed contents. For example, the information-technology can be categorized into sub-categories such as the computation hardware or software, communication network hardware or software, filing system hardware or software and the like. Each area will be decomposed into smaller clusters. Suppose that each area is decomposed in K cluster based on the similarity to cluster center candidate documents or to the others (miscellaneous). It is noted that area and cluster are the temporal names for hierarchical categorization. They can be considered the first layer and second layer categories. For the categorization of filed documents, the cluster center documents should be as typical of the documents as possible. For example, such a document will be parts of handbooks regarding the filed documents or the text books or the term dictionaries. These documents have a term index. The term set can be determined from the term index. The handbook parts call be optimized by knowing the distribution of filed document contents or by searching the filed documents with the SSP to store the sample term sets. These documents can be stored in the DFM 10. The corresponding filed pattern can be stored in the string search processor SSP 16. The SSP has a parallel character string matching function to compare 128 or less strings (terms) with an input text string, and has the selective comparison function to compare either of 16 term sets (128 string/set) with an input text string in one chip. Thus, using the SSP to store L sets of search terms and a K cluster center pattern memory, the query document call be compared with the filed documents categorized into L.times.K clusters in such a macro category as information-technology. This invention call be used for very wide range library retrieval and specified area document retrieval. In either document retrieval case, the filed documents are hierarchically categorized. Then, first, they are categorized by the term sets into L=16 areas. Second, the documents in each area are categorized into K clusters with the same term set. The term set common to K cluster documents in the Lth area is selected from the cluster center patterns in the Lth area. The set is stored in the lth CAM area of the SSP. The procedure to categorize the documents which are stored as filed patterns in FPM is as follows: First, the SSP stores L terms sets and the cluster center documents are converted into the cluster center patterns. They are stored in the cluster center pattern memory CCPM 32 with selected segment address. Second, the documents to be filed are searched by the SSP to store L term sets. When the search results are sent as query patterns to the QPB via MSC and MFM, .SIGMA.Gij/n is checked by the match grade accumulator MGA. This .SIGMA.Gij/n is sent to the ranker FIFO. In the ranker this FIFO, the segment address 1 corresponding to the largest .SIGMA.Gij/n is indicated by HD signal. When the ranker FIFO outputs the HD signal, the query pattern is transferred from QPB to the QPM 70 and the FPM 20. Third, the query pattern buffered in the QPM is compared with k cluster center patterns in the CCPM 32 with the selected lth segment address, using the PSM. When k is scanned, the largest similarity can be detected by the ranker FIFO 28. If the similarity of the query to the kth cluster center pattern is the greatest, the query pattern is categorized in the kth cluster. Thus, the filed pattern address is stored in the DjAM in association with the lth segment address code and the kth cluster addresses code in sequence. If the largest similarity detected by the ranker FIFO is very small, then the query pattern will be categorized in the miscellaneous class. Then, the similarity threshold must be optimized so that the documents in each cluster and those in the miscellaneous cluster are well balanced. After all filed documents are stored in the FPM, the query document can be easily classified by the .SIGMA.Giq/n and the similarity Sqk. Then, transitive fuzzy-logic inference is used. The sum of term match grades between the query document and lth set of search terms can be used to infer the search domain SDl of the query. The similarity of Sqk between the query and the kth cluster center pattern in SDl can be used to infer the cluster Ck of the query pattern. Assume that the kth cluster contains R filed patterns. The similarity Sqr between the query and the rth filed pattern FPr in Ck call be used to decide the nearest pattern NPr, r=1,2, . . . ,R. Thus, fuzzy-logic inference rules are assumed as follows: ›.SIGMA.Gliq/nSDl!=Rl lth term set select (16) ›SDlSqkCk!=Rk kth center pattern select (17) ›CkSqrNPr!=Rr rth filed pattern select (18) where Rl, Rk and Rr are the rule truth grades of .SIGMA.Gliq/nSDl, SDlSqkCk and CkSqrNPr. As described previously, there are two methods for determining the rule truth grades such as Rl, Rk, and Rr. The first method is to define the rule truth grade as follows: Rl=l-min{.alpha..sub.l',.alpha..sub.l } (19) Rk=k-min{.beta..sub.k',.beta..sub.k } (20) Rr=r-min{.gamma..sub.r',.gamma..sub.r } (21) where .alpha..sub.l, .beta..sub.k, and .gamma..sub.r are defined as min.sub.j of the antecedent truth grade {.SIGMA.Glij/n, Sjk and Sjr} for the filed documents in categories l,k,r while .alpha..sub.l', .beta..sub.k', and .gamma..sub.r' are defined as max.sub.i of the antecedent truth grade {.SIGMA.Glij/n, Sjk and Sjr} for the filed documents in the categories, except the specified categories l, k and r. The second method is to set Rl, Rk and Rr to 1 by introducing thresholds min{.alpha..sub.l',.alpha..sub.l }, min{.beta..sub.k',.beta..sub.k } and min {.gamma..sub.r',.gamma..sub.r } for the antecedent truth grades .SIGMA.Gliq/n, SDlSqk and CkSqr, respectively. The first method decreases the rule truth grade Rl, Rk and Rr by min{.alpha..sub.l',.alpha..sub.l }, min{.beta..sub.k',.beta..sub.k } and min{.gamma..sub.r', .gamma..sub.r }. The second method increases the threshold ftl, ftk and ftr by min{.alpha..sub.l',.alpha..sub.l }, min{.beta..sub.k',.beta..sub.k }, and min{.gamma..sub.r',.gamma..sub.r } in order to decide which categories will not be searched. If the data fidelity is not positive, the category is skipped, i.e. not searched. Otherwise, the category is searched. For example, if the lth area documents have large values of .SIGMA.Gij/n and the other area documents have small values of .SIGMA.Gij/n, because min.sub.j .SIGMA.Gij/n=.alpha..sub.l >max.sub.j .SIGMA.Gij/n=.alpha..sub.l', and Rl becomes 1-.alpha..sub.l', then if .SIGMA.Giq/n is less than 1-Rl=.alpha..sub.l', the lth area does not have to be searched. Otherwise, the lth area is searched. The lth area contains many clusters with kth cluster center patterns. The other rule truth grades Rk and Rr can be determined in the same scanner as Rl is determined. Assuming that R filed patterns in each cluster call be categorized in more narrow sub-clusters SCz, z=1,2, . . . ,Z by using the similarities between the filed pattern and sub-cluster center pattern, each sub-cluster will contain R/Z filed patterns. If Z is 32, R/Z also becomes 32. In the same manner, categorizing the filed patterns hierarchically, the search domain becomes narrower and narrower, though the inference rules increase. These inference rules remind us of a fuzzy transitive inference from the query to infer the cluster containing the nearest filed patterns Fj. That is, the fuzzy-logic inference is performed as follows: ›SDl!=Rl whenever fdql>0 (22) ›Ck!=Rk whenever fdlk>0 (23) ›NPr!=Rr whenever fdkr>0 (24) where, based on the first method for determining the rule truth grades, fdql=Rl.crclbar.›.SIGMA.Gliq/n!', l=1,2, . . . ,L (25) fdlk=Rk.crclbar.›SDlSqk!', k=1,2, . . . ,K (26) fdkr=Rr.crclbar.›CkSqr!', r=1,2, . . . ,R (27) fdqr=min(fdql, fdlk, fdkr) (28) and where, based on the second method for determining the rule truth grades , fdql=.SIGMA.liq/n-min{.alpha..sub.l',.alpha..sub.l }, l=1,2, . . . ,L(29) fdlk=›SDlSqk!-min{.beta..sub.k',.beta..sub.k }, k=1,2, . . . ,K(30) fdkr=›CkSqr!-min{.gamma..sub.r',.gamma..sub.r }, r=1,2, . . . ,R(31) fdqr=min(fdql, fdlk, fdkr) (32) The values fdql, fdlk, fdkr and fdqr are the data fidelities of the inference results ›SDl!, ›Ck!, ›NPr! and ›SDlCkNPr!. The clusters containing the documents nearest to the query are found by sending the fidelity fdqr to the ranker FIFO. At the last stage when R filed patterns in the kth cluster which are stored in the FPM are read out and compared sequentially with the query pattern, the filed pattern addresses providing the larger data fidelities are stored in the ranker FIFO 28 in order of magnitude. The advantages of the transitive fuzzy-logic inference application to the closest document retrieval are in the filtering characteristics to skip the search of filed patterns in the mismatched categories and in the flexible decision characteristics to rank the nearest document addresses in order of data fidelity. The filtering characteristics are obtained by cutting the consequent category searches based on zero data fidelity of the inference results. The flexible decision characteristics are accomplished by continuing the search for so long as the data fidelities are not zero. The high possibility solutions are stored in the ranker FIFO. The nearest document retrieval system using the transitive fuzzy-logic inference can be realized using a circuit as shown in FIG. 2. The major difference between the circuits in FIG. 1 and FIG. 2 is the introduction of a match grade among calculator MGA 30, cluster center pattern memory CCPM 32, term weight calculator TWC 31, weight memory WM 33, transitive fuzzy inference processor TFIP 34, rule truth grade extractor RTE 36, and rule truth grade memory RTM 35. These hardware elements are explained below. The match grade accumulator MGA 30 is an accumulator for generating the sum of term match grades, as shown in FIG. 8. The MGA 30 comprises an adder 301 and a register 302 which receives as an input either grades Giq or Gij and provides as an output either .SIGMA.Giq or .SIGMA.Gij, respectively. A schematic block diagram of a cluster center pattern memory CCPM 32 is shown in FIG. 9. The contents of QPM, FPM and CCPM are selected by the lth segment address coming from the SiA-CNT 325 to count the area number. In general, the lowest address code is TiA, next is CkA. The highest address code is the segment address SiA. The address switches 321 and 324 are used to select between the contents of address counters SiA-CNT 325, CkA-CNT 326, TiA-CNT 327 and DjA-CNT 328 or the specified address coming from the host computer 12 provided along bus 84. In FIG. 9, a query pattern buffer (QPB) 329 is used to select the query pattern based on the best term set. The output of the MFM 24, provided via switch 330, is stored in the QPB 329. When HD (highest detection) signal is output from the ranker FIFO 28, the content of QPB is transferred to the QPM 18. The address of the FPM 20 is provided from the DjAM 331 until the Pr-CNT 332 is reset by the cluster end signal output from DjAM 331. Since the DjAM 331 stores the filed pattern address in sequence together with the lth segment and the kth cluster addresses, the filed pattern addresses can be generated by giving the lth segment and kth cluster addresses. Next, the tern weight calculation hardware 31 is explained. When the pattern similarity between the query and the categorized filed pattern is calculated in the PSC 26, the term contribution weights must be taken into account because the contribution of each search term to the filed patterns in each cluster is different from each other. Thus, the PSC is modified as shown in FIG. 10. The weights are stored in the weight memory WM 33. The coincidence grade Ci between the query and the kth cluster center patterns center patterns is calculated as follows: Ciqk=wik{1-.vertline.Gik-Giq.vertline.} (33) where wik is the term weight for the kth cluster. The similarity Sqk is calculated by taking the average of Ciqk, i=1,2, . . . ,n. Assume that ACtqk is the normalized sum of Ctqk, t=1,2, . . . ,n-1. The ACtqk becomes: ACtqk=AC(t-1)qk+{Ctqk-wtk.multidot.AC(t-1)qk}/.SIGMA.wtk. (34) When t reaches n, ACtqk equals Sqk. Thus, the pattern similarity accumulator PSA 92 in FIG. 6 is modified to estimate recursively ACtqk as shown in FIG. 10. The weight memory contents prepared in the term weight calculator TWC shown schematically in FIG. 11, when the filed patterns are categorized into any of K clusters. The term weight wik is calculated as follows: wik=.SIGMA..sub.r (1-GDik/(AGDi+GDik)/.SIGMA..sub.r (35) where GDik=.SIGMA..sub.r GDik is a grade difference between the kth cluster center pattern and the rth filed pattern in the lth cluster. AGDi is the average of GDik, k=1,2, . . . ,K. Assume that the average AWikt is the sum of Wikt, t=1,2, . . . ,r-1. The average AWikt becomes: AWikt=AWik(t-1)+{Wikt-AWik(t-1)}/.SIGMA.r (36) where Wikt is {1-GDikt/(GDikt+AGDi)}, t=1,2, . . . ,r. When t reaches r, AWikt equals wik. The term weight calculator TWC is shown schematically in FIG. 11. Two BDCs 311,312 are used to calculate the absolute value of the difference between Giq and Gik. The result is temporally stored in the GD-Buf 313. When the ranker FIFO 28 indicates the highest detection HD signal at the kth cluster, the content of GD-Buf 313 is transferred to the grade difference memory GDM 314. After k cluster center patterns are scanned, the GDM content is divided by the sum of AGDi and GDik at DIV and ADDER 315. AGDi is calculated in the average calculator 316 containing two subtractors 3161, 3162 register AGDM 3163 and divider 3164, in the same manner as described in conjunction with the PSA 92 in FIG. 6. The output of DIV and ADDER 315 is subtracted from 1 in subtractor 317 and the result is averaged by the document number r in the kth cluster at the average calculator 318 containing the subtractor 3181, divider 3182 and adder 3183. The output of the adder 3183 is equal to wik when the document number in the cluster reaches r. The result is stored in WM 33. The document number r in each cluster is counted by adder 319, though the WM 33 must store the document number r(k) for each cluster. As shown in FIG. 12, TFIP 34 can be realized primarily with an input truth grade data switch 110, minimum selection circuits MIN 112,114, the output truth grade register (TRG1 or TRG2) 116,118, fidelity registers fd-Reg 120,121, polarity-detectors PD 122,124 and bounded-difference calculators BDC 126,128,130. The BDC estimates the data fidelity fd and output truth grade which is the purpose of TFIP 34. The input truth grade is provided from the data switch 110 and the MIN 112,114 to select the minimum between the present input and the previous output truth grades. The result is compared with the rule truth grade provided from the rule truth grade memory RTM 35, to calculate the data fidelity such as fdlk=Rk-›SqkSDl!'. The fidelity is stored in the fidelity register fd-Reg 120,121 and the polarity of the data fidelity is decided in the polarity detector PD 122,124. When the PD 122,124 outputs "1", the rule truth grade is latched in the truth grade register 116,118. The BDC 128,128 is shown schematically in FIG. 13. The BDC comprises three input adder 1261, polarity detector 1262 and data switch 1263 which is controlled by the output from polarity detector 1262. If X+Y-1>0, the polarity detector 1262 output is a logic level "1" and switch 1263 is closed to pass through X.crclbar.Y'=X+Y-1. If X+Y-1<0, then polarity detector 1262 output is a logic level "0" and switch 1263 is open causing the output to be zero. The rule truth grade extractor RTE 36 is shown in FIG. 14. The RTE comprises rule parameter memory RPM/RPM' 130 which stores the category thresholds such as .alpha..sub.l, .beta..sub.k, .gamma..sub.l', .alpha..sub.l', .beta..sub.k', and .gamma..sub.r' which are extracted from the ranker FIFO 28 through MIN detector 132 and MAX detector 134. The BDC 136 and subtractor 138 are used to calculate Eq. (19),(20) and (21). RPM may be divided in .alpha.-M, .beta.-M and .gamma.-M to store .alpha..sub.l, .beta..sub.k, .gamma..sub.r', .alpha..sub.l', .beta..sub.k', and .gamma..sub.r', by using the address register 140. The transitive fuzzy-logic inference is carried out as shown by EQs (22),(23),(24), (25),(26), (27) and (28). The three layer transitive inference process is explained below. The closest document retrieval system using the transitive fuzzy-logic inference processor TFIP is explained below. The system has two operation modes. The first mode is the filed document categorization and rule acquisition. The second mode is the query document classification and closest document address searching. The process is explained below. In the first mode, the filed documents containing the cluster center candidate documents are searched by the SSP at the next sequence. First step, the cluster center candidate documents are transferred to the host-computer and displayed. Then the search terms of the lth area are picked, filed as the lth area term set and stored in the SSP. Second step, L.times.K cluster center candidate documents are searched by the SSP while l is incremented from 1 to L. The search results are buffered as a cluster center pattern in the QPB via MSC, MFM1 and the match grade accumulator MGA to calculate .SIGMA.Glij/n, until the segment address l to give the maxl .SIGMA.Glij/n is determined by the ranker FIFO. Whenever 1 is reached, the cluster center pattern is stored in the CCPM with the lth segment address. Then, the highest and the next values of .SIGMA.Glij/n, j=1,2, . . . ,L.times.K are stored as .alpha..sub.l and .alpha..sub.l' in the rule parameter memory RPM (.alpha.-M) with the lth address activated, where .alpha..sub.l =min.sub.j .SIGMA.Glij/n, and .alpha..sub.l' =max.sub.j .SIGMA.Gl'ij/n. The symbols 1' means "not 1." Third step, the other filed documents are sequentially searched by the SSP to store L term sets. When the search results of the filed documents are converted to the filed patterns through MSC 22, MFM 24, MGA 30 and QPB 329. The filed pattern buffered in the QPB 329 is transferred to the QPM 70 and FPM 20, when an HD signal is output from the ranker FIFO 28 to rank .SIGMA.Glij/n coming from the PSC 26. Fourth step, the contents of QPM 70 is compared with the contents of CCPM 32 in sequence in PSC 26. The output of PSC 26 is sent to the ranker FIFO 28 via switch 27. When HD signal is output from the ranker FIFO, the cluster address and the lth segment address are used to decide the address of DjAM to store the filed pattern address. After the best k is determined, the highest and the second similarities in the ranker FIFO are sent to the rule parameter memory RPM/RPM' (.beta.-M) in a form of .beta.lk=min.sub.j Sjk and .beta.lk'=max.sub.j Sjk'. The rule truth grade extractor RTE 316 converts the rule parameters (.alpha..sub.l =min.sub.j .SIGMA.Glij/n, .alpha..sub.l' =max.sub.j .SIGMA.Gl'ij/n, .beta.lk=min.sub.j Sjk and .beta.lk'=max.sub.j Sjk') to the rule truth grades Rl and Rk are shown in FIG. 14. Fifth step, when the filed documents in each area are hierarchically divided into the sub-categories such as clusters, the contribution of each term to the cluster are updated as the filed document is categorized, and stored as weight coefficients in the weight memory so that the weights can be used when the similarity of query document to the cluster center is calculated. The weight is calculated by the average calculator to give the average of each term match grades for the filed documents in each cluster, and thereby to calculate {1--the difference of the ith term match grades for the filed document and for the kth cluster center document}. Sixth step, the contents of alpha and beta threshold memory in the rule parameter memory are sent to rule truth-grade extractor RTE 36 to estimate the rule truth grades using Eqs.(18) and (19). These six steps are used for the categorization of filed documents. In the second operation mode of the closest document retrieval system, the query document is classified into some clusters suggested by the transitive fuzzy-logic inference, and then the filed patterns in the suggested clusters are searched to show the closest document addresses. The procedure is explained below. First step, the query document is searched by the SSP 16 to store L term sets. The search result of SSP is sent to the QPM 70 via MSC 22, MFM 24 and MGA 30 to calculate .SIGMA.Glij/n. Second step, the fuzzy-logic inference based on ›.SIGMA.Giq/nSDl!=Rl is used to check whether or not ›SDl! is equal to Rl. Of course, if Rl><›.SIGMA.Giq/n!', then ›SDl!=Rl. Thus, the Transitive Fuzzy-logic Inference Processor TFIP 34 calculates the data fidelity fdql=Rl.crclbar.›.SIGMA.Gliq/n!' using MGA to output the .SIGMA.Gliq/n and the rule parameter memory RPM to store Rl, in the TFIP. The contents of RPM is determined in the RTE 36. If fdql is positive, ›SDl!=Rl is stored in the first stage truth grade register TGR1. Third, whenever fdql is positive, the content of QPM 70 is sequentially compared with k cluster center patterns stored in the FPM 20 with the lth segment address. Then, the similarity Sqk is sent to MIN 112,114 together with ›SDl!stored in the TGR1. The output of the MIN is input to the BDC 128,130 together with the rule truth grade Rk stored in RPM 130. The BDC estimates ›Ck!=Rk by calculating the data fidelity fdlk=Rk-›SDlSqk!'. When fdlk is positive, the inference result ›Ck!=Rk is held in the second stage truth grade register TGR2. The similarity is recursively calculated using weight wik, while the term number i is incremented, the similarity is calculated as follows: Sqk=1-.SIGMA.wik(.vertline.Gik-Giq.vertline.)/.SIGMA.wik. For this calculation of similarity, PSC is modified as shown in FIG. 10. The difference is in the pattern similarity accumulator 92. That is, the multiplier 926 is introduced between the subtractor 921 and the divider 922. The counter 924 is replaced by the weight accumulator 927. The weight is read out from the weight memory 33 with term address i. Fourth step, when the data fidelity fdlk is positive, the content of the QPM is compared with all filed patterns stored in the FPM with the kth cluster address in the PSC. The resultant similarity Sqr is sent to the MIN together with the content of TGR2. When the data fidelity fdkr=Rr-›CkSqr!' is positive, ›NPr! is estimated to be Rr=1 based on the rule CkSqrNPr, because ›Ck! stored in the TGR2 and the output of PSC are used to infer the truth grade ›NPr! for the nearest pattern. Fifth step, the data fidelity fdqr=min(fdql,fdlk,fdkr) is calculated in MIN 131 and sent to the ranker FIFO 28 together with the corresponding filed pattern address. Then, the cluster address k is incremented. If fdlk is not positive, r is not scanned and Sqr or fdkr is not calculated in PSC or BDC. The value k is incremented successively. When k reaches K, l is incremented. Similarly, if fdql is not positive, k is not scanned and fdlk is not calculated in TFIP 34. Until l reaches the maximum L, the above process is repeated. The output of the TFIP 34 is sent to the ranker FIFO 38, as long as the data fidelity is positive. At the end of the process, the nearest document address can be output from the ranker FIFO to the host computer 12. The RTE 36 comprises rule parameter memory RPM and RPM' 130 to store .alpha..sub.l, .beta..sub.k, .gamma..sub.r and .alpha..sub.l', .beta..sub.k ' and .gamma..sub.r' respectively, minimum and maximum selection circuits MIN detector 132 and MAX detector 134, a MIN circuit 136, subtractor 138 and rule truth grade memory RTM 35 as shown in FIG. 14. The rule parameters .alpha..sub.l, .beta..sub.k, .gamma..sub.r and .alpha..sub.1', .beta..sub.k' and .gamma..sub.r' are extracted from the ranker FIFO through MIN detector 132 and MAX detector 134. The MIN circuit and subtractor are used to calculate Eq.(19),(20) and (24). RTM 35 must store L.times.K.times.R inference rule truth grades. But, using the fact that Rr always becomes 1, RTM memory capacity becomes L.times.K bytes. In general, if the filed documents can be categorized into narrower sub-clusters by the similarity to the sub-cluster center patterns, the query document can be classified into the narrower subclusters. In the same way, even each sub-cluster can be categorized into sub-sub-clusters. As the layers of hierarchical categorization increase, the search domain of the nearest patterns can be narrowed, resulting in the shorter search time. In this example, only two layers of categorization are shown. Even if the layers are two, the performance is markedly improved. When the filed document number m is 10.sup.6 per document file, two layers are enough to shorten the search time within a few seconds. Rather in the first step of the query document classification, the search time of the query document L times by the SSP. If L SSPs are used in parallel to search the query document, the search speed is enhanced, because the search results can be processed in MSC in parallel without any collision. This is equivalent to the use of large capacity SSP to store L sets of search terms. Assume that the search results are stored in the MSC whose addresses are separated by the term set code. When the query document was searched, the content of MSC is sent to the QPM in sequence of the term set while the term match grades are amounted in the MGA for every term set. When the output of MGA causes TFIP to output the positive data fidelity, the term set code is fixed, and then the CCPM with the current term set address is scanned to compare the content of the QPM with the content of CCPM. The following process is the same as the third and fourth steps during the query document classification. FIGS. 15(a), 15(b) and 15(c) show the timing diagram for categorization and classification for the closest pattern address search, and FIG. 16 shows the timing diagrams skipping of the pattern search in the classification period. In the categorization period, first, cluster center patterns (CCP) are categorized in the best area which is decided by the SSP, and second, filed patterns are categorized based on the similarity between the filed and cluster center patterns. The rule truth grades are estimated in RTE when the cluster of filed pattern is decided by the ranker FIFO. In FIG. 15(a) each pulse at the SSP corresponds to a document search operation by the SSP. The pulse at the CCPM corresponds to a write operation for center pattern categorization. In FIG. 15(b) the pulse at FPM corresponds to a write operation for categorization of filed patterns. In FIG. 15(c) the pulses at CCPM and FPM memories correspond to read operations. The pulses at the output of ranker FIFO corresponds to the query pattern classification ranked by similarities. In FIG. 16, which is a single continuous waveform including the shaded pulses at the SSP and CCPM which refer to skipped searches resulting from a mismatch result. When the query document is classified, first, the query is searched by SSP. The term match grade amount is calculated in MGA and sent to the TFIP 34. If the data fidelity is judged to be positive in the TFIP, the query is compared with a cluster center pattern stored in CCPM. The similarity is calculated and sent to the TFIP. If the data fidelity is judged to be positive in TFIP, the query is compared with filed patterns in FPM. The similarity is sent to TFIP. The overall data fidelity is calculated in TFIP and sent to the ranker FIFO. It is noted that whenever TFIP outputs a negative data fidelity, the following pattern searches are skipped. The above example shows the case when h.sub.L =2/L, h.sub.K =1/K and 2/K, where h.sub.L and h.sub.K are the hit ratio of positive data fidelity in the TFIP. The full text search time of m filed documents is Ts1=s.multidot.m/f, where s is document size to be normalized to 10.sup.5 bytes. In this case, the FPM, PSC and ranker FIFO are not used, the matched document addresses are not ranked by the similarity between the query and filed documents, though the software in the host-computer call analyze the text search results to rank the similarity. The software processing time is long and the software is expensive, though the preparation time is Tp1=0. Using FPM, PSC and ranker FIFO, Ts becomes as follows: Ts2=s/f+n.multidot.m/f, where n shows term numbers in SSP. If the document area is fixed, n is nearly equal to 10.sup.2. If the area is not fixed, n is 10.sup.3 or more. As long as n is fewer than s, Ts2 becomes shorter than Ts1. In this case, hardware such as FPM, PSC and ranker FIFO are needed. Furthermore, the preparation time of FPM contents becomes as follows: Tp2=(s+n).multidot.m/f. Though this preparation time Tp2 is nearly as long as Ts1, the contents of FPM is seldom updated. Thus, the performance to find the nearest document addresses is determined mainly by Ts2. When the filed documents are categorized into L areas and k clusters per area, the transitive fuzzy-logic inference can be used to quickly decide the cluster of the query document. Even if r patterns in each cluster are sequentially searched, the overall search time Ts3 becomes as follows: Ts3=(s+n)/f+h.sub.L .multidot.h.sub.K .multidot.m.multidot.n/f, where h.sub.L and h.sub.K are hit ratios of matched patterns in L areas and K clusters, respectively. The first term shows the period when the query document is searched by the large capacity SSP and the search results in MSC are sent to QPM. Though the content of QPM must be compared with m=L.times.K.times.R filed patterns in FPM, if area and cluster do not meet the query, the following search can be omitted. Since the hit ratios h.sub.L, h.sub.K become very low because the search domain is narrowed by the transitive fuzzy-logic inference. Thus, Ts3 will be determined mainly by the second term. In this case, in addition to FPM, PSC and ranker FIFO, the CCPM, FIP and some accessory circuits must be added to the hardware. These are not as large as FPM or PSC. The preparation time to fill the contents of CCPM, FPM and RPM is as follows: Tp3=L.multidot.K.multidot.(s+n)/f+m{(s+n)+n.multidot.K}/f+2L.multidot.K/f where the first term shows the time to store L.times.K cluster center patterns in the CCPM and the second term shows the time to store the categorized filed patterns into the FPM, and the third term shows the time to store the rule truth grade in RTM 35 via the RTE 36. When s=10.sup.5, n=10.sup.3, m=10.sup.6 and f=10.sup.6 /sec in the conventional system, Ts1 becomes 10.sup.5 sec. When FPM, PSC and ranker FIFO are used, Ts2 becomes 10.sup.3 sec. When the filed patterns are categorized into L areas and K clusters per area, if L=16, K=64, R=m/(L.multidot.K)=10.sup.3, n=10.sup.2, h.sub.L =1/10 and k.sub.K =1/10, Ts3 becomes 1.1 sec. This value is markedly small compared with Ts1 and Ts2. The SSP search speed f can be increased to 10.sup.7 /sec. In that case, Ts1, Ts2 and Ts3 become: Ts1=10.sub.4 sec, Ts2=10.sup.2 sec and Ts3=0.11 sec. The preparation times Tp2 and Tp3 at f=10.sup.6 /sec become: Tp2=(s+n).multidot.m/f=(10.sup.5 +10.sup.3).multidot.10.sup.6 /10.sup.6 =1.01.times.10.sup.5 sec.div.28 hr. Tp3=(s+(K+1)n).multidot.m/f=1.6.times.10.sup.5 sec.div.44.4 hr. Although this preparation time (28 or 44.4 hours) is very long, it is not fatal, because the preparation is needed only when the filed patterns in the FPM are updated. By virtue of this long preparation time, the retrieval time becomes very short. Of course, it is anticipated that f will be increased by 10 times, because Tp3 can become 4.4 hours. This time 4.4 hours will be allowable when the document retrieval system is started up and tested. In any event, since the preparation time of 4.4 or 44 hours seldom is needed, the performance is governed by the search time Ts. The search time Ts<1 sec is very useful and attractive for practical use, because the user can quickly find the nearest document addresses by inputting only the query document. Since the inference processor is realized by simple logic circuits, it will be possible to make the speed f higher than 10.sup.7 /sec. In order to improve the performance using the cluster center patterns, it is necessary to prepare the MSC, MFM, MGA, QPM, FPM, CCPM, PSC and FIP. The hardware comprises the memory and arithmetic logic operation. Thus, the hardware can be realized by a microcomputer board of a LSI chip, if the operation program is stored as micro-programs in the read-only-memory ROM, EPROM, EEPROM and the like. However, the document retrieval system may be realized as a document retrieval acceleration machine (DRAM) to operation together with a personal computer or other workstation. FPM needs m.times.n bytes memory capacity. When m is 10.sup.6 and n is 10.sup.3, FMP must be 1 Gbytes. This is very large. Compared with an FPM, the other processors PSC, ranker FIFO, MGA and FIP are very small. Since the VLSI circuit technologies to realize 1 gigabit memory and 1 million gate processor in chip die are markedly progressed, the above hardware can be easily integrated in eight chips of functional memory in the future. At least, the document retrieval acceleration machine will be realized by two modules: an application specific CPU and a large capacity memory. To accelerate the document retrieval by full-text search processor, two strategies using the fuzzy logics have been developed. The first strategy is to deal with search results as the term match grade patterns and to store all document search results as filed patterns. The nearest pattern is found by fuzzy logic matching between query and filed patterns. The second strategy is to categorize the filed patterns by an area specific term set and cluster center patterns and to estimate the nearest filed pattern area and cluster by the fuzzy-logic inference from the term match grade volume and the similarity between the query and cluster center pattern. Finally, the nearest document addresses are output from the ranker FIFO in order of data fidelity. The effects of these strategies on the document retrieval performance improvement were studied. If DFM stores 100 GB, the full-text search time becomes 10.sup.4 sec even at 10 MB/s of SSP. Using the first strategy and using FPC, PSC and ranker FIFO, the search time is shortened to about 10.sup.2 sec. Using the second strategy, the time will become around 1 sec. The preparation time of the rule truth grades is kept equal to the full text search time 10.sup.4 sec or 2.8 hour. Since the preparation is needed only during the period when FPM contents is to be updated, the time 2.8 hours is not fatal, when considering the resultant performance improvement. While there has been described and illustrated a preferred method and apparatus for document retrieval using fuzzy-logic inference, it will be apparent to those skilled in the art that variations and modifications are possible without deviating from the broad principles and spirit of the present invention which shall be limited solely by the scope of the claims appended hereto.
|
Same subclass Same class Consider this |
||||||||||
