String comparator device system circuit and method4490811Abstract The string comparator device for comparison of strings of indicia at high speeds for use in a system circuit in a computer system. The string comparison device provides a numeric measurement of the degree of similarity between the compared indicia strings as defined by a mathematical algorithm. The algorithm is solved through a new string comparator device or a new program in a computer system. The system circuit in chip form can be connected in a storage loop of a computer system to locate and quickly extract records that are very similar to the supplied query. Inexact queries will rapidly locate records similar with respect to indicia string related measurements of similarity. The method of indicia string comparison in the improved string comparator device can provide rapid response to queries in a computation time proportional to the average length of the indicia string. Claims We claim: Description BACKGROUND OF THE INVENTION
TABLE 1
______________________________________
CMD SIGNALS
VALUE COMMAND
______________________________________
000 NOP (Idle State)
010 Pre-Forward Clear
001 Process Character
(repeated during Forward Scan)
011 Pre-Reverse Clear
101 Load CB into R
001 Process Character
(repeated during Reverse Scan stage 1)
100 Load CA into R
001 Process Character
(repeated during Reverse Scan stage 2)
110 Transmit Results
000 NOP (Idle State)
______________________________________
During the Forward Scan and the Reverse Scan, the CHAR output 32 is an 8-bit signal containing a character from either String-A or String-B. Each character from String-A or String-B is transmitted once during the Forward Scan and the Reverse Scan. The SCID output signal 34 is low (inactive) during the Forward Scan and high (active) during the Reverse Scan. The STRID output signal 33 is high (active) when a character from String-B is being output on CHAR lines 32. The STRID output signal is low (inactive) when a character from String-A is being output. The STRID output signal 33, the SCID output signal 34, and the CHAR output 32 are meaningful only when the CMD output 31 is equal to `001` indicating that a character should be processed. The Forward Scan phase of the String Control means' operation causes the computation of the function M.sub.f as described in the previous section defining the string similarity function .theta.. In effect, the String Control right-justifies String-1 and String-2 and then transmits every character beginning with the leftmost and ending with the rightmost. More precisely, String Control implements the following Forward Scan Program:
______________________________________
char strA[128], strB[128];
/*Two Strings*/
int lenA, lenB; /*Strings' lengths.
LenB.gtoreq.LenA*/
fscan(){
int disp; /*DISP register*/
int pos; /*POS register*/
pos = 0; /*Initialization*/
disp = LenB-LenA;
for (; pos LenB; ++p){
int i; /*Temporary Variable*/
i = pos-disp;
if (i == 0){
printf("CMD`001`,OUTPUT CHAR`%c`
from STRING A. ", strA[i]);
printf("CMD`001`, OUTPUT CHAR`%c` from
STRING B..0.", strB[pos]);
}
}
______________________________________
The variable POS and DISP in the above program correspond exactly in function and purpose to the 8-bit registers POS 42 and DISP 41 shown in FIG. 5. The Reverse Scan phase of the string control means' operation causes the computation of the function M, as described in the previous section defining the string similarity function .theta.. The Reverse Scan is divided into two stages to facilitate the computation of the compensation functions. In effect, String Control left-justifies the String-1 and String-2 and transmits every character beginning with the rightmost and ending with the leftmost. More precisely, String Control implements the following Reverse Scan Program:
______________________________________
char strA[128]; strB[128];
/*Two Strings*/
int lenA, lenB; /*Strings' lengths.
/*LenB.gtoreq.LenA*/
rscan(){
int disp; /*DISP register*/
int pos; /*POS register*/
pos = lenB ; /*Initialization*/
disp = LenB-LenA;
/*Reverse Scan - Stage 1*/
for (; disp = 0; disp--, pos--){
printf("CMD`001`, OUTPUT CHAR`%c` from
STRING B. ".0.", strB[pos]);
/*LOAD CA INTO R*/
printf(CMD`100`, LOAD CA INTO R.0);
/*Reverse Scan - Stage 2*/
for (; pos 0; pos--){
printf("CMD`001`, OUTPUT CHAR`%c` from
STRING A.".0.", strA[pos]);
printf("CMD`001`, OUTPUT CHAR`%c` from
STRING B..0.", strB[pos]);
}
}
______________________________________
The variables POS and DISP in the above program correspond exactly in function and purpose to the 8-bit registers POS 42 and DISP 41 shown in FIG. 5. Also, POS and DISP are initialized in the program RSCAN to the values which they held at the end of the program FSCAN. Thus RSCAN can conveniently be run directly after FSCAN with no extra initialization. The purpose of the Paramater Generation means 12 is to obtain the character weight and compensation values. The character itself is received from the String Control circuit 11. The character weight and compensation values are used by the rest of the string similarity computer 10. The Parameter Generation means 12 is shown in FIG. 6. FIG. 6 shows the input signals 31-34, the output signals 39-43 and the internal Random Access Memory 38. The input signals are the CHAR input 32, the SCID input 34; the STRID input 33, and the CMD input 31. All of these signals are outputs of the String Control means 11. The Random Access Memory 38 contains, for each of the 255 characters, a compensation value between 0 and 7, a weight value between 0 and 7, and a bias value between -2 and +1. These values correspond to the compensation function C, the forward weight function W.sub.f, and the bias function B which were described in the appendix defining the string similarity function .theta.. The Random Access Memory 38 is loaded via the Bus Control means 21. The CHAR output signal 39 is always the same as the previous CHAR input signal 32. The CHAR signals 32, 39 are 8-bit data items. The STRID output signal 40 is always the same as the previous STRID input signal 33. The STRID signals 33, 40 are 1-bit signals. The CMD output signal 43 is always the same as the previous CMD input signal 31. The CMD signals 31, 43 are 3-bit signals. The C output signal 42 is a 3-bit positivie binary signal equal to the compensation value for the character denoted by the CHAR signal 39. This compensation value is read from the Random Access Memory 38. The WGHT output signal 41 is a 3-bit signal equal to either the forward or the reverse weight of the character denoted by the CHAR signal 39. When the SCID input signal 34 is low (inactive) denoting the Forward Scan phase, the WGHT output signal 41 is equal to the value W-f read from the Random Access Memory 38. When the SCID signal is high (active) denoting the Reverse Scan phase, the WGHT output signal 41 is equal to the sum of the values of W.sub.f and B read from the Random Access Memory 38. The Core Section 13 is the decision-making part of the string similarity computer 10. It receives data from the Parameter Generation means 12, maintains character counts and determines what the rest of the string similarity computer must do. The Core Section 13 is pictured in FIG. 7 with its input signals 39-43, internal TALLY memory 44, and output signals 45-52. The input signals are CHAR 39, STRID 40, WGHT 42, C 42 and CMD 43. These input signals are all outputs of the Parameter Generation means 12. The TALLY memory 44 is a fast clear memory means of size 256.times.4. The TALLY memory contains a 4-bit signed (two's complement) number in the range -7 to 7, inclusive, for each character specified by the CHAR input signal 39. The clear control of the TALLY memory zeros the entire memory. Furthermore, individual entries in the TALLY memory may be incremented or decremented. Attempting to increment the value 7 or to decrement the value -7 results in an unchanged state. This is referred to as latching at .+-.7. The TALLY memory 44 corresponds directly to the array T used in the C programs in the earlier sections defining the string similarity function .theta.. The WX output signal 46 is equal to the arithmetic (two's complement) inverse of the WGHT input signal 41. WX 46 is a 4-bit non-positive integer. The WGHT output signal 49 is equal to the previous WGHT input signal 41. The WGHT signals 41, 49 are 3-bit unsigned integers. The CMD output signal 50 is a 3-bit signal that is always equal to the previous CMD input signal 43. The CMDX output signal 48 is a 1-bit signal equal to the low-order bit of the CMD input signal 43. The CMDB output signal 47 is a 2-bit signal. The low-order bit of CMDB 47 is high (active) only if the CMD input 43 is equal to `001` denoting a Process Character command. The high-order bit of CMDB 47 is high only if the CMD input 43 is equal to `010` or `011`, denoting a Clear command. When the CMD input 43 is equal to `010` or `011`, a Clear command is specified. The entire contents of the TALLY memory 44 are cleared in this case. When the CMD input 43 is equal to `001`, a Process Character command is specified. The CHAR input 39 specifies a character to be processed. Each character designates a 4-bit entry in the TALLY memory 44. Character processing consists of incrementing or decrementing the entry in the TALLY memory 44 which corresponds to the character denoted by the CHAR input 39. When the CMD input 43 is equal to `001` and the STRID input 40 is high (active), the appropriate TALLY memory entry is incremented and latched at 7. If the result is not positive then the T output signal 52 is high (active) otherwise the T output 52 is low (inactive). When the CMD input 43 is equal to `001` and the STRID input 40 is low (inactive), the appropriate TALLY memory entry is decremented and latched at -7. If the result is non-negative then the T output signal 52 is high (active) otherwise the T output 52 is low (inactive). If the T output signal 52 as computed above is high (active), then the C output signal 51 is the arithmetic (two's complement) inverse of the C input 42. Otherwise the C output 51 is equal to the C input 42. The C input 42 is an unsigned 3-bit integer. The C output 52 is a signed 4-bit integer. The CA section 14 is used to compute the total compensation value for the characters in the STRING-A. The CA section 14 is shown in FIG. 7 with its inputs 45, 49-52 and outputs 53-58. The signal COMPA 53 is an output signal which is also fed-back as an input to CA. The input signals are STRID 45, WGHT 49, CMD 50, C 51, and T 52. These signals are outputs of the CORE section 13. The output signals STRID 54, WGHT 55, CMD 56, C 57 and T 58 are always the same value as the corresponding inputs; i.e. these signals are passed through unchanged. The COMPA output signal 53 is a 9-bit unsigned non-negative integer. When the CMD input signal 50 is equal to `010`, denoting a Pre-forward Clear command, the COMPA output value 53 is zero. When the CMD input signal 50 is equal to `001`, denoting a Process Character command, and the value of the STRID input signal 45 is equal to the value of the T input signal 52; then the 4-bit signed input C 51 is added to the previous value of COMPA 53; the resulting sum is output as the COMPA signal 53. If an overflow occurs on this addition, the carry bit is lost; an overflow will not occur if the programming restraints documented in Appendix 3 are obyeed. When neither of the above condition is met, the COMPA output signal 51 is the same as the previous COMPA input signal 51. The CB section 15 is used to compute the total compensation value for the characters in the STRING-B. The CB section 15 is shown in FIG. B with its inputs 53-58 and outputs 59-63. The signal COMPB 59 is an output signal which is also fed-back as an input to CB. The input signals are COMPA 53, STRID 54, WGHT 55, CMD 56, C 57, and T 58. These signals are outputs of the CA section 14. The output signals STRID 60, WGHT 61, CMD 62, and T 63 are always the same value as the corresponding inputs; i.e. these signals are passed through unchanged. The COMPB output signal 59 is a 9-bit unsigned non-negative integer. When the CMD input signal 56 is equal to `010`, denoting a Pre-forward Clear command, the COMPB output value 59 is zero. When the CMD input signal 56 is equal to `100`, denoting a Load CA into R command, the COMPB output value 59 is equal to the COMPA input value 53. When the CMD input signal 56 is equal to `001`, denoting a Process Character command, and the value of the STRID input signal 54 is not equal to the value of the T input signal 58; then the 4-bit signed input C 57 is added to the previous value of COMPB 59; the resulting sum is output as the COMPB signal 59. If an overflow occurs on this addition, the carry bit is lost; an overflow will not occur if the programming restraints documented in Appendix 3 are obeyed. When none of the above conditions are met, the COMPB output signal 59 is equal to the previous COMPB input signal 59. The R section 16 computes an intermediate subtotal value. The values computed by the R section correspond precisely to the R function described in the appendix 1 defining the string similarity function .theta.. The R section 16 is pictured in FIG. 10 with its inputs 59-63 and its outputs 64-66. The RSUM signal fedback is an input to the R section 16. The values of the output signals STRID 65 and CMD 66 are always the same as the values of the inputs STRID 60 and CMD 62. The RSUM output signal 64 is a 10-bit non-negative (unsigned) integer. When the CMD input 62 is equal to `010` or `011`, denoting a Clear command, then the RSUM output 64 is zero. When the CMD input 62 is equal to `101`, denoting the Load CA into R command, then the value of the COMPB signal 59 is added to the previous value of the RSUM input signal 64; this sum is the next RSUM output 64. When the CMD input 62 is equal to `001`, denoting a Process Character command, and the T input signal 63 is high (active); then twice the value of the WGHT input signal 61 is added to the value of the previous RSUM input 64; the result is the next RSUM output 64. When none of the above conditions is met, the RSUM output signal 64 is equal to the previous RSUM input signal 64. If the result of the above additions causes an overflow, the carry bit is lost. An overflow will not occur if the programming restrictions documented in Appendix 3 are obeyed. The M section 17 computes the numerator of the ratio defining the .theta. string similarity function. The M function is described in appendix 1. The M section 17 is shown in FIG. 7. The inputs to the M section 17 are the signals RSUM 64 (a 10-bit unsigned integer), STRID 65 and CMD 66. The outputs are the READY output signals 68 and the MVAL output signal 67. MVAL 67 is a 16-bit non-negative integer which is fedback as an input to the M section 17. The output READY 68 is high (active) only when the CMD input 66 is equal to `110`, denoting a Transmit Result command. While the READY signal 68 is active, the output of the M section 17 and the TOTM section 19 are valid and ready for the DIVIDER section 23 to use them. When the CMD input 66 is equal to `010`, denoting a Pre-Forward Clear command, the MVAL output 67 is zero. When either of the following two conditions hold (1) the CMD input 66 is equal to `100`, denoting a Load CA into R command, or (2) the STRID input 65 is high (active) and the CMD input 66 is equal to `001`, denoting a Process Character command; then, the MVAL input 67 is added to the RSUM input 64 and the result of the addition is the next MVAL output 67. When none of the above conditions are met, then the MVAL output 67 is unchanged from the previous MVAL input 67. If the result of the above additions causes an overflow, the carry bit is lost. An overflow will not occur if the programming restrictions documented in Appendix 3 are obeyed. The TOTR section 18 computes an intermediate subtotal value. The values computed by the TOTR section correspond precisely to the values of the variable TOTR used in the C programs in the appendix 1 defining the .theta. string similarity function. The TOTR section 18 is shown in FIG. 12 with its input signals 45-48 and its output signals 71-74. The TOTRSUM output 71 is an 11-bit signed (two's complement) integer which is fedback as an input to the TOTR section 18. The WX input 46 is a 4-bit signed (two's complement) integer. Both WX 46 and TOTRSUM 71 have only non-positive values. The CMDX output 74 and the STRID output 72 are always equal to the previous CMDX input 48 and STRID input 45, respectively. When the high-order bit of the CMDB input 47 is high (active), denoting a Clear command, then the TOTRSUM output 71 is zero. When the low-order bit of the CMDB input 47 is high (active), denoting a Process Character command, then the WX input 46 is added to the TOTRSUM input 71 and the result of the addition is the next TOTRSUM output value 71. When neither of the above conditions is met, then the TOTRSUM output 71 is unchanged from the previous TOTRSUM input 71. If the result of the above additions causes an underflow, the carry bit is lost. An underflow will not occur if the programming restrictions documented in Appendix 3 are obeyed. The TOTM section 19 computes the denominator of the ratio defining the .theta., string similarity function. The TOTM function was described in appendix 1. The TOTM section 19 is shown in FIG. 19 with its inputs 71-74 and its output TOTMVAL 75. The TOTMVAL signal 75 is a 16-bit non-positive integer which is fedback to the M section 19 as an input. A non-positive integer is either zero or is a negative number in two's complement format with an implicit negative sign bit. The reason TOTMVAL 75 is a non-positive integer is to simplify the circuit design of the DIVIDER section 23. The TOTRSUM input 71 is an 11-bit signed (two's complement) number. When the high order bit of the CMDB input signal 73 is high (active) and the CMDX input signal 74 is low (inactive) denoting a Pre-Forward Clear command, then the TOTMVAL output 75 is zero. When the low-order bit of the CMDB input 73 is high (active), denoting a Process Character command, and the STRID input 72 is high (active); then the TOTRSUM input 71 is added to the TOTMVAL input 75 and the result is the next TOTMVAL output 75. When neither of the above conditions is met, then the TOTMVAL output 75 is unchanged from the previous TOTMVAL input 75. If the result of the above additions causes an overflow, the carry bit is lost. An underflow will not occur if the programming restrictions documented in Appendix 3 are obeyed. Complete logic drawings for the chip including the string comparator means shown in FIGS. 1 and 2 are shown in FIGS. 27-67, 66A-66D, 67A-67Z, and 67AA. These drawings are being used by and made by American Microsystems Inc. under contract to Proximity Devices Corporation to manufacture the chip to be known by the trademark "PF474" as a Large Scale Integrated (LSI) circuit using a 5-micron NMOS process. Below, we give the correspondence of node numbers on the complete logic drawings to the numbers shown in FIGS. 1 through 13. This will show how the logic functions described in the body of the patent have been implemented in the logic drawings. FIG. 2 shows a block diagram of the PF474. The Master Control means 20 and the Bus Control means 21 are implemented with nodes 2700-3299. The Ranker means 22 is implemented with nodes 3300-3900. The Divider means 23 is node numbers 1-299. The String Similarity Computer 10 is implemented with nodes 700-2000. The String Similarity Computer 10 consists of several different logic blocks. The String Control is implemented as nodes 300-600. The Core section 13 is implemented as nodes 700-1023. The CA section 14 is implemented as nodes 1024-1122, 1182-1207. The CB section 15 is implemented as nodes 1123-1170, 1214-1303. The R section 16 is implemented as nodes 1326-1462. The M section 17 is implemented as nodes 1463-1622. The TOTR section 18 is implemented as nodes 1630-1738, 1768-1788. The TOTM section 19 is implemented as nodes 1740-1767, 1789-1978. We present in tabular form the node numbers corresponding to the signals, registers and memories shown in FIGS. 4-13.
______________________________________
NODE NUMBER CORRESPONDENCE
Node Numbers in
Signal Name Reference Figure
FIG. 67
______________________________________
GO 30 FIG. 4 3044
RBUSY 36 FIG. 4 111
CTS 37 FIG. 4 108
SBUSY 35 FIGS. 4, 6 2509
CMD 31 FIGS. 4, 6 2511-2513
CHAR 32 FIGS. 4, 6 2500-2507
STRID 33 FIGS. 4, 6 2510
SCID 34 FIGS. 4, 6 2508
String Control
FIG. 5 2530-2699
RAM 80
LEN1 register
81 FIG. 5 2486-2492
LEN2 register 82
FIG. 5 2493-2499
DISP register 83
FIG. 5 2333-2399
POS register 84
FIG. 5 2153-2160, 2215-2266
Parameter Generation
RAM 38 FIG. 6 380-563
CHAR 39 FIGS. 6, 7 357-364
STRID 40 FIGS. 6, 7 371
WGHT 41 FIGS. 6, 7 368-370
C 42 FIGS. 6, 7 365-367
CMD 43 FIGS. 6, 7 372-374
Tally Memory 44
FIG. 7 823-926
STRID 45 FIGS. 7, 8 811
WX 46 FIGS. 7, 8 807-810
CMDB 47 FIGS. 7, 8 1700-1701
CMDX 48 FIGS. 7, 8 1702
WCHT 49 FIGS. 7, 8 701-703
CMD 50 FIGS. 7, 8 815- 817
C 51 FIGS. 7, 8 818-821
T 52 FIGS. 7, 8 822
COMPA 53 FIGS. 8, 9 1114-1122
STRID 54 FIGS. 8, 9 1106
WCHT 55 FIGS. 8, 9 1107-1109
CMD 56 FIGS. 8, 9 1102-1104
C 57 FIGS. 8, 9 1110-1113
T 58 FIGS. 8, 9 1105
COMPB 59 FIGS. 9, 10 1227-1235
STRID 60 FIGS. 9, 10 1240
WGHT 61 FIGS. 9, 10 1241-1243
CMD 62 FIGS. 9, 10 1236-1238
T 63 FIGS. 9, 10 1239
RSUM 64 FIGS. 10, 11 1453-1462
STRID 65 FIGS. 10, 11 1450
CMD 66 FIGS. 10, 11 1447-1449
MVAL 67 FIG. 11 1607-1622
READY 68 FIG. 11 1605
TOTRSUM 71 FIGS. 12, 13 1774-1784
STRID 72 FIGS. 12, 13 1788
CMDB 73 FIGS. 12, 13 1785-1786
CMDX 74 FIGS. 12, 13 1787
TOTMVAL 75 FIG. 13 1931-1946
______________________________________
The chip manual entitled Advanced Product Description is included as Appendix 3 and made a part hereof to provide disclosure of use of the chip. Appendix 4 is included herein and made a part hereof as an example of an electrical interface of the chip with the S-100 Bus, a widely used system for computer interconnection. The circuit described in Appendix 4 supports appropriate communication between the chip, a 4K-word by 8-bit RAM, and any of the widely used computer system which are compatible with the S-100. Appendix 4 included drawings A4-1 through A4-8. The improved string comparator is based on the associative memory circuit originally disclosed and filed on Mar. 14, 1979 as Ser. No. 20,618 includes a word associator circuit shown in detail as an electrical digital circuit in FIGS. 14 through 26. The system or associative memory circuit is shown in FIG. 14. This associative memory circuit is an improved associatve retrieval device that includes the use of the word associator or comparator circuit connected in a storage loop to locate and extract records that are most similar to the supplied query. Inexact queries will rapidly locate records similar with respect to word, numeric and mask related measurements of similarity. The new and improved method that is set forth below in detail provides a method of word comparison and a method of processing in the improved associative memory circuit or associative retrieval device. The processing is preferably in a parallel configuration that provides rapid response to queries, while processing a large number of simultaneous requests. Referring now to FIG. 14, most internal data traffic within the associative memory circuit 10 passes through shared memory 312, such as a time multiplexed multi-port random access read-write memory of any well known design such as TI's 74200. Each of the many ports of the shared memory 312 is allotted a brief time slice on the order of one millisecond. A port may disconnect prior to this time elapsing. The associative memory circuit 10 communicates with the outside world through its communications modules 314 and 314' of any well known design. A plurality of communications modules may be connected as illustrated by numeral 318 and the small circles or dots. The communications modules 314 and 314' are microcomputer based flexible interfaces responsible for decoding requests and then supervising the operations of the associative memory circuits 310 to satisfy the request. The communications modules or circuits may communicate with the other associative memory circuits 10 using shared memory through buses 316 and 316' in any well known manner and by use of any well known design. These communication modules might also perform considerable preprocessing before passing a query onto the other system components. The main storage units (MSU) 320 and 320' of any well known design are devices that contain the actual records to be searched in memory units of any well known design. The main storage units contain any of a variety of well known control circuits to transmit these records in a fixed format over a bus. A plurality of main storage units may be used as illustrated by number 322 and the dots. The transmission format requires the simultaneous transmission of record characters taken sequentially from the record moving from right to left and from left to right, see FIG. 25 and the in-use description set forth hereinbelow. Numeric portions of a record are transmitted separately. The bus or lines 324 and 324' also contain control and timing signals, error correction codes and a data path of well known design for use in the communications between associator circuits 342, 344, 342' and 344' and extractor circuits 356 and 356'. Within an MSU 320 and 320', data might be compressed to conserve resources by any well known means. The main storage units (MSU) 320 and 320' might be formed using virtually any of today's data storage devices. Following the transmission of each record along lines 324 and 324', a short blanking period is required to permit the associator circuits to initialize themselves for another record. Prior to the transmission of each records data, the MSU 320 and 320' must transmit an internal record number for the record that follows. These numbers should be assigned sequentially by the MSU 320 and 320'. The control circuit 330 is connected to the MSU devices 320 and 320' by bus 326. The control circuit 330 is responsible for all update and control of the MSU's. The control circuit may consist of one or more simple microcomputers of well known design. Control circuit 330 communicates through shared memory 312 over bus 340. Optionally, a direct interface 332 of well known design might be attached by bus 334. This would permit a direct data path from an MSU 320 and 320' to an external high speed device. This would facilitate the rapid loading of an entire MSU 320 and 320' as might occur at bootstrap time. All data storage loops generated by the MSU devices 320 and 320' feed into network switching 336 by bus 324 and 324'. The network switching circuit 36 is responsible for routing through bus 338 or 338' data from an MSU 320 and 320' to a vacant associator circuit 342 or 344 as well as 342' or 344' to satisfy a query. Additional associator circuits may be connected between 342 and 344 and 342' and 344' as illustrated by numerals 346 and 346' and the dots. Additional parallel circuits may also be interconnected as illustrated by numeral 348 and the dots. Network switching circuit 336 is connected to a control device 350 of well known design by bus 352 which processes requests communicated through shared memory 312 by connection bus or line 354. Control device 350 decodes these requests and decides which requests are to be processed and in what order. Then control device 350 communicates to network switching 336 over line 352, a specific order to reconfigure the network. The associator circuits (ac) 342, 344, 342' and 344', are an important part of this invention. The associator circuits 342, 344, 342' and 344' are connected in strings terminated at one end by a single or multiple extractor circuits 356 and 356' respectively by continuations of bus 338 or 338' respectively and at the other end by the network switching module 336. Data from a selected MSU passes through network switching 336 and then through an associator circuit 342, 344, 342 and 344'. This circuit scrutinizes the data as it passes, looking for records that are very similar to the query provided. Of significance here is the fact that the word associator circuits, a part of 342, 344, 342', and 344' (within each associator circuit) can look for similar records at very high data rates. It is expected that data rates on the bus in excess of 20,000,000 characters per second are quite possibly using today's standard technology. The associator circuits 342, 344, 342', and 344' flag the most similar records and they are then extracted from the data stream by the extractor circuits 356 and 356' and eventually passes back through shared memory 312 over bus 360 to the communications circuits 314 and 314'. The diagnostic computer 364, also of any well known design, is connected in a well known manner to the associative memory 310 to provide system performance statistics and maintenance information in a well known manner. Referring now to FIG. 15, the basic module is referred to by numeral 343 which is a more detailed block diagram of associator circuits 342 and 342' of FIG. 14. Each pair of associator circuits in FIG. 14 is similar to the FIG. 15 illustration. FIG. 15 shows the associator circuit 343 along with the basic interconnections. The query is stored in query storage 370. As records pass by on the data bus 338, the records are received by the associator circuit on interface 372 over bus 374 where the records are merged with query characters transmitted over bus 376 in an appropriate manner and then forwarded through buses 378, 380, and 382 to the three types of associator circuitry. The three types of associator circuity are: (1) a word associator circuit 384, (2) a number associator circuit 386, and (3) a mask associator circuit 388. Within the word associator circuits 384 exists two circuits designated by numerals 400 and 400', one of which is shown in greater detail in block diagram in FIG. 17 which is illustrated in schematic form in FIGS. 19 through 25. The word associator circuit 384 combines the output of circuits 400 and 400' at the end of each record to arrive at the degree of word similarity. If the basic circuit of FIG. 17 is used then at the end of each record, the M output from each copy of the circuit are added together by any well known means to arrive at the numerator of the fraction that equals the degree of word similarity. The denominator is computed by any well known means including table lookup by circuit 384 and is equal to L(L+1) where L is the length of the compared words. If the more complex circuit of FIG. 18 is used, then the numerator is computed by any well known means and is equal to twice the sum of the M quantities output from the two copies of the circuit. The denominator is computed by any well known means and is equal to the sum of the TOTM quantities output from each copy of the circuit. The word associator circuit 384 may or may not actually perform a division to arrive at the degree of word similarity. Instead, the ranker 396 and the other associator circuits 380 and 388 might work entirely with fractional representations of similarity. Using the basic circuit of FIG. 17 as 400 and 400' in FIG. 15, computes the basic form of word similarity given by the mathematical formula disclosed herein. It should also be noted that FIG. 18 is an enhanced version of the circuit in FIG. 17. The word associator circuit 384 is mainly made up of circuits 400 and 400' and interconnecting circuitry of well known design. Again, referring to FIG. 15, at the end of each record, the three associators forward their "opinion" of how similar the record and the query were over bus 390, 392, and 394 respectively to a ranker 396 of well known design. If the record was a perfect match, then it is marked by the ranker 396 for immediate extraction. Otherwise it is ranked relative to the prior records processed. Only the N highest ranking records are maintained in the ranker 396 by their internal record numbers. Here, N is an integer design parameter in any well known manner. Loading of the many parameters involved in the association process is controlled by an onboard microprocessor or controller 398 of well known design. The microprocessor is connected to the basic module in a well known manner. When all records have been observed, the ranker waits for the highest ranking records to appear again in the storage loop. As they appear, the ranker 396 marks them for extraction in any well known manner. The basic method for computing the degree of word similarity in a non-complex and expeditious manner involves processing sybmols as they occur in a bidirectional serial data stream. By bidirectional serial, we mean that successive positions from each of the two words under comparison and from their flips are simultaneously transmitted. For example, imagine two observers of the data stream performing the procedure or method as illustrated in FIG. 26. Of significance is the fact that each observer needs knowledge only of the data instantaneously before him in the data stream. We describe the procedure from the standpoint of these observers. We describe the information they must remember, computations they must perform and decisions they must make. Each observer performs an identical method, the first observes the transmission of the words, the second observes the transmission of the flips of the words. Before starting, each observer must perform certain initial tasks. After all data has passed, they combined their knowledge to arrive at the degree of word similarity between the transmitted words. Hereafter, the block diagram shown in FIG. 17 will be described, second the method performed by each observer will be described and then the method in which their knowledge is eventually combined will be described. Referring now to FIG. 17, the one word associator circuit in block diagram form is illustrated as numeral 400. Two of these circuits 400 are included in the word associated circuit 384 in FIG. 15. The data selection 1 illustrated in FIG. 17 is shown in detail in FIG. 19 and may utilize two quadruple bus gates such as Texas Instruments, Incorporated's circuit 74125 and 74126, Exhibit A in the original application and made a part of this application. It has two input buses 404 and 406 entering from above. One of these two in one time frame is routed to a single output 408 exiting below. FIG. 19 discloses the schematic for a single bus line. Clock input potential .theta. 110 is connected to the selector 402. Two clocks are utilized in circuit 400, FIG. 17. They are referred to as .phi. and .theta. and graphically illustrated in FIG. 16 to disclose their interrelationship. Their interrelationship and purpose is described and disclosed in the diagram illustrated in FIG. 16. The purpose of .phi. is to select either the read or write mode of operation in the tally memory 114 in FIG. 17 and to provide certain triggers in latches 170 and 136 of FIG. 17. When .phi. is low, the read mode of operation in 114 is selected. When .phi. is high, the write mode of operation in 114 is selected. The event consisting of a low to high transition of .phi. may serve as a trigger to latch components 170 and 136 of FIG. 17. The purpose of .theta. is to define whether a query or a bus character is currently being processed. A complete cycle of .phi. corresponding to a read/write cycle in 114 occurs during each half cycle of .theta. as shown in FIG. 16. Another purpose of .theta. is to provide a trigger to latch 182 of FIG. 17. The event consisting of a high to low transition of .theta. and may serve as a trigger to latch 182. The clock .theta. is used as a control input in blocks 144, 126 and 402 of FIG. 17. The main portion of the FIG. 17 circuit is designated by numeral 112 and includes a random access read/write memory 114 referred to as tally or a tally memory. The memory address enters from above through bus 408. The read/write mode of operation is selected by the read/write potential .phi. entering from the right through line 116 from a clock means of well known design not shown. When read/write potential is low, the read mode is selected. When read/write potential is high, the write mode is selected. Data exists from the lower left on bus 120 and .phi. enters from the upper right. In the preferred embodiment, tally is organized as 256 8 bit words. Thus, addresses and data are 8-bit quantities. FIG. 20 shows the schematic for each bit of a tally word. The tally may include one or more Texas Instrument Incorporated Hex inverter 122 No. 7404, Exhibit AA in the original application, and 256-bit read/write memory 124, No. 74200, Exhibit B in the original application and made a part hereof disclosing the circuits. It should be designed so that its contents may rapidly be set to zero by clear means 118. Simply setting all cells simultaneously to zero, may however not be practically feasible due to power surge and overheating considerations. Therefore, several cycles may be necessary to clear the memory. Each cycle would then clear a fixed portion of the memory. Also, it is not necessary to actually set all of the cells to zero. An extra bit associated with each word could be maintained. To clear the memory, only those extra bits would be cleared. Then when a memory word is read, this extra bit is tested. If it is zero, then zero is read out instead of the actual memory contents. This extra bit is set only when its associated location is written into. Now, if one is read out, then the actual memory contents are presented as usual to the outside word. In this way, the extra bits affect a logical clearing of the memory while avoiding a physical clearing of all of the cells. Techniques such as these serve to significantly expedite the clearing operation, but such procedures are not necessary because there are well known standard procedures available. The add block FIG. 17 is an incrementer/decrementer 126. Data enters from the right. Depending on the state of input potential, the input potential value of the entering data is either incremented or decremented before exiting the two four bit binary full adders circuit 130 on bus 132, as shown in FIG. 21. The adders 130 may include a Texas Instrument Incorporated 4-bit binary full adder No. 7483, Exhibit C in the original application and made a part hereof. Hex inverters 134 No. 7404, Exhibit AA in the original application is connected between the input potential over bus 128. The incrementer/decrementer 126 in FIG. 17 is actually an adder in which one of the summands is restricted to either plus or minus one. If input potential of 128 is zero, then plus one is added. If input potential of 128 is one, then minus one is added. A positive edge triggered data latch 136 in FIG. 17 is connected to bus 132 by bus 138. Data enters through bus extension 138 from the left and exits from the right on bus 140 to tally 114. On the positive going edge of read/write potential actuated by input .phi., the contents entering latch 136 are latched and become the output from the latch over bus 140. In the preferred embodiment shown in FIG. 22, the latch is a two-bit D-type register 5 with 3-state output 142 shown in FIG. 22, Texas Instrument Incorporated Number 74173 attached to the original application as Exhibit D and made a part hereof. A convertible sign tester 144 in FIG. 17 has an input through bus 132 for data entering from the right and the tester 144 determines if this data is non-positive or non-negative, depending on the state of the input potential on bus 146 entering from below. The test is performed relative to two's complement arithemetic. If the input potential is low, then test will output high on output bus 148 to the left, provided that its input is greater than or equal to zero. In the preferred embodiment, tester 144 is shown in FIG. 23 as an 8-bit device. The input 132 is connected to two dual 5-input positive no gate 150 and 152 Texas Instrument No. 74260 attached in the original application as Exhibit E and made a part hereof connected to one gate of a quadruple 2-input positive and gate 154, Texas Instrument No. 7408 attached in the original application as Exhibit EE and made a part hereof. Gate 154 is connected to one gate of a quadruple 2-input positive or gates 156 Texas Instrument Incorporated No. 7432 attached in the original application as Exhibit EEE and made a part hereof. Gate 156 Texas Instrument No. 7432 is also connected to and gate 158. And gate 158 Texas Instrument No. 7408 is connected to one of the input lines 132 and line 146 through inverter 160, a Texas Instrument No. 7404. The output of 156 is connected to the input of 162 that is the same as 156 having another input from and gate 164 with input from bus 146 and the output of inverter 168, a Texas Instrument Incorporated 7404. The exhibits are incorporated herein and made a part hereof. A clearable edge triggered latch 170 is shown in FIG. 17 and shown in detail in FIG. 24 with an adder attached as described below. Input 172 inserts a one (1) into latch 170. The output is transmitted on bus 174 to add latch 182. The adder 170 has two inputs on busses 148 and 172 and a single output on bus 174 which is the sum of the inputs. The output 178 of the adders 176, such as a 4-bit binary full adder of Texas Instrument, Incorporation No. 7483 attached as Exhibit C on the original application become the input to the latches 180, such a 4-bit D-type registers with 3-state outputs, Exhibit D. One of the inputs to the adder is the output of the latch. The other input enters from above in FIG. 24 and is wired permanently to be equal to 1. The latches 180 are similar to item 142 and are triggered from bus 148 when it is high at the positive going edge of the read/write potential. The current latch contents exit from below over bus 174, to an add-latch 182. In the preferred embodiment, latch 170 is an 8-bit latch coupled to an adder with an 8-bit output and two 7-bit inputs as shown in FIG. 24. The clearing connections are not shown in FIG. 24, but may be accomplished by any well known manner. Add-latch 182, shown in FIGS. 17 and 25, is like item 170. Its input enters from bus 174 from above. Add-latch 182 is triggered on the negative going edge of the read/write potential .theta.. In the preferred embodiment, add-latch 182 includes 4-bit binary full adders 184 such as Texas Instruments Corporation No. 7483, is a 13-bit latch coupled, that is connected to a 4-bit D-type register with 3-stage outputs 186 such as Texas Instrument Corporation No. 74173. Indicator/readout devices of well known design may be connected thereto. The output of item 182 is an electrical output signal that may be translated to readable or other indications by any well known device. Clearing connections are not shown in FIG. 25, but any well known devices or procedures may be used. Referring now to illustration FIG. 26, the memory 190 contains the words ABC and ABB which are to be compared. They are transmitted via two transmitters 192 and 194 over a data stream four characters wide, as illustrated. The top half of the stream contains the transmission of the unaltered words and the bottom half contains the transmission of the flips of the words. On each side of the stream sits an observer 196 and 200. Each observer is watching a single column at a time as columns flow from left to right. The memory 190 and transmitters 192 and 194 correspond to the MSU 320 of FIG. 14 and the data stream roughly for illustration purposes corresponds to the data bus 338 of FIG. 15 (although query characters do not occur on the data bus or in the MSU). The two observers correspond for illustration purposes to the two copies of the circuit 100 shown in FIG. 17 contained within the word associator 384 of FIG. 15. To perform his appointed task, each observer must "remember" a numeric quantity assocaited with each alphabet member. In this example there are but three; A, B, and C. This collection of quantities corresponds for illustration purposes to tally 114 of FIG. 17. Each observer must also "remember" a numeric quantity R that is 170 and another M that is 182. These correspond for illustration purposes directly with FIG. 17. At each instant in time, each observer notices two characters before him. He processes one at a time in some fixed order, say top to bottom. First, he increments the quantity corresponding to the top character. Then if it is less than or equal to zero, he increments R. Next, he decrements the quantity corresponding to the bottom character. Then if it is greater than or equal to zero, he increments R. Finally, now that both characters are processed, he updates M by adding R to it. This continues for each column as they flow past. The relationship between FIGS. 17 and 26 are now more particularly described. To illustrate this relationship a description of what occurs in the circuit of FIG. 17, step by step, as the example of FIG. 26 is processed by the circuit is presented. First, OBSERVER ONE shown as 196 in FIG. 26 corresponds to one copy of the circuit 400 of FIG. 17. We will also call the word "ABC" the query word and "ABB" the bus word. The circuit of FIG. 17 is controlled by two clock cycles shown in FIG. 16. The bottom signal (THETA) is the major timing signal. Within each THETA cycle, another clock PHI makes a complete cycle. PHI is called the minor clock. These clocks are an essential part of the invention only insofar as they specify the order of processing which occurs within the circuit. Three major clock cycles are required to process the examples with three characters. Within each of these, two characters are processed, first a character from the query word is processed and then a character from the bus word. Since OBSERVER ONE is considered at this point, the circuit first processes the character "A" from the query and "A" from the bus word. Then it processes the character "B" from the query word and "B" from the bus word. Then it processes the character "C" from the query word and "B" from the bus word. It is useful to imagine that these characters flow downward into selections 404 and 406 as follows:
______________________________________
Minor clock cycle
1 2
______________________________________
(Major clock cycle 3)
C B
(Major clock cycle 2)
B B
(Major clock cycle 1)
A A
______________________________________
In other words external means are used to access the two words being compared and "feed" these words into the core circuit. Then external means may be used to interpret the result of the circuit which at the completion of the computation is available at output 182. This interpretation normally involves a division but the essential point is that the degree of similarity is directly indicated by the magnitude of the number present at output 182 at the end of the computation, i.e. larger numbers mean more similarity. In other words this quantity may be "normalized" by a process involving a division as discussed elsewhere if one desires a similarity measure that ranges from zero to one. Before operation begins, R, M. and TALLY are reset to zero. During Major clock cycle 1 Minor clock cycle 1: During the start of this period while PHI is low, the character "A" is routed from 404 to become the address of the memory 114 and since PHI is low that memory responds by reading out this location and presenting it to the ADD block 126. Since THETA is low this block adds one to the value producing 1. This is then routed both into LATCH 136 and TEST 144. Since THETA is low TEST outputs a 0. At the positive going edge of PHI, the LATCH contents are frozen as TALLY switches into write mode. This in effect writes the updated quantity just computed by ADD back into location "A". Also at this transition the R register is incremented if signal 148 is 1. In this case it is not. During the second half of this period while PHI is high, TALLY is writing its updated information. The character "A" is present at the selection input 404. The character "A" is present at the selector input 406. The R register shown as 170 in FIG. 4 contains zero. The M register shown as 182 in FIG. 4 contains zero.
______________________________________
Original Value
Updated Value
______________________________________
Tally location "A" =
0 1
Tally location "B" =
0
Tally location "C" =
0
______________________________________
During Major clock cycle 1 Minor clock cycle 2: During the start of this period while PHI is low, the character "A" is routed from 406 to become the address of the memory 114 and since PHI is low the memory responds by reading out this location and presenting it to the ADD block 126. Since THETA is high this block subtracts one to the value producing 0. This is then routed both into LATCH 136 and TEST 144. Since THETA is high TEST outputs a 1. At the positive going edge of PHI, the LATCH contents are frozen as TALLY switches into write mode. This in effect writes the updated quantity just computed by ADD back into location "A". Also at this transition the R register is incremented if signal 148 is 1. In this case it is. During the second half of this period while PHI is high, Tally is writing its updated information. The character "A" is present at the selector input 404. The character "A" is present at the selector input 406. The R register shown as 170 in FIG. 17 contains zero at the start of this period and 1 at the completion. The M register shown as 182 in FIG. 17 contains zero at the start of this period and 1 at the completion.
______________________________________
Original Value
Updated Value
______________________________________
Tally location "A" =
1 0
Tally location "B" =
0
Tally location "C" =
0
______________________________________
The end of this cycle is marked by the negative going edge of THETA. At this time M is updated and becomes 1. During Major clock cycle 2 Minor clock cycle 1: During the start of this period while PHI is low, the character "B" is routed from 404 to become the address of the memory 114 and since PHI is low the memory responds by reading out this location and presenting it to the ADD block 126. Since THETA is low this block adds one to the value producing 1. This is then routed both into LATCH 136 and TEST 144. Since THETA is low TEST outputs a 0. At the positive going edge of PHI, the LATCH contents are frozen as TALLY switches into write mode. This in effect writes the updated quantity just computed by Add back into location "B". Also at this transition the R register is incremented if signal 148 is 1. In this case it is not. During the second half of this period while PHI is high, TALLY is writing its updated information. The character "B" is present at the selector input 404. The character "B" is present at the selector input 406. The R register shown as 170 in FIG. 17 contains 1 throughout. The M register shown as 182 in FIG. 17 contains zero.
______________________________________
Original Value
Updated Value
______________________________________
Tally location "A" =
0 1
Tally location "B" =
0
Tally location "C" =
0
______________________________________
During Major clock cycle 2 Minor clock cycle 2: During the start of this period while PHI is low, the character "B" is routed from 406 to become the address of the memory 114 and since PHI is low the memory responds by reading out this location and presenting it to the ADD block 126. Since THETA is high this block subtracts one to the value producing 0. This is then routed both into LATCH 136 and TEST 144. Since THETA is high TEST outputs a 1. At the positive going edge of PHI, the LATCH contents are frozen as TALLY switches into write mode. This in effect writes the updated quantity just computed by ADD back into location "B". Also at this transition the R register is incremented if signal 148 is 1. In this case it is. During the second half of this period while PHI is high, TALLY is writing its updated information. The character "B" is present at the selector input 404. The character "B" is present at the selector input 406. The R register shwon as 170 in FIG. 17 contains 1 at the start of this period and 2 at the completion. The M register shown as 182 in FIG. 17 contains 1 at the start of this period and 3 at the completion.
______________________________________
Original Value
Updated Value
______________________________________
Tally location "A" =
0 0
Tally location "B" =
1
Tally location "C" =
0
______________________________________
The end of this cycle is marked by the negative going edge of THETA. At this time M is updated and becomes 3. During Major clock cycle 3 Minor clock cycle 1: During the start of this period while PHI is low, the character "C" is routed from 404 to become the address of the memory 114 and since PHI is low the memory responds by reading out this location and presenting it to the ADD block 126. Since THETA is low this block adds one to the value producing 1. This is then routed both into LATCH 136 and TEST 144. Since THETA is low TEST outputs a 0. At the positive going edge of PHI, the LATCH contents are frozen as TALLY switches into write mode. This in effect writes the updated quantity just computed by ADD back into location "C". Also at this transition the R register is incremented if signal 148 is 1. In this case it is not. During the second half of this period while PHI is high, TALLY is writing its updated information. The character "C" is present at the selector input 404. The character "B" is present at the selector input 406. The R register shown as 170 in FIG. 17 contains 2 throughout. The M register shown as 182 in FIG. 17 contains 3 throughout.
______________________________________
Original Value
Updated Value
______________________________________
Tally location "A" =
0
Tally location "B" =
0
Tally location "C" =
0 1
______________________________________
During Major clock cycle 2 Minor clock cycle 2: During the start of this period while PHI is low, the character "B" is routed from 406 to become the address of the memory 114 and since PHI is low the memory responds by reading out this location and presenting it to the ADD block 126. Since THETA is high this block subtracts one to the value producing 0. This is then routed both into LATCH 136 and TEST 144. Since THETA is high TEST outputs a 0. At the positive going edge of PHI, the LATCH contents are frozen as TALLY switches into write mode. This in effect writes the updated quantity just computed by ADD back into location "B". Also at this transition the R register is incremented if signal 148 is 1. In this case it is not. During the second half of this period while PHI is high, TALLY is writing its updated information. The character "C" is present at the selector input 404. The character "B" is present at the selector input 406. The R register shown as 170 in FIG. 17 contains 2 at the start of this period and 2 at the completion. The M register shown as 182 in FIG. 17 contains 3 at the start of this period and 5 at the completion.
______________________________________
Original Value
Updated Value
______________________________________
Tally location "A" =
0
Tally location "B" =
0 -1
Tally location "C" =
1
______________________________________
The end of this cycle is marked by the negative going edge of THETA. At this time M is updated and becomes 5.
______________________________________
FINAL TALLY CONTENTS:
______________________________________
Tally location "A" =
0
Tally location "B" =
-1
Tally location "C" =
1
FINAL R CONTENTS: 2
FINAL M CONTENTS: 5
______________________________________
The circuit copy representing OBSERVER TWO would work in the same manner except that the order of characters for both words is reversed. In the above, we have assumed that the observer started with all quantities equal to zero. Once the record has passed, the two observers add together the values for M that they have arrived at. This result is then divided by L(L+1) which in this case is 3(3+1)=12. L is defined as te length of the compared words. Thus, we have 8/12=2/3 as the final similarity between the two words. FIG. 26 displays the final state of all numeric quantitites involved. A basic mathematical formula has been created. If A is an alphabet, then words in A are finite concatenations of members from A. If w is a word in A, then w denotes the flip of w. For example, the flip of the word "abcd" is "dcba". If x and y are numbers then (x,y) denotes the greater of the two quantities x-y and 0. If w is a word in A, then n(a,w,i) denotes the number of occurances of alphabet member "a" in word w found in position i or beyond where position is measured canonically from left to right. If w and v are words in A, both of length L, then the degree of word similarity between them is denoted S(w,v) and is given by the formula below: ##EQU1## This formula, as set forth above and well understood by those skilled in the art, produces a number between 0 and 1 inclusive. It produces 1 if and only if w and v are identical. It produces 0 if and only if w and v share no common alphabet members. Intermediate values are interpreted as degrees of similarity between these two extremes. Formula exist and are discussed in "The Definition, Computation and Application of Symbol String Similarity Functions" referred hereinabove, which do not presume equality of the lengths of w and v. The above formula is, however, the most fundamental. It equally weighs all alphabet members and word positions. It corresponds to the circuit of FIG. 17. In the equation above, the fundamental computation is that involving the double summation. Various forms of the equation might still perhaps produce a useful measurement of word similarity. For example, the whole equation might be raised to some positive integral power. Falk in the article referred to hereinabove, discussed a function which also associated with each pair of words, a number between zero and one. But his function is considerably more complex and in practice is much more difficult to compute. This is discussed in the thesis referred to hereinabove. The disclosed formula rests on a simpler and more rigerous mathematical foundation, as pointed out in the thesis. Faulk also makes little attempt towards justifying his formula. It should be noted that the circuit invention evolved from the formula to the algorithm to the circuit. The following criteria are met: 1. Mathematical simplicity 2. Ease of computation 3. Agreement with human intuition 4. Flexibility to permit varied applications. The formula derived in the thesis and disclosed herein and made a part hereof is mathematically simple and produces results that appear to agree well with human intuition as referred to in the IEEE articles referred to hereinabove. Computation of the formula in a straight forward fashion requires quite a bit of work, primarily due to the double summation. The next evolutionary step is the invention of an algorithm which quickly computes the formula. This algorithm is presented herein in the form of a Fortran function subprogram. It is called with three input parameters: IQ, IR and N. IQ and IR each of which is a dimension N integer vector. Upon return, the variable Theta is the degree of similarity between the input words IQ and IR. Alphabet members are integers between 1 and 256 inclusive. To process a character from each of the two words under comparison, the algorithm implemented in machine language on a modern general purpose CPU such as the IBM 370, requires the execution of dozens of instructions each comprised of many micro instructions. To perform this same task, our circuit requires only two internal clock cycles. Each represents approximately a read/write cycle pertaining to a high speed random access memory. Actually, the clock must be slightly slower than the memory's maximum speed to permit other circuit components to operate. In use, however, the multiple should be less than 2. Therefore, we see that our circuit is capable of computing our definition of word similarity much faster than any existant general purpose processor. A complete similarity memory system may contain many such circuits. Therefore, the system would than be performing a search function beyond the capabilities of existant general CPU's. The basic algorithm as programmed in Fortran is:
__________________________________________________________________________
FUNCTION THETA (IQ, IR, N)
__________________________________________________________________________
C IQ and IR ARE EACH WORDS OF LENGTH N IN THE ALPHABET CONSISTING
C OF THE NUMBERS 1 THROUGH 256. THEY ARE PASSED AS INTEGER VECTORS
C EACH OF DIMENSION N. UPON RETURN, THETA ASSUMES THE VALUE OF
C THE BASIC WORD SIMILARITY BETWEEN IQ AND IR.
C
INTEGER R
DIMENSION ITALLY (256), IQ(N), IR(N)
DO 1 I=1,256
1 ITALLY(I)=0
M=0
R=0
C
DO 2 I=1,N
ITALLY (IQ(I))=ITALLY(IQ(I))+1
IF (ITALLY(IQ(I)).LE.0) R=R+1
ITALLY (IR(I))=ITALLY(IR(I))-1
IF (ITALLY(IR(I)).GE.0) R=R+1
2 M=M+R
C
R=0
DO 4 I=1,256
4 ITALLY(I)=0
C
DO 3 J=1,N
I=N+1-J
ITALLY (IQ(I))=ITALLY(IQ(I))+1
IF (ITALLY(IQ(I)).LE.0)R=R+1
ITALLY(IR(I))=ITALLY(IR(I))-1
IF (ITALLY(IR(I)).GE.0) R=R+1
3 M=M+R
C
THETA=FLOAT(M)/(N*(N+1))
C
RETURN
END
__________________________________________________________________________
Referring now to FIG. 18, the SEL is a random access read/write memory 208. Its address enters on bus 224 from above and data output is to the right on bus 214. In FIG. 18, it is assumed that the read mode is selected as the device is only written to during master initialization. In the preferred embodiment the read/write memory 208 is organized as 127 2-bit words. SYN consists of three random access read/write memories, 202, 204, and 206. In each, the address enters on bus 406' from above the data output transmitted on bus 220, 220', and 220" from below. In the Figure, we assume that the read mode is selected as the device is only written to during master initialization. In the preferred embodiment, it is organized as three memories, each consisting of 256 8-bit words. Select is a data selector 210. It has three input busses 220, 220', and 220" entering from above. One of these three is routed to a single output 224 existing below, depending upon the numeric value of the 2-bit value entering from the left. If this value is 0, then Select ignores its input and outputs zero. If this value is 1, 2 or 3, then the first, second or third input data bus respectively is routed to the output. In the preferred embodiment, it has 8-bit inputs and outputs. MV is a one bit latch 222. It is set/reset during master initialization. The current state of MV exists above. Select 402 is a data selector. It has two input busses 224 and 404' entering from above. One of these is routed to a single output exiting below, depending upon the state of .theta. entering from the right. If .theta. is low, then the right input bus is selected. Otherwise, the left bus is selected. In the preferred embodiment, Select has 8-bit inputs and outputs. Double skip on zero is a circuit to block the propagation of .phi. and .theta., for the duration of two .phi. cycles, provided that .theta.=0, MV=1 and the output from Select is zero, at the time of a positive transition of .phi.. This effectively causes the later circuit stages to ignore completely the current column. These altered versions of .phi. and .theta. are then used by the central stage of the circuit. PW is a random access read/write memory. Its address enters from the left and data output is to the right. In the figure, we assume that the read mode is selected, as the device is only written to during master initialization. In the preferred embodiment, it is organized as 127 2-bit words. CW is a random access read/write memory. Its address enters from above and data output is to the right. In the figure, we assume that the read mode is selected, as the device is only written to during master initialization. In the preferred embodiment, it is organized as 256 2-bit words. Distributor is a data distributor. It has three outputs above and a 2-bit control input to the left. When this input is zero, all three outputs are zero. When it is 1, 2, or 3, the first, second or third output goes high respectively, leaving the others zero. Single skip on zero is a circuit to block the propagation of .phi., for the duration of one .phi. cycle, provided that the cumulative output from the gate circuits is zero at the time of a positive transition of .phi.. This altered version of .phi. is then used by the lowest circuit stage. Each gate is a pair of logical and gates used to control the propagation of the data output from CW. Both bits leaving CW center each gate to the left. Inside gate there are two 2-input and gates. One input from each becomes a common control input shown entering from below. The remaining two inputs connect to the two entering data lines. The outputs from the gates are shown to the right. When the control input is low, the gate outputs zero. When it is high, gate simply propagates its two bit input. The combination of the three gate circuits and the distributor circuit forms a variable shift register which shifts the output of CW, depending upon the output of PW. This affects the computation of the final weight. TOTM is an add-latch device. Its input enters above. It is triggered on a negative .theta. transition. In the preferred embodiment, TOTM is a 17-bit latch together with an adder having one 17-bit input and one 11-bit input. R is an add-latch device. Its input enters above and its output exists below. It is triggered on a positive .phi. transition provided that T entering from the right is equal to one. In the preferred embodiment, R is an 11-bit latch coupled to an adder having one 11-bit input and one four bit input. TOTR is an add-latch device. Its input enters above and its output exists below. It is triggered on a positive transition of .phi. provided that .theta.=1. In the preferred embodiment, it is an 11-bit latch together with an adder having one 11-bit input and one four-bit input. M is an add-latch device. Its input enters above. It is triggered on a negative .theta. transition. In the preferred embodiment, it is a 17-bit latch together with an adder having one 17-bit input and one 11-bit input. Sign test is defined as test of FIG. 17. L is defined as L of FIG. 17. Tally is defined as tally of FIG. 17. Adder is defined as add of FIG. 17. Referring to FIG. 18, this diagram illustrates how the basic circuit of FIG. 17 may be considerably enhanced without sacrificing speed of processing. In FIG. 18, before data reaches the core or data selector, circuit 402 and the basic word associator circuit 112 that is identical to that shown in FIG. 17, several tasks are performed. First, a memory word is fetched corresponding to the current column position being processed. If this word is zero, then the current column is ignored. This is accomplished by using the double skip on zero circuit 220 of well known design. This circuit merely blocks propagation of all timing signals during the current character pair. Therefore, the circuit ignores the current column. Whereas the circuit of FIG. 17 processed every column unconditionally, this facility permits column selection in the circuit of FIG. 18. If the fetched word is non zero, then it is used to select one of three tables to be used in translating the data character from the record before it reaches the select circuit 402. This is called synonym processing and permits additional flexibility. The facilities above are implemented via the random access memory's SEL 208 and the three random access memory's labels SYN 202, 204, and 206, and by the SEL 208 component which simply selects one of the three outputs from the SYN memories 202, 204, and 206. The SEL 208 is of any well known design. The SYN 202, 204 and 206 is a set of three random access memories of a well known design. If the translated value of a record character is zero and the MV flag is set, then the entire current column is ignored as above. MV 222 is a one bit latch of a well known design. This permits the definition of missing value fields so as not to detract from the measure of the similarity between the record and the query. Position bus 224 is connected to SEL 208 and PW 226 that is a random access memory of well known design. The enhancements we have discussed so far constitute simple preprocessing and are not crutial to the basic invention. We now discuss some more crucial enhancements. In the circuit of FIG. 17 you will note that the quantity "1" is always added to R 107. This has the effect of weighing all alphabet members and column positions equally. FIG. 18 implements a weighing scheme wherein the column number currently under consideration and the current character are used to determine a weight which is to be added to R instead of "1". In this way, one can weight characters heavier than others. Generally speaking, many circuits might compute a weight to be used to update R. FIG. 18 contains one such circuit. In this circuit, each alphabet character is assigned a two bit weight. This weights 0, 1, 2, 3 are possible. Each column position is also assigned a weight also two bits. But this weight is used to control a shift register so that here the possible weights are 0, 1, 2, 4. The character weight is effectively multiplied by the column weight to arrive at the final weight. A complete word associator circuit must of course contain two copies of the circuit of FIG. 18. We observe that the positional weight tables defined for each copy might differ. This might allow certain columns to be processed with more emphasis on initial or on final characters. When the two tables agree, there is no such directional bias. It should be noted that when the final weight of a column/character pair is zero, the character is not processed. This is accomplished by the "single skip on zero" circuit 230 which blocks propagation of timing signals for the duration of a single character. The weight scheme is implemented by the random access memories PW 226 and CW 242 of well known design and by the selectable shift register 244 formed by the distributor 246 and gate components 248, 250, and 252 and by the single skip on zero circuit 230, all of which are of well known designs. In the circuit of FIG. 17, the final result M needed to divide by N(N+L) where N is the length of the words under comparison, to arrive at the measure of similarity. In the circuit of FIG. 18, this denominator must be computed since it will depend upon the weights encountered during processing. In FIG. 18, TOTR 254 and TOTM 256 that are add-latches, compute a denominator term. A corresponding term is computed by the other copy of the circuit. The sum of these two terms is the final denominator. The final numerator is twice the sum of the two M values read out of the two circuit copies. The similarity between the query and bus words is the quotient of the numerator and the denominator quantities. The above is just one way in which the information read out of the circuit may be interpreted to arrive at a measurement of similarity. Other schemes might weight various terms unequally. The circuit 112' in the lower right of FIG. 18 is recognizable as very similar to the circuit of FIG. 17. The only difference is that the selection component is now external and R may now be updated by quantities other than "1". The circuit of FIG. 18 is divided by dashed lines into three stages designated by I, II and III. Note that busses passing from stage to stage are broken. These indicate that buffers might be inserted to achieve a pipeline with three stages. In this way the circuit can process data as fast as the circuit of FIG. 17. Without pipeline buffers, the circuit is two to three times slower. The timing signals are labeled identically in each stage but may vary from stage to stage both because of the optional pipeline and because of the skip on zero circuits. The circuit of FIG. 18 must be initialized before use. A master initialization must be performed once per search to establish weights, etc. This initialization must load the SEL 208, SYN 202, 204 and 206, PW 226, and CW 242 memories. Also, the MV 222 flag must be set or reset. Before each record is processed, additional initialization is required. The tally memory in 112' not illustrated must be set to zero as must the R and M, not shown, the TOTR 254 and TOTM 256 add-latches. After each record is processed, the contents of M, not shown in 112', and TOTM 256 are read out of the circuit. Connections describing initialization and readout are trivial and of well known design and are therefore not shown in the drawings. Finally, note that in the circuit of FIG. 18, the bus data character must be stable on the bus even during the processing of the query character. This permits the bus character to be translated while the query character, which does not pass through synonym translation, is processed. The query translation is better left to software since the query is fixed during a search. The instant invention has been shown and described herein in what is considered to be the most practical and preferred embodiment. It is recognized, however, that departures may be made therefrom within the scope of the invention and that obvious modifications will occur to a person skilled in the art. ##SPC1## ##SPC2## ##SPC3## ##SPC4## ##SPC5##
|
| ||||||||||
