Efficient implementations of constructs such as feature tables6192374Abstract A methodology for producing efficient implementations of constructs such as feature tables is disclosed. The method of superimposed coding and the method of inverted list tables are combined. Although they both accomplish similar things, each method has advantages over the other under particular circumstances. The method of superimposed coding is used where feature table rows are "dense" or contain many records of interest, while the method of inverted list tables is used where feature table rows are "sparse" or contain few records of interest. The combination of the two methods results in a synergistic method that loses no accuracy and is generally faster in operation than what could be achieved by the application of either method alone. Claims What is claimed is: Description FIELD OF THE INVENTION
TABLE 1
Records
Features a b c d e
1 1 1 0 0 0
2 0 1 0 1 1
3 1 0 1 1 0
In this tiny example, there are five Records, a through e, and three Features, 1 through 3. There are many types of feature tables utilizing these and other bit definitions and logical and other operations. The current invention can be applied effectively to all of these, as will be apparent to those skilled in the art, because the method of superimposed coding and the method of inverted list tables are entirely equivalent, but have different processing speeds in different regions of a given problem. For example, some types of feature tables require that a "1" bit in the (i,j) element means that Feature i is DEFINITELY present in Record j, while other types require only that Feature i MAY be present in Record j. Other feature tables reverse this logic, such that a "1" guarantees that the Feature is present, while a "0" means that the Feature may or may not be present. Still other definitions are possible and well known to those skilled in the art. Which of those many types to use in a given situation is beyond our present scope. The current invention deals only with efficient implementations of constructs such as feature tables, not with their construction or ultimate use. Here, for purposes of illustration, we will deal with only one type, as described below, although it will be understood that the discussion applies equally well to all constructs such as feature tables. Superimposed Coding According to this example of the method of superimposed coding, each Feature is represented by a row of bits, a "bitmap," that represents the possible presence or definited absence of the given Feature in the corresponding Record. For example in Table 1, Reconds c, d, and e are guaranteed NOT to have Feature 1 present in them, while Records a and b may or may not have Feature 1 present in them. A zero in the (i,j) element means that Feature i is definitely NOT present in Record j. Thus the feature table shown in Table 1 can be read directly as a set of rows, each having a bitmap suitable for superimposed coding. The superimposed coding of Table 1 is shown in Table 2.
TABLE 2
Features Bitmaps
1 11000
2 01011
3 10110
To use superimposed coding, one specifies one or more Features and (typically) performs a logical AND on them, element by element. The result is a bitmap whose length in bits is equal to the number of Records. For example, specifying only Feature 1 would produce a "result bitmap" of:
Record a b c d e
result bitmap 1 1 0 0 0
which is interpreted to mean that Records "c", "d", and "e" do NOT have Feature 1. In other words, the only Records that need to be considered by other evaluation software are Records "a" and "b". These Records of interest can be read directly from the result bitmap. They are the Records corresponding to "1" bits in the result bitmap. Specifying Features 1 and 3 would produce this "result bitmap":
Record a b c d e
Feature 1 1 1 0 0 0
Feature 3 1 0 1 1 0
result bitmap 1 0 0 0 0
since only Record "a" has a "1" for both Features 1 and 3. This result bitmap is interpreted to mean that Records "b", "c", "d", and "e" do NOT have both Features "1" and "3". The only Record that needs to be considered by other evaluation software is Record "a", which again can be read directly from the result bitmap. Again, in this use of superimposed coding, a "0" in the result bitmap indicates that the Feature, or Feature combination, is definitely NOT present in the corresponding Record. A "1" means that the Feature may be present. The important point in this example is that a "0" in the resulting bitmap guarantees that one or more of the Features specified is not present in the corresponding Record. This means that it is completely unnecessary to look at that Record in more detail. Ignoring such Records can result in a substantial savings in processing time, since it may be possible to eliminate looking further at a substantial portion of the Records. This is one important reason superimposed coding is widely known and used. The number of operations that must be performed is proportional to the sum of the number of Records indexed times the number of Features being searched. The exact number of operations in a given case has to be determined by analysis of the particular superimposed coding algorithm being used and how it is applied. Some superimposed coding algorithms do bitwise logical operations on whole words rather than bits, which can be more efficient on many computers. Other efficiencies are possible, such as checking the result bitmap for zeroes which, if found during the course of an application of the method, means that it is unnecessary to perform further logical AND operations against those bits containing zeroes. Inverted Lists An inverted list table presents the information in Table 1, but in a different form. Each Feature is itemized as in Table 1, but adjacent to each Feature is a list of the Record numbers of those Records that have a "1" bit in Table 1. Table 3 shows an inverted list table for the same information that is shown in Table 1.
TABLE 3
Features Lists of Records
1 a b
2 b d e
3 a c d
Repeating the previous examples, if Feature 1 is specified, we know immediately by accessing the list of Records associated with Feature 1 that the only Records that need to be considered are Records "a" and "b". If Features 1 and 3 are specified, we have to work a little harder. For each Record listed in the inverted list table list for Feature 1, we need to check to see if that Record is also listed in the inverted list table list for Feature 3. This requires performing a number of comparisons equal to the product of the number of Records in the Feature 1 list times the number of Records in the Feature 3 list, or six comparisons in this case. Checking these six cases turns up the fact that only Record "a" occurs in both lists, and therefore only Record "a" needs to be considered further. In general in the use of inverted list tables, an outer product must be formed of all the lists associated with the Features of interest in the inverted list table. The Records that are to be considered by other evaluation software are only those Records that appear on all of the lists associate with the Features of interest. Therefore, the number of comparisons that must be made under the inverted list table method in order to determine the final list of Records for further consideration, is the product of the lengths of the lists involved. However, note that this number may be reduced substantially by performing dyadic operations, in which case the computational complexity involves the sum of the products of pairwise comparisons of lists, where all but the first comparison involve intermediate result lists generated by earlier operations. The number of comparisons may also be reduced by other optimizations, all of which should be done to either or both methods prior to computing computational costs. Putting the Methods Together Although the end results are the same, the computational costs of using the two coding methods, superimposed coding and inverted list tables, are very different. Depending on the precise situation, the computational cost of either of these methods may be substantially lower than the computational cost of the other. Among the teachings of this current invention are guidance for those skilled in the art of estimating computational costs, regarding estimating the computational costs of the two coding methods in a given situation. This current invention also teaches how to put the two methods together into a single system, so as to employ the lower cost method consistently. Thus the use of the two methods in combination will be seen to be superior in almost every instance to the use of either method by itself, and except for the overhead of the additional computational work introduced by the current invention, which is usually negligible compared to the overall task, the combination is never inferior to the use of the better method alone. The general principles behind this synergy can be understood as follows. If there are a large number of Records and if any given Feature appears in only a few Records, then the bitmap for the superimposed coding representation of that Feature will be almost entirely zeroes and only a few ones. Such a Feature row is called "sparse." An inverted list table entry for such a row will contain correspondingly few Record numbers. In this "sparse" case, it can be far more efficient to use the inverted list rather than the superimposed coding. This can be confirmed by counting required operations in such a case. On the other hand, if any given Feature appears in a large number of the Records, then the bitmap for the superimposed coding representation of that Feature will contain a large number of ones and only a few zeroes. Such a Feature row is called "sdense." An inverted list table entry for such a row will contain a number of Records equal to the aforementioned large number of ones. In this "dense" case, it can be far more efficient to use superimposed coding rather than an inverted list. Again, this can be confirmed by counting the required operations in such a case. Moreover, this decision between the two coding methods can be applied on a Feature-by-Feature basis. If the Feature row is "sparse" (has many zeroes), then that row should be processed using the method of inverted list tables; if the Feature row is "dense" (has many ones), then that row should be processed using the method of superimposed coding. It is the recognition of the existence of this important tradeoff between the two methods that is one of the important aspects of this invention. By attention to this fact, and by taking into account the specific details of a given situation in light of this fact, the synergistic combination of the two methods can be achieved. It is only necessary to examine the exact situation in detail, taking into account such things as the number of computational operations, the speed of those operations in a given machine environment, and the specific uses to which the feature table will be put, to arrive at a beneficial application of the current invention in terms of the optimal combination of the two methods. In the application of the current invention, each row may be represented as a bitmap of the method of superimposed coding, or as a list of the method of inverted list tables, or as both. The specific counts of computational operations involved in using each method will determine whether a given Feature row is "dense" or "sparse." For each row, if that row is "dense," then the bitmap should be used; if that row is "sparse," then the list should be used. For rows that are in doubt, both should be provided. In the application of the current invention, when combining Feature rows, the length of each of the chosen rows is considered. The length of a row is the number of "1" bits in the row's bitmap in the case of the method of superimposed coding and alternatively, the number of Records in the row's list in the case of the method of inverted list tables. The length of a given row is the same under either method. The chosen Features are sorted into a Sorted Features List according to the lengths of their respective rows. They are then divided into two groups, the "dense" group and the "sparse" group, according to the lengths of their rows. Either group may contain zero, one, or more members. The "dense" group is then combined using the method of superimposed coding, by applying appropriate logical operations to the bitmaps in the manner well understood by those skilled in the art, to produce a result bitmap. The "sparse" group must be combined using the method of inverted list tables, by applying appropriate logical operations to the lists in the manner well understood by those skilled in the art, to produce a result list. The result bitmap from the "dense" combination must then be combined with the result list from the "sparse" combination by converting either of these forms to the other form, choosing as the target form whichever form results in the fewest operations, and combining the resulting converted form result with the other result to produce a final list of Records. If one of the groups contains no members, then the final combination of the two groups just described is unnecessary. The final list of Records is simply the result from the group that does contain members. If both groups contain no members, then the final list of Records contains no members. In the case where both groups contain members, a final combination must be performed. This should be done in the form that produces the fewest computational operations. The calculation of computational cost is important in the application of this current invention. The calculation of computational cost is a process that is well known and understood by those skilled in the art. Here it is assumed that these fundamental skills are understood clearly by the reader, and only those certain points that are unique to this current invention, or that may be less well understood, are discussed here. Here are the means to decide how to form the sparse and dense groups, and to decide which form to convert to the other form in order to make the final comparison, if necessary. Analysis of the computational cost of each alternative for a given computer, implementation, and feature table objective will show the needed information. The principle is this: When two rows must be logically combined, every bit in both rows must be dealt with under the method of superimposed coding, but under the method of inverted list tables, it is the outer product of the two row lists that must dealt with. The computational cost, C(SC), of the former is approximately proportional to the total number of bits, both "1" bits and "0" bits, in each row bitmap involved; that is, the total number of Records, a number that does not depend on the number of "1" bits in either row bitmap. The computational cost, C(ILT), of the latter is approximately proportional to the product of the number of Records in each row list, a number that does depend on the number of Records in each row. As a first approximation, if the product of the number of Records in each row list is less than the total number of Records, then it is computationally less expensive to combine the rows using the method of inverted list tables than it is to use the method of superimposed coding, and conversely. This first approximation, of course, is very crude, and can be dramatically improved in accuracy, by considering further details of the situation. For example, it is customary in the implementation of the method of superimposed coding to pack the bits of each row's bitmap into the natural "word" of the machine on which the program will run. Most machines deal with bits in such groups of words, and require additional overhead to deal with bits individually. A common optimization is to perform logical bit operations using bitwise logical operators that operate on the entire word as a unit, rather than operating on each bit in turn. This reduces the total computational cost of combining two bitmaps by a factor approximately equal to the number of bits in a word. When determining C(ILT) and C(SC) for comparison in this case, the value of C(SC) will be found to be reduced over the previous value by a factor approximately equal to the number of bits in a word. This means that C(ILT) would have to be smaller by approximately that same factor to remain less than C(SC) in this case. As another example of the types of details of the situation that must be considered in computing C(ILT) and C(SC), consider the fact that under the method of inverted list tables, each row lists Records. What are typically listed are really Record numbers, or other identifiers that uniquely identify specific Records, typically integers. The total number of Records referenced in the feature table determines what minimum machine representation is necessary to uniquely hold the Record numbers. This is required, because most computers have one or a few "natural" representations of integers, each consisting of a definite and fixed number of bits. Examples of such representations are the "byte," which is typically in units of 8 bits, the "short integer," which is typically in units of 16 bits, and the "integer," which is typically in units of 32 bits. The byte type can uniquely represent 256 different numbers; for the short integer type, the number is 65536; for the integer, it is 4294967296. Larger representations are possible. Thus if the row lists are composed of bytes, up to 256 Records could be uniquely referenced; if the row lists are composed of short integers, up to 65536 Records could be uniquely referenced; and if the row lists are composed of integers, up to 4294967296 Records could be uniquely referenced. The size of these representations enters into the calculation of a particular C(ILT), because the computational cost of comparisons between two such types, or other operations on them, is often a function of which type is being used. Thus C(ILT) has to be adjusted in well-known ways to take such factors into account. In many cases, C(ILT) and C(SC) can be computed at the time a specific implementation of this current invention is made for a particular situation, and they will not change thereafter, unless some assumptions or conditions that influence them change. Things that could cause a need for changing them include increasing the number of Records handled by the system, which will affect C(SC), since it depends on the total number of Records. In making such an increase, C(ILT) can be affected by surpassing the maximum number of Records that can be represented by the chosen data type, necessitating a larger data type. Another common situation that will likely necessitate re-examining C(ILT) and C(SC) both is the porting of an existing system of this current invention from one computing environment to another. There are other examples, but if the implementation of this current invention is to be a static thing, then it is likely that both C(ILT) and C(SC) can be computed a priori. To determine how to divide the Sorted Features List into two groups, one "sparse" and one "dense," apply C(ILT) and C(SC) in the following ways. Beginning at the end of the Sorted Features List that is most likely to favor C(SC) as the smaller measure, which is the end of the Sorted Features List that has the "densest" rows, compute C(SC) for the first two rows and note the value of C(SC) in a table, T1. Next, compute C(SC) between the likely result of applying the method of superimposed coding to the previous two rows, and the next row, add that value of C(SC) to the previous table entry, and insert the resulting sum as the next entry in the table T1. Continue this accumulation of computational complexity until all rows in the Sorted Features List have been considered in order. If the Sorted Features List contains only one row, or if it contains none, then it is not necessary to form groups. Note that it is not necessary to know the likely result of applying the method of superimposed coding between two rows or between a row and a previous result row in order to compute C(SC), since C(SC) depends primarily on the number of Records, not the specific bit patterns. Next, beginning at the end of the Sorted Features List that is most likely to favor C(ILT) as the smaller measure, which is the end of the Sorted Features List that has the "sparsest" rows, compute C(ILT) for the first two rows and note the value of C(ILT) in a table, T2. Then compute C(ILT) between the likely result of applying the method of inverted list tables to the previous two rows, and the next row, add that value of C(ILT) to the previous table entry, and insert the resulting sum as the next entry in the table T2. Continue this accumulation of computational complexity until all rows in the Sorted Features List have been considered in order. Again, if the Sorted Features List contains only one row, or if it contains none, then it is not necessary to form groups. Note that the likely result of applying the method of inverted list tables between two rows or between a row and a previous result row depends on the specific Records in each and therefore may not be known a priori. In order to compute C(ILT) in this case, simply assume the worst case: that no Records are eliminated in the combining of rows under this method. This will provide an upper bound on C(ILT). If a better bound can be found, it should be used. Otherwise, this upper bound value of C(ILT) will suffice for the purposes of the current invention. Now compare tables T1 and T2, taking into account that they were written in reverse order from each other. Reverse the order of either of the tables and place the two tables side by side. The intent here is for one of the tables to list increasing values and the other to list decreasing values. To form the two groups, inspect the two tables side by side. If all of the values of T1 are uniformly smaller than the values of T2, then there is only one group, the superimposed coding group, and all combinations of rows should be done using that method. If all of the values of T2 are uniformly smaller than the values of T1, then there is again only one group, the inverted list tables group, and all combinations of rows should be done using that method. However, most often there will be a point in the two tables where the values "cross over," that is, where the decreasing entries in one table fall below the increasing entries in the other. In this case, the two groups are formed by grouping all the rows corresponding to the entries in one table just prior to the point where the lists cross over into the group for the method corresponding to that table, and grouping the remaining rows into the group for the method corresponding to the other table. If exactly the same value appears in both tables at the crossover point between the two tables, that row should be assigned arbitrarily to either one of the two groups. To decide which form to convert to the other form in order to make the final comparison, consider which is less, C(SC) or C(ILT), in the two cases where each row is first converted into the form of the other. Choose the conversion that results in the smaller of C(SC) and C(ILT). It is emphasized here that the computation of C(SC) values and C(ILT) values can be made by a combination of the teachings of this current invention together with the ordinary skills of those skilled in the art of computing computational complexities, and that, as is taught in this current invention, these computations can be made, as required by this current invention, without an a priori knowledge of the specific results of the application of either the method of superimposed coding or the method of inverted list tables to specific situations that may arise within the coding table. Preferred Embodiment The preferred embodiment is now presented. It is assumed that a suitable feature table of k rows and N Records has been constructed by means outside of the scope of this current invention, and the discussion here will focus on the optimization of the speed of access to that feature table. FIG. 1 shows coding table 10, which encodes the subject feature table according to the invention. The number of Records, N, in the subject feature table is placed in block 11 for subsequent use. The k rows of the subject feature table are represented by k rows in the coding table, here illustrated by blocks 22, 23, and 24. Block 22 corresponds to the first row of the both the coding table and the subject feature table. Block 23 corresponds to the second row of both tables. Block 24 corresponds to the last row of both tables. The ellipsis 25 represents the rows that are omitted from FIG. 1 for clarity. Blocks 15, 16, and 17 serve the same function for the row of block 23 as blocks 12, 13, and 14, respectively, serve for the row of block 22. Similarly, blocks 18, 19, and 20 serve the same function for the row of block 24 as blocks 12, 13, and 14, respectively, serve for the row of block 22. The blocks represented by the ellipsis 25 each similarly have blocks serving the same function for its respective row as blocks 12, 13, and 14, respectively, server for the row of block 22. Each row of the coding table 10 of FIG. 1 contains three data elements. Taking coding table 10 row 22 as exemplary of all of the coding table 10 rows, block 13 of coding table row 22 holds the bitmap produced by applying the method of superimposed coding to the corresponding row of the subject feature table. Block 14 of coding table row 22 holds the Record list produced by applying the method of inverted list table coding to the corresponding row of the subject feature table. Block 12 of coding table 10 row 22 contains the number of Records indicated by either the number of "1" bits in the bitmap held in block 13 or the number of Records in the list held in block 14, which is the same number in both cases, or block 14, the list form, may be empty as described below. FIG. 2 is a flow chart representing the method of constructing the coding table 10 of FIG. 1 of the invention. Beginning with Start, block 30, control moves immediately to block 32, which represents a loop control that considers each row of the subject feature table and its corresponding row in the coding table 10, in turn. Block 32 exits through block 34 when there are no more rows to process. For each row, control passes to block 36, where a bitmap representation of the corresponding row of the subject feature table is constructed in the corresponding row of the coding table 10. First, the number of Records, N, in the feature table is stored in block 11. Taking row 22 of coding table 10 of FIG. 1 as exemplary of all of the rows of coding table 10, data is inserted into the blocks of the row as follows. The method of superimposed coding is applied to the corresponding row of the feature table, and the resulting bitmap is stored in the bitmap, block 13. The number of Records of interest in the bitmap, block 13, is stored in block 12. Control passes to block 38 of FIG. 2, where a determination is made whether or not to leave the inverted list, block 14, empty or to fill it. This determination is made as follows. If the expected computational cost of utilizing the bitmap encoded form, block 13, of the corresponding feature table row, is less than the expected computational cost of utilizing the inverted list encoded form, block 14, of the same feature table row, then the list, block 14, should be left empty. This minimizes storage requirements with no loss of efficiency in the event that the row is in fact "dense," because the bitmap form, block 13, must be present in all cases, while the list form, block 14, is more inefficient as compared to the bitmap form, block 13, if the coding table row 22, or equivalently its corresponding feature table row, is more dense. In many cases, a simple calculation will often be adequate to make the determination of block 38. If the number of bits involved in representing a given list will be less than the number of bits involved in representing a given bitmap, then the list should be included, that is, the answer to the question of block 38 is "yes"; otherwise, the answer to the question of block 38 is "no". This is a rough approximation to the actual computation of the computational complexities involved, and is quick and easy to apply. If a more reliable measure is desired, more elaborate means of determining computational complexity must be used, in the manner of such things known to those skilled in the art of determining computational complexities. If it is determined that the list should be left empty, then the "no" exit is taken from block 38 and control passes to block 32. If the determination is to fill the list, block 14, then the "yes" exit is taken from block 38 and control passes to block 40. In block 40, the method of inverted list tables is applied to the corresponding row of the feature table, and the resulting inverted list is stored in the appropriate list for the row, for example block 14. Control then passes to block 32. FIG. 3 is a flow chart representing the method of utilizing the coding table 10 of the invention. Block 50 represents the process by which some of the rows of the feature table are selected, which is accomplished by means beyond the scope of this current invention. What is important here is that a selection of rows is done in some manner, with that selection then passed to block 52. In block 52, the selected rows are sorted according to the number of Records of interest in each row. For example, in FIG. 1, if the rows of block 22, block 23, and block 24 were selected, then these selected rows would be sorted by utilizing block 12, block 15, and block 18, respectively. In block 54 of FIG. 3, the sorted rows are separated into two groups as previously described. The group of rows designated for processing by the method of superimposed coding are so processed in block 56 as previously described. The group of rows designated for processing by the method of inverted list tables are so processed in block 58, as previously described. Finally, in block 60, the results produced by block 56 and block 58 are combined to produce the final list of Records of interest for this utilization of the coding table 10, in the manner previously described. Conclusions, Ramifications, and Scope of Invention The foregoing disclosure and description of the invention are illustrative and explanatory thereof. It will be understood by those skilled in the art to which this invention pertains that various modifications may be made in the details and arrangements of the processes and of the structures described herein in order to explain the nature of the invention without departing from the principles and/or spirit of the foregoing teachings. It is to be further understood that the descriptions contained herein are not limited to the specific forms disclosed by way of illustration, but may assume other embodiments limited only by the scope of the appended claims and their legal equivalents.
|
Same subclass Same class Consider this |
||||||||||
