Data compression and encryption system and method representing records as differences between sorted domain ordinals that represent field values5678043Abstract Records in the relational database are converted into ordinal numbers. The numbers are then sorted by a predetermined ordering rule. Next, for each record, the difference between the number and the preceding number is computed and this difference is then used to represent that record. Alternatively, for each record, the difference between the number and some other record's number that is within the same block is computed and the difference is then used to represent that record. The compression technique results in highly compressed data that can still be handled by conventional relational database software for record insertion, deletion and other standard database operations. Claims What is claimed is: Description BACKGROUND AND SUMMARY OF THE INVENTION
______________________________________
0.sub.10 0.sub.10 2.sub.10 021.sub.10 012.sub.10
=00.sub.2 00.sub.2 10.sub.2 0010101.sub.2 0001100.sub.2
=0000.sub.2 1000.sub.2 1010.sub.2 1000.sub.2 1100.sub.2
=08.alpha.8c.sub.16
=35468.sub.10
______________________________________
Supporting Standard Database Operations Since our compression method is designed for use at the lowest levels of a database system, it is important to understand how it might interact with other system components, and particularly, whether its use might require changes to their structure. In this section, we demonstrate that no re-thinking or re-design of other database system components is required, and that our method may be integrated cleanly with standard approaches to structuring them. In particular, we now consider how access mechanisms may be constructed on the coded tuples, and how the tuples may be retrieved and modified. We focus on demonstrating the use of our method with standard access and retrieval mechanisms. We have restricted our attention to these basic operations rather than to queries for several reasons: (1) All queries, simple or complex, reduces to a set of basic tuple operations. (2) The variety of queries is too large to derive a set of representative, typical queries. The feasibility of these operations on a compressed database carries over to more complex queries which are built upon them. Access method The record mapping, where function .phi. (Equation 1) is one instance of, defines a clustering order among tuples. FIG. 3 shows the placement of the difference-tuples of Table (b) in FIG. 2 into 10 data blocks, corresponding to the demarcations shown in the table. Each block begins with a head tuple which is the numerically smallest tuple in the block. All tuples following the head tuple are difference tuples. Notice that the leading zeroes of the difference tuples are replaced by a number indicating the counts of the number of leading zero (run-length coding). Thus, the first difference tuple <2,2,021,012> in block 1 may be decoded into <0,0,2,021,012>. The head tuple can be arithmetically added (via mixed-radix addition) to the differences to derive the actual tuples. For instance, block 2 begins with head tuple <0,2,1,031,035> because the first difference tuple <0,0,1,007,010>=<0,2,2,038,045>-<0,2,1,031,035> (see Table ›a! of FIG. 2). The purpose of starting a block with a head tuple is to restrict the scope of decompression to within a data block. If only a block is searched, the difference tuples may be decoded immediately without decompressing an preceding blocks. Hence, access is localized. In order to permit random access, an index scheme is constructed. FIG. 4 shows an order-3 B.sup.+ tree index where A.sub.4 is the search key. Since the relation is physically clustered via .phi., the index is non-clustering and secondary. This explains the extra level of indirection provided by the buckets in the figure. Suppose we wish to execute .sigma..sub.A4=34 (R). Starting with the root index node, we follow the pointer to the last index node <36> since 34 is after the last key 29 in the root node. Following the first link in this index node, bucket 5 is reached, indicating that the tuple resides in disk block 9. The combination of head tuples and indices realize random localized access to a compressed database. For the other two techniques, the same mechanism is applicable. Assuming the tuples are ordered via some attribute A.sub.k, there is a key associated with each compressed data block. A tree index may then be defined upon A.sub.k. Tuple Insertion and Deletion How are tuple insertion and deletion supported in a compressed database? Suppose we wish to insert in our previous database the tuple t=<1,1,0,016,021>, which differs from <1,1,0,016,020> in the last attribute value. We need a means of locating the block which contains tuples that are physically ordered in the neighborhood of t. We realize this by constructing an order-3 B.sup.+ tree index using .phi. as the search key, as shown in FIG. 3. Since .phi. orders the tuples physically, this is a primary index, which we call the basic index. Key comparison while traversing the index is accomplished by comparing entire tuples. With this index, data block 3 is found to be the candidate block for inserting t, and is updated as shown in FIG. 5. Notice that only tuples succeeding t are recomputed, and that the changes are confined to the affected block. For tuple deletion, the basic index is similarly used to locate the data block, and changes made within the block. Tuple modification may simply be defined as a combination of tuple insertion and deletion. In summary, standard database operations are the same even when the database is compressed. The only difference is that the search key of the primary index is the entire tuple. All other indices are non-clustering and secondary, as in standard databases. An advantage of a compressed database is that the storage requirements for the indices will be reduced because the number of data blocks for storing the database has been reduced by compression. Although we have illustrated the use of tree indices as the access mechanisms, we do not preclude the use of other methods such as hashing. Working Models We have implemented TDC and applied it to the compression of census data. The 1990 Public User Microdata Samples (PUMS) from the US Bureau of Census ›9! contain records representing 5% and 1% samples of the housing units in the U.S. and of persons residing in them. The 5% PUMS comes in a set of 52 ASCII text files with a total size of approximately 4 gigabytes, each corresponding to the samples taken from a state. With TDC, we achieve a compression ratio of approximately 80% consistently on each of the text files comprising the 1% PUMS. Let B and A.sub.TDC be the size of a text file before and after compression by TDC respectively. Then A.sub.TDC /B=0.2. With bit-compression on the text file alone, a compression ratio of approximately 70% is achieved. Let A.sub.bit be the size of the text file after it is bit-compressed. That is, A.sub.bit /B=0.3. We deduce that A.sub.TDC /A.sub.bit =0.2/0.3=0.66. Therefore TDC achieves a compression ratio of approximately 33% over the bit-compression method and reduces the size of the files to around 800 megabytes. We have also compressed the files using the Unix "compress" utility, which uses a variant of the Ziv and Lempel class of algorithms. We obtained a compression ratio of approximately 85%. Although this ratio is higher, we have noted above that LZ algorithms are unsuitable for database compression. The most important reason is that it is a global compression technique that is unable to support random tuple access and modification (including insertion and deletion) that are required for general database operations. From the foregoing, the TDC method of the invention provides a new method for compressing tables, i.e., data organizable as tuples. TDC achieves its objectives without violating any operational requirements. It exhibits the following features: Field mapping: Arbitrary field values may be mapped into numeric numbers. This step itself achieves compression. A record now becomes a collection of numbers. Tuple ordering and field ranking: A record may be mapped into a single numeric number. In order to do so, the collection of fields may be ranked, and a tuple ordering rule must be chosen. Thus, different mappings are available, depending on the field ranking and ordering rule chosen. Tuple differencing: A tuple may be represented as the numerical difference between itself and some other tuple. Since any two lexicographically close tuples share common attributes, their numerical correspondences are numerically close. Hence, their differences are numerically small, much smaller than the tuples themselves. With less storage required to store the difference, further compression is achieved. Localized access: By partitioning the differences into blocks, with each block beginning with a full tuple rather than a tuple difference, decompression is localized to within a disk block. This prevents the costly operation of decompressing the entire table when only a small portion of the table is accessed. Access methods: As TDC retains the tuple-structure within a table, conventional access methods may be used to provide random access to the compressed tuples, with little or no changes. While the presently preferred form of the invention has been described, the following are potential variants of the technique: (a) The difference relation may be represented in various different ways. For example, the differences may be expressed as tuples rather than as integers. Other innovative variants may be possible. (b) Variations are also possible in the lower-level representations for the difference relation. Different codes may be used to represent or store the numerical difference (or tuple difference). For example, if a variable-length code were used, the number of bytes allocated to store each difference could be indicated by a count field such that if the count value is x, then the number of bytes allocated is some function f(x). The function f(x) determines the semantics of the count field. There are two commonly used classes of f: linear and exponential. In the former case, f(x)=ax for some integer a. Generally, a=1 so that x gives the number of bytes directly. In the latter case, f(x)=a.sup.x for a=2,3, . . . Other variants are possible. (c) There are many possible definitions for the attribute ranking and tuple ordering rule. Each potential combination gives rise to different a tuple mapping rule, which naturally affects the amount of achievable compression from tuple differencing. The invention is applicable to the compression and storage of very large databases where data are organizable as tables of tuples. During normal database operations, multiple accesses are made to the database where each access retrieves a large portion of the database, thus incurring many disk I/Os. The invention not only reduces the amount of I/Os, but also increases the efficiency of each I/O transfer because more data are fetched during each I/O. The invention is also suitable for reducing the size of very large database to so that it can be stored on a relatively inexpensive mass storage medium, such as a hard disk of a personal computer, so that the database may be made available to more users. Exemplary Computer-Implemented Software Embodiment The principles of the invention may be applied in a variety of database and computer software applications. The next section describes one possible software implementation. The software implementation comprises an encoding module which compresses data according to the principles of the invention. Naturally, the decoding module to decompress the data follows the described process in reverse. Also, while the invention has particular utility as a data compression system and method, it can also serve as a data encryption system and method, by hiding or encrypting the head tuple. Without access to the head tuple, the difference tables are meaningless. FIG. 6 is the data and process flowchart of the encoding module. This figure shows how the encoding module of the invention interfaces with the relational database management system application 20. Examples of relational database management system application 20 include Ingres Oracle or SYBASE. The encoding module may be made up of several lower level modules which will now be described. The first module, domain mapping module 24 has as its primary function the mapping of field values to domain ordinals. Module 24 requires two principal inputs: (1) table to be compressed 22; and (2) domain range for each table field. The table to be compressed 22 will contain: the table name or some type of a table identifier; field names; and the values for each of those fields, namely field values. As described above, each field name has an associated domain value. This domain value specifies the number of possibilities of the values that a particular field can assume. The domain mapping module 24 determines the ordinal position of a particular field value within the field's domain. This module 24 produces the domain mapped table 26. The domain mapped table 26 comprises the table name/identifier, the field names, and the domain ordinal tuples. The domain mapped table 26 and the other intermediate tables which will be discussed below may be stored in memory if the tables are not too large. If the intermediate tables are too large to be stored in memory, then they can be stored on the hard drive. To further save room on the hard drive, each intermediate table can be deleted after it has served its appropriate function. After the domain mapped table 26 is produced, the field rearranging module 28 alters the order of the fields within the table. Field rearrangement is based upon each field's domain value. The field with the lowest domain value is placed first. The field with the next lowest domain value is placed second. This process continues until all the fields have been accordingly rearranged. The goal of the rearrangement process is to increase the likelihood of zero values being in the beginning fields for each tuple. As will be shown below, greater compression will occur for those tuples with the greater amount of leading zeros for a given tuple. The field rearranging module 28 outputs the result into the field rearranged table 30. Next, the tuple sorting module 32 sorts the tuples that are in the field rearranged table 30 based on each tuple's mixed radix value. A single mixed radix number is calculated for each tuple. The following pseudocode is one implementation method for producing the mixed radix value. The pseudocode below illustrates how to transform the domain ordinals of a single tuple into its mixed radix representation. There are five fields that have the following domain ordinal values. The first field has a domain ordinal value of 0; the second has a value of 0; the third a value of 3; the fourth a value of 32; the fifth has a value of 32. The domain values for each field within that tuple are as follows: 4; 4; 4; 128; 128. COMMENT: n=# of fields within a Table. N=5 COMMENT: A.big() is an array which contains the field's Domain Value. A.big(1)=4 A.big(2)=4 A.big(3)=4 A.big(4)=128 A.big(5)=128 Comment: a.small(b) is an array which contains the tuple's Domain Ordinals. a.small(1)=0 a.small(2)=0 a.small(3)=3 a.small(4)=32 a.small(5)=32 sum=0 product=1 phi=0 FOR i=1 TO n product=1 FOR j=(i+1) TO n product=product*ABS(A.big(j)) NEXT j sum=sum+(a.small(i)*product) NEXT i phi=sum END The input values described above will generate the number 53,280, which is the single mixed radix representation of that entire tuple. The tuple sorting module 32 performs this type of calculation for each tuple within the field rearranged table 30. The tuples are then sorted according to their single mixed radix number in ascending order and placed in the tuple-sorted table 34. The TDC calculating module 36 calculates the tuple differences within a predefined block. Thus the TDC calculating module 36 requires two inputs: the first being the block size of the data which can usually be obtained from the RDBMS application 20; and the second being the tuples from the tuple-sorted table labeled 34. The TDC calculating module 36 takes the first tuple of the tuple-sorted table 34 and stores that tuple as the head tuple within the TDC table 38 for the first block. The module 36 then reads in the next tuple and subtracts from it the previous tuple, which in this case is the head tuple. This difference is then stored in the TDC table 38. The next tuple from the tuple-sorted table 34 will be read and will have subtracted from it the preceding tuple's value. The resulting tuple difference is stored in the TDC table 38. This differencing process will continue until the number of tuples have reached the block size. When the number of tuples has reached the block size, then the TDC calculating module 36 begins a new head tuple from which subsequent differences will ultimately be based. The following pseudocode illustrates how the head tuple and the subsequent differencing is accomplished within a block. Lastly, for this operation the domain ordinal values of each tuple are used in the differencing operation. size=0 n=# of total tuples within the table i=1 WHILE (i<=n) IF (blocksize>size) THEN Set this tuple as the head tuple and store it in the TDC Table size=size of this head tuple ELSE Store the result of the difference between Tuple(i) and Tuple (i+1) size=size+(size of this difference) END IF i=i+1 END WHILE Finally the storing module 40 takes the values from TDC table 38 and variably encodes them to achieve further compression. These variably encoded tuples are then stored within the RDBMS application 20. Both the head tuples and the difference tuples are variably encoded. Many encoding techniques exist to achieve the variably encoded result. The technique used here replaces any leading zeros of a tuple with an integer value representative of the number of leading zeros it had replaced. By way of summary, the software embodiment of FIG. 6 performs the operations illustrated in FIG. 7. The input database, comprising a plurality of tuples having predefined field values according to a predefined schema, is mapped (operation 100) according to the predefined mapping rules 102. This mapping operation produces tuples with fields mapped to domain ordinals. Next, a rearrange operation 104 is performed according to predefined rearrangement rules 106. This produces tuples with fields rearranged into a permuted schema. The tuples are then sorted (operation 108) according to predefined sorting rules 110. This produces a sorted table. Then, if desired a block of predefined size may be established (operation 112) so that the sorted tuples are further operated upon in block-sized increments. These further operations involve difference operation 114, which produces a head tuple and a table of difference values to represent the remaining tuples in the block. This entire tuple may be used as the key for the primary index to the database. The tuple is then stored (operation 116) according to the predefined storing rules 118 of the database system. Lastly, several other ways exist to perform differential coding instead of computing the difference between consecutive pairs of tuples. One alternative way is to establish the head tuple of each block as the reference tuple for its respective block. Then each succeeding tuple within the block is subtracted from the respective head tuple. Another alternative is to establish the middle tuple of each block as the reference tuple for its respective block. Then each succeeding tuple within the block is subtracted from the respective middle tuple. Both of these alternatives are exemplified below. Table (a) in FIG. 8 shows a relation with five attribute domains A.sub.1, A.sub.2, A.sub.3, A.sub.4, A.sub.5 denoting the employee number, department, job title, insurance grade and income in thousands, respectively. The size of each domain, i.e., the number of different attribute values, is 64, 8, 16, 64, 64, respectively. Table (b) shows the same relation, except that the attribute values have all been mapped to numbers. This is usually the raw form in which a statistical data set is available; a file of numerals corresponding to contiguous records. We preserve tuple identity by displaying them as individual tuples. Continuing with the example relation in FIG. 8, the results of the subsequent operations of TDC is illustrated i n FIG. 9. The tuples are lexicographically ordered into Table (a). Notice that the attributes have been reordered under the permutation .tau. defined as: ##EQU6## Column N shows the ordinal numbers of the corresponding tuples. Due to the similarity among tuples of a block, one may capture the difference among these tuples instead of storing the tuples explicitly. If these differences require less space for storage on average than the original tuples, compression is achieved. One way of such differential coding is to designate a tuple in a block as a reference tuple and encode all other tuples in the block with respect to the preceding tuple. We illustrate two possibilities with Table (b) and (c) in FIG. 9, where the reference tuples are the first and middle tuple of each block respectively. In Table (b), the reference tuple of block 1 is <2,06,26,20,36> or 10069284 numerically. The four differences in the same block with respect to it are: 12318=10081602-10069284, 1040770=11122372-10081602, 2637701=13760073-11122372 and 229372=13989445-13760073. In Table (c), the reference tuple of block 1 is <2,10,27,27,04> or 11122372 numerically. Tuples before and after the reference tuple are subtracted differently. The two differences before the middle tuple are: 12318=10081602-10069284 and 1040770=11122372-10081602 and the two differences after are: 2637701=13760073-11122372 and 229372=13989445-13760073. Our invention permits the use of any form of differential coding within a block. The above examples illustrated three possibilities. In general, we may choose any tuple in the block as the reference tuple and encode the other tuples with respect to it. Since the differences are numerically smaller than the tuples, they require fewer bytes of storage, as illustrated by the leading zero components in each tuple difference. By using run length coding to encode these components, compression is achieved. For instance, the first difference in block 1 of Table (b) <0,00,03,00,30> is encoded into <2,03,00,30> where the two leading zero components are replaced by a count of 2. In essence, TDC performs differential coding on tuples of a relation that is totally ordered by a relation .phi., hence giving rise to its name. The result of encoding is the set of differences as illustrated by column R.sub.d of Table (b) in FIG. 9. The differences, which are stored in binary representation, are actually the bit-wise concatenation of the respective tuple-differences in binary representations. For instance, the first tuple-difference is:
______________________________________
0.sub.10 00.sub.10 03.sub.10 00.sub.10 30.sub.10
=000.sub.2 0000.sub.2 000011.sub.2 000000.sub.2 011110.sub.2
=0000.sub.2 0000.sub.2 0000.sub.2 0011.sub.2 0000.sub.2 0001.sub.2
1110.sub.2
=000301e.sub.16
=12318.sub.10
______________________________________
Supporting Standard Database Operations With the Alternative Differencing Approaches Since our compression method is designed for use at the lowest levels of a database system, it is important to understand how it might interact with other system components, and particularly, whether its use might require changes to their structure. In this section, we demonstrate that no rethinking or redesign of other database system components is required, and that our method may be integrated cleanly with standard approaches to structuring them. In particular, we now consider how access mechanisms may be constructed on the coded tuples, and how the tuples may be retrieved and modified. We focus on demonstrating the use of our method with standard access and retrieval mechanisms. We have restricted our attention to these basic operations rather than to queries for several reasons: (1) All queries, simple or complex, reduces to a set of basic tuple operations. (2) The variety of queries is too large to derive a set of representative, typical queries. The feasibility of these operations on a compressed database carries over to more complex queries which are built upon them. Access Method With The Alternative Differencing Approaches FIG. 10 shows an order-3 primary B.sup.+ tree index constructed using the data block of Table (b) in FIG. 9. Notice that the search key in the index is an entire tuple. In conventional primary indices, the search key is usually only an attribute value. Operations on the tree-index are performed as usual. Suppose a query wishes to locate the tuple <4,06,40,27,27> ›which from Table (a) in FIG. 9 is located in block 6!. Starting with the key in the root index node, index node 2 is searched next since it is lexicographically smaller than the root key. There are two search keys in node 2. Following the link corresponding to the smaller of the differences between the tuple and each of the keys, index node 6 is searched. We find again that the second search key is closer to the tuple than the first. This leads us to data block 6, where the tuple resides. This block is now transferred to main memory and decompressed. Thus, traversing the index is the same except that key comparison requires measuring the difference between the key and the target tuple. When tuples are to be retrieved given certain attribute values only, secondary indices based on the primary index above may be constructed. FIG. 11 shows an order-3 B.sup.+ tree index where A.sub.5 is the search key. Since the relation is physically clustered via .phi., the index is nonclustering and secondary. This explains the extra level of indirection provided by the buckets in the figure. Suppose we wish to execute .sigma..sub.A5=39 (R). Starting with the root index node, we follow the pointer to the last index node <42> since 39 is after the last key 33 in the root node. Following the first link in this index node, bucket 5 is reached, indicating that the tuple resides in disk block 9. Therefore, the combination of reference tuples and conventional indices realize random localized access to a compressed database. Although we have illustrated the use of tree indices, we do not preclude the use of other methods such as hashing. Tuple Insertion and Deletion With The Alternative Differencing Approaches Suppose we wish to insert in our previous database the tuple t=<3,07,37,31,50>. Using the primary index, we identify data block 3 as the block for insertion. The tuple is found to lie lexicographically between the second and third tuple in the block. FIG. 12 shows the result of tuple insertion. Notice that only tuple succeeding t is recoded, and that the changes are confined to the affected block. If the inserted tuple is lexicographically smaller than all the other tuples in the block, then it becomes the new reference tuple in the block. The process of tuple deletion is similar. Tuple modification is just a combination of tuple insertion and deletion. While the invention has been described in its presently preferred form, with several examples and an exemplary software implementation given, it will be understood that the invention is not limited to these examples or this software implementation. Rather, the invention is capable of modification without departing from the spirit of the invention as set forth in the appended claims.
|
Same subclass Same class Consider this |
||||||||||
