|
|
|
Manipulating data structure (e.g., compression, compaction, compilation) |
Encoding method for compressing a tabular database by selecting effective compression routines for each field and structure of partitions of equal sized records5546575
Abstract
A method whereby a database storage structure is created by selectively applying one or more data compaction methods to fields of a database. A specific compaction method is applied to a field if the field data characteristics satisfy criteria for that compaction method. The compaction methods used are: single-field encoding, where codes are substituted for data values in a field; multiple-field combining, where a single code is substituted for data values from two or more fields; pattern suppression, where recurring character patterns within data values are removed; numeric substitution, where binary values are substituted for numeric character data; and text compression, where codes are substituted for words and phrases in a text field. These compaction methods create compacted records which are reduced storage equivalents of the database records. Translation tables and auxiliary tables created by the compaction methods allow the compacted data to be retranslated into a user-readable format. The compacted records are grouped into storage partitions, each containing compacted records of the same length, to form a database image. The database image advantageously resides in a computer system's mass storage while the translation and auxiliary tables advantageously reside in the computer system's fast access memory. Along with an accessing subsystem, the database image and tables function as a reduced storage equivalent of the original database.
Claims
What is claimed is:
1. A database storage method for encoding a database structure having a plurality of data records and a plurality of data fields, said method comprising the following steps:
selecting a field of the database structure from the plurality of data fields;
reading a plurality of data values from the selected field, each individual data value being read from a separate record;
determining a plurality of field characteristics and a plurality of compaction criteria, the field characteristics being dependent on the data values within the selected field and the compaction criteria corresponding to a plurality of compaction methods wherein each compaction method has a relative priority level;
identifying a first field characteristic from said plurality of field characteristics which satisfies a corresponding one of said compaction criteria, said first field characteristic corresponding to a first compaction method with the highest priority level of any remaining compaction methods which also have compaction criteria satisfied by a corresponding field characteristic;
applying the first compaction method to the selected field to create a plurality of compacted data values, the compacted data values being reduced storage equivalents of the data values; and
storing the compacted data values in a plurality of compacted records.
2. A database storage method according to claim 1, wherein one of the compaction methods is single-field encoding, comprising the steps of:
assigning a numeric equivalent to each data value; and
creating a plurality of compacted data values corresponding to the numeric equivalents.
3. A database storage method according to claim 2, wherein the assigning step comprises the steps of:
listing the different data values;
sorting the list by frequency of occurrence of the data values within the field; and
associating a unique number with each listed data value, each number corresponding to the rank of the listed data value, where the smallest number corresponds to the most frequently occurring data value and the largest number corresponds to the least frequently occurring data value, the associated number being the numeric equivalent of the data value.
4. A database storage method according to claim 2, wherein the creating step comprises the steps of:
encoding each numeric equivalent with a binary number; and
replacing each data value with the binary number which is the encoded numeric equivalent of the data value, the binary numbers being the compacted data values.
5. A database storage method according to claim 1 wherein the compaction methods comprise single-field encoding, multiple-field combining, pattern suppression, numeric substitution, and text compression; and
the relative priority levels of the compaction methods are from highest to lowest priority as follows: multiple-field combining, single-field encoding, pattern suppression, numeric substitution, and text compression.
6. A database storage method according to claim 1, wherein one of the compaction methods is multiple-field combining, said method further comprising the steps of:
selecting a second field of the database structure;
reading a second plurality of data values from the second field, each individual data value from the second plurality of data values being read from a separate record;
identifying a plurality of data value combinations, each of which comprise one data value and one second data value occurring together in the same record;
assigning a numeric equivalent to each data value combination; and
creating a plurality of compacted data values corresponding to the numeric equivalents.
7. A database storage method according to claim 6, wherein the assigning step comprises the steps of:
listing the different data value combinations;
sorting the list by frequency of occurrence of the data value combinations within the field and second fields; and
associating a unique number with each listed data value combination, each number corresponding to the rank of the listed data value combination, where the smallest number corresponds to the most frequently occurring data value combination and the largest number corresponds to the least frequently occurring data value combination, the associated number being the numeric equivalent of the data value combination.
8. A database storage method according to claim 6, wherein the creating step comprises the steps of:
encoding each numeric equivalent with a binary number; and
replacing each data value combination with the binary number which is the encoded numeric equivalent of the data value combination, the binary numbers being the compacted data values.
9. A database storage method according to claim 1, wherein one of the compaction methods is pattern suppression, comprising the steps of:
identifying a pattern, the pattern being a character which occurs repeatedly at a specific character position of the data values;
identifying those data values which contain the pattern;
creating a plurality of compacted data values, the compacted data values being the data values, with the pattern deleted from the data values which contain the pattern; and
associating one of a plurality of storage method designations with each of the compacted data values, such that each of the designations indicate whether the pattern was deleted from one of the compacted data values contained in one of the compacted records.
10. A database storage method according to claim 1, wherein one of the compaction methods is numeric substitution comprising the steps of:
identifying any data values which are numeric;
computing a binary equivalent of each data value that is numeric; and
creating the compacted data values by replacing the data values that are numeric with the binary equivalent of the data values.
11. A database storage method according to claim 1, wherein one of the compaction methods is text compression, comprising the steps of:
identifying a plurality of text portions of the data values, the text portions selected from the group consisting of a delimiter, word or phrase;
assigning a numeric equivalent to each text portion; and
creating a plurality of compacted data values corresponding to the numeric equivalents.
12. A database storage method according to claim 11, wherein the assigning step comprises the steps of:
listing the different text portions;
sorting the list by frequency of occurrence of the text portions within the field; and
associating a unique number with each listed text portion, each number corresponding to the rank of the listed text portion, where the smallest number corresponds to the most frequently occurring text portion and the largest number corresponds to the least frequently occurring text portion, the associated number being the numeric equivalent of the text portion.
13. A database storage method according to claim 11, wherein the creating step comprises the steps of:
encoding each numeric equivalent with a binary number; and
replacing each text portion with the binary number which is the encoded numeric equivalent of the text portion, the binary numbers being concatenated to form the compacted data values.
14. A database storage method according to claim 1 wherein the step of reading a plurality of data values comprises the steps of:
determining an average sample interval;
reading the data value from the field in one of the records;
skipping at least one of the records; and
repeating the steps of reading the data value and skipping until all of the records have been processed by one of the reading and skipping steps, such that an average of the records skipped is equal to the average sample interval.
15. A database storage method comprising the following steps:
selecting a field of a database, the database containing a plurality of records;
reading a plurality of data values from the field, one data value being read from each record;
calculating a field characteristic, which is the number of different data values in the field;
calculating a threshold equal to a constant of proportionality multiplied by the number of records in the database raised to a power of less than one;
comparing the field characteristic to the threshold, to determine if the field characteristic satisfies the criterion that the number of different data values in the field is less than the threshold
applying the compaction method only if the field characteristic satisfies the criterion, the compaction method creating a plurality of compacted data values, the compacted data values being reduced storage equivalents of the data values;
storing the compacted data values in a plurality of compacted records, such that each compacted record contains one compacted data value.
16. A database storage method according to claim 15, wherein the constant of proportionality is in the range of 10 to 100 and the power is in the range of 0.25 to 0.75.
17. A database storage method according to claim 16, wherein the constant of proportionality is 31.623 and the power is 0.5.
18. A database storage method comprising the following steps:
selecting a first field of a database, the database containing a plurality of records;
reading a first plurality of data values from the first field, one data value from the first plurality being read from each record;
selecting a second field of the database;
reading a second plurality of data values from the second field, one data value from the second plurality being read from each record;
identifying a plurality of data value combinations, each of which comprises one data value from the first plurality and one data value from the second plurality occurring together in the same record;
calculating first and second field characteristics, respectively for the first and second fields, which are an estimated mean of the number of records per slot for each field and an estimated standard deviation of the number of records per slot for each field, where the estimated mean is equal to a sample mean multiplied by a sample interval and the estimated standard deviation is equal to a sample standard deviation multiplied by the square root of the sample interval;
calculating a threshold equal to a constant of proportionality multiplied by the number of records in the database raised to a power of less than one;
comparing the field characteristics to the threshold, to determine if the field characteristics satisfy the criteria that the estimated mean is greater than one half of the threshold, the estimated mean plus the estimated standard deviation is greater than the threshold and the estimated mean is greater than the estimated standard deviation;
applying a multiple-field compaction method whereby each data value combination is represented with a corresponding numeric equivalent only if the field characteristics satisfy the criteria, the compaction method creating a plurality of compacted data values, the compacted data values being reduced storage equivalents of the data values based on the numeric equivalents of the corresponding data value combinations; and
storing the compacted data values in a plurality of compacted records, such that each compacted record contains one compacted data value.
19. A database storage method according to claim 18, wherein the constant of proportionality is in the range of 0.001 to 0.01 and the power is in the range of 0.5 to 0.99.
20. A database storage method according to claim 19, wherein the constant of proportionality is 0.0056 and the power is 0.75.
21. A database storage method comprising the following steps:
selecting a field of a database, the database containing a plurality of records;
reading a plurality of data values from the field, one data value being read from each record;
identifying a pattern, the pattern being a character which occurs repeatedly at a specific character position of the data values;
identifying those data values which contain the pattern;
calculating first and second field characteristics, which are a per character percentage and a pattern determination percentage, respectively;
setting a first threshold equal to a first constant and a second threshold equal to a second constant;
comparing first and second field characteristics to the first and second thresholds respectively, to determine if the field characteristics satisfy the criteria that the per character percentage is greater than the first threshold and that the pattern determination percentage is greater than the second threshold;
applying a pattern suppression compaction method only if the field characteristic satisfies the criterion, the compaction method creating a plurality of compacted data values with the pattern deleted from those data values which contain the pattern, the compacted data values being reduced storage equivalents of the data values;
associating one of a plurality of storage method designations with each of the compacted data values, such that the designations indicate whether the pattern was deleted from the associated compacted data value; and
storing the compacted data values in a plurality of compacted records, such that each compacted record contains one compacted data value.
22. A database storage method according to claim 21, wherein:
the maximum number of characters used in calculating the pattern determination percentage is in the range of 3 to 5 characters;
the first constant is in the range of 5% to 20%; and the second constant is in the range of 40% to 60%.
23. A database storage method according to claim 22, wherein the maximum number of character used in calculating the pattern determination percentage is 4, the first constant is 10% and the second constant is 50%.
24. A database storage method comprising the following steps:
selecting a field of a database, the database containing a plurality of records;
reading a plurality of data values from the field, one data value being read from each record;
calculating a field characteristic, which is the percentage of records which are numeric;
setting a threshold equal to a constant;
comparing the field characteristic to the threshold, to determine if the field characteristic satisfies the criterion that the percentage of records which are numeric is greater than the threshold;
performing the following numeric substitution steps if the field characteristic satisfies the criterion:
(1) computing a binary equivalent of each data value which is numeric;
(2) creating a plurality of compacted data values by replacing the data values that are numeric with the binary equivalent of the data values, the compacted data values being reduced storage equivalents of the data values; and
(3) storing the compacted data values in a plurality of compacted records, such that each compacted record contains one compacted data value.
25. A database storage method according to claim 24, wherein the threshold is in the range of 75% to 100%.
26. A database storage method according to claim 25, wherein the threshold is 90%.
27. A database storage method comprising the following steps:
selecting a field of a database, the database containing a plurality of records;
reading a plurality of data values from the field, one data value being read from each record;
identifying a plurality of text portions of the data values, the text portions selected from the group consisting of a delimiter, word or a phrase;
calculating a field characteristic, which is the number of different words;
calculating a threshold equal to a constant of proportionality multiplied by the number of delimiters in the field raised to a power less than one;
comparing the field characteristic to the threshold, to determine if the field characteristic satisfies the criterion that the number of different words is less than the threshold;
applying a compaction method to the data values only if the field characteristic satisfies the criterion, the compaction method comprising the following steps:
(1) assigning a numeric equivalent to each text portion;
(2) creating a plurality of compacted data values based on the numeric equivalents of the corresponding text portions, the compacted data values being reduced storage equivalents of the data values; and
(3) storing the compacted data values in a plurality of compacted records, such that each compacted record contains one compacted data value.
28. A database storage method according to claim 27, wherein the constant of proportionality is in the range of 10 to 100 and the power is in the range of 0.25 to 0.75.
29. A database storage method according to claim 28, wherein the constant of proportionality is 63.246 and the power is 0.5.
30. A database storage method, which comprises:
reading a plurality of records from a database;
applying a plurality of compaction methods to the records to create a corresponding plurality of compacted records, each of the compacted records having a record length equal to the number of bits in the compacted record;
creating a plurality of storage partitions, each storage partition designated for storing compacted records of identical record length; and
storing the compacted records in the partitions to create a database image, such that each of the compacted records is stored in one and only one of the partitions and the record lengths of each of the compacted records in a specific one of the partitions are identical.
31. A database storage method according to claim 30, wherein the record length of each compacted record is an integral number of 8-bit bytes.
32. A database storage method according to claim 30, wherein the record length of each compacted record is an integral number of nibbles, where a nibble is in the range of 2 to 7 bits.
33. A database storage method according to claim 30, comprising the additional step of:
creating a plurality of subpartitions comprising the compacted records, all of the compacted records associated with a particular one of the subpartitions having an identical record type designation specifying a plurality of field-dependent pack methods.
34. A database storage method according to claim 33, comprising the additional step of:
building a record information table equating the record type designation to one of the subpartitions.
35. A compaction method, which comprises:
selecting a first group of fields from a database;
reading a first group of data values from a record of the database, each individual data value being read from a separate field from within the first group of fields to form a first data value combination;
assigning a numeric equivalent to the first data value combination, such that the numeric equivalent uniquely determines the first data value combination;
replacing the first data value combination with a representation of the numeric equivalent to create a compacted data value only if a field combining criterion based on the first group of fields, and dependent on a concentration level of unique data value combinations, is satisfied, the compacted data value being a reduced storage equivalent of the first data value combination; and
creating a translation table which contains an entry that equates the data value combination with the assigned numeric equivalent, to provide for retranslation from the numeric equivalent to the data value combination.
36. A compaction method according to claim 35 further comprising the following steps:
selecting a second group of fields from the database;
reading a second group of data values from a record of the database to form a second data value combination, each individual data value being read from a separate field from the second group of fields;
determining a first figure-of-merit based on the concentration of unique data value combinations occurring in the first group of fields;
determining a second figure-of-merit based on the concentration of unique data value combinations occurring in the second group of fields;
replacing the first data value combination with a representation of the numeric equivalent to create a compacted data value only if the first figure-of-merit is greater than the second figure-of-merit.
37. A compaction method according to claim 35 wherein said first group of fields comprises two separate fields of the database.
38. A compaction method according to claim 35 wherein said first group of fields comprises three separate fields of the database.
39. A compaction method according to claim 35 wherein said first group of fields comprises a first field and a second field, said first field being an uncompacted field from said database and said second field being a previously combined field representing two uncompacted fields from said database.
40. A compaction method, which comprises:
selecting one of a plurality of fields of a database;
reading a plurality of data values from the selected field;
identifying a single-position pattern, the single-position pattern being a character which occurs repeatedly within the selected field at a specific character position of the data values;
creating a plurality of compacted data values from the data values by deleting the single-position pattern from each of the data values within which the single-position pattern occurs;
storing a particular one of the compacted data values in the selected field of a compacted record; and
associating with the compacted record a designation, the designation indicating a pack method for each of the fields within the compacted record including that the selected field of the compacted record contains the particular compacted data value with the single-position pattern deleted.
41. A compaction method according to claim 40 wherein the designation is a type byte appended to the compacted record and specifying a subpartition, the compacted record being stored in the subpartition, the subpartition being associated with the pack method.
42. A compaction method, which comprises:
selecting one of a plurality of fields of a database;
reading a plurality of data values from the selected field;
identifying a multiple-position pattern, the multiple-position pattern being a plurality of characters and a plurality of character positions in one-to-one correspondence with the characters, such that all of the characters occur together at the corresponding character positions within more than one of the data values within the selected field;
creating a plurality of compacted data values from the data values by deleting the multiple-position pattern from each of the data values within which the multiple-position pattern occurs;
storing a particular one of the compacted data values in the selected field of a compacted record; and
associating with the compacted record a designation, the designation indicating a pack method for each of the fields within the compacted record including that the selected field of the compacted record contains the particular compacted data value with the multiple-position pattern deleted.
43. A compaction method according to claim 42 wherein the designation is a type byte appended to the compacted record and specifying a subpartition, the compacted record being stored in the subpartition, the subpartition being associated with the pack method.
44. A method of compacting data values within a database comprising the following steps:
selecting one of a plurality of fields of a database;
reading a plurality of data values from the selected field;
identifying a numeric pattern having at least two characters which occur repeatedly within the selected field at a plurality of specific character positions of the data values;
creating a plurality of compacted data values from the data values by deleting the numeric pattern from each of the data values within which the numeric pattern occurs, converting the numeric pattern into a binary equivalent, and appending the binary equivalent to the data values within which the numeric pattern occurs;
storing a particular one of the compacted data values in the selected field of a compacted record; and
associating with the compacted record a designation, the designation indicating a pack method for each of the fields within the compacted record including that the selected field of the compacted record contains the particular compacted data value with the numeric pattern deleted and the binary equivalent appended.
45. A database encoding method, which comprises:
selecting a field of a database;
identifying a plurality of data items within the field selected from a group consisting of a data value, a data value combination, a word or a phrase;
listing the different data items;
sorting the list by frequency of occurrence of the data items within the field to provide a rank for each data item;
associating a unique number with each listed data item to create a translation table, each unique number corresponding to the rank of the listed data item, where the smallest unique number corresponds to the most frequently occurring data item and the largest unique number corresponds to the least frequently occurring data item, the associated unique number being a numeric equivalent of the data item;
specifying a code length having a fixed number of bits L;
determining a first threshold as a function of the code length, said first threshold representing the maximum allowable numeric equivalents within the translation table;
determining a second threshold representing an acceptable number of occurrences of individual data item entries within a range of the numeric equivalents wherein said range is a function of L;
comparing the amount of numeric equivalents to the first threshold and the total number of individual data item entries within the range to the second threshold;
encoding all the numeric equivalents with a single-length code of length L if the comparison step indicates that the first threshold is not exceeded;
encoding the numeric equivalents with a combination code if the comparison step indicates that the first threshold is exceeded and the second threshold is exceeded, where the combination code is of length L for the numeric equivalents less than or equal to the maximum value of the range, and of length n*L for all other numeric equivalents, where n is the smallest positive integer necessary to represent a particular numeric equivalent as a binary number;
encoding all the numeric equivalents with a variable-length code if the comparison step indicates that first threshold is exceeded and the second threshold is not exceeded, where the variable length code is of length n*L; and
creating a plurality of compacted data values by replacing the data items with the encoded numeric equivalents.
46. A database encoding method according to claim 45, wherein L equals 8 and the single-length code comprises each numeric equivalent modulo 256.
47. A database encoding method according to claim 45, wherein L equals 8 and the combination code comprises:
a one byte binary number equal to the each numeric equivalent less than 256;
a one byte binary number for each numeric equivalent in the range of 256 to 511 which is equal to the numeric equivalent minus 256; and
a binary number equal to each numeric equivalent greater than 512 which uses the smallest necessary integral-number of bytes and is equal to the numeric equivalent minus 512.
48. A database encoding method according to claim 45 wherein L is equal to 8 bits, the first threshold is approximately between 960 to 990, the range of the numeric equivalents contains those having a rank of 256 to 511, and the second threshold is approximately between 0.25 to 0.35 times the total number of database records.
49. A database structure comprising:
a plurality of compacted data values;
a plurality of compacted records each having a record length and comprising a plurality of fields, the fields containing the compacted data values;
a plurality of subpartitions comprising the compacted records, all of the compacted records associated with a particular one of the subpartitions having an identical record type designation specifying a plurality of pack methods dependent on the fields;
a plurality of storage partitions comprising the subpartitions, all of the compacted records associated with a particular one of the partitions such that the compacted records of each partition are of identical record length;
a database image comprising the partitions; and
a record information table equating the subpartitions to the record type designations.
50. A database structure according to claim 41, which further comprises a translation table containing a plurality of entries, each entry associating a specific compacted data value to an equivalent uncompacted data value.
51. A database structure according to claim 41, further comprising an index, for efficiently searching the database image.
52. A database structure according to claim 41, further comprising a plurality of indices for efficiently searching the database image, such that a single index that results in the least number of records to be searched can be chosen from the plurality of indices.
53. A database structure according to claim 49, further comprising:
a mass storage medium capable of being accessed by a computer system, the database image and record information table being stored on the mass storage medium.
54. A database structure according to claim 53, wherein the mass storage medium is a CD-ROM.
55. A database system, which comprises:
a computer system, comprising a random access memory, a mass storage memory, an input device, an output device, a processor and a bus for coupling the random access memory, the mass storage memory, the input device, the output device and the processor;
an operating system, executable by the processor, for controlling the functions of the computer system;
a database structure, comprising a database image residing in the mass storage memory and a translation table and a record information table residing in the random access memory, the database image comprising a plurality of partitions, the partitions comprising a plurality of subpartitions, the subpartitions comprising a plurality of compacted records, the compacted records comprising a plurality of fields, the fields comprising a plurality of compacted data values, the record information table equating the partitions and subpartitions to a plurality of record type designations, such that a particular one of the partitions comprises a subset of the compacted records all having an identical length and a particular one of the subpartitions comprises a subset of the compacted records all having a particular one of the record type designations, the particular record type designation specifying a plurality of pack methods dependent on the fields;
an access subsystem, executable by the processor, for performing a user request entered on the input device, which comprises reading a subset of the compacted records from the database image, reading the compacted data values from these compacted records, translating the compacted data values into the uncompacted data values and outputting the uncompacted data values to the output device.
56. A database system according to claim 55, which further comprises a build subsystem, executable by the processor, for creating the database image, the record information table and the translation table from a database, the database comprising a plurality of records, each of the records containing the uncompacted data values.
57. A database system according to claim 56, wherein the build subsystem includes a build preprocessor for inputting data into the database.
58. A database system according to claim 55, further comprising:
a remote device, comprising a second random access memory, a non-volatile memory, a second input device, a second output device, a second processor and a second bus for coupling the second random access memory, the non-volatile memory, the second input device, the second output device and the second processor, for accessing data in the database image at a location separate from that of the computer system;
a copy of the record information table and translation table, stored in the non-volatile memory; and
a communications channel connecting the remote device to the computer system.
59. A database system according to claim 58, wherein a partial copy of the database image is stored in the nonvolatile memory.
60. A database system according to claim 55, further comprising:
a database management system; and
an uncompacted database accessible by the database management system, such that deletes, inserts and updates are performed both by the access subsystem on the database image and by the database management system on the uncompacted database, a subset of database access requests are processed only by the access subsystem on the database image and database access requests which are not in the subset are processed only by the database management system on the uncompacted database.
61. A database system according to claim 55, wherein the access subsystem further comprises:
a statistical analysis subprocess for reading the table to derive information about relationships between the uncompacted data values; and
a reporting subprocess for creating a report based on the information derived by the statistical analysis subprocess and transmitting the report to the output device.
62. A database access method, comprising:
providing a user access request to a computer system comprising a mass storage memory, a random access memory, a database image stored in the mass storage memory, and a record information table and a translation table stored in the random access memory, the database image comprising a plurality of partitions, the partitions comprising a plurality of subpartitions, the subpartitions comprising a plurality of compacted records, the compacted records comprising a plurality of fields, the fields comprising a plurality of compacted data values, the record information table equating the partitions and subpartitions to a plurality of record type designations, such that a particular one of the partitions comprises a subset of the compacted records all having an identical length and a particular one of the subpartitions comprises a subset of the compacted records all having a particular one of the record type designations specifying a plurality of pack methods dependent on the fields;
identifying from the user access request a requested one of the fields and a requested one of the compacted data values;
searching the database image for a requested one of the compacted records, such that the contents of the requested field corresponds to the requested compacted data value; and
reading the requested compacted record from the database image.
63. A database access method according to claim 62, further comprising:
preparing a report which specifies the frequency of occurrence of the data values within the field, based on the translation table; and
outputting the report to the user.
64. A database access method according to claim 62, wherein the providing step comprises the steps of:
inputting a user request to a remote device, the remote device comprising a non-volatile memory in which are stored a second translation table identical to the translation table and a second record information table identical to the record information table, the remote device being in a location separate from the computer system and connected to the computer system with a communications channel;
interpreting the user access request, using the remote device, to determine the requested field and a requested data value within the requested field;
determining a compaction method which was applied to the requested field during the build of the database image;
reading a requested compacted data value from the second translation table corresponding to the requested data value; and
transmitting a message to the computer system via the communications channel, the message identifying the requested field and the requested compacted data value.
65. A database access method according to claim 64, wherein the identifying step comprises:
receiving the message from the remote device over the communications channel to the computer system; and
determining from the message the requested field and the requested compacted data value.
66. A database access method according to claim 64 wherein the transmitting step comprises:
creating a storage format identifier from the second record information table based on the requested field and the compaction method applied to the requested data value to create the requested compacted data value;
creating a field identifier based on the requested field; and
sending the message from the remote device to the computer system over the communications channel, the message comprising the requested compacted data value, the storage format identifier and the field identifier.
67. A database access method according to claim 66, wherein the storage format identifier is a type byte and a partition number, such that the type byte describes a subpartition which contains the requested compacted data value and the partition number describes a partition of the database image which contains the subpartition.
68. A database access method according to claim 66, wherein the field identifier is a field identification bit map comprising a plurality of bits which are in one-to-one correspondence with a plurality of fields of the database image, a specific bit corresponding to the requested field being turned-on to indicate that the requested field is transmitted.
69. A method of accessing a database comprising the following steps:
specifying a compacted data record to be retrieved from a database; retrieving the compacted data record from within a partition and a subpartition of the database;
determining a partition number associated with the partition and a subpartition number associated with the subpartition;
specifying a field of the compacted data record;
reading a compacted data value from the field;
determining a field number associated with the field;
locating a pack method in a record information table, the pack method dependent upon the partition number, the subpartition number and the field number; and
creating a data value from the pack method and the compacted data value, the data value associated with the specified compacted data record.
70. A database access method according to claim 69 wherein the pack method specifies a storage number; and
the creating step comprises the following steps:
locating a code in a storage definition table, the code equated to the storage number;
calculating a numeric equivalent associated with the compacted data value and the code; and
locating a data value in a translation table, the data value equated to the numeric equivalent.
71. A database access method according to claim 69 wherein the pack method specifies a pattern number; and
the creating step comprises the following steps:
locating a position number and a position value in a pattern definition table, the position number and the position value equated with the pattern number;
forming a new character position at a location in the compacted data value consistent with the position number; and
inserting the position value at the new character position.
72. A database access method according to claim 69 wherein the pack method specifies a storage number; and
the creating step comprises the following steps:
locating a code in a storage definition table, the code equated to the storage number;
calculating a numeric equivalent associated with the compacted data value and the code;
locating a field number combination in a field combination table, the field number combination comprising the field number and a second field number; and
locating a data value combination in a multiple-field translation table, the data value combination equated to the numeric equivalent and comprising the data value.
73. A database access method according to claim 69 wherein the pack method specifies a storage number; and
the creating step comprises the following steps:
locating a status value in a text table, the status value equated to the field number and indicating that the associated field is text;
locating a code in a storage definition table, the code equated to the storage number;
calculating a numeric equivalent associated with the compacted data value and the code; and
locating a data value in a text translation table, the data value equated to the numeric equivalent.
74. A database access method according to claim 69 wherein the pack method specifies a storage number; and
the creating step comprises the following steps:
locating a first status value in a mostly numeric table, the first status value equated to the field number and indicating that the associated field is mostly numeric;
locating a second status value in a storage definition table, the second status value equated to the storage number; and
converting the compacted data value to character values if the second status value indicates that the compacted data value is numeric.
75. A method of compacting a database containing data values stored in a plurality of records wherein each record contains a plurality of fields, said method comprising the following steps:
selecting a field of the database from the plurality of fields, the field containing a plurality of data values corresponding to the plurality of records of the database;
analyzing the data values within the field to determine a first compaction characteristic for the field;
analyzing the data values within the field to determine a second compaction characteristic for the field, wherein one said first and second compaction characteristics is based at least in part on the number of times a type of data occurs in the field;
applying a first compaction method to the field when the first compaction characteristic satisfies a first criterion;
applying a second compaction method to the field when the second compaction characteristic satisfies a second criterion; and
storing a set of compacted data values within a group of storage partitions, the set of compacted data values representing the data values which have been compacted according to the first and second compaction methods.
76. The method of compacting a database according to claim 75 wherein the first compaction method is numeric substitution and the second compaction method is pattern suppression.
77. The method of compacting a database according to claim 75 further comprising the following steps:
selecting a second field of the database from the plurality of fields, the second field containing a second plurality of data values corresponding to the plurality of records of the database;
analyzing the second plurality of data values within the second field to determine a third compaction characteristic for the second field;
applying a third compaction method to the second field when the third compaction characteristic for the second field satisfies a third criterion;
storing a second set of compacted data values within the group of storage partitions, the second set of compacted data values representing the second plurality of data values which have been compacted according to the third compaction method.
78. A method of compacting a database containing data values stored in a plurality of records wherein each record contains a plurality of fields, said method comprising the following steps:
selecting a first field of the database from the plurality of fields, the first field containing a first plurality of data values corresponding to the plurality of records of the database;
applying a first compaction method to the first plurality of data values within the first field to create a first set of compacted records;
selecting a second field of the database from the plurality of fields, the second field containing a second plurality of data values corresponding to the plurality of records of the database;
applying a second compaction method to the second plurality of data values within the second field to create a second set of compacted records, the second compaction method different than the first compaction method, wherein at least one of said compaction methods is selected based at least in part on the number of times a type of data occurs in the corresponding field; and
storing the first and second set of compacted records to form a compacted database image.
79. The method of compacting a database according to claim 78 wherein:
the step of applying a first compaction method further comprises the following steps:
analyzing the first plurality of data values within the first field to determine a first compaction characteristic for the first field; and
applying a first compaction method to the first field when the first compaction characteristic satisfies a first criterion; and
the step of applying a second compaction method further comprises the following steps:
analyzing the second plurality of data values within the second field to determine a second compaction characteristic for the second field; and
applying a second compaction method to the second field when the second compaction characteristic satisfies a second criterion.
80. The method of compacting a database according to claim 78 wherein the first compaction method is single-field encoding and the second compaction method is numeric substitution.
81. A distributed network for accessing and transmitting a compacted database image between a host computer and a remote device comprising:
a host computer comprising a CPU, a host memory storage device, and a communication link;
a database image residing in the host memory storage device comprising a plurality of compacted records corresponding to a database;
a first record information table and a first translation table stored in the host memory storage device, the first record information table and the first translation table associated with the database image to enable translation of the compacted records into uncompacted records;
a remote computer comprising a CPU and a remote memory storage device, the remote computer communicating with the host computer over the communication link; and
a second record information table and a second translation table stored in the remote memory storage device, the second record information table and the second translation table associated with the database image to enable translation of compacted records transmitted over the communication link to the remote computer.
82. The distributed network of claim 81 wherein the second record information table and the second translation table are a subset of the first record information table and the first translation table, the information contained in the second record information table and the second translation table corresponding to a limited set of operations performed by the remote computer.
83. The distributed network of claim 82 wherein each compacted record has a unique translation designation attached thereto, the translation designation corresponding to a particular location of the second record information table, the particular location of the second record information table providing details of compaction methods used to compact a corresponding compacted record.
84. The distributed network of claim 83 wherein the translation designation comprises a storage partition number for the compacted record and a type byte indicating a method of compaction for the compacted record.
85. The distributed network of claim 83 wherein the translation designation for each compacted record transmitted over the communication link has a length which is a fixed number of bytes.
Description
BACKGROUND OF THE INVENTION
This invention relates to databases. In particular, this invention relates to an improved database storage structure.
A database is a collection of interrelated data, typically stored on a computer, to serve multiple applications. Data in a database is logically represented as a collection of one or more tables, composed of a series of rows and columns. Each row in the table, called a record, represents a collection of related data. Each column in the table, called a field, represents a particular type of data. Thus, each row is composed of a series of data values, one data value from each field.
A database is organized from two perspectives. A logical perspective describes how the database is organized from a user's viewpoint. A physical perspective describes how data is actually recorded in computer storage. The prior art describes various techniques designed to alter the physical organization of a database while maintaining the same logical perspective, in order to reduce computer storage requirements. One technique, data compression, is well-known in the prior art. Peter Alsberg's paper, "Space and Time Savings Through Large Database Compression and Dynamic Restructuring," Proceedings of the IEEE, Vol. 63, No. 8, August 1975, pp. 1114-1122, describes the use of binary number codes to represent data consisting of character strings. Alsberg noted that when specific data values occur repeatedly, it is feasible to use a variable-length compression code for that field. Using shorter codes for the frequently occurring data values and longer codes for the infrequently occurring data values achieves greater compression. The paper by Dennis Severance, "A Practitioner's Guide to Database Compression," Information Systems, Vol. 8, No. 1, 1983, pp. 51-62, describes a similar binary encoding compression scheme. Severance outlines a method where the data values are ordered by probability of occurrence and then assigned a variable-bit-length binary code using Huffman coding, a well-known optimum code for this purpose.
Besides data compression, pattern recognition can also be used to reduce the data storage requirements of a database without altering the logical organization of the database. In Fred McFadden's book, "Database Management," 1983, Benjamin/Cummings Publishing Company, a technique called pattern substitution is described. This technique first identifies repeating sequences of characters that occur within a particular field, then replaces these sequences of characters by a single character which represents the pattern.
Besides reducing data storage size, other techniques for altering the physical organization of a database to allow faster data access have been described. For example, data records may be grouped in a way which allows data items which are accessed more frequently to be stored on the fastest storage devices. This can be achieved by splitting the stored records into separate segments and allocating separate segments to separate physical storage devices, some of which permit faster data access than others. As another example, records can be physically grouped together if they are frequently accessed together, such as grouping records on the same disk sector or disk track. In this manner, fewer disk accesses are needed to transfer data to or from the main computer memory for a particular application. This technique is described in McFadden's book, "Database Management," cited above.
The prior art also describes techniques for altering the logical organization of a database in order to create a more efficient physical organization. As an example, W. Kent, in his paper, "Choices in Practical Data Design," Proceedings: Very Large Data Bases, Sep. 8-10, 1982, pp. 165-180, describes field design options. Kent describes alternative representations of the same data relationships. Each specific data type can be represented by a separate field or grouped with other data types into a combined field. In Kent's article, data types are either combined into a single field or separated into distinct fields based on the known relationships between the data types. For example, "date" can be a single field, or date information can be represented by three separate fields: "month," "day" and "year."
SUMMARY OF THE INVENTION
The present invention is directed to an improved physical structure for databases which are logically composed of a series of rows and columns and a method for creating the structure. A conventional database typically makes inefficient use of computer storage space and requires a significant amount of data access time. By redesigning the physical structure of such a database, the same information can be recorded in less space and can be accessed faster. In accordance with the present invention, this redesign is accomplished, in part, by the automatic selection or rejection of one or more compaction methods to be applied to individual database fields. In a preferred embodiment, there are five compaction methods which can be applied to the database fields: single-field encoding, multiple-field combination, pattern suppression, numerical substitution and text compression. A particular compaction method is applied to a particular field only if the benefit from storage savings is sufficiently large to offset the penalty in overhead storage and increased storage complexity. The overhead storage is mostly due to translation tables needed to convert compacted data back to original form. A favorable tradeoff between storage savings and overhead is achieved by setting criteria for each compaction method which must be satisfied by the data within a particular field. That is, various data-dependent characteristics are calculated for each field, and these field characteristics are then compared to specific compaction method criteria to determine which compaction methods are applicable to which fields. In this manner, a system is provided whereby a compaction method or methods for each field are automatically selected to yield favorable results. The method or methods are then applied to each record of the database to create equivalent compacted records, requiring less storage.
In accordance with another aspect of this invention, data access performance can be improved over a conventional database by a method of grouping compacted records according to record length, where record length is the number of storage bytes required for a compacted record. Each equal record-length group is written to a common storage area, called a partition. This results in a "database image" containing multiple partitions, each partition containing records of the same record length. This same-length record structure simplifies the data processing required to read, modify, delete and add records and speeds database access.
In accordance with a further aspect of this invention, various compaction methods convert data values into equivalent compacted data values which require less storage space. The single-field encoding method accomplishes this by assigning a code to each different data value in a field. A field which can be encoded is also a candidate for the multiple-field combining method. The multiple-field combining method combines two or more fields into a single field by assigning a unique code to each different combination of data values which occur in a single record, within the fields to be combined. The code assigned to a data value combination is a reduced storage equivalent of the data value combination. The data value combination can be reconstructed from the code, and the code requires less storage space in the database. That is, the codes generated by the multiple-field combining method become compacted data values that replace each of the data value combinations from the fields to be combined, creating a single combined field.
A field which cannot be encoded might be compacted by the pattern suppression method. For pattern suppression, a single character repeatedly occurring in the same position within a field is identified as a pattern. Also, multiple characters that repeatedly occur together in a single record in the same positions within a field are identified as a pattern. These characters need not be contiguous within the field. A pattern is removed from each record in which it occurs and a designation of the pattern removed is associated with the record. In a preferred embodiment, a "type byte" at the end of each compacted record, used to specify the storage method for each field, is used to designate which patterns have been removed from each field. This saves the storage space required by the pattern and there is no need to use a substituted character.
A method of encoding using integral-byte length codes is used in the single-field encoding and multiple-field combination compaction methods. Hence, an item to be encoded can be a data value (single-field encoding) or a data value combination (multiple-field combination). Encoding begins by selection of a field to be encoded. A field may be a combined field containing data value combinations as a result of the multiple-field combining. The different items to be encoded are identified, and the relative frequency of occurrence of the items within the field is determined. A unique variable-byte length code is then assigned to each different item, where the more frequently occurring items are assigned smaller codes. The codes then replace all occurrences of the items in the field to be encoded.
It should be noted that run-length encoding schemes, such as Huffman codes, have variable-bit length codes and are often used to encode data streams. In databases, the database records may consist of many fields. If a number of these fields are encoded using a variable-bit length code, the overall bit length or storage length of each record will vary widely. The partitioning of compacted records is impractical if the number of possible record lengths is not limited. Hence, a tradeoff exists between the optimum compaction achieved by variable-bit length codes and the resulting widely variable record lengths. In accordance with a preferred embodiment of the present invention, the partitioning of compacted records is made practical by using a variable-length code of integral-byte length.
Another aspect of this invention relates to a database structure which includes a database image and various tables for uncompacting the data in the database image. The database image is made up of partitions. Each partition contains compacted records of the same length. Within each partition are subpartitions of compacted records using the same storage configurations for each field in the records. The compacted records are all of integral-byte length and contain the compacted data values created by the various compaction methods. This database structure resides on a readable computer storage medium, and is a reduced storage version of a conventional database.
A further aspect of this invention relates to a complete database system which is implemented on a conventional computer system using the database structure described above. The database image resides in the computer system's mass storage, and the tables reside in the computer system's random access memory. A conventional operating system controls conventional computer system functions such as program execution and input/output. An access program is required to interpret and perform user requests such as data reads, insertions, deletions and updates.
Further features and advantages of the present invention will become apparent from the detailed description of preferred embodiments which follow, when taken together with the appended figures and claims.
BRIEF DESCRIPTION OF THE DRAWINGS
FIGS. 1A-E show an example of a database from a logical perspective and FIGS. 1F-I show the results of applying an embodiment of the present invention to this database;
FIG. 2 comprising views shown in FIGS. 2A-B are an illustration showing a physical data structure for a database using the example database of FIGS. 1A-E;
FIG. 3 is an illustration of the physical data structure of an example of the database image created according to the present invention;
FIG. 4 is a top-level block diagram showing an embodiment of a computer-implemented database system according to the present invention;
FIG. 5 is a top-level functional flow diagram showing an embodiment of the database image build process according to the present invention;
FIG. 6 comprising views shown in FIGS. 6A-E are tables summarizing compaction methods, preferred field characteristics and compaction criteria according to the present invention;
FIG. 7 is a top-level flowchart of a database image build process in accordance with a preferred embodiment of the invention;
FIG. 8 is a top-level flowchart of a preferred field characteristic test process used to determine which fields are amenable to particular compaction methods;
FIG. 9 is a detailed flowchart of a preferred field characteristic test pertaining to the single-field encoding compaction method;
FIG. 10 is a detailed flowchart of a preferred field characteristic test pertaining to the multiple-field combining compaction method;
FIG. 11 comprising views shown in FIGS. 11A-E show a detailed flowchart of a preferred field characteristic test pertaining to the pattern suppression compaction method;
FIG. 12 is a detailed flowchart of a preferred field characteristic test pertaining to the numeric substitution compaction method;
FIG. 13 is a detailed flowchart of a preferred field characteristic test pertaining to the text compression compaction method;
FIG. 14 is a detailed flowchart of a preferred translation table creation process applicable to single-field encoding and to multiple-field combining compaction methods;
FIG. 15 comprising views shown in FIGS. 15A-B show a detailed flowchart of a preferred text parsing process applicable to the text compression compaction method;
FIG. 16 comprising views shown in FIGS. 16A-C show a detailed flowchart of a preferred process for creating compacted records for the database image;
FIG. 17 is a detailed flowchart of a preferred pattern suppression process;
FIG. 18A is a detailed flowchart of a preferred text compression process;
FIG. 18B is a chart of a preferred text compression encoding scheme;
FIG. 19 is a detailed flowchart for a preferred code determination process;
FIG. 20 comprising views shown in FIGS. 20A-B show a detailed flowchart of a preferred process for generating the integral-byte length codes, including single-byte, variable-byte and combination codes;
FIG. 21 comprising views shown in FIGS. 21A-B show a detailed flowchart of a preferred storage partitioning process advantageously used to produce a database image;
FIG. 22 comprising views shown in FIGS. 22A-C show a detailed flowchart of a preferred process for creating the index access structure for the database image;
FIG. 23A is a top-level diagram illustrating an embodiment of the database data structures and information flow between data structures according to the present invention;
FIGS. 23B-L show examples of the tables referenced in FIG. 23A;
FIG. 23M is a table of a storage method identification scheme;
FIG. 24 is a top-level flowchart of a preferred database image access process for data reads, data deletes and data updates;
FIG. 25 comprising views shown in FIGS. 25A-B show a detailed flowchart of a preferred data compaction subprocess used in the access process of FIG. 23A;
FIG. 26 comprising views shown in FIGS. 26A-B show a detailed flowchart of a preferred data search subprocess used in the access process of FIG. 23A;
FIG. 27 is a detailed flowchart of a preferred record processing subprocess used in the access process of FIG. 23A;
FIG. 28 is a detailed flowchart of a preferred access process for data updates; and
FIG. 29 is a detailed flowchart of a preferred access process for data inserts.
DESCRIPTION OF THE PREFERRED EMBODIMENTS
In accordance with the present invention, a database image is created from a database logically organized as one or more tables of rows and columns. Shown in FIGS. 1A-E are tables for the database "COMPANY." Each table contains a collection of related data. The dashed lines at the bottom of the tables indicate that just the first few entries of each table are shown. For example, the table "EMPLOYEE.sub.-- INFO," contains all data related to Company employees. Each row of a table contains data on one specific member of the related data collection. For example, row 1 of the table "EMPLOYEE.sub.-- INFO" contains all data describing employee Arnold Benjamin. The columns of a particular table contain a specific type of data common to the entire data collection. For example, the column "SSN" contains the social security numbers for all employees. The intersection of a particular row and column contains a data value describing a very specific aspect of one member of the related data collection. For example, in FIG. 1A, column "EID" of row 1 of table "EMPLOYEE.sub.-- INFO" specifies "the employee identification number of Arnold Benjamin." The associated data value is the character string "10639." Note that a blank is also a data value. For example, in FIG. 1A, employee "Arnold Benjamin" does not have a middle initial. Hence, a blank is a data value in the "MI" field. Tables like FIGS. 1A-E represent the logical or human perspective of a database. There are various ways to logically represent the same data relationships. FIGS. 1A-E show multiple, interrelated tables. For simplicity, the preferred embodiments of the database structure and method are described here in terms of a single table. These operations can be extended to multiple tables in a manner described below.
An example of the type of physical data structure which might be used by a conventional database is shown in FIG. 2. FIG. 2 corresponds to the logical database structure of the EID.sub.-- WRK.sub.-- ASSGN table of FIG. 1C. In this particular example, data records, which correspond to the table rows, are ordered numerically by a particular field, which corresponds to a table column. The dots between records represent records which are not shown in the figure. The field used for the ordering is "EID," and the records are stored sequentially in blocks of disk storage. The length of each record, that is, the amount of storage the record requires, varies from record to record. The means of storage of data in the original database from which the database image is created is unimportant. For example, the data may be stored in a user specific file or in a commercial database. The database may already be in existence, or data can be directly input into a preprocessor which compiles data and creates a database image when sufficient data is available. The nature of the database fields is also unimportant. These may be numeric or alphanumeric, fixed form or free form. The database, however, must be capable of being logically organized in a row-column format.
An example of the physical structure of a database image made in accordance with a preferred embodiment of the present invention is shown in FIG. 3. This figure illustrates that all records in a partition are of equal length. The vertical axis labels the various partitions by partition number. The horizontal axis shows the length (storage required) for each record, including the type byte, in a particular partition. The dots at the end of the records indicate additional records in each partition which are not shown in the figure. Additional data structures, which include translation, auxiliary and access tables, are described below. The database image consists of compacted records, each of which is a reduced storage version of a database record. The compacted records contain fields which may be fewer in number than those in the original database due to field-combinations during the database image build. The data within a particular field of the database image may be character strings, binary codes, binary numbers or a combination of these. The database image records are organized into computer-storage partitions, where each partition contains records of equal length.
FIG. 4 illustrates a preferred embodiment of the database structure, which includes the database image, as configured for a typical computer system. The build subsystem 10 creates a database image 12 in mass storage 14 and a variety of tables in random access memory (RAM) 16. These tables include translation tables 20, auxiliary tables 22 and access tables 24. The build subsystem consists of a database image build processor 26. The database image build processor 26 assumes the existence of a database 30 from which to create the database image 12. Optionally, the database image build processor 26 is interfaced to a build preprocessor 32 used for constructing an initial database 30 if none exists. Alternatively, a conventional database management system (DBMS) 34 can be used to initially construct a database 30. The database image build processor 26 need not remain on the system after the initial build is complete. A rebuild, however, may be necessary after many data insertions, deletions and updates are done on the database image. Such operations may eventually change the frequency of occurrence of data values in a field to the extent that the order of data values in the corresponding translation table is incorrect (i.e. some of the more frequently occurring data values are assigned greater length codes). This may result in degraded performance characteristics for the database image. The database image build processor 26 would be needed to perform a rebuild in these circumstances.
FIG. 4 also illustrates the access substructure for the database. A user interface is achieved through input/output devices 36. The access subsystem 40 provides the means of accessing the database image 12, the translation tables 20, auxiliary tables 22, and access tables 24 and transferring data to and from the input/output devices 36. The translation tables 20 consist of single-field encoding translation tables, multiple-field encoding translation tables, partial translation tables and text field translation tables. The auxiliary tables 22 consist of the record information table, pattern information tables, the mostly numeric fields table, and the combined fields table. The access tables 24 consist of index tables and page tables. The access subsystem 40 is the database image access processor 42. Optionally, the database image access processor 42 is interfaced to an access language preprocessor 44 which provides the database image user with a standardized data access language, such as SQL (Structured Query Language). As part of the access subsystem 40, the DBMS 34 might remain on the system to operate on the database 30 in parallel to the database image access processor 42 operating on the database image 12. In this mode, the database image access processor 42 might serve as a background process for the DBMS 34, providing quick access for routine data queries while the DBMS 34 provides more sophisticated data reports. Alternatively, the database image 12 might provide a backup role, providing the capability of regenerating the database 30 in the event of a system failure. Also shown in FIG. 4 is the CPU 46 which provides the hardware processing function for all database operations. This processing function is generally under the control of the operating system 50. The input/output devices 36 include such devices as a keyboard, printer, and cathode-ray tube (CRT). The mass storage 14 would include such devices as hard disk, floppy disk or compact disk - read only memory (CD-ROM).
Although the database image 12 typically resides in mass storage 14, if small enough, the database image may reside in RAM 16. Translation tables 20, auxiliary tables 22 and access tables 24, which are critical to fast data access, are preferably stored in RAM 16, but, if too large for RAM 16, the translation tables 20 might be split between RAM 16 and mass storage 14. In that case, translation table entries corresponding to the most frequently occurring data would still reside in RAM 16. All data structures (the database image 12, translation tables 20, auxiliary tables 22 and access tables 24) are permanently stored in mass storage, and those data structures which reside in RAM 16 during operation of the database are loaded into RAM 16 upon start-up of the computer system.
FIG. 5 shows that the database image is built by selectively compacting database fields. A particular compaction method is selected 54 by comparing field characteristics 56 to compaction criteria 60. The field characteristics 56 are derived from a specific database 62 and are features of the data in a particular field that determine whether a particular compaction method 54 will be beneficial. The criteria 60 are compaction-method dependent limits placed on specific field characteristics 56. A particular compaction method will be applied 52 to a particular field only if the field characteristics 56 satisfy the criteria 60 for applying that compaction method. Five compaction methods 54 are possibly applied 52 to each field in the database 62: single-field encoding, multiple-field combining, pattern suppression, numeric substitution and text compression. More than one compaction method 54 may be applied 52 to the same field. By selectively compacting 52, a favorable tradeoff is achieved between data compaction and overhead storage due to the compaction method itself. The build configuration parameters 64 are constants which are given a value by the user or are given a default value and which affect the criteria 60 described above.
FIG. 6 summarizes the build configuration parameters, field characteristics and compaction criteria used for each compaction method. The preferred values for these constants, given in FIG. 6, are starting points for creating a database image. After the database image is built using these initial criteria, there is nothing to prevent alteration of the criteria and one or more rebuilds of the database image to determine if overall compaction can be improved. One of ordinary skill in the art could easily achieve desired tradeoffs in a specific database application by using other build configuration parameters or other criteria that may work as well or possibly even better than those given in FIG. 6. It is contemplated that such other build configuration parameters or other criteria fall within the scope of the present invention.
Referring back to FIG. 5, applying the compaction method selected 52 creates translation and auxiliary tables 66 used to create compacted records 70 and to interpret database image data. The database image 72 is generated by partitioning 74, which groups compacted records according to the number of bytes of compacted data they contain. Each group is assigned a separate area of computer storage, called a "partition."
A method of accessing the data in the database image is also required. This involves constructing an index access structure. Indexes are additional data structures which make the data searches more efficient by referencing certain fields, called indexing fields. For example, in the database table of FIG. 1A, the field containing last names may be made an indexing field. This field would be sorted alphabetically and pointers to records containing each last name would be created. In this manner, records containing a specific last name could be quickly located without searching the entire database. The database image partitioning structure easily accommodates indexing because same-length records are grouped together, allowing simple algorithms to be used to locate individual records.
Referring back to FIG. 5, a preferred embodiment uses indexing 76 to create access tables 80 consisting of index and page tables for each specified indexing field. The access tables 80 are derived from the database image 72 and the translation tables 66. The index table consists of a page pointer associated with each index value, where an index value is a unique data value in the indexing field. A page pointer specifies a block of entries in the page table. Each page table entry is a record pointer. A record pointer specifies a partition number and a relative location within the partition where a record is located. In summary, specifying an index value specifies a page pointer which, via blocks of record pointers in the page table, specifies the location of records containing this index value within the database image. The database image 72, translation and auxiliary tables 66 and access tables 80 form the reconfigured physical structure of the original database 62.
As shown in FIG. 7, the database image build 82 begins by reading the build configuration parameters 84 and then testing field characteristics and selecting compaction methods 86. Then, translation tables are created 90 for fields marked for encoding 94 or fields marked for text compression 102. Next, compacted records are created 106 and grouped into partitions 110 to form the database image. The final build step creates an access structure for the database image by indexing 111, completing the build process 112.
Referring to FIG. 8, the field characteristic test process begins by selecting one of five tests 114. Each of the five tests, encoding 116, combining 120, pattern 122, numeric 124 and text 126 correspond to the five compaction methods. For each test, the field characteristics for the specific compaction method are calculated, these characteristics are compared with the applicable compaction criteria and the fields satisfying the criteria are marked for compaction by that method. The encoding test 116 determines which fields have data values which can feasibly be replaced by a code. The combining test 120 determines which encoded fields are sufficiently related with other fields to be combined into a single field. The pattern test 122 determines which fields contain one or more patterns which can be removed from the data values containing patterns. The numeric test 124 determines which fields contain mostly numeric data and the text test 126 determines which fields contain text which can be compressed. After it is determined that all tests have been performed on all fields 130, the field characteristic test procedure is finished 131.
Single-field encoding substitutes a numeric code for each data value in a particular field, with shorter-length codes substituted for the most frequently occurring data values. For ease of description, the records containing each data value of a field can be envisioned as being sorted into cubbyholes or "slots," where there is a slot allocated for each different data value occurring in the field. A record is envisioned as being placed in a slot if the record contains the data value associated with that slot. The single-field encoding method can then be described as assigning a numeric equivalent to these slots, representing the numeric equivalent with a unique binary code and substituting the code for the data value in every record contained in the slot. A translation table is constructed which equates data values with numeric equivalents.
Single-field encoding is performed on a field if the field's characteristic satisfies the encoding criterion. The relevant field characteristic for single field encoding is the number of slots, or different data values, in the field. The corresponding criterion to satisfy is a maximum allowed number of slots. There is a translation table entry for each unique code assigned to each slot and the single-field encoding criterion ensures that storage overhead resulting from translation table size does not cancel the compaction benefit from encoding.
FIG. 1C illustrates, by example, the concept of slots, records contained in slots and numeric equivalents. In FIG. 1C, the table "EID.sub.-- WRK.sub.-- ASSGN" has a field "WO" containing work order codes. The records for work orders from BC00009 through BC00014, BR00015 and BR00016 are included in this figure. There are four records in the slot BC00012, three records each in slots BC00011 and BR00015, and two records each in slots BC00009, BC00010, BC00013, BC00014, and BR00016. A numeric equivalent is sequentially assigned to each slot in decreasing order of the number of records in a slot. Assuming the sample shown in FIG. 1C of the EID.sub.-- WRK.sub.-- ASSGN table represents all records, the resulting translation table would be as shown in FIG. 1F. To complete the single-field encoding process, these numeric equivalents would be represented with binary codes. Shorter-length codes could be used to represent the smaller value numeric equivalents.
In the example, when a sufficient number of records are associated with work orders, the entries in the WO field can be replaced with an equivalent binary code. The condition where many employees are assigned to work on a specific job is common in several industries. Thus, in practice, the number of records in these work order slots could be significantly higher than demonstrated in FIG. 1C, as denoted by the dashed line at the bottom of the table. In such cases, replacement of the work order value in the EID.sub.-- WRK.sub.-- ASSGN table with a binary code is justified.
If a database is logically represented as multiple, interrelated database tables, the translation table for a field which is common to several database tables can be handled in a variety of ways. The field may have a translation table for each occurrence in a database table, a single translation table which covers all occurrences or a few translation tables, some of which are multiply used by several database tables. Multiple-usage depends on a given translation table meeting the criteria for several database tables.
For single-field encoding, the relevant field characteristic, as shown in FIG. 6, is the number of different data values, or slots, in the field. Also shown in FIG. 6 is the compaction criteria for single-field encoding, which is the maximum allowed number of slots. This criteria limits the number of entries in the corresponding translation table. In a preferred embodiment, this maximum equals C.sub.1 .times.(number of records).sup.k, k<1, where C.sub.1 is a build configuration parameter. With this criteria, the translation table size grows with larger databases, but not in direct proportion to the size of the database. In the most preferred embodiment, C.sub.1 equals 31.623 and k=0.5. This value sets the translation table size for a 100,000 record database to 10,000 entries or 10 percent of the total number of database records. For a 100 million record database, the translation table size would be 316,228 entries or 0.3 percent of the database size. In a preferred embodiment, C.sub.1 is in the range of 10 to 100, and k is in the range of 0.25 to 0.75. Other values for the build configuration parameter C.sub.1 and k can be used in this formula, and other completely different formulas can be used to achieve different tradeoffs. For example, the single-field encoding criterion might be made proportional to the number of bytes in system RAM, where the translation tables would normally reside. As another example, this criterion might be made dependent on the number of fields in the database. One of ordinary skill in the art will recognize that the criteria design for single field encoding may be modified in a number of ways, dependent on the particular system resources and the particular application.
As shown in FIG. 9, the encoding test 132 begins by selecting a particular field 134, reading the data values from the database records 136 and computing the number of slots 140, that is, the number of different data values in the field. At this point, the field characteristic for the single-field encoding method has been determined. This characteristic is compared to the criterion 142, a maximum. If the number of slots is less than this maximum, then this field is marked for encoding 144. Otherwise, the field is rejected for this compaction method. If a field is rejected for single-field encoding, it is tested to see if the partial encoding criterion is met 146. If so, the field is marked for partial encoding 150. When all fields are tested in this manner 152, the encoding test process is complete 153.
Partial encoding associates codes for most of a field's data values and leaves the remaining data values unencoded. Partial encoding is feasible if the largest slots, those containing the most records, account for a significant portion of the total number of records in the database. Hence, the field characteristic for partial encoding, as shown in FIG. 6, is the total number of records contained in the first C.sub.2 largest slots. In a preferred embodiment, C.sub.2 is 256. This value of C.sub.2 is chosen because a one byte code can then be associated with these largest slots. Also shown in FIG. 6 is the partial encoding criterion, a minimum number of records. In a preferred embodiment, this criterion is C.sub.3 .times.(total number of records in the database). That is, the C.sub.2 largest slots in a field must contain a minimum fraction, C.sub.3, of all records in the database. In a preferred embodiment, C.sub.3 is 0.33. This value is chosen because one-third is a large enough fraction of the database to make partial encoding overall worthwhile. Other values for C.sub.2 and C.sub.3 can be used. For example, C.sub.2 can be set to 512. Then the first and second bytes of the single-byte code could be used for encoding. Thus, using the same one-byte code length and using only a 512 entry translation table, the likelihood that a field is partially encoded is increased. Another example which increases the likelihood that a field rejected for encoding would be partially encoded is to lower C.sub.3 to 0.25. Then the C.sub.2 largest slots would only have to contain one fourth of the total database records. For a large database, this may still result in significant compaction. The partial encoding criterion need not be a constant. For example, the criterion may be made proportional to: (total number of database records).sup.1/2. In this manner, the total number of partially encoded records does not have to increase in direct proportion to the database size. As noted above with respect to the single-field encoding criterion, one of ordinary skill in the art will recognize other values for C.sub.2 and C.sub.3, other characteristics and other criteria may be chosen to achieve other system tradeoffs.
The multiple-field combining method creates a single field from two or more fields by substitution of a numeric code for each data value combination occurring in two or more fields within the same record. Shorter-length codes are substituted for the most frequently occurring data value combinations. For ease of description, the records containing each data value combination can be envisioned as being sorted into slots, where there is a slot allocated for each different data value combination occurring in the same record, within the fields to be combined. The multiple-field combining method can then be described as assigning a code to each slot and substituting the slot code into every record contained in the slot. Combining fields, however, may result in so many different slots that the combined field cannot practically be encoded. The tradeoff involved is similar to that for single-field encoding discussed above--compaction benefit versus the storage penalty due to the translation table. Potentially, the number of different slots from two fields is the smaller of: (1) the product of the number of slots for each individual field; or (2) the total number of records in the database. Combining two or more fields into one compacted field, however, is practical if these fields are sufficiently related. Fields are related, in this context, when there is a relatively limited number of different data value combinations or slots for these fields. For example, in a database consisting of the physical characteristics of individual persons, "Height" and "Weight" might be two fields, consisting of the recorded individual heights and weights, respectively. The possible combinations of recorded height and weight are numerous. However, there is a strong relationship between height and weight. It is unlikely that there are any individuals in the slot "5 feet tall, 200 pounds" or the slot "6 feet tall, 90 pounds." Therefore, if these fields were combined into a field "Height/Weight," the resulting number of slots is likely to be such that an individual's recorded height and weight could be represented by a single code. In a preferred embodiment of the present invention, the field characteristics used to determine if two fields are sufficiently related to combine are based on record per slot statistics. Specifically, the characteristics for a field combination are the mean, .mu., and the standard deviation, .sigma., of the number of records per slot. The corresponding criteria are: .mu.>.theta./2; .mu.+.sigma.>.theta.; and .mu.>.sigma., where the threshold .theta. is a function of database size. Together, these three criteria restrict the distribution of records per slot. Field pairs are only combined if the greatest concentration of records in the fewest slots result. This simplifies coding because shorter codes are used for the largest slots, as noted below. The multiple-field combination characteristics are tested two fields at a time, for all possible field pairs in the database. However, overlapping field pairs may satisfy the criteria, so a figure-of-merit is used to decide which of these field pairs to combine. The figure-of-merit, .GAMMA.=.mu.+.sigma., is a measure of the concentration of records per slot and is computed for each field pair which satisfies the criteria. Pairs having the largest .GAMMA. are combined, maximizing the total .GAMMA.. After fields are combined, repeated "passes" must be made to determine if these new combined fields can be combined with other fields.
For multiple-field combination, a pair of fields would be combined and encoded as a single field if the number of records per slot for the two fields is sufficiently concentrated. Unfortunately, it is impractical to simply determine the number of different data value pairs, or slots, for every pair of fields in the database and then compute the number of records per slot. There are many combinations of field pairs to consider in even a small database. For example, at least 7 of the 12 fields of the EMPLOYEE.sub.-- INFO table shown in FIG. 1 demonstrate properties which often result in single-field encoding. There are N!/(n!(N-n)!) different combinations of N items taken n items at a time. Hence, for 7 fields taken 2 at a time, there are 7!/(2!5!)=21 unique field pairs. This means 21 characteristic tests for the first pass alone. Subsequent passes treat a field pair satisfying the criteria as a single field. All field pairs are re-tested, where a field "pair" would be either two single fields; a single field and a combination field; or two combination fields. These passes continue until no more field pairs satisfy the combining criteria. The straightforward computation of records per slot would involve a very large amount of computer processing time. Instead, a preferred embodiment of this invention uses statistical field characteristics derived from a sample of the database, rather than performing direct computations on the entire database.
A sample database is a subset of the records contained in the full database. One method of sampling is to only process every Mth record of the full database, where M is the sample interval. This uniform sampling of the full database has a disadvantage in that a single event may be responsible for a number of contiguous records in the database. In a preferred embodiment, in order to provide a good statistical sample of the full database, staggered sampling is performed. For staggered sampling, the requirement is that the average sample interval, .lambda., equals M. For example, if M is 16, then the number of records skipped might be the repeating sequence 1, 2, 4, 7, 14, 28, 56, 1, 2, 4, 7, 14, 28, 56, . . . . The average number of records skipped is (1+2+4+7+14+28+56)/7=16. The average sample interval, .lambda., is defined as: Int[(number of records)/sample size]; where Int[ ] is the largest integer less than the quantity inside the brackets [ ]. Sample size is the number of records in the sample database. In a preferred embodiment, as shown in FIG. 6, the sample size is C.sub.4 .times.(number of records).sup.1/2, where C.sub.4 is a build configuration parameter. The sample size grows with larger databases, but not in direct proportion to the size of the database. In a preferred embodiment, C.sub.4 =31.623. This value sets the sample size for a 100,000 record database to 10,000 sample records, corresponding to 10 percent of the total number of records in the database. For a 100 million record database, the sample size is 316,228 sample records, corresponding to 0.3 percent of the total number of records in the database.
As shown in FIG. 6, the field characteristics for the multiple-field combination method are the mean, .mu., and standard deviation, .sigma., of the number of records per slot for a field pair. The criteria, also shown in FIG. 6, are: .mu.>.theta./2; .mu.+.sigma.>.theta.; .mu.>.sigma., where .theta. is a specified minimum threshold. Also as shown in FIG. 6, the figure-of-merit for deciding between field combination choices is .GAMMA.=.mu.+.sigma.. In this embodiment, .mu. is projected from the sample mean, .mu..sub.s, and .sigma. is projected from the sample standard deviation, .sigma..sub.s : .mu.=.mu..sub.s .times..lambda.; .sigma.=.sigma..sub.s .times.(.lambda.).sup.1/2. These projections hold true for slot distributions satisfying the criteria above.
In a preferred embodiment, as shown in FIG. 6, .theta.=C.sub.5 .times.(number of records).sup.k ; k<1. If the relative distribution of records per slot remains constant as the database size increases, the average records per slot will increase proportionally. In this embodiment, however, the threshold, and hence the mean records per slot, does not increase in direct proportion with database size. Thus, for larger databases, a wider distribution of records per slot is allowed. This is analogous to single-field encoding where a preferred embodiment allows the translation table to increase somewhat with larger databases. Here, a wider allowed distribution of records per slot implies the percentage of records encoded with the shortest codes is allowed to decrease somewhat with larger databases. For the most preferred embodiment, k=0.75 and C.sub.5 =(1/1000).sup.k. For a database ranging in size from 100,000 to 100,000,000 records, this results in .theta. ranging from about 30 to about 5,000 records per slot. In a preferred embodiment, C.sub.5 is in the range of 0.001 to 0.01 and k is in the range of 0.5 to 0.99. Other values of C.sub.5 can be chosen to allow fewer or greater numbers of field combinations and hence greater or fewer records per slot. The dependency of record distribution per slot on database size can be altered with other values of k. Also, other functionally different criteria might be chosen. For example, if .mu.>.sigma. and .mu.>.theta./2, the number of sample database slots is a good projection of the number of database slots. Fields also could be combined based on minimization of the number of database slots rather than maximization of the figure-of-merit .GAMMA.. This would result in control over the translation table size, as in single-field encoding, rather than control over the concentration of records per slot. One of ordinary skill in the art will recognize that the parameters and criteria disclosed above for multiple-field combination may be modified in a number of ways to achieve various tradeoffs for specific database applications.
As shown in FIG. 10, the combining test 154 begins by selecting a pair of fields marked for single-field encoding 156. Then the database is read by sampling 160, that is, skipping an average of .lambda. records at a time, where .lambda., the sample interval, is given in FIG. 6. Next, the sample mean, standard deviation and figure of merit .mu..sub.s, .sigma..sub.s and .GAMMA. for this field pair are computed 162 using the formulas of FIG. 6. If these characteristics meet the combining criteria 164, the pair is marked as a candidate for combining 166. Otherwise, the pair is rejected for this translation and another field pair is selected 156. All pairs of fields are tested in this manner 170. When all possible combinations of fields are tested, a list of candidate pairs is sorted in descending values of .GAMMA.172. By first sorting all candidate pairs by .GAMMA., the most favorable pair combinations can be made. To do this, all fields are initially set as being available for combining 174. The sorted list is read 176 beginning with the candidate pair having the largest .GAMMA.. If both fields are available for combining 180, the candidate pair is marked for combining and both fields are marked as hereafter unavailable 182. If either or both fields are unavailable, the candidate pair is not marked for combining and the availability status of the fields is unchanged. If all entries in the list have not been processed 184, the next candidate pair from the sorted list is selected for combining 176. All candidate pairs are processed in this fashion. If any candidate pairs were marked for combining 186, fields marked for combination are combined as a new field 190, and a new pass is started 156. Further passes are initiated until no new pairs are marked for combining 192. In this manner, two or more fields can eventually be combined into a single field. The field combination table is then created by listing the groups of fields which are part of each combination. The sorted list of .GAMMA. values for each pass can be printed and would provide insight into database field relationships which might otherwise be unknown or unsuspected.
Pattern suppression compaction is a feasible method to apply to fields that have not been encoded and that at least have single-position patterns, that is, the same characters occurring in the same character positions for many records. A field is first tested for single-position patterns in each character position of the field. Then a field is tested for multiple-position patterns, that is, combinations of single-position patterns occurring together in many records.
The characteristics for pattern suppression, shown in FIG. 6, are: the "per character percentage," which is the percentage of record occurrences for each of the C.sub.6 most frequently occurring single-position patterns; the "single-position pattern determination percentage," which is the total percentage of record occurrences for all of the C.sub.6 most frequently occurring single-position patterns; and the "multiple-position pattern determination percentage," which is the percentage of record occurrences for combinations of single-position patterns. The first two characteristics are determined for each character position of a field for single-position patterns. The last characteristic is determined for each different multiple-position (multi-position) pattern which occurs in a record. The corresponding compaction criteria, also given in FIG. 6, are the specified minimums, C.sub.7, C.sub.8 and C.sub.9, for these three percentages, respectively. C.sub.6 through C.sub.9 are build configuration parameters.
Characters meeting the criteria for a single-position pattern are stored as a single-position pattern and are also considered for multiple-position pattern formation. In the most preferred embodiment, C.sub.6 =4, C.sub.7 =10% and C.sub.8 =50%. That is, only the 4 most frequently occurring characters in each position are considered for pattern determination, and each of these characters must occur in at least 10% of the records. The characters satisfying the first criterion together must occur in 50% of the records to be stored as a pattern. As an example, in the work order field WO of the WORK.sub.-- ORDER.sub.-- INFO table, FIG. 1B, in the first character position, "A" occurs in 26.7 percent, "B" occurs in 53.3 percent and "Q" in 13.3 percent of the records for a total percentage of 93.3 percent of the records. Hence, the values "A", "B" and "Q" in the first character position each satisfy the 10% per character percentage and in total satisfy the 50% pattern determination percentage. The character "R" occurs in the second character position in more than 50 percent of the records, thus satisfying both criteria. The value in the third character position is always a "0," also satisfying both criteria. No values in any other character position meet the single-position pattern suppression criteria. Hence "A", "B" and "Q" are the single-character pattern values for column 1, "R" is the single-character pattern value for column 2 and "0" is the single-character pattern value for column 3.
A combination of single-position pattern characters might be a multiple-position pattern. Each of these character combinations must occur in a minimum C.sub.9 percentage of the records. In the most preferred embodiment, C.sub.9 =10%.
Continuing the example from above, the percentage of occurrences determined for all multiple-position patterns for the work order WO field in the WORK.sub.-- ORDER.sub.-- INFO table is shown in FIG. 1G. The patterns which have multiple-position pattern percentages greater than 10 percent are advantageously stored as multiple-position patterns. These patterns are "AR0," "BR0," and "B.sub.-- 0," where the ".sub.-- " refers to any character value. Note that the combinations "AR.sub.--," "BR.sub.-- " and QR.sub.-- " do not exist because 0 always occurs in the third character position.
Combinations of single-position pattern characters which are rejected as not satisfying the multiple-position pattern criterion, can be used as component combinations to create a new combination which does satisfy the criterion. A new combination is formed by intersecting two or more component combinations, such that the new combination contains the columns and values that the component combinations have in common, with the new combination's multiple-position pattern percentage being the sum of the component combination's multiple-position pattern percentages. These new combinations are tested to determine if the multiple-position pattern criterion is satisfied. Testing is started by considering the component combinations that result in new combinations with the most characters. Testing continues, considering component combinations that result in new combinations with fewer numbers of characters, until all possible intersections of component combinations greater than a single character are considered. Note that the number of characters in a combination is defined as the number of single-position pattern characters which compose the combination. Thus, for example, "B.sub.-- 0" is a two character combination.
When component combinations create a new acceptable multiple-position pattern, the components cannot be used to intersect with other components. If two new combinations exist with the same number of characters and they contain one or more component combinations in common, the new combination with the largest multiple-position pattern percentage is saved. Acceptability of a multiple-position pattern formed by intersecting component combinations is the same as for a multiple-position pattern formed directly from combinations of single-position patterns. The multiple-position pattern percentage must be greater than C.sub.9 percentage of all records. In the most preferred embodiment, C.sub.9 =10%.
Continuing the example from FIG. 1G, the component combinations from rejected single-position pattern combinations are ".sub.-- R0," "A.sub.-- 0," "QR0" and "Q.sub.-- 0.". Of all possible intersections of these components, only the intersections of components ".sub.-- R0" with "QR0" and "Q.sub.-- 0" with "QR0" create new combinations with more than a single character. The pattern ".sub.-- R0," composed of ".sub.-- R0" and "QR0," has a multiple-position pattern percentage of 6+7=13 percent, while the pattern "Q.sub.-- 0," composed of "Q.sub.-- 0" and "QR0" has a multiple-position percentage of 5+7=12 percent. Thus ".sub.-- R0" is saved as an additional multiple-position pattern. FIG. 1H summarizes the patterns which are accepted for pattern suppression in this example. A data value is tested for the existence of a pattern by comparing it first with patterns containing the most characters. FIG. 1H is presented in this order, that is by descending number of characters. The single-position pattern value "Q" in FIG. 1H has been maintained because it was not used in any multiple-position pattern. This pattern could be discarded because its percentage is less than the per character percentage C.sub.7. For this example, FIG. 1I illustrates how the records shown in the WORK.sub.-- ORDER.sub.-- INFO table of FIG. 1B are affected by pattern suppression compaction.
C.sub.6 through C.sub.9 are chosen as a tradeoff between the storage savings achieved by pattern suppression and the penalty of compacted record complexity. This record complexity is due to the number of subpartitions (different type byte values within a partition) and partitions created by pattern suppression. Each removed pattern must be noted by a different type byte, and pattern removal causes variations in record length which may generate additional partitions. Hence, these constants are chosen to reject patterns unless they are prevalent. In a preferred embodiment, the C.sub.6 is in the range of 3 to 5, C.sub.7 is in the range of 5% to 20%, C.sub.8 is in the range of 40 to 60% and C.sub.9 is in the range of 5 to 20%. Other values for these parameters, however, can be used to increase or decrease the number of patterns recognized. Alternatively, the pattern suppression criteria might be made dependent on specific system or database characteristics, such as the number of fields or number of records. One of ordinary skill in the art will recognize that the criteria and parameters disclosed above may be modified in a variety of ways.
During the pattern test, a field is also tested for the existence of a numeric pattern. A numeric pattern exists when at least two positions in the field have digits for all records. A number is formed by combining these positions and this "numeric pattern" is converted to its binary equivalent. Positions do not have to be adjacent for a numeric pattern to be identified. The binary equivalent of a numeric pattern is stored with the original form of the remaining positions. For example, suppose the first, third and fifth positions in a field are all digits, and a record contains the data value 8a4x1npq for the field. The number 841 would be converted to its binary equivalent and stored with axnpq as the field's data value for that record.
The presence of an alphanumeric "wild card" character in any position in the pattern table for a field indicates that numeric values occupy all other positions within the field which are not included in the pattern table. For example, if positions 6, 7 and 9 of a particular pattern in a pattern table are shown as wild cards, and no other characters are included in that particular pattern, then the first three bytes of a compacted data value with this pattern are the character values (e.g. in ASCII format) for positions 6, 7 and 9. The remaining portion of the compacted data value is a binary number, which when translated to ASCII yields the numeric characters which occupy all other positions of the uncompacted data value. Using FIG. 23I as another example, if the position 2 value of pattern number 3 was a wild card (rather than blank as shown), all compacted data values with this pattern number would have "B" and "0" suppressed from positions 1 and 3. Also, the first byte of the compacted data value would be the character in position 2 of the uncompacted data value. The remainder of the compacted data value would be the binary equivalent of the numeric characters in the remaining positions of the uncompacted data value.
As inferred above, a numeric pattern may be combined with single-position or multi-position patterns or may exist by itself. Also, numeric pattern conversion differs from the numeric substitution compaction method in that numeric substitution applies to every position in a field. Note that a numeric pattern is also distinguished from a single-position or multi-position pattern containing a digit, where a digit is treated like any other character. That is, a digit which frequently occurs in a position that is not all digits may be a single-position pattern or part of a multiple-position pattern. Whereas a numeric pattern only requires two or more positions of all digits.
FIG. 1I illustrates numeric pattern conversion applied to the work order "WO" field of the WORK.sub.-- ORDER.sub.-- INFO table of FIG. 1B. The third column of the table shows that the third through sixth positions of the WO field, after pattern suppression, comprise a numeric pattern. The fourth column of the table shows the data values that are not part of the numeric pattern. Each number associated with the numeric pattern is converted to its binary equivalent and stored with the remaining data values shown in the fourth column. Although the preferred embodiment of numeric pattern conversion requires that a position considered as part of the numeric pattern must contain all digits in all records, one of ordinary skill in the art will recognize that the same concept will work if less than all records contain digits in the positions considered as part of the numeric pattern. In that case, only those records containing the numeric pattern would be converted.
As shown in FIG. 11, for the pattern test 194, a field is selected 196 and checked to determine if it is marked for encoding 200. If so, pattern suppression is inapplicable, and if all fields have not been tested 202 another field is selected 196. If the field is not to be encoded, the database records are read for this field 204. Next, all character positions are scanned for pattern characters. This is done by first selecting a particular position within the field 206, and then counting the number of occurrences of each character in that position 210. For a particular position, the C.sub.6 most frequently occurring characters are identified 212. The field characteristics for that position are determined next by computing the per character percentage and pattern determination percentage 214. It is determined whether any of these characters satisfy the per character criterion 216, and if so, it is determined if the pattern determination criterion is satisfied 218. If the position satisfies both criteria, the position is marked as containing a pattern 220. All positions are tested in this fashion 222. If at least two positions are all digits 224, these positions are marked as a numeric pattern 226. After all fields are tested in this manner 202, the multiple-position test is begun.
Continuing with FIG. 11, the multiple-position test starts by selecting a field which contains more than one single-position pattern 230. If there are none, the process ends 231. Otherwise, the database records are read 232. The number of occurrences of each different combination of single-position patterns which occur together in a record is determined for the field 234. These combinations are sorted by length 236, where the length is the number of single-position patterns in the combination. Going down the list from the longest to shortest combination, each combination is compared with the multiple-position pattern determination criterion 240. If a combination satisfies the criterion, then it is marked as a multi-position pattern 242. If a combination does not satisfy the criterion, it is nevertheless retained for further processing 244. All occurring combinations are tested in this manner 246. Combinations which did not satisfy the pattern determination criterion 250 are intersected with other rejected combinations to see if there is a smaller common pattern which meets the pattern determination criterion. First, the longest rejected combination is determined 252, and a search length for intersection patterns is initialized at one less than the longest rejected combination 254. A search is made for a pattern in common with two or more component combinations equal to the search length 256. If any intersections of this length are found 260, then the occurrence count of each rejected component combination is added 261 and the total is compared with the multiple-position pattern determination criterion 262. If this intersection combination satisfies the criterion, the component combinations are discarded and the intersection is saved as a multi-position pattern 264. Otherwise, the search length is reduced by one 266. Until all component combinations have been considered 270, the intersection searches continue 256. All fields with more than one single-position pattern are processed in this fashion 272. For each field, the identified multi-position patterns and the single-position patterns which are not part of a multi-position pattern are stored in each field's pattern table and assigned a pattern number.
A field which cannot be encoded might be compacted by numeric substitution. Numeric substitution is applied on fields in which most of the values are numeric. The numeric values of these fields are encoded with the binary rather than the character string representation. The remaining values in these fields, those containing values other than digits, are stored in their original character representation. Each digit in a numeric field represented by character strings typically requires a byte of storage. In binary form, however, one byte can represent up to 3 digits (numbers 0 through 255), two bytes can represent up to 5 digits (numbers 0 through 65,536), three bytes can represent up to 8 digits (numbers 0 through 16,774,216), and so on. Hence, there can be considerable compaction by this substitution.
For numeric substitution, unencoded fields with many records containing only numeric character strings would have each of those data values represented by a binary equivalent. The criterion for numeric substitution, shown in FIG. 6, is a minimum C.sub.10 percentage of records containing all numeric characters. This criterion determines the tradeoff between achievable compaction versus partition complexity. Each character string would be replaced by a 1 to 4 byte number. The type byte at the end of each compacted record must specify the number of bytes substituted by this method. This potentially quadruples the number of subpartitions. However, a 4-byte number can represent the character string 4294199296, which would otherwise require ten bytes (one byte per character). This reduces storage by many bytes per substituted record. In the most preferred embodiment, C.sub.10 is 90%. That is, at least 90% of the records for a particular field must contain numeric data values before numeric substitution is applied. In a preferred embodiment, C.sub.10 is in the range of 75% to 100%. Other values for this build configuration parameter are possible, with smaller values of C.sub.10 allowing this method to be applied to more fields. Alternatively, the criterion may be chosen to vary with specific database or system parameters. For example, the criterion for numeric substitution may be made proportional to the number of fields, because fewer fields would tend to generate fewer subpartitions. One of ordinary skill in the art would recognize other variations in the parameter and criterion for numeric substitution for achieving different tradeoffs. As an example, FIG. 1I shows the records of the WO field of the WORK.sub.-- ORDER.sub.-- INFO table of FIG. 1B for which numeric substitution can be applied.
As shown in FIG. 12, for the numeric test 274, a specific field is selected 276, then checked to determine if it is marked for encoding 280. If so, numeric substitution is inapplicable. If all fields have not been tested 282, another field is selected 276. Otherwise, the database records are read for this field 284. If the field is marked for pattern suppression 286, the detected patterns are masked out 290. Next, the field characteristic consisting of the percentage of records containing all numeric data values (after patterns are removed) is computed 292. If this characteristic meets the numeric substitution criteria 294, shown in FIG. 6, the field is marked for numeric substitution 296. When all fields have been tested in this manner 282, the numeric test is complete 300. The mostly numeric table is then created by listing the field numbers of all fields marked for numeric substitution.
Text compression is only applied to text fields. Text fields are unencoded fields containing multiple words of text in most records, with a relatively small number of different words in the field. Specifically, a field is determined as being a text field when at least half of its values contain multiple words and the total number of words is greater than the total number of records. Note that a "word" in this context means any sequence of alphanumeric characters separated from other alphanumeric characters by delimiters, such as spaces, commas, periods, etc. A word could include a sequence of numeric characters, for example. The criterion for text compression, shown in FIG. 6, is a maximum number of different words. The rationale for this criterion is the same as for single-field encoding--a tradeoff between storage savings by compression and storage overhead due to the resulting translation table. In a preferred embodiment, the criterion is C.sub.11 .times.(number of delimiters).sup.k, k<1 where delimiters are defined by the user or are, by default, a space, comma, period, slash, dash, colon, semicolon, quotation mark or end-of-line character. A sequence of delimiters is treated as a single delimiter. The number of delimiters is an estimate of the total number of words in the field. Hence, this criterion is analogous to the criterion C.sub.1 .times.(number of records).sup.k, k<1 used for the single-field encoding method. That is, the translation table size is allowed to grow with the number of words, but is not allowed to grow in direct proportion to the size of the database. In the most preferred embodiment, C.sub.11 =63,246 and k=0.5, which yields the result that 100% of a field of 4,000 words can be different words, 6.3% can be different if there are 1,000,000 total words and 0.63% can be different if there are 10,000,000,000 words. In a preferred embodiment, C.sub.11 is in the range of 10 to 100, and k is in the range of 0.25 to 0.75. One of ordinary skill in the art will recognize that other values of the build configuration parameter can be used in this formula, and different criterion can be used to achieve different tradeoffs.
Referring to FIG. 13, the text compression test 302 begins by selecting a particular field 304. Next, it is determined if the field selected is marked for any other compaction method 306. If so, text compression is inapplicable. If all fields have not been tested 310, another field is selected 304. If no other compaction methods have been applied, the records for this field are read 312. The field characteristic tested is the number of different words in the field 314. This characteristic is compared to the text compression criterion 316, shown in FIG. 6. If the criterion is satisfied, this field is marked for text compression 320. Otherwise, the field is not marked, and if all fields have been tested 310, the text compression test is complete 322.
Referring back to FIG. 7, after the fields are tested 86 to determine the applicable compaction methods for each field, the next step is selected 90 based on the compaction method marked for a field. Fields marked for single/multiple field encoding are encoded separately 94 from fields marked as text fields 102. All other fields skip the encoding step. As shown in FIG. 14, the single/multiple field encoding process 324 begins by first selecting a field marked for single-field encoding, including partial encoding or a set of fields marked for multiple-field combining 326. Then, the database records are read 328. Next, the slots are identified 329, and the number of records in each slot is determined 330. Slots are each different data value for single-field encoding or each different combination of data values for multiple-field combination. The slots are sorted in descending order by the number of records in each slot 332, with the slot having the most records (the "largest" slot) being first and the slot having the least records (the "smallest" slot) being last. Numbers are then assigned to this sorted list 334 by the position of the slot in the sorted list, with the largest slot assigned the number 0, the second largest slot assigned the number 1, etc. These assigned numbers are the numeric equivalents of the slots. That is, an assigned number can be used to uniquely identify a slot. The steps of identifying the slots 329, computing the number of records per slot 330, sorting the slots 332 and assigning numeric equivalents 334 create a single-field translation table or partial field translation table for a field or a multiple-field translation table for a field combination. The particular integer-byte-length code to use for this field (single-byte code, combination code or variable-byte code) is determined next 336, (see below). If another marked field or field combination exists 338, it is selected 326 and another translation table is created. Otherwise, the translation table creation process is complete 340.
Again referring to FIG. 7, a text field is encoded separately from other fields 102. FIG. 15 shows the text encoding process 342. A field marked for text compression is first selected 344. Next, the database records are read 346. Lists of different words and of different delimiters are created 348, and the number of occurrences for each different word and delimiter are determined 350. These lists of different words and delimiters are sorted separately in descending order by frequency of occurrence 352, with the most frequently occurring word and delimiter being first and the least frequently occurring word and delimiter being last. Words and delimiters are then assigned numbers 354, based on position in the sorted lists. The most frequently occurring word and delimiter are assigned the number 0 and the next most frequently occurring word and delimiter are assigned the number 1, etc. These assigned numbers are the numeric equivalents of the words and delimiters. The steps of listing the words and delimiters 348, computing the number of occurrences of these words and delimiters 350, sorting the lists 332 and assigning numeric equivalents 334 create a text phrase translation table (for single words) and a text delimiter tr |