Efficient multidimensional data aggregation operator implementation5822751Abstract An efficient implementation of a multidimensional data aggregation operator that generates all aggregates and super-aggregates for all available values in a results set by first generating a minimal number of aggregates at the lowest possible system level using a minimal number of function calls, and second categorizing the aggregate function being applied and applying the aggregate function with the fewest possible function calls. The aggregates are generated from a union of roll-ups of the n attributes to the GROUP BY clause of the SELECT statement. The number of roll-ups are minimized by including a barrel shift of the attributes being rolled up. A scoreboard array of 2.sup.n bits is updated during the roll-up and barrel shifting process to keep track of which roll-ups are complete and with are not yet complete. Generating super-aggregates is further optimized by identifying the type of aggregate function being applied and facilitating the most efficient application of the aggregate function. A lter.sub.-- super() function is implemented to facilitate the most efficient application of algebraic aggregate functions that require access to intermediate aggregate data that heretofore was not available to any algebraic aggregation operator. Claims What is claimed is: Description FIELD OF THE INVENTION
TABLE 1
______________________________________
Roll-up Binary Scoreboard
String Representation
Bit Position
______________________________________
ABCDE 00000 0
ABCD 00001 1
ABC 00011 3
AB 00111 7
A 01111 15
11111 31
______________________________________
In a preferred embodiment each roll-up string in Table 1 can be represented by a corresponding binary string of unique bits that contain zeros where an attribute is present in a given roll-up string and a 1 when an attribute is missing from a give roll-up string. These binary representations make it easy to determine the scoreboard bit position in the 32 bit scoreboard array. Thus, the first roll-up of ABCDE, requires six iterations of the steps 431 and 438. The barrel shift roll-up steps 458-480, are illustrated below in Tables 2-5. The first roll-up on a barrel shifted roll-up string is illustrated in Table 2.
TABLE 2
______________________________________
Roll-up Binary Scoreboard
String Representation
Bit Position
______________________________________
BCDE 10000 16
BCD 10001 17
BC 10011 19
B 10111 23
______________________________________
Beginning with the original n attributes ABCDE, the first barrel shifted roll-up string identified in step 458, contains BCDE. Note that the binary representation of each of the barrel shifted roll-up strings maintains the original n attribute order. Thus, the binary representation of the barrel shifted roll-up string BCDE would be 10000 indicating a base 10 scoreboard bit position of 16. Note also that for each iteration of the barrel shifted roll-up in Table 2, only the leftmost x attributes are aggregated in the four iterations of step 470 and 473. The barrel shift sequence repeats at step 453 as illustrated below in Table 3.
TABLE 3
______________________________________
Roll-up Binary Scoreboard
String Representation
Bit Position
______________________________________
CDEA 01000 8
CDE 11000 24
CD 11001 25
C 11011 27
______________________________________
The second barrel shifted roll-up string contains CDEA which corresponds to the binary representation of 01000 and the base 10 equivalent of scoreboard bit position 8. The results of the third barrel shift and roll-up sequence is illustrated in Table 4 below.
TABLE 4
______________________________________
Roll-up Binary Scoreboard
String Representation
Bit Position
______________________________________
DEAB 00100 4
DEA 00110 12
DE 11100 28
D 11101 29
______________________________________
The results of the barrel shift roll-up is illustrated in Table 5 below.
TABLE 5
______________________________________
Roll-up Binary Scoreboard
String Representation
Bit Position
______________________________________
EABC 00010 2
EAB 00110 6
EA 01110 14
E 11110 30
______________________________________
When the original string of n attributes has been barrel shifted so that the original attribute order is represented once again at decision step 465, then the scoreboard is evaluated to determine if all bits in the scoreboard array have been set from a 0 to 1. Table 6 below illustrates all 2.sup.5 scoreboard bit positions 0-31 along with the binary attribute representations and the associated values for each bit in the scoreboard array.
TABLE 6
______________________________________
Binary Representation
Bit Position
Value
______________________________________
00000 0 1
00001 1 1
00010 2 1
00011 3 1
00100 4 1
00101 5 0
00110 6 1
00111 7 1
01000 8 1
01001 9 0
01010 10 0
01011 11 0
01100 12 1
01101 13 0
01110 14 1
01111 15 1
10000 16 1
10001 17 1
10010 18 0
10011 19 1
10100 20 0
10101 21 0
10110 22 0
10111 23 1
11000 24 1
11001 25 1
11010 26 0
11011 27 1
11100 28 1
11101 29 1
11110 30 1
11111 31 1
______________________________________
In the present example scoreboard bit positions 5, 9-11, 13, 18, 20-22, and 26 have not been set. For this reason the attribute strings associated with each of the above bit positions must be located and aggregated in step 490. The details of step are disclosed in the text accompanying FIG. 5. FIG. 5 illustrates the operational steps 500 for rolling up the remaining strings identified by the 0 bits in the scoreboard array. Operational steps 500 begin at step 508 and illustrate the details of step 490 from FIG. 4. At step 512, variable x is set equal to 2 because the attribute string associated with the first and second bit positions 0 and 1 will always have been previously set during the barrel shifting operational steps 400. If at decision step 518 it is determined that scratch variable x is less than the 2.sup.n bit scoreboard array size and there is at least one scoreboard bit that is not set, then processing continues at decision step 524. Alternatively, if at decision step 518 it is determined that either scratch variable x is greater than or equal to the 2.sup.n bits of the scoreboard array or there are no set bits in the scoreboard array then processing continues at step 521. If it is determined at 524 that bit position x is set in the scoreboard array, then processing continues at step 580 where scratch variable x is incremented by one and processing continues at decision step 518 as previously disclosed. Alternatively, if it is determined at decision step 524 that bit position x is not set in the scoreboard array, then processing continues at step 528 where the set of attributes associated with the scoreboard bit in position x are identified. The attributes associated with scoreboard bit position x are easily identified from the binary representation of x. In a preferred embodiment, a 1 in any position of the binary representation of x indicates the absence of an attribute in the set of attributes, and a 0 in any position of the binary representation of x indicates the presence of an attribute in the set of attributes. At step 535, permutations are generated for the set of attributes identified in step 528 and the permutation that will produce the most new roll-up aggregates is saved at step 542. Ordinarily for a set of m attributes, m| permutations can be generated. However, at this stage of the process, a roll-up string of m attributes can produce at most only m-1 new roll-ups. For this reason, once a permutation is encountered that produces m-1 new roll-ups, no more new permutations need be generated at step 535. If more than one permutation has the property of producing m-1 new roll-ups, then the first such permutation encountered is saved at step 542. At step 547, scratch variable y is set to the number of roll-up string attributes in the identified permutation. At step 556, the present roll-up string is aggregated for the first y attributes in the roll-up string, and the corresponding bit in the scoreboard array is updated at step 563. At step 568, scratch variable y is decremented by one and processing continues at processing step 575. If it is determined at step 575 that y is greater than 1, then processing continues at step 556 as previously disclosed. Alternatively, if it is determined at step 575 that y is less than or equal 1, then processing continues at step 580 where scratch variable x is decremented by one and processing continues at decision step 518 as previously disclosed. To continue the illustrative example developed in Tables 1-6 above, the operational steps 500 will first encounter scoreboard array bit position 5 at decision step 524, as a bit position that does not contain a 1. Table 7 illustrates the binary representation associated with bit position 5 and the attributes that the binary representation represents.
TABLE 7
______________________________________
Bit Binary
Position Representation
Attributes
______________________________________
5 00101 A,B,D
______________________________________
The binary representation of attributes at scoreboard array bit position 5, as illustrated in Table 7, not only quickly identifies the scoreboard bit position but also quickly identifies which of the attributes A, B, and D, that have not been previously rolled up. However, at step 535, the various permutations of the attribute string associated with scoreboard array x, must be evaluated to determine the permutation that will produce the most roll-ups. Table 8 below illustrates two possible permutations of the string ABD.
TABLE 8
______________________________________
Permutation Roll-ups
______________________________________
ABD 1
ADB 2
(maximum possible)
______________________________________
Step 542 sets the roll-up string to the permutation ADB because the string will produce m-1=2 new roll-up aggregates at scoreboard positions 5 and 13 as illustrated in Table 9.
TABLE 9
______________________________________
Roll-up Binary Scoreboard
String Representation
Bit Position
______________________________________
ADB 00101 5
AD 01101 13
______________________________________
At step 558, the first y roll-up string attributes of the string ADB, are aggregated and the corresponding scoreboard array bit updated at step 563. None of the individual attributes A, B, C, D, or E need be individually aggregated because the operational steps 400 guarantee that they have been previously aggregated. Following the roll-up of the first permutation attribute set, processing continues at step 518 where the next scoreboard array bit position 9 is identified at decision step 524 as not being set as shown in Table 10.
TABLE 10
______________________________________
Bit Binary
Position Representation
Attributes
______________________________________
9 01001 A,C,D
______________________________________
Permutations of ACD identified at step 535 include only ACD. The ACD permutation yields the maximum number of m-1 unaggregated possibilities as illustrated in Tables 11 and 12 below.
TABLE 11
______________________________________
Permutation Roll-ups
______________________________________
ACD 2
(maximum possible)
______________________________________
Tables 13-15 illustrate the next series of roll-ups after having identified scoreboard bit position 10 as not having been set in decision step 524.
TABLE 13
______________________________________
Bit Binary
Position Representation
Attributes
______________________________________
10 01010 A,C,E
______________________________________
The generated permutations of the string ACE and the corresponding number of possible roll-ups are illustrated in Table 14.
TABLE 14
______________________________________
Permutation Roll-ups
______________________________________
ACE 1
AEC 1
CAE 1
CEA 2
(maximum possible)
______________________________________
The new roll-ups and the scoreboard bit positions that are updated corresponding to each roll-up string is illustrated in Table 15.
TABLE 15
______________________________________
Roll-up Binary Scoreboard
String Representation
Bit Position
______________________________________
CEA 01010 10
CE 11010 26
______________________________________
Tables 16-18 illustrate the next series of roll-ups after having identified scoreboard bit position 18 as not having been set.
TABLE 16
______________________________________
Bit Binary
Position Representation
Attributes
______________________________________
18 10010 B,C,E
______________________________________
The generated permutations of the string BCE and the corresponding number of possible roll-ups are illustrated in Table 17.
TABLE 17
______________________________________
Permutation Roll-ups
______________________________________
BCE 1
BEC 2
(maximum possible)
______________________________________
The new roll-ups and the scoreboard bit positions that are updated are illustrated in Table 18.
TABLE 18
______________________________________
Roll-up Binary Scoreboard
String Representation
Bit Position
______________________________________
BEC 10010 18
BE 10110 22
______________________________________
Tables 19-21 illustrate the final series of roll-ups after having identified scoreboard bit position 20 as not having been set.
TABLE 19
______________________________________
Bit Binary
Position Representation
Attributes
______________________________________
20 10100 B,D,E
______________________________________
The possible roll-up permutations on a string BDE is determined as illustrated in Table 20.
TABLE 20
______________________________________
Permutation Roll-ups
______________________________________
BDE 2
(maximum possible)
______________________________________
The new roll-ups and the corresponding scoreboard bit positions that are updated in Table 21.
TABLE 21
______________________________________
Roll-up Binary Scoreboard
String Representation
Bit Position
______________________________________
BDE 10100 20
BD 10101 21
______________________________________
Table 22 illustrates benchmark results of the numbers of roll-ups required for n-dimensional arrays. The Optimal column is the theoretically optimal number of roll-ups required each dimension. The Present Invention column illustrates the actual number of roll-ups resulting by using the techniques of the present invention. The Present Invention numbers of roll-ups are matched one-for-one with or less than 5% of the Optimal number of roll-ups, yet the implementation of the present invention is less complex and less processing intensive than Optimal implementations.
TABLE 22
______________________________________
Data Cube on N Dimensions
N Dimensions Optimal Present Invention
______________________________________
1 1 1
2 2 2
3 3 3
4 6 6
5 10 10
6 20 20
7 35 37
8 70 71
9 126 129
10 252 252
11 462 478
12 924 926
13 1716 1784
______________________________________
FIG. 6 illustrates the super-aggregate operational steps 600 for generating super-aggregates from the roll-up process disclosed in FIGS. 4-5. The super-aggregate operational steps 600 are the details of step 342 from the FIG. 3 overview. The super-aggregate operational steps 600 begin at step 608 and proceed to step 615 where the aggregate function being applied is classified so that the most efficient processing technique can be applied. Aggregation functions can be classified into three categories: distributive, algebraic and holistic. Each of the three categories can be further defined by considering the aggregation of a two dimensional set of values {Xij.backslash.l-1, . . . l; j=1, . . . , J}. An aggregate function F() is distributive if there is a function G() such that F({Xi,j})=G({F({Xi,j.backslash.l=1, . . . , l}).backslash.j=1, . . . , J}). Aggregate functions including, but not limited to, COUNT(), MIN(), MAX(), and SUM() are common distributive functions. In fact, F=G for all but COUNT() where G=SUM() for the COUNT() function. Once order is imposed on the values, cumulative aggregate functions also fit into the distributive category. An aggregate function F() is algebraic if there is an M-tuple valued function G() and a function H() such that F({Xi,j})=H({G{Xi,j.backslash.l-1, . . . ,l}).backslash.j=1, . . . ,J}). Aggregate functions including, but not limited to, Average(), Standard Deviation, MaxN(), MinN(), and Center.sub.-- of.sub.-- mass(), are common algebraic functions. For Average, the function GO records the sum and count of the subset. The H() function adds these two components and then divides to produce the global average. Similar techniques apply to finding the N largest values, the center of mass of a group of objects, and other algebraic functions. The key to algebraic functions is that a fixed size result such as an M-tuple, can summarize the sub-aggregation. An aggregate function F() is holistic if there is no constant bound on the size of the storage needed to describe a sub-aggregate. That is, there is no constant M, such that an M-tuple characterizes the computation F({Xi,j.backslash.l=1, . . . ,l}). Aggregate functions including, but not limited to, Median(), MostFrequent() or Mode(), and Rank(), are common holistic functions. Once at step 615 a classification is determined for the aggregate function being applied, different aggregation techniques are used to minimize the number of function calls required to complete the super-aggregates. If it is determined at decision step 625 that the aggregate function being applied is a holistic function, then processing continues at step 628 where the most efficient technique is to call the lter(handle, value) function 2.sup.n times, that is, once for each super-aggregate using standard GROUP BY techniques. To use the lter(handle, value) function, a handle must be allocated for each cell in the data cube. A handle is a temporary variable or place-holder that facilitates access to an intermediate result of an otherwise separate operation. Such as for example, an average of a set of values, where each of the values is the sum of at least two numbers, requires a handle for each of the values in the set of values so that all of the values can be counted and summed when it is time to generate the average. Absent the handles, the intermediate results that are the individual values would not be as readily available to generate the average. When a new tuple (x.sub.1, x.sub.2, . . . x.sub.n, v) arrives, the lter(handle, value) is called for each handle of each cell of the data cube that matches the value. The 2.sup.n comes from the fact that each coordinate can either be x.sub.i or ALL. When all the input tuples have been computed, the controlling system invokes the final(&handle) function for each of the (C.sub.i +1) nodes in the data cube. Alternatively, if it is determined at decision step 625 that the aggregate function being applied is not a holistic function, the processing continues at decision step 638. If it is determined at decision step 638 that the aggregate function being applied is a distributive function, then processing continues at step 645 where super-aggregates are generated for the data cube using distributive functions. Because the core of a data cube is represented as an n-dimensional array in memory where each dimension has a size C.sub.i +1, the n-1 dimensional slabs of the array are computed by aggregating one dimension of the core. For example, the following computation aggregates the first dimension of the array: CUBE(ALL, x.sub.2, . . . ,x.sub.n)=F({CUBE(l, x.sub.2, . . . ,x.sub.n).backslash.l=1, . . . C.sub.1 }). N such computations will compute the n-1 dimensional super-aggregates. The distributive nature of the function F() allows aggregates to be aggregated. The process can continue to aggregate lower dimensions of the array. For example, in terms of a cross tab, there exists a choice of computing the final super-aggregate by aggregating either the lower row as in F(ALL, *) or by aggregating the right column F(*, ALL). Either approach will generate the same super-aggregate, however, the most efficient choice is to aggregate the row or the column with the fewest C.sub.i to aggregate. Thus, the super-aggregates can be computed by dropping one dimension at a time. Alternatively, if it is determined at decision step 638 that the aggregate function being applied is not a distributive function, the processing continues at decision step 650. If it is determined at decision step 650 that the aggregate function being applied is an algebraic function, then processing continues at step 658. Algebraic aggregates are difficult to generate because an intermediate computational value is saved in a handle prior to generating the final result in a separate computation. For example, the Average() function maintains the count and sum values in its handle until the final count and sum are available. Thus, the algebraic functions must maintain a handle for each element of the cube so that the final subaggregates are available to generate the super-aggregate. When the core GROUP BY operation is complete, the CUBE operator passes the set of handles for each n-1 dimensional super-aggregate. The handles of the super-aggregates are then passed to the super-super-aggregates and so on until the (ALL, ALL, . . . ,ALL) aggregate has been computed. This passing of the handles is done by a lter.sub.-- super(&handle, &handle) function that folds sub-aggregate results into super-aggregates. The lter.sub.-- super() function has not previously been disclosed or claimed prior to the present documentation. Alternatively, if it is determined at decision step 650 that the aggregate function being applied is not an algebraic function, then the processing is complete at step 660. The lter.sub.-- super() function is also useful in computing aggregates for parallel database systems. In parallel database systems, the aggregates are computed for each partition of a database in parallel prior to combining the results of the parallel computations. Thus, the lter.sub.-- super() function is useful to fold the result of a first computation into the input of successive computations. Another special data cube processing situation occurs if an entire data cube does not fit into concurrent memory. Because traditional array processing techniques will not work with a data cube that does not fit into memory, the cube can either be partitioned with a hash function or sorted. Both the hash function and the sort are standard techniques available with the GROUP BY clause. Because the numbers of super-aggregates are typically orders of magnitude smaller than the core of the data cube, the super-aggregates will typically fit into memory. Sorting is the preferred partition technique because the roll-up process phase of computing a data cube requires that the results set sorted anyway. Summary The efficient implementation of a multidimensional aggregation operator generates a data cube containing a set of sub-aggregates from a roll-up and barrel shift technique that is tracked by a scoreboard array, and a set of at least one super-aggregate from an efficient aggregate function application that is applied as determined by the aggregate function type. Although specific embodiments are disclosed herein, it is expected that persons skilled in the art can and will design alternative multidimensional aggregation operators that are within the scope of the following claims either literally or under the Doctrine of Equivalents.
|
Same subclass Same class Consider this |
||||||||||
