Discovery-driven exploration of OLAP data cubes6094651Abstract A method for locating data anomalies in a k dimensional data cube that includes the steps of associating a surprise value with each cell of a data cube, and indicating a data anomaly when the surprise value associated with a cell exceeds a predetermined exception threshold. According to one aspect of the invention, the surprise value associated with each cell is a composite value that is based on at least one of a Self-Exp value for the cell, an In-Exp value for the cell and a Path-Exp value for the cell. Preferably, the step of associating the surprise value with each cell includes the steps of determining a Self-Exp value for the cell, determining an In-Exp value for the cell, determining a Path-Exp value for the cell, and then generating the surprise value for the cell based on the Self-Exp value, the In-Exp value and the Path-value. Claims What is claimed is: Description BACKGROUND OF THE INVENTION
TABLE 1
______________________________________
Products
Stores 1 2 3 4 5 6 7
______________________________________
1 80 12 8 8 10 11 14
2 15 3 2 1 2 2 14
3 90 14 11 10 10 10 12
4 99 11 10 9 7 12 14
______________________________________
There are several well-known methods of varying complexity for identifying extreme values in a flat set of numbers. For example, a straightforward approach is to mark as exceptions all values that are greater than a predetermined standard deviation away from an average value. This approach, though, does not work well when the distribution of numbers is multi-modal. In such cases, it is better to cluster numbers and mark as exceptions all values that do not belong to any large cluster. Other methods for identifying exceptions in a flat set of numbers are presented by A. Arning et al., A linear method for deviation detection in large databases, In Proc. Of the 2nd Int'l Conference on Knowledge Discovery in Databases and Data Mining, Portland, Oreg., August 1996, and incorporated by reference herein. The simplest approach is to mark as exceptions values that are extreme in comparison to the entire set of data cube values. Thus, in accordance with the simplest approach, the values 80, 90 and 99 in column 1 of Table 1 would be marked as exceptions because they appear to be exceptionally high values when compared to all the other values. The problem with the simplest approach is that the position of a value within the table is not taken into account when a value is identified or marked as an exception. For example, considering the percentage increase in sales of Product 1 over all stores (Table 1, column 1), the quantity 15 corresponding to Store 2 appears to be more of an exception than the extreme values 80, 90 and 99 for the simplified approach. For this second approach, <Product 1, Store 2> would be marked or flagged as an exception. Consequently, a better approach for flagging a value as an exception is when the value is an extreme value within a smaller grouping of numbers, such as a grouping corresponding to either rows or columns. In general, a value can be flagged as an exception by modifying the approach by considering a value in all possible groupings of the value and using a well-known extreme-value finding algorithm for each respective group. This modified approach is used by the Explora project, disclosed by W. Klosgen, Efficient discovery of interesting statements in databases, Journal of Intelligent Information Systems: Integrating Artificial Intelligence and Database Technologies, 4(1):53-69, 1995, and incorporated by reference herein, where a value is marked as interesting or exceptional when it is significantly different than the average value of a larger population values to which it belongs. The problem with the modified approach is that when the row and column in Table 1 in which the value 15 occurs are simultaneously considered, Store 2, in general, has a smaller increase in sales for almost all products in comparison with other stores. Therefore, a value of 15 for <Product 7, Store 2> is perhaps not an exception. It is more useful in this situation to call the whole of Product 1 as an exception because it is significantly different from all the other products in Table 1. Similarly, Store 2 is an exception when compared with other stores in Table 1. The only real exception in the product-store space is the value of 14 corresponding to <Product 7, Store 2> because the increase in sales for any product for Store 2 is expected to be much smaller when compared with other stores. The present invention places two qualitative requirements on the definition of an exception for overcoming drawbacks of conventional table analysis approaches. That is, the present invention takes into account trends along different rows, columns, levels and faces that a particular cell belongs to in addition to exceptions associated with the cell at all levels of detail. An exception is defined, according to the present invention, by first considering a two-dimensional data cube having p values along a first dimension A and q values along a second dimension B. The element or quantity corresponding to the ith value of dimension A and jth value of dimension B is denoted as y.sub.ij. To estimate the degree of surprise y.sub.ij holds in this data cube, an expected value Y.sub.ij of y is calculated as a function f of three terms: (1) a term .mu. that denotes a trend that is common to all y values of the cube, (2) a term .alpha..sub.i that denotes special trends along the ith row with respect to the rest of the cube, and (3) a term .beta..sub.j that denotes special trends along the jth column with respect to the rest of the cube. The residual difference r.sub.ij between the expected value y.sub.ij =f(.mu.,.alpha..sub.i,.beta..sub.j) and the actual value y.sub.ij represents the degree of surprise of element y.sub.ij based on its position in the cube. When the data cube has three dimensions, for example, with dimension C being the third dimension, the expected value y.sub.ijk is calculated by taking into account not only the kth value of the third dimension, but also the three values corresponding to the pairs (i,j) in the AB plane, (i,k) in the AC plane and (j,k) in the BC plane. The expected value y.sub.ijk is then expressed as a function of seven terms as: y.sub.ijk =f(.mu.,.alpha..sub.i,.beta..sub.j,.gamma..sub.k,(.alpha..beta.).sub.ij,(. alpha..gamma.).sub.ik,(.gamma..beta.).sub.kj), (1) where (.alpha..beta.).sub.ij denotes the contribution of the ijth value in AB plane, (.alpha..gamma.).sub.ik denotes contribution of jkth value in the AC plane, and (.gamma..beta.).sub.kj denotes the contribution of the kjth value in the BC plane. In general, for any k dimensional cube, the y value can be expressed as the sum of the coefficients corresponding to each of the 2k-1 levels of aggregations or group-bys of the cube. To illustrate the present invention, a 3-dimensional cube will be considered. The function f() can take several forms or models. Two particularly useful forms are an additive form, where function f() is a simple addition of all its arguments, and a multiplicative form, where function f() is a product of all its arguments. The multiplicative form can be transformed to the additive form by performing a logarithm on the original data values. Thus, the final form of Eq. (1) is, y.sub.ijk =y.sub.ijk +r.sub.ijk =.mu.+.alpha..sub.i +.beta..sub.j +.gamma..sub.k +(.alpha..beta.).sub.ij +(.alpha..gamma.).sub.ik +(.gamma..beta.).sub.kj)+r.sub.ijk, (2) where r.sub.ijk is the residual difference between the expected value y.sub.ij =f(.mu.,.alpha..sub.i,.beta..sub.j) and the actual value y.sub.ij. The relative importance of an exception is based on the relative value of its residual, that is, the higher the value of the residual, the higher the importance of the exception. For a multiplicative model, the y.sub.ijk values denote the log of the original y-values of the cube. The choice of the best form of the function depends on the particular class of data, and is preferably selected by a user having the understanding and experience with the data at hand. Table 2 shows residuals in parentheses for the different products and different stores for the data of Table 1.
TABLE 2
______________________________________
Products
1 2 3 4 5 6 7
Stores (1.8) (0.1) (0.4) (0.5) (0.4) (0.4) (0.1)
______________________________________
1 (0.3)
0.0 0.0 0.1 0.1 0.1 0.2 0.2
2 (1.0) 0.2 0.0 0.1 0.6 0.1 0.1 1.2
3 (0.30 0.1 0.0 0.1 0.2 0.0 0.0 0.5
4 (0.4) 0.1 0.0 0.1 0.2 0.0 0.0 0.5
______________________________________
There are several ways for deriving values of the coefficients of Eq. (2). One approach is by a mean-based method where the coefficients are estimated as follows: .mu.=y.sub.... =overall mean or average (3) .alpha..sub.i =y.sub.i.. -.mu., (4) where y.sub.i.. is the mean over all numbers with the ith value of A. Thus, for a two-way table, .alpha..sub.i denotes the magnitude of the difference of the average of the numbers along the ith row from the overall average .mu.. .beta..sub.j =y.sub..j. -.mu., (5) where y.sub..j. is the mean over all numbers with the jth value of B. .gamma..sub.k =y.sub...k -.mu., (6) where y.sub...k is the mean over all numbers with the kth value of C. Lastly, (.alpha..beta.).sub.ij =y.sub.ij -.alpha..sub.i -.beta..sub.j -.mu..(7) The remaining terms are defined analogously. In general, the coefficient corresponding to any group-by G is recursively calculated by subtracting all coefficients from group-bys that are at a smaller level of detail than group G from the average y value at group-by G. The mean-based approach for calculating the coefficients of the present invention is not particularly robust in the presence of extremely large outliers. Consequently, a number of well-known alternative approaches for handling large outliers can be used, such as the median polish method and the square combining method, disclosed by D. Hoaglin et al., Exploring data tables, trends and shapes, Wiley series in probability, 1988, and incorporated by reference herein. These two alternative approaches are based on using a "median" instead of "mean" for calculating the coefficients. Nevertheless, these alternative approaches have an associated high computational cost. Consequently, the mean-based approach is preferred for most OLAP data sets because significantly large outliers are uncommon in most data sets. FIG. 2 shows the absolute value of the residuals and coefficients after applying the multiplicative model of Eq. (2) to the data in Table 1. In the Product-Store plane, the entry corresponding to <Product 7, Store 2> has the highest residual value. That is, Store 2 has the highest residual in the store plane, and Product 1 has the highest residual in the product plane. Thus, the present invention identifies the correct exceptions at the appropriate level of detail by examining the residual values at different levels. Similarly, consider the earlier example of a data cube shown in FIGS. 1 and 2. The entry <Birch-B, Dec>, having a y value of -10%, is indicated as an exception, whereas the entry <Old-B, Sep>, having the same y value, is not because in the December column nearly all products other than Birch-Beer have a large positive value, whereas in the September column, most products have a negative value. The product "Jolt-C" is the only product in the September column having a positive y-value and is, according to the approach of the present invention, indicated as an exception. The method for determining residual and coefficients of the present invention can be extended to handle hierarchies along one or more dimensions of a data cube. The basic idea is to define the expected value of a data cube element, not only based on its row and column position, but also on its hierarchical parents. For instance, consider values y.sub.ij in a data cube consisting of two dimensions A and B, where dimension A has two levels of hierarchies: A.sup.1 .fwdarw.A.sup.2 .fwdarw.ALL. To calculate an expected value y.sub.ij at the A.sup.1 B level, the row coefficient .alpha..sub.i at level A.sup.1, the column coefficient .beta..sub.j at level B and overall coefficient .mu. at level ALL, two new terms corresponding to the two new aggregations A.sup.2 and A.sup.2 B along the hierarchy on A are used. Equation (2) thus becomes: y.sub.ij =.mu.+.alpha..sub.i +.beta..sub.j +.alpha.'.sub.i' +(.alpha.'.beta.).sub.i'j, (8) where i' denotes the parent of i at hierarchy level A.sup.2, .alpha.'.sub.i' denotes the contribution of the ith value at level A.sup.2, and (.alpha.'.beta.).sub.i'j denotes the contribution due to the ijth value at level A.sup.2 B. The general formula for handling hierarchies is to express a y value in a cube in terms of coefficients obtained from all higher level aggregations of the cube. For example, for the y-values at A.sup.1 B in Eq. (8), coefficients from the five higher level aggregates are used, that is, A.sup.1, A.sup.2, B, A.sup.2 B, and ALL. The same recursive rule of the previous subsection is followed for estimating the coefficients, where .mu. is first estimated as the overall average, and then for terms corresponding to each group-by G. The average at group-by G is computed and then the coefficients from each child of G are subtracted from the computed average. Data in an OLAP cube often includes a time dimension. Time is special in not only that it is a quantitative attribute (unlike product and store that are categorical), but may include seasonality effects and trends that are specific to time. Although the table analysis method of the present invention has been described thus far as treating time as a categorical attribute, the present invention can automatically filter any seasonal pattern that is common to all elements of a data cube. For example, suppose the total sales of all stores is much higher in December than for any other month of each of five previous years. When looking at the sales for a single store S with respect to time, the sales in December might appear exceptionally high compared to other months. However, when comparing the December sales for all stores, the high December sales figure for store S in does not appear exceptional. Thus, the coefficient for store S corresponding only to the month of December each year has a high value. By defining a hierarchy that is based on months of the year, a seasonal-type exception can be filtered so that instead of December of each year being indicated as an exception, one exception is obtained in the month of December over all years. When different products exhibit different seasonality patterns, each Product-Store pair is considered separately, and the seasonality and other trends are filtered out by comparing the values as a function of time. The table analysis method of the present invention is then applied on the transformed cube for locating exceptions. Exceptions occurring in lower levels of a data cube as are summarized by the present invention as individual numbers at higher levels of the cube. In particular, for all cells displayed, exceptions in their respective group-bys are made apparent and particular cells that require further drilling-down for exceptions occurring underneath are indicated. For each cell that is an exception, the present invention provides a clear indication which particular path is the best path for getting to exceptions. These features are provided by definitions of the quantities Self-Exp, In-Exp and Path-Exp, which are now formally defined. Self-Exp denotes the exception value of a cell, and is defined to be the scaled absolute value of the residual of the cell obtained from Eq. (2). The residuals at each group-by of the cube are scaled by the root mean squared residual over all elements of the group-by so that the residual is invariant to changes in scale of the y-values. In-Exp denotes the degree of surprise underneath a cell, and represents the total surprise value over all elements reachable by drill-downs from the cell, not just the immediate detail-level elements. Preferably, In-Exp is defined as a "Max" function of all of the cells underneath a cell, that is, the maximum surprise value of all of the cells underneath a cell. Alternative definitions for In-Exp can be based on the total count of surprises of all cells underneath the cell, the sum of the surprise values of all cells underneath the cell, the sum of the Self-Exp of all cells underneath the cell, a polynomial combination of surprise values of all cells underneath the cell, the product of the surprise values of all cells underneath the cell, the product of the Self-Exp of all cells underneath the cell, the variance of surprise values of all cells underneath the cell, an average (sum or product) of all surprise values underneath the cell, and a median (sum or product) of all surprise values underneath the cell. One caveat when using a sum-based definition is that a dimension having several small residuals may have a larger sum when compared to a small dimension with a few large entries. Path-Exp denotes the ranking of the degree of surprise expected if a drill down is performed along a particular path for each drill-down path from a cell. Preferably, Path-Exp is defined as the "Max" of the Self-Exp and In-Exp over all cells reachable by drilling-down along a particular path. An important characteristic of the definition of Path-Exp includes the additional goal of providing a path that requires as few cells as possible to be drilled-down. Therefore, when comparing two paths A and B on the basis of Path-Exp, the Path-Exp of path A is ranked higher than the Path-Exp of path B whenever the number of high In-Exp value cells in path A is fewer than the number of high value In-Exp cells in path B, provided that the max(Self-Exp over all A values) equals the max(Self-Exp over all B values) and the max(In-Exp over all A values) equals the max(In-Exp over all B values). FIG. 6 shows a Product-Store plane 60 of an exemplary data cube that illustrates the definitions of Self-Exp, In-Exp and Path-Exp. Scaled absolute residuals are shown in an inner cell area 61 of the data cube. The cells in the Product-Store plane have no In-Exp values, while for the Product plane or the Store plane, the In-Exp is the maximum residual over all cells in the corresponding column or row. For the ALL group-by, the In-Exp is the maximum over the In-Exp and Self-Exp of the product group-by and store group-by. For the ALL group-by, the Path-Exp is also of interest since there are two drill-down paths from it. The Path-Exp for the product path is ranked higher than the store path because fewer cells have a high In-Exp value in the product plane, although the "max" exceptions for both paths is the same. The approach of the present invention may appear at first glance to be unrealizable in practice because of the apparent cost of computing exceptions. Nevertheless, the present invention uses fast computation techniques that make the approach of the present invention feasible for large OLAP data sets. According to the invention, there are three logical phases in the computation of exceptions in an entire cube. The first phase involves computing aggregate values, as specified by a user- provided aggregate function, over which exceptions are found at each group-by of the cube. The second phase involves determining coefficients of the model-fit equations and then using the coefficients for finding the residuals as described in connection with Equations (1) - (8). The third phase involves summarizing exceptions found during the second phase. The computation must be planned carefully because, in most cases, the quantity of data will be too large to fit into main memory. Consequently, intermediate results often are read from and written to a disk, thus requiring carefully planned optimizations. A three dimensional data cube having dimensions A, B and C is used in the following discussion of the fast computation techniques. During the first phase, aggregate values are computed for each group-by over which exceptions will be found. For OLAP-type data, there is interest in finding exceptions not only at the lowest level of detail, but also at higher levels of aggregation. Preferably, the function(s) used for aggregations is (are) specified by a user. For example, a user may specify "sum of sales" as the aggregate function. In this case, exceptions in "total sales" are reported at various levels of the cube. As another example of an aggregate function, a user might be interested in a more complicated scenario like finding exceptions in "percentage change in total revenue from the same month last year". The problem of efficiently computing aggregates has been addressed by S. Agarwal et al., "On the computation of multidimensional aggregates," In Proc. Of the 22nd Int'l Conference on Very Large Databases, pages 506-521, Mumbai (Bombay), India, September 1996, and is incorporated by reference herein. The present invention uses a sort-based approach that is similar to that disclosed by S. Agarwal et al. for accomplishing the tasks in phase 1. During the second phase, model Eq. (2) is fit for computing the residuals at each group-by of the cube. In general, separate equations are fitted at different levels of the cube. When the aggregate function is a "mean" function and the mean-based additive model of Eq. (2) is being fitted, a single equation can recursively fit all the group-bys of the cube because the coefficients are calculated using the "mean" function. Consequently, the residuals of a higher level aggregate are the coefficients for all the lower aggregates. For instance, for the three attribute cube ABC, if AB is aggregated using the "mean" function and fit to a mean-based additive model at AB, then the residuals at AB correspond to the (.alpha..beta.).sub.ij coefficients in Eq. (2) of a model fit at the ABC level. A straightforward method of computing residuals at each group-by of a cube is first presented for illustrating the present invention. Improvements are presented subsequently. The coefficients at each group-by G of a data cube are equal to the average y value at that level minus the sum of the coefficients of all group-bys that are at a lower level of detail than group-by G. Thus, a straight forward and an efficient way for computing the coefficients is using a two pass approach. First, the average y value is computed at each group-by starting from the most detailed group-by. Then, starting from the least detailed level (ALL), for each group-by subtract the coefficients of the group-by that are less detailed than the group-by for which the coefficients are being computed. For example, consider the three attribute cube ABC. The average y value is computed for each of the 2.sup.3 -1=7 group-bys of the cube by starting from the ABC level and computing the average at AB, AC and BC from ABC, computing the average at A from one of AB or AC, and so on. The coefficients are computed starting from ALL. The coefficient of each member or element of group-by A is the average value at the member minus the coefficient of its parent ALL. The coefficient at AB is the average at AB minus the coefficients at A, B and ALL, and so on. Finally, the coefficients at AB, AC, BC, A, B, C, and ALL are subtracted from the average value at ABC for obtaining the residuals. The second phase of computation has two parts. The first part of the second phase is computationally similar to the computation of the user-specified aggregate functions performed in phase 1. The second part is computationally intensive because, in general, to compute the coefficients of a k attribute group-by the coefficients must be subtracted from 2.sup.k-1 other group-bys. This is equivalent to joining the k-attribute group-by with 2.sup.k-1 -1 other group-bys. When the size of the group-bys is large, computing many multi-attribute joins for each group-by can involve large sorting and comparison costs. Regardless, the first and second phases of computation are easily combined, providing a savings in terms of disk scans and sorting costs. Model Eq. (2) can be rewritten in a way that remarkably speeds up the computation of phase 2. Instead of Eq. (2), the expected value y.sub.ijk can be expressed as a sum of three terms: y.sub.ijk =(ab).sub.ij +(ac).sub.ik +(bc).sub.jk, (9) where (ab).sub.ij =Avg.sub.k (y.sub.ijk), (10) (ac).sub.ik =Avg.sub.j (y.sub.ijk -(ab).sub.ij), and (11) (bc).sub.jk =Avg.sub.i (y.sub.ijk -(ab).sub.ij -(ac).sub.ik).(12) The residuals and coefficients from Eq. (2) can be rewritten in terms of the new coefficients as: r.sub.ijk =y.sub.ijk -(ab).sub.ij -(ac).sub.ik -(bc).sub.jk,(13) a.sub.i =Avg.sub.j ((ab).sub.ij), (14) b.sub.j =Avg.sub.i ((ab).sub.ij -a.sub.i), (15) (.alpha..beta.).sub.ij =(ab).sub.ij -a.sub.i -b.sub.j, (16) c.sub.k =Avg.sub.i ((ac).sub.ik), (17) (.alpha..gamma.).sub.ik =(ac).sub.ik -c.sub.k, (18) (.gamma..beta.).sub.kj =(bc).sub.jk, (19) .mu.=Avg.sub.i (a.sub.i), and (20) .alpha..sub.i =a.sub.i -.mu.,.beta..sub.j =b.sub.j,.gamma..sub.k =c.sub.k.(21) Computationally, Eqs. (13)-(21) work as follows: The (ab) values are first computed from the Y.sub.ijk values. The computed (ab) values are then subtracted from the Y.sub.ijk values. The (ac) values are computed from the once-subtracted y values. The computed (ac) values are subtracted from the once-subtracted y-values. The (bc) values are computed from the twice-subtracted y values and then subtracted from the twice-subtracted y-values. The final y-values (thrice-subtracted y-values) directly provide the residuals. The foregoing rewriting procedure can be generalized to any number of dimensions. For example, consider a k-dimensional cube. Any y-value of the cube can be expressed as the sum of k coefficients. The final residual is obtained by averaging along each of the k dimensions, in turn, and subtracting the average from all values along that dimension of the cube. The k average values yield the k coefficients of the equation. Mathematically, the (i.sub.1, . . . i.sub.k)th y value can be expressed as the sum of k terms as: Y.sub.i1...ik =G.sup.1 +. . . G.sup.k (22) where G.sup.r =Avg.sub.ir (y.sub.i1...ik -G.sup.1 -. . . -G.sup.r-1)(23) and G.sup.r denotes the (i.sub.1 . . . i.sub.r-1 i.sub.r+1 . . . i.sub.k)th term in the group-by obtained by averaging along the rth dimension. Hierarchies can easily be incorporated into Eq. (22). That is, Eq. (22) would still contain k terms for a dimensional cube. The only difference is that instead of averaging along all the values along a dimension, the elements that belong to the next level of the hierarchy are averaged together along that dimension. The advantage of rewriting Eq. (2) is two fold: first, the residuals of a k dimensional cube can be computed by joining the residual with k other group-bys instead of .sub.2.sup.k-1 group-bys as in Eq. (2). For instance, the residuals at ABC are computed by joining with coefficients from only three different group-bys AB, AC and BC, instead of joining with seven other group-bys, as would be indicated by Eq. (2). Secondly, the residuals are computed at the same time as the aggregate values of phase 1 are computed, thus saving on the sorting and comparison costs. When ABC is sorted in the order ACB for computing AC, the AC values can be directly subtracted from ACB, and ABC does not require resorting. Next, ACB can be sorted in the order BCA, the coefficients for BC computed from BCA, and then the coefficients subtracted from BCA. This method of computation can be generalized for data cubes having higher dimensions. FIG. 7 shows an example of a sequence of computations for obtaining coefficients and residuals for a four attribute group-by according to the present invention. An upwardly-pointing arrow denotes an averaging phase and a downwardly-pointing arrow denotes a subtraction phase. In terms of sorting costs, sorts beyond what are required for simple aggregate computation at all levels of the group-by are not incurred. Consequently, the second phase of the computation can be combined with the first phase of the computation. During the third phase, summaries of exceptions for each group-by of the cube are computed. Computationally, this phase is similar to phase 1 in which aggregate values for each group-by are computed, but with one difference. Each group-by in the third phase aggregates all of its immediate parents for calculating the Path-Exp values, whereas during phase 1, any one of the parent group-by can be used for computing the aggregate functions. Therefore, all phase 1 optimizations that are based on choosing a good parent for each group-by are inapplicable for the third phase. However, for optimizations based on pipelining, such as disclosed by S. Agrawal et al., supra, the computation of multiple group-bys and using partially sorted parents are applicable. Each cell of the data cube is associated with Self-Exp, In-Exp and Path-Exp values that are treated just like other precomputed aggregates (e.g., average sales) and, consequently, are stored and indexed in the same framework as aggregate values. Depending upon the particular data cube, it may be useful to update the In-Expand, Path-Exp values at higher level group-bys based on exceptions previously observed in some lower drill-down paths. Further, for high-dimensional data cubes, it may be useful to annotate cells with explanations regarding why the cell was (or was not) flagged as an exception. Additionally, the space required for storing the different exception values can be reduced significantly using three optimizations. The exact floating point value of an exception is not particularly useful. Often, only a rough estimate of the level of exception is of interest. Therefore, for the first optimization, instead of storing an exception value as a 32-bit floating point number, the exception can be discretized to a small set of eight or sixteen different levels requiring three or four bit per value. Second, only a few of the cells, by definition, have associated non-zero exception values. Storage and retrieval cost are reduced by keeping only the cells with non-zero exceptions. Lastly, the In-Exp value is not stored, but is computed on the fly. The improvement in time needed for the rewrite-based approach of the present invention to find exceptions when applied to three real-life datasets is now described. The time to find exceptions is also compared with the time for precomputing aggregates in a OLAP data cube. The following examples were performed on an RS/6000 250 workstation running AIX 3.2.5. The workstation had a total physical memory of 256 MB. A buffer of size 32 MB was used. The datasets were stored as flat files on a local 2GB SCSI 3.5" drive. Three datasets were used that were derived from sales transactions of several department stores and mail order companies. Each dataset differs in the number of transactions, the number of attributes, and the number of distinct values for each attribute. In each case, a "sum" function was used for aggregating the cube. In the first dataset, the data relates to supermarket purchases and includes a total of 5 million transactions. Each transaction has three attributes: store, data and product category. The second dataset uses data from a mail order company. A sales transaction for this data set consists of three attributes: a Customer Identifier, an Order Date (month and year) and a Product. Additionally, the Product dimension has a hierarchy that groups products based on their category. The and a hierarchy on the date dimension that groups the monthly dates to "years". There are a total of one million transactions. The third dataset includes data having a total of one million transactions relating to grocery purchases of customers from a supermarket. Each transaction has four attributes: the Date of Purchase, the Shopper Type, the Store Code and the Product Group of the item purchased. FIGS. 8A-8C show performance of the rewrite-based approach (RW) and the straight-forward approach (SF) for calculating coefficients and residuals according to the present invention with respect to the performance for computing all of the group-bys (Agg) for the first through third datasets, respectively. A relative performance scale is shown at the right side of FIG. 8C and is applicable for each of FIGS. 8A-8C. The total execution time is normalized by the time taken to compute all of the group-bys Agg. From FIGS. 8A-8C, the performance of computing exceptions improves almost three-fold by using the rewrite-based approach of the present invention. The improvement is greater for datasets B (FIG. 8B) and C (FIG. 8C) than for dataset A (FIG. 8A) because the presence of hierarchies and/or a greater number of dimensions (as for dataset-B and C) causes the number of group-bys with which each group-by is joined with (during phase 2) to increase exponentially when compared to the straight-forward approach of the present invention. In contrast, with the rewrite-based approach, the number of group-bys that a particular group-by is joined with increases only linearly. After the rewriting, the time needed for finding exceptions and summarizing the exceptions comes to within a factor of three of the time needed for computing group-bys. Thus, computationally speaking, pre-mining for exceptions in OLAP data is not too expensive compared to simply precomputing aggregates. FIG. 9 shows a program storage device 90 having a storage area 91. Information stored in the storage area in a well-known manner that is readable by a machine, and that tangibly embodies a program of instructions executable by the machine for performing the method of the present invention described herein for interactively exploring OLAP data cubes. Program storage device 90 can be a magnetically recordable medium device, such as a magnetic diskette or hard drive, or an optically recordable medium device, such as an optical disk. While the present invention has been described in connection with the illustrated embodiments, it will be appreciated and understood that modifications may be made without departing from the true spirit and scope of the invention.
|
Same subclass Same class Consider this |
||||||||||
