Query transformation and simplification for group by queries with rollup/grouping sets in relational database management systems6574623Abstract A method, apparatus, and article of manufacture for optimizing database queries, wherein the query is analyzed to determine whether the query includes at least one GROUP BY operation that computes at least one of the following: (1) a ROLLUP and (2) a GROUPING SET, and when it does, the query is rewritten to optimize one or more predicates that are applied after the GROUP BY operation. The query is also analyzed to determine whether the query includes at least one GROUP BY operation that computes two or more stacked GROUP BY operations, and when it does, the query is rewritten to collapse the stacked GROUP BY operations into a single GROUP BY operation. Claims What is claimed is: Description BACKGROUND OF THE INVENTION
SELECT DISTINCT Q1.PARTNO, Q1.DESCR, Q2.SUPPNO
FROM INVENTORY Q1, QUOTATIONS Q2
WHERE Q1.PARTNO = Q2.PARTNO AND
Q1.DESCR=`ENGINE`AND
Q2.PRICE <= ALL (SELECT Q3.PRICE
FROM QUOTATIONS Q3
WHERE Q2.PARTNO=Q3.PARTNO);
This query provides information about suppliers and parts when the supplier's price is less than that of ALL other suppliers. FIG. 4 illustrates the QGM for this query. The graph contains four boxes 400, 402, 404, and 406, wherein Boxes 400 and 402 are associated with base tables INVENTORY and QUOTATIONS, Box 404 is a SELECT box associated with the main part of the query, and Box 406 is a SELECT box associated with the subquery. Each box is comprised of two main components, i.e., a head and a body, wherein the head describes the output table produced by the box, and the body specifies the operation required to compute the output table. Base tables can be considered boxes that have empty or non-existent bodies. With regard to Box 404, the head of this Box includes output columns PARTNO, DESCR and SUPPNO, as specified in the select list of the query. The specification of these columns includes column names, types, and output ordering information. The head has a Boolean attribute called DISTINCT that indicates whether the associated table contains only distinct tuples (head.distinct=TRUE), or whether it may contain duplicates (head.distinct=FALSE). The body of a Box contains a graph. The vertices of this graph (dark circles in the diagrams) represent quantified tuple variables, called QUANTIFIERS. In Box 404, there are quantifiers Q1, Q2, and Q4. Quantifiers Q1 and Q2 range over the base tables INVENTORY and QUOTATIONS, respectively, and correspond to the table references in the FROM clause of the SQL query. Note that nodes Q1 and Q2 are connected via an inter-box edge to the head of the INVENTORY and QUOTATIONS boxes. The edge between Q1 and Q2 specifies the join predicate. The (loop) edge attached to Q1 is the local predicate on Q1. In fact, each inter-quantifier edge represents a conjunct of the WHERE clause in the query block, i.e., the conjuncts being represented in the diagram by the labeled rectangle along the edge. Such edges are also referred to as Boolean factors. Quantifier Q4 is a universal quantifier, associated with the ALL subquery in the WHERE clause. This represents that for all tuples associated with Q4, the predicate represented by the edge between Q2 and Q4 is TRUE. In Box 404, Q1 and Q2 participate in joins, and some of their columns are used in the output tuples. These quantifiers have type F (ForEach), since they come from the query's FROM clause. Quantifier Q4 has type A, representing a universal (ALL) quantifier. SQL's predicates EXISTS, IN, ANY and SOME are true, if at least one tuple of the subquery satisfies the predicate. Hence, all of these predicates are existential, and the quantifiers associated with such subqueries have type E. Each quantifier is labeled with the columns that it needs from the table it ranges over. Additionally, quantifiers may be ordered within a box to support asymmetric operators, such as EXCEPT. In QGM, the quantifiers associated with existential and universal subqueries are called counting quantifiers. Scalar subquery quantifiers have the type S, requiring that (1) the subquery returns at most one row and (2) if the subquery does not produce any row, a null value will be returned via the S quantifier. Box 406 represents the subquery. It contains an F quantifier Q3 over the QUOTATIONS table, and has a predicate that refers to Q2 and Q3. The body of every box has an attribute called DISTINCT (not shown) that has a value of ENFORCE, PRESERVE or PERMIT. ENFORCE means that the operation must eliminate duplicates in order to enforce head.distinct=TRUE. PRESERVE means that the operation must preserve the number of duplicates it generates. This could be because head.distinct=FALSE, or because head.distinct =TRUE and no duplicates could exist in the output of the operation even without duplicate elimination. PERMIT means that the operation is permitted to eliminate (or generate) duplicates arbitrarily. For example, the DISTINCT attribute of Box 406 can have the value permit because its output is used in a universal quantifier (Q4 in Box 404), and universal quantifiers are insensitive to duplicate tuples. Like each box body, each quantifier also has an attribute called DISTINCT (not shown) that has a value of ENFORCE, PRESERVE or PERMIT. ENFORCE means that the quantifier requires the table over which it ranges to enforce duplicate elimination. PRESERVE means that the quantifier requires that the exact number of duplicates in the lower table be preserved. PERMIT means that the table below may have an arbitrary number of duplicates. Existential and universal quantifiers can always have distinct=PERMIT, since they are insensitive to duplicates. In the body, each output column may have an associated expression corresponding to expressions allowed in the select list of the query. These expressions are called head expressions. In the preferred embodiment, the RDBMS supports derived tables, which are similar to view definitions, and can be defined anywhere a table can be used. In the RDBMS, derived tables and views, just like queries and subqueries, have a QGM, with one or many boxes. When a derived table or view is referenced in a query, its QGM becomes part of the QGM graph of the query. The output of a box can be used multiple times (e.g., a view may be used multiple times in the same query), creating common subexpressions. In the remainder of this specification, only rough sketches of QGM graphs are used, and details that are not critical to the specification are omitted. Overview of GROUP BY Constructs In SQL92, GROUP BY items can be simple columns or expressions, wherein each of these grouping items is separated by a comma ",". For example, consider the following table and a simple GROUP BY query
CREATE TABLE T (A INT, B INT, C INT, D INT);
INSERT INTO T VALUES (1, 2, 3, 4);
INSERT INTO T VALUES (1, 2, 4, 4);
INSERT INTO T VALUES (1, 3, 5, 4);
INSERT INTO T VALUES (1, 3, 6, 4);
SELECT A, B, C COUNT(*) as COUNT
FROM T
GROUP BY A, B, C;
The above query returns the following results:
A B C COUNT
1 2 3 1
1 2 4 1
1 3 5 1
1 3 6 1
Similarly, consider the following GROUP BY query with expression:
SELECT A+B as AB, C, COUNT(*) as COUNT
FROM T
GROUP BY A+B, C;
The above query returns the following results:
AB C COUNT
3 3 1
3 4 1
4 5 1
4 6 1
Since SQL92 was introduced, there has been several extensions to the GROUP BY clause. In particular, GROUPING SETS and ROLLUP are allowed. For example, the following view and query illustrates the use of GROUPING SETS function:
CREATE VIEW V1 (A, B, C, COUNT) AS
(SELECT A, B, C, COUNT(*)
FROM T
GROUP BY GROUPING SETS (A, B, C));
According to the SQL semantics for "GROUPING SETS", the contents of the view V1 comprises:
A B C COUNT
-- -- 3 1
-- -- 4 1
-- -- 5 1
-- -- 6 1
-- 2 -- 2
-- 3 -- 2
-- -- -- 4
Note that, in the above table, the "--" character represents a NULL value.
Following is another example involving ROLLUP:
CREATE VIEW V2 (A, B, C, COUNT) AS
(SELECT A, B, C, COUNT(*)
FROM T
GROUP BY ROLLUP(A, B, C));
The contents of the view V2 comprise:
A B C COUNT
-- -- -- 4
1 -- -- 4
1 2 -- 2
1 3 -- 2
1 2 3 1
1 2 4 1
1 3 5 1
1 3 6 1
The present invention includes techniques that can be applied to optimizing queries involving these new GROUP BY constructs. Optimization For Queries With GROUPING SETS In relational databases, views are often used because views can provide a higher level of abstraction and access authorization. Complex queries can be built by referencing views. As an example, consider the following simple query referencing view V1 defined earlier:
SELECT*
FROM V1
WHERE A > 0;
In this query, the WHERE predicate will filter out any rows whose column A value is greater than 0, including the null values. That is, the query is not interested in other GROUPING SETS (B and C). Hence, the query can be equivalently simplified as:
SELECT A, NULL, NULL, COUNT(*) as COUNT
FROM T
GROUP BY A
The above query returns the following results:
A B C COUNT
1 -- -- 4
Note that the values of columns B and C are NULL, because these columns are
selected. Computing the above simplified GROUP BY query is obviously
faster than the original query.
Now, consider a slightly more complex query.
SELECT*
FROM V1
WHERE A > 0 OR B > 0;
When both columns A and B are NULL, the WHERE predicate becomes FALSE, and the query can then be simplified such that it requires only a subset of columns in the GROUPING SETS function:
SELECT A, B, NULL, COUNT(*) as COUNT
FROM T
GROUP BY GROUPING SETS (A, B);
The above query is equivalent to a simple union of two GROUP BY subselects:
SELECT A, NULL, NULL, COUNT(*) as COUNT
FROM T
GROUP BY A
UNION ALL
SELECT NULL, B, NULL, COUNT(*) as COUNT
FROM T
GROUP BY B;
The above query returns the following results:
A B C COUNT
-- 2 -- 2
-- 3 -- 2
1 -- -- 4
Again, the column C is always NULL. Like the previous example, this rewritten query avoids computing the GROUP BY operation for column C. According to the optimization techniques of the present invention, the rule can be defined as follows: 1. Given a set of column references (denoted as COL) in a GROUPING SETS function, a predicate to be applied after the GROUP BY clause where the predicate is FALSE when a subset of the column references in COL are simultaneously NULL (this subset is denoted as S_COL), then the GROUPING SETS function can be simplified such as only S_COL appears in the function. 2. If S_COL is a single GROUP BY column, then the GROUPING SETS function can be effectively removed. Optimization for Queries with ROLLUP Consider an example to illustrate the optimization technique for queries with ROLLUP. Suppose there is a fact table SALES that stores the sales transactions, including the location of the sales, the sales date, the amount and item bought. Further, assume there is a view defined that rolls up the number of transactions along the location dimension, i.e., region, state and city:
CREATE TABLE SALES (ITEM, REGION, STATE, CITY, SALES.sub.--
DATE, SALES_AMOUNT, . . . );
CREATE VIEW SUMMARY (REGION, STATE, CITY, COUNT) AS
(SELECT REGION, STATE, CITY, COUNT(*)
FROM SALES
GROUP BY ROLLUP (REGION, STATE, CITY));
Semantically, a GROUP BY operation with "ROLLUP(REGION, STATE, CITY)" is equivalent to four different GROUP BY operations unioned together. Therefore, the view can be equivalently written as:
CREATE VIEW SUMMARY (REGION, STATE, CITY, COUNT) AS
(
SELECT REGION, STATE, CITY, COUNT(*)
FROM SALES
GROUP BY REGION, STATE, CITY
UNION ALL
SELECT REGION, STATE, NULL, COUNT(*)
FROM SALES
GROUP BY REGION, STATE
UNION ALL
SELECT REGION, NULL, NULL, COUNT(*)
FROM SALES
GROUP BY REGION
UNION ALL
SELECT NULL, NULL, NULL, COUNT(*)
FROM SALES
);
That is, the first GROUP BY operation groups on columns REGION, STATE and CITY. The second GROUP BY operation groups on only REGION and STATE. The third GROUP BY operation groups on REGION and the last GROUP BY operation has no grouping column. There are few other ways of manually rewriting the ROLLUP construct using UNION ALL and temporary tables, i.e., the focus of this invention is not on how these constructs can be implemented using UNION ALL constructs. Instead, the present invention concentrates on query optimization in presence of predicates to be applied after the GROUP BY operation. Consider the following query that selects the sales count in Los Angeles, San Francisco and San Jose:
SELECT *
FROM SUMMARY
WHERE CITY IN (`LA`, `SF`, `SJ`);
By definition, the IN predicate is FALSE if the CITY column value is NULL. The above query can then be simplified to the following:
SELECT REGION, STATE, CITY, COUNT(*)
FROM SALES
WHERE CITY IN (`LA`, `SF`, `SJ`);
GROUP BY REGION, STATE, CITY
Instead of computing the entire rolled-up data, only a single GROUP BY operation is required. Not all predicates on the view can lead to a simplified GROUP BY clause. Following is an example query that selects the sales count in California, Arizona and Oregon states:
SELECT *
FROM SUMMARY
WHERE STATE IN (`CA`, `AZ`, `OR`);
The query can be simplified to the following:
SELECT *
FROM TABLE
(
SELECT REGION, STATE, CITY, COUNT(*)
FROM SALES
GROUP BY REGION, STATE, CITY
UNION ALL
SELECT REGION, STATE, NULL, COUNT(*)
FROM SALES
GROUP BY REGION, STATE)
AS Q(REGION, STATE, CITY, COUNT)
WHERE STATE IN (`CA`, `AZ`, `OR`)
);
Or, equivalently using a common subexpression:
WITH BASE AS
(
SELECT REGION, STATE, CITY, COUNT(*) AS COUNT
FROM SALES GROUP BY REGION, STATE, CITY),
ROLLUP AS (SELECT REGION, STATE, SUM(COUNT) AS COUNT
FROM BASE GROUP BY REGION, STATE)
SELECT REGION, STATE, CITY, COUNT
FROM BASE
UNION ALL
SELECT REGION, STATE, NULL AS CITY, COUNT
FROM ROLLUP
);
Essentially, the predicate filters out the rows due to the data along the ROLLUP hierarchy. In the first example, where the predicate involves the CITY column, rows for STATE, REGION and above in the ROLLUP hierarchy are eliminated when the CITY column is NULL. In the second example, where the predicate involves the STATE column, rows for REGION and above in the ROLLUP hierarchy are eliminated when the STATE column is NULL, resulting effectively two GROUP BYs unioned together. According to the optimization techniques of the present invention, the rule can be defined as follows: 1. Consider ROLLUP (COL1, COL2, . . . , COLn) where by definition COL1 is the highest level in the ROLLUP hierarchy. Suppose a predicate is to be applied after the GROUP BY clause where the predicate is FALSE when one of the grouping column (denoted as COLi, where i is between 1 and n) is NULL. 2. If COLi is exactly COLn, then the ROLLUP function can be eliminated resulting in COL1, COL2, . . . , COLn. 3. If COLi is not COLn, then the grouping results due to higher level of ROLLUP do not contribute to the answer set, and hence their computation can be eliminated. Effectively, this results in (n-i+1) different GROUP BYs unioned together:
GROUP BY COL1, COL2, . . . , COLi
GROUP BY COL1, COL2, . . . , COLi + 1
. . .
GROUP BY COL1, COL2, . . . , COLn
Optimization For Stacked GROUP BY Operations The following example illustrates the idea of optimization of a GROUP BY query involving another GROUP BY operation defined in a view or a derived table. A view is defined for summarizing the sales amount and the number of sales along the time and location dimensions:
CREATE VIEW SUMMARY (STATE, CITY, YEAR, COUNTSALES,
SUMSALES AS (SELECT STATE, CITY, YEAR (SALES_DATE),
COUNT(*), SUM(SALES_AMOUNT)
FROM SALES
GROUP BY STATE, CITY, YEAR (SALES_DATE));
The following query aggregates the sales information for California along the time dimension only using the above pre-defined view:
SELECT YEAR, SUM(COUNTSALES) As COUNTSALES,
SUM(SUMSALES) AS SUMSALES
FROM SUMMARY
WHERE STATE = `CA`
GROUP BY YEAR;
The QGM diagram is depicted in FIG. 5, where there is a GROUP BY box 500 "stacked" on top of another GROUP BY box 502 that accepts the output of a SELECT box 504 that accesses a base table 506. A traditional RDBMS typically computes the lower level of GROUP BY operation 502 (denoted as LGB) prior to the upper level of GROUP BY operation 500 (denoted as UGB). This is unnecessary, because the query can actually be rewritten and optimized so that there is only one GROUP BY operation to be applied directly on the base table. This involves flattening the stacked GROUP BY operations as illustrated by the following:
SELECT YEAR (SALES_DATE), COUNT(*) AS COUNTSALES,
SUM(SALES_AMOUNT) AS SUMSALES
FROM SALES
WHERE STATE = `CA`
GROUP BY YEAR (SALES_DATE);
The conditions for such query optimization technique include the following: 1. There is no filtering to be done between the two GROUP BY operations LGB and UGB or any row filtering requirement can be applied prior to the lower GROUP BY operation LGB. 2. The grouping columns in UGB is a subset of the grouping columns in LGB. 3. The aggregate function in UGB is computable using the input to the lower GROUP BY operation LGB. For example, SUM(COUNT_SALES) can be computed as COUNT(*) over the base table `SALES`. Similarly, SUM(SUM.sub.- SALES) can be computed as SUM(SALES_AMOUNT). The same applies to other aggregate function such as MAX( ) and MIN( ). Furthermore, when the aggregate function in UGB involves a GROUP BY column in LGB, the aggregate function remains unchanged. It can be seen from FIG. 5 that these conditions can be met. As such, one can "collapse" the two GROUP BY operations 500 and 502 resulting in only one new GROUP BY operation which has the following properties: 1. The GROUP BY columns remain unchanged with respect to UGB. 2. The aggregate functions in UGB are written as so that the functions are computed using the input to the lower GROUP BY operation LGB. The optimized query is depicted using QGM in FIG. 6, where there is a single GROUP BY box 600 that accepts the output of a SELECT box 602 that accesses a base table 604. LOGIC OF THE OPTIMIZATION TECHNIQUE FIGS. 7A, 7B, 7C, and 7D together are a flowchart illustrating the method of optimizing SQL queries in step 202 of FIG. 2 and step 312 of FIG. 3 according to the preferred embodiment of the present invention. Specifically, this flowchart analyzes a query to determine whether the query includes at least one GROUP BY operation that computes at least one of the following: (1) a ROLLUP and (2) a GROUPING SET, and when it does, rewrites the query to optimize one or more predicates that are applied after the GROUP BY operation. This flowchart also analyzes a query to determine whether the query includes at least one GROUP BY operation that computes two or more stacked GROUP BY operations, and when it does, rewrites the query to collapse the stacked GROUP BY operations into a single GROUP BY operation. Referring to FIG. 7A, Block 700 represents the server system 100, specifically an optimizer function of the RDBMS software 106, analyzing the query. Block 702 is a decision block that represents the server system 100 determining whether the query includes at least one GROUP BY operation. If so, control transfers to Block 704; otherwise, control transfers to Block 716. Block 704 is a decision block that represents the server system 100 determining whether the query includes a GROUPING SET function. If so, control transfers to FIG. 7B via Block 706; otherwise, control transfers to Block 708. Block 708 is a decision block that represents the server system 100 determining whether the query includes a ROLLUP function. If so, control transfers to FIG. 7C via Block 710; otherwise, control transfers to Block 712. Block 712 is a decision block that represents the server system 100 determining whether the query includes stacked GROUP BY operations. If so, control transfers to FIG. 7D via Block 714; otherwise, control transfers to Block 716. After these query transformation steps are performed, Block 716 returns control to Block 202 in FIG. 2 or Block 312 in FIG. 3 for subsequent processing steps, including the execution of the SQL query against the relational database and the output of the result set. Referring to FIG. 7B, Block 718 is a decision block that represents the server system 100 performing a GROUPING SET test, i.e., determining whether the predicates that are applied after the GROUP BY operation are false when the subset of column references in the GROUPING SETS function are simultaneously null. If so, control transfers to Block 720; otherwise, control transfers back to FIG. 7A via Block 706. Block 720 represents the server system 100 rewriting the query to simplify the GROUPING SETS function so that only a subset of column references remain in the GROUPING SETS function. Block 722 is a decision block that represents the server system 100 determining whether the subset of column references comprises a single column referenced by the GROUP BY operation. If so, control transfers to Block 724; otherwise, control transfers back to FIG. 7A via Block 706. Block 724 represents the server system 100 rewriting the query to eliminate the GROUPING SETS function. Thereafter, control transfers back to FIG. 7A via Block 706. Referring to FIG. 7C, Block 726 is a decision block that represents the server system 100 performing a first ROLLUP test, i.e., determining whether the predicates that are applied after the GROUP BY operation are false when a lowest column reference in a hierarchy of column references in the ROLLUP function is null. If so, control transfers to Block 728 to rewrite the query to eliminate the ROLLUP function; otherwise, control transfers to Block 730. Block 730 is a decision block that represents the server system 100 performing a second ROLLUP test, i.e., determining whether the predicates that are applied after the GROUP BY operation are false when a column reference in a hierarchy of column references in the ROLLUP function does not contribute to an answer set for the ROLLUP function and the column reference is not a lowest column reference in the hierarchy of column references in the ROLLUP function. If so, control transfers to Block 732 to rewrite the query to simplify the ROLLUP function. Thereafter, control transfers back to FIG. 7A via Block 710. Referring to FIG. 7D, Block 734 is a decision block that represents the server system. 100 determining whether there is no filtering performed between the two stacked GROUP BY operations. If not, control transfers to Block 736; otherwise, control transfers back to FIG. 7A via Block 714. Block 736 is a decision block that represents the server system 100 determining whether grouping columns in an upper one of the two stacked GROUP BY operations are a subset of grouping columns in the lower one of the two stacked GROUP BY operations. If not, control transfers back to FIG. 7A via Block 714; otherwise, control transfers to Block 738. Block 738 is a decision block that represents the server system 100 determining whether any aggregate function in the upper one of the two stacked GROUP BY operations is computable using inputs to the lower one of the two stacked GROUP BY operations. If not, control transfers back to FIG. 7A via Block 714; otherwise, control transfers to Block 740. Block 740 represents the server system 100 collapsing two stacked GROUP BY operations into a single GROUP BY operation, wherein the columns of the upper one of the two stacked GROUP BY operations are unchanged in the single GROUP BY operation and any aggregate functions in the upper one of the two stacked GROUP BY operations are written so that the aggregate functions are computed using inputs from the lower one of the GROUP BY operations. Thereafter, control transfers back to FIG. 7A via Block 714. CONCLUSION This concludes the description of the preferred embodiment of the invention. The following describes some alternative embodiments for accomplishing the present invention. For example, any type of computer, such as a mainframe, minicomputer, or personal computer, could be used with the present invention. In addition, any software program performing queries in a like manner (either partially or entirely) could benefit from the present invention. In summary, the present invention discloses a method, apparatus, and article of manufacture for optimizing database queries, wherein the query is analyzed to determine whether the query includes at least one GROUP BY operation that computes at least one of the following: (1) a ROLLUP and (2) a GROUPING SET, and when it does, the query is rewritten to optimize one or more predicates that are applied after the GROUP BY operation. The query is also analyzed to determine whether the query includes at least one GROUP BY operation that computes two or more stacked GROUP BY operations, and when it does, the query is rewritten to collapse the stacked GROUP BY operations into a single GROUP BY operation. The foregoing description of the preferred embodiment of the invention has been presented for the purposes of illustration and description. It is not intended to be exhaustive or to limit the invention to the precise form disclosed. Many modifications and variations are possible in light of the above teaching. It is intended that the scope of the invention be limited not by this detailed description.
|
Same subclass Same class Consider this |
||||||||||
