Method for evaluating relational database queries with automatic indexing and ordering of join components5423035Abstract A computer-implemented method that speeds up relational database qualification processing by emulating the function of a multiple dimension index, including constant expressions and joins involving complex functions. The method comprises the following steps. Establishing a command comprising a plurality of range variables related by join operators. Breaking up the qualification portion of the command into sub-expressions of the general type f(a) (join) f(b), or f(a) join "constant expression", where "a" and "b" are genetic range variables. The sub-expressions are then mapped onto component joins that may be used repeatedly for numerous commands. The component joins may contain boolean operators. Establishing a range variable processing order to resolve processing ambiguities, both in ordering of the range variables and in specialized join types such as outer-joins, and eliminate false roots. Evaluating the component joins to make a partial index for each join, wherein the partial index comprises a pointer table such that it may be used simultaneously with other partial index. Finally, looping over range variables using boolean processing to combine the join vectors, as needed, to complete the qualification. The present method eliminates the need for a user defined index. A processing gain is achieved without the associated increase in storage requirements normally encountered with a multiple dimension index. The measured performance gain for join processing is about two orders of magnitude, while the overall performance gain was measured at about an order of magnitude. Claims What is claimed is: Description BACKGROUND
__________________________________________________________________________
range of A is RELATION.sub.-- 1
range of B is RELATION.sub.-- 2
range of C is RELATION.sub.-- 3
retrieve (A.key,B.key,C.key)
<targetlist>
where
A.comndid = C.comndid.sup.1
<qualify operations>
and (B.setnum > 0 or B.fldnum > 2)
and A.object = B.object.sup.2
and A.setnum = B.setnum.sup.3
and (B.setnum*128 + B.fldnum) = C.record.sub.-- id
and B.set.sub.-- id = C.set.sub.-- id
The joins which are processed using join vectors using the present
method are
designated on the same command are identified by numbered footnote
below.
range of A is RELATION.sub.-- 1
range of B is RELATION.sub.-- 2
range of C is RELATION.sub.-- 3
retrieve (A.key,B.key,C.key)
<target list>
where
A.comndid = C.comndid.sup.11
<qualify operations>
and (B.setnum > 0 or B.fldnum >2).sup.12
and A.object = B.object.sup.13
and A.setnum = B.setnum.sup.14
and (B.setnum*128 + B.fldnum) = C.record.sub. -- d.sup.15
and B.set.sub.-- id = C.set.sub.-- id.sup.16
__________________________________________________________________________
The method of the present invention is employed with a relational database that comprises at least one relation corresponding to a table (file), a plurality of attributes corresponding to columns (fields) of the relation, and a plurality of tuples corresponding to rows (records) of the relation. A qualify operation restricts the cross-product of two relations. The cross-product is formed by placing a tuple from one relation followed by every tuple from another relation. The qualify operation is comprised of joins that selectively control the commands operation, restricting the cross-product of the referenced relations. The qualify operation thus controls the target tuples retrieved by the operation. A target list controls the columns listed in the target tuple and hence projects the cross-product. Referring now to the drawing figures, FIG. 1 shows a flow diagram of a computer-implemented method 10 of performing relational database qualification processing in accordance with the principles of the present invention. The method 10 comprises the following steps. The first step comprises establishing a query (command) 11 comprising a plurality of range variables related by join operators. The next step comprises breaking up the command (query) 12 into component joins of the type f(a) (join) f(b). The next step comprises establishing a range variable processing order 13. The next step comprises evaluating the sub-expressions 14 to establish partial index for each sub-expression, wherein the partial index comprises a pointer table. The final step comprises looping over range variables 15 evaluating the established join vectors as required. The present method 10 processes all joins by means of join vectors, using boolean processes operating upon partial index, having a form such as f(a)=f(c) and f(b)=f(c), for example. In contrast, prior art processes use a cross-product of relations (as referenced through the range variables) for that portion of the relation that is not excluded by index processing, and then eliminates invalid target tuples which result from the cross-product by evaluating the cross-product results against the qualification statements. The method of qualify expression evaluation of present method 10 may be illustrated as follows, with reference to ranges "B" and "C". A No join vector is used for this range variable, the range variable is simply stepped once over its allowed range of values. B index by join vector which results from boolean combination of partial index of constant expression upon "B" and sub-expressions of "A" and"B" for the current value of "A" explicitly the partial index for the sub-expressions: A.setnum=B.setnum A.objectid=B.objectid (B.setnum>0 or B.fldnum>2) C index by join vector which results from boolean combination of partial index of sub-expressions of "A" and "C" for the current value of "A", and of partial index for sub-expressions of "B" and "C" for the current value of "B" explicitly the partial index for the sub-expressions: A.comndid=C.comndid B.set.sub.-- id=C.set.sub.-- id (B.setnum* 128+B.fldnum)=C.record.sub.-- id The inclusion of the joins for A, B, to C is unique to the present processing method 10, and represents the central aspect thereof. However the ability to determine where to optimally resume the stepping process is also unique to the present processing method 10 and is an important optimization. An optimization of aggregate processing is also provided by the present method 10 wherein the dimensions of the aggregates are reduced through analysis of the aggregate expression and, if the join to the aggregate result fails, then a default result is joined. Given the above, FIGS. 2a-2f show a more detailed flow diagram of the method 10 of FIG. 1. The method 10 breaks qualify expressions into component parts 12 so that each part involves a join of either one range variable to a constant or two range variables to each other. These component parts may contain OR operators. The inclusion of AND operators in a sub-expression is a performance decision. Examples of typical component parts of the qualify portion of a relational database command are: A.setnum=B.setnum (B.setnum*128+B.fldnum)=C.record.sub.-- id. Component parts of the qualify expression in which the range variables are not segregated across the relational operators and that cannot be reprocessed to conform to this role, are often resource-expensive to produce and are excluded from join vector processing as a practical matter. The technique of assigning an alternative function into the partial index function, as an optimization to restrict cross-product evaluation of this class of joins, is useful. An example of such a join is A.setnum+B.setnum=C.setnum. The present method 10 as shown in FIGS. 2a-2f comprises building a directed graph 22 (examples are shown in FIGS. 3a, 3b, 4 and 5 with range variables at nodes and with all join relationships represented between them. The join relationships are an abstraction of the functional relationship between two range variables as expressed in the subexpressions. Before the directed graph 22 may be used to facilitate processing of the qualify expression, the method 10 resolves processing ambiguities 23, eliminates false roots 24, and finally assigns a range variable processing order 13. Steps 22, 23, 24, and 13 are graphically illustrated 3a, 3b, 4 and 5, respectively. More specifically, FIG. 3a illustrates building a directed graph 22 and also illustrates assigning a range variable processing order 13. This is a directed graph with join directions randomly assigned. The join directions go in a circle making the parent-child relationship ambiguous. This directed graph also has only one possible processing order wherein each range variable appears before it's parent. The processing order is "A", "C", "B". FIG. 3b shows a directed graph with join directions rearranged so that the parent-child relationships are defined. Range "A" is now a root (having no parents). FIG. 4 illustrates resolving processing ambiguities 23. A range variable "D" has been added to illustrate a graph with two root ranges. Both "A" and "D" do not have parents. FIG. 5 illustrates eliminating false roots 24. A range variable "D" has been added to illustrate a graph with two root ranges. Both "A" and "D" do not have parents. The direction of the relationship of "C" and "D" in FIG. 5 has been reversed so that "D" is now a grandchild of "A". This eliminates "D" as a root. Each of the sub-expressions assigned a directional component may then be combined with other similarly processed sub-expressions for a given range so that collectively the sub-expressions form a join vector. Join vectors of this type replace corresponding qualify expressions. In this way join vectors are combined to emulate a complex multiple-dimension index. Additional join vectors may be derived from components of the qualify expression that irreducibly represent a join between more than two range variables. In this case the join vector does not replace the portion of the qualify it is derived from, since it represents a superset of the original expression. The first step in the processing of the join vectors in the method 10 initializes the join vectors 26. Then starting at a first range variable 27, which is always a root, the method 10 loops over all range variables 28, in accordance with the range variable processing order (step 13), and as a function of the join vector processing. The method 10 checks the range variable if it is subject to outer-join processing (looping step 28) which is a special value which the range variable may assume in addition to the tuple values it may assume for the particular relation which it represents. This value allows the method to bypass the range variable increment until the outer-join has been evaluated. The range variables become subject to outer-join processing when the method 10 requires that the range variables be transitioned up the range variable processing list and the parent of an outer join is transitioned and no target list tuple has been produced for that range variable value. Ranges that have been designated as outer-joined are ignored by range process step 29 until the parent range that is outer-joined is re-encountered, when the range variables record numbers are set for initialization, and step range function continues as before. Then the method 10 determines the next or first value for the range variable for the tuples for the relation which it represents in range process step 29. If the range variable is subject to join vector processing then the join vectors are combined in a boolean process which takes intersections or unions, as appropriate, and determines the next or first range variable value. If no value results from the join vector processing then the range variable value is set to undefined so that re-evaluation of the parent range values may proceed directly. This is a unique feature of this process. If however the range variable is not subject to join vector processing then it is incremented through the tuple values for the relation which it represents, until it has assumed all allowed values and then it is set to undefined. Next the range variable is tested 30 to determine if it has assumed a defined tuple value. If it has, the range variable is tested 31 if it is the last range variable in the processing order (step 13). If it is, then a test 32 is made to see if the qualify is completed in the join vector processing. If the qualify is completely processed in the join vector processing then the target list is processed 35 to produce the target list tuple result. If the qualify was not completely represented then the remainder of the qualify is completed 33, and if the remainder of the qualify evaluates as true, determined in qualify processing step 34 then the target list is processed 35 to produce the target list tuple result. Then the method 10 sets the range variables to indicate that a tuple was produced for all current range variable values and the backs up in the range variable processing list to the deepest range variable in the list that is not subject to outer join processing 36. If however, at looping step 28, the range variable was subject to outer join processing then the current position in the range variable processing order 13 is incremented 44 and looping step 28 is repeated. In the method 10 if the range variable is undefined when tested in step 30, then if the range variable is the first range variable in the processing order then the evaluation of the command is complete 39. If the range variable is not the first range variable in the processing order then if the range variable is immediately subject to an outer-join parent 38 then a test is performed to check if a tuple was produced for the range variables parents current value 40. If no value has yet been performed for the parent then the outer join is propagated to all subject outer-join children 41, and the processing continues down the range variable processing list. Otherwise all child combinations have been eliminated or evaluated and processing proceeds up the range variable processing order 42. The ability of the method 10 to decrement its processing of the range variable processing order to the deepest failing parent join is a unique feature of the method 10 and is critical to the processing speed of the method. Other processing methods are not able to reverse the processing order until joins involving multiple parents are detected in the qualify processing step 34. If the qualify processing step 34 fails then the method 10 sets the current range variable to the deepest failing range variable in the processing order 45 and proceeds with the range variable processing loop in step 28. Complete details of an implementation of the method 10 including a procedural description language (PDL) are provided in the PDL DESCRIPTION. Those skilled in the programing art may use this PDL to generate code to implement the method 10 of the present invention. The technique of processing the join vectors at the parent ranges, rather at the child ranges is also provided by the present invention, but is not described in detail since it is a simple extension of the method 10. The processing method 10 has been implemented and tested in conjunction with relational database report generation. The processing method 10 achieved a reduction of over an order of magnitude in the processing time required to process the reports. An overall performance gain of a factor of 20 was initially realized. Additional relational database processing brought this gain down to a factor of 14, unadjusted for additional processing load, but this is in itself is very good, since the above-described processing method 10 automatically determines optimal processing for the database manipulations and relations and required no database analysis or optimization. The qualify processing time improvement realized with the method 10, without the inclusion of input and output processing, was 2-3 orders of magnitude for historically troublesome commands. This was achieved with no reduction in processing speed for commands involving simple joins. The baseline relational database that was used as a benchmark for relative performance evaluation was a Britton-Lee based setup conversion process that has been in use for over five years, and which was heavily optimized. Thus there has been described new and improved method of evaluating relational database qualifications. It is to be understood that the above-described embodiment is merely illustrative of some of the many specific embodiments that represent applications of the principles of the present invention. Clearly, numerous and other arrangements can be readily devised by those skilled in the art without departing from the scope of the invention. PDL DESCRIPTION A procedural description language (PDL) for the present method 10 described above is provided below. Those skilled in the programming art may use this PDL to generate code to implement method of the present invention.
__________________________________________________________________________
Data Structures
Per Range
Processing Order
Outer-join flag
PARENT- "Outer-join to child relations"
CHILD - "Child (grandchild...) of outer-join"
BLANK - "not a parent or a child"
Outer-join status
PASS - "Some Tuple produced for range current
record"
OUTERJOIN - "No record yet produced for range
current record
Max records
max value for record count
Current record
internal record count
Current range
range variable in processing list being looped on
Direction
1 - go downward in range variable processing list
-1 - go upward in range variable processing list
STEP RANGES
Initialize "current-range" to
Initialize "direction" to "1"
Initialize all "outer-join-status" to "PASS"
Initialize all "current-record" to "0"
For "current-range", while > "0"
.sup. if "record-number" for "current-range" >= 0
.sup. then
if direction < 0
for each join
if any "outer-join-status" is set to "OUTERJOIN" in list of joins
then
set "outer-join-status" to "PASS"
perform propagate outer join
set "direction" to "1"
elseif ("outer-join-flag" is "PARENT"
clear all subordinate range record numbers
endif
enddo
if "range-type" is "SCAN"
perform step scan type range
else
perform step join vector type range
endif
else
if "range-type" is "SCAN"
perform step scan type range
else
perform step join vector type range
endif
endif
endif
current-range" = "current-range" + "direction"
if "current-range" is greater than "last-range" then
perform Qualify processing and Target List processing
<<< note: if the expression evaluates as true >>>
<<< then all outer-join-status are set to "pass"
set "direction" to "-1"
decrement "current-range"
endif
endfor
INITIALIZE INDEX
clear current-record
for each parent/index
set pointer-to-partial-index-list to first element in set-of-records
set end-of-list to last element in set-of-records
endfor
STEP INDEX
if current record = 0
set current pt to 0 for each index
endif
set "new-record" to "current-record + 1"
set "index-point" to "first-index in partial index list of
current-range"
do while index point >o
if new-record > "current-record of pointer-to-partial-index-list of
index-
point"
then
increment "pointer-to-partial-index-list of index-point"
if "pointer-to-partial-index-list" > "end-of-list"
then
clear "record-number" for "current-range"
set "direction" to "-1"
set current range to deepest range where null result occured
goto ::return::
endif
elseif "new-record" < "current-record of pointer-to-partial-index-list
of index-
point"
set "new-record" to "current-record of pointer-to-partial-index-list of
index-
point"
reset "index-point" to "first-index"
else
set "index-point" to "next-index in partial index list"
endif
enddo
set "current-record" to "new-record"
for each join
if "outer-join-flag" is "PARENT"
set "outer-join-status" to "OUTERJOIN"
endif
endif
set "direction" to "1"
::return::
STEP SCAN
increment "current-record"
if "current record" > "max record" then
clear "record-number" of "current-range"
set "direction" to "-1"
else
for each join
if "outer-join-flag" is "PARENT"
set "outer-join-status" to "OUTERJOIN"
endif
enddo
set "direction" to "1"
endif
__________________________________________________________________________
|
Same subclass Same class Consider this |
||||||||||
