Method of performing a parallel relational database query in a multiprocessor environment5797000Abstract A method of performing a parallel join operation on a pair of relations R1 and R2 in a system containing P processors organized into Q clusters of P/Q processors each. The system contains disk storage for each cluster, shared by the processors of that cluster, together with a shared intermediate memory (SIM) accessible by all processors. The relations R1 and R2 to be joined are first sorted on the join column. The underlying domain of the join column is then partitioned into P ranges of equal size. Each range is further divided into M subranges of progressively decreasing size to create MP tasks T.sub.m,p, the subranges of a given range being so sized relative to one another that the estimated completion time for task T.sub.m,p is a predetermined fraction that of task T.sub.m-1,p. Tasks T.sub.m,p with larger time estimates are assigned (and the corresponding tuples shipped) to the cluster to which processor p belongs, while tasks with smaller time estimates are assigned to the SIM, which is regarded as a universal cluster (cluster 0). The actual task-to-processor assignments are determined dynamically during the join phase in accordance with the dynamic longest processing time first (DLPT) algorithm. Each processor within a cluster picks its next task at any given decision point to be the one with the largest time estimate which is owned by that cluster or by cluster 0. Claims What is claimed is: Description BACKGROUND OF THE INVENTION
TABLE 1
______________________________________
Query 1
______________________________________
1 SELECT NAME, EMPLOYEE.DEPT, SALARY
2 FROM EMPLOYEE, DEPT
3 WHERE NAME = MGR
4 ORDER BY NAME
______________________________________
Line 1 of this query specifies the columns requested as output: in this case, the NAME, DEPT and SALARY columns of table 200 (EMPLOYEE). The notation EMPLOYEE.DEPT means that the DEPT column is to be taken from the EMPLOYEE table; otherwise, since DEPT appears in both tables, the specification would be ambiguous. The particular table from which the NAME and SALARY columns are taken need not be specified (although it may be specified), since these column names appear only in the EMPLOYEE table and hence there is no ambiguity. Line 2 specifies that these columns are taken from the set of all tuples formed by concatenating a tuple from table 200 (EMPLOYEE) with a tuple from table 300 (DEPT). This set of tuples is known as the Cartesian product of tables 200 and 300. Line 3 of the query, which is referred to as the predicate (the join predicate in this instance, since the query is a join), specifies that only those tuples of the Cartesian product for which the NAME column of the EMPLOYEE component is equal to the MGR column of the DEPT component are selected. Since in this case the join predicate specifies an equality, this type of query is known as an equijoin. Finally, line 4 specifies that the output listing is to be sorted by the NAME column of the output listing. In general, Query 1 may be partitioned into concurrently executed tasks by having each processor 104 process only that part of the query corresponding to particular values of one of the columns of a table, for example, table 200 (EMPLOYEE). More specifically, Query 1 may be partitioned into tasks corresponding to a particular range of one of the columns of table 200. Thus, if the underlying domain of the NAME column of table 200 (EMPLOYEE) extends from A from ZZZ . . . ZZZ, the original query may be split into 26 independent tasks for concurrent execution, so that the first task, for example, might be as follows:
TABLE 2
______________________________________
Task 1
______________________________________
1 SELECT NAME, EMPLOYEE.DEPT, SALARY
2 FROM EMPLOYEE, DEPT
3 WHERE NAME = MGR
3a AND A .ltoreq. NAME < B
4 ORDER BY NAME
______________________________________
Lines 1-3 and 4 of task 1 are similar to lines 1-4 of query 1. Line 3a adds another condition to the predicate, namely, that only those tuples of the EMPLOYEE table for which NAME begins with A are considered. The two conditions on lines 3-3a taken together imply that MGR is subject to the same range condition as NAME. Therefore, instead of having to consider the entire DEPT table, only those tuples for which A.ltoreq.MGR<B need be considered in this particular task. Task 1 might be restated, therefore, as follows:
______________________________________
1 SELECT NAME, EMPLOYEE.DEPT, SALARY
2 FROM EMPLOYEE, DEPT
3 WHERE NAME = MGR
3a AND A .ltoreq. NAME < B
3b AND A .ltoreq. MGR < B
4 ORDER BY NAME
______________________________________
Task 1 is equivalent to the original query as performed on subsets (or partitions) of the EMPLOYEE and DEPT tables 200 and 300 in which NAME and MGR begin with the letter A. Other tasks into which Query 1 is divided would be similar in form to task 1; except that the additional condition would reference a different part of the EMPLOYEE and DEPT tables. Thus, task 2 might be:
______________________________________
1 SELECT NAME, EMPLOYEE.DEPT, SALARY
2 FROM EMPLOYEE, DEPT
3 WHERE NAME = MGR
3a AND B .ltoreq. NAME < C
3b AND B .ltoreq. MGR < C
4 ORDER BY NAME
______________________________________
while task 26 might be:
______________________________________
1 SELECT NAME, EMPLOYEE.DEPT, SALARY
2 FROM EMPLOYEE, DEPT
3 WHERE NAME = MGR
3a AND NAME .gtoreq. Z
3b AND MGR .gtoreq. Z
4 ORDER BY NAME
______________________________________
It may be readily verified that these tasks are mutually exclusive and that their results, when merged, replicate the results that would be obtained by performing the original unpartitioned query. As can be inferred from the above example, the tasks into which tables 200 and 300 are partitioned, using a partitioning scheme of one task per processor, can vary widely in the number of tuples. For example, relatively few if any names start with X, whereas a great many may start with other letters such as E. Similar skew patterns may exist for other types of data (e.g., zip codes). Since the query completion time is determined by the time required to complete the longest task, such data skew can greatly prolong the query time, nullifying the benefits of parallelization. Partitioning Procedure Referring to FIG. 4, in accordance with the present invention, each table or relation 400 forming part of the query is first partitioned into P ranges p (402) of equal size, where P is the total number of processors 104 in the system 100 and 1.ltoreq.p.ltoreq.P. This partitioning is preferably performed on the basis of the underlying domain of a particular column of the relation (preferably the join column), not the actual number of tuples in a particular range. Thus, if NAME and MGR are the respective join columns for joining the EMPLOYEE table 200 (FIG. 2) and the DEPT table 300 (FIG. 3), as in the above example, the EMPLOYEE table may be divided into a first range consisting of those tuples for which NAME begins with the letter "A", a second range consisting of those tuples for which NAME begins with the letter "B", and so on, as already described. Correspondingly, the DEPT table 300 would be divided into a first range consisting of those tuples for which MGR begins with the letter "A", a second range consisting of those tuples for which MGR begins with the letter "B", and so on, as already described. Next, each range p (402) into which a relation 400 forming part of the query has been partitioned is further divided into M subranges 404 to form tasks T.sub.m,p of progressively decreasing estimated task time, where 1.ltoreq.m.ltoreq.M. This division into subranges 404, like the original division into ranges 402, is preferably performed on the basis of the underlying column domain, not the actual number of tuples in a particular range. As a particular example, the ranges 402 might be divided in such a manner that each subrange m (404) has an extent in the underlying column domain that is 1/.sqroot.2 that of the preceding subrange m-1. Since the completion time for a join of two relations is roughly proportional to the product of the number of tuples in each relation, such a partitioning results in an estimated completion time for a task T.sub.m,p that is approximately half that of the preceding task T.sub.m-1,p. In the event that additional tuple cardinality information is available, the creation of these tasks can be slightly modified. For example, some database catalogs keep track of the cardinalities of the K most frequently occurring tuples for some small value of K. Thus, referring to FIG. 11, such a catalog may keep track of the K most frequently occurring tuples 1102 in a first relation R1, as well as the K most frequently occurring tuples 1104 in a second relation R2. If this information is available, there will be K' frequently occurring values 1108 (where 0.ltoreq.K'.ltoreq.K) for which the tuple cardinalities of both relations are known, and 2(K-K') values 1106, 1110 for which the tuple cardinality of one relation is known and the other is bounded from above. These latter values can either be estimated or determined explicitly during the sort phase, and those values with large enough task times turned into separate tasks. These tasks can even be split into multiple subtasks to be performed on several processors if necessary, as is described in J. L. Wolf et al. (1990) and J. L. Wolf et al. (1991), cited above. Processors 104 are initially assigned those tasks having the largest estimated completion times. Upon the completion of a task by a processor, it is assigned an awaiting task having the largest estimated completion time, so that the tasks having the smallest estimated completion times are the last to be assigned. This allows the smaller tasks to be assigned in such a manner as to smooth out load imbalances that may develop among the processors 104. At the same, having the initially performed tasks relatively large minimizes scheduling overhead. Overall Query Procedure Referring to FIG. 5, the overall query procedure is conveniently divided into four successive phases: a sort phase 510, a partition phase 520, a transfer phase 530, and a join phase 540. Each of these phases is performed in parallel within the clusters 102 (FIG. 1), with each cluster performing its portion of a particular phase as it relates to the tuples within the cluster. Referring to FIG. 6, in the sort phase 510, the tuples of each relation forming part of the query that are locally stored on a particular cluster 102 are sorted (if necessary) in accordance with the value in the column forming the basis for partitioning (step 512); in most instances, this column is preferably the join column. This sorting facilitates the partitioning of the relations in the next phase as well as their subsequent transfer, since the tuples of each partition are consecutively accessible. Referring to FIGS. 4 and 7, in the partition phase 520, the join request is partitioned into P sets of M tasks T.sub.m,p of progressively decreasing estimated task time, for a total of MP tasks overall. In the case of an equijoin query (as in the above example), this may be accomplished by partitioning the join column domain of each relation 400 forming part of the query into P ranges p of equal size, as described above (step 522), and then further dividing each such range p into M subranges 404, also as described above (step 524). In general, however, any of a number of mechanisms may be used to partition the original query into sets of independent tasks of decreasing estimated task time for concurrent execution. A particular mechanism is described further below. Referring now to FIG. 8, in the transfer phase 530, the portions of the tasks T.sub.m,p resident on a given cluster 102 (i.e., the actual tuples forming the corresponding partitions of the relations to be joined) are transferred to the clusters that will actually be processing them, or to the universal cluster (SIM) 108 (FIG. 1), depending on the relative task size. In general, as described more fully below, "smaller" tasks are transferred to the universal cluster 108, while "larger" tasks are transferred to a particular processor cluster 102. Each processor 104 of the system 100 is allotted an equal portion 1/P of the memory capacity of universal cluster 108. In the initial portion of the transfer phase, for each processor p (104) of the system 100, the tasks T.sub.m,p corresponding to that processor and residing on a particular cluster 102 are transferred from that cluster to the universal cluster 108, beginning with the task T.sub.M,p having the smallest estimated completion time and progressing in order of increasing task size (i.e., decreasing m), until the allotted portion 1/P is filled (step 532). The remaining tasks T.sub.m,p for each processor p (104) are transferred to the cluster 102 owning the processor, unless they are already resident there (step 534). Referring to FIG. 9, in the join phase 540, each processor p (104) within a particular cluster is initially assigned task T.sub.1,p (step 542). The processors 104 perform the tasks assigned to them in a conventional manner that does not form part of the present invention. Subsequent tasks T.sub.m,p, where m>1, are assigned dynamically as the initially assigned tasks are completed. Thus, upon the completion of a task (step 544), a determination is made of whether there are any remaining tasks in that cluster 102 that have not been assigned (step 546). If so, then an available processor 104 in the cluster 102 (e.g., the processor that was executing the just-completed task) is assigned the task T.sub.m,p from the cluster with the longest estimated completion time (step 548). In these subsequent assignments, the task T.sub.m,p assigned a processor 104 need not have the same index p as the processor; the task need only be one of the tasks that was transferred to the cluster 102 in the transfer phase. The task assignments are in this sense dynamic, since only the cluster 102, not the processor 104, is predetermined (in the transfer phase). If there are no remaining tasks in the cluster 102 that have not been assigned, a determination is made of whether there are any remaining tasks in the universal cluster 108 that have not been assigned (step 550). If so, then an available processor 104 in the cluster 102 is assigned the task T.sub.m,p from the universal cluster 108 with the longest estimated completion time (step 552). If there are no remaining tasks in either the processor cluster 102 or the universal cluster 108, then scheduling is complete for that query, and the only remaining item in this phase is the actual completion of task processing by the various processors 104, together with any required post-processing steps such as merging the results of the individual tasks and reporting the query results to the user. The invention has been described with particular reference to a sort merge join. However, as noted above the same principles can also be applied to other query types such as a hash join. Thus, in a hash join, each relation might be correspondingly hashed into P partitions, each of which is in turn further divided into M subpartitions of decreasing size to create MP independent tasks which are transferred to the various clusters 102 and 108. Particular Partitioning Mechanism As noted above, any of a number of mechanisms may be used to partition an original query R into sets of independent tasks of decreasing estimated task time for concurrent execution. A particular mechanism in the form of a query processor (QP) is described below. The problem domain assumes a requester which issues requests for data, in a language such as SQL and one or more data management instances such as a relational database which store data, such as relational tables and provide a language such as SQL--see D. Chamberlin et al., "SEQUEL 2: A Unified Approach to Data Definition, Manipulation, and Control", IBM Journal of Research and Development, vol. 20, no. 6, pp. 560-575 (November 1976), for a further discussion of SQL--for issuing requests to retrieve such data. For the purposes of explanation, in the description below the SQL language will be used for examples and relational database terminology will be used; however, the procedure is not limited to the relational database scope. The SQL request is denoted as R, and the relational database instances as DB.sub.1, DB.sub.2, and so on. The request R may involve data stored at any subset or all of DB.sub.1 . . . DB.sub.N, for any number N. Each database provides, upon request, catalog information about the data that it stores. All tables are accessible from any of the DB.sub.i instances, either through multiple processing unit access to the same data--one example is shared data mechanisms such as described in C. Mohan, "Parallelism in Relational Data Base Systems: Architectural Issues and Design Approaches", IEEE Proceedings of Distributed and Parallel systems conference, Dublin, Ireland (July 1990)--or through replicated copies as described in Proceedings of the Second International Conference on Parallel and Distributed Information Systems, San Diego, IEEE Computer Society Press (January 1993) (hereinafter Proceedings 1993). The join request R for data stored in multiple tables is assumed to have the following form or to be transformable through various means to such a form: SELECT (expressions) FROM (list of tables) (rest of query) where (expressions) are any computation supported by the request language and (rest of query) is an optional phrase, which, if present describes any qualifications about the data or any other computations permitted by the request language. FROM (list of tables) provides in the list of tables the names of two or more tables whose data is being requested. These tables are denoted as T.sub.1, T.sub.2, T.sub.3 . . . T.sub.K. These tables are expected to be among those stored by DB.sub.1, . . . DB.sub.N. An example of such a join request is the following: SELECT NAME, SALARY, DEPTNO FROM EMPLOYEE, DEPT WHERE EMPLOYEE.ENO=DEPT.MGRENO ORDER BY 3 Referring to FIG. 12, upon receipt of the original data request R, the QP performs the following steps: 1. The QP examines the FROM (list of tables) contents of the request and estimates the cost of providing the answer to this query (step 1202). Such an estimate could be provided from many mechanisms that are not part of this invention. One alternative is to execute the query and measure elapsed time, or CPU time, or I/O time, or any combination. Another alternative is to measure or estimate any quantity of importance to the system designer or the requester, such as response time; such a measurement could differ by type of data or requester or time of day, etc. Still another alternative would be to analyze the request using known mechanisms such as those in P. Selinger et al., "Access Path Selection in a Relational Database Management System", Proceedings of ACM SIGMOD Conference, pp. 23-34 (June 1979), or P. G. Selinger et al., "Access Path Selection in Distributed Database Management Systems", Proceedings International Conference on Data Bases, Deen and Hammersly (eds.), University of Aberdeen, pp. 204-215 (July 1980), also available as IBM Research Report RJ2883 (August 1980), resulting in an estimated total cost in terms of an arithmetic combination of CPU and I/O and possibly message costs. 2. The QP determines the contribution of each table in the FROM list of tables to the overall cost obtained in step 1, using any one of a variety of techniques (step 1204). One such technique is to use database tools, such as the EXPLAIN mechanism of the IBM DATABASE 2.TM. (DB2.RTM.) 3.1 relational database manager, described in the IBM manual DATABASE 2 Version 3.1 General Information (GC26-4886-00), which provides a description of the execution of the request plus the costs associated with each table. (DATABASE 2 is a trademark, and IBM and DB2 are registered trademarks, of IBM Corporation.) An alternative technique would be to construct in the QP a cost analyzer that formulates its own estimates of cost and to calculate and record the contribution of each table T.sub.i to the overall cost. 3. The QP constructs a list L.sub.1, L.sub.2, . . . , L.sub.K of the tables in the request in order of their cost contribution to the overall cost of the request, with the most costly as L.sub.1 (step 1206). 4. Next the QP identifies the tables which permit partitioned access (step 1208). This determination can be done with various mechanisms such as using the catalogs of DB.sub.1, . . . DB.sub.N, or any extract of those catalogs which QP itself may keep. In some implementations, all tables may be partitioned, while in others only certain tables may be partitioned. By partitioned access is meant being able to access only a portion of the entire table using database mechanisms. Such mechanisms include indexes or scans on subsets of the table, such as the partitions of DB2.RTM. 3.1 described in the above-identified manual or the ROWID predicates of the Oracle relational database manager as described in Proceedings 1993, cited above. 5. The QP chooses as the table to be split the first table in the list L.sub.1, L.sub.2, . . . L.sub.K that also permits partitioned access (step 1210). This table is called T.sub.j. 6. Each such T.sub.j will have a means of expressing the partitioned access request. A preferred means is to capture this partitioned access request by use of a predicate or predicates, denoted by P.sub.j that can be attached to the original request in order to narrow that request to a single partition of T.sub.j. 7. The QP determines, using mechanisms that do not form part of this invention, how many partitions T.sub.j should be split into, and how to describe the partition scope (step 1212). Such mechanisms might include a predetermined number from the QP or database instance catalogs, the number of working processing units, the number of database instances available, or a computation based on current system load across all of the processing units or database instances. Once selected, the number of partitions for T.sub.j is denoted as M.sub.j. (M.sub.j corresponds to MP in the discussion further above.) 8. Given T.sub.j and M.sub.j, the QP determines the scope of each partition, using mechanisms that do not form part of this invention (step 1214). One such mechanism might be a list of value ranges for a partitioning index on table T.sub.j in the database or QP catalog; another might be ranges of row IDs defined by an equal number of rows per M.sub.j partitions. The M.sub.j instances of partition scope will be denoted as P.sub.j,i, i=1 . . . M.sub.j. Using as an example the query given in the background section, the EMPLOYEE table might have partitioned access using the DEPTNO attribute, with every 10 values in a different partition scope. In this example, then, the partitioned access request P.sub.j would be EMPLOYEE.DEPTNO BETWEEN 1 AND 10, and EMPLOYEE.DEPTNO BETWEEN 11 and 20, etc. 9. QP then makes M.sub.j copies of the original request R (step 1216) and attaches to the ith copy a WHERE clause with the conjunction of the original predicates if any plus the P.sub.j,i associated with the ith partition scope of T.sub.j (step 1218). The result of this attachment is denoted the modified request R.sub.i. Using as an example the query given in the background section, the resulting R.sub.1 request will have the form: SELECT NAME, SALARY, DEPTNO FROM EMPLOYEE, DEPT WHERE EMPLOYEE.ENO=DEPT.MGRENO AND EMPLOYEE.DEPTNO BETWEEN 1 AND 10 ORDER BY 3 Requests R.sub.2 through R.sub.Mj will have a similar form, with different values for the scope of the partition (e.g., 11 through 20). The above steps 1-9 are performed in the partition phase 520 (FIGS. 5 and 7) of query processing. 10. Each of the M.sub.j modified requests is sent to a processing unit 104 for execution using mechanisms that do not form part of this invention (step 1220). This step is performed in the transfer phase 530 (FIGS. 5 and 8) of query processing. 11. Following the join phase 540 (FIGS. 5 and 9), as each of the M.sub.j modified requests returns its result to the QP, the QP performs further activities to merge the results back into a form expected by the requester (step 1222). Such further activities could involve sorting, aggregation, and other processing, for example. Further modifications can be made to the above procedure to adjust for the choice of access path or overall execution plan for accessing the requested data. For example, an examination after step 9 of the resulting plans for each of the M.sub.j requests may indicate that an unfavorable access path was selected. Or an analysis of the overall costs of each R.sub.i may indicate that T.sub.j was not a suitable candidate table for splitting; for example, a heuristic that the sum of the overall costs of R.sub.1, R.sub.2, . . . R.sub.Mj is more than 2 times the overall cost of the original request R could be used. In these cases, T.sub.j may be rejected as the table to use for partitioning, and list L may be modified to include only those L.sub.i 's that follow the L.sub.i that represented T.sub.j. Then the process of selecting another table for splitting is begun, starting at step 3 above. Unlike other possible mechanisms, the procedure described above exploits cost-based knowledge about the request. Furthermore, the procedure allows for the possibility of using the database itself in calculating the costs of the alternative tables in the original request, which will generally provide a better choice of table for splitting than such other mechanisms which are independent of the database engine. The procedure is capable of using a variety of partitioning access mechanisms, such as indexes or storage partitions, not simply row IDs. Because the procedure allows value-based partition ranges, which may include values in multiple attributes, it is more flexible, allowing more opportunity for optimization and therefore potentially better performance. Because the procedure expresses the split requests in terms of possibly complex predicates that are more nonprocedural than row IDs and less dependent on the physical storage of data by the underlying database instances, the procedure will allow for further transformations and processing, including but not limited to further partitioning and parallelizing and/or distributing the processing of each R.sub.i modified request instance. Furthermore, using value-based predicates has an inherent advantage over physical addressing schemes such as those that split requests based on row IDs. This is because the use of value-based predicates can be combined with language processors that perform transitive closure on predicates to result in splitting requests on possibly more than one table at a time without an exponential number of resulting requests. For example, in a join request that links 4 tables on the same attribute (e.g., t1.x=t2.x and t2.x=t3.x and t3.x=t4.x) the work on all four tables will be split if the request is split on the x attribute of any of the tables. Conclusion Various modifications may be made to the system described above, as described, for example, in the concurrently filed application entitled "Task Scheduler for a Multiprocessor System", now U.S. Pat. No. 5,437,032. Thus, as already noted, each of the elements denominated as a "processor" may be a tightly coupled processor complex rather than a uniprocessor. Further, each processor (uniprocessor or processor complex) may support a plurality of concurrently executing tasks rather than only a single task as described above. In such a case, each processor would be initially assigned a plurality of tasks, in accordance with the desired multiprogramming level, rather than only a single task as described above, and would be assigned new tasks as necessary to maintain such desired multiprogramming level. In addition, in assigning awaiting tasks, other factors such as the "affinity" of a task for a particular processor may be taken into account. Also, in a system in which multiple queries are being processed concurrently, tasks from different queries might be prioritized and executed concurrently in accordance with a desired scheme for ensuring "fairness" and avoiding undue starvation of low-priority tasks. Still other modifications will be apparent to those skilled in the art.
|
Same subclass Same class Consider this |
||||||||||
