Task scheduler for a miltiprocessor system5437032Abstract A task scheduler for use in a multiprocessor, multitasking system in which a plurality of processor complexes, each containing one or more processors, concurrently execute tasks into which jobs such as database queries are divided. A desired level of concurrent task activity, such as the maximum number of tasks that can be executed concurrently without queuing of tasks, is defined for each processor complex. Each job is assigned a weight in accordance with the external priority accorded to the job. For each job there is defined a desired level of concurrent; task activity that is proportional to its share of the total weight assigned to all concurrently executing jobs. The jobs are prioritized for execution of awaiting tasks in accordance with the discrepancy between the desired level of multitasking activity and the actual level of multitasking activity for each job. Awaiting tasks are preferentially scheduled from jobs with the largest discrepancy between the desired and actual levels of concurrent task activity and are preferentially assigned to the processor complexes with the largest discrepancy between the desired and actual levels of concurrent task activity. The scheduler attempts to assign each task to a processor for which the task has an affinity or at least neutrality in terms of relative execution speed. Claims What is claimed is: Description CROSS-REFERENCE TO RELATED APPLICATIONS
______________________________________
1 SELECT NAME, EMPLOYEE.DEPT, SALARY
2 FROM EMPLOYEE, DEPT
3 WHERE NAME = MGR
4 ORDER BY NAME
______________________________________
This type of query is known as a join, and the NAME column of table 300 (EMPLOYEE) and the MGR column of table 400 (DEPT) are known as the join columns. As described in copending application Ser. No. 08/148,769 (KI9-93-012), this query may be partitioned into independent tasks restricted to particular portions of the tables. Thus, a first such task (task 1) might be limited to those tuples in tables 300 (EMPLOYEE) and 400 (DEPT) whose values in the join columns (NAME, MGR) begin with the letter A:
______________________________________
1 SELECT NAME, EMPLOYEE.DEPT, SALARY
2 FROM EMPLOYEE, DEPT
3 WHERE NAME = MGR
3a AND A .ltoreq. NAME < B
3b AND A .ltoreq. MGR < B
4 ORDER BY NAME
______________________________________
Task 1 is equivalent to the original query as performed on subsets (or partitions) of the EMPLOYEE and DEPT tables 300 and 400 in which NAME and MGR begin with the letter A. Other tasks created from the original query would be similar in form to task 1, except that they would be directed to different parts of the EMPLOYEE and DEPT tables. Thus, task 2 might be:
______________________________________
1 SELECT NAME, EMPLOYEE.DEPT, SALARY
2 FROM EMPLOYEE, DEPT
3 WHERE NAME = MGR
3a AND B .ltoreq. NAME < C
3b AND B .ltoreq. MGR < C
4 ORDER BY NAME
______________________________________
while task 26 might be:
______________________________________
1 SELECT NAME, EMPLOYEE.DEPT, SALARY
2 FROM EMPLOYEE, DEPT
3 WHERE NAME = MGR
3a AND NAME .gtoreq. Z
3b AND MGR .gtoreq. Z
4 ORDER BY NAME
______________________________________
It may be readily verified that these tasks are mutually exclusive and that their results, when merged, replicate the results that would be obtained by performing the original unpartitioned query. In the partitioning scheme described in copending application Ser. No. 08/148,769 (KI9-93-012), an original join query such as the one illustrated above is partitioned into multiple independent tasks by first partitioning the join column domain of each relation forming part of the query into P equal-size ranges (where P is the number of processor complexes). Each range is in turn partitioned into M subranges of progressively decreasing extent to create a corresponding number of tasks of progressively decreasing estimated task time, resulting in MP tasks from the original query. However, other partitioning schemes could be used to divide a query into independent tasks. Also, as noted above, the jobs that the subject of this specification may be other than database queries. Each CEC p (102) has an actual multiprogramming level (MPL), defined as the number of tasks 204 concurrently executing on that CEC, of m(p). The maximum allowable MPL for CEC p, beyond which level tasks 204 are no longer assigned to the CEC, is M(p). This level M(p) can be set at the maximum number of concurrent tasks before tasks 204 begin to queue. However, if there exists a job 202 whose only remaining active tasks 204 are on a particular CEC 102, the level M(p) for that CEC can be artificially reduced dynamically in an effort to fence off that CEC, inducing a swifter completion of the job. To accomplish this, the desired MPL M(p) for the complex p is set at or below the actual MPL m(p), so that the complex p is not counted as an available complex to which processors may be assigned. The task scheduler of the present invention assigns awaiting tasks 204 making up the jobs 202 to available complexes 102. The following key notation, in addition to that already introduced, is employed: The time of the current invocation of the scheduler is TIME. The time of the last invocation of the scheduler is OTIME. U(i,j,d) is the number of accesses made to DASD d by task i of job j. The access level at DASD d (the number of accesses to DASD d made by all tasks currently executing in the system) is n(d). The maximum allowable access level at DASD d is N(d). Any one of a number of methods may be used to approximate the number of accesses made to a DASD by a task if more precise information is unavailable. Thus, one measure of the number of accesses made to a DASD by a task is the number of tables relevant'to the task that reside on that DASD. This type of information is commonly maintained by database managers, such as the IBM DATABASE 2.TM. (DB2.RTM.) relational database manager, in database catalogs that are available to the scheduler. (DATABASE 2 is a trademark, and IBM and DB2 are registered trademarks, of IBM Corporation.) S(j,i) indicates the assignment and completion status of task i of job j. S(j,i)=p>0 means that task i of job j is currently active on CEC p. S(j,i)=-p<0 means that task i of job j was previously active on CEC p but has completed. S(j,i)=0 means that task i of job j has not yet been assigned a CEC. ARRIVE(j) is the arrival time of job j. FRACTION(j) is the cumulated "fraction of the action" (a specific measure of concurrent task activity to be defined below) for job j between ARRIVE(j) and TIME. GOAL(j) is the current desired fraction of the action for job j, a function of the priority P(j) and the current time TIME. In the present embodiment, GOAL(j) is computed by assuming that a job of priority k should have a fraction of the action goal proportional to 1/k. This tends to ensure that a priority 1 job gets k times as much action as a priority k job, and that jobs of the same priority get a comparable amount of action. While this method is assumed for concreteness, it is important to note that the actual method for computing GOAL(j) is independent of the scheduling procedure itself. The scheduling procedure comprises four routines: MAIN, BOOKKEEP, SEARCH and D.sub.-- N.sub.-- A (disaffinity/neutrality/affinity). The main routine (MAIN) essentially acts as a switch which keeps calling itself until no further tasks are assigned to CECs. The routine is as follows:
TABLE 1
______________________________________
Main Routine
______________________________________
500 Procedure: MAIN
501 Set SUCCESS = 1
502 While SUCCESS = 1 do {
503 Call BOOKKEEP
504 Call SEARCH
505 }
506 End MAIN
______________________________________
Upon being invoked, the MAIN routine first sets the flag SUCCESS to 1 (step 501). The routine then alternatingly calls the BOOKKEEP and SEARCH routines (steps 503-504) until SUCCESS is reset to zero in the SEARCH routine upon the failure to assign a task 204 to a CEC 102. Upon such event, the MAIN routine suspends (step 506), to be restarted (step 500) whenever a task 204 completes or a new job 202 is submitted to the system 100. The bookkeeping routine (BOOKKEEP) is used in preparation for the searching routine, which does the actual task-to-CEC assignment. The bookkeeping routine is as follows:
TABLE 2
__________________________________________________________________________
Bookkeeping Routine
__________________________________________________________________________
600
Procedure: BOOKKEEP
601 Set TOTAL1 = 0
602 Set TOTAL2 = 0
603 For all j = 1 to J do {
604 Set ACTIVE(j) = 0
605 For all i = 1 to T(j) do {
606 If S(j,i) > 0 then do {
607 ACTIVE(j)++
608 TOTAL1++
609 }
610 }
611 Increment TOTAL2 by 1/P(j)
612 }
613 For all j = 1 to J do {
614 If TIME = ARRIVE(j) then set FRACTION(j) = 0
615 Else set FRACTION(j) = ((OTIME - ARRIVE(j)) .times.
FRACTION(j) + (TIME - OTIME) .times.
(ACTIVE(j)/TOTAL1))/(TIME - ARRIVE(j))
616 Set GOAL(j) = 1/(P(j) .times. TOTAL2)
617 }
618 For all p = 1 to P do {
619 Update M(p)
620 }
621 Update S, n, m
622 Set OTIME = TIME
623 Determine the number P.sub.-- POS of CECs p for which M(p) -
m(p) > 0
624 Index these P.sub.-- POS CECs by decreasing values of M(p) -
m(p)
625 Index the J jobs by increasing values of FRACTION(j) -
GOAL(j)
626 For all j = 1 to J do {
627 Index the T(j) tasks in job j by decreasing values
of t(j,i) except for that task, if any, required
to have index 1
628 }
629
End BOOKKEEP
__________________________________________________________________________
Upon being invoked, the bookkeeping routine first calculates the quantities TOTAL1 and TOTAL2, as well as ACTIVE(j) for each job currently being executed (steps 601-612). In this routine TOTAL1 represents the total number of tasks from all jobs currently being executed, while TOTAL2 represents the combined weights ##EQU1## accorded to the various jobs being processed. ACTIVE(j) represents the number of tasks of a job j that are currently being executed. To calculate these quantities, the routine first initializes quantities TOTAL1 and TOTAL2 at zero (steps 601-602), after which it enters a loop (603-612) which is iterated for each job j. On each iteration of this loop, the routine first sets ACTIVE(j) equal to zero (step 604). The routine then examines the assignment value S(j,i) for each task of job j (step 606), incrementing ACTIVE(j) and TOTAL1 for each task currently being executed (steps 607-608), as indicated by an assignment value S(j,i)>0. Finally, on each iteration of the loop (603-612) the routine increments TOTAL2 by the weight 1/P(j) accorded job j (step 611). The bookkeeping routine next recalculates the current fraction of the action FRACTION(j) and the desired fraction of the action GOAL(j) for each job j currently being executed (steps 613-617). If a job has just arrived (TIME=ARRIVE(j)), its cumulated fraction of the action FRACTION(j) is set at zero (step 614). Otherwise, the routine recalculates FRACTION(j) in accordance with the formula (step 615): ##EQU2## This calculation consists essentially of determining the fraction of the action ACTIVE (j)/TOTAL1 for the current time interval (OTIME-TIME) and weighting that fraction and the previous cumulative fraction (for the time interval ARRIVE(j)-OTIME) accordance with the respective durations of the time intervals. FRACTION(j) thus represents the fraction of all tasks currently being executed belonging to job j, averaged over the time job j has existed on the system 100. For each job j, the BOOKKEEP routine next sets the desired fraction of the action GOAL(j) in accordance with the formula (step 616): GOAL (j)=1/(P(j).times.TOTAL2) As may be seen from this formula, the desired fraction of the action GOAL(j) for a job j is a function of the external priority P(j) assigned to that job and is generally inversely proportional to the total number of jobs J in the system 100 (since TOTAL2 is generally inversely proportional to the total number of jobs J). GOAL(j) thus represents the weighted share (as determined by the external priority) of the total available processing resources allotted to job j. The bookkeeping routine then updates the maximum allowable MPL M(p) for each CEC p (steps 618-620). This update may include reducing the desired MPL for a selected CEC p to fence off that CEC and thereby speed the completion of tasks executing on that CEC, as described above. The routine thereafter updates S(j,i), n(d) and m(p) to account for newly assigned and newly completed tasks (step 621), and sets OTIME equal to the current value of TIME (step 622). The routine then determines the number P.sub.-- POS of CECs p that are operating at an MPL m(p) that is less than the maximum allowable MPL M(p) (step 623); these CECs are currently underutilized and can be assigned additional tasks. The routine prioritizes these CECs, by indexing them by decreasing values of M(p)-m(p), so that the most underutilized CEC (as measured by this criterion) has an index of 1, the next most underutilized CEC has an index of 2, and so forth, with the least underutilized CEC having an index of P.sub.-- POS (step 624). The routine thereafter prioritizes the J outstanding jobs, indexing them by increasing values of FRACTION(j)-GOAL(j), so that the most underserved job (as measured by this criterion) has an index of 1, while the most overserved job has an index of J (step 625). The routine then prioritizes the T(j) tasks in each job, by indexing the tasks by decreasing values of estimated task time t(j,i), so that the task with the largest estimated completion time has an index of 1, while the task with the smallest estimated completion time has an index of T(j) (step 626-628). This step of the procedure may be modified to give a particular task an index of 1, regardless of its estimated completion time; this might be desired in a database query application, for example, if a particular task results in the return of a first tuple. Finally, the BOOKKEEP routine terminates (step 629) and returns to the MAIN routine, which immediately thereafter calls the SEARCH routine (step 504). The searching routine (SEARCH) performs the actual task-to-CEC assignment. Basically, this routine searches through the jobs for a task with affinity for an underutilized CEC, or barring that, for a task which is neutral to an underutilized CEC, if any. The search ordering is on job (outermost loop), task within job (middle loop), and underutilized CEC (innermost loop). The search routine is as follows:
TABLE 3
______________________________________
Search Routine
______________________________________
700 Procedure: SEARCH
701 For all j = 1, . . . ,J do {
702 Set BEST.sub.-- I = 0
703 For all i = 1, . . . ,T(j) do {
704 If S(j,i) = 0 {
705 For all p = 1, . . . ,P.sub.-- POS do {
706 Switch (D.sub.-- N.sub.-- A(j,i,p)) {
707 Case 1: Set S(j,i) = p
708 Return
709 Case 0: If BEST.sub.-- I = 0 then do {
710 Set BEST.sub.-- I = i
711 Set BEST.sub.- P = p
712 }
713 }
714 }
715 }
716 }
717 If BEST.sub.-- I > 0 {
718 Set S(j,BEST.sub.-- I) = BEST.sub.-- P
719 Return
720 }
721 }
722 Set SUCCESS = 0
723 End SEARCH
______________________________________
In the SEARCH routine, all jobs are scanned in the order of their prioritization, and all tasks within jobs in the order of their prioritization, in an attempt to find an awaiting task with an affinity or, failing that, at least a neutrality for an available CEC. For each job, prior to scanning the tasks of that job, the quantity BEST.sub.-- I is set to zero (702); BEST.sub.-- I is a pointer used to identify a task having a neutrality for an available CEC, if such a task is found. When a task is scanned, the quantity S(j,i) corresponding to that task is first examined to determine whether the task is currently awaiting assignment to a CEC complex (703). If the task has already been assigned (S(j,i).noteq.0), the routine proceeds to the next task. Otherwise, if the task is awaiting assignment (S(j,i)=0) (704), the affinity value D.sub.-- N.sub.-- A(j,i,p) for each CEC p is examined in turn, in the order of their prioritization (by calling the D.sub.-- N.sub.-- A routine to be described below), in an attempt to find a CEC for which the task has an affinity or at least a neutrality. If a CEC is found for which the task has an affinity, the task is assigned to that CEC by setting S(j,i)=p (707) and the SEARCH routine is exited (708). If a CEC has been found for which the task has neutrality, and if this is the first such instance for the job being scanned (BEST.sub.-- I=0) (709), then the task pointer BEST.sub.-- I is set equal to the task number i (710), while a CEC pointer BEST.sub.-- P is set equal to the CEC number p (711), to mark the task-CEC pair. If all of the tasks of a given job have been scanned without finding an awaiting task having an affinity for an available CEC, the quantity BEST.sub.-- I is examined to determine whether an awaiting task was found with a neutrality for an available CEC (717). If so, that task (BEST.sub.-- I) is assigned to the available CEC (BEST.sub.-- P) (718) and the SEARCH routine is exited (719). If all jobs have been scanned without finding an awaiting task having either an affinity or a neutrality for an available CEC, then the variable SUCCESS is set equal to zero to indicate that the search has been unsuccessful (722), and control returns to the MAIN routine (724). As noted above, resetting SUCCESS to zero causes the MAIN routine to suspend until it is resumed upon the completion of a task or upon the arrival of a new job j. The disaffinity/neutrality/affinity (D.sub.-- N.sub.-- A) routine is invoked by the search routine (step 706) to determine whether a task i (204) in job j (202) has affinity, neutrality or disaffinity for a CEC p (102). A task is said to have an affinity for a given processor if it is expected to execute significantly faster if assigned to that processor than it would have otherwise. Similarly, a task is said to have a disaffinity for a given processor if it is expected to execute significantly slower if assigned to that processor than it would have otherwise. Finally, a task is said to have a disaffinity for a given processor if it is expected to execute at roughly the same speed if assigned to that processor as it would have otherwise. Any one of a number of criteria, including projected input/output (I/O) contention, possibilities for buffer reuse, and other criteria, may be used to determine the relative affinity of a task 204 for a CEC 102; the particular criteria used do not as such form part of the present invention. In its most general form, the D.sub.-- N.sub.-- A routine is as follows:
TABLE 4
______________________________________
D.sub.-- N.sub.-- A Routine (Generalized Version)
______________________________________
800 Procedure: D.sub.-- N.sub.-- A(j,i,p)
801 Set VALUE = 1 if task i in job j has affinity for CEC
p
802 Set VALUE = 0 if task i in job j is neutral to CEC p
803 Set VALUE = -1 if task i in job j has disaffinity for
CEC p
804 Return (VALUE)
805 End D.sub.-- N.sub.-- A
______________________________________
As is evident from the above listing, the D.sub.-- N.sub.-- A routine returns (step 804) a value (VALUE) of -1, 0 or 1, depending on whether task i (204) of job j (202) has a disaffinity (step 803), neutrality (step 802) or affinity (step 801) for CEC p (102). For the sake of an example, a particular version of a D.sub.-- N.sub.-- A routine is given in the following listing:
TABLE 5
______________________________________
D.sub.-- N.sub.-- A Routine (Particular Version)
______________________________________
900 Procedure: D.sub.-- N.sub.-- A(j,i,p)
901 Set VALUE = 0
902 For all d = 1 to D do {
903 If U(i,j,d) > 0 then do {
904 If n(d) + U(i,j,d) > N(d) then return (-1)
905 For all ii = 1 to T(j) do {
906 If .vertline.S(j,ii).vertline. .noteq. p or U(ii,j,d) = 0 then
continue
907 If S(j,ii) = -p then set VALUE = 1
908 }
909 }
910 }
911 Return (VALUE)
912 End D.sub.-- N.sub.-- A
______________________________________
The D.sub.-- N.sub.-- A routine of Table 5, like the generalized version of Table 4, returns an affinity value (VALUE) of 1, 0 or -1, depending on whether the task 204 has an affinity, neutrality or disaffinity for the processor complex 102. Upon being called, the D.sub.-- N.sub.-- A routine initially sets VALUE to the default value 0 (901). The routine then enters an outer iteration loop (902-910) which it repeats for each DASD d of the system. On each iteration of the outer loop for a particular DASD d, the routine first determines whether the number of accesses U(i,j,d) made to DASD d by task i of job j is greater than zero (903). If so, then the routine continues the iteration for the current DASD d (903-909); otherwise, the routine proceeds to the iteration for the next DASD. If the routine continues the current iteration, it next determines whether the sum of the current access level n(d) at DASD d and the number of accesses U(i,j,d) made to DASD d by task i of job j is greater than the maximum allowable access level N(d) at DASD d; if so, the routine returns a VALUE of -1, indicating that the task i has a disaffinity (904). The returned affinity value of -1 causes task i not to be assigned, since it would result in an excessive number of accesses to DASD d. The disaffinity here is for the system 100 as a whole rather than for a particular processor complex 102, since the same excessive number of accesses to DASD d will occur, regardless of the processor complex to which the task is assigned. If the sum of the current access level n(d) at DASD d and the number of accesses U(i,j,d) made to DASD d by task i of job j is greater than the maximum allowable access level N(d) at DASD d, the routine enters an inner iteration loop (905-908) which it repeats for each task ii of job j. 0n each iteration of the inner loop for a particular task ii of job j, the routine first determines whether the task ii has been assigned to processor p, as well as whether the task ii accesses DASD d (906). If the task ii has not been assigned to processor p (i.e., .vertline.S(j,ii.vertline.).noteq.p), or if the task ii does not access DASD d (i.e., U(i,j,d)=0), then the routine proceeds to the iteration for the next task. Otherwise, the routine completes the iteration for the current task and determines whether the task ii has already been completed by processor p (907). If it has, then the routine sets VALUE equal to 1 to indicate that task i has an affinity for processor complex p. Task i is deemed to have an affinity for the processor complex p since another task ii from the same job j has completed execution on that complex, thereby creating the possibility for buffer reuse. Unlike the disaffinity case described above, this is truly a case of affinity for a particular processor complex p as distinguished from affinity for the system 100 as a whole. The routine does not return at this point, however, but proceeds to the next outer loop iteration, since VALUE can still be reset to -1 on a subsequent outer loop iteration at step 904. Upon completing the outer loop for all DASD (provided it does not end prematurely at step 904), the D.sub.-- N.sub.-- A routine returns the finally determined VALUE to the SEARCH routine (911). To recapitulate, the returned VALUE is normally 0 (neutrality), but is -1 (disaffinity) if the task makes an excessive number of accesses to a DASD (904), and is 1 (affinity) in the absence of an excessive number of requests to a particular DASD if any other task of the same job has already completed on the same processor complex. Various additional features can be added to this basic scheme. For example, one could use the total estimated task times assigned to the various systems as a further load balancing constraint. If a task assignment would yield system loads which were too unbalanced, such assignments could be inhibited. Similarly, one could employ a comparable mechanism to guard against job starvation, perhaps increasing the priority of jobs which have received too little processing time.
|
Same subclass Same class Consider this |
||||||||||
