System for preventing periodic load balancing if processor associated with lightest local run queue has benefited from idle processor load balancing within a determined time period6993767Abstract An apparatus and methods for periodic load balancing in a multiple run queue system are provided. The apparatus includes a controller, memory, initial load balancing device, idle load balancing device, periodic load balancing device, and starvation load balancing device. The apparatus performs initial load balancing, idle load balancing, periodic load balancing and starvation load balancing to ensure that the workloads for the processors of the system are optimally balanced. Claims What is claimed is: Description BACKGROUND OF THE INVENTION
If a local run queue meeting these criteria is found, the dispatcher 150 attempts to steal an unbound thread from that local run queue. A thread is stolen from the local run queue after obtaining the selected local run queue's lock. If the local run queue's lock cannot be obtained immediately, repeated attempts are not made. If the local run queue's lock is obtained, the dispatcher 150 verifies that an unbound thread is still available and the unbound thread with the most favored priority is chosen. The thread is stolen from the local run queue by obtaining the thread's lock and changing the thread's run queue pointer to the run queue pointer for the local run queue assigned to the potentially idle CPU. Again, if the thread's lock is not obtained immediately, the steal attempt is abandoned. If the thread's lock is obtained and the thread is stolen, the stolen thread is then immediately processed by the CPU and is not actually queued in the local run queue of the potentially idle CPU. This result follows naturally after the stolen thread has completed a dispatch cycle, assuming typical behavior. Idle load balancing is constrained by the node's steal threshold. The steal threshold is a fraction of the smoothed average load factor on all the local run queues in the node. This load factor is determined by sampling the number of threads on each local run queue at every clock cycle. For example, if the load factors of the CPUs is 5, 15 and 16 over a period of time, the smoothed average load factor might be 12. The steal threshold may be, for example, ¼ of the smoothed average load factor and thus, may be 3. The steal threshold (¼ in this example) is actually a tunable value. Accordingly, the local run queue from which threads are to be stolen must have more than 3 threads in the local run queue, at least one of which must be an unbound thread and thus, stealable. The local run queue must also have the largest number of threads of all of the local run queues and must not have had a maximum number of threads stolen from it over the current clock cycle. As an example of the above method, consider the node shown in FIG. 4 As shown in FIG. 4, CPU 420 is becoming idle and its associated local run queue 472 and global run queue have no assigned threads. Thus, the idle CPU 420 attempts to steal a thread from another local run queue 471, 473-476. Taking the above steal criteria into consideration, the local run queue satisfying the above criteria is local run queue 474. This is because local run queue 474 has the most threads of all of the local run queues 471-476 (5 threads). The local run queue 474 contains at least one unbound thread (this is assumed). The local run queue 474 has not reached its maximum number of stolen threads limit (this is also assumed). The local run queue 474 contains more threads than the node's current steal threshold assuming that the current local run queue workloads represent the average load factors of the local run queues. The steal threshold for the node 400 is currently approximately 1 and the local run queue 474 has 5 assigned threads. Thus, the local run queue 474 meets all of the above steal criteria. Hence, the first unbound thread in local run queue 474 is stolen and its run queue pointer reassigned to local run queue 472. Periodic Load Balancing Periodic load balancing is performed every N clock cycles and attempts to balance the workloads of the local run queues of a node in a manner similar to that of idle load balancing. However, periodic load balancing is performed when, in general, all the CPUs have been 100% busy. Periodic load balancing involves scanning a node's local run queues to identify the local run queues having the largest and smallest number of assigned threads on average, i.e., the local run queues with the highest and lowest load averages, hereafter referred to as the heaviest and lightest local run queues, respectively. If the lightest local run queue has stolen a thread through idle load balancing in the last N clock cycles, periodic load balancing may not performed. This is because periodic load balancing is directed to addressing the situation where idle load balancing is not occurring and all of the node's CPUs are busy. In addition, this prevents a local run queue that has benefited from idle load balancing from being locked for two consecutive cycles. If the difference in load factors between the heaviest and lightest local run queues is above a determined threshold, such as 1.5 for example, periodic load balancing may be performed. If the difference is less than the threshold, it is determined that the workloads of the CPUs are well balanced and periodic load balancing is not performed. If periodic load balancing is to be performed, the dispatcher 150 acquires the heaviest local run queue's lock. In this case, if the lock is not acquired immediately, the dispatcher 150 will make repeated attempts to acquire the local run queue's lock, i.e. the dispatcher 150 will spin on the local run queue's lock. Once the local run queue's lock is obtained, the dispatcher 150 scans the local run queue for an unbound thread to steal. The scan for stealable unbound threads starts at threads having a medium priority in order to increase the likelihood of stealing a thread that will use enough CPU time to have an impact on the system performance and also to leave high priority threads with their original CPUs. The thread is then stolen in the same manner as described above. As an example of periodic load balancing, consider the node 500 shown in FIG. 5. As shown in FIG. 5, each of the CPUs 510-560 are busy with dispatching threads in their respective local run queues 571-576. However, the workloads among the CPUs 510-560 are not balanced. Periodic load balancing finds the heaviest and lightest local run queues, which in this case are local run queues 574 and 572, for example. Assume that the load factor for local run queue 574 is 4 and the load factor for local run queue 572 is 1. The difference between the load factors is 3 which is higher than 1.5 indicating that the workloads of the local run queues 571-576 are not balanced. Accordingly, the dispatcher 150 obtains the lock for local run queues 574 and 572 and steals the first unbound thread in local run queue 574 and places it in local run queue 572. In order to avoid having to hold two local run queue 572 and 574 locks at the same time, the stolen thread may be temporarily dequeued and placed in a temporary queue (not shown). The lock on the local run queue 574 may then be released and the lock for the local run queue 572 acquired. The thread may then be requeued in local run queue 572. Starvation Load Balancing Starvation Load Balancing is directed to moving unbound threads which have not been dispatched within a predetermined period of time to a global run queue. In this way, undispatched threads from local run queues may be moved to the global run queue where there is a greater likelihood that they will be assigned to a local run queue for a CPU that may be able to dispatch them. With the starvation load balancing method, each thread is time stamped when it is assigned to a local run queue. At periodic intervals, the dispatcher 150 scans each of the threads in the system to find unbound threads that have been pending on a local run queue for greater than a threshold time amount, for example, greater than 1.5 seconds. If the dispatcher 150 finds any unbound threads meeting this criteria, the dispatcher 150 steals the thread from the local run queue and places it in the global run queue for the node. In this way, the thread will be dispatched by the next available CPU in the node, priority permitting. Thus, a low priority thread that may not be dispatched due to higher priority threads in one local run queue, may be requeued to a less busy local run queue and will have a greater likelihood of being dispatched. In addition, by moving threads that are not being dispatched to the global run queue, there is a greater likelihood that load balancing will achieve the desired effect. For example, if a local run queue has a large number of undispatched threads, load balancing will tend to cause dispatching threads to be placed in other local run queues. By removing the undispatched threads to the global run queue, dispatching threads will be spread more evenly among the local run queues. As an example of starvation load balancing, consider the node 600 in FIG. 6. As shown in FIG. 6, the local run queue 671 includes an unbound thread that has not been dispatched within a threshold amount of time. This unbound thread is located by the dispatcher 150 by scanning the threads of the system, in a single operation, for unbound threads in each of the local run queues 671-676 having time stamps that indicate they have been pending in the local run queue for a time longer than the threshold amount of time. Once the unbound thread is located, the dispatcher 150 obtains the lock for the local run queue 671 and steals the thread from the local run queue 671 and places it in the global run queue 681. The next available CPU 610-660 allowed to service a thread at the given thread's priority will dispatch the thread, after which it will be assigned to that local run queue 671-676. Thus, the present invention makes use of initial, idle, periodic and starvation load balancing to achieve an optimum load balance among CPU resources. In this way, CPU resources may be equally utilized and the overall throughput of the system may be increased substantially. FIG. 7 is an exemplary block diagram of the dispatcher 150 of FIG. 1. As described above, the dispatcher 150 is depicted as a centralized device. However, the invention may be implemented using a distributed dispatcher 150 where, for example, each node or group of nodes has a separate associated dispatcher 150. Furthermore, each CPU may have an associated dispatcher 150. In such an embodiment, certain load balancing functions may be performed by the dispatchers 150 of each CPU while others may be performed by only certain ones of the dispatchers 150. For example, each dispatcher 150 associated with each CPU may perform idle load balancing when the CPU becomes idle, whereas only the dispatcher 150 associated with a master CPU in a node (usually the lowest numbered CPU) may perform periodic load balancing and starvation load balancing. As shown in FIG. 7, the dispatcher 150 includes a controller 700, a memory 710, an initial load balancing device 730, an idle load balancing device 740, a periodic load balancing device 750, and a starvation load balancing device 760. These elements 700-760 communicate with one another via the signal/control bus 770. Although a bus architecture is shown in FIG. 7, the invention is not limited to such an architecture. Rather, any type of architecture that allows for communication among the elements 700-750 is intended to be within the spirit and scope of the present invention. The controller 700 controls the operation of the dispatcher 150 based on, for example, control programs stored in the memory 710. The controller 700 transmits and receives information to and from the nodes via the MP system interface 720. The controller 700 utilizes the initial load balancing device 730 to perform initial load balancing in the manner described above when new threads are generated by a process in the MP system 100. The controller 700 utilizes the idle load balancing device 740 to perform idle load balancing in the manner described above when information is received from a node that a CPU in the node is about to become idle. The controller 700 utilizes the periodic load balancing device 750 to perform periodic load balancing in the manner described above. The starvation load balancing device 760 is utilized to perform starvation load balancing also in the manner described above. The initial load balancing device 730, idle load balancing device 740, periodic load balancing device 750, and starvation load balancing device 760 may be, for example, programmed microprocessor devices or microcontroller and peripheral integrated circuit elements, an Application Specific Integrated Circuit (ASIC) or other integrated circuit, a hardware electronic or logic circuit such as a discrete element circuit, a programmable logic device such as a PLD, PLA, FPGA or PAL, or the like. In short, any device capable of performing the functions described above and illustrated in the flowcharts of FIGS. 8-11, described hereafter, may be used without departing from the spirit and scope of the present invention. FIG. 8 is a flowchart outlining an exemplary operation of the dispatcher 150 when performing initial load balancing. The operation starts with the controller 700 receiving a new thread to be dispatched by a CPU (step 810). The controller 700 then determines if the new thread is a bound or unbound thread (step 820). This may be performed by reading attribute information associated with the thread indicating whether or not the thread is bound to a particular CPU or is unbound. If the thread is bound (step 820:YES), the controller 700 places the new thread in the local run queue associated with the bound CPU (step 830). If the new thread is unbound (step 820:NO), the controller 700 instructs the initial load balancing device 730 to perform initial load balancing. The initial load balancing device 730 determines if the new thread is part of an existing process (step 840). This may also be performed by reading attribute information associated with the thread. If the new thread is part of an existing process (step 840:YES), the initial load balancing device 730 performs a round robin search of the CPUs of the node to which the other threads from the existing process were assigned (step 850) looking for an idle CPU. If the new thread is not part of an existing process (step 840:NO), the initial load balancing device 730 performs a round robin search of all nodes and CPUs for an idle CPU (step 860). The initial load balancing device 730 determines whether or not an idle CPU is found (step 870) and places the new thread in the local run queue of the idle CPU if one is found (step 890). If an idle CPU is not found, the initial load balancing device 730 places the new thread in the global run queue (step 880). If the new thread is part of an existing process, the global run queue to which the new thread is added is the global run queue for the node to which the other threads of the existing process, or the thread which created the current thread, were assigned. If the new thread is not part of an existing process, the global run queue to which the new thread is added is the global run queue preferred based on, for example, a round robin search, although other load placement approaches may be used instead of the round robin search. This is generally the global run queue with the least number of threads. FIG. 9 is a flowchart outlining an exemplary operation of the dispatcher 150 when performing idle load balancing. As shown in FIG. 9, the operation starts when the controller 700 instructs the idle load balancing device 740 to perform idle load balancing. Accordingly, the idle load balancing device 740 scans the local run queues of the node of the potentially idle CPU looking for a local run queue meeting the above described idle load balancing criteria (step 910). If a local run queue meeting the idle load balancing criteria is found (step 920:YES), the idle load balancing device 740 steals a thread from the local run queue meeting the criteria (step 940). If a local run queue meeting the idle load balancing criteria is not found (step 920:NO), the idle load balancing device 740 allows the CPU to go idle (step 930). FIG. 10 is an outline of an exemplary operation of the dispatcher 150 when performing periodic load balancing. As shown in FIG. 10, the operation starts when the controller 700 instructs the periodic load balancing device 750 to initiate periodic load balancing (step 1010). This may be performed, for example, based on a periodic timing of the operation. The periodic load balancing device 750 identifies the heaviest and lightest loaded local run queues and determines the load factors for the heaviest and lightest loaded local run queues (step 1020). The periodic load balancing device 750 then determines if the lightest loaded local run queue has benefited from idle load balancing in the previous clock cycle (step 1030). This may be performed by determining the current setting of a flag in the internal structure representing the local run queue. If the lightest loaded local run queue did benefit from idle load balancing in the previous clock cycle (step 1030:YES), periodic load balancing is not performed (step 1070). If the lightest loaded local run queue did not benefit from idle load balancing in the previous clock cycle (step 1030:NO), the periodic load balancing device 750 determines the difference between these load factors (step 1040) and determines if the difference is higher than a threshold amount (step 1050). If the difference between the load factors is higher than a threshold amount (step 1050:YES), the periodic load balancing device 750 steals an unbound thread from the heaviest loaded local run queue and places it in the lightest loaded local run queue (step 1060). If the difference between the load factors is not higher than the threshold amount (step 1050:NO), the system is well balanced and load balancing is not performed (step 1070). FIG. 11 is a flowchart outlining an exemplary operation of the dispatcher 150 when performing starvation load balancing. As shown in FIG. 11, the operation starts when the controller 700 instructs the starvation load balancing device 760 to perform starvation load balancing (step 1110). This may be performed, for example, based on a periodic timing of the operation. The starvation load balancing device 760 scans each of the threads in the system for an unbound thread (step 1120). The starvation load balancing device 760 determines the time stamp for the unbound thread (step 1130) and determines if the time stamp indicates that the unbound thread has been pending in a local run queue for longer than a threshold amount of time (step 1140). If the unbound thread has been pending for longer than the threshold amount of time (step 1140:YES), the starvation load balancing device 760 requeues the unbound thread to the global run queue of the node containing the thread's local run queue. If the unbound thread has not been pending for longer than the threshold amount of time (step 1140:NO), then the unbound thread is left in the local run queue. The starvation load balancing device 760 then determines if there are more threads to search and if so (step 1160:YES), performs the operation repeatedly (steps 1120-1160). If there are no more threads to be searched (step 1160:NO), the operation is ended. With the present invention, load balancing is achieved in a multiple run queue system by using both global and local run queues. Initial load balancing, idle load balancing, periodic load balancing, and starvation load balancing are performed in conjunction with one another to ensure optimum load balancing among the local run queues. Fixed Priority Threads Under certain conditions, threads must be dispatched in a fixed priority order. For example, the in AIX (Advanced Interactive eXecutive) operating system, POSIX compliant processes require that the threads be dispatched in strict priority order. In a multiple run queue system, such as that of the prior art, dispatching threads in strict priority order may not be performed or may require that all of the threads be dispatched to a single CPU. The present invention avoids this problem by assigning all fixed priority threads, such as POSIX-compliant fixed priority threads, to the global run queue for the first node 120, for example, of the MP system 110. In this way, the threads are guaranteed to be dispatched in strict priority order because the threads are present in a single global run queue and not distributed among a plurality of local run queues. Automatically assigning fixed priority threads to a global run queue eliminates the benefits obtained by cache affinity since the next CPU that becomes available to dispatch a thread of that priority level will dispatch the next thread in the global run queue. Thus, regardless of possible cache affinity benefits, the fixed priority threads are assigned to whichever CPU becomes available first. However, the benefits of dispatching the fixed priority threads in strict priority order and dispatching them quickly by the next available CPU will tend to offset the loss in cache affinity benefits. The assumption is that fixed priority threads are highly favored threads, and that it is preferable to execute them as soon as possible. In order to identify the fixed priority threads, the threads must have attribute information that includes a fixed priority flag, such as a POSIX-compliant flag, that may be set when the thread is to be treated as a fixed priority thread. When this flag is set, the dispatcher 150 will assign the thread to the global run queue for the first node 120 of the MP system 110. Then, because each CPU services the global run queue, the CPUs associated with the node will dispatch the threads in strict priority order as the CPUs become available to dispatch the threads. In this way, fixed priority threads, such as POSIX compliant threads, may be utilized with the multiple run queue system according to this invention. It is important to note that while the present invention has been described in the context of a fully functioning data processing system, those of ordinary skill in the art will appreciate that the processes of the present invention are capable of being distributed in the form of a computer readable medium of instructions and a variety of forms and that the present invention applies equally regardless of the particular type of signal bearing media actually used to carry out the distribution. Examples of computer readable media include recordable-type media such a floppy disc, a hard disk drive, a RAM, and CD-ROMs and transmission-type media such as digital and analog communications links. The description of the present invention has been presented for purposes of illustration and description, but is not intended to be exhaustive or limited to the invention in the form disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art. The embodiment was chosen and described in order to best explain the principles of the invention, the practical application, and to enable others of ordinary skill in the art to understand the invention for various embodiments with various modifications as are suited to the particular use contemplated.
|
Same subclass Same class Consider this |
||||||||||
