Dynamic compilation control6993755Abstract Modern programming languages have stimulated work on systems that dynamically compile or optimize frequently executed portions of programs. In practice, such systems typically rely on ad hoc heuristics. For example, a system may optimize (or compile) some code once its execution count exceeds a given threshold. An analytical model has been developed that expresses performance of such a system. In one embodiment, the model is based on a bytecode frequency histogram, which indicates (for a given program) how many bytecodes run for how many times. It predicts that the optimal compilation threshold will occur where the hazard rate falls through the reciprocal of the break-even point, the number of times a compiled bytecode must be executed to recoup its compilation time. Based on the insight provided by the model, a dynamic compilation control technique has been developed. Claims What is claimed is: Description BACKGROUND
Our analysis will be based on bytecode frequencies. That is, the analysis will be based not the relative frequencies of different kinds of bytecodes, but rather the number of bytecodes contained in the program that run a given number of times in the course of one run of a program:
The (complementary) distribution as well as the density are usefully defined: ##EQU1## ##EQU2## ##EQU3## ##EQU4## Note that another embodiment of the model would use summations instead of integrals because a bytecode cannot execute half a time, but the common statistical practice of approximating discrete functions with continuous ones is followed herein. The number of executions of bytecodes that run at least a certain number of times: ##EQU5## ##EQU6## ##EQU7## ##EQU8## ##EQU9## In the following discussion regarding the analytical model, a mathematical model for two cases is presented: one with perfect knowledge in which a program's profile is known a priori, and one with imperfect knowledge in which the profile is unknown. Compilation Policy in an Idealized Scenario: Perfect Knowledge Suppose that each program came with a profile that indicated how many times each of its bytecodes would run. Then, the obvious compilation policy would be to first compile every bytecode that was going to run more than a certain number of times, and then to run the program, interpreting the infrequently running bytecodes and executing the (compiled) frequently running bytecodes. What would an optimal compilation threshold in this case be? We start with
This time will be the sum of three components: the time to compile the compiled bytecodes, the time to execute the compiled bytecodes, and the time to interpret the non-compiled bytecodes. So Ttot(x)=TC{overscore (F)}(x)+TEE(x)+T1[E(0)-E(x)]=T1E(0)-[(T1-T The second line above rearranges the total time to be a sum of the time it would take to interpret all executions, T1E(0), minus the net time saved by the compiler, (T1-TE)E(x)-TC{overscore (F)}(x). Rather than minimizing the total time, it will prove more convenient to maximize this latter term. We normalize it with respect to T1-TE, the execution time saved by one compiled bytecode execution: ##EQU10## To find the optimal compilation threshold, we set the first derivative of the speedup to zero (using our definitions for E(x) and {overscore (F)}(x) above), and will check for a negative second derivative: ##EQU11## Setting to zero: ##EQU12##
This inequality shows that of the two possibilities for a zero first derivative, ##EQU16## only the former condition yields an optimal speedup. Thus, ##EQU17## Note that the optimal threshold does not depend on the frequency distribution of the program. It is simply the ratio of the compilation time per bytecode to the time saved by executing a bytecode instead of interpreting it. In other words, it represents a break-even point; if this value is 10, then a compiled bytecode must execute ten times in order to recoup the investment in compilation time. It is convenient to denote this quantity with a new symbol: ##EQU18## ##EQU19## Suppose a very fast (and stupid) compiler takes two milliseconds to compile a bytecode and speeds up the execution time from one millisecond for interpretation to half a millisecond. Then, TC=2 ms, T1=1 ms, TE=0.5 ms. So ##EQU20## and the compiler should be used on any bytecode executing four times or more. On the other hand, a more sophisticated compiler may take 12,000 cycles to compile a bytecode, reducing the interpretation time of 1,000 cycles to an execution time of 2 cycles. Then, TC=12,000, T1=1,000, TE=2, and ##EQU21## So any bytecode that runs more than 12 times (the answer is slightly greater than 12) should be compiled. Although the former compiler will be used on more bytecodes, it may not yield the best performance. In order to understand this better, we next examine the components of the speedup. Total Speedup from Compilation (With Perfect Knowledge) Armed with the above result for the optimal compilation threshold, we can examine the speedup obtained with the optimal policy. Recall that T1E(0)-Ttot(x)|x=x The first factor is merely the per-bytecode efficacy of the compiler. Regarding the second factor, we can substitute the definitions of E and F and simplify to obtain: ##EQU22## This integral can be interpreted as the expected amount by which the bytecode frequency exceeds the break-even frequency for the population of compiled bytecodes. Accordingly, the characteristics of the program finally make themselves felt: a program with more high-frequency bytecodes will have a larger expected excess. Since the speedup is the product of the factors, a compiler that produces better code is guaranteed to deliver a greater speedup, as long as its compile time increases no more than in proportion to its efficacy (i.e., constant break-even point). This is a sufficient, not necessary condition. Compilation Policy in the Non-Ideal Scenario: Less-Than-Perfect Information In the more typical case, a profile is not available at the outset of execution of a program. A profile may be obtained offline by using a practice run, or dynamically by instrumenting a virtual machine, for example. A virtual machine can observe the program's behavior as it unfolds in order to predict the future. One of the simplest ways to do this would be to count the number of times each bytecode executes, and to compile a bytecode when its count surpasses a threshold. We turn now to an analysis of this plausible heuristic and will seek to understand what the optimal compilation threshold might be. Once a profile is obtained, the hazard rate function can be computed, and the compilation threshold may be set at a point where the compiler's break-even point is equal to the reciprocal of the hazard rate. As before, we denote threshold as x. As a side note, we know that if at any time a bytecode will run β more times, it will be beneficial to recompile that bytecode. So, we can expect that the optimal compilation threshold will turn out to be the one that predicts this, and naively might expect it to be β. In this scenario, for any given compilation threshold, the system compiles the same set of bytecodes as before, but it compiles them later. Any bytecode that eventually gets compiled must first slog through x interpreted executions instead of breezing through x compiled executions. So the total execution time will be the same as it was with perfect knowledge, plus a penalty term of (T1-TE)x{overscore (F)}(x), the product of the additional time spent per execution, the number of executions before compilation, and the number of bytecodes involved: ##EQU23## As before, it has been rearranged into a term representing the time taken by a totally interpreted run, T1E(0) and a term representing the speedup thanks to the compiler. Also, as before, we denote the speedup normalized by the per-bytecode execution savings: ##EQU24## Comparing this with the perfect knowledge speedup, E(x)-β{overscore (F)}(x), we can observe that the price of ignorance is x{overscore (F)}(x). Optimal Compilation Strategy with Imperfect Knowledge To find the optimal threshold where the speedup will be maximized, we start by differentiating and substituting for {overscore (F)} and E ##EQU25## We will also need the second derivative: Speedup"imperfect(x)=βf′(x)+f(x) At maximal speedup (recalling that β is always positive for the third step): ##EQU26## ##EQU27## ##EQU28## At this point, the field of reliability engineering, a field used to study the reliability of light bulbs, semiconductors and the like. (See, e.g., Richard E. Barlow and Frank Proschan, Mathematical Theory of Reliability, SIAM Classics in Applied Mathematics, Philadelphia, 1996, originally published by John Wiley & Sons, Inc., New York, 1965.) For any probability density f(x), the hazard rate (or failure rate) is defined as: ##EQU29## Thus the requirement for optimality is just: ##EQU30## In reliability engineering, the hazard rate can be interpreted as the probability that a part will fail, given that it has operated without failure for x time units. In our case, we can interpret the hazard rate as the probability that a bytecode will stop being executed, given that it has already run x times. It makes sense that when the hazard rate is too high we don't want to compile the bytecode because it is too likely to stop being executed. Likewise, when the hazard rate is too low, we are too late in compiling the bytecode, it is so unlikely to stop being executed that it should have been compiled before. Thus, we can hypothesize that an optimal threshold will exist only in a region where the hazard rate is falling. Let's confirm this: ##EQU31## Substituting in the optimality conditions ##EQU32## Stating the full optimality conditions in terms of the hazard rate: ##EQU33## Thus, the optimal number of times a bytecode should be interpreted before compiling it will be the frequency where the hazard rate falls through the reciprocal of the break-even point. If the frequency distribution does not exhibit a falling hazard rate, there will be no such optimum; instead it will be better to either compile everything first, or to compile nothing ever. When thinking about this problem, one imagines that a bytecode executes, its counter trips (i.e., the bytecode is observed to execute x times), and the system must decide whether or not to compile it. One may think that the goal is to predict how many more times the bytecode will run given that knowledge that it has run for x times, and that the best value for the compilation threshold will be the one that maximized the expected number of future executions of the bytecode. The reality is slightly more complicated. In order to compute the expected number of additional executions of a bytecode, we will need a formula that gives the probability that the total number of executions of a bytecode is given that the bytecode executes at least x times: ##EQU34## Since we are interested in the expected number of additional executions, we must integrate and subtract: ##EQU35## ##EQU36## To maximize the expected number of additional executions (given that a bytecode has executed x times) this is what we would be optimizing. However, we need to factor in the cost of compilation. The additional number of executions determines the benefit from compilation, but what we really want to optimize is the benefit minus the cost, and (for our single bytecode) this is ##CHR1## -β)×{overscore (F)}(x): ##EQU37## This is the same expression we optimized based on speedup instead of expected additional executions. The two really are equivalent. Results with Analytic Frequency Distributions Armed with the above analytical results, we now examine two simple continuous distributions. First, assume that the bytecode frequency is uniformly distributed up to 1000 (unrealistic perhaps, but a useful example in its simplicity). FIGS. 3A through 3D respectively show plots of the bytecode frequency density f(x), the complementary cumulative distribution {overscore (F)}(x), hazard rate, and speedups, assuming a break-even point of 100. Recall that the density shows how many bytecodes have exactly a given number of repetitions, and that the complementary cumulative distribution shows how many bytecodes have at least a given number of repetitions. As expected, the perfect speedup peaks at 100, and the imperfect one has no optimum, except for compiling everything right away. (Of course, there is no profit in compiling any bytecode with a frequency less than one, a distortion which occurs due to the use of integrals instead of summations.) Next, we turn to a distribution with a decreasing hazard rate, the log normal distribution. This distribution assumes that the log of the frequencies is normally distributed (a more realistic assumption than the uniform distribution shown in FIG. 3A). FIGS. 4A through 3E respectively show plots of the density, complementary distribution, hazard rate, hazard rate with tail, and speedups. Since we have made no attempt to calibrate the parameters of the distribution, the scale of the graphs below has no significance. For example, the break-even point was set to 1.4 for these plots. Recall that the optimal compilation threshold given imperfect knowledge occurs where the hazard rate decreases through ##EQU38## In this distribution with a falling hazard rate, the imperfect speedup does display a peak, right where the hazard rate falls through ##EQU39## (on the y-axis of FIG. 4C), which corresponds to a compilation threshold of just less than 2 (on the x-axis of FIG. 4C). Therefore, an optimal compilation policy (according to our model) would be to compile any bytecode after it executes two times. Note that in a real life example, the compilation threshold may be much higher. Also, in real life, the hazard rate would likely begin increasing to form a typical hazard rate "U" curve if the program has a finite lifetime. Such "right halves" of the hazard rate curve will be positioned differently, but they do not affect the optimal compilation threshold calculation in any way. One may guess what the optimal imperfect-knowledge compilation threshold might be, since it might be necessary to observe β executions before we are willing to bet on β additional executions. If that were the case, then, since the optimal threshold occurs where the hazard rate equals the reciprocal of β, we would expect that ##EQU40## and our plots would be straight lines with slopes of -1. Now, the right hand sides of our curves must have positive slope because our programs have finite lifetimes. In other words, as approaches the maximum for the program, the probability that the bytecode will cease to be executed must increase. And on the left hand side of real life programs, the curves typically fall more steeply than log—log hyperbolae. Thus, the optimal compilation threshold typically lies at a fraction of the break-even point, and it may be too conservative to wait until a bytecode's frequency reaches β to compile it. Also, there may be a limit to the optimizations that a dynamic compiler may perform. Recall that ##EQU41## As the compiler gets more sophisticated, the T1 term will dominate the denominator, and since TC will be rising, β will have to rise eventually. However, typical plots show that if β goes much above 500, then since the hazard rates never drop that low, the model predicts there will be no optimal compilation threshold. With such a compiler, the best strategy would be to either compile all or nothing at all. When the compilation threshold is low (approaching unity), it is possible to squander half of the available speedup (e.g. for pep test) on premature optimization. In fact, many commercially available virtual machines with JITs do just this, compiling any code that executes even one time. Increasing the compilation threshold to as little as two would recover most of the available speedup. Revisiting Assumptions In system 100, any bytecode can be compiled independently of any other. In systems where whole methods are compiled at a time, the break-even points may be expected to be worse, since the net effect will be to increase the compile time above that required to merely compile an inner loop. However, the basic model will likely still hold. For example when one computes the compilation speed per bytecode for the above described embodiment, do not divide the total compiler time by the number of bytecodes compiled. Instead, divide the total compiler time by the number of bytecodes executed by the program (i.e., the sum of all the bytecodes in the interpreter histogram). That gives a more accurate break-even point. Even if a variation in the amount of dead or non-loop code in each method will tend to fuzz the results, the model is still effective. The foregoing discussion assumes that compilation speeds up the execution of each bytecode by the same amount. Of course, the amount of speedup will vary from bytecode to bytecode. As long as this variation is independent of frequency, the effect should merely be to add noise to the predictions. Finally the model assumes that the preparation unit 120 can count the executions of each bytecode. This data may be approximated in real life by less accurate, more efficient means such as strategically placed counters. A simple analytical model can illuminate the relationship of dynamic compilation policy to the total running time of a program. Given a bytecoded program with bytecode frequency distribution f(x), and an implementation whose compiler has a break-even point, it is possible to compute an optimal compilation threshold. If profile information is available, it is best to compile every bytecode whose frequency exceeds β. Without profile information, if a bytecode is compiled after it has executed x times, it is best to set x where the hazard rate, ##EQU42## falls through ##EQU43## A term taken from reliability engineering, the hazard rate can be interpreted as the probability that a bytecode will not be executed any more after it has already run times. The analytical model we have described can be used to evaluate multiple stages of dynamic compilation. Also, when deciding whether it would be advantageous to slow the compiler in order to improve code quality, these results provide a sufficient condition: if β does not increase, the slower compiler that produces better code will increase system performance. Also, the performance of a system that compiles every bytecode that runs can be improved by delaying compilation until a bytecode has run at least twice. However, if it takes more than about 500 executions to recoup the compilation time, there may be no optimal compilation threshold. The above description is intended to describe at least one embodiment of the invention, not to define the scope of the invention. Rather, the scope of the invention is defined in the claims below. Thus, other embodiments of the invention include other variations, modifications, additions, and/or improvements to the above description. Those skilled in the art will recognize that boundaries between logic blocks are merely illustrative and that alternative embodiments may merge logic blocks or circuit elements or impose an alternate decomposition of functionality upon various logic blocks or circuit elements. Moreover, alternative embodiments may combine multiple instances of a particular component. System 100, and any portions thereof, may be integrated on-chip, on-board, in-box or may be distributed. System 100 may be hardware or software or both. For example, the architectural blocks may be integrated circuits on a board, software implementations for execution on general purpose processors, a combination thereof, a virtual machine, or software representations of hardware for design and incorporation into other circuitry. In one embodiment, system 100 is a computer system such as a personal computer system. Other embodiments may include different types of computer systems. Computer systems are information handling systems which can be designed to give independent computing power to one or more users. Computer systems may be found in many forms including but not limited to mainframes, minicomputers, servers, workstations, personal computers, notepads, personal digital assistants, various wireless devices and embedded systems. A typical computer system includes at least one processing unit, associated memory and a number of input/output (I/O) devices. A computer system processes information according to a program and produces resultant output information via I/O devices. A program is a list of instructions such as a particular application program and/or an operating system. A computer program is typically stored internally on computer readable storage medium or transmitted to the computer system via a computer readable transmission medium. A computer process typically includes an executing (running) program or portion of a program, current program values and state information, and the resources used by the operating system to manage the execution of the process. A parent process may spawn other, child processes to help perform the overall functionality of the parent process. Because the parent process specifically spawns the child processes to perform a portion of the overall functionality of the parent process, the functions performed by child processes (and grandchild processes, etc.) may sometimes be described as being performed by the parent process. A personal computer system can usually be defined as a desk top, floor standing, or portable microcomputer that includes a system unit having a system processor and associated volatile and nonvolatile memory, a display monitor, a keyboard, one or more diskette drives, a fixed disk storage device and an optional printer. A system board is often used to electrically connect these components together. A personal computer system may also include one or a plurality of I/O devices (i.e. peripheral devices) which are coupled to the system processor and which perform specialized functions. Examples of I/O devices include modems, sound and video devices or specialized communication devices. Mass storage devices such as hard disks, CD-ROM drives and magneto-optical drives are also considered to be peripheral devices. Each of the blocks/steps of FIG. 2 may be executed by a module (e.g., a software module) or a portion of a module or a computer system user. Thus, the above described method(s), the steps thereof and modules therefor may be executed on a computer system configured to execute the operations of the method and/or may be executed from computer-readable media. The method and/or modules may be embodied in a machine-readable and/or computer-readable medium for configuring a computer system to execute the method. Thus, the software modules may be stored within and/or transmitted to a computer system memory to configure the computer system to perform the functions of the module. Software modules may include script, batch or other executable files, or combinations and/or portions of such files. Those skilled in the art will recognize that boundaries between the functionality of the above described steps merely illustrative. The functionality of multiple steps may be combined into a single step, and/or the functionality of a single step may be distributed in additional steps. Moreover, alternative embodiments may include multiple instances of a particular step, and the order of steps may be altered in various other embodiments. Likewise, those skilled in the art will recognize that boundaries between modules are merely illustrative and alternative embodiments may merge modules or impose an alternative decomposition of functionality of modules. For example, the modules discussed herein may be decomposed into submodules to be executed as multiple computer processes. Moreover, alternative embodiments may combine multiple instances of a particular module or submodule. Software modules may be received by system 100, for example, from computer readable media such as storage 110. Computer readable media may be permanently, removably or remotely coupled to system 100. Computer readable media may include, for example and without limitation, any number of the following: magnetic storage media including disk and tape storage media; optical storage media such as compact disk media (e.g., CD-ROM, CD-R, etc.) and digital video disk storage media; nonvolatile memory storage media including semiconductor-based memory units such as FLASH memory, EEPROM, EPROM, ROM; ferromagnetic digital memories; holographic media, volatile storage media including registers, buffers or caches, main memory, RAM, etc.; and data transmission media including computer networks, point-to-point telecommunication equipment, and carrier wave transmission media, just to name a few. Other new and various types of computer-readable media may be used to store and/or transmit the software modules discussed herein. It is to be understood that the architectures depicted herein are merely exemplary, and that in fact many other architectures can be implemented which achieve the same functionality. In an abstract, but still definite sense, any arrangement of components to achieve the same functionality is effectively "associated" such that the desired functionality is achieved. Hence, any two components herein combined to achieve a particular functionality can be seen as "associated with" each other such that the desired functionality is achieved, irrespective of architectures or intermedial components. Likewise, any two components so associated can also be viewed as being "operably connected", or "operably coupled", to each other to achieve the desired functionality. The components and devices described herein are used as examples for sake of conceptual clarity. Consequently, as used herein these specific exemplars are intended to be representative of their more general classes. Furthermore, in general, the use of any specific exemplar herein is also intended to be representative of its class and the noninclusion of any specific devices in any exemplary lists herein should not be taken as indicating that limitation is desired. The above detailed description has been divided into sections with subheadings in order to highlight the invention described herein; however, those skilled in the art will appreciate that such sections are merely for illustrative focus, and that the invention herein disclosed typically draws its support from multiple sections. Consequently, it is to be understood that the division of the detailed description into separate sections is merely done as an aid to understanding and is in no way intended to be limiting. Because the above detailed description is exemplary, when "one embodiment" is described, it is an exemplary embodiment. Accordingly, the use of the word "one" in this context is not intended to indicate that one and only one embodiment may have a described feature. Rather, many other embodiments may, and often do, have the described feature of the exemplary "one embodiment." Thus, as used above, when the invention is described in the context of one embodiment, that one embodiment is one of many possible embodiments of the invention. Notwithstanding the above caveat regarding the use of the words "one embodiment" in the detailed description, it will be understood by those within the art that if a specific number of an introduced claim element is intended in the below claims, such an intent will be explicitly recited in the claim, and in the absence of such recitation no such limitation is present or intended. For example, in the claims below, when a claim element is described as having "one" feature, it is intended that the element be limited to one and only one of the feature described. Furthermore, when a claim element is described in the claims below as including or comprising "a" feature, it is not intended that the element be limited to one and only one of the feature described. Rather, for example, the claim including "a" feature reads upon an apparatus or method including one or more of the feature in question. That is, because the apparatus or method in question includes a feature, the claim reads on the apparatus or method regardless of whether the apparatus or method includes another such similar feature. This use of the word "a" as a nonlimiting, introductory article to a feature of a claim is adopted herein by Applicants as being identical to the interpretation adopted by many courts in the past, notwithstanding any anomalous or precedential case law to the contrary that may be found. Similarly, when a claim element is described in the claims below as including or comprising an aforementioned feature (e.g., "the" feature), it is intended that the element not be limited to one and only one of the feature described merely by the incidental use of the definite article. Furthermore, the use of introductory phrases such as "at least one" and "one or more" in the claims should not be construed to imply that the introduction of another claim element by the indefinite articles "a" or "an" limits any particular claim containing such introduced claim element to inventions containing only one such element, even when the same claim includes the introductory phrases "one or more"or "at least one" and indefinite articles such as "a" or "an." The same holds true for the use of definite articles. While particular embodiments of the present invention have been shown and described, it will be obvious to those skilled in the art that, based upon the teachings herein, various modifications, alternative constructions, and equivalents may be used without departing from the invention claimed herein. Consequently, the appended claims encompass within their scope all such changes, modifications, etc. as are within the true spirit and scope of the invention. Furthermore, it is to be understood that the invention is solely defined by the appended claims. The above description is not intended to present an exhaustive list of embodiments of the invention. Unless expressly stated otherwise, each example presented herein is a nonlimiting or nonexclusive example, whether or not the terms nonlimiting, nonexclusive or similar terms are contemporaneously expressed with each example. Although an attempt has been made to outline some exemplary embodiments and exemplary variations thereto, other embodiments and/or variations are within the scope of the invention as defined in the claims below.
|
Same subclass Same class Consider this |
||||||||||
