Method and apparatus for transforming multiplications into product table lookup references5907711Abstract A compiler automatically determines when it is advantageous to perform multiply operations by using a table of product values (that is, a table that contains scale.sub.-- factor .times.0 as its first entry, scale.sub.-- factor .times.1 as its second entry, scale.sub.-- factor .times.2 as its third, and so on) and transforms the multiply operations into a table lookup indexed by the value of the non-constant multiplier. This transformation is only performed when it is found to be advantageous to do so. Performing the transformation requires that the product table be initialized when the program actually runs, but that can be done at a location that is much less-frequently visited than the location where the multiply operation takes place. Claims I claim: Description BACKGROUND OF THE INVENTION
______________________________________
void scale.sub.-- matrix (int A›ROWS! ›COLUMNS!,
(int scale.sub.-- factor)
int l, j;
for (l = 0; l < ROWS; l = l + 1)
for (j = 0; j < COLUMNS; j = j + 1)
A›l! ›j! = A ›l! ›j! * scale.sub.-- factor;
}
______________________________________
In this example, the value of scale.sub.-- factor is not readily known, but a sophisticated compilation system can determine that it remains constant across ROWS.times.COLUMNS consecutive multiplications. The invention finds such cases and determines when it is advantageous to perform the multiplications by using a table of product values (for example, a table that contains scale.sub.-- factor.times.0 as its first entry, scale.sub.-- factor.times.1 as its second entry, scale.sub.-- factor.times.2 as its third, and so on) and transform the multiply into a table lookup indexed by the value of the non-constant multiplier. The invention then proceeds to perform this transformation when it is found to be advantageous. Doing so requires that the product table be initialized when the program actually runs, but that can be done at a location that is much less-frequently visited than the location where the multiplies takes place. The invention alters the subroutine so that it behaves in manner shown by the following code:
______________________________________
void scale.sub.-- matrix (int A›ROWS! ›COLUMNS!,
(int scale.sub.-- factor)
int l, j;
int product.sub.-- table ›MAXINT + 1!
int x, product;
product = 0;
for (x = 0; x <= MAXINT; x = x + 1) (
product.sub.-- table›x! = product
product = product + scale.sub.-- factor;
)
for (i = 0; i < ROWS; i = i + 1)
for (j = 0; j < COLUMNS; j = j + 1)
if (A›l! ›j! >= 0)
A›l! ›j! = product.sub.-- table ›A›l! ›j!!;
else
A›l! ›j! = -product.sub.-- table ›A›l! ›j!!;
}
______________________________________
Thus, the multiply operation, which normally requires a long sequence of operations that take anywhere up to 80 processor cycles to execute, can be turned into a simple three-instruction sequence that provides each product in as little as 1 to 3 processor cycles, depending on the micro-architecture. There are two associated issues that the invention considers and are essential to its effectiveness. The first is that there is no need to re-initialize the product table every time that the subroutine is invoked, provided that it is allocated so that it persists across multiple invocations of the subroutine and the value passed in the scale.sub.-- actor argument is the same value that was last used to initialize the product table. A simple check can be added to the initialization code to bypass the initialization loop if the value in product.sub.-- table›1! is equal to scale.sub.-- factor. This technique is called memorization, and it is a well-known technique that is used to improve the effectiveness of the invention. A second issue is the amount of time and space required to initialize and hold a product table. In the above example, it is assumed that MAXINT (the largest value that fits into a signed integer variable) is less than ROWS.times.COLUMNS. On 16-bit architectures, this may be rather likely, on 32-bit and 64-bit machines, however, this is extremely unlikely and the amount of space required to hold such a table would be prohibitively large if it were. There are two different solutions to this problem, both of which are incorporated into the invention. The first is to try to examine each multiplier to determine if the range of values that one of them can take on is substantially smaller than the full range of representable values. This is trivially the case when the type of a multiplier is short or char. To increase the likelihood of finding such opportunities, the current implementation of the invention carefully and deliberately tracks variables and the effect that each operation can have on the range of values that they can hold. The second solution that the current implementation of the invention uses is to split a large range of values into two or more components and then individually access the product value associated with each component and assemble them back together. To better describe how this applies to the previous example, this example, which is semantically equivalent to the previous example, illustrates the use of this technique:
______________________________________
void scale.sub.-- matrix (int A›ROWS! ›COLUMNS!,
(int scale.sub.-- factor)
int l, j;
int product.sub.-- table ›(MAXINT + 1) / 65536!;
int x, product, low.sub.-- part, high.sub.-- part, multiplier;
product = 0;
if (product.sub.-- table ›l! |= scale.sub.-- factor) {
for (x = 0; x <= MAXINT / 65536; x = x + 1) {
product.sub.-- table›x! = product;
product = product + scale.sub.-- factor;
}
}
for (i = 0; i < ROWS; i = i + 1)
for (j = 0; j < COLUMNS; j = j + 1) {
if (A›l! ›j!) >= 0)
______________________________________
Although it appears that this code contains more multiplies, these are all by a constant power of 2 (65,536 is 2 to the 16th power) and a binary computer can perform these operations using very inexpensive shift operations. Thus, the size of the product table needed is only one 65,536th the size of the product table used in the first example. In operation, the compiler tells the program to reserve room for the table. The table itself cannot be built until the scale factors and multipliers are known. Thus, the compiler inserts code into the program that builds the table at run time. Thus, the invention provides a tool that is built into the low level optimizer of the compiler and that avoids repetitive multiplications by adding code to the run time program that causes the program to compute a table and reserve space in memory for this table. The invention optimizes the code to avoid expensive multiplications by using inexpensive memory lookups, e.g. cache lookups. Thus, the optimizer looks for and finds the opportunity to apply this particular transformation to the code that is being optimized and then substitutes code that performs the transformation. FIG. 3 is a high level, block diagram that shows the operations performed in the low level optimizer. In the preferred embodiment of the invention, the low level optimizer 24 receives a low level RISC code (300), i.e. code that is almost in a condition to be executed by a RISC processor-based machine. The optimizer takes the code and generates ancillary data structures that help the optimizer understand the code. The optimizer then starts transforming/modifying the code such that a final optimized risk instruction stream (400) is produced that runs significantly faster on the target computer architecture, e.g. a specific RISC-based architecture in the preferred embodiment of the invention. The RISC compiler discussed herein is implemented with a machine-specific optimizer that understands the nuances of a particular RISC architecture (in the preferred embodiment of the invention, the PA-RISC architecture, which is owned by Hewlett-Packard Company of Palo Alto, Calif.). While the invention is described in connection with a low level optimizer, it should be appreciated that the invention is also readily applicable as a preprocessor, which looks at source code at the source code level, i.e. a high level optimizer. However, it is generally preferred to implement the invention in a low level optimizer because such application is limited to a particular code sequence for a machine-specific optimizer that knows a lot about the particular target machine. Thus, the low level optimizer is more likely to know when it is beneficial to apply the transformation taught herein and when it is not beneficial to do so. Once the low level optimizer 24 receives the original low-level risk instruction stream (300), the code sequences are considered to be naive, i.e. they are not very smart about the operations that they are performing, such that they take a significant amount of time and energy to perform the function that they are intended to perform. The first step that is typically performed by the low level optimizer is to split the instruction stream up into basic blocks (310), i.e. into sequences of code where there is only an exit at the end of such sequence with no way to enter the code except at the beginning of the sequence. Thus, the code is encapsulated to allow a better understanding of the code to be obtained. The optimizer then performs various optimizations upon these basic blocks. For example, the optimizer examines instruction sequences that are redundant, e.g. where three or four instructions are used, and where it is possible to use just one instruction (320). This is type of pattern matching is referred to as common sub-expression elimination. Next, the optimizer performs an operation that is referred to as interval building (330), in which the basic block structure is examined to decipher the code structure in terms of higher level structures. For example, the optimizer tries to find such structures as, for example, loops and if/then statements. This operation also helps to perform data flow analysis, which determines for the code being optimized where values are generated, where values are consumed, and how long values persist. The optimizer next builds memory webs (340) that examine memory references. Thus, if there is a particular local or global variable that leaves a memory reference, it is often desirable to take, for example, such local variable and promote it to a register (i.e. perform register promotion), which keeps the variable in a register as opposed to having to perform a computationally expensive reference to memory for such variable. Accordingly, the optimizer builds webs, which are data structures that identify where variables are defined and where such variables are used. Next, the oplimizer performs loop analysis (350) by going individually to each loop to determine what that loop is doing. Loops are important in optimizations because most programs spend most of their time inside of loops. Because of that, most optimizers traditionally focus on loops to try to speed up the loops. In fact, the transformation taught herein in connection with the preferred embodiment of the invention typically operates under the assumption that the program is in a loop, where the program is trying to perform a particular operation again and again within the confines of the loop. In a sense, the transformation may be a part of the loop analysis phase, although it is not necessary that the optimizer see the loop structure. Rather, it is only necessary to know where variables are set and defined and the particular cost of performing an operation at one point versus another point. For example, it is desirable to build a table at a point where it is less computationally expensive to perform various operations than to perform an outright multiply. Thus, if an operation resides inside a loop, it is typically considered to be a more expensive operation. Traditionally, finding loops is accomplished by looking at the structure of the program. The optimizer herein disclosed actually has the ability to go in and build the program, allow the user to run the program, and as the program runs, the optimizer determines where the program is spending its time while running, e.g. which instructions are executed more often than others. The optimizer herein disclosed then feeds that information back in, such the program may be recompiled with more accurate information about where the program is really spending its time and which instructions are going to be more computationally expensive relative to others. By using this kind of feedback information, the optimizer herein disclosed can actually make wiser decisions about which particular multiply operations to transform and which multiply operations to ignore because they are not executed very often. In particular, although a segment of the program may look like a loop, it does not mean that program segment is going to be executed many times. It may only be executed once or even not at all. Thus, there is a risk of going after a loop and then later finding out at run time that the loop is not executed. A lot of resources may have been applied trying to optimize the loop, without any beneficial for such expenditure. This is one reason why the herein disclosed feedback scheme can enable the optimizer to generate much better code. A register webs builder (350) builds what are referred to as def-use chains (410) that identify every point at which a particular register is defined and used over the register's lifetime. FIG. 3 shows a register webs component of the compiler broken out into its subcomponents. The transformation herein disclosed preferably operates as a subcomponent of the register webs and optimizer phase (360) of the optimizer 24. First, the optimizer builds the def-use webs (410) for all the registers that are used in the program. The webs are then examined (420) to find those instances where a particular web has a limited set of values with which it can be associated. For example, if the program is loading out of a particular type of register that can only hold a character, for example where a character can only have a value between 0 and 255, which is fairly small relative to a word, then the optimizer knows and remembers that the character has this given set of values. When two characters are subsequently added together, it is known that the result can have a value between 0 and 510. The def-use register web thus provides a database/structure that allows ready propagation of this information to thereby expose the information to the rest of the optimizer. The optimizer also propagates constants and eliminates redundant operations (430). For example, if a particular web can only have a constant value in it, then everywhere the web is defined it is given the same constant. The optimizer can go to all of the points where that web is used and replace that constant at that point, and thereby come up with better code. The optimizer also eliminates redundant operations. For example, if a particular web has a value that can only be between 0 and 255, and a compare operation is performed on the web against another web that produces a value greater than 255, the optimizer knows that this is a useless operation because the web can never get a condition where it's values are greater than 255. So, the optimizer eliminates operations that are known to have no effect on an actual web. The optimizer then finds places in the code where a multiply operation is to be performed (440). The optimizer identifies a multiply operation, for example by identifying two source webs--web 1 and web 2. The optimizer looking for a web that is limited in range and that is producing some result. The optimizer then goes back and looks at the other web to see if this other web potentially has several different values. That is, is the other web much wider, i.e. can it hold values that are much greater than the first web over a much wider range. The optimizer then identifies locations in the code where this particular web might be defined. At this particular instruction, the optimizer can determine about how many times the instruction is going to be executed. This information is part of the loop analysis and the feedback obtained from running the code. The optimizer then knows about how often this particular instruction is going to be executed. For example, if the optimizer expects this instruction to be executed a million times, then the optimizer goes back to find where this web is actually defined. For example, if the web is only defined at one place and this place is only executed ten times, then the optimizer determines that there is a potential benefit to performing the transformation described above because, if the optimizer were to insert code into the program at this point to build a table that would only get executed ten times, the transformation would actually speed up a million multiply operations by performing only a few steps during the optimization process. An important consideration implementing the invention is the notion of cost analysis and benefit, i.e. "Would it be profitable to perform this transformation at this point" The optimizer determines where to assert the transformation by looking at locations where operations are defined within the webs and trying to determine how many times the code is to perform a multiply operation versus whether a product table is going to be inserted in a place that is less frequently visited than the actual number of multiply operations eliminated by the transformation. The optimizer also bases its decisions on knowing the size of the table, where a larger table requires more time to build. Thus, once the optimizer performs the transformations, it may go back and reexamine the webs and determine, for example if there are now constants exposed that were not exposed before and if more is known about widths than could be determined before. So there's a certain amount of iteration that goes on until you come to the point where you realize that you had nothing new to do. After the optimizer has determined the best possible code optimizations, the optimizer stops iterating and expands pseudo-instructions (450). For example, a multiply operation is a pseudo-instruction because it is not possible to take the code when such instruction is encountered and run it directly. For example, the integer multiply operation is not supported by the target machine. Such operation is put into the code as part of the instruction stream. When the optimizer gets to the point where there is nothing else it can do to this multiply, then the system is committed to performing this multiply operation. The optimizer then tries to determine whether it is necessary to perform such operation by calling a function that performs the multiply operation. If one of the values to be multiplied is a constant, then it is possible to find a better way of performing that multiply operation. Finally, the optimizer looks for copies and eliminates redundant copies (460) which tend to show up as a result of performing a transformation, such as that described herein. Thus, rather than devoting two registers to hold the same value, the optimizer stores only one instance of the value. The compiler then performs instruction scheduling (370), which reorders the instructions to eliminate some of the problems related to latency, i.e. where the program tries to use a value before it is produced. Next, the compiler performs a coloring register allocation (380). Up to this point, the optimizer operated upon pseudo-registers but the registers are now real registers. This approach is taken because it is possible to use an unlimited number of pseudo-registers during optimization, and then at the end of the optimization process this unlimited number of registers is mapped to a very limited set of registers available in the target machine. The last operation performed by the optimizer is that of peephole and branch optimization (390). Peephole optimizations apply simple, well known transformations that look for sequences of three or four instructions which can be replaced with just one instruction. Branch optimizations look for places, for example, where there is a branch to another branch which branches again. Rather than going through that middle step, the optimization goes directly to the last point. An optimized instruction stream which, in this example, is a RISC instruction stream, is then output by the optimizer (400). One advantage of the invention is that it increases the cases where a compilation system for a computer that performs relatively slow multiplication operations can dramatically improve the performance of the code. From the examples above, it is possible for a skilled programmer to rewrite the source code to take advantage of these opportunities. Even if a programmer understands the mechanics of this algorithmic transformation, they may either not consider it worth their time to implement, may do so and have a difficult time getting it right, are not willing to be obfuscated by the application of this technique, or may not know enough about the micro-architecture to determine if it is profitable to do so. By allowing a compilation system to perform this transformation automatically through the use of this invention, programmers can focus their attention on other issues, less-experienced programmers are able to obtain higher performance from their systems, the final source code is easier to understand, and the most effective product table size and multiplier split components are automatically selected by the compilation system for the specific micro-architecture for which the compilation system is asked to produce code. Although the invention is described herein with reference to the preferred embodiment, one skilled in the art will readily appreciate that other applications may be substituted for those set forth herein without departing from the spirit and scope of the present invention. Accordingly, the invention should only be limited by the Claims included below.
|
Same subclass Same class Consider this |
||||||||||
