Software profiler which has the ability to display performance data on a computer screen6311324Abstract A C-language program performance tuning advisor that helps a systems analyst to improve the performance of an application. The tuning advisor identifies critical regions (hot spots) of an application, and helps the user to analyze the region. Once the region has been identified and analyzed, the tuning advisor advises the user on how to rewrite the original C code to improve the performance of the overall application. When the compiler needs to be conservative to be semantically correct, the tuning advisor suggests code modifications to remove the semantic constraints. The tuning advisor recognizes most commonly used C code patterns which if modified could improve the performance. Claims What is claimed is: Description BACKGROUND OF THE INVENTION
Original Code Optimized Code
void this_routine( void this_routine(
float *a, float *a,
float **b, float **b,
int n) int n)
{ {
short i; int i;
short k = n; int k = n;
for(i=0; i<k; i++) for(i=0; i<k; i++)
{ {
a[i]+=b[0][i]; a[i]+=b[0][i];
} }
} }
Post Increment/Decrement of a Loop Test Variable The loop index variable is tested and incremented or decremented inside the loop conditional expression. Two separate registers are used to store the current and the future values of the variable. Using two registers for computing the value of one variable reduces the number of free registers available. In the example given below, loop index variable lim is both tested and decremented in the loop conditional expression. Two registers are used to store the current and the future values of lim. Having two instructions in a conditional expression increases the loop overhead. The C Tuning Advisor will advise the user to not combine two operations in a loop conditional expression. It is better to modify the loop index variable after testing the loop conditional expression. PROGRAM EXAMPLE 2 Post Decrement of a Loop Test Variable
Original Code Optimized Code
void test.sub.-- post( void test.sub.-- post(
int n, int n,
int *a, int *a,
int b) int b)
{ {
int lim = 0; int lim = 0;
lim = n; lim = n;
while(lim--) while(lim)
{ {
*a += b; *a += b;
} lim--;
}
}
}
Loop Invariant Motion A pointer variable is used inside the loop. The target value changes but the value of the pointer itself does not change inside the loop. Using a loop invariant pointer generates redundant store to memory operations. In the example below, a pointer variable *a which does not change the location to which it is pointing is stored and used for computation inside the loop. Redundant store instructions are generated for the loop invariant pointer variable. The C Tuning Advisor will suggest the following: 1. Assign the target value to a temporary register variable. 2. Use the register variable instead of the pointer inside the loop. 3. Update the target value with the register variable after the loop. This will result in a code sequence which will have no redundant memory allocation instructions within the loop. PROGRAM EXAMPLE 3 Loop Invariant Motion
Original Code Optimized Code
void test_post( void test_post(
int n, int n,
int *a, int *a,
int b) int b)
{ {
int lim=0; int lim=0;
lim=n; register int tempa;
while (lim--) lim = n;
{ tempa =*a;
*a += b; while (lim--)
} {
tempa += b;
} }
*a = tempa;
}
Loop Invariant Motion with Conflicts A loop invariant pointer variable is used for computation inside the loop. This generates redundant store to memory operations. The pointer does not change; but the target value is added and loaded into consecutive memory locations of an array. Using a loop invariant variable generates redundant store to memory operations. In the example below, the pointer variable *a does not change the location to which it is pointing but is stored and used for computation inside the loop. Assuming that the pointer may be pointing to one of the elements of the array, the compiler stores the variable and loads it back again for every iteration of the loop. This generates several redundant store instructions within the loop. CTA will advise the user to do the following: 1. Assign the memory location of the pointer to a register variable. 2. Use the register variable instead of the pointer inside the loop. 3. Update the target value from the register variable after the loop. This results in fewer instructions being executed within the loop. Using the register variable instead of the pointer inside the loop indicates explicitly to the compiler that there is no conflict between the pointer and the range of values in the elements of the array. This prevents the compiler from generating redundant store instructions inside the loop. PROGRAM EXAMPLE 4 Loop Invariant Motion with Conflicts
Original Code Optimized Code
void this_routine( void this_routine(
float *a, float *a,
float **b, float **b,
int n) int n)
{ {
int i; int i;
int k = n; int k = n;
for(i=0; i<k; i++) register float
{ tempa;
*a += b[0][i]; tempa = *a;
} for(i=0; i<k; i++)
} {
tempa += b[0][i];
}
*a = tempa;
}
Instruction Scheduling The assembly language instructions generated for the source code cannot be reordered to improve instruction scheduling. Instructions with dependencies cannot be moved around and reordered by the compiler. This results in inefficient instruction scheduling. The following statement: *a+++=b[0][n]; may produce the following code: load a load b[0][n] add a+b[0][n] store a incr a If the following code is present, the compiler will have a hard time of scheduling: *a+++=b[0][n]; *a+=b[0][n-1]; If the compiler can move the load instructions past the store instructions, there could be better instruction scheduling. But, since a pointer variable *a is used inside the expression and can be pointing to b[0][n], the compiler does not move the load instructions past the store instructions. This results in inefficient instruction scheduling. CTA will advise the user to do the following: Modify the source code in order to generate independent instructions that can be reordered by the compiler. Use temporary variables to explicitly state to the compiler that there are no dependencies between the instructions. The following code sequence will be result of these code modifications: temp1=*a+b[0][n]; temp2=*(a+1)+b[0][n-1]; *a++=temp1; *a=temp2; This will result in more efficient instruction scheduling and pairing. PROGRAM EXAMPLE 5 Instruction Scheduling
Original Code Optimized Code
void this_routine( void this_routine(
float *a, float *a,
float **b, float **b,
int n) int n)
{ {
*a++ += b[0][n];
*a += b[0][n-1]; register float
} temp1;
register float
temp2;
temp1 = *a +
b[0][n];
temp2 = *(a+1) +
b[0][n-1];
*a++ = temp1;
*a = temp2;
}
Loop Unrolling The loop generates instructions that do not allow efficient instruction scheduling and pairing. The instructions generated are few and provide little scope for the Pentium.TM. processor to schedule and pair them in its dual pipelines. As a result, several redundant clock cycles are generated to execute these instructions. In the example below, the loop index variable is incremented by one every time the loop executes. This code generates only a few machine instructions. The few instructions give little scope for the Pentium.TM. processor to reorder, schedule, and pair the instructions in its dual pipelines. CTA will advise the user to unroll the loop by a certain amount, which is determined to be optimal for the loop for the Pentium.TM. architecture. This will provide the optimal scheduling and register allocation balance to this loop. Unrolling can be is done as follows: Replicate the body of the loop. Adjust the index expression if needed. Adjust the loop iteration's control statements. PROGRAM EXAMPLE 6 Loop Unrolling
Original Code Optimized Code
void test_it( void test_it(
int *a, int *a,
int* c, int *c,
int n) int n)
{
{
int i; int i;
for(i=0; i<n; i++) for(i=0; i<n-(n%3); i+=3)
a[i] = c[i]; {
a[i] = c[i] ;
a[i+1] = c[i+1];
}
a[i+2] = c[i+2];
}
for(i; i<n; i++)
a[i] = c[i];
}
If-Switching An if statement with a loop invariant condition is used inside a loop. Every time the loop executes, the if condition is evaluated and the branch code is executed. This generates several redundant instructions. In the example given below, the value of the variable in the if conditional expression putp==1 does not change inside the for loop so the if condition is loop invariant. Since the if statement is inside the loop, it is evaluated every time the for loop is executed. This generates several redundant instructions and increases the loop overhead. CTA will advise the user to move the if statement outside of the for statement, by doing the following: 1. Move the if statement out of the loop. 2. Copy the loop into the two branches of the if-else statement. This will result in the if statement becoming the main controlling statement. The if condition is evaluated only once. Depending upon how the condition evaluates, one of the loops in the if-else branch statements is executed. Since fewer statements are executed within the loop, the loop overhead is significantly reduced. PROGRAM EXAMPLE 7 If-Switching
Original Code Optimized Code
extern int putp; extern int putp;
void test_if( void test_if (
int *a, int *a,
int *p int *p,
int *q, int *q,
int n) int n)
{ {
int i; int i;
for(i=0; i<n; i++) if (putp==1)
if (putp==1) for(i=0; i<n; i++)
a[i] = p[i]+q[i]; a[i] = p[i] + q[i];
else else
a[i] = p[i]-q[i]; for(i=0; i<n; i++)
} a[i] = p[i]-q[i];
}
Loop Rerolling A loop has been manually unrolled. This may result in increased register pressure in the loop on the Pentium machines, resulting in increased code to store and re-load the register values. It will be better to let a compiler (in this case, the C Tuning Advisor) decide the unrolling factor, taking into account the scheduling and register allocation issues. In the example below, a loop has been manually unrolled. CTA will advise the user to re-roll the loop and let the compiler decide the unrolling factor. This optimization can be contrasted with the loop unrolling example, since there, the compiler has actually determined the optimal unrolling factor for the Pentium architecture. PROGRAM EXAMPLE 8 Loop Rerolling
Original Code Optimized Code
void test_it( void test_it(
int *a, int *a,
int *c) int *c)
{ {
int i; int i;
for(i=0; i<100; i+=5){ for(i=; i<100; i++){
a[i] = c[i] ;
a[i] .sup. = c[i] ; }
a[i+1] = c[i+1];
a[i+2] = c[i+2]; }
a[i+3] = c[i+3];
a[i+4] = c[i+4];
}
}
Logical ANDS to Bitwise ANDS A logical AND (&&) operation is used inside a loop. This operation causes branching in the generated code. Branching inside a loop increases the loop execution time. In the example below, a logical && operation is used for testing the pointer value for NULL before accessing it. If the user knows that a NULL pointer reference will not happen in this code, then it is possible to change the code to use a bitwise AND operation. CTA will advise the user to use AND (&) operation instead of && operation, if it is safe to do so. If this code is in a loop, the change will result in no branching code inside the loop. The code will execute much faster. Programmers are cautioned to replace the logical AND (&&) operator with the bitwise AND (&) operator only if the replacement can be done without problem. PROGRAM EXAMPLE 9 Logical ANDS to Bitwise ANDS
Original Code Optimized Code
struct ent{ struct ent{
struct enode *expr; struct enode *expr;
struct ent *next; struct ent *next;
}; };
extern struct ent extern struct ent
tbl[10][20]; tbl[10][20];
void test_it{ void test_it{
struct ent * p, struct ent * p,
int i, int i,
int j){ int j){
if[(p=tbl[i][j])&&(p->expr!=0)] p = tbl[i][j];
printf("Tested P..backslash.n"); if {(pl=0)&(p->expr!=0)}
} printf("Tested P..backslash.in");
}
Float to Integer A float to integer conversion is used. In C semantics, when a float value is assigned to int, the value is truncated. On x86 processors, the instruction Fist (float to int store) has a default mode of "round to nearest," and compilers generate code to explicitly set the control word to reflect C semantics. As a result, several instructions are generated to convert the float value to integer. This increases execution time. In the example below, the following code sequence requires a fist instruction for float to int conversion: int num2=(int)*num; According to C semantics, the value is truncated. On an X86 processor, the instruction Fist (float to int store) has a default mode of "round to nearest." The compiler generates several instructions to explicitly set the control word to reflect C semantics. This code takes more execution time. The C Tuning advisor will advise the user that if the C semantics are not required and it is okay to use the "round to nearest" or other default processor modes, the float to int conversion can be replaced with in-line assembly code or an assembly macro. This will result in avoiding the control word manipulation, and fewer instructions will be generated. The code executes much faster. PROGRAM EXAMPLE 10 Float to Integer
Original Code Optimized Code
int test_it( float *num, asm void
int num1) FloatToInt(value, num)
{ {
%mem value,num;
int num2 = (int) *num; flds value
int result; fistpl num
result = num1+num2; }
return result;
int test_it(
} float * num,
int num1) {
int num2 ;
int result;
FloatToInt(*num, num2);
result = num1+num2;
return result;
}
Logical OR Conversion A logical OR (.parallel.) operation is used to test a variable for equality with small integers. The operation generates several branches inside the code. Branching increases the program's execution time. In the example below, the value of the variable signif is tested for a very small range of integer values (less that 16, in fact!). CTA will advise the user to replace this with a table look up algorithm, where a small table of integers have the appropriate entries set to TRUE and others set to FALSE. The entries where the value have index values corresponding to the the original integer values. i.e., if we were testing for signif==4, then testtable[4] will be TRUE. The resulting code will have much less branch code, and the testing will be done in one small lookup. This is usually beneficial only for small ranges. PROGRAM EXAMPLE 11 Logical OR Conversion
Original Code Optimized Code
void sub(int *, int*); void sub(int *, int*);
void test_it( int
int * a, testtable[16]={0,1,0,
int *b, 0,1,0,
int signif) { 0,1,0,
0,1,0,
if (signif==1.vertline..vertline.signif==4
.vertline..vertline. 0,1,0,0};
signif==7 .vertline..vertline.
signif==10 .vertline..vertline. void test_it( int * a,
signif == 13){ int *b,
int signif)
sub(a,b); {
} if(testtable[signif])
sub(a,b);
else sub(b,a); else sub(b,a);
} }
Call to Error in a Stream of Code A call to an infrequently executed error statement is detected in the middle of a block of code. The call statement generates several machine instructions to be placed in the Pentium processor's Instruction Cache. Since these instructions take up most of the Instruction Cache, the more frequently executed code following the error statement cannot be accessed and executed immediately. This increases the program's execution time. CTA will advise the user to move the infrequently used error statement out of the block of code. This will result in the most frequently executed block of code remaining in the instruction cache and, ultimately, the program executes much faster. PROGRAM EXAMPLE 12 Call to Error in a Stream of Code
Original Code Optimized Code
void error(char *); void error(char *);
void test_it( void test_it(
char *mem, char *mem,
int flag) { int flag) {
if (flag < 0) if (flag < 0) goto errlab;
error("flag is dummy(flag);
negative"); dummy1(*mem);
dummy(flag); return;
dummy1(*mem); errlab;
return; error("flag is
} negative");
}
While the invention has been particularly shown and described with reference to preferred embodiments thereof, it will be understood by those skilled in the art that the foregoing and other changes in form and detail may be made therein without departing from the scope of the invention.
|
Same subclass Same class Consider this |
||||||||||
