Dynamic optimizing object code translator for architecture emulation and dynamic optimizing object code translation method6463582Abstract An optimizing object code translation system and method perform dynamic compilation and translation of a target object code on a source operating system while performing optimization. Compilation and optimization of the target code is dynamically executed in real time. A compiler performs analysis and optimizations that improve emulation relative to template-based translation and interpretation such that a host processor which processes larger order instructions, such as 32-bit instructions, may emulate a target processor which processes smaller order instructions, such as 16-bit and 8-bit instructions. The optimizing object code translator does not require knowledge of a static program flow graph or memory locations of target instructions prior to run time. In addition, the optimizing object code translator does not require knowledge of the location of all join points into the target object code prior to execution. During program execution, a translator records branch operations. The logging of information identifies instructions and instruction join points. When a number of times a branch operation is executed exceeds a threshold, the destination of the branch becomes a seed for compilation and code portions between seeds are defined as segments. A segment may be incomplete allowing for modification or replacement to account for a new flow of program control during real time program execution. Claims What is claimed is: Description BACKGROUND OF THE INVENTION
TABLE 1
The statically allocated part of the OOCT buffer.
Field Offset Contents
jump_table 0h An array of entry points in interpreter 110,
such as
IC_FETCHO2, IU_PGMxx. OOCT_INIT writes
them and compiler 104 reads them. Compiler
104
uses them to generate jumps to interpreter
110.
trans_master_target.sub.-- 1000h An array of pointers, one for each
page in ASP's
table.sup.1 address space. For a page that ASP does not
use,
the pointer is 0. For a page that ASP uses,
the
pointer points to an array in the
dynamically
allocated part of the OOCT buffer (see
below.)
unallocated 41004h A pointer which points to the first unused
byte in the
dynamically allocated part of the buffer.
Only used
during initialization.
length_left 41008h The number of bytes left in the dynamically
allocated
part of the buffer. Only used during
initialization.
num_execs 4100Ch The number of interpreter 110.
zones.sup.2 41010h A pointer to the zone memory, which is in
the
dynamically allocated part of the OOCT
buffer.
OOCT_INIT writes the pointer while compiler
104
reads the pointer. Compiler 104 uses the
zone
memory during compiling.
zones_length 41014h The amount of zone memory. Written by
OOCT_INIT and read by compiler 104
segments.sup.3 41018h A pointer to the segment memory, which is in
the
dynamically allocated part of the OOCT
buffer.
OOCT_INIT writes the pointer while compiler
104
reads the pointer. Compiler 104 uses the
segment
memory to store compiled code.
segments_length 4101Ch The amount of segment memory. Written by
OOCT_INIT and read by compiler 104
branch_11_tables.sup.4 41020h A pointer to level-one (L1) branch cache
structures,
which are in the dynamically allocated part
of the
OOCT buffer.
branch_record_free.sub.-- 41024h A list of unused BRANCH_RBCORD
structures,
list.sup.5 which are in the dynamically allocated part
of the
OOCT buffer.
branch_header_table.sup.6 41028h A hash table containing BRANCH_RECORD
structures. The table is dynamically
allocated in the
OOCT buffer.
branch_log_lock 4102Ch A lock which must be held to write to the
branch
log.
branch_seed_buffer 41030h A buffer which the interpreter 110 use to
send seeds
to compiler 104.
num monitor_seed_ 41060h A counter that tells how many messages the
messages interpreter 110 have sent to compiler 104,
but
compiler 104 has not finished.
seed_threshold_mode 41064h A flag that tells the interpreter 110 how to
pick a
seed. The seed is either OOCT_DEBUG_MODE or
OOCT_PERFORMANCE_MODE.
seed_production.sub.-- 41068h The threshold number of times a branch
must
threshold execute before its destination becomes a
seed for
compiler 104.
trickle_flush_11_rate 4106Ch The number of times a branch can be updated
in an
L1 cache before the branch is flushed from
the cache
and written back to memory.
seeds_sent 41070h UNUSED
seeds_handled 41074h UNUSED
exit 41078h Compiler 104 uses this flag to tell the
interpreter 110
that compiler 104 has shut down after
receiving a
signal.
segment_exit 4107Ch An entry point jn the interpreter 110, which
compiled code jumps to upon exit. The code
at this
entry point releases locks if necessary.
segment_exit_interp 41080h An entry point in the interpreter 110, which
compiled code jumps to upon ending with an
instruction that must be interpreted. The
code at this
entry point releases locks if necessary.
segment_exit_log 41084h An entry point in the interpreter 110, which
compiled code jumps to upon ending with a
non-
fixed branch instruction. The code at this
entry
point releases locks if necessary.
sbe_impl 41088h An entry point in the interpreter 110, which
compiled code calls to execute the SBE
instruction
cc_impl 4108Ch An entry point in the interpreter 110, which
compiled code calls to execute the CC
instruction
mv_impl 41090h An entry point in the interpreter 110, which
compiled code calls to execute the MV
instruction
mv_impl_same_size 41094h An entry point in the interpreter 110, which
compiled code calls to execute the MV
instruction
when the lengths of both strings are the
same.
segment_lock 41098h An entry point in the interpreter 110, which
mousetrap compiled code calls to verify that it still
holds a
lock. THIS IS ONLY USED FOR DEBUGGING.
breakpoint_trap 4109Ch An entry point in the interpreter 110, which
compiled code calls to stop in the debugger.
THIS
IS ONLY USED FOR DEBUGGING.
segment_gates 410A0h An array of SEGMENT_GATE structures. The
SEGMENT_GATEs are used to lock segments of
compiled code.
gate_free_list 710A0h A list of currently unused SEGMENT_GATEs.
ooct_stack_bottom.sup.7 710A4h The lowest address of compiler 104's
stack. Points
into the dynamically allocated part of the
OOCT
buffer.
ooct_stack_top.sup.7 710A8h The highest address of compiler 104's
stack. Points
into the dynamically allocated part of the
OOCT
buffer.
build_options 710ACh The options used to build the interpreter
110. In
ooct_compiler_start, compiler 104 checks
that it was
built with the same options.
code_zone.sup.2 710B0h A pointer to an area of dynamically
allocated
memory. Compiler 104 uses this memory to
temporarily create an array of target
instructions. At
the end of compilation, this array is copied
to the
segment memory area and then deleted.
In the dynamically allocated part of the OOCT buffer, the sizes of data structures depend on several variables. One is the number of system pages used by the operating system for the original processor, such as ASP by Fujitsu. For each page of ASP address space that contains instructions to be translated, there is one translated page in the translation table. Another variable is the number of branch instructions that the system expects to log. It currently expects 2.sup.20 branches which affects the size of the BRANCH_RECORD array and the branch header table. The number of interpreter 110 affects the size of the L1 branch logger cache, because there is one cache for each task. FIG. 4 illustrates a picture of the OOCT buffer for one setting of the variables. In FIG. 4, the number of ASP pages is 10 MB of ASP instructions, the number of interpreter 110 is 4 and the total size of the OOCT buffer is 128 MB.
TABLE 2
The dynamically allocated part of the OOCT buffer.
Name Contents
Translation Table.sup.1 For every page of address space used by ASP, there
is one 16
KB page allocated in the translation table. SIZE = Num
system pages * 16 KB.
BRANCH_RECORD We guess how many branch instructions occur in ASP
array.sup.5 (current guess is 2.sup.20) and allocate one
BRANCH_RECORD
for each one. SIZE = 2.sup.20 * 24 bytes = 24 MB.
Branch header table.sup.6 There is one pointer to a BRANCH_RECORD for each
estimated branch. SIZE = 2.sup.20 * 4 bytes = 4 MB.
Branch L1 caches.sup.4 For each interpreter 110, there is one cache with
32 sets, 4
BRANCH_L1_RECORDs per set. SIZE = Num execs * 32
* 4 * 24 bytes. Maximum SIZE = 16 * 32 * 4 * 24 bytes
=
49152 bytes.
OOCT stack.sup.7 A 1 MB stack.
Zone memory.sup.2 A percentage of the remaining memory is used for zone
memory. Currently 50% of memory is used.
Segment memory.sup.3 A percentage of the remaining memory is used for
segment
memory. Currently 50% of memory is used.
Branch Log (Branch Logger 112) The branch log data structures are the BRANCH_RECORD array, the branch header table and the branch L1 caches. Please see the section on interpreter modifications, set forth below, for an explanation of how branch logger 112 works. This section will describe how the branch log is used to communicate information from the interpreter 110 to compiler 104. FIG. 4 illustrates the OOCT buffer after initialization. The sizes of the regions are drawn to scale. For this example, the size of the OOCT buffer is 128 MB, the number of ASP pages is 2560, the number of interpreter 110 is 2 and the expected number of branch instructions is 220. Compiler 104 reads the branch log to find out how many times a conditional branch instruction was taken and how many times a conditional branch instruction was not taken. Compiler 104 uses this information in two ways. First, when compiler 104 parses instructions, compiler 104 tries to parse only the instructions that have been executed most frequently. If there arises a conditional branch instruction, it checks how many times it branched and how many times it fell through. Second, when compiler 104 generates code, the compiler tries to place the most likely successor instruction of a conditional branch immediately after the branch instruction. This makes the generated code run faster. In order to tell which successor is more likely, compiler 104 uses branch log information. Please reference compiler 104 information set forth below for more details. BRANCH_Get_Record (ooct/compiler/branch.c) When compiler 104 wants to read branch log information, it calls the procedure BRANCH_Get_Record with the address of the branch instruction. This procedure looks up the branch in the branch log and returns a pointer to one of the elements of the BRANCH_RECORD array. Compiler 104 can then see how many times the branch instruction was executed, how many times it branched and how many times it fell through. Translation Table (Trans Unit) The translation table contains information about every instruction in the ASP address space. The translation table records whether the instruction is the destination of a branch (JOIN), whether the instruction was sent to compiler 104 as a seed (BUFFERED) and whether there is a compiled code entry point for the segment (ENTRY). When OOCT is initialized, the translation table is empty. When branch instructions are logged, their destinations are marked as JOIN points. If the branch executes more times than the threshold, the destination will be sent as a seed to compiler 104 and the translation table entry will be marked BUFFERED. After compiler 104 finishes compiling the translated version, it stores the addresses of entry points in the translation table and marks them as ENTRYs. FIGS. 5a, 5b and 5c illustrate the structure of a translation table according to a preferred embodiment of the present invention. As illustrated in FIG. 5a, an ASP address is divided into two parts. The high 20 bits are the page number and the low 12 bits are the page offset. FIG. 5b illustrates that the page number is used as an index into the first level translation table. The pages that ASP act are in the first level table. The pages that ASP does not use have no pointers because there will never be an instruction with that page number. The pointers point into the second level translation table. Adding the page offset to the pointer gives a translation table entry. As illustrated in FIG. 5c, each entry is 32 bits long and its fields are shown at the bottom. The first bit says whether the ASP instruction is a join point. The second says whether there is a segment entry point for the instruction. The third says whether the instruction was sent to compiler 104 as a seed. The other bits of the translation table entry are the entry point address for the instruction if there is one or 0 if there is no entry point. Since the K machine architecture has variable length instructions, the translation table has an entry for every ASP address, including addresses that are in the middle of instructions and data addresses. This makes the table very large but it simplifies the job of locating the translation table entry for an address. The structure of the translation table is shown in FIGS. 5a, 5b and 5c. As mentioned above, the second level translation table has a 32 bit entry for every ASP address. So if ASP uses 10 MB of space, the second level translation table uses 40 MB. There are several procedures and macros that read and write the entries of the translation table: TRANS_Set_Entry_Flag (ooct/common/trcommon.h) The TRANS_Set_Entry_Flag macro turns on one of the flags, JOIN, ENTRY or BUFFERED, of the translation table entry. It uses an assembly language instruction with the lock prefix so that it sets the bit atomically. TRANS_Reset_Entry_Flag (ooct/common/trcommon.h) The TRANS_Reset_Entry_Flag macro turns off one of the flags, JOIN, ENTRY or BUFFERED, of the translation table entry. It uses an assembly language instruction with the lock prefix so that it resets the bit atomically. TRANS_Entry_FlagP (ooct/common/trcommon.h) The TRANS_Entry_FlagP macro reads and returns the state of one of the flags, JOIN, ENTRY or BUFFERED, of the translation table entry. TRANS_Test_And_Set_Entry_Flag (ooct/common/trcommon.h) The TRANS_Test_And_Set_Entry_Flag procedure atomically reads the state of one of the flags, JOIN, ENTRY or BUFFERED, and turns it on if it was not already on. It returns the state of the flag before calling the procedure. TRANS_Set_Entry_Address (ooct/common/trcommon.h) The TRANS_Set_Entry_Address procedure writes the entry point address of the translation table entry. It uses an assembly language instruction with the lock prefix so that it writes the address atomically. Note that an entry point address is the address of an target instruction if there is no segment locking, but it is the address of a SEGMENT_GATE data structure if there is segment locking. TRANS_Get_Entry_Address (ooct/common/trcommon.h) The TRANS_Get_Entry_Address procedure reads and returns the entry point address of the translation table entry. Note that an entry point address is the address of an target instruction if there is no segment locking, but it is the address of a SEGMENT_GATE data structure if there is segment locking. Segments A segment is a unit of compiled code that may be executed by the KOI system. Compiler 104 material set forth below describes how a segment is created and deleted. This section describes how compiler 104 tells the interpreter 110 about a segment, how the interpreter 110 enter and leave the segment and how compiler 104 tells the interpreter 110 to stop using one segment and switch to another. When a segment is created, there are several ASP instruction addresses where the interpreter 110 can enter the segment. For each of these addresses, compiler 104 creates an entry point to the segment. An entry point is a special point in the segment where the interpreter 110 is allowed to jump. At other points in the segment, the compiled code assumes that certain values are in registers, so it is not safe to jump there. To tell the interpreter 110 where the entry points are, compiler 104 calls TRANS_Set_Entry_Address for each nth TRANS_Get_Entry_Address. The interpreter 110 check for compiled code segments when they enter branch logger 112. They call TRANS_Entry_FlagP to see if the current ASP address has an entry point. If it does, they call TRANS_Get_Entry_Address to read the address. If segment locking is on, they lock the segment (see below) and then jump to the entry point. If segment locking is off, they just jump to the entry point. The compiled code decides when it should exit. Usually, this happens when it needs to execute an instruction that is not part of the same segment, so it jumps to interpreter 110. Compiler 104 can delete one compiled code segment and tell the interpreter 110 to use another one. Compiler 104 does this by turning off the ENTRY bit of the translation table entry, changing the entry point address and then turning on the ENTRY bit again. Segment Locking Segment locking is an optional feature of the OOCT system. Since branch logger 112 gains more information as the system runs, compiler 104 can produce a new version of a segment that is better than the old one. Segment locking permits compiler 104 to replace an old segment with a new one and reclaim the memory used by the old segment. Unfortunately, segment locking makes branch logger 112 and compiled code slower. So there is a tradeoff between the time to execute OOCT code and the space that it uses. This section describes how the segment locking works. The segment locking code has two main parts. The first part is an interface for all parts of the OOCT system except the segment locking implementation. This interface guarantees that a segment can only be in one of four well-defined states and will change states atomically in well-defined ways. The second part is the implementation of segment locking itself, which fulfills the guarantees made by the interface. Design The states that a segment may be in are shown in Table 3. A segment may be either reachable or unreachable and it may be either locked or unlocked. Segments are reachable when there are one or more entry points in the translation table. It is unreachable when there are no entry points to the segment in the translation table. An entry point is a structure that contains a lock and an instruction address. The lock, which may be used by more than one interpreter 110 at the same time, counts how many interpreter 110 are using the entry point and the segment containing it. A segment is locked when one or more of its entry points are locked. It is unlocked when all of its entry points are unlocked. Compiler 104 may reclaim and delete a segment if it is unreachable and unlocked, but it cannot reclaim it if it is reachable or locked. Every segment begins in state U/U when compiler 104 creates it. It moves to state R/U when compiler 104 writes its entry points to the translation table. It can move to state R/L and back to R/U as interpreter 110 enter and leave the segment. Compiler 104 may create a new segment that translates the same instructions as an old segment. In this case, it will overwrite the old segments entry points in the translation table, which makes it unreachable. When compiler 104 overwrites the segments last entry, it goes from state R/L to U/L if an interpreter 110 is using it, or from state R/U to U/U if no interpreter 110 was using it. Eventually, all interpreter 110 using the segment will release their locks and the segment will be in state U/U. Compiler 104 can then reclaim the segment and delete it because no interpreter 110 is using it and none can enter it.
TABLE 3
The states that a segment can be in
State Reachab Locke Description
U/U No No No interpreter 110 is using the segment and no
interpreter 110
can enter it. Compiler 104 can delete it at any
time.
R/U Yes No No interpreter 110 is using the segment but an
interpreter 110
R/L Yes Yes One or more interpreter 110 are using the segment
and other
U/L No Yes One or more interpreter 110 are using the segment
but no
FIG. 6 illustrates interpreter 110 for entering and leaving a segment 122 according to an embodiment of the present invention. The segment 122 in the middle of the drawing is the unit of code produced by compiler 104. Segment 122 must be locked at all times when used by interpreter 110. Accordingly, a lock counter (not shown) is incremented before entering segment 122 and the lock counter is decremented after leaving segment 122. Since the interpreter 110 cannot lookup the entry point and lock the entry point atomically, it must be determined whether the entry point did not changed after being locked. FIG. 7 illustrates a compiler 104 method for creating a segment, making the segment reachable by interpreter 110, making old segments unreachable, and deleting old segments. In step S200, compiler 104 creates a new segment and adds associated entry points to the translation table. When an entry point is added in step S200, an older entry point may be re-written. The older entry point is now unreachable, and accordingly may be reused if no task (such as interpreter 110 or compiler 104) holds a lock on it. The old entry point is put on a reclaim list (not shown). Step 202 illustrates how compiler 104 uses the reclaim list. Step 202 checks whether an entry point is locked. If the entry point is not locked, then the entry point is not being used by any interpreter 110, and therefore can be removed from the segment that owns it. However, if that segment does not have any more entry points, then the segment is not being used by a task (such as interpreter 110 and compiler 104) and no task can enter it. Therefore, the segment can be deleted. The segment locking interface allows most parts of OOCT to ignore the details of synchronization because a segment always appears to be in a well-defined state and all state transitions appear to happen atomically. However, within the segment locking code the transitions are not atomic because the Intel target does not support such complicated operations in hardware. Therefore, the segment locking code makes the transitions appear to be automatic. Implementation Procedures for execution of the interpreter 110 and compiler 104 are illustrated in FIG. 6 and FIG. 7, respectively. The two procedures cooperate to ensure that each transition appears automatic. The numbered references in the following description refer to FIG. 6 and FIG. 7. There are six possible transitions among the four states of the segment interface and they fall into four groups. The first transition is U/U to R/U, when compiler 104 makes a segment reachable by writing its entry points into the translation table (*6). Since compiler 104 is the only task allowed to write the translation table, no synchronization is necessary to make this transition automatic. The second group of transitions is R/U to U/U and the similar one from R/L to U/L. These happen when compiler 104 overwrites the last entry point of a segment in the translation table (*306). Although compiler 104 can atomically write a new entry point in the translation table, the interpreter 110 cannot atomically read and lock an entry point (*301, *302). The interpreter 110 has to read the entry point in one operation and lock it in another operation. This exposes a potential problem if an interpreter 110 reads an old entry point from the translation table, then compiler 104 writes a new one, and then the interpreter 110 locks the old entry point. In this case, compiler 104 assumes that the entry point is unreachable but the interpreter 110 is able to enter the segment, which is an error. To prevent this problem, the interpreter 110 checks that the translation table contains the same entry point after locking (*303). If the translation table contains the same entry point, then it is still reachable and it is safe to enter the segment. If the translation table does not contain the same entry, the interpreter 110 must release its lock and not enter the segment. The third group of transitions is R/U to R/L and its opposite from R/L to R/U. The first one happens when an interpreter 110 reads the entry point from the translation table and locks it (*302). The second one happens when the interpreter 110 leaves a segment at its exit (*304) and goes to the unlock procedure (*305). It is important that the locking and unlocking instructions are not themselves in the segment because any time the segment is unlocked, compiler 104 may delete it (*3011). The fourth transition is from U/L to U/U. It also happens when the interpreter 110 leaves a segment (*304) and goes to the unlock procedure (*305). After this transition occurs, the segment is unlocked and compiler 104 will pass the two tests (*309, *3010) and delete the segment (*3011). Since the interpreter 110 can hold the lock on a segment for an arbitrary amount of time, it is inefficient to make compiler 104 wait for a lock. Therefore, compiler 104 does not try to lock entry points to prevent interpreter 110 from using them. Instead, it just makes the segment unreachable and later checks whether the lock has been released (*309). Once the lock is released, the entry point can be freed and reused. Monitor Message Queues The interpreter 110 send seed addresses to compiler 104. They use two message queues to send them. The first one uses the KOI system calls ScMsgSnd and ScMsgRcv to send and receive seeds. The second queue uses a shared memory area in the OOCT buffer. The shared area is called the branch_Seed_Buffer. The reason for using two queues is that each has one advantage and one disadvantage. The KOI system call is expensive for the interpreter 110 to use so it should not be used very frequently. However, the AOI system call allows compiler 104 to block when there are no seeds to compile. This allows the KOI system to use compiler 104 CPU to do some other work. The advantage of the shared memory buffer is that it is very cheap for the interpreter 110 and the disadvantage is that compiler 104 cannot block when there are no seeds. By using both queues, OOCT gets the advantages of both methods. When compiler 104 is idle, it calls ScMsgRcv to block. In this case, the interpreter 110 sends the next seed with a ScMsgSnd call to wake compiler 104 up. When compiler 104 is working, the interpreter 110 sends seeds through the branch_Seed_Buffer area, which is faster. After compiler 104 finishes one compilation, it checks for sch_Seed_Buffer area. If there are any then it compiles them. When it finishes the all seeds, it calls ScMsgRcv again and blocks. V. Interpreter Modifications (Exec Unit) The design of OOCT includes three types of modifications to interpreter 110. First, OOCT needs to be initialized by interpreter 110. Second, interpreter 110 has been modified to use branch logging. Finally, interpreter 110 has been modified to allow transitions to and from compiled code. This document will describe the details of those modifications. The OOCT interpreter code can run in two modes, OOCT_PERFORMANCE_MODE and OOCT_DEBUG_MODE. This documentation describes all of the features of OOCT_PERFORMANCE_MODE and notes where OOCT_DEBUG_MODE is different. Initialization Before OOCT compiles any code or logs any branches, interpreter 110 calls OOCT_INIT to initialize the OOCT data structures. OOCT_INIT and the procedures that it calls perform the following steps. Initialize the translation table. The MCD instruction tells OOCT the pages in the systems address space. The procedure TRANS_Execution_Init creates the first level translation table so that the entries for system pages point to second level translation table arrays. These arrays are zeroed out at initialization. See the Communications section for more details about the translation table. Initialize branch logger 112. The procedure BRANCH_Execution_Init initializes memory in the OOCT_buffer for several data structures. First there is the branch log itself which contains profile information about branch instructions. Second there is a level-one (L1) cache which makes branch logger 112 operate faster. Third there is a seed buffer which contains seeds sent from branch logger 112 to compiler 104. Fourth there are several global functions which compiled code calls. Their addresses are stored in the OOCT_buffer during BRANCH_Execution_Init. See the above section on branch logger 112 for more information about the branch log and level-one cache. Allocate compiler 104s stack memory. Compiler 104 uses a special large stack that is allocated in the OOCT_buffer. 1. Allocate compiler 104's zone memory. Compiler 104 uses this memory in the OOCT_buffer during compilation. 2. Allocate the compiled segment memory. The compiled code is placed in this area of the OOCT_buffer. 3. Zero out statistical information. Most information in the OOCT statistics area is reset when OOCT is initialized. Branch Logger Interface with Interpreter When interpreter 110 executes a branch instruction in system code and the OOCT mode bit is set, interpreter 110 calls branch logger 112 through one of the following routines:
_declspec(naked) OOCT_Log_Unconditional_Fixed_Branch( )
Invoked by interpreter with a branch
Arguments: ecx: address of branch instruction
Returns: Does not return (acts like a jump to IC_FETCHO2)
_declspec(naked) OOCT_Log_Unconditional_Non_Fixed_Branch( )
Invoked by interpreter with a branch
Arguments: ecx: address of branch instruction
Does not return (acts like a jump to IC_FETCHO2)
_declspec(naked) OOCT_Log_Conditional_Fixed_Branch_Taken( )
Invoked by interpreter with a branch
Arguments: ecx: address of branch instruction
Returns: Does not return (acts like a jump to IC_FETCHO2)
_declspec(naked) OOCT_Log_Conditional_Fixed_Branch.sub.--
Not_Taken( )
Invoked by interpreter with a branch
Arguments: ecx: address of branch instruction
Returns: Does not return (acts like jump to IC_FETCHO2)
These four routines check for a compiled code entry point for the destination address and jump to the entry point if it exists. If it does not exist, then the routines update the branch log by calling branch_L1_Touch (see next section) and then jump to interpreter 110's fetch routine. Updating Branch Log Tables FIG. 8 illustrates a structure of a BRANCH_RECORD according to a preferred embodiment of the present invention. The branch logging code counts how many times a branch has executed. There are two data structures that branch logger 112 uses to store the counts. First, there is the branch log, which is shared by all simulated processors in a multi-processor system. Second, there is one level-one (L1) cache for each simulated processor in the system. The branch execution counts are first written to the cache and then written to the branch log. This section describes the structure of the L1 caches and the branch log. It also describes how branch logger 112 uses them. The information for each branch is stored in a structure called a BRANCH_RECORD. It includes the address of the branch, the destination of the branch, the fall through instruction following the branch, the approximate number of times the branch has executed and the approximate number of times the branch was taken. The last field of the BRANCH_RECORD is a pointer to another BRANCH_RECORD. It is used to connect BRANCH_RECORDs in a linked list. The hash table is organized as an array of linked lists. FIG. 9 illustrates the structure of the branch log. It is a large hash table that stores BRANCH_RECORDs. Each interpreter 110 has its own copy of the variable local_branch_header_table, but they all point to the same array in the OOCT buffer area. The elements of the local_branch_header_table are pointers to lists of BRANCH_RECORDs. The procedure for finding a BRANCH_RECORD for a branch has 3 steps. 1. Hash the destination address. (index=BRANCH_HASH(destination_address) % BRANCH_TABLE_SIZE.) 2. Get the head of the list. (list=local_branch_header_table[index].) 3. Walk down the list until you find a record with the same branch address. (while (list->branch_address!=branch_address) list=list->next.) FIG. 9 particularly illustrates that the variable local_branch_header_table is an array of pointers to lists. Each list contains BRANCH_RECORDs that have the same destination address. When there is no list, the pointer in local_branch_header_table is NULL. The branch log contains all of the information about branches, but it has two problems. First, looking up and inserting BRANCH_RECORDs are slow operations. They a re too slow to do every time interpreter 110 logs a branch. Second, every interpreter 110 uses the same branch log. In order to keep the lists of BRANCH_RECORDs consistent, only one Exec can access the branch log at one time. This slows down the multi-processor system even more than the single processor system. In order to fix these problems, there is an L1 cache for each interpreter 110. The L1 cache can be accessed quickly and the interpreter 110 can access their L1 caches in parallel. Each L1 cache is a 2-dimensional array of BRANCH_L1_RECORD structures. The base address of the array is stored in the variable branch_L1_table. FIG. 10 illustrates the structure of the L1 cache. The cache is a 2-dimensional array of BRANCH_L1_RECORDs. The first dimension is BRANCH_L1_SETS (currently 32) and the second dimension is BRANCH_L1_SETSIZE (currently 4.) Each row of the array is one set. The same branch instruction always uses the same set of the cache, but it can be at different places. As illustrated in FIG. 10, the L1 cache is organized into sets. The set number for a branch is equal to (branch_address+branch_destination) % BRANCH_L1_SETS. The 4 members of the set hold the 4 most recent branches with the same set number. This is called 4-way set associativity. It improves the performance of the cache when there are several branches executed at almost the same time that have the same set number. FIG. 11 illustrates a method for executing operation of the L1 cache by the interpreter 110 according to an embodiment of the present invention. In other words, FIG. 11 illustrates a branch logging method by using the L1 cache. The optimizing object code translation method utilizes two forms of memory to record non-compiled branches, namely 1. a branch log having a dynamically changing size in proportion to the number of recorded branches, and 2. a branch cache, entitled an L1 cache, in which a limited number of non-compiled recorded branches are stored according to an order which enhances access. The branch log and the L1 cache represent virtual memory locations which are managed by an operating system. Thus, the term "L1 cache" is arbitrarily given to the cache for storing non-compiled branches and should not be confused with the `L1 cache` which is generally found on a processor such as the Pentium Pro. The optimizing object code translator according to the present invention provides that interpreter 110 may call a plurality of different branch logging routines. However, each branch logging routine itself calls a subroutine which decides to jump to compiled code or to log a branch instruction. This subroutine is particularly illustrated in FIG. 11. In view of the above, to execute the branch logging method with the L1 cache, the method is first started in step S400. In step S401, the interpreter 110 first checks for a compiled code entry point for the branch destination (i.e. whether the segment at issue has been previously compiled). If there is an entry point, i.e. "yes," then there is a compiled segment and flow jumps to step S402 for immediate execution of the compiled code segment. Execution then proceeds with the compiled code segment until an end flag is reached, and flow then returns for execution of the next segment. Of course, the branch is not recorded in the branch log because the branch has already been compiled. If there is no entry point in step S401, i.e. "no", then there is no compiled code corresponding to the branch instruction. Flow then proceeds to step S404 and the interpreter 110 looks into the L1 cache to determine if there is a possible match between the branch and the plurality of branches stored in the L1 cache. Step S404 determines if there is a match between the branch and the plurality of branches stored in the L1 cache. The L1 cache is divided into a plurality of sets with each set being designated by a unique set number. According to an embodiment of the present invention, each set is contains four branches. Step S404 first determines a cache set number "S" corresponding to the current branch address, with S=(branch_address+branch_destination) % BRANCH_L1_SETS. Next, each element of the branch_L1_table[S] is sequentially checked against the current branch address and destination. If a match is detected, i.e. "yes", then flow proceeds to step S406 and the fields "encountered_sub_count" (a field which designates how many times the branch was encountered) and "taken_sub_count" (a field which designates how many times the branch was taken) are updated. Flow then proceeds to step S407. In step S407 it is determined if the current branch address has been encountered greater than a predetermined threshold number. The preferred threshold value is on the order of 1000 hits. Thus, the field encountered_sub_count is compared with the threshold value in step S407. If the threshold value is exceeded, i.e. "yes", then flow proceeds to step S410 and the cached information for this branch is written back to the branch log. On the other hand, if the threshold value is not exceeded, i.e. "no" then flow proceeds to step S412. Step S412 is an end of the current subroutine which jumps to IC-FETCHO2, i.e. the entry point of the interpreter 110. If the correct branch is not in the cache, i.e. "no" in step S404, then flow proceeds to step S408 and one BRANCH_L1_RECORD (i.e. the record containing all fields which may be updated, such as encountered_sub_count and taken_sub_count) in the set designated by "S" above is removed from the L1 cache and written to the branch log. Next, the current branch information is written into the set designated by "S". Moreover, during writing of the current branch record into the set "S", the current branch record is placed as the first element of the set. This is because the same branch will very likely be executed again, thereby increasing performance and efficiency of the system. In other words sets S404 will be executed faster. Even when the branch is in the cache, i.e. "yes", it may be copied to the branch log if it has been executed a large number of times since it was last flushed. When the L1 cache is used, the sequence of steps is almost always S400, S404, S406, S407, and S412. Accordingly, the present invention seeks to make those steps as fast as possible. When the current branch information is put in the first element of the set, the branch information makes step S404 faster because the interpreter 110 is likely to execute the same branch again. The branch logging method set forth above reduces a burden on the processor by executing code which has been previously compiled and enhancing access to often called branch instructions which have not yet reached the threshold level for compilation. In this regard, the main purpose of OOCT is to make step S400 take the "yes" branch almost every time. If a branch is executed frequently, then there should be a compiled code segment for its destination. A secondary goal is to make the "no" path following step S401 faster, so that branches which have not yet been compiled will not appreciably slow down program execution. The slowest part of the "no" path is referred to as "flush." In both steps S408 and S410, branch information is "flushed" from the L1 cache and written to the branch log. It become necessary to flush a branch's information in order to send a seed to the compiler, which will cause compiled code to be generated and cause step S400 to answer "yes" for this branch in the future. However, it is not necessary to flush the branch's information every time a non-compiled branch address is executed. Flushing every 100 executions or less is often O.K. Therefore, the present invention seeks to increase the speed of steps S400, S404, S406, S407, and S412, which include no flushes. Thus, the faster path is always taken unless one of two things happen. In step S404, it is possible for the branch information not to be found in the set, so we take the "no" path to S408. In step S407, if the branch was executed more than the "threshold" number of times, it will take the "yes" path to S410 which also includes a flush. In OOCT_DEBUG_MODE, the L1 cache method is still used, but the threshold for flushing the cache is set to 1, so the information is written to the branch log on every branch execution. This makes the OOCT_DEBUG_MODE much slower. Seed Selection When a branch instruction is executed very frequently, branch logger 112 sends its destination address to compiler 104. This address is called a `seed` and choosing seeds is a very important part of the OOCT system. Seeds should be addresses that are at the beginning of a procedure or at the head of a loop. Therefore, branch logger 112 only sends seeds that are the destination of an unconditional branch. Seeds should be addresses that are executed frequently, so a branch destination becomes a seed only when its encountered_count field is greater than a threshold. The threshold is stored in the OOCT buffer in the field named seed_production_threshold. The threshold can change over time, which is described in the next section. Threshold Setting There are two bad things about using a fixed threshold to decide whether to send a seed. First, the threshold might be too high while compiler 104 is idle. In this case, there is useful work for compiler 104 to do, but branch logger 112 does not tell compiler 104 what to do. Second, the threshold might be too low while the message queue is full. In this case, branch logger 112 will try to send a seed even though the seed will not fit in the queue, which is a waste of time. Fortunately, it is possible to detect the two situations, when compiler 104 is idle and when the message queue is full, and change the threshold. Branch logger 112 detects that compiler 104 is idle in the procedure branch_Update_Entry by reading the OOCT buffer field named num_monitor_seed_messages. If this field is 0, then compiler 104 has finished all of the seeds that were sent. The threshold is too high, so branch logger 112 lowers it. Branch logger 112 detects a full message queue in the procedure branch_Send_Seed when it tries to send a seed and gets an error code indicating that the message was not sent. The threshold is too low, so branch logger 112 raises it. In OOCT_DEBUG_MODE, the threshold never changes. Its value is set to the third argument of the OOCT_INIT procedure in this case. Handling Multitasking OOCT runs on a multiprocessor system with multiple interpreter 110. These tasks have individual branch L1 caches, but they use the same branch log table. When branch information is flushed from the L1 cache to the branch log table, the interpreter 110 acquires a log on the table so that it will not conflict with any other Exec. There are two possible ways to handle contention for the branch log lock. The first is to make an interpreter 110 wait until the lock is available and then get the lock and write its branch information. This makes the interpreter 110 run more slowly but makes the branch log more accurate. The second is to give up without writing the branch information if the interpreter 110 cannot get the lock. This way makes the interpreter 110 faster but loses some branch logging information. OOCT uses the second way because the speed of interpreter 110 is more important than the accuracy of the branch log. The branch log information only needs to be approximately correct for the system to function well. When OOCT is running with multiple interpreter 110, one of the tasks is the special master task that calls OOCT_INIT to initialize the OOCT buffer and the branch logging data structures. The other tasks are slave tasks that only have to initialize some local variables and their branch L1 caches. The slave tasks call SlaveOOCT_Init after the master task has finished initializing the OOCT_buffer. The synchronization between master and slave tasks uses the following methods. Master Method 1. Execute the MCD instruction to turn OOCT on. 2. Call OOCT_INIT, which initializes the OOCT buffer and branch logging data structures. 3. Wake up slave tasks. 4. Jump to interpreter. Slave Method 1. Go to sleep. Wake up when master task executes (step 3 above). 2. Call SlaveOOCT_Init, which initializes the task's individual branch L1 cache. 3. Jump to interpreter. User/System Space Transitions The OOCT system only compiles instructions from the system pages of the ASP address space. It ignores the user pages. The OOCTSTS bit of interpreter 110's individual area controls whether branch logger 112 is called or not. This bit is primarily controlled by the two macros NEXT_CO and NEXT_OUN. However, there is one case where the OOCT code has to set this bit. When a compiled code segment ends with a non-fixed branch instruction, it may cause the PSW_IA to move from system space to user space, which requires setting OOCTSTS to 0. So a compiled code segment that ends with a non-fixed branch jumps to the routine branch_Exit_Log which checks the destination address and sets the OOCTSTS bit correctly. Compiled Code Interface Transition to/from Compiled Code Interpreter 110 transfers execution to compiled code when interpreter 110 calls a branch logging routine and it finds a compiled code segment for the branch destination (see FIG. 11.) When segment locking is turned off, interpreter 110 jumps directly to the entry point. When segment locking is turned on, interpreter 110 must attempt to lock the segment before jumping to the entry point. If it locks the segment, then it jumps to the entry point. If it fails to lock the segment, then it jumps back to interpreter 110. There are several ways for execution to leave a compiled code segment, which are described in Table 4. In all cases, when control jumps back to interpreter 110, the ESI and EDI registers have correct values and the individual area of interpreter 110 has perfect K status.
TABLE 4
How control leaves a compiled code segment.
Final K opcode What the compiled code segment.
Fixed branch or Tests if the destination address has a compiled entry
straight-line K point. If it does, then it makes an intersegment jump
opcode to the entry point. If it does not, then control is
passed back to interpreter 110 at IC_FETCHO2,
or to branch_Exit when segment locking is on.
Non-fixed branch Jumps to branch_Exit_Log which sets the
OOCTSTS bit and then invokes branch logger 112
if the PSW_IA is still in a system page.
LPSW, SSM, Without segment locking: Jumps to IC_FETCHO2
STNSM, MCD, to execute the opcode
CALL, RRTN, With segment locking: Jumps to
SVC, MC, BPC, branch_Exit_Interpret.
LINK, LINKD,
LOAD, LOADD,
DELT, DELTD,
FBFCC
SAM opcode that Without segment locking: Jumps to IC_FETCHO2
switches to RISC to execute SAM opcode
mode With segment locking: Jumps to
branch_Exit_Interpret.
When segment locking is on, the interpreter 110 will be holding a lock on the compiled code segment while it is executing that code. It must release this lock after it leaves the segment, so the compiled code calls some procedures in branch logger 112 which release the lock and then jump to interpreter 110. Interrupts There are several interrupts that can occur while compiled code is executing, such as IO interrupts or MCHK interrupts. The compiled code checks the INTINF field of the individual area to detect whether an interrupt has occurred. It checks this field inside of any possibly infinite loop, which ensures that it does not ignore the interrupt forever. If an interrupt does occur, the compiled code calls interpreter 110 routine IU_OINTCHK with perfect K status. It expects that interpreter 110 will return to the compiled code. Interpreted Callbacks Some K opcodes are not translated by OOCT. Instead the compiled code calls interpreter 110 subroutine IC_OOCT to interpret the opcode and return back to the compiled code. The compiled code makes sure that the ESI and EDI registers have the correct values and that the individual area has perfect K status before calling IC_OOCT. If interpreter 110 detects an error while executing the IC_OOCT subroutine, it calls the procedure OOCT_EXCP and does not return to the compiled code. If segment locking is turned on, then OOCT_EXCP releases the segment lock. Exceptions When a translated opcode has an unmasked exception, such as an operation exception or a zero divisor exception, the compiled code calls an interpreter subroutine IC_PGMxx, where the xx is the error code number between 01 h and 21 h. Interpreter 110 tries to handle the exception and return. When interpreter 110 cannot return, it calls OOCT_EXCP, which releases any segment lock. Use of Global Functions Some K opcodes, such as character processing operations, translate into a large number of target opcodes. Making multiple translations of these opcodes would use too much of the segment memory re subroutines called global functions which the compiled code calls to execute these opcodes. These global functions are just like interpreter 110 routines that execute K opcodes, except that they are specially written to be called from compiled code and return to compiled code. There are global functions for five opcodes, SBE, CC, MV, TS and C. Experiments show that the global functions are much faster than calling the IC_OOCT entry point of interpreter 110 and they use much less memory than compiling the opcode into target instructions multiple times. VI. Compiler Overview Before delving into the details of compilation, it is important to understand at a high level the main purpose of compiler 104 and to understand the structure of compiler 104. The purpose of compiler 104 is to translate heavily executed portions of the currently executing program into optimized target code and to make this code available to interpreter 110 for execution. FIG. 12 particularly illustrates an overall structure of compiler 104. Compiler 104 receives seeds from the branch logger 112 (discussed above) which start the compilation process. The seed is the address of a original instruction that has been the target of a large number of branches in the currently executing program. This is intended to give a starting point for finding a heavily executed portion of the currently executing program. The block picker 114 uses this seed along with other information provided by branch logger 112 to pick sections of the program that should be compiled. Once the original code to be compiled has been chosen it goes through three major stages. The first stage is to convert the K opcodes into an intermediate language (IL) which used by the rest of compiler 104. The intermediate language is generated by IL generator 124. The second stage performs various analyses and optimizing transformations on the IL by way of optimization set forth above and designated for reference as optimizer 126. The final stage converts the IL into relocatable machine code and is designated as optimizing code generation unit 118. The final job of compiler 104 is to make the optimized code available to interpreter 110. A segment data structure is then created with a copy of the optimized code by way of segment installation unit. The segment is then installed into a shared area within the OOCT buffer (not shown). The translation table is finally updated so that any branches by interpreter 110 to the compiled K code will use the new target code instead. The rest of this section will discuss in detail each of the above compiler 104 stages. A number of other miscellaneous implementation details will also be discussed at the end of the section. Block Picking Compiler 104 receives a single seed address to start compilation. Beginning at the seed, it reads original instructions until it has read something like a procedure body. Then it passes this set of original instructions to the next compiler 104 stage, IL generation. The units of instructions that compiler 104 reads are called basic blocks, so this stage is called a block picker, i.e. block picker 114. A basic block is a sequence of instructions where control can only enter at the first instruction and can only exit at the last instruction. This means that only the first instruction can be the target of a branch and only the last instruction can be a branch instruction. It also means that if the first instruction of the block is executed then all of the instructions will be executed. Block Picker FIG. 13 illustrates an example of block picker 114 according to an embodiment of the present invention. The procedure OOCT_ParseFrom implements the block picker 114. It reads one basic block at a time. A basic block ends for one of five reasons. 1. If the parser reads a branch instruction, then the block ends with the branch. 2. If the next instruction was already parsed, then the block ends with the current instruction, because each K opcode should only appear one time in one segment. 3. If the next instruction is a join point, then the block ends with the current instruction because join points have to be at the beginning of a basic block. 4. If the current instruction is a factor on and it could be followed by data instead of instructions, then the block ends with the current instruction. 5. If the current instruction is an illegal instruction, then the block ends with the current instruction. After reading each block, block picker 114 decides what action to take next, depending on the way the block ended. The possible actions are illustrated in Table 5.
TABLE 5
Action after reading a block.
End of current Block picker 114 action
block
Conditional Continue parsing at the fall through instruction and
branch the branch destination instruction.
Unconditional Continue parsing at the branch destination
fixed branch instruction.
Non-fixed branch Stop parsing because branch destination is unknown.
Factor of end Stop parsing because the next byte might not be an
instruction or instruction.
Illegal instruction
Other instructions Continue parsing at the fall through instruction.
An example is illustrated in FIG. 13. Block picker 114 begins at the seed instruction, which is an LB instruction. Since that is not a branch or factor of end instruction, it continues to the next instruction. That one is a TH instruction, which is a conditional branch. Block picker 114 stops reading the current block because of the conditional branch. It continues reading new blocks at both the LH and LF instructions. When it reads the SVC instruction, block picker 114 ends that block because SVC is a factor of end instruction. When it reads the GO instruction, block picker 114 ends that block because GO is a branch instruction. It continues reading at the L8 instruction because it is a branch destination. After it reads the ST8 instruction, block picker 114 ends the block because it has already read the next instruction. There is an upper limit on the number of instructions that block picker 114 will read. The purpose of the limit is to prevent compiler 104 from running out of memory while compiling the source instructions. The limit is set by the constant MAX_KINST_NUM in OOCT_trace.c and it is currently 500. Block picker 114 can cause a page fault when it tries to read an instruction. If it gets a page fault, block picker 114 stops reading the current block but continues reading from any branch destinations that it has not tried yet. This allows compiler 104 to create a segment even if it cannot parse all of the instructions that can be reached from a seed. Block Layout After choosing the basic blocks to be block picker calls the procedure OOCT_GenerateIL to create the IL instructions that the rest of compiler 104 will use. At this time, it is possible to rearrange the order of blocks. This is called block layout and it helps compiler 104 produce better code for the Pentium Pro processor because the Pentium Pro runs faster if forward conditional branches are not taken. Consider the example in FIG. 13. It has one conditional branch, the TH instruction. In the original instructions, the fall through basic block is the one beginning with LH and the destination block is the one beginning with LF. If the conditional branch is taken 75% of the time, then it will run faster if the LF basic block is put in the fall through position and the LH basic block in the branch taken position. The OOCT_GenerateIL procedure lays out blocks according to the information in the branch log. It places the most common successors of conditional branches in the fall through position whenever it can. This procedure produces a list of IL instructions that are passed to the optimization phases of compiler 104. Intermediate Language (IL) Generation The section will discuss the process of generating compiler 104's intermediate language (IL) representation for the K opcodes. Before directly discussing how the IL is generated, an overview of the IL is given and data structures that are important to understand are described. IL Overview The main analysis and transformation passes of compiler 104 operate on an intermediate language that is a special machine independent instruction set. Using an intermediate language is a standard compiler 104 technique for two main reasons. First, an IL typically has an architecture that simplifies analysis and transformations. Second, an IL allows many different source languages to use the same optimization and code generation stages and eases retargeting to different platforms. The IL used by OOCT (referred to as just the IL from here on) is currently composed of 40 different opcodes listed in Table 6. The instructions fall into three main categories. First, there are functional opcodes such as ADD and LOAD that have a straightforward mapping to standard machine opcodes. Second, there are opcodes that handle control flow such as LABEL and CGOTO. Finally, there are a number of special opcodes that are used as special markers by compiler 104, which do not directly correspond to code that is generated by the back end. These special marker opcodes are described in a separate section. Since the IL represents a virtual machine, it is straightforward to add other opcodes to the IL if further functionality is required. The IL is composed of instructions, each of which specifies one of the opcodes, a type, and a number of pseudoregister arguments. The types supported by compiler 104 are signed and unsigned 8 bit, 16 bit and 32 bit values. Aside from immediate values used by the SET opcode and values loaded from memory with the LOAD opcode, all arguments are passed with pseudoregisters. Pseudoregisters are simply the IL virtual machine's registers. Compiler 104 allows an arbitrary number of pseudoregisters, each of which has a predefined size (e.g. 16 bits). Each pseudoregister directly corresponds to a specified memory location. For OOCT, these memory locations are in the OOCT specific parts of the individual area. This mapping of pseudoregisters to memory locations gives two benefits. First, it streamlines the IL. The IL operations to load commonly used values into temporaries and store them back to memory are not needed. Second, compiler 104 is often able to allocate commonly used values into machine registers, eliminating redundant loads or stores.
TABLE 6
IL Opcodes
OPCODE DESCRIPTION
LABEL Marks a place in the flow graph which could be the target
of jump operations
GOTO A jump to a label
CGOTO A conditional jump to a label based on the boolean value
of a pseudoregister
IGOTO An indirect jump to an address determined by the value of a
pseudoregister
SET Puts an immediate value into a pseudoregister
ASSIGN Moves the value in one pseudoregister into another
pseudoregister
OASSIGN A special marker instruction that shows where pseudo-
registers overlap, to make aliasing explicit
CVT Convert a pseudoregister from one type to another (e.g. sign
extension, zero extension)
NEG, Unary negation, logical complement, byte-swap
CMPL,
BSWAP
ADD, SUB, Binary add, subtract, multiplication, divide, remainder
MUL, DIV,
REM
ASL, ASR Arithmetic shift left, right
LSR Logical shift right
BAND, Binary logical and, or, xor
BOR,
BXOR
EQ, NE, LT, Compares two input operands and sets output operand
LE, GT, GE to true if op1 == op2, op1 != op2, op1 < op2, op1 <= op2,
op1 > op2, op1 >= op2
TESTZ, Compares two input operands and sets output operand to
true if (op1 & op2) = = 0, (op1 & op2) ! = 0
TBSTNZ
CMP Compares two input operands and sets output operand to
-1 if op1 < op2, to 0 if op1 == op2 and to 1 if
op1 > op2. This is not currently used by OOCT
LOAD Load a pseudoregister with a value from a specified
memory location
STORE Store the value of a pseudoregister to a specified memory
location
GCALL Performs a function call to one of a set of predetermined
global functions
ICALL Performs an indirect function call, similar to IGOTO
EXIT Exit the compiled block. This is not currently used
by OOCT
ENTRY Marks a point where control can enter the flow graph
SYNC Marks the points where a set of pseudoregisters are
flushed to memory
EXTMOD Marks a pseudoregister as externally modified. This is
used to handle modification of pseudoregisters by
function calls
SBTCC Sets a boolean to the value of a condition code based
upon an operation. This is used to represent places where
flags are used. Currently, all SETCC operations are folded
into the successor so they are not emitted, but the use of
SETCC makes the flow of the value of the condition code
explicit without requiring compiler 104 to represent
multiple destinations for a single IL operation.
Special IL OPCodes The OOCT IL contains a few opcodes that have special purposes. Most IL opcodes correspond to code that is generated in the back end. Instead, these special instructions act as sign posts to compiler 104 that something special is happening. The IL contains the following special opcodes: ENTRY, SYNC, EXTMOD, and OASSIGN. This section discusses the first three of these opcodes. OASSIGNs are fully set forth above. The ENTRY opcode marks a point where control can enter the flow graph. The code generated by OOCT may have multiple external entry points that represent external join points. Each of the external entries has a corresponding ENTRY IL instruction. The ENTRY instructions occur at the end of the code and are immediately followed by a GOTO instruction that jumps to a label within the main body of code. The reason that an entry is used instead of having the external entry jump directly to the label is to allow the code generator to insert fills between the ENTRY and the jump to the label. FIG. 14 illustrates an outline of code with two external entry points where a fill was inserted between the ENTRY instruction and the GOTO instruction. In other words, FIG. 14 particularly illustrates an entry example according to an embodiment of the present invention. The SYNC opcode is used to guarantee that a range of pseudoregisters is flushed to memory. In particular, OOCT uses the SYNC opcode to guarantee that all the K registers are in the memory locations where interpreter 110 expects to find them. The SYNC acts as a directive to the register allocator, indicating that a pseudoregister that is in a machine register that has been modified should be spilled. A SYNC also acts as a use of any live data, which prevents compiler 104 from dead code eliminating code that only has the effect of modifying K registers. The EXTMOD opcode is used to indicate that a pseudoregister is modified, but that compiler 104 does not have the details of how the register has been modified. Thus, the EXTMOD has two effects. First, it acts as a barrier to optimizations such as constant folding or copy propagation. Second, it forces compiler 104's register allocator to insert a fill before the next use of the pseudoregister. In OOCT, EXTMOD instructions are used after a call back to interpreter 110 to indicate which K registers may have been modified. IL Data Structures Before discussing how the IL is built from the K opcodes, it is useful to have familiarity with the main data structures used in compiler 104. ZONE (compiler/zone.[h,c]) Memory allocation in compiler 104 is handled with an abstraction called a ZONE. The ZONE abstraction is an efficient way of allocating memory such that it can be released all at once. With the ZONE abstraction, allocation is fast and the programmer does not have to worry about memory leaks since destroying the ZONE will reclaim all the memory used.2 In compiler 104, a ZONE is created, and all calls that allocate memory (i.e. what would normally be malloc calls) call ZONE_Alloc with the initially created ZONE. When compiler 104 is done, it calls ZONE_Destroy which de-allocates the entire ZONE (i.e. does the equivalent of a free for all the memory). The underlying implementation of ZONE uses `chunks` of memory. For example, when the ZONE is created, it might `malloc` a block of size 0.times.2000 bytes. Calls to ZONE_Alloc will use that `chunk` of memory until it is used up. When there is not room to service a ZONE_Alloc request from the initial 0.times.2000 bytes, a new `chunk` is created. Further ZONE_Alloc calls will use that `chunk` until it is also used up. In compiler 104, things are complicated a little bit by the fact that memory is all pre-allocated, and thus malloc can not be called. Instead, a special ZONE allocator unit (the ZALLOC unit) is used. The ZONE allocator is initialized with a large pool of memory (0.times.10000 bytes for example). It divides the memory into chunks of the same size that the ZONE will use for allocation, and keeps a list of free chunks. Thus, the `malloc` requests are replaced by a call to ZALLOC_get_chunk that gives back a free `chunk` of memory. Similarly, the calls to `free` in the ZONE_Destroy are replaced with calls to ZALLOC_free_chunk. In the current implementation, the maximum allocation size that can be handled by ZONE_Alloc is the initial chunk size. This limitation could be `fixed` by changing the ZALLOC unit to handle variable size allocations instead of simply handling one size (see the Segment Allocation unit for an example of this type of allocator). There are two reasons that this was not done here. First, a variable size allocator is much more complex and creates problems such as fragmentation. Second, the chunk size can be made very large with little to no penalty. When the chunk size is sufficiently large, compiler 104 will only request a single allocation larger than the chunk size if compiler 104 would have run out of memory any way. Thus, there is no real advantage to generalizing the ZALLOC unit to handle variable sized allocation. IL_CTXT (compiler/oc_common/include/il_internal.h) Compiler 104 maintains a single data structure, the IL_CTXT, to keep track of the current state of the compilation. The IL_CTXT data structure stores a pointer to a linked list of IL_NODEs that represent the code currently being compiled. The IL_CTXT also stores a number of miscellaneous fields that are used throughout the compilation process such as the ZONE and IL_FRAME structure being used. Each of the stages of compiler 104 has the IL_CTXT as an argument and makes modifications to that data structure, for example, a number of the stages add or remove IL_NODEs. IL_NODE (compiler/oc_common/include/il_internal.h) The IL_NODE data structure represents a single abstract instruction in compiler 104's intermediate language, as translated from a K opcode. The IL_NODEs that are generated from the K opcodes are maintained in a doubly-linked list. Pointers to the first and last elements in this list are maintained in the IL_CTXT. This list represents the code currently being worked on by compiler 104. Each pass of compiler 104 traverses this list and either generates information about the code in the list or transforms the list. Each IL_NODE contains an operation field `op` which indicates the basic nature of the instruction. It also contains a vector of operand fields representing the operands of the instruction. The interpretation of the operand fields is dependent on the operation type of the instruction. In addition to the operation and optrand fields, all IL_NODEs contain a number of fields that are shared by all node types, such as the K pc of the instruction from which the node was translated, the starting address of the target machine code generated for the node, etc. The number of operand fields in a node varies according to the operation type. In fact, in some cases two nodes of the same type may have different numbers of operands; the number of operands for a call operation, for example, will depend on the number of arguments passed to the target method. This variation in the number of operands means that IL_NODEs are not of a consistent size, and that the operand vector is the last item in the IL_NODE structure. The operand vector is declared to be one entry long, and IL_NODEs are allocated by calculating/allocating the total amount storage necessary for the common fields and the operand fields and by casting the allocated memory to an IL_NODE pointer. In most, but not all, cases each operand actually requires two consecutive entries in the operand vector. The entry operand[i] of the pseudo-register in which the operand will be found. If the operand is a destination operand, operand[i +1] will point to a list of nodes that use the value that is being defined by this operation; if the operand is a source operand, operand[I+1] will point to a list of nodes containing definitions for the value. If an operation has a destination operand, that operand will always be stored in operand[0] and operand[1]. If operand[i] is a source (input or use) operand, then operand[i+2] will be also; i.e., all source registers must come at the end of the list of operands. Operand fields in a node acre never accessed directly. Rather, access is by a large set of macros of the form ILOP_xxx(N), where N is a pointer to an IL_NODE. These macros which know how various operands are stored in the operands vector for all the it various instruction types. Some of the node types are as follows (this list is not all-inclusive): Unary operations These represent a variety of simple unary (1 source operand) instructions including assignment. type the type of the operation ILOP_DEST(N) destination register; where the result goes ILOP_DEST_use(N) list of instructions that use the destination register ILOP_SRC(N) source register ILOP_SRC_def(N) list of instructions that define the source Binary Operations A large number of binary (2 source operand) instructions are represented by this category. type the type of the operation ILOP_DEST(N) destination register; where the result goes ILOP_DEST_use(N) list of instructions that use the destination register ILOP_SRC1(N) first source register ILOP_SRC1_def(N) list of instructions that define the first source ILOP_SRC2(N) second source register ILOP_SRC2_def(N) list of instructions that define the second source ILOP_DIVEX(N) this operand appears only for the DIV and REM operations, and point to a (singleton) list containing the node that represents the start of the divide by zero exception if there is one. Label A LABEL instruction represents a point in the code where branches can branch to. It contains the following operands: ILOP_LABEL(N) a unique integer identifying the label ILOP_LABEL_refs(N) a list of instructions that refer to this label ILOP_LABEL_live(N) a BITSET showing which registers are live at this label ILOP_LABEL_rd(N) a vector of lists of the definitions of each register that reaches this label ILOP_LABEL_misc(N) a place for any pass to hang private info about the label Goto A GOTO instruction represents an unconditional branch to a label. ILOP_LABEL(N) unique integer identifying the target label ILOP_LABEL_refs(N) a singleton-list of the target LABEL instruction CGoto A CGOTO instruction represents a conditional branch to a label. It contains the same operands as a GOTO instruction as well as some additional operands. ILOP_COND(N) register containing the condition on which to branch. This register must contain a boolean (B1) type value. The branch will be taken if the condition is TRUE. ILOP_COND_def(N) list of instructions that define this register ILOP_COND_live(N) a BITSET showing which regs are live if the branch is not taken. In addition to the instruction-specific ILOP macros, there are a number of generic macros that can be used on any instruction ILOP_HasDEST Returns TRUE if the instruction has a destination register. In this case, the ILOP_DEST and ILOP_DEST_use macros can be used on this instruction. IL_OP_START, IL_OP_DONE, IL_OP_NEXT Used to iterate through the source registers of an instruction. IL_OP_START returns an IL_OP_INDEX referring to the first such source register. IL_OP_DONE tests an IL_OP_INDEX to see if it refers to a source register; it returns true if it does not. IL_OP_NEXT is used to go on to the next source register IL_OP, IL_OP_def These return the particular source register and the definition list for it for a given IL_OP_INDEX. These 5 macros are generally used in a loop of the form: for (op=IL_OP_START(n); !IL_OP_DONE(n,op); op=IL_OP_NEXT(n, op)) {use IL_OP(n, IL_FRAME (compiler/oc_common/include/il_frame.h, compiler/OOCT_Frame.c) The IL_FRAME data structure is used to give information about the context in which the compiled code will run. The frame defines the size and memory location for each of the pseudoregisters, how the pseudoregisters overlap with other pseudoregisters and which machine registers are legal to use in the register allocator. Additionally, the IL_FRAME structure defines whether or not a C stack frame is required for the code being compiled. In OOCT, C stack frames are not used. In compiler 104, the IL_FRAME structure is initialized by the functions in OOCT_Frame.c. These functions setup each of the pseudoregisters that correspond to the K registers and PSW locations. Additionally, compiler 104's temporary pseudoregisters are set to correspond to interpreter 110's work space area. Information about how the K registers overlap is also setup. NL_LIST (compiler/oc_common/[include, src]/nl_nodelist.h) In many places compiler 104 uses lists of IL_NODEs, the NL_LIST data structure provides an abstraction for manipulating these node lists. For example, the UseDef analysis, set forth above, creates lists of IL_NODEs that use a given definition and lists of IL_NODEs that may be the definition for a given use. The NL_LIST abstraction is straightforward, it provides the ability to create, add, remove, replace, search and iterate over node lists. K Opcode to IL Translation After block picker 114, set forth above, has chosen which K opcodes to compile, translating the K opcodes into IL involves three main steps. The first step is to determine the order in which code will be generated for the basic blocks. The block layout method is set forth above. Second, as basic blocks of K opcodes are chosen by the block layout method, the opcodes are examined to determine if they can be combined into a `logical opcode`. Finally, an IL generation procedure is called based on the K opcode and its arguments. Opcode Combination (compiler/ooct_opcode_combine.c). Some sequences of K opcodes can be described as a single `logical` opcode. For example, it was determined that a sequence of two TR instructions was used to test the value of a 32 bit register pair by testing each of the individual halves. These two TR instructions represent a logical 32 bit test opcode that is not available in the K architecture. The code that the IL building procedures would create for the two TR instructions is much less efficient than the code that could be created if this pattern was recognized. Fortunately, since OOCT is software, it is easy to add a new opcode, have a special unit that recognizes the patterns, and instead generate the efficient IL. Before generating the standard IL for a given opcode, the OOCT_opcode_combine routine is called. This routine iterates over all of the patterns that have been defined, trying to use a `logical` opcode if it is appropriate. Currently, only two patterns are defined, but it is straightforward to define additional combinations. If one of the patterns is matched, the IL building procedure for that logical opcode is used to create the IL instructions and OOCT_opcode_combine will return TRUE to indicate that the normal IL building procedure need not be called.3. IL Building Procedures (compiler/ooct_il_build.c) For each K opcode, there is a specific IL building procedure. The IL building procedures take two types of arguments, the address of the instruction, and a list of arguments that are the fields in the original instruction. The IL building procedures also use a shared global variable global_gen_state that is used to keep track of the pseudoregisters and the labels while generating the IL. Each of the IL building procedures adds IL instructions to the IL_CTXT structure. All of the IL generation routines create a LABEL IL_NODE with the address of the original instruction as the label's identifier (if the label is not the target of another instruction, it will be eliminated early in the optimization process) not in general attempt to perform optimizations, leaving that to later compiler 104 stages, but a few special cases such as checking for exceptions that can be detected at compile time are handled. Most of the IL building procedures are straightforward once the IL and the original instruction that code is being generated for become familiar. There are a few tips that help in understanding the code: The IL building has been designed so that the compilation of any given opcode can be easily turned off for debugging. This is mainly controlled with the REALLY_COMPILE macro, and the COMPILE_SECTION_XX macros. When REALLY_COMPILE is turned off, all of the IL building routines will simply build calls (or jumps) back to interpreter 110. When COMPILE_SECTION_X is turned off, all the IL building routines for opcodes in section number X will simply build calls (or jumps) back to interpreter 110. Since the IL is typed, it is critical to use the correct size pseudoregister with the correct type. For example, to load a 16 bit value into a 32 bit register, first a 16 bit load is done into a 16 bit pseudoregister, and then a CVT operation is used to cast the 16 bit value to a 32 bit value (the LOAD_CVT32 macro does this). Whenever a callback or jump to interpreter 110 is inserted, a SYNC must be added to make sure that interpreter 110 has the correct values for the K registers. The compiled code does not attempt to maintain the value of the ESI register as it goes (in fact it is used to hold other values). Thus, the code generated must put the correct value into ESI before calling or jumping to interpreter 110. When making a callback, the code must also contain an EXTMOD instruction for every pseudoregister that may have been modified by the callback (the MODIFIES_REG macro does this). Code to handle exception conditions (such as overflow) is not inlined. Instead, code is generated at the end of the list of IL instructions. This allows the common case to be compiled as a fall through, which typically improves the performance of the generated code. Entry Points, Interrupt Checks In addition to the IL that is generated for each K opcode chosen by block picker 114, IL is generated for entry points, interrupt checks. In order to allow more optimizations to occur, every branch destination is not included as an external entry point (external entry points act as a barrier to optimizations). In particular, the only destinations which should be made into external entry points are those which are jumped to from outside of the segment. When compiling a given segment, partial information is available about which destinations fit this criterion in the branch log (see above for information on the branch log). Compiler 104 uses this information to chose which basic blocks should have external entries. For each of these entries, an ENTRY IL_NODE is generated along with a GOTO IL_NODE that jumps to the generated IL for the entry original instruction. The OOCT specifications indicate that compiler 104 should insert interrupt checks within any loop. When generating the IL, a conservative estimate is made by inserting interrupt checks within any backward branch within the segment and before any computed jump instruction. The interrupt check is inserted after the label for the last original instruction in the basic block. As with other exception conditions, the IL code for the interrupt is generated out of line so that the normal case is simply the fall through of the condition branch. Compiler Middle End Description Middle End Overview The main goal of compiler 104's `middle end` is to improve the quality of the IL so that better code will be generated in the code generation stage. The rest of compiler 104 is structured as a series of passes that either perform an analysis of the IL or perform a transformation that modifies the IL. The passes can be applied multiple times although there are some dependencies between passes. From this point on, the rest of compiler 104 does not have any special knowledge about K instructions, it only deals with the IL. The remainder of this section is divided as follows. First, the stage that performs OASSIGN insertion is discussed. Second, compiler 104's analysis passes are discussed. Finally, compiler 104's transformation passes (that perform the main optimizations) are discussed. FIG. 15 particularly illustrates an OASSIGN insertion example. OASSIGN INSERTION (compiler/ooct_add_overlap_defs.c). The OASSIGN opcode is a special marker instruction that makes aliasing between pseudoregisters explicit. The need for OASSIGN arises in OOCT because some K opcodes use 16 bit registers while other operations use 32 bit registers that alias the 16 bit registers. In OOCT, separate pseudoregisters are used for all of the 16 bit and 32 bit registers. Thus, some of the pseudoregisters implicitly overlapped with each other. This creates two problems. The first problem is with optimization passes performing incorrect transformations. For each pseudoregister definition compiler 104 keeps track of the instructions which use that definition, and for each pseudoregister use compiler 104 keeps track of its definitions. This information is called use/def information. Compiler 104 uses use/def information in passes such as the Constant Folding pass. When pseudoregisters can alias each other, this requires the use/def computation and compiler 104 passes that use that information to be much more complex. A second problem created by overlapping pseudoregisters is in register allocation. When the register allocator assigns two overlapping pseudoregisters into machine registers at the same time, a modification to one register may require that the other register be invalidated. In general, keeping track of that information is very difficult and creates unneeded complexity. Instead of tackling these difficult problems and adding significantly to compiler 104's complexity, a method for inserting special marker OASSIGN instructions was designed which would allow compiler 104 to ignore the problem. A special compiler pass immediately after IL generation inserts OASSIGNs. After this compiler 104 pass, other analysis passes are allowed to assume that pseudoregisters do not overlap (with regard to use/def analysis). Additionally, register allocation is fairly easily handled by using OASSIGNs. Whenever the register allocator comes to an OASSIGN, it spills the source at its definition and fills the destination after the OASSIGN. This method uses the aliased memory to guarantee that any use of the overlap definition uses the correct value. The OASSIGN insertion is handled in two stages. First, a special version of the UseDef analysis is run. This version of UseDef is aware of pseudoregister overlaps, and creates use lists and definition lists that contain overlapping pseudoregisters. The rest of compiler 104 is not prepared to handle use/def lists that contain overlapping pseudoregisters, so this option for UseDef should not be used in general. After this analysis is performed, the procedure OOCT_Add_Overlap_Defs performs the actual insertion of OASSIGNs. An OASSIGN is inserted for every use that has an overlap definition (i.e. a definition that defines a pseudoregister that overlaps with the use's pseudoregister) and for overlapping reaching definitions at labels. FIG. 15 illustrates an example of a case where an OASSIGN would be inserted. In the example, the pseudoregisters GRPAIR1 and GR1 overlap, so that the assignment to GRPAIR1 in the first line of the code is an implicit modification of GR1. The OASSIGN makes this explicit. Analysis Passes UseDef (compiler/oc_common/src/oc_usedef.c) Computing the uses of a given definition and the potential definitions for a given use is one of the most fundamental compiler 104 analyses. Every compiler 104 optimization pass uses the use/def information. Each of the IL instructions may have one pseudoregister argument which is being written to (a dest) and one or more pseudoregister arguments which are read from (a src). After UseDef analysis, each dest has a list associated with it that stores pointers to all IL instructions which might use that value (called a du chain). Similarly, each src has a list as | ||||||
