Monitoring program execution

Hardware and software co-simulation including simulating a target processor using binary translation

6751583

Abstract

A co-simulation design system to simulate on a host an electronic system that includes target digital circuitry and a target processor with an accompanying user program. The system includes a processor simulator to simulate execution of the user program by executing host software that includes an analyzed version of the user program. The system further includes a hardware simulator to simulate the target digital circuitry and an interface mechanism that couples the hardware simulator with the processor simulator. The user program is provided in binary form. Determining the analyzed version of the user program includes decomposing the user program into linear blocks, translating each linear block of the user program into host code that simulate the operations of the linear block, storing the host code of each linear block in a host code buffer for the linear block, and adding timing information into the code in the host code buffer on the time it would take for the target processor to execute the user program. The timing information incorporates target processor instruction timing. Adding of timing information includes inserting dynamic hooks into the host code that during execution invoke dynamic mechanisms that may effect timing and that cannot be determined ahead of execution such that while the processor simulator executes the analyzed version of the user program, the processor simulator accumulates simulation time according to a simulation time frame, the accumulated simulation time accounting for the target processor instruction timing as if the user program was executing on the target processor.


Claims

What is claimed is:

1. A co-simulation design system for testing by simulation an electronic system on a host computer system, the electronic system including target digital circuitry, and at least one target processor each with an accompanying user program to be executed on its target processor, the design system comprising:

a processor simulator for each target processor using software executing on the host computer system for simulating execution of the user program by the target processor, the software including an analyzed version of the user program;

a hardware simulator to simulate the target digital circuitry using software executing on the host computer system; and

an interface mechanism that couples the hardware simulator with the processor simulator including controlling communication between the processor simulator and the hardware simulator,

wherein the processor simulator includes

a communication mechanism to communicate with the hardware simulator using the interface mechanism when an event requires interaction of the user program with the target digital circuitry,

wherein the user program is provided in binary form, and wherein determining the analyzed version of the user program includes:

decomposing the user program into linear blocks,

translating each linear block of the user program into host code that simulate the operations of the linear block,

storing the host code of each linear block in a host code buffer for the linear block, and

adding timing information into the code in the host code buffer on the time it would take for the target processor to execute the user program, the timing information incorporating target processor instruction timing, the adding of timing information including inserting of dynamic hooks into the corresponding host code in the host code buffer that during execution invoke dynamic mechanisms that may effect timing and that cannot be determined ahead of execution,

such that while the processor simulator executes the analyzed version of the user program, the processor simulator accumulates simulation time according to a simulation time frame, the accumulated simulation time accounting for the target processor instruction timing as if the user program was executing on the target processor.

2. The design system of claim 1,

wherein a target processor includes a cache, the processor simulator for the target processor with the cache includes a cache simulator, and determining the analyzed version of the user program further includes, for each linear block,

identifying those parts in the linear block of the user program that include one or more memory references that might require a cache lookup, and inserting hooks into the corresponding host code in the corresponding host code buffer to invoke, at run time, the cache simulator for any simulated memory reference that might require a cache lookup;

such that executing the analyzed version of the user program:

causes the cache simulator to be invoked for the memory references, the cache simulator accounting for the effect of cache lookups on timing, and

accumulates simulation time as if the user program was executing on the target processor, the accumulated simulation time also accounting for cache lookup effects.

3. The design system of claim 2, wherein the cache simulator includes a simulated cache containing simulated cache entries, and a cache search mechanism for searching the simulated cache for an entry that matches an address.

4. The design system of claim 3, wherein the cache search mechanism includes a multi-level lookup table search mechanism.

5. The design system of claim 4, wherein the multi-level lookup table search mechanism requires the same number of table lookups independent of whether the cache lookup is successful or not.

6. The design system of claim 3, wherein the cache simulator stores, at execution time, for an instruction that might require a cache lookup, a pointer to the simulated cache entry that results from a lookup of the simulated cache the first time execution of the target instruction is simulated such that the cache simulator can avoid looking up the simulated cache the next time the target instruction is executed in simulation.

7. The design system of claim 2, wherein a linear block has a maximum length equal to the length of a cache line of the cache used for instructions such that when the processor simulator executes the host code in the host code buffer for the linear block, the processor simulator can operate under that assumption that all the target code of that linear block is in cache so that no instruction cache misses will occur.

8. The design system of claim 7, wherein determining the analyzed version of the user program further includes, for each linear block, inserting a hook to determine once at executing time whether or not the linear block of target code simulated is in cache.

9. The design system of claim 8,

wherein the processor simulator simulates allocated blocks of target memory, wherein each allocated block of simulated target memory is partitioned into sub-blocks equal to the length of a cache line of the cache used for instructions, and wherein state information is defined for each sub-block to automatically ensure that executing the analyzed code avoids re-executing analyzed code that in simulation has subsequently been modified.

10. The design system of claim 1,

wherein a target processor includes a memory management unit (MMU) for translating virtual addresses to physical addresses, the processor simulator for the target processor with the MMU includes an MMU simulator, and determining the analyzed version of the user program further includes, for each linear block,

identifying those parts in the linear block of the user program that include one or more memory references that might require accessing the MMU, and inserting hooks into the corresponding host code in the corresponding host code buffer to invoke, at run time, the MMU simulator for any simulated memory reference that might require an MMU access;

such that executing the analyzed version of the user program:

causes the MMU simulator to be invoked for the memory references, the MMU simulator accounting for the effect of MMU accesses on timing, and

accumulates simulation time as if the user program was executing on the target processor, the accumulated simulation time also accounting for MMU access effects.

11. The design system of claim 10, wherein the MMU includes a TLB and the MMU simulator includes a TLB simulator containing simulated TLB entries and a TLB search mechanism for searching the simulated TLB for an entry that matches a virtual address and a page size.

12. The design system of claim 11, wherein the TLB search mechanism includes a multi-level lookup table search mechanism.

13. The design system of claim 12, wherein the multi-level lookup table search mechanism requires the same number of table lookups independent of whether the TLB lookup is successful or not.

14. The design system of claim 11, wherein the TLB simulator stores at execution time, for an instruction that might include accessing the MMU, a pointer to the simulated TLB entry that results from a lookup of the simulated TLB the first time execution of the target instruction is simulated such that the TLB simulator can avoid looking up the simulated TLB the next time the target instruction is executed in simulation.

15. The design system of claim 10,

wherein a target processor includes a cache, the processor simulator for the target processor with the cache includes a cache simulator, and determining the analyzed version of the user program further includes, for each linear block,

identifying those parts in the linear block of the user program that include one or more memory references that might require a cache lookup, and inserting hooks into the corresponding host code in the corresponding host code buffer to invoke, at run time, the cache simulator for any simulated memory reference that might require a cache lookup;

such that executing the analyzed version of the user program:

causes the cache simulator to be invoked for the memory references, the cache simulator accounting for the effect of cache lookups on timing, and

accumulates simulation time as if the user program was executing on the target processor, the accumulated simulation time also accounting for cache lookup effects.

16. The design system of claim 15, wherein the cache simulator includes a simulated cache containing simulated cache entries, and a cache search mechanism for searching the simulated cache for an entry that matches an address, and wherein the cache simulator stores, at execution time, for an instruction that might require a cache lookup, a pointer to the simulated cache entry that results from a lookup of the simulated cache the first time execution of the target instruction is simulated such that the cache simulator can avoid looking up the simulated cache the next time the target instruction is executed in simulation.

17. The design system of claim 10,

wherein the target processor includes a pipeline, and wherein the timing information added into the host code in the host code buffer incorporates pipeline effects, such that the simulation time accumulated during execution of the analyzed version of the user program also accounts for pipeline effects.

18. The design system of claim 11,

wherein the processor simulator further includes a memory mapper that translates between target memory addresses and host memory addresses, the translation using memory mapping information.

19. The design system of claim 18,

wherein invoking the TLB simulator for a memory reference further includes invoking the memory mapper to translate the target memory address for the memory reference to a host memory address for the memory reference.

20. The design system of claim 18,

wherein parts of the memory mapper are incorporated in the simulated TLB entries such that the MMU simulator both translates a virtual address to a physical address during simulation, and automatically also provides the host address for the physical address.

21. The design system of claim 18,

wherein the target digital circuitry including one or more devices coupled to the target processor, each device having a target address,

wherein the memory mapper also translates addresses to the target addresses of each of the devices,

wherein the identifying step of determining the analyzed version of the user program includes identifying those parts of the user program that include one or more references that each is either a memory reference or a reference that require a read or write to a device, and inserting hooks in the user program to invoke, at run time, a reference process for each of the references, the reference process including: determining if the reference is a memory reference or a device reference, and if a device reference, determining the physical address of the device, and causing the processor simulator to communicate with the hardware simulator via the communication mechanism to cause the device to be written to or read from, and if a memory reference, invoking the cache simulator for the memory reference.

22. The design system of claim 10, wherein determining the analyzed version of the user program further includes inserting one or more instructions into the host code buffer for accumulating the simulation time for the code translated in the host code buffer.

23. The design system of claim 10,

wherein the processor simulator simulates allocated blocks of target memory, wherein each allocated block of simulated target memory is partitioned into sub-blocks, and wherein state information is defined for each sub-block to automatically ensure that executing the analyzed code avoids re-executing analyzed code that in simulation has subsequently been modified.

24. The design system of claim 23,

wherein the state information is further to indicate the memory state in the case that the electronic system includes a plurality of target processors sharing the same memory.

25. The design system of claim 1,

wherein the processor simulator and the hardware simulator process independently of each other.

26. The design system of claim 1,

wherein the processor simulator communication mechanism communicates information associated with the event to the hardware simulator, and

wherein the hardware simulator receives the associated event information.

27. The design system of claim 26,

wherein the hardware simulator processes the associated event information.

28. The design system of claim 27,

wherein the event information includes time delay information indicating an amount of simulated time since a previous event, and

wherein, upon receiving the time delay information, the hardware simulator executes an appropriate amount of hardware simulation time.

29. The design system of claim 1,

wherein the host computer system includes a computer network containing a first and a second host computer,

wherein the processor simulator operates on the first host computer,

wherein the hardware simulator operates on the second host computer, and

wherein the processor simulator is coupled to the hardware simulator by a computer network connection of the computer network, and

wherein the interface mechanism controls communications over the network connection.

30. The design system of claim 27, further comprising

a suspend mechanism coupled to the processor simulator that temporarily halts execution of the user program on the processor simulator while the hardware simulator processes the event information.

31. The design system of claim 30, wherein the interface mechanism includes the suspend mechanism.

32. The design system of claim 27,

wherein the hardware simulator processing the event information produces an event result, and,

wherein the hardware simulator includes a mechanism to communicate the event result to the processor simulator using the interface mechanism.

33. The design system of claim 32, wherein the event result is an interrupt, and is processed upon receipt of the event result by the processor simulator.

34. The design system of claim 32, further including

a resumption mechanism coupled to the processor simulator to resume execution of the user program upon receipt of the event result.

35. The design system of claim 1, wherein the event requiring the user program to interact with the target digital circuitry is an input/output instruction to the hardware simulator.

36. The design system of claim 1, wherein the processor simulator uses a first data format and the hardware simulator uses a second data format, the system further including a translator to convert the associated event information from the first data format to the second data format.

37. The design system of claim 32,

wherein the hardware simulator contains a processor model shell to access at least some of the external hardware signals of the target processor connected to the digital circuitry in the electronic system, and

wherein the processor simulator uses a first data format and the hardware simulator uses a second data format,

the design system further including a mapper to map an event result in the second data format to the first data format.

38. The design system of claim 1, wherein the hardware simulator operates in a hardware description language, and at least some of the digital circuitry is specified in the hardware description language.

39. The design system of claim 1, wherein the interface mechanism includes a message passing kernel.

40. The design system of claim 39, wherein the processor simulator and the hardware simulators are tasks under the kernel.

41. A method of simulating an electronic system that includes target digital circuitry and one or more target processors, each with an accompanying user program to be executed on its target processor, the simulation on a host computer system including one or more host processors, the method comprising:

for each target processor, providing the user program in binary form, and executing a processor simulation process including determining an analyzed version of the user program, and simulating execution of the target processor's user program by executing the analyzed version of the user program on a host processor, determining the analyzed version of the user program including:

decomposing the user program into linear blocks,

translating each linear block of the user program into host code that simulate the operations of the linear block,

storing the host code of each linear block in a host code buffer for the linear block,

adding timing information into the code in the host code buffer on the time it would take for the target processor to execute the user program, the timing information incorporating target processor instruction timing, the adding of timing information including inserting of dynamic hooks into the corresponding host code in the host code buffer that during execution invoke dynamic mechanisms that may effect timing and that cannot be determined ahead of execution,

such that while the processor simulation process executes the analyzed version of the user program, the processor simulator accumulates simulation time according to a simulation time frame, the accumulated simulation time accounting for the target processor instruction timing as if the user program was executing on the target processor;

simulating the target digital circuitry on a hardware simulator operating on the host computer system, the simulating of the target digital circuitry including accumulating timing information; and

passing communication between the processor simulation process and the hardware simulator at significant events, including events that require interaction between the user program and the target digital circuitry.

42. The method of claim 41,

wherein a target processor includes a cache, wherein the processor simulation process of the target processor with the cache includes storing a simulated cache containing simulated cache entries, and wherein determining the analyzed version of the user program further includes, for each linear block,

identifying those parts in the linear block of the user program that include one or more memory references that might require a cache lookup, and inserting hooks into the corresponding host code in the corresponding host code buffer to lookup, at run time, the simulated cache for any simulated memory reference that might require a cache lookup;

such that executing the analyzed version of the user program:

causes simulating the cache for the memory references to account for the effect of cache lookups on timing, and

accumulates simulation time as if the user program was executing on the target processor, the accumulated simulation time also accounting for cache lookup effects.

43. The method of claim 42, wherein looking up the cache for at least one of the memory references includes searching the simulated cache for an entry that matches an address.

44. The method of claim 43, wherein searching the simulated cache includes partitioning the address to be lookup up into a set of address parts, and sequentially looking up the address parts in a set of multi-level lookup tables wherein each valid lookup table element points to the next lower-level lookup table to look up, valid last lookup table elements pointing to the simulated cache entries of the simulated cache.

45. The method of claim 44, wherein looking up the address parts in the set of multi-level lookup tables requires the same number of table lookups independent of whether the cache lookup is successful or not.

46. The method of claim 45, wherein the set of multi-level lookup tables includes a null table at each level but the highest, each null table element pointing to the next level null table, the last null lookup table elements pointing to a null simulated cache entry guaranteed never to match an address, so that the validity or not of any lookup table entry never needs to be tested.

47. The method of claim 43, the processor simulation process of the target processor with the cache further includes storing, at execution time, for an instruction that might require a cache lookup, a pointer to the simulated cache entry that results from a lookup of the simulated cache the first time execution of the target instruction is simulated such that searching the simulated cache can be avoided the next time the target instruction is executed in simulation.

48. The method of claim 42, wherein a linear block of the user program of the target processor having a cache has a maximum length equal to the length of a cache line of the cache used for instructions such that executing the host code in the host code buffer for the linear block can operate under that assumption that all the target code of that linear block is in cache so that no instruction cache misses will occur.

49. The method of claim 48, wherein determining the analyzed version of the user program further includes, for each linear block, inserting a hook to determine once at executing time whether or not the linear block of target code simulated is in cache.

50. The method of claim 41,

wherein a target processor includes a memory management unit (MMU) having a TLB for translating virtual addresses to physical addresses, wherein the processor simulation process of the target processor with the MMU includes storing a simulated TLB having simulated TLB entries, and wherein determining the analyzed version of the user program further includes, for each linear block,

identifying those parts in the linear block of the user program that include one or more memory references that might require accessing the MMU, and inserting hooks into the corresponding host code in the corresponding host code buffer to lookup, at run time, the simulated TLB for the TLB entry that matches any simulated memory reference that might require an MMU access;

such that executing the analyzed version of the user program:

causes looking up the simulated TLB to find an entry that matches a virtual address and a page size for the memory references to account for the effect of MMU accesses on timing, and

accumulates simulation time as if the user program was executing on the target processor, the accumulated simulation time also accounting for MMU access effects.

51. The method of claim 50, wherein looking up the simulated TLB for at least one of the memory references includes searching the simulated TLB for an entry that matches an address.

52. The method of claim 51, wherein searching the simulated TLB includes partitioning the address to be lookup up into a set of address parts, and sequentially looking up the address parts in a set of multi-level lookup tables wherein each valid lookup table element points to the next lower-level lookup table to look up, valid last lookup table elements pointing to the simulated TLB entries of the simulated TLB.

53. The method of claim 52, wherein looking up the address parts in the set of multi-level lookup tables requires the same number of table lookups independent of whether the TLB lookup is successful or not.

54. The method of claim 53,

wherein the simulated TLB entries include memory mapping information for mapping physical addresses to host addresses, such that looking up the simulated TLB both translates a virtual address to a physical address during the execution simulation, and automatically also provides the host address for the physical address.

55. The method of claim 54,

wherein the one or more entries of the simulated TLB simulating each TLB entry are represented by memory allocation data structure that represents memory mapping information for mapping physical addresses to host addresses, each element of the memory allocation data structure for a simulated TLB entry representing a block of physical memory addresses, and each element of the memory allocation data structure for a simulated TLB entry representing allocated physical memory including a pointer to where in the memory space of the host processor that physical block of memory is stored, such that looking up the simulated TLB both translates a virtual address to a physical address during the execution simulation, and automatically also provides the host address for the physical address.

56. The method of claim 55,

wherein the memory allocation data structure is a linked list.

57. The method of claim 53, wherein the set of multi-level lookup tables includes a null table at each level but the highest, each null table element pointing to the next level null table, the last null lookup table elements pointing to a null simulated TLB entry guaranteed never to match an address, so that the validity or not of any lookup table entry never needs to be tested.

58. The method of claim 53,

wherein the entries of the simulated TLB include a replica lookup table having an element pointing back to the simulated TLB entry, such that a simulated TLB entry is able to act as one of the lookup tables,

wherein a lookup table at any level may include an element pointing to a simulated TLB entry, and

wherein the simulated TLB includes a null simulated TLB entry guaranteed never to match an address, so that the validity or not of any lookup table entry never needs to be tested.

59. The method of claim 51, wherein the processor simulation process of the target processor with the TLB further includes storing, at execution time, for an instruction that might require a TLB lookup, a pointer to the simulated TLB entry that results from a lookup of the simulated TLB the first time execution of the target instruction is simulated such that searching the simulated TLB can be avoided the next time the target instruction is executed in simulation.

60. The method of claim 50,

wherein the processor simulation process of the target processor with the MMU further includes storing memory mapping information for translating between target memory addresses and host memory addresses.

61. The method of claim 60,

wherein the memory mapping information is in the form of a memory allocation data structure, each element of the memory allocation data structure representing a block of physical memory addresses that includes, in the case that the block of physical addresses is allocated, a pointer to where in the memory space of the host processor that physical block of memory is stored.

62. The method of claim 61,

wherein the memory allocation data structure is in the form of an ordered linked list, each element of the linked list representing a block of physical memory addresses, and each linked list element representing allocated physical memory including a pointer to where in the memory space of the host processor that physical block of memory is stored.

63. The method of claim 60,

wherein looking up the simulated TLB for a TLB entry matching a memory reference further includes searching the memory mapping information to translate the target memory address for the memory reference to a host memory address for the memory reference.

64. The method of claim 62,

wherein looking up the simulated TLB for a memory reference further includes searching the set of linked list elements for a matching address to translate the physical address of the TLB entry matching the memory reference to a host memory address for the memory reference.

65. The method of claim 62,

wherein the one or more entries of the simulated TLB simulating each TLB entry are represented by an ordered linked list that is a subset of the ordered linked list representing the memory mapping information, each element of the simulated TLB linked list representing a block of physical memory addresses, and each simulated TLB linked list element representing allocated physical memory including a pointer to where in the memory space of the host processor that physical block of memory is stored, such that looking up the simulated TLB both translates a virtual address to a physical address during the execution simulation, and automatically also provides the host address for the physical address.

66. The method of claim 50, wherein determining the analyzed version of the user program further includes inserting one or more instructions into the host code buffer for accumulating the simulation time for the code translated in the host code buffer.

67. The method of claim 41,

wherein the passing communication communicates information associated with the event to the hardware simulator, and

wherein the hardware simulator receives the associated event information.

68. The method of claim 67,

wherein the hardware simulator processes the associated event information.

69. The method of claim 68,

wherein the event information includes time delay information indicating an amount of simulated time since a previous event, and

wherein, upon receiving the time delay information, the hardware simulator executes an appropriate amount of hardware simulation time.

70. The method of claim 68, further comprising

a suspend mechanism coupled to the processor simulation process that temporarily halts the processor simulation process while the hardware simulator processes the event information.

71. The method of claim 68,

wherein the hardware simulator processing the event information produces an event result, and,

wherein the hardware simulator includes a mechanism to communicate the event result to the processor simulation process.

72. The method of claim 71, further including

a resumption mechanism coupled to the processor simulation process to resume execution of the user program upon receipt of the event result.

73. A method of simulating on a host computer system the execution of a user program on a target processor having a TLB, the method comprising:

decomposing the user program into linear blocks of target code;

translating each linear block into host code that simulates the operations of the linear block;

storing the host code of each linear block in a host code buffer for the linear block;

for each linear block, inserting dynamic hooks into the host code in the host code buffer that during execution invoke dynamic mechanisms that may effect timing and that cannot be determined ahead of execution, the inserting of dynamic hooks including inserting hooks into the user program to invoke a TLB simulation process for any reference that includes a memory reference that might require a TLB lookup, the TLB simulation process including looking up a simulated TLB having simulated TLB entries;

for each linear block, adding timing information into the code in the host code buffer on the time it would take for the target processor to execute the user program, the timing information incorporating target processor instruction timing and TLB effects;

executing the code in the host code buffers; and

accumulating simulation time according to a simulation time frame, the accumulated simulation time accounting for the target processor instruction timing and TLB effects as if the user program was executing on the target processor,

the TLB simulation process including storing, at execution time, for a target instruction that might require a TLB lookup, a pointer to the simulated TLB entry that results from a lookup of the simulated TLB the first time execution of the target instruction is simulated such that searching the simulated TLB can be avoided the next time the target instruction is executed in simulation.

74. The method of claim 73, further including:

simulating target digital circuitry on a hardware simulator running on the host computer system,

wherein executing the code in the host code buffers includes communicating with the hardware simulator when an event requires interaction of the user program with the target digital circuitry.

75. A method of simulating on a host computer system the execution of a user program on a target processor having a TLB, the method comprising:

decomposing the user program into linear blocks of target code;

translating each linear block into host code that simulates the operations of the linear block;

storing the host code of each linear block in a host code buffer for the linear block;

for each linear block, inserting dynamic hooks into the host code in the host code buffer that during execution invoke dynamic mechanisms that may effect timing and that cannot be determined ahead of execution, the inserting of dynamic hooks including inserting hooks into the user program to invoke a TLB simulation process for any reference that includes a memory reference that might require a TLB lookup, the TLB simulation process including looking up a simulated TLB having simulated TLB entries;

for each linear block, adding timing information into the code in the host code buffer on the time it would take for the target processor to execute the user program, the timing information incorporating target processor instruction timing and TLB effects;

executing the code in the host code buffers; and

accumulating simulation time according to a simulation time frame, the accumulated simulation time accounting for the target processor instruction timing and TLB effects as if the user program was executing on the target processor,

wherein the attributes of a TLB entry are encoded in the corresponding simulated TLB entry such that the testing of whether or not a virtual address and a page size match an entry of the simulated TLB automatically also checks permission without a separate permissions check required.

76. The method of claim 75, wherein the attributes of a TLB entry are encoded in the corresponding simulated TLB entry such that the testing of whether or not a virtual address and a page size match an entry of the simulated TLB automatically also checks alignment without a separate alignment check required.

77. The method of claim 76, wherein the simulated TLB entry is encoded such that the address and page size match test fails when there is no permission or the alignment is incorrect.

78. The method of claim 77, wherein the simulated TLB entry includes a set of versions of the TLB entry virtual address and TLB entry page size information, each version corresponding to a different mode, and wherein testing a virtual address and page size pair for a match in a simulated TLB entry also includes indexing to the version of the of the TLB entry virtual address and TLB entry page according to the mode.

79. A method of simulating on a host computer system the execution of a user program on a target processor having a cache, the method comprising:

decomposing the user program into linear blocks of target code;

translating each linear block into host code that simulates the operations of the linear block;

storing the host code of each linear block in a host code buffer for the linear block;

for each linear block, inserting dynamic hooks into the host code in the host code buffer that during execution invoke dynamic mechanisms that may effect timing and that cannot be determined ahead of execution, the inserting of dynamic hooks including inserting hooks into the user program to invoke a cache simulation process for any reference that includes a memory reference that might require a cache lookup, the cache simulation process including looking up a simulated cache having simulated cache entries;

for each linear block, adding timing information into the code in the host code buffer on the time it would take for the target processor to execute the user program, the timing information incorporating target processor instruction timing and cache effects;

executing the code in the host code buffers; and

accumulating simulation time according to a simulation time frame, the accumulated simulation time accounting for the target processor instruction

timing and cache effects as if the user program was executing on the target processor,

the cache simulation process including storing, at execution time, for a target instruction that might require a cache lookup, a pointer to the simulated cache entry that results from a lookup of the simulated cache the first time execution of the target instruction is simulated such that searching the simulated cache can be avoided the next time the target instruction is executed in simulation.

80. The method of claim 79, further including:

simulating target digital circuitry on a hardware simulator running on the host computer system,

wherein executing the code in the host code buffers includes communicating with the hardware simulator when an event requires interaction of the user program with the target digital circuitry.

81. A method of simulating on a host computer system the execution of a user program on a target processor, the method comprising:

decomposing the user program into linear blocks of target code;

translating each linear block into host code that simulates the operations of the linear block;

storing the host code of each linear block in a host code buffer for the linear block;

for each linear block, inserting dynamic hooks into the host code in the host code buffer that during execution invoke dynamic mechanisms that may effect timing and that cannot be determined ahead of execution;

for each linear block, adding timing information into the code in the host code buffer on the time it would take for the target processor to execute the user program, the timing information incorporating target processor instruction timing and cache effects;

for each host code buffer, looking up the next host code buffer to execute and executing the code in the next code buffers in sequence; and

accumulating simulation time according to a simulation time frame, the accumulated simulation time accounting for the target processor instruction timing and cache effects as if the user program was executing on the target processor,

wherein the looking up the next host code buffer at the conclusion of processing the code in a present host code buffer includes searching for the next host code buffer the first time the present host code buffer is processed, and, after the next host code buffer is found or newly created, storing for the present host code buffer a pointer to the next host code buffer such that the search for the next host code buffer can be avoided the next time the present host code buffer is processed and its next host code buffer is looked up.

82. The method of claim 81, wherein for each host code buffer, two next host code buffer pointers are stored, one pointing to a first host code buffer for a taken branch and the other pointing to a second host code buffer for an untaken branch in the case that the last target instruction translated in the present host code buffer is a conditional branch.

83. The method of claim 81, wherein a return address is stored for the present HCB in the case that the next HCB pointer does not point to the correct next HCB, wherein the present host code buffer includes an instruction to jump unconditionally to the location pointed to by the next pointer stored for the next host code buffer, and wherein the next host code buffer pointer initially points to a process to search for the next host code buffer, so that executing the code in a host code buffer of a sequence of host code buffers always ends in an unconditional indirect jump to the next location.


Description

FIELD OF THE INVENTION

The present invention relates to computer software and hardware simulators, and more specifically, to a system and method to simulate an electronic system that includes one or more target processors executing software and interacting with hardware.

BACKGROUND

Computer simulation of digital hardware systems has become a common technique to reduce the cost and time required for the design of such hardware systems. Simulating digital hardware allows a designer to predict the functioning and performance of the hardware prior to fabricating the hardware.

More and more digital systems incorporate a processor, including a microprocessor, a digital signal processor, or other special purpose computer processor. There has been increased effort to develop a simulation system that includes simulating the hardware and simulating the running of software on one or more processors that are included in the digital system. Having such a simulation system allows a designer to test the operation of software on the processor(s) before a physical processor is available. Thus, for example, a designer may be able to start designing a system incorporating a new microprocessor before the manufacturer actually releases physical samples of the microprocessor. In addition, a system designer designing an integrated circuit or a system on a printed circuit board that includes a processor can, for example, use the simulation system to test the integrated circuit or printed circuit board implementation, including operation of software on the processor part, and any testing interactions between the processor and the other digital circuit elements of the integrated circuit or board, before the integrated circuit or board is fabricated. This clearly can save time and money.

Nomenclature

A simulation system for simulating both the digital hardware that includes one or more target processors and the running of software on the processor(s) is called a co-simulation design system, a co-simulation system, or simply a design system herein, and the environment for operating such a co-simulation system is called a design environment. The processor is called a target processor and the computer system on which the environment operates is called the host computer system or simply the host. The host computer system includes one or more host processors. The hardware other than the target processor is called digital circuitry. The computer software program that is designed by a user to operate on the target processor is called the user program or the target code.

The target processor typically includes memory and one or more caches, for example a data cache (or D-cache) and an instruction cache (or I-cache). The target processor typically may also include a memory management unit (MMU) that converts virtual addresses into physical memory addresses and possibly physical I/O device addresses. The MMU may include a translation lookaside buffer (TLB) to improve address translation performance. A TLB is a hardware element that acts as a cache of recent translations and stores virtual memory page to physical memory page translations. Given a memory address (an instruction to fetch, or data to load or store), the target processor first looks in the TLB to determine if the mapping of virtual page to physical page is already known. If so (a "TLB Hit"), the translation can be done quickly. But if the mapping is not in the TLB (a "TLB Miss"), the correct translation needs to be determined.

The target processor may be a separate microprocessor with the digital circuitry being external to the microprocessor (e.g., on a printed circuit board or elsewhere in the system), or may be a processor embedded in an application specific integrated circuit (ASIC) or a custom integrated circuit (IC) such as a very large scale integrated (VLSI) device, with the digital circuitry including some components that are part of the ASIC or IC, and other components that are external to the ASIC or IC.

The host processor also includes memory, and the host memory is referred to as "host memory" herein. The physical address of the host memory is referred to as the "host address" herein. When the word "address" is used without specifying the host, then it refers to the target address.

Desirable Aspects for a Co-Simulation Design System

A design environment capable of co-simulation requires the capability of accurately simulating the digital circuitry, including timing, and the capability of accurately simulating the running of the user program (i.e., the target code) on the target processor, including the accurate timing of operation of the user program and of any software/hardware interaction. The first requirement is available today in a range of simulation environments using hardware description languages (HDLs) such as Verilog and VHDL. It also is available as a set of constructed libraries and classes that allows the modeling of hardware in a higher-level language such as `C` or `C++.` The second requirement pertains to a processor simulator using an executable processor model that both accurately simulates the execution of a user program on the target processor, and can interact with the digital circuitry simulation environment. Such a processor simulator should provide timing information, particularly at times of software/hardware interaction, i.e., at the software/hardware interface. A processor model that includes such accurate timing information is called a "quantifiable" model herein.

One known way of providing such processor simulation is to simulate the actual hardware design of the processor, for example by specifying a processor model in a hardware description language (HDL). The main but great disadvantage of so simulating the operation of the processor is the slow execution speed, typically in the range of 0.1-100 instructions per second.

Another known way of accurately simulating the execution of software on a processor for inclusion in a co-simulation environment is an instruction set simulator (ISS), wherein both the function and the sequencing of the microprocessor is mimicked in software. An instruction set simulator still executes relatively slowly, compared for example to how fast a program would be executing on the target processor. An ISS executes in the range of 1,000 to 50,000 instructions per second depending on the level of timing and operational detail provided by the model.

Real systems execute 50-1000 million instructions per second or more, so that the ISS or full hardware simulation techniques have a disparity of a factor between about 10,000 to 200,000 in performance; 3 to 60 hours of simulation may be needed to model 1 second of real-time target processor performance.

One solution to the slow speed of simulating a processor is to use a hardware processor model. This device includes a physical microprocessor and some circuitry for interfacing and interacting with the design environment simulating the digital circuitry. The memory for the target processor is simulated as part of the digital circuitry. Such an approach is fairly expensive. Another limitation is due to having two definitions of time operating on the same simulation system: simulation time of a hardware simulator, and processor time, which is real time for the hardware processor. Correlating these is difficult.

Another solution is to use an emulator as the target processor model. An emulator, like a hardware processor model, is a hardware device, typically the target processor, and usually includes some memory. The emulator is designed to emulate the operation of the microprocessor. Such a processor emulator when it includes memory can execute the user program directly, but again is expensive and may require the development of external circuitry to interact with the hardware simulator simulating the digital circuitry. U.S. Pat. No. 5,838,948 describes an environment that uses an emulator for speeding up the running of a user program in the design environment.

While sometimes it is desired to run a simulation with great precision at a high level of detail, at other times, less detail may suffice, enabling faster execution of the simulation. There therefore is a need in the art for an executable and quantifiable processor model that can be used in a co-simulation system and that models the operation of the target processor at an elected level of detail, including an elected level of detail at the hardware/software interface.

Computer networks are becoming ubiquitous, and it is desired to be able to operate a co-simulation design system on a computer network, with different elements of the design system running on different processors of the computer network to speed execution. Similarly, multiprocessor computers are also becoming commonplace, and it would be desirable to be able to operate a co-simulation design system on a multiprocessor computer, with different elements running on different processors of the multiprocessor computer.

Electronic systems nowadays may include more than one target processor. It is therefore desirable to have a co-simulation design system that provides for rapidly simulating such an electronic system, including simulating respective user programs executing on the target processors, such processor simulation providing timing detail that takes into account instruction timing and pipeline effects for target processors that include a pipeline.

The Parent Applications

The Parent Applications describe a method and system for rapidly simulating a target processor executing a user program. Described is a processor model for the target processor that operates up to the host processor speed on a host computer system and yet takes into account instruction timing and pipeline effects such as pipeline hazards and/or cache effects such as cache misses. The model can be incorporated into a design system that simulates an electronic circuit that includes the target processor and digital circuitry. The Parent Applications also describe using more than one such processor models in a design system that simulates an electronic circuit that includes more than one target processor and digital circuitry. A further feature described in the Parent Applications is how a user can modify the processor model to include more or less detail.

The Parent Applications' design system operates on a host computer system and simulates an electronic system that contains target digital circuitry and a target processor. The design system includes a hardware simulator simulating the target digital circuitry, a processor simulator simulating the target processor by executing a user program substantially on the host computer system, and an interface mechanism that couples the hardware simulator with the processor simulator including passing information between the hardware simulator and the processor simulator. The hardware processor provides a simulation time-frame for the design system.

In one version, at significant events including events that require the user program to interact with the target digital circuitry, the operation of the processor simulator is suspended and associated event information is passed from the processor simulator to the hardware simulator. The operation of the processor simulator then is resumed when the hardware simulator processes information and passes an event result back to the processor simulator.

The processor simulator described in the Parent Applications accumulates a simulation time delay when operating, the simulation time delay determined using timing information that accounts for instruction timing including pipeline effects and/or cache effects. A static analysis process is performed on the user program, i.e., a process obtained by analyzing the user program prior to running the analyzed version of the user program on the processor simulator, determines timing information in accordance to characteristics of the target processor including instruction timing characteristics and, in one aspect, pipeline characteristics. The static analysis process comprises decomposing the user program into linear blocks of one or more instructions; determining the time delay for each linear block of the user program using characteristics of the target processor; and combining the linear block timing information with the user program to determine the timing information for the processor simulator.

Any timing effects, such as cache misses in a D-cache or an I-cache, are dependent on the current state of the cache, and cannot be known until runtime. Static analysis cannot easily account for such timing. Another aspect of the Parent Applications is dynamic analysis by including code to simulate a cache or other dynamic components. In one aspect of the Parent Applications, the processor simulator includes a cache simulator that simulates operation of the cache to account for the effects of cache misses on timing.

The hardware simulator provides a simulation time-frame for the design system. In one version, at significant events, including events that require the user program to interact with the target digital circuitry, the operation of the processor simulator is suspended and associated event information is passed from the processor simulator to the hardware simulator. The operation of the processor simulator then is resumed when the hardware simulator processes information and passes an event result back to the processor simulator.

The static analysis process comprises decomposing the user program into linear blocks of one or more instructions; determining, using characteristics of the target processor, the time delay for each linear block of the user program that would be incurred by executing the linear block with no cache misses, and combining the linear block timing information with the user program to determine the timing information for the processor simulator. In the case that the processor model includes a cache simulator, the analysis process also includes determining those parts of the user program that include one or more references that might require a cache lookup, and inserting hooks into the analyzed user program to invoke, at run time, the cache simulator for the references that includes a memory reference that requires a cache lookup in order to account for cache misses in timing.

In one embodiment, the hardware simulator runs on a HDL and at least some of the digital circuitry is specified in the HDL. In another embodiment, all or some of the digital circuitry is described to the hardware simulator in a higher-level language such as such as `C` or `C++.`One implementation described in the Parent Applications is when the user program includes statements in a higher-level language. An alternate implementation for which the present invention is particularly applicable is the case that the user program is provided as executable (binary) code in the target processor's machine language.

Other features and aspects of the invention will become clear from the detailed description that follows.

SUMMARY

One aspect of the invention is a method and system for rapidly simulating on a host computer system an electronic system that includes both digital circuitry and one or more target processors each executing a user program, with the target processor including a cache or an MMU or both. One feature of the invention is providing a processor model for each target processor that operates fast--potentially even faster than the target processor speed--and yet takes into account instruction timing and cache effects when a cache is included and MMU effects when an MMU is included. As an additional feature, the processor model also takes into account pipeline effects such as pipeline hazards for target processors that have a pipeline. Another feature of the invention is providing such a processor model that is modifiable by a user to include more or less detail. Another feature of the invention is providing such a processor model that can be incorporated into a design system that simulates an electronic circuit that includes the target processor and digital circuitry.

Described herein is a co-simulation design system to simulate on a host processor an electronic system that includes target digital circuitry and a target processor with an accompanying user program. The system includes a processor simulator to simulate execution of the user program by executing host software that includes an analyzed version of the user program. The system includes a hardware simulator to simulate the target digital circuitry and an interface mechanism that couples the hardware simulator with the processor simulator including controlling communication between the processor simulator and the hardware simulator.

The user program is provided in binary form. Determining the analyzed version of the user program includes decomposing the user program into linear blocks, translating each linear block of the user program into host code that simulates the operations of the linear block, storing the host code of each linear block in a host code buffer for the linear block, and adding timing information into the code in the host code buffer on the time it would take for the target processor to execute the user program. The timing information incorporates target processor instruction timing. The adding of timing information includes inserting dynamic hooks into the corresponding host code that during execution invoke dynamic mechanisms that may effect timing and that cannot be determined ahead of execution such that while the processor simulator executes the analyzed version of the user program, the processor simulator accumulates simulation time according to a simulation time frame, the accumulated simulation time accounting for the target processor instruction timing as if the user program was executing on the target processor.

In one version, the target processor includes a cache and the processor simulator includes a cache simulator. Determining the analyzed version of the user program further includes, for each linear block, identifying those parts in the linear block of the user program that include one or more memory references that might require a cache lookup, and inserting hooks into the corresponding host code in the corresponding host code buffer to invoke, at run time, the cache simulator for any simulated memory reference that might require a cache lookup. Executing the analyzed version of the user program causes the cache simulator to be invoked for the memory references, the cache simulator accounting for the effect of cache lookups on timing, and accumulates simulation time as if the user program was executing on the target processor, the accumulated simulation time also accounting for cache lookup effects.

The cache simulator includes a simulated cache containing simulated cache entries, and a cache search mechanism for searching the simulated cache for an entry that matches an address. One embodiment of the cache search mechanism includes a multi-level lookup table search mechanism that requires the same (small) number of host processor operations independent of whether the lookup is successful or not. This avoids tests that might slow down the simulation in the host code that implements the search mechanism.

The cache simulator stores, at execution time, for an instruction that might require a cache lookup, a pointer to the simulated cache entry that results from a lookup of the simulated cache the first time the execution of the target instruction is simulated such that the cache simulator can avoid looking up the simulated cache the next time the target instruction is executed in simulation.

One version is for a target processor that includes an MMU for translating virtual addresses to physical addresses. For this version, the processor simulator includes an MMU simulator, and determining the analyzed version of the user program further includes, for each linear block, identifying those parts in the linear block of the user program that include one or more memory references that might require accessing the MMU, and inserting hooks into the corresponding host code in the corresponding host code buffer to invoke, at run time, the MMU simulator for any simulated memory reference that might require an MMU access. Executing the analyzed version of the user program causes the MMU simulator to be invoked for the memory references, the MMU simulator accounting for the effect of MMU accesses on timing, and accumulates simulation time as if the user program was executing on the target processor, the accumulated simulation time also accounting for MMU access effects.

The MMU includes a TLB and the MMU simulator includes a TLB simulator containing simulated TLB entries and a TLB search mechanism for searching the simulated TLB for an entry that matches a virtual address and a page size. The TLB search mechanism includes a multi-level lookup table search mechanism that requires the same number of table lookups, and thus host processor operations independent of whether the lookup is successful or not.

The TLB simulator stores at execution time, for an instruction that might include accessing the MMU, a pointer to the simulated TLB entry that results from a lookup of the simulated TLB the first time execution of the target instruction is simulated such that the TLB simulator can avoid looking up the simulated TLB the next time the target instruction is executed in simulation.

The attributes of a TLB entry are encoded in the corresponding simulated TLB entry such that the testing of whether or not a virtual address and a page size match an entry of the simulated TLB automatically also checks permission without a separate permissions check required. In one version, the simulated TLB entry is encoded such that the address and page size match test fails when there is no permission or the alignment is incorrect. The simulated TLB entry includes a set of versions of the TLB entry virtual address and TLB entry page size information, each version corresponding to a different mode, and wherein testing a virtual address and page size pair for a match in a simulated TLB entry automatically also includes indexing to the version of the TLB entry virtual address and TLB entry page according to the mode.

Running the processor simulator includes, for each host code buffer, looking up the next host code buffer to execute and executing the code in the next code buffers in sequence. The looking up the next host code buffer at the conclusion of processing the code in a present host code buffer includes searching for the next host code buffer the first time the present host code buffer is processed, and, after the next host code buffer is found or newly created, storing for the present host code buffer a pointer to the next host code buffer such that the search for the next host code buffer can be avoided the next time the present host code buffer is processed and its next host code buffer is looked up.

Other features and aspects of the invention will become clear from the detailed description that follows.

DESCRIPTION OF THE FIGURES

The present invention will be more fully understood from the detailed embodiments of the invention, which, however, should not be taken to limit the invention to any specific embodiment but are for explanation and better understanding only. The various embodiments in turn are explained with the aid of the following drawings:

FIG. 1 shows a single processor embodiment of a co-simulation design system according to the invention. Multiple processor embodiments are similar but include more than one processor simulator.

FIG. 2 shows a timing diagram of an example two-processor simulation according to an embodiment of the invention.

FIG. 3 shows the main loop of the processor simulator according to an embodiment of the invention.

FIG. 4 shows an embodiment of a linked list to represent the physical address to host memory mapping and a search for the host address of a physical address according to an embodiment of the invention.

FIG. 5A shows a three-level lookup table search mechanism to search a set of linked list elements representing the physical to host address mapping according to an embodiment of the invention.

FIG. 5B shows a lookup table embodiment for a multi-level lookup table search mechanism according to an embodiment of the invention.

FIG. 5C shows a linked list element that includes a replica lookup table according to an embodiment of the invention.

FIG. 6 shows another embodiment of a linked list to represent the physical to host address memory mapping together with a multi-level lookup table search mechanism according to an embodiment of the invention.

FIG. 7 shows a search mechanism to search for the next host code buffer (HCB) according to an embodiment of the invention.

FIG. 8 shows a simple example of the structure of an HCB and the mechanism for chaining HCBs according to an embodiment of the invention.

FIG. 9A shows one representation of a translation lookaside buffer (TLB) according to an embodiment of the invention.

FIG. 9B shows how the virtual address and the page size information may be encoded in a simulated TLB entry according to an embodiment of the invention.

FIG. 9C shows a search mechanism to search for a simulated TLB entry in a TLB simulator according to an embodiment of the invention.

FIG. 10A shows an embodiment of a TLB simulator that includes a search mechanism according to an embodiment of the invention.

FIG. 10B shows an element representing one or more blocks of physical memory together with a physical to host address mapping for use in a TLB simulator according to an embodiment of the invention.

FIGS. 11A-D show how information may be encoded in a simulated TLB entry so that no separate permissions and alignment check are needed according to one embodiment of the invention.

FIG. 12A shows a typical target address that may need to be searched in a cache.

FIG. 12B shows one representation of a cache according to an embodiment of the invention, including how a cache entry may be encoded in a simulated cache according to an embodiment of the invention.

FIG. 12C shows a search mechanism to search for a simulated cache entry in a cache simulator according to an embodiment of the invention.

DETAILED DESCRIPTION OF THE EMBODIMENTS

The first one or two digits in a reference numeral indicate on which figure that reference numeral is first introduced. For example, any reference numerals between 100 and 199 are first introduced in FIG. 1, any between 200 and 299 are first introduced in FIG. 2, any between 1000 and 1099 are first introduced in FIG. 10, and so forth.

The method, processor model, and system embodiments of the invention include components that operate on a host computer system. The host computer system may be a single computer, for example, a computer workstation. Such workstations are readily available, and may operate under any operating system (OS) such as any variants of the UNIX operating system (including LINUX.TM.), or any variants of Microsoft Windows.RTM. (e.g., Windows NT, Windows 98, or Windows 2000 from Microsoft Corporation, Redmond, Wash.), or the Mac OS.RTM. (Apple Computer, Cupertino, Calif.). Some embodiments operate under a computer network that includes a plurality of host processors interconnected as a network, while other embodiments run on a multiprocessor computer that includes a plurality of host processors. The term "host computer system" thus means a computer system with a single host processor, or a plurality of interconnected host processors that may be interconnected as a network, or a multiprocessor computer.

The Co-Simulation System

FIG. 1 shows a design system embodiment of the present invention. Design system 100 operates on a host computer system and simulates an electronic system that contains target digital circuitry and at least one target processor executing a user program. In one embodiment, the user program is provided as executable binary code for execution on the target processor. The target processor typically may or may not have a pipeline, and may or may not include a memory management unit (MMU), also called a virtual memory system, that accepts virtual addresses and determines physical addresses or a cache system, or both a cache system and an MMU. The cache system may be either a single cache or a separate data cache and instruction cache. For the remainder of this description, "instruction cache" or "I-cache" shall refer to the separate instruction cache when the target processor has a separate I-cache, or the cache when there is only one cache. Similarly, any reference to a "data cache" or "D-cache" shall refer to the appropriate target cache, depending, for example, on whether or not there is a separate D-cache.

A hardware simulator 103 simulates the target digital circuitry. In one embodiment, the hardware simulator operates in a hardware description language, in particular Verilog, and so the description 105 of the target digital circuitry is provided by the user in the form of Verilog code. The invention can also work with other hardware description languages such as VHDL, and with hardware descriptions in terms of libraries, or libraries and classes written in a higher-level language such as `C,` or `C++.` Thus, the invention does not depend on the particular hardware models used in the hardware simulator 103.

Co-simulation design system 100 also includes a processor simulator for each processor that simulates the target processor executing the user program, and one such processor simulator is shown as 107. The processor simulator 107 executes the user program substantially on the host computer system, which provides for extremely rapid simulation of the software. While only one processor is shown, the simulation system can accommodate is additional processor simulators of additional target processors to simulate a multiprocessor system.

Processor simulator 107 simulates execution of a user program 109 on the target processor by executing an analyzed version 111 of the user program 109. The analyzed version of the user program is thus a program derived from the user program by an analysis process carried out by block 112.

In one embodiment, determining the analyzed version includes performing dynamic binary translation of the target code (the user program in binary form) into host instructions that carry out the same operation. The dynamic binary translation is carried out linear block by linear block by an analyzer/binary translator 112. The translated code, i.e., host code 114 for each linear block, is stored in a buffer 111 called the host code buffer (HCB) for that block to provide for reusing that translated code if the same linear block of target code is encountered again during the simulation. Thus, the analysis and translation process carried out by block 112 produces a plurality of HCBs 111. The host code in each HCB 111 is executed, and then the next linear block of target code is either translated into an HCB prior to execution, or, if already available in a previously filled HCB, directly executed.

The dynamic binary translation aspects of the processor simulator are described in more detail below.

The analysis of analyzer/binary translator 112 includes adding timing information 110 for the code in each HCB on the time it would take for the target processor to execute the user program 109 such that while the host processor executes the analyzed/translated version 114 of the user program in HCB 111, the processor simulator 107 generates accurate execution timing information incorporating the target processor instruction timing as if the user program 109 was executing on the target processor. For processors that have a pipeline, the timing information incorporates pipeline effects. Furthermore, for a processor that includes a cache and an MMU, the processor simulator includes a cache simulator 121 executing a cache model, and an MMU simulator 123 that simulates the operation of the target MMU. In one implementation for a target processor that includes a translation lookaside buffer (TLB) in the MMU, the MMU simulator 123 includes a TLB simulator. One embodiment of the processor simulator 107 further includes a memory mapper 125 that translates between target memory addresses and host memory addresses and using memory mapping information 108 that relates target addresses to host addresses.

An interface mechanism 119 is coupled to both the processor simulator 107 and the hardware simulator 103 and enables communication between processor simulator 107 and hardware simulator 103. Processor simulator 107 includes a communication mechanism 141 to pass information to the hardware simulator 103 using the interface mechanism when an event requires interaction of user program 109 with the target digital circuitry. Such events include times when user program 109 encounters an input/output instruction, or when the program has an arithmetic exception during execution, and other significant events.

In one embodiment, the target digital circuitry includes a target memory for the target processor, and the hardware simulator provides for simulating at least some of the operations of the target memory by running a hardware model 122 of the target memory, with the contents of the simulated target memory stored in the host computer system. Typically, the user selects to simulate only some bus transactions that may occur in executing the user program by running bus hardware model 124 on the hardware simulator.

In another embodiment, the co-simulation design system 100 provides for accurately simulating bus transactions. In such an embodiment, the description 105 of the target digital circuitry includes a bus hardware model 124 of the bus of the target processor. At least some of the operations of the target processor bus may be simulated by running bus hardware model 124 on the hardware simulator. Typically, the user may select to simulate only some bus transactions that may occur in executing the target processor by running bus hardware model 124 on the hardware simulator.

When both the target processor bus and the target processor memory are simulated by target memory model 122 and target bus model 124, a significant event may include, for example, the cache simulator's determining that a cache miss has occurred that requires a number of bus cycles and memory accesses to be simulated in the bus model 124 and memory model 122 of the target digital circuitry. In such an example, the user may choose to simulate these bus and memory transactions using the target memory model 122 and target bus model 124. Note that the memory model 122 preferably does not store actual data but rather uses the memory of the host computer system for data storage. Similarly, the bus model 124 preferably does not move actual data but rather simulates the timing of the bus cycles required to move data.

The hardware simulator 103 also includes a communication mechanism 143 to pass information to processor simulator 107 using the interface mechanism at events significant to the hardware simulator 103 that need to be communicated to the processor simulator. Such an event includes when a signal in the target digital circuitry connected to the target processor is asserted, for example, an interrupt.

The interface mechanism 119 passes the information across the hardware/software boundaries. One interface mechanism embodiment 119 includes a message passing kernel. Thus, in one embodiment, both the processor simulator and the hardware simulator communication mechanisms 141 and 143 are included in interface mechanism 119. Also, the processor simulator and the hardware simulator are tasks under the kernel, and the kernel provides the mechanism for the tasks to communicate whenever one or the other task requires it. When several processor simulators operate, each runs independently as a task under the kernel.

Those in the art will appreciate that other types of interface mechanisms are possible, including using multiple threads, and using a complete or partial operating system.

The hardware simulator and the processor simulator each has its own definition of time, i.e., its own time domain (time frame), with the interface mechanism providing a mechanism to synchronize time whenever processor simulator 107 and hardware simulator 103 need to communicate. Similarly, when several processor simulators operate, each processor simulator has its own concept of time, i.e. each accumulates simulation time according to a time frame, as does the hardware simulator.

The analyzed/translated version of the user program is obtained linear-block by linear block by the analysis and binary translation process performed on the target code--user program 109--by analyzer/binary translator 112. The analysis/binary translation process includes dynamic binary translation of the target code into corresponding host code linear block by linear block and insertion of dynamic hooks 106 into the corresponding host code in each HCB that during execution invoke dynamic mechanisms that may effect timing and that cannot be determined ahead of execution. For example, dynamic hooks 106 may include invoking the cache simulator 121 when there are memory references in the user program 109 that may produce a cache miss. In the case the target processor includes an MMU and the processor simulator 107 includes an MMU simulator 123, analysis further includes identifying those parts in the linear block of the user program that include one or more memory references that might require accessing the MMU, and inserting hooks into the corresponding host code in the corresponding HCB to invoke the MMU simulator 123 during execution of the analyzed (host) program to simulate dynamic address translations that would occur if the user program was being executed on the target processor. The analysis/translation process of unit 112 may further include calculating for each linear block the time delay that would be incurred by executing that linear block on the target processor and adding host code to advance the simulation time.

The time calculating uses characteristics 117 of the particular target processor, including instruction timing and characteristics of the processor. Such processor characteristics may include pipeline characteristics for a target processor that includes a pipeline so that the result is the analyzed program which includes the instruction of user program 109, and timing information 110 that includes pipeline effects.

When an event occurs that requires the processor simulator to communicate to the hardware simulator, the processor simulator's communication mechanism 141 sends information to hardware simulator 103 associated with the event through the interface mechanism 119. The hardware processor receives the associated event information and processes it. Typically, the event may be an input/output instruction in the user program to send information or to poll a port or to execute a number of bus cycles, or otherwise to interact with the hardware simulator.

The associated event information preferably includes time delay information indicating an amount of simulated time since a previous event occurred, such as when the processor last started or resumed operation, or when the processor simulator last sent event information, or when the hardware simulator last received event information. The hardware simulator 103, upon receiving the time delay information, executes for an appropriate amount of hardware simulation time.

Typically, the processor simulator 107 operates much faster than the hardware simulator 103. That is, simulation time is consumed much faster (in real time) on a processor simulator than on a hardware simulator because hardware simulator 103 of design system 100 models the digital circuitry 105 in detail, while the processor simulator 107 does not model the architectural detail of the target processor, but rather runs the user program substantially on the host computer system. The timing detail comes as a result of the analysis process 113 and in accumulating the delay during processing using timing information 110.

In one embodiment, the hardware simulator provides a simulation time-frame for the design system. That is, simulation time is started and maintained by the hardware simulator, and whenever synchronization is required, all times are synchronized to the hardware simulation time, which is the simulation time for the system.

The design system also includes a suspend mechanism 149 and a resume mechanism 151 coupled to the processor simulator that allow the processor simulator to suspend and resume operation. In one embodiment, the suspend and resume mechanisms are in the interface mechanism 119 and provide for suspending and resuming operation of any task. In one embodiment, when the processor simulator sends associated event information which includes time delay information, it passes a message to the kernel in the interface mechanism that causes the processor simulator to be suspended. The resumption mechanism uses the interface mechanism to place events on an event queue in the hardware processor. Thus, when the processor simulator suspends, the kernel also restarts the hardware simulator and places instruction in the hardware simulator's event queue to resume the processor simulator at some later time. The hardware processor then continues until an event is reached which causes the processor simulator to resume, for example, a previously scheduled resumption of the processor simulator in its event queue.

Thus, in one embodiment, the suspend and resume mechanisms of the interface mechanism 119 use an event queue which is in the hardware simulator. Those in the art will appreciate that other interface mechanisms and resume and suspend mechanisms may be used. For example, in an alternate embodiment, the processor simulator and the hardware simulator are independent tasks running under the interface mechanism, and the interface mechanism schedules all tasks by maintaining its own one or more event queues.

Thus, in one embodiment, when associated event information including time delay information is sent by processor simulator 107 to hardware simulator 103, the suspend mechanism suspends operation of processor simulator 107 while hardware simulator 103, upon receiving the time delay information, executes for an appropriate amount of hardware simulation time. Once hardware simulator 103 processes the event information and produces an event result, such as a signal being asserted, or simply the time delay being consumed, it typically sends the event result to processor simulator 107. The resume mechanism 151 resumes operation of processor simulator 107 upon the processor simulator receiving the event result.

Note that if no time delay needs to be executed by the hardware simulator, such as when the processor simulator is already in time synchronization with the hardware simulator and does not have any internal events that need to be processed in that simulation time, the processor simulator need not suspend operation. As another example, the user program may encounter a command that asks only for the current hardware simulation time. Or the user program may encounter an input/output command before the processor simulator has accumulated any delay since the last access to the hardware simulator. There would not be any need to suspend operation under such circumstances.

With one embodiment of the suspend/resume mechanisms, when the processor simulator's execution is suspended, the delay time passed to the hardware simulator is used to schedule the resumption of the suspended task, by placing a delayed event on the hardware simulator queue to have the interface mechanism to resume executing the suspended user program task running on the processor simulator.

One event result may be an interrupt that occurs in the digital circuitry during the execution of the time delay. The interrupt is communicated to the processor simulator 107, and upon receipt of this event result, on resumption of the processor simulator, processes the interrupt by calling an interrupt handler.

The design system 100 also includes a processor shell 153 in hardware simulator 103 that simulates activity of at least some of the external hardware entities of the target processor, in particular, those signals that are connected to the digital circuitry of the target electronic system which affect a user program. Included are those hardware variables and other hardware entities the user program may access or that may generate asynchronous events such as interrupts. As an example, the hardware shell provides access to the reset line or the pause line of a processor. The processor shell normally would provide an interface to the hardware simulator in the hardware description language (e.g., Verilog). Note that by "signal" we mean a signal or a hardware variable or an event or any other general entity defined within the hardware simulator.

In one embodiment, the design system 100 also includes a hardware-to-software mapper 147 that translates information from the second format understandable in hardware simulator domain, such as a signal assertion to indicate some asynchronous event, or register contents, or simulation time, to the first data format understandable in the processor simulator domain, for example, to one or more software variables accessible to the user program.

Since simulation speed is extremely important, and since a single host processor can only process a single task at a time, the invention also provides for carrying out the simulation in a multiprocessor computer that includes several host processors. In such a system, the processor simulator operates on one or more of the host processors, while the hardware simulator operates on one or more other host processors. The interface mechanism is programmed to handle the communication between the processor simulator host processor, and the other host processors executing the processor simulator. How to implement such an arrangement would be clear to those in the art.

The invention also provides for carrying out the simulation in a host computer system that includes several host processors interconnected using a network connection. In such a system, the processor simulator operates on one or more of the host processors, while the hardware simulator operates on one or more other host processors. The mapper and the translator also may operate on a separate host processor of the network. That is, the processor simulator is coupled to the mapper and the translator by a first computer network connection, with the interface mechanism controlling communication between the processor simulator and the mapper and translator over the first network connection. Also the hardware simulator is coupled to the mapper and to the translator by a second network connection, with the interface mechanism controlling communication between the mapper and the translator, and the hardware simulator over the second network connection.

Note that the tasks of an individual processor simulator can be split across several host processors of the host computer system. Similarly, the tasks of the hardware simulator can be split across more than one host processors of the host computer system. Other networked or standalone multiprocessor combinations and permutations of operating the elements of the design system will be clear to those in the art.

Timing

Typical operation will now be explained with the aid of FIG. 2 which shows an example of the timing of execution of a design system such as that of FIG. 1, but including two processor simulators, processor simulator 1 and processor simulator 2 simulating two processors, processor 1 and processor 2, respectively, and the hardware simulator. Each processor simulator is similar to the processor simulator 107 of FIG. 1

The hardware simulator provides the simulation time-frame. Any units of time may be used, and clock cycles will be assumed to be the unit of time. Each of processor 1 and processor 2 may have different speeds and thus its own simulation time. Assume that the first task is some execution for a time .DELTA.T1 until time T1. At this time, a start signal in the digital circuitry starts the processor simulator for processor 1. Processor 1 executes for a time .DELTA.T2 until time T2 (measured in processor simulator 1's simulation time). Suppose at this point, processor simulator 1 encounters a memory reference that causes the cache simulator to perform a cache lookup, and the cache lookup determines that there has been a cache miss. This cache miss event may cause processor simulator 1 to use its communication mechanism to send the event information to the hardware simulator. This is necessary if the memory being accessed is determined to be shared by the two processors. This in turn causes the suspend mechanism to suspend operation of processor simulator 1.

Note that while processor simulator 1 has consumed .DELTA.T2 of simulation time, the hardware simulator has barely moved because the processor simulator executes so much faster than the hardware simulator on the host computer system. Thus when the information is communicated to the hardware simulator, it is still at time T1, and constrained to be less than T2.

The hardware simulator now processes the associated event information, which in this example is to execute a required number of bus cycles on the target bus model of processor 1 included in the digital circuitry. The hardware simulator returns to processor simulator 1 when it has executed the required number of bus cycles, say time delay .DELTA.T2 at time T2.

Starting from T2, processor 1 executes for a time .DELTA.T6 until time T5 (measured in processor simulator 1's simulation time). Suppose at this point, processor simulator 1 encounters an interface function to send a signal to the digital circuitry. It now uses its communication mechanism to send the event information to the hardware simulator. This in turn causes the suspend mechanism to suspend operation of processor simulator 1. While processor simulator 1 has consumed .DELTA.T6 of simulation time, the hardware simulator again has hardly moved, so is still at time T2. The hardware simulator now processes the associated event information, which may be to determine a variable and return its value to processor simulator 1 when it has executed the time delay .DELTA.T6 at time T5. However, before reaching T5, after only .DELTA.T3 of simulation time has been consumed, at T3 (<T5), a signal in the digital circuitry causes the second processor simulator (processor simulator 2) to start executing. It processes for .DELTA.T4 and encounters an interface function at time T4, at which time it sends the information associated with the interface function encountering event (e.g., an input/output instruction) to the hardware simulator, which has not progressed much beyond T3, and is constrained to be less than T4 and T5.

The hardware simulator now continues to execute, including processing the new event information, until it reaches time T4, at which time the hardware simulator task in an event queue of the interface mechanism causes the resume mechanism to re-start the suspended process. Processor simulator 2 now processes for time .DELTA.T8 at which time another significant event occurs. This causes the hardware simulator to process until the next time in its queue. This occurs after .DELTA.T9 at time T5 when the processor simulator 1 recommences operation. The processor 1 continues operation until the next significant event, which occurs at time T6. The significant event is to wait .DELTA.T11 units of simulation time.

Note that one aspect of the invention is the capability of modeling processing to a selected level of accuracy, and in this instant, the user has selected to "behaviorally" model rather than accurately model hardware known to require .DELTA.T11 units of simulation time to operate as a means of saving host computer simulation time. So the software task is now suspended and the interface mechanism returns to the hardware simulator not long after T5 in the hardware simulator's time-frame.

Starting from T5, the hardware simulator executes for .DELTA.T10 until T6. The hardware simulator now reaches the time when the first processor simulator's operation was suspended (in hardware simulation time). Note that the hardware simulator does not pass control to the software task, but rather continues to process for the .DELTA.T11 delay requested. That is, the event queue information on the processor simulator 1 is to restart at time T7. When the hardware simulator reaches T7, the processor 1 simulator indeed resumes operation for .DELTA.T 12, and so forth.

Dynamic Binary Translation

Instruction interpretation is known for running binary target code on a host processor. Each instruction is interpreted and run on the host processor. One embodiment of the processor simulator achieves high speed through the aggressive use of dynamic binary translation, i.e., on-the-fly binary translation. Rather than simulating the target processor by interpreting the binary instructions of the user program one-by-one, or by cross compiling the higher language user program, the analyzer/binary translator 112 of the processor simulator translates linear blocks of target binary code into host code 114 that, when executed, each simulates the execution of the linear block of target code. This use of binary translation allows the processor simulator to eliminate, for example, most of the overhead of instruction interpretation.

The processor model that operates in the processor simulator includes simulation models for the hardware components commonly present on modern computers including processors, memory subsystems, DMA and interrupt controllers, consoles, network interfaces, disks, frame buffers, mice, and keyboards. Each of these components may be simulated to any desired degree of accuracy. For example, an initial run may not include accurate time simulation, so that many of the hardware components may be simulated by using behavioral simulation. If more accuracy is required in any of the hardware components, the model may be modified to include more detail. For example, it may be desirable to run some of the target computer subsystem as a hardware model, for example the memory model 122 and the bus model 124.

Thus the invention provides for having a range of compatible simulation models of processors, memory systems, and I/O devices that vary greatly in simulation speed and detail. The user is able to select the simulator that provides the desired level of simulation detail with the greatest possible speed. For example, a processor pipeline study might use a processor model that includes pipeline effects.

In one embodiment, Dynamic binary translation is the main technique used to achieve speed. Dynamic binary translation is known and used for several applications. See for example Richard L. Sites, Anton Chernoff, Matthew B. Kerk, Maurice P. Marks, and Scott G. Robinson, "Binary Translation," Communications of the ACM, vol. 36, no. 2, pp. 69-81, February 1993, Bob Cmelik and David Keppel: "Shade: A Fast Instruction Set Simulator for Execution Profiling," Proceedings of the 1994 ACM SIGMETRICS Conference on the Measurement and Modeling of Computer Systems, pp. 128-137, Nashville, Tenn., 1994. See also Emmett Witchel and Mendel Rosenblum: "Embra: Fast and Flexible Machine Simulation," Proceedings of the 1996 ACM SIGMETRICS Conference on the Measurement and Modeling of Computer Systems, pp. 68-79, Philadelphia, Pa., 1996.

Adapting dynamic binary translation to implement the processor simulator 107 required extensions to the known dynamic binary translation techniques, for example, to include modeling the MMU of the target processor and the cache hardware of the target processor.

Processor simulator 107 operates in three phases:

1. Configuration phase: In this phase, the target processor simulator is configured according to the target processor being simulated. Configuration includes defining one or more registers of the host processor or one or more host memory locations or both to simulate the target processor registers, and assigning one or more host memory blocks to simulate the physical memory.

2. Translate phase: In this phase, the target code is analyzed and translated to produce host code to simulate the target code together with timing information and hooks for dynamic calls to determine time dynamically. The translated and analyzed code is stored in one or more host code buffers (HCBs).

3. Execution phase: During the run phase, the host code in an HCB is run to simulate the execution of the linear block of target code. The results of execution are stored in the simulated registers as appropriate.

The Configuration Phase

One embodiment simulates the MIPS R3000 and R4000 32-bit instruction set architecture (MIPS Technologies, Inc., Mountain View, Calif.). Another simulates an ARM processor (ARM Ltd., Cambridge, United Kingdom). The invention is easily adapted to other processors, for example CISC types, other RISC types, and DSP processors.

A physical model of the target processor system is created in the configuration phase including the processor's internal registers and its view of the MMU, cache, and memory subsystems. The level of accuracy is user determined. For example, for very high speed performance, one embodiment estimates rather than simulates each resulting bus transaction. The target processor may have several busses, e.g., one for the cache, coprocessor, memory, debug, etc. The configuration file--called the Target Information File (TIF) herein--includes a description of the system level timing interactions with the MMU, cache, and memory subsystems, and each of their respective busses. Alternate implementations might include a database for the configuration information and a graphical user interface.

Another embodiment includes one or more bus simulators as part of the digital circuitry simulated by the hardware simulator, so that when full bus level accuracy is required, actual bus cycles are simulated, and hooks are provided during the analysis/translation phase to invoke the relevant bus simulator to simulate the appropriate number of bus cycles.

The processor model set up in the configuration phase includes:

1. A model comprising one or more host registers and possibly one or more host memory of the target processor registers and any co-processor interfaces.

2. An MMU simulator that includes a fast TLB simulator that provides a fast means of target virtual to target physical address translation. In one embodiment, the MMU simulator also includes translation to host virtual addresses.

3. A model comprising host memory of the physical target memory that includes the memory containing the target binary code and any target data regions.

Target Memory Image Data

The processor simulator accepts target binary images of code and data, together with the TIF that includes information relating to the address mapping, performance and level of accuracy (e.g., high-speed bus model vs. full bus model) on accesses to particular address accesses.

Target registers.

The simulated target processor registers in the processor simulator include the target processor's registers. Also included are storage locations (host registers or host memory or both) that provide fast access to some other information, such as instruction and clock counters, at run time. The size of the host address space is dependent on the host virtual address space. An x86 host processor (e.g., INTEL Pentium) uses 32 bits and other hosts might use other sizes.

The program counter (PC): The program counter (PC) is also simulated, and in one embodiment is updated anytime its value can be observed by a user of the co-design system, either within the user program, or by a debugger. Most modern processors support multiple virtual address spaces in order to support multi-user operating systems and multi-process user programs. Furthermore, many modern processors support concurrent but disjoint virtual address spaces. Thus, in addition to the target PC, modern target CPUs may use additional information to generate a virtual address. For example, the ARM processor includes a "process ID" (PID) to modify the virtual address to generate a "modified virtual address." Other processors may use other information to generate a "complete" virtual address. In one embodiment, information such as the PID is used in addition to the PC to generate simulated virtual addresses in the processor simulator.

The Physical Memory Simulation

The physical memory of the target processor is specified in the TIF file. The configuration stage includes allocating blocks of host memory to model the blocks of target memory and building a mapping 125 from physical memory to host memory that provides for translating a target processor physical address to the location or locations in host memory where that target address is stored.

In one embodiment, the memory mapping information 108 is stored in the form of a memory allocation data structure, with each element of the memory allocation data structure representing a block of physical memory, whether allocated or not, i.e., a block of physical memory addresses, and each memory allocation data structure element representing allocated physical memory including a pointer to where in the memory space of the host processor that physical block of memory is stored. In one embodiment, the memory allocation data structure is an ordered linked list, with each element of the linked list representing a block of physical memory addresses that, if allocated, includes a pointer to where in the memory space of the host processor that physical block of memory is stored. Each element of the linked list is labeled by the physical address of the block of physical memory, and an indication of whether that block has or has not been allocated, i.e., is part of physical memory. In one implementation, physical address ranges that are not mapped to host memory (to simulate target memory) are mapped to either a function that is used to access the virtual bus or bus simulated in the hardware processor, or to a function that reports an attempted access to unallocated memory.

Thus in one embodiment, the linked list contains every possible physical address. Some of the linked list elements point to host memory, and some, including those blocks that are unallocated, do not.

Other data structures may be used for the memory allocation data structure. One alternate embodiment, for example, uses an array.

In the embodiment using the linked list, the linked list elements are ordered by their respective end addresses, so that a linked list element that represents a physical block that ends at address 3M, for example, appears earlier in the list than the linked list element that represents a block of memory that ends at address 4M. If the end addresses are the same, then the smaller linked list element comes first. In one embodiment, the last linked list element covers the full physical address space.

A lookup mechanism is used to look up a physical address to determine the corresponding host address using the linked list. The lookup mechanism searches for the linked list element that represents the physical address, and an element of the found linked list element points to the host address. In one embodiment, each linked list element represents (and describes) a physical block of target memory of length some power of 2, and also points to the allocation of host memory to represent that physical block of memory in the processor memory. The physical blocks that the linked list elements represent are partitioned in a naturally aligned manner, so that 1 MB long blocks, for example, must be aligned to a 1M natural boundary in the target address space.

In one embodiment, in the case of the target including a paged memory system, the smallest block size is the smallest page size. So for an MMU that allows page sizes from 4 KB to 4 MB, the smallest block is 4 KB.

FIG. 4 shows by way of example a first implementation of the ordered linked list 400 for the memory map for a target processor that has the following two blocks of physical memory specified in the TIF:

At physical address 3M there is a block of 2 MB of memory

At physical address 10M there is a block of 4 KB of memory.

According to the first implementation, the linked list is of contiguous non-overlapping blocks of memory addresses. There are linked list elements for both allocated and unallocated memory blocks. The first unallocated space is from 0 to 3M physical address. Because a block cannot be 3 MB long, the first linked list element 403 represents physical addresses from 0 to 2M. The next linked list element 405 represents physical addresses from 2M to 3M. There now is a need to represent a 2 MB block of allocated memory. Because of the natural alignment requirement, a linked list element starting at address 3M cannot represent a block longer than 1 MB, so the next linked list element 407 represents physical addresses from 3M to 4M, and this is followed by another linked list element 409 representing physical addresses from 4M to 5M. The next linked list element 410 again represents a 1 MB block for physical addresses from 5M to 6M, and this is followed by a linked list element 411 representing a 2 MB block of physical addresses from 6M to 8M. The next block represented by linked list element 413 is 2 MB long for physical addresses from 8M to 10M. At physical address 10 MB, there is an allocated block of 4 KB, so the next linked list element 415 represents physical addresses from 10M to 10M4K. The next linked list element 417 is also 4 KB long representing physical addresses from 10M4K to 10M8K. There now follow a sequence of linked list elements that represent larger and larger blocks. Linked list element 419, for example, represents a 16 MB block of physical memory addresses from 16M to 32M, then linked list element 421 represents a block of addresses from 32M to 64M, and so forth until the last linked list element represents a block ending in the largest physical address, 4G in this example.

Those linked list elements that represent physical addresses that are allocated each includes a pointer to the block of host memory 425 that represents that allocated target memory in the processor simulator. Suppose, for example, that the 2 MB block of target memory is represented by the two contiguous blocks of host memory 429 and 433 in the host address space starting at a first host address and a second host code address, respectively. Suppose furthermore that the 4 KB block of target memory is simulated by a block of host code 437 starting at a third host code address. Then linked list element 407 includes a pointer 427 to the block 429 of host memory at the first host memory address, linked list element 409 includes a pointer 431 to the block 433 of host memory at the second host memory address, and linked list element 415 includes a pointer 435 to the block 437 of host memory at the third host memory address.

A lookup mechanism 447 searches for a match of a physical address 445 in the linked list 400. That is, given a physical address 445, the lookup mechanism 447 compares the address range of each linked list element to find the linked list element that represents the given physical address. In the invention, the lookup mechanism uses a speed-up technique for searching the linked list 400 for a physical address match.

The Search Mechanism

Many methods to look up an address 445 can be used in lookup mechanism 447 to find the linked list element that has an address 445, included sorted array methods, binary tree search techniques, and so forth. In one embodiment, a multi-level table lookup method is used. FIG. 5A shows one three-level table lookup method 500. The part of the address to be looked up is partitioned into three parts PART1, PART2, and PART3 in one embodiment. A three-level sequence of lookup tables is used to lookup the contents of PART1, PART2, and PART3, respectively. So ADD1 (503), the contents of PART1 is used to index TABLE1513. The indexed element of TABLE 1 includes a pointer to the next level table--TABLE2515--to use. ADD2 (505), the contents of PART2 is used to index TABLE2515 and the indexed element of TABLE2 includes a pointer to the next level table--TABLE3517--to use. Finally, ADD3 (507), the contents of PART3 is used to index TABLE3517 and the indexed element of TABLE3 includes a pointer to the linked list element that contains the physical address 445. Thus each valid lookup table element points to the next lookup table to use; the last lookup table valid elements point to the linked list elements.

There may be a part--PART4 (509) of the address 445 containing the bits ADD4 that is never used for the address look up for the linked list element. Suppose, for example, that the minimum page size is 4K. The smallest allocation is then in blocks of 4 KB, so that no linked list element represents less than a 4 KB block of addresses. Looking up the linked list element of a physical address can then ignore the 11 lowest order bits of the address, shown as PART4509 in FIG. 5A. The remaining bits of the address are then looked up in three steps, one for each partition of the address.

Suppose further that the largest possible block is 4 MB long. Then bits of address 445 from bit 11 to bit 21 (with the lowest order bit being bit 0) are sometimes used, depending of the block size, and the bits from bit 22 and higher are always used. In one embodiment, the lowest level table, TABLE3, is indexed by the contents 507 of PART3 of the address 445 that may or may not be used depending on the block (e.g., page) size. The part of the address that is always used is indexed in two steps by the first and second level tables. Thus, for a 4 BM block, all entries in the lowest-level table TABLE3 point to the same linked list element, while for 4 KB blocks, each entry in a lowest level tableTABLE3 point to different linked list elements. If each block is 2 MB, then half the entries in the lowest-level table point to one linked list element, and the other half point to another linked list element.

In one embodiment, each table lookup operation is carried out by a shift, mask, and indexing operation as follows:

shift right by number of bits corresponding to PART2, PART3and PART4.

mask to give ADD1

index TABLE1 with ADD1

shift right by number of bits corresponding to PART3 and PART4.

mask to give ADD2

index TABLE2 with ADD2

shift right by number of bits corresponding to PART4.

mask to give ADD3

index TABLE3 with ADD3

Thus thethree-level lookup operation may be carried out in about 9 instructions. Looking up the host address or determining that this is unallocated is provided by the link in the linked list element. The low order bits PART4 together with the linked list element link to host memory, for example, may then be used to access a particular host memory location corresponding to the physical memory address 445.

Note that while a three-level lookup mechanism is described herein, any number of levels may be used.

Conserving Lookup Table Space

The multi-level tables in the lookup mechanism embodiment described above are fully populated. That is, every entry in the set of lookup tables is populated, which potentially requires a huge amount of storage. Many lookup tables however are redundant. For example, in the case of 4 MB blocks, since every entry in the lowest level lookup table points to the same linked list element, that lookup table is not needed--it is redundant; the second-level lookup table can simply point directly to the linked list element.

This however introduces special cases; there would need to be a test to determine if the search has arrived at the linked list element, or is at a lookup table.

An improved embodiment provides for eliminating redundant lookup tables while eliminating the need for special cases. According to the improved embodiment, each lookup is identical, requiring three sets of shift-mask-lookup operations. FIG. 5B shows an improved embodiment lookup table 521. Each lookup table such as 521 includes, in addition to the pointer fields (one for each table entry) that each has a pointer such as pointer 527 to either a next level lookup table or a linked list element, two fields 523 and 525 that contain the shift and mask, respectively, to apply to the address field 445 prior to the lookup. Furthermore, a replica lookup table that includes the shift and mask fields together with a single pointer field included in each linked list element. See for example FIG. 5C which shows an improved version 531 of the 5M-6M linked list element of the example of FIG. 4 that includes a single entry replica lookup table with mask and shift fields (533 and 535 in this example) and a pointer field 537. The pointer in pointer field 537 includes a pointer 539 that points back to the linked list element 531. The mask and shift fields (533 and 535 in this example) are irrelevant, and in one embodiment are set to 0. Any address part indexing the replica lookup table in the linked list element finds the pointer 539 back to the linked list element.

For a second level lookup table, when the lookup tables are set up, in the case of the largest (e.g., 4 MB) page size, rather than the element pointing to a third-level lookup table that has every element pointing to the same linked list element, the second-level lookup table element points directly to the linked list element. This saves the storage for the third level lookup table. The lookup is still carried out with three sets of shift-mask-lookup operations as if there was a third-level lookup table. The replica lookup table in the linked list element acts as that third level lookup table.

Similarly, it may be that many of the first level lookup table elements all ultimately lead to the same linked list element. For example, consider the example of FIG. 4. Many of the first level lookup table elements may lead, after three table lookups, to the last linked list element 423. All the second- and third-level lookup tables that follow such a first lookup table may be eliminated by having the first level lookup table element point directly to the linked list element. The replica lookup table in the linked list element then acts first as the second level lookup table then as the third level lookup table in the three-step lookup process, without requiring any special tests.

Improved Linked List Representation

In the above-described implementation, the linked list includes blocks of physical memory addresses, each of a block size and that each starts and ends at a natural block boundary for that block size. Some linked list elements point to where in the host memory space that block of physical addresses is simulated in the processor simulator, and others are null, representing physical addresses that are not allocated in the target processor.

There typically, therefore, is a large number of blocks, and many of these blocks are null, i.e., represent unallocated addresses.

In an improved embodiment, a single linked list element is used to represent the whole address space. Every linked list element but the last represents a block of allocated physical addresses. A single--the last--linked list element replaces: 1) the set of linked list elements for addresses lower than the first block of allocated addresses, 2) the set of linked list elements for addresses higher than the highest allocated physical address, and 3) the sets of one or more linked list elements that represent addresses in between blocks of allocated physical addresses.

FIG. 6 shows the linked list 600 for the same example as FIG. 4, but using the compressed linked list representation and linked list elements that each includes a replica one-element lookup table. Linked list element 611 is for a 1M block of physical addresses from 3M to 4M and includes a pointer to the location in the host memory space 425 used to simulate this 1 MB block of physical memory. Linked list element 613 is also for a 1M block of physical addresses from 4M to 5M and includes a pointer to the location in the host memory space 425 used to simulate this 1 MB block of physical memory. Linked list element 615 is for the 1K block of physical addresses starting at address 10M and includes a pointer to the location in the host memory space 425 used to simulate this 4 KB block of physical memory. The linked list element 617 is for all unallocated physical addresses.

The linked list 600 is constructed during the configuration phase. Once constructed, the set of lookup tables 650 is constructed. One level 1 table, TABLE1603, one level-2 table TABLE2 (605) and two level 3 tables, TABLE3-1 (607) and TABLE3-2 (609) are shown. Suppose in TABLE1 there is an entry 621 that points to TABLE2. Suppose further than in TABLE2, there is an entry (625) pointing to TABLE3-1. TABLE3-1 may contain one or more entries (all shown as 631) pointing to linked list element 611 and one or more entries (all shown as 635) pointing to linked list element 613. Suppose further that there is another element 623 in TABLE2 that points to TABLE3-2, and that there is an element 627 is TABLE3-2 that points at the linked list element 615. All other elements in TABLE3-1 and TABLE3-2 represent unallocated addresses so point at the linked list element 617. The links from these third level tables are not shown in the drawing.

To avoid having redundant level 3 tables, e.g., tables whose elements all point to linked list element 617, many of the TABLE2 elements point directly to linked list element 617. Similarly, many of the TABLE 1 elements may point to level-2 tables whose elements point to level-3 tables whose elements all point to linked list element 617. These level-2 and level-3 tables are redundant, and to avoid building such redundant tables, the links in TABLE1 point directly to linked list element 617.

There also may be other level-2 tables with an element pointing to a level-3 table whose elements all point to one of the linked list elements that represent allocated addresses. In such a case, the level-3 table is redundant, and the link to it is replaced with a link to link directly to the linked list element. The replica lookup table of that linked list element than acts as the lookup table for the third lookup.

In one embodiment, the linked list representing the physical memory and the physical memory-to-host relationship is used with the lookup mechanism to determine a host address from a virtual (target) address during simulation.

In an improved embodiment, the linked list representing the physical memory and the physical memory-to-host relationship is built during the configuration phase, and used in the MMU simulator so that the MMU simulator both translates virtual addresses to physical addresses during simulation, and automatically also provides the host address.

Modeled Memory State

In one embodiment, each allocated block of modeled target memory (i.e., host memory) is partitioned into sub-blocks called "chunks" herein, and an element of a sub-block state buffer is assigned to each sub-block of modeled target memory. Thus, during the configuration stage, a sub-block state buffer (or "chunk state buffer") is defined, with each element in the sub-block state buffer providing a state for the corresponding sub-block of the physical memory modeled in host memory. In one embodiment, each modeled target memory sub-block can have one of the states shown in the following table that also includes the code stored in the sub-block state buffer element in one embodiment.

            State                            Code Value
            uninitialized (U)                    -1
            loaded (L)                           -2
            modified (M)                         -3
            translated by multiple                0
            translated processor N (T:N)             N


The state information provides for automatically ensuring that translated code that has subsequently been modified, for example by self-modifying target code, is not run again. The state information is further used to indicate the memory state in the case that the electronic system includes a plurality of target processors sharing the same memory.

In one embodiment, each modeled memory sub-block is the same size as an equivalent line of the instruction cache.

Initially, during the configuration stage, all the elements of the sub-block state buffer are set to uninitialized.

The Translate Phase

The translate phase is now described in more detail. In one embodiment, linear blocks are the unit of translation, and each translated linear code block is stored in an HCB. A linear block is a target code sequence that ends with a jump or branch instruction or any instruction that may change the mode of the target processor. In one embodiment, the linear block is selected have a maximum length equal to the length of a cache line of the cache used for instructions (the I-cache). This provides for fast execution because when executing the translated code, the processor simulator may assume all the target code of that linear block is in cache, and no instruction cache misses will occur. Furthermore, in one embodiment, the linear block ends at any end-of-page boundary in the case that an MMU with paged memory is included. That is, the target code is translated no further than where there is a change in address which might result in a change of permissions from the current location in memory. This further provides for fast execution because the translated host code of a linear block of code may be executed without requiring permissions check. Note that by definition a cache line cannot cross a page boundary.

The translated code is maintained in the HCB even after execution of the code in case that code is executed again. This avoids having to re-translate the same linear block of target code. That is, the HCB is used as a translated code cache.

A basic translation includes accessing the source registers of an instruction from the simulated registers, setting up the HCB data structure for an HCB, translating the linear block of target code into the HCB, and adding hooks for any required memory references, MMU lookups, and cache lookups. In the case that a full-accuracy bus modeling is used, the translate phase includes adding the hooks to invoke the appropriate number of bus cycles for the particular bus. The translate phase also includes adding host code to the HCB to accumulate simulation time according to the characteristics of the target processor instructions being simulated in the HCB.

Because the code in an HCB is guaranteed to be in the cache during execution of the linear block of the HCB, one of the hooks inserted in an HCB is to determine at run time whether or not the target code simulated by the HCB is in the I-cache, i.e., whether an instruction-cache miss will be generated. This I-cache lookup is only needed once. In one embodiment, the hooks to check the I-cache are inserted at the start of the HCB so that during execution of the code in the HCB, a check is first made to determine whether or not there is an I-cache miss generated.

Data references are identified and hooks inserted to simulate data cache lookups and also to check and as necessary modify the state of the chunk of target memory being modified to provide for automatically identifying self-modifying code. The state is defined in an ele