Method and apparatus for handling cache misses in a computer system6272516Abstract A method for handling cache misses in a computer system. A prefetch unit fetches an instruction for execution by one of a plurality of coprocessors. When the preferred embodiment of the present invention experiences a cache miss in a prefetch unit, the process for which an instruction is being fetched is passed off to a memory processor which executes a read of the missing cache line in memory. While the process is executing in memory processor, or queued by the scheduler for execution of the same instruction, the prefetch unit continues to dispatch other processes from the its queue to the other processors. Thus, the computer system, including the processors, do not stall. Processors continue to execute processes. The prefetch unit continues to dispatch processes. When the memory read is completed, the process in which the cache miss occurred is rescheduled by the scheduler. The prefetch again attempts to fetch and decode the instruction and arguments. If another cache miss occurs, the process is again dispatched to the memory processor. Upon reading the cache line, the memory processor again sends the process to the scheduler's queue. Claims What is claimed is: Description COPYRIGHT NOTICE
Instruction Set
Instruction Operation
add adds two numbers
subtract subtracts two numbers
and performs Boolean AND of two
numbers
or performs Boolean OR of two
numbers
move moves data from one
register/memory location to
another
shift shifts a number a specified number
of places left or right
add_complex adds two complex numbers
load reads the contents of a memory
location into a register
store writes the contents of a register into
a memory location
atadd atomically adds to a memory
location
atdeli waits for a memory location to be
equal to a specified value
lookup searches for a key in a table
branch branches to a segment of code
specified by an address
The data set comprises the following as shown in the table below. Each of the data items is first class, meaning the data items can be manipulated directly.
Data Set
Data Item Definition
Dn A general purpose register. In the
preferred embodiment, there may
be 16 to 32 general purpose 64-bit
registers.
An A pointer register. In the preferred
embodiment, there may be 4 32-bit
pointer registers.
M A memory location pointed to by a
pointer register.
[field] any contiguous bit field of a D or A
register, or M memory location.
As provided by the instruction and data sets shown above, the contents of a memory location can be manipulated directly. Thus, there is no need for a load and store instruction when memory is being accessed. Furthermore, accessing memory is no slower than accessing a general purpose register. This functionality is supported by a caching method discussed below and by encoding the M operands within an instruction rather than through the use of instruction extensions. The preferred embodiment of the present invention is able to encode a memory operand within an instruction because the typical offset from an A register is small enough to be specified by three bits. This is so because in a data packet processing environment, a data structure for storing a data packet is generally less than 50 bytes. Since memory is 64 bits wide, it is the exception when an offset of a memory operand from the A register is more than eight words, which can be specified by three bits. Each address register A effectively maps eight memory locations onto a flat register space. If an instruction specifies D0, then the general register D0 is used. If the instruction specifies A0 then the first word in memory starting at the memory location specified by A0 is used. Cache As discussed above in connection with the prior art, having a cache to take advantage of temporal locality is of less importance in the computer system embodied by the present invention for use in a switching hub. However, spatial locality is relevant to the data packet processing environment and is, therefore, the underlying rationale for the use of cache in the preferred embodiment. The cache in the preferred embodiment is small, for example, on the order of 64 to 128 bytes, due to the fact that the working set of data, i.e., the header information in a data packet, is on the order of tens of bytes or less in size. In one embodiment, the data cache 160 provides for storage of four cache lines for each process, where each cache line is 32 bytes wide. The lines may be allocated and discarded using a least recently used algorithm. Moreover, the cache lines are fully associative. Allocation Policy As in the case of a general purpose computer system, the preferred embodiment of the present invention relies on the fact that nearby data items in memory are often times related. Thus, there is a strong correlation that if a data item is read from memory, a nearby data item in memory will likely also be read. Similarly, writing a data item into a data packet is a strong indication that additional data nearby in memory will be written into the data packet as well. Based on this fact, the preferred embodiment of the present invention utilizes an allocate on write caching policy so that a copy back cache can be used to improve memory throughput. An allocate on write policy is burdensome in a general purpose computer system architecture because it typically reads the cache line before performing a write into the cache due to the fact that only a cache line may be dirty rather than an individual word or bytes. An allocate on write policy is desirable if multiple writes to the same cache line occurs, but is undesirable if isolated writes generally occur. However, in a data packet processing environment, the write spatial locality is a much stronger relationship than in a general computing environment. As a result, an allocate on write policy is the most appropriate policy. Furthermore, a data packet processing application benefits from a larger cache line than does a general purpose computing application. Cache State In a data packet processing application program, if a data item is written to a cache line that has not previously been read, the line should be allocated because additional writes are highly likely. Further, and more importantly, it is very likely that no data will be read out of the cache line in which case the allocate should not read the line from memory. Moreover, it is very likely that the entire cache line will be written. The cache in the preferred embodiment of the present invention maintains a number of cache lines. The state for each of the cache lines is maintained as well. Each cache line in the preferred embodiment is 32 bytes. Each byte can be in one of three states: invalid, indicating the data is meaningless; valid, indicating the byte can be read and has not yet been written; and dirty, indicating the byte can be read and has already been written. When a read misses in the data cache 160, the cache line is read from memory and all bytes are marked valid. When a write misses, the cache line is allocated but not read from memory, and only those bytes actually written are marked as dirty. Then, when the cache line is pushed out of the cache, only those bytes marked as dirty are actually written to memory 183. Even though the preferred embodiment of the present invention provides a relatively small cache for each process, the cache represents a significant fraction of the SRAM space needed to maintain the state of the process. Moreover, the cache allocated to each process is generally more than needed in most situations. Combined with the fact that the preferred embodiment of the present invention has a relatively large number of processes executing concurrently, it is possible to take advantage of a low rate of cache utilization. Rescheduling a Process Upon a Cache Miss In prior art general purpose computer systems, when a cache miss occurs, the execution of the central processor is generally stalled until memory is accessed and the appropriate page from memory is placed in the cache. Unlike the prior art, when the preferred embodiment of the present invention experiences a cache miss in the prefetch unit 113, for example, when two cache lines are accessed and one is missing, the process is passed off to the memory processor 180, which executes a read of the missing cache line in memory 183. While the process is in an input queue 181 waiting to be executed by memory processor 180, or executing in memory processor 180, or in output queue 182 waiting to be requeued in the execute queue 110 by scheduler 110, the prefetch unit continues to dispatch other processes from the execute queue to the other processors. Thus, the computer system, including the processors, do not stall. Processors continue to execute processes. The prefetch unit continues to dispatch processes. When the memory read is completed, the process in which the cache miss occurred is rescheduled by the scheduler. The instruction and argument fetch units again attempt to fetch and decode the instruction and arguments. If another cache miss occurs, the process is again dispatched to the memory processor. Upon reading the cache line, the memory processor again sends the process to the execute queue via bus 115, where it is rescheduled by the scheduler. Memory Interface The memory processor 180 interfaces with memory 183. Memory processor 180 receives load, store, and cache requests from dispatcher 150 as described above. Also as described above, when a process requires a memory read, the process identification (PID) of the process is transferred to the memory processor and the process is not considered for execution by the scheduler 120 until the memory operation is completed. Furthermore, so long as there are processes not waiting to access memory, the scheduler 120 continues to dequeue other processes to the instruction fetch unit 130. Thus, a long latency for a process when accessing memory does not stop any other process from executing on another appropriate processor. In the preferred embodiment of the present invention, because the prefetch unit 113 and processors 170, 190 and 195 operate significantly faster than the external memory processor 180, requests from the dispatcher 150 are queued in the memory processor's queue 181 as shown in FIG. 1. The final status and data read are queued in a status queue 182. Status is queued because the memory processor 180 may execute a series of load instructions and the prefetch unit and other processors may not be able to satisfy the write backs immediately. Because memory operates at a different speed than the processors, queues 181 and 182 provide an asynchronous boundary between the memory and the processors. Additionally, memory and the lookup and atomic processors share the same write port to the D registers. The queues 181 and 182 provide memory a place to store results when the port is in use by the processors. It is important to note that the number of operands (i.e., request and status replies) can not be more than the number of processes supported by the architecture embodied by the present invention. Thus, the same SRAM is used for both queues 181 and 182. Finally, since the architecture embodied by the present invention is latency tolerant, the queues between the memory processor (or other processors) and the prefetch unit provide an asynchronous boundary between the processors and clock domains to allow the architecture the ability to optimize the clock frequency of the processors independently of the memory access speed. Atomic Processor The atomic processor 195 provides for execution of an atomic operation, such as atomic add (atadd) as discussed above. An atomic operation synchronizes the operations performed by the multiple processors. While most data packets are independent of one another, there are cases where synchronization between processing of data items in a data packet processing environment is necessary. For example, if many data packets need to go out a single transmit I/O channel, the processes must cooperate when controlling that channel. Additionally, there is occasionally the need to order data packets. If two data packets arrive at the same input port of a switching hub and are destined for the same output port of the switching hub, the data packets should be transmitted out the output port in the same order in which they were received. Thus, while the processing of each data packet is independent, the ultimate queuing of the data packets for transmission out the output port is not independent. The atdeli instruction provides for this functionality. Lookup Processor The lookup processor 190 executes the lookup instruction to find a key in a table. More specifically, the lookup instruction finds a sequence of symbols comprising a key in a table. As an example, if the prefetch unit 113 examines an instruction and the state of the cache and determines the instruction is a lookup instruction, it sends the instruction to the lookup processor 190. The execution of the lookup instruction results in a key being found, or a key not being found. If the key is not found, the process is requeued by the scheduler 120 on the execute queue 10, with the program counter pointing at the instruction immediately following the lookup instruction. For example, given the instruction sequence: lookup D0, D1 <next instruction> the lookup processor searches for a key specified in register D0 in a table in the memory of the computer system, wherein the starting address of the table is pointed to by register D1. If the lookup processor executing the lookup instruction does not find the key, the scheduler requeues the process in which the lookup instruction was executed, and the next instruction in the process following the lookup instruction is fetched by the instruction fetch unit 130. Additionally, any arguments associated with the next instruction are fetched by the argument fetch unit 140, and the dispatcher 150 sends the next instruction to the appropriate processor for execution. In the event the key is found in the table, the entry in the table associated with the key contains the memory address of the next instruction to be executed. This memory address is loaded into the program counter (PC) associated with the process in which the lookup instruction was executed. Thus, if the key is found, the result of the lookup instruction is to provide a memory address to be loaded into the PC, causing the process to jump to an instruction at the specified memory location in the process and continue execution of the process. In a switching hub, one of the most common operations is to look up data in a table. For example, in a switching hub connected to heterogeneous data networks transmitting data packets according to multiple protocols, data packet processing in the switching hub involves, among other things, determining the format of a data packet received on an input port of the switching hub. The lookup instruction described above is used to branch to the appropriate portion of code for processing the data packet. Below is a simple example of a routine for determining the format of a data packet and branching to the appropriate code for processing the data packet:
;determine packet format
;register A0 points to the memory location in which the data packet is
stored
load pkt frmat_tbl_base, D1 ;load the ptr to a table of keys into
D1
lookup A0[format], D1 ;search the format table using the
field
"format" from the data packet as the
key
<unrecognized format>
.
.
.
packet_format_0:
<instructions related to processing packet of format 0>
packet_format_1:
<instructions related to processing packet of format 1>
packet_format_2:
<instructions related to processing packet of format 2>
.
.
.
The lookup processor 190 searches for a key in a table using a novel radix search method as described in U.S. patent application Ser. No. 08/690,225 entitled, "METHOD FOR STORING A TABLE OF KEYS IN A MEMORY OF A COMPUTER SYSTEM," r filed on Jul. 19, 1996. As described in more detail therein, the radix search method traverses a tree data structure searching for a sequence of symbols comprising the key. The search continues until a symbol in the sequence of symbols is not found, in which case the search fails, or the end of a key is found, in which case the search is considered successful. The end of a search is detected when the last symbol in the sequence of symbols comprising the key is found. The last symbol has associated with it a memory address of a program code segment to which to vector execution of the process. The memory address is placed in the program counter. The process is then requeued on the execute queue, where the scheduler 120 dequeues the process and the dispatcher 150 transfers the process to the appropriate processor for execution of the instruction pointed at by the program counter. Searching for Keys of Arbitrary Width In the manner described above, the preferred embodiment of the present invention can search for keys of arbitrary width by repeatedly executing lookup instructions on the lookup processor. As stated above, the lookup processor 190 executes a lookup instruction to find a key in a table. When the scheduler dequeues a process from the execute queue 110, if the next instruction to be executed in the process is the lookup instruction, the prefetch unit 113 transfers the process to the lookup processor 190. The execution of the lookup instruction results in a key being found, or a key not being found. If the key is not found, the process is requeued by the scheduler 120 on the execute queue 110, with the program counter for the process pointing to the instruction immediately following the lookup instruction, i.e., the next instruction. In the event the key is found in the table, the entry in the table associated with the key contains the memory address of the next instruction to be executed. This memory address is loaded into the program counter (PC) register associated with the process in which the lookup instruction was executed. The scheduler requeues the process, later dequeues it, and the instruction pointed to by the PC register is fetched by the instruction fetch unit 130. Additionally, any arguments associated with the instruction are fetched by the argument fetch unit 140, and the dispatcher 150 sends the process to the appropriate processor for execution of the next instruction. In this way, a branch to any instruction in the process can be performed. Importantly, the instruction pointed to by the PC register can be another lookup instruction in the process. In this way, multiple lookup instructions can be sequentially executed, each specifying a different portion of a key of arbitrary width, or each specifying a particular key in a sequence of keys being searched for. Thus, for example, if the maximum size key that the lookup instruction can process is 32 bits in length, a 64 bit key can be searched for in a table by sequentially executing two lookup instructions, each processing 32 bits of the key: the first lookup instruction, if it finds the first portion of the key, specifies the memory address of a second lookup instruction to execute. The process is rescheduled and dispatched again by the prefetch unit to the lookup processor wherein the second lookup instruction is executed to find the second portion of the key. The same technique can be applied to finding a sequence of keys. The first lookup instruction is executed by the lookup processor. If the lookup processor finds the first key, the lookup instruction returns a memory address specifying the next instruction to be executed by the process. The memory address is loaded in to the program counter and the process queued in the execute queue 110. The prefetch unit fetches the next instruction and its associated arguments (a lookup instruction, a key value to search for, and the base address of the table) and dispatcher forwards the process again to the lookup processor. The lookup processor executes the second lookup instruction to find the second key, and so on. The ability of the prefetch unit and lookup processor of the preferred embodiment of the present invention to iteratively dispatch and execute lookup instructions in this manner enables the ability to search for a key of arbitrary width or to search for a sequence of keys. This ability to search for a key of arbitrary width or to search for a sequence of keys is particularly useful in a switching hub, for example, when performing data packet filtering on multiple fields in the data packet or when searching for large keys, e.g., a 128 bit IP address, in which the key size is greater than that which can be searched for in the execution of a single lookup instruction. Distributed Memory Architecture In addition to the main memory 183, each processor has its own memory, strategically distributed along the stages of an execution pipeline of the processor, to provide fast access to often used information, such as the contents of the A and D registers, the program counter, etc. When a processor executes an instruction, it must first access the program counter (PC) to determine the next instruction to be executed. The processor then might, for example, access A or D registers as required according the argument associated with the instruction pointed at by the program counter. After executing the instruction, the processor may store the result in an A or D register, and increment the program counter. Thus, it is readily apparent that certain memory locations, e.g., memory locations providing storage for registers, may be repeatedly accessed by each instruction in a process. The processor may access these memory locations in the initial stage of execution and carry the contents of the PC, A and D registers with the process as it moves through subsequent stages of the execution pipeline. Using this approach, if memory access times are slow relative to the clock cycle time of the execution pipeline, the pipeline may stall while waiting for a memory access (either a read or a write) to complete. Alternatively, a processor may access the information in the registers on an as needed basis, i.e., at the moment in time that the information is required to continue instruction execution. Here, too, if the access time of the memory is such that the execution pipeline must wait before continuing execution of the instruction, the pipeline may stall. Moreover, depending on the physical location of the memory locations in which often needed data is stored, a further delay might be incurred when accessing memory. The preferred embodiment of the present invention strategically locates memory storage in close physical proximity to a stage in an execution pipeline at which memory is commonly or repeatedly accessed. For example, with reference to FIG. 2, a block diagram of a processor execution pipeline 200 is illustrated. The processor execution pipeline illustrated in FIG. 2 is indicative of the execution pipeline embodied by any of the processors in the preferred embodiment of the present invention. Coupled to the pipeline at various stages are small memory cells for storing information that is consistently and repeatedly requested at that stage in the execution pipeline. For example, memory cells 204L, 204R, 207L and 207R are physically located in close proximity to stage one of the pipeline. The memory cells are used, for example, as a PC register for storing a program counter. In stage one, the program counter is read to determine the memory address in memory 183 of the instruction to be executed by the processor. Memory cells 204L or 204R provide a random access memory (RAM) cell for the PC register. The memory cells are immediately available to stage one of the processor's execution pipeline due to their physical proximity to stage one, thus increasing the speed with which the memory access of often used information is performed. Likewise, memory cells 207L and 207R provide access to another register. Alternatively, memory cells 207L and 207R provide access to the same PC register as cells 204L and 204R, but via a different port, so that the execution pipeline has immediate access to the PC register at a latter point within stage one of the execution pipeline. As another example, memory cells 212L and 212R each represent a RAM cell for an A register. The A register is accessed by stage n of the pipeline, thus, the memory cells are situated in close physical proximity to stage n. Even/Odd Memory Contexts The speed of the execution pipeline in a processor is critical to overall performance of the processor and the computer architecture of the present invention as a whole. To that end, the clock cycle time at which the pipeline is operated is increased as much as the operating characteristics of the logic and associated circuitry will allow. Generally, access times for memory are slower than the clock cycle times at which the pipeline logic can operate. Thus, there is a point of diminishing return at which increasing the clock cycle time of the pipeline is less advantageous if the pipeline must wait for memory access to complete. Thus, the preferred embodiment of the present invention provides two sets of strategically located memory cells distributed along the execution pipeline of a processor, and alternately accesses the memory cells. With reference to FIG. 2, a first set of even memory cells is shown by memory cells 204L, 207L and 212L. Each of the even memory cells have an odd counterpart, 204R, 207R and 212R, respectively. As processes are dispatched to the processor, the process identification (PID) associated with the process determines whether the odd or even set of memory cells is accessed by the pipeline as the process moves along the pipeline. If the PID is an odd number, the odd set of memory cells is accessed. If the PID is an even number, the even set of memory cells is accessed. An example of the operation of the pipeline utilizing the even/odd memory cells follows. As discussed above, each process has a process ID (PID). The execute queue 110 in FIG. 1 is divided into two separate queues--an odd execute queue 111 and an even execute queue 112. As processes are queued by the scheduler 120, they are placed into either the even or odd execute queue according to the PID associated with the process. If the PID associated with a process is even, the process is queued in the even execute queue 112. Likewise, if the PID associated with a process is odd, the process is queued in the odd execute queue 111. The prefetch unit 113 dispatches processes from the dispatcher 115 in an even/odd fashion. That is, a process having an even PID is dispatched, then a process having an odd PID is dispatched, then a process having an even PID is dispatched again, and so on. Indeed, bus 114 is actually two buses in the preferred embodiment. One bus is the even bus, and a second bus is the odd bus. Both buses form the connections shown by bus 114 in FIG. 1. With reference to FIG. 2, a multiplexer 201 at the top of a processor's execution pipeline receives both buses. Input 216 is coupled to the even bus, while input 215 is coupled to the odd bus. The selector input 217 on multiplexer 20 is driven by the clock signal so that each clock cycle, a different input is selected, thereby forwarding processes on to the pipeline in an even/odd manner. As discussed above, each process has it's own context, i.e., it's own set of A and D registers, it's own program counter (PC), etc. The context is loaded into the memory cells strategically located near the appropriate stages of the execution pipeline that is currently executing the process. If the process has an even PID, the context for the process is stored in the even memory cells. If the process has an odd PID, the context for the process is stored in the odd memory cells. As a process moves through a pipeline, the PID associated with the process is passed from register to register, from stage to stage, so that the pipeline knows whether to access the appropriate even or odd memory cells provided. For example, with reference to FIG. 2, the PID associated with a process moves through execution pipeline registers 202, 205, 206, and so on. At register 205, the contents of the PC are loaded into the pipeline. Multiplexer 203 determines, based on the PID, which memory cell, 204L or 204R to access. If the PID associated with the process currently being processed in stage one is even, the selector input to multiplexer 203 signals the multiplexer to select the input coupled to memory cell 204L. If the PID is odd, the selector input selects the input coupled to memory cell 204R. Likewise, the PID passed on to multiplexer 208 and register 209 is used to select memory cell 207L or 207R for input into register 209 via multiplexer 208. Stage two shows a similar configuration for accessing the contents of an A register associated with a process and placing the contents in register 214. Even though the preferred embodiment utilizes two sets of identical memory cells in the pipeline for even and odd processes to account for the delay incurred in accessing a memory cell, registers 202, 206 and 211 act as place holders in the pipeline since the memory access time and delays incurred by the logic in selecting the appropriate memory cells prevent the contents of the memory cells being immediately available to the next register in the pipeline. Conclusion There are, of course, alternatives to the described embodiment which are within the understanding of one of ordinary skill in the relevant art. The present invention is intended to be limited only by the claims presented below.
|
Same subclass Same class Consider this |
||||||||||
