Compressed instruction format for use in a VLIW processor and processor for processing such instructions5878267Abstract Software creates a compressed instruction format for a VLIW processor which allows greater efficiency in use of cache and memory. Instructions are byte aligned and variable length. Branch targets are uncompressed. Format bits specify how many issue slots are used in a following instruction. NOPS are not stored in memory. Individual operations are compressed according to features such as whether they are resultless, guarded, short, zeroary, unary, or binary. Instructions are stored in compressed form in memory and in cache. Instructions are decompressed on the fly after being read out from cache. Claims We claim: Description BACKGROUND OF THE INVENTION
TABLE I
______________________________________
Format ›2i!
Format ›2i + 1!
lsb msb meaning
______________________________________
0 0 Issue slot i is used and an
operation for it is available in
the instruction. The operation
size is 26 bits. The size of the
extension is 0 bytes
1 0 Issue slot i is used and an
operation for it is available in
the instruction. The operation
size is 34 bits. The size of the
extension is 1 byte.
0 1 Issue slot i is used and an
operation for it is available in
the instruction. The operation
size is 42 bits. The size of the
extension is 2 bytes.
1 1 Issue slot i is unused and no
operation for it is included in the
instruction.
______________________________________
Operations correspond to issue slots in left to right order. For instance, if 2 issue slots are used, and Format={1, 0, 1, 1, 1, 1, 1, 0, 1, 1}, then the instruction contains two 34 bit operations. The left most is routed to issue slot 0 and the right most is routed to issue slot 3. If Format={1, 1, 1, 1, 1, 0, 1, 0, 1, 0}, then the instruction contains three 34 bit operations, the left most is routed to issue sot 2, the second operation is intended for issue slot 3, and the right most belongs to issue slot 4. The format used to decompress branch target instructions is a constant. Constant.sub.-- Format={0, 1, 0, 1, 0, 1, 0, 1, 0, 1} for the preferred five issue slot machine. 6. Operation Formats The format of an operation depends on the following properties zeroary, unary, or binary; parametric or non-parametric. Parametric instructions contain an immediate operand in the code. Parameters can be of differing sizes. Here there are param7, i.e. seven bit parameters, and param32, i.e. 32 bit parameters. result producing or resultless; long or short op code. The short op codes are the 32 most frequent op codes and are five bits long. The long op codes dre eight bits long and include all of the op codes, including the ones which can be expressed in a short format. Op codes 0 to 31 are reserved for the 32 short op codes guarded or unguarded. An unguarded instruction has a constant value of the guard of TRUE. latency. A format bit indicates if operations have latency equal to one or latency larger than 1. signed/unsigned. A format bit indicates for parametric operations if the parameter is signed or unsigned. The guarded or unguarded property is determined in the uncompressed instruction format by using the special register file address of the constant 1. If a guard address field contains the address of the constant 1, then the operation is unguarded, otherwise it is guarded. Most operations can occur both in guarded and unguarded formats. An immediate operation, i.e. an operation which transfers a constant to a register, has no guard field and is always unguarded. Which op codes are included in the list of 32 short op codes depends on a study of frequency of occurrence which could vary depending on the type of software written. The table II below lists operation formats used by the invention. Unless otherwise stated, all formats are: not parametric, with result, guarded, and long op code. To keep the tables and figures as simple as possible the following table does not list a special form for latency and signed/unsigned properties. These are indicated with L and S in the format descriptions. For non-parametric, zeroary operations, the unary format is used. In that case the field for the argument is undefined.
TABLE II
______________________________________
OPERATION TYPE SIZE
______________________________________
<binary-unguarded-short>
26
<unary-param7-unguarded-short>
26
<binary-unguarded-param7-
26
resultless-short>
<unary-short> 26
<binary-short> 34
<unary-param7-short> 34
<binary-param7-resultless-
34
short>
<binary-unguarded> 34
<binary-resultless> 34
<unary-param7-unguarded>
34
<unary> 34
<binary-param7-resultless>
42
<binary> 42
<unary-param7> 42
<zeroary-param32> 42
<zeroary-param32-resultless>
42
______________________________________
For all operations a 42-bit format is available for use in branch targets. For unary and binary-resultless operations, the <binary> format can be used. In that case, unused fields in the binary format have undefined values. Short 5-bit op codes are converted to long 8-bit op codes by padding the most significant bits with 0's. Unguarded operations get as a guard address value, the register file address of constant TRUE. For store operations the 42 bit, binary-param7-resultless> format is used instead of the regular 34 bit <binary-param7-resultless short> format (assuming store operations belong to the set of short operations). Operation types which do not appear in table II are mapped onto those appearing in table II, according to the following table of aliases:
TABLE II'
______________________________________
FORMAT ALIASED TO
______________________________________
zeroary unary
unary.sub.-- resultless
unary
binary.sub.-- resultless.sub.-- short
binary.sub.-- resultless
zeroary.sub.-- param32.sub.-- short
zeroary.sub.-- param32
zeroary.sub.-- param32.sub.-- resultless.sub.-- short
zeroary.sub.-- param32.sub.-- resultless
zeroary.sub.-- short
unary
unary.sub.-- resultless.sub.-- short
unary
binary.sub.-- resultless.sub.-- unguarded
binary.sub.-- resultless
unary.sub.-- unguarded
unary
binary.sub.-- param7.sub.-- resultless.sub.-- unguarded
binary.sub.-- param7.sub.-- resultless
unary.sub.-- unguarded
unary
binary.sub.-- param7.sub.-- resultless.sub.-- unguarded
binary.sub.-- param7.sub.-- resultless
zeroary.sub.-- unguarded
unary
unary.sub.-- resultless.sub.-- unguarded.sub.-- short
binary.sub.-- unguarded.sub.-- short
unary.sub.-- unguarded.sub.-- short
unary.sub.-- short
zeroary.sub.-- param32.sub.-- unguarded.sub.-- short
zeroary.sub.-- param32
zeroary.sub.-- parame32.sub.-- resultless.sub.-- un-
zeroary.sub.-- param32.sub.-- resultless
guarded.sub.-- short
zeroary.sub.-- unguarded.sub.-- short
unary
unary.sub.-- resultless.sub.-- unguarded.sub.-- short
unary
unary.sub.-- long binary
binary.sub.-- long binary
binary.sub.-- resultless.sub.-- long
binary
unary.sub.-- param7.sub.-- long
unary.sub.-- param7
binary.sub.-- param7.sub.-- resultless.sub.-- long
binary.sub.-- param7.sub.-- resultless
zeroary.sub.-- param32.sub.-- long
zeroary.sub.-- param32
zeroary.sub.-- param32.sub.-- resultless.sub.-- long
zeroary.sub.-- param32.sub.-- resultless
zeroary.sub.-- long binary
unary.sub.-- resultless.sub.-- long
binary
______________________________________
The following is a table of fields which appear in operations:
TABLE III
______________________________________
FIELD SIZE MEANING
______________________________________
src1 7 register file
address of first
operand
src2 7 register file
address of second
operand
guard 7 register file
address of guard
dst 7 register file
address of result
param 7/32 7 bit parameter or
32 bit immediate
value
op code 5/8 5 bit short op code
or 8 bit long op
code
______________________________________
FIG. 5 includes a complete specification of the encoding of operations. 7. Extensions of the instruction format Within the instruction format there is some flexibility to add new operations and operation forms, as long as encoding within a maximum size of 42 bits is possible. The format is based on 7-bit register file address. For register file addresses of different sizes, redesign of the format and decompression hardware is necessary. The format can be used on machines with varying numbers of issue slots. However, the maximum size of the instruction is constrained by the word size in the instruction cache. In a 4 issue slot machine the maximum instruction size is 22 bytes (176 bits) using four 42-bit operations plus 8 format bits. In a five issue slot machine, the maximum instruction size is 28 bytes (224 bits) using five 42-bit operations plus 10 format bits. In a six issue slot machine, the maximum instruction size would be 264 bits, using six 42-bit operations plus 12 format bits. If the word size is limited to 256 bits, and six issue slots are desired, the scheduler can be constrained to use at most 5 operations of the 42 bit format in one instruction. The fixed format for branch targets would have to use 5 issue slots of 42 bits and one issue slot of 34 bits. COMPRESSING THE INSTRUCTIONS FIG. 8 shows a diagram of how source code becomes a loadable, compressed object module. First the source code 801 must be compiled by compiler 802 to create a first set of object modules 803. These modules are linked by linker 804 to create a second type of object module 805. This module is then compressed and shuffled at 806 to yield loadable module 807. Any standard compiler or linker can be used. Appendix D gives some background information about the format object modules in the environment of the invention. Object modules II contain a number of standard data structures. These include: a header; global & local symbol tables; reference table for relocation information; a section table; and debug information, some of which are used by the compression and shuffling module 807. The object module II also has partitions, including a text partition, where the instructions to be processed reside, and a source partition which keeps track of which source files the text came from. A high level flow chart of the compression and shuffling module is shown at FIG. 9. At 901, object module II is read in. At 902 the text partition is processed. At 903 the other sections are processed. At 904 the header is updated. At 905, the object module is output. FIG. 10 expands box 902. At 1001, the reference table, i.e. relocation information is gathered. At 1002, the branch targets are collected, because these are not to be compressed. At 1003, the software checks to see if there are more files in the source partition. If so, at 1004, the portion corresponding to the next file is retrieved. Then, at 1005, that portion is compressed. At 1006, file information in the source partition is updated. At 1007, the local symbol table is updated. Once there are no more files in the source partition, the global symbol table is updated at 1008. Then, at 1009, address references in the text section are updated. Then at 1010, 256-bit shuffling is effected. Motivation for such shuffling will be discussed below. FIG. 11 expands box 1005. First, it is determined at 1101 whether there are more instructions to be compressed. If so, a next instruction is retrieved at 1102. Subsequently each operation in the instruction is compressed at 1103 as per the tables in FIGS. 5a and 5b and a scatter table is updated at 1108. The scatter table is a new data structure, required as a result of compression and shuffling, which will be explained further below. Then, at 1104, all of the operations in an instruction and the format bits of a subsequent instruction are combined as per FIGS. 4a-4e. Subsequently the relocation information in the reference table must be updated at 1105, if the current instruction contains an address. At 1106, information needed to update address references in the text section is gathered. At 1107, the compressed instruction is appended at the end of the output bit string and control is returned to box 1101. When there are no more instructions, control returns to box 1006. Appendices B and C are source code appendices, in which the functions of the various modules are as listed below:
TABLE IV
______________________________________
Name of module identification of function performed
______________________________________
scheme.sub.-- table
readable version of table of FIGS. 5a
and 5b
comp.sub.-- shuffle.c
256-bit shuffle, see box 1010
comp.sub.-- scheme.c
boxes 1103-1104
comp.sub.-- bitstring.c
boxes 1005 & 1009
comp.sub.-- main.c
controls main flow of FIGS. 9 and 10
comp.sub.-- src.c,
miscellaneous support routines for
comp.sub.-- reference.c,
performing other functions listed in
comp.sub.-- misc.c,
FIG. 11
comp.sub.-- btarget.c
______________________________________
The scatter table, which is required as a result of the compression and shuffling of the invention, can be explained as follows. The reference table contains a list of locations of addresses used by the instruction stream and corresponding list of the actual addresses listed at those locations. When the code is compressed, and when it is loaded, those addresses must be updated. Accordingly, the reference table is used at these times to allow the updating. However, when the code is compressed and shuffled, the actual bits of the addresses are separated from each other and reordered. Therefore, the scatter table lists, for each address in the reference table, where EACH BIT is located. In the preferred embodiment the table lists, a width of a bit field, an offset from the corresponding index of the address in the source text, a corresponding offset from the corresponding ndex in the address in the destination text. When object module III is loaded to run on the processor, the scatter table allows the addresses listed in the reference table to be updated even before the bits are deshuffled. {??} DECOMPRESSING THE INSTRUCTIONS In order for the VLIW processor to process the instructions compressed as described above, the instructions must be decompressed. After decompression, the instructions will fill the instruction register, which has N issue slots, N being 5 in the case of the preferred embodiment. FIG. 12 is a schematic of the decompression process. Instructions come from memory 1201, i.e. either from the main memory 104 or the instruction cache 105. The instructions must then be deshuffled 1201, which will be explained further below, before being decompressed 1203. After decompression 1203, the instructions can proceed to the CPU 1204. Each decompressed operation has 2 format bits plus a 42 bit operation. The 2 format bits indicate one of the four possible operation lengths (unused issue slot, 26-bit, 34-bit, or 42-bit). These format bits have the same values is "Format" in section 5 above. If an operation has a size of 26 or 34 bits, the upper 8 or 16 bits are undefined. If an issue slot is unused, as indicated by the format bits, then all operation bits are undefined and the CPU has to replace the op code by a NOP op code (or otherwise indicate NOP to functional units). Formally the decompressed instruction format is <decompressed instruction> ::{<decompressed operation>}N <decompressed operation> ::=<operation:42><format:2> Operations have the format as in Table III (above). Appendix A is VERILOG code which specifies the functioning of the decompression unit. VERILOG code is a standard format used as input to the VERILOG simulator produced by Cadence Design Systems, Inc. of San Jose, Calif. The code can also be input directly to the design compiler made by Synopsys of Mountain View, Calif. to create circuit diagrams of a decompression unit which will decompress the code. The VERILOG code specifies a list of pins of the decompression unit these are
TABLE V
______________________________________
# of pins name of group
in group of pins description of group of pins
______________________________________
512 data512 512 bit input data word from
memory, i.e. either from the
instruction cache or the main
memory
32 PC input program counter
44 operation4 output contents of issue slot 4
44 operation3 output contents of issue slot 3
44 operation2 output contents of issue slot 2
44 operation1 output contents of issue slot 1
44 operation0 output contents of issue slot 0
10 format.sub.-- out
output duplicate of format bits
in operations
32 first.sub.-- word
output first 32 bits pointed to
by program counter
1 format.sub.-- ctr10
is it a branch target or not?
1, each reissue1 input global pipeline control
stall.sub.-- in
signals
freeze
reset
clk
______________________________________
Data512 is a double word which contains an instruction which is currently of interest. In the above, the program counter, PC is used to determine data512 according to the following algorithm: A:={PC›31:8!, 8'b0} if PC›5!=0 then data512':={M(A) , M(A+32)} else data512':={M(A+32),M(A)} where A is the address of a single word in memory which contains an instruction of interest; 8'b0 means 8 bits which are zeroed out M(A) is a word of memory addressed by A; M(A+32) is word of memory addressed by A+32; data512' is the shuffled version of data 512 This means that words are swapped if an odd word is addressed. Operations are delivered by the decompression unit in a form which is only partially decompressed, because the operation fields are not always in the same bit position. Some further processing has to be done to extract the operation fields from their bit position, most of which can be done best in the instruction decode stage of the CPU pipeline. For every operation field this is explained as follows: src1 The src1 field is in a fixed position and can be passed directly to the register file as an address. Only the 32-bit immediate operation does not use the src1 field. In this case the CPU control will not use the src1 operand from the register file. src2 The src2 field is in a fixed position if it is used and can be passed directly to the register file as address. If it is not used it has an undefined value. The CPU control makes sure that a "dummy" src2 value read from the register file is not used. guard The guard field is in a fixed position if it is used and can be passed directly to the register file as an address. Simultaneously with register file access, the CPU control inspects the op code and format bits of the operation. If the operation is unguarded, the guard value read from the RF (register file) is replaced by the constant TRUE. op code Short or long op code and format bits are available in a fixed position in the operation. They are in bit position 21-30 plus the 2 format bits. They can be fed directly to the op code decode with maximum time for decoding. dst The dst field is needed very quickly in case of a 32-bit immediate operation with latency 0. This special case is detected quickly by the CPU control by inspecting bit 33 and the formal bits. In all other cases there is a full clock cycle available in the instruction decode pipeline state to decode where the dst field is in the operation (it can be in many places) and extract it. 32-bit immediate If there is a 32-bit immediate it is in a fixed position in the operation. The 7 least significant bits are in the src2 field in the same location as a 7-bit parameter would be. 7-bit parameter If there is a 7-bit parameter it is in the src2 field of the operation. There is one exception: the store with offset operation. For this operation, the 7-bit parameter can be in various locations and is multiplexed onto a special 7-bit immediate bus to the data cache. BIT SWIZZLING Where instructions are long, e.g. 512 bit double words, cache structure becomes complex. It is advantageous to swizzle the bits of the instructions in order to simplify the layout of the chip. Herein, the words swizzle and shuffle are used to mean the same thing. The following is an algorithm for swizzling bits, see also comp.sub.-- shuffle.c in the source code appendix. for (k-=0; k<4; k=k+1) for (i=0; i<8; i=i+1) for (j=0; j<8; j=j+1) begin word.sub.-- shuffled›k*64+j*8+i!=word.sub.-- unshuffled›(4*i+k)*8+j! end where i, j, and k are integer indices; word.sub.-- shuffled is a matrix for storing bits of a shuffled word; and word.sub.-- unshuffled is matrix for storing bits of an unshuffled word. CACHE STRUCTURE FIG. 6a shows the functioning on input of a cache structure which is useful in efficient processing of VLIW instructions. This cache includes 16 banks 601-616 of 2k bytes each. These banks share an input bus 617. The caches are divided into two stacks. The stack on the left will be referred to as "low" and the stack on the right will be referred to as "high". The cache can take input in only one bank at a time and then only 4 bytes at a time. Addressing determines which 4 bytes of which bank are being filled. For each 512 bit word to be stored in the cache, 4 bytes are stored in each bank. A shaded portion of each bank is illustrated indicating corresponding portions of each bank for loading of a given word. These shaded portions are for illustration only. Any given word can be loaded into any set of corresponding portions of the banks. After swizzling according to the algorithm indicated above, sequential 4 byte portions of the swizzled word are loaded into the banks in the following order 608, 616, 606, 614, 604, 612, 602, 610, 607, 615, 605, 613, 603, 611, 601, 609. The order of loading of the 4 byte sections of the swizzled word is indicated by roman numerals in the boxes representing the banks. FIG. 6b shows how the swizzled word is read out from the cache. FIG. 6b shows only the shaded portions of the banks of the low stack. The high portion is analogous. Each shaded portion 601a-608a has 32 bits. The bits are loaded onto the output bus, called bus256low, using the connections shown, i.e in the following order: 608a-bit 0, 607a-bit 0, . . . , 601a-bit 0; 608a-bit 1, 607a-bit 1, . . . , 601a-bit 1; . . . ; 608a-bit 31, 607a-bit 31, . . ., 601a-bit 31. Using these connections, the word is automatically de-swizzled back to its proper bit order. The bundles of wires, 620, 621, . . . , 622 together form the output bus256 low. These wires pass through the cache to the output without crossing On output, the cache looks like FIG. 7. The bits are read out from stack low 701 and stack high 702 under control of control unit 704 through a shift network 703 which assures that the bits are in the output order specified above. In this way the entire output of the 512 bit word is assured without bundles 620, 621, . . . 622 and analogous wires crossing.
|
Same subclass Same class Consider this |
||||||||||
