Encryption processor with shared memory interconnect6434699Abstract An encryption chip is programmable to process a variety of secret key and public key encryption algorithms. The chip includes a pipeline of processing elements, each of which can process a round within a secret key algorithm. Data is transferred between the processing elements through dual port memories. A central processing unit allows for processing of very wide data words from global memory in single cycle operations. An adder circuit is simplified by using plural relatively small adder circuits with sums and carries looped back in plural cycles. Multiplier circuitry can be shared between the processing elements and the central processor by adapting the smaller processing element multipliers for concatenation as a very wide central processor multiplier. Claims What is claimed is: Description BACKGROUND OF THE INVENTION
TABLE 1
Sample Instruction Set
Instruction Description
load rn, addr Load register n with memory
store rn, addr Store register n in memory
xor r1, r2, r3 r1 = r2 xor r3
add r1, r2, r3 r1 = r2 + r3
rol r1, r2, r3 r1 = r2 <<< r3 (<<< is the Java operator
for a rotate
instruction. For a 32-bit operand, bit 31 rotates to bit
position 0)
xor r1, addr r1 = r1 xor memory[addr]
add r1, addr r1 = r1 + memory[addr]
rol r1, addr r1 = r1 <<< memory[addr]
moda r1, r2 Modulo adjust: r1 = r1 mod r2, where r1 is result
of modular add or multiply
moda r1, addr r1 = r1 mod memory[addr
mul r1, r2, r3 Multiply: r1 = r2 .times. r3, performed in 32 bits.
mulm r1, r2, Modular multiply: r1 = r2 .times. r3 mod r4.
r3, r4
Jump label Transfer control unconditionally to label
sync label Pipeline sync: wait until all PEs have arrived at a
"sync" instruction, then branch to label
Dbra rn, label rn = rn - 1; if rn != 0 then jump to label
cbra r1 cond r2, Compare and branch: compare r1 and r2, and branch to
label label if condition is true. "Cond" is one of
==, !=, <, >, <= or >=.
Layout Issues A general layout of the encryption chip is illustrated in FIG. 4, assuming 16 processing elements and a 512-bit wide public key PK core unit. The PK core word width of 512 bits was chosen due to its layout convenience. A width of 1024 bits for example, would require more silicon area but would also double performance. The individual elements can be compared to the elements of FIGS. 2 and 3. Sixteen of the processing elements are linearly arranged in a column in the large region of the layout to the lower left, one being shown in detail. The shared multiplier element is shown associated with the illustrated processing element. As previously described, a 32.times.32 multiplier segment 70 is associated with each processing element for performing 32-bit multiplications within the respective processing elements. Alternatively, the multiplier elements 70 can be concatenated to serve as a wide 512.times.32 bit multiplier for the public-key ALU 50. The public-key PK ALU 50 is located to the right of the secret-key SK elements, made up of the processing elements as described above. Next to the PK ALU is the PK register file 48. Together, the PK ALU 50 and the PK register file 48 make up the PK processing core, identified in FIG. 2 as 46. To the right of the PK core is located the global memory (RAM) 44. Along the top of the chip are the control CPU 52, the communication logic 54 and the input and output processing blocks 40 and 42. The global data bus 38 links the SK elements, the PK core 46, the global RAM 44, the communication logic 54 and the control CPU 52. The layout of a typical processing element with the local bus connections is shown in FIG. 5. All components of a processing element may communicate via a local processing element data bus 72, which handles all the memory-register transfers. Note that the next-neighbor shared PE memory 68 is laid out in line with the other elements of the illustrated processing element, whereas, the previous-neighbor shared PE memory 66 is laid out in line with the elements of the previous-neighbor processing element. For programming and testing purposes, all PE memories are accessible from the global bus 38. A switch 74 normally disconnects the local bus 72 from the global bus 38 but may be selectively closed to enable data transfer between the local RAM 64 and the global RAM 44. Another switch 76 allows the local bus 72 to be segmented into independent segments such that the control unit 60 can read instructions from the RAM 62 simultaneously with transfer of data on the bus 72. As such, operations within the processing element may be pipelined with one instruction being processed in the control unit while a prior instruction is executed in the PE ALU 56. During execution of encryption code, switches 74 and 76 are normally open, so that instruction fetches from the instruction RAM may proceed concurrently with data fetches from the data memories and register file. Many multiprocessor architectures have been proposed. Most of them are designed for general-purpose multiprocessing, so communication between processing elements is usually done using a switching matrix that can be dynamically configured to switch data from any one PE to any other. These switch designs are extremely complex. Since they are not required for encryption, an embodiment of the present invention uses a simpler linear arrangement of the PEs with much less switching circuitry. In addition, the use of shared memory as the interconnect technique rather than I/O ports as documented in the literature produces a much simpler and more powerful programming model. Consider two PEs, A and B, connected with a single 32-bit I/O port. In order for A to transfer multiple words of data to B, A must write each word to the I/O port and wait for B to read it. In contrast, if A and B are connected by a shared memory large enough to hold all words of the communication, then A may write out its data without waiting for B to read any. Furthermore, PE B has the freedom to read the words out in any order, or to pick and choose from the data as required by the job at hand. Finally, it should be noted that if some of the shared memory is not required for communication, then it may be used as an extension of the local memory to provide additional local work space. Public Key Support Efficient public-key encryption requires efficient modular exponentiation, provided by the public-key co-processor. This unit comprises the following items: PK register file 48, consisting of 16 512-bit wide registers PK 512.times.32-bit multiplier 70 made up of concatenated SK multiplier elements (this unit can perform a 512.times.512 multiply in only 32 clock cycles) PK 512-bit adder ALU 50 which can perform addition in 2-16 cycles, typically no more than 2 global memory 44 organized for 512-bit parallel access from the PK coprocessor for loading and storing 512-bit words in a single clock cycle The PK core processor accelerates modular multiplication by performing it using 512-bit words. A 512.times.512 multiply operation using the PK unit of the invention would be implemented by performing 16 512.times.32 multiplies using the concatenated multiplier elements of the 16 processing elements described below. Assuming each multiply requires 2 clock cycles and 16 such multiplies are required, a 512.times.512 multiply would require 32 clock cycles and a 2048.times.2048 multiply would require only 512 clock cycles. The full modular exponentiation operation, requiring 4096 multiplies, would take a total of 2 million clock cycles. This represents an 80 fold improvement to the Pentium example discussed earlier. The performance improvement of PK algorithms is expected to be similar. This represents a significant performance gain compared to the prior art, and will enable more frequent changing of session keys, thereby increasing security. 512-bit Adder Adders are not shared between the public key PK and secret key SK units. Rather, since addition and logic operations are common for both PK and SK, each unit has its own adder, so that operations may proceed concurrently. Within the public-key PK ALU 50, a 512-bit single cycle adder would be extremely complex and would add substantially to the critical path time of the ALU. Accordingly, the 512-bit adder in the ALU 50 is formed of 16 32-bit adders as illustrated in FIG. 6. In operation, the AND-gate 78 and multiplexer 80 initially apply two 32-bit operand segments to each of the 32-bit adders A0-A15. Note that the AND-gate gate 78 represents a 32-bit wide operation. Each 32-bit adder computes a 32-bit sum along with a carry output. The carry output of one adder is connected through a D flip-flop flop 79 to the carry input of the next. If a carry is generated in the first cycle, then it is clocked into the flip-flop where it is available as a carry input for the next clock cycle. Each sum is returned to one input of the same adder through a D flip-flop 81 and multiplexer 80; the other input of the adder is held at zero using the AND gate 78 during successive clock cycles. The step of adding back the sum as a carry input for each adder is repeated as long as any carry results at the output of any of the 32-bit adders. The operation of the 512-bit adder can be better understood with reference to the following example using four 4-bit binary words instead of the 16 32-bit words of the actual implementation.
Add: 1101 0110 1001 011
0001 0101 1100 1011
01110 01011 10101 10110 carry-outs are 0, 0, 1, 1
1110 1011 0101 0110
0 1 1 0 previous carries
1110 1100 0110 0110 final sum
Note that two additions were required to arrive at the final sum where no further carries resulted. This is a typical case. Since the adder is being used for encryption operations, it is safe to assume that the numbers being added are more or less randomly distributed. The probability of a carry-out after the first add is quite high. However, the probability that a carry, added back in as a least significant bit, will result in another carry from the most significant bit is quite low. For this reason, most add operations are expected to take only two clock cycles. Returning to the original problem of constructing a 512-bit adder, if a standard carry lookahead or carry bypass adder design were used, the critical path through the adder would be quite long, since the carry must propagate through some optimized circuitry that operates on 512 bits. This adder would be quite large and slow. In contrast, in one embodiment of the present invention, a 512-bit adder is composed of 32-bit adders, whose design is well-known today and has been well optimized. The maximum clock speed of an individual 32-bit adder is expected to be more than twice that of a 512-bit carry lookahead design. Thus, the two-or-more cycle adder according to the invention, would on average operate faster than a large 512-bit adder, while consuming less chip area. In a worst case, as illustrated below, it is possible that 16 cycles would be required to completely compute the final sum without carries, for 16 32-bit adders implementation. Using the 4-bit binary word example once again for illustrative purposes:
Add: 1111 1111 1111 1111
0000 0000 0000 0001
01111 01111 01111 10000 1.sup.ST carry-outs are 0, 0, 0, 1
1111 1111 1111 0000
0 0 1 0
01111 01111 10000 00000 2.sup.nd carry-outs are 0, 0, 1, 0
1111 1111 0000 0000
0 1 0 0
01111 10000 00000 00000 3.sup.rd carry-outs are 0, 1, 0, 0
1111 0000 0000 0000
1 0 0 0
10000 0000 0000 0000
Four additions were required. In general, for n groups of numbers, at most, n additions will be required. 512.times.32 Multiplier Multipliers are large in area. Each secret-key processing element must contain its own multiplier in order to implement any secret key algorithm requiring multiplication, for example IDEA which will be discussed in more detail below. The area taken by each PE multiplier collectively is significant and as a result, use of this area is made in implementing the 512.times.32-bit public key multiplier. To save area, the large 512.times.32 multiplier is implemented by concatenating the 16 32.times.32 multipliers in each secret key processing element. In other words, the secret and public key units can share the multiplier elements, as is illustrated in the layout of the chip in FIG. 4. Use of the multiplier elements must therefore be coordinated between the secret key processing elements and the PK core processor since the PK core processor cannot perform a multiply operation when any one of the secret key processing elements is itself independently performing a multiply operation. To illustrate the concatenation of the multipliers, a simple design of a combination 4.times.4/4.times.N multiplier is illustrated below. Note that more advanced techniques of multiplier design such as Booth encoding and 4:2 compressors are available. The following example provides a simple presentation:
1011
.times. 0100
0000
1011 Partial Products
1011
0000
1000010
Single-digit multiplication can be easily implemented by using an AND gate. The result, when using two 4-bit operands, consists of 16 bits of partial product. These partial products must be added together efficiently. The partial products could, for example, be added using two 4-bit and one 6-bit full adder, but they would take a substantial time to perform the addition of the partial products, since the carry may have to propagate through several adders. The overall result of such an adder implementation would be too slow. A better approach would consist of an adder whose carry has to go through fewer stages. The basic component of the preferred multiplier is a full adder, a circuit that takes three inputs and outputs the two-bit sum of their inputs. A full adder is illustrated using the symbol in FIG. 7. The use of squares instead of binary numbers is for generality and convenience. The three squares at the top show the three inputs of the full adder. The two squares at the bottom show the sum and carry outputs. The carry is to the lower left to indicate its place value is twice that of the sum. The first stage of the addition of a 4.times.4 multiplier is illustrated in FIG. 8. Above the summing line, 16 squares, some shown in black, some as white boxes, represent the bits of a partial product which has to be added. The bits shown in black will be added in the first stage using four full adders 82. The bits shown as white boxes are not to be added in the first stage, but simply transferred down as shown with the arrows in FIG. 8, in preparation for the next addition stage. The sum of the adders in the first stage is shown below the summing line. The second stage is shown in FIG. 9. Arrows once again indicate bits which are not operated on during this current stage and simply transferred down, while black boxes denote bits that are to be added in this current (second) stage. Once again, four full adders 84 are used to add the black box elements. At the output of the second stage produced by the full adders 84, there are two numbers which now must be added with a regular 4-bit carrying adder 86. A comparison of the performance of various adder and multiplier architectures helps to illustrate the advantages of the multiplier according to the present invention. A naive implementation of a 4-bit adder consists of four full adders A0-A3 in series as shown in FIG. 12. In this design, the carry out C.sub.out of the rightmost adder could potentially affect every one of the adder stages to the left of it. The critical path in this design is therefore four adder stages. Since the typical full adder consists of two or more logic stages, the total gate delay of an four-bit adder could exceed 8 stages. An improved four-bit adder is a carry lookahead design. A three-bit carry lookahead adder is shown in FIG. 13. A four-bit design is only a little more complex. A detailed discussion of operation of the AND gates 102, OR gates 104 and Exclusive-OR gates 106 is not presented since this is a well known circuit. The advantage of a carry lookahead adder is that carries propagate to final sum bits in only four logic gates. More complex designs for wider numbers have more stages of logic, but they are still faster than a carry chain design. In the full 4.times.4 multiplier, the carry save design creates a critical path through two full adders, plus the final carry lookahead adder. An implementation using only full adders would have a longer critical path since a naive adder using chained carries is slower than the carry lookahead adder. Finally, if we used full carry lookahead adders in the first two stages of the partial product summation, then the resulting multiplier would again be slower, since carry lookahead adders are slower than individual full adders. Note that the multiplier design according to the present invention never propagates a carry from one adder to another at the same partial product level. In this manner, the critical path through the multiplier is sure to include no more than two full adders in the first two stages of partial product summation. FIG. 10 illustrates a much wider 4.times.N multiplier. The large black boxes 82, 84,86 denote the same full adder hardware that was used in FIG. 9. Note that full adders are necessary in this case, since in each circumstance, three inputs are being added together. In FIG. 9, simpler circuitry could have been used since not all circumstances required the addition of three inputs. However, in order to create a system which can handle a 4.times.N case, full adders are preferably used for all stages with some additional circuitry to determine how to treat a case with fewer than three inputs. Dual mode adders are therefore created, some having a multiplexer feeding one of their inputs to select between the output of a previous stage or a single bit partial product. FIG. 11 illustrates the full adders A required to implement the boxed regions 82,84,86 of FIG. 10 with respective carry outputs at the lower left. In the preferred implementation, each adder A is a full adder. Some of the adders have only two inputs in the 4.times.4 case (i.e. the secret key case), while other adders have three inputs in the 4.times.N case (i.e. the public key case). The two input adders must have their third input gated with an enable signal. Some adders also require a multiplexer to provide one of their inputs to select between the output of a previous stage or a single-bit partial product. The carry-lookahead adder 86 on the bottom requires a carry-out every four positions to generate the last bit of the product in the 4.times.4 case. In FIG. 11, the partial products of a 4.times.4 multiplier have been labeled to correspond with the following partial product scenario: ##EQU1## For a 4.times.N multiplier, the neighboring partial products must also be considered. They are labeled in FIG. 11 according to the following scenario: ##EQU2## Note that D' is the neighboring (either to the left or to the right) equivalent of D. The 8-bit final sum is indicated as S7,S6,S5,S4,S3,S2,S1,S0 and the three lower-order bits of the left-hand neighbor's sum are S2',S1',S0'. The 2:1 multiplexer 88 has a selection signal Sel. In general, if Sel is logic 1, then the left input is passed to the output of the multiplexer; otherwise, if Sel is logic 0, the right input is passed to the output of the multiplexer. The Sel signal is also used to gate the AND gates 90. When Sel is logic 1, the other input to the AND gate is passed to the output; otherwise, with Sel at logic 0, the AND gates 90 are disabled and pass a logic 0 regardless of the value of the other input. Thus, in the implementation of FIG. 11, if Sel is logic 1, a segment of a 4.times.N multiplier can be implemented, with product appearing on outputs S6-S3. If Sel is logic 0, a 4.times.4 multiplier is implemented, with the 8-bit product appearing on outputs S7-S0. As such, the implementation of FIG. 11 illustrates a secret key multiplier element which can also be utilized as an multiplier element, which when concatenated with other like multiplier elements, will be used to implement the much wider width public key multiplier. Example Implementations Example implementations of common encryption algorithms will be now described with reference to the encryption chip preferred embodiment discussed above. RC5 is perhaps one of the simplest encryption algorithms to implement. It basically utilizes three types of operations: XOR, additions and rotations, all supportable by any one of the processing elements discussed above, as shown in Table 1. Although RC5 has a variable length block, most commonly, each round of the RC5 algorithm operates on a 64-bit data block plus a key value stored in Si1 and Si2 which are constants within each processing element, depending only on the round and the key. To encrypt data, a 64-bit input block is split into two 32-bit words which are then stored in locations A and B in the previous-neighbor memory, the output block is to be written to A_next and B_next in the next-neighbor memory. An example of a round of the RC5 encryption algorithm follows: Loop: load r1, A xor r1, B rol r1, B add r1, Si1 store r1, A_next load r2, B xor r2, r2, r1 rol r2, r2, r1 add r2, Si2 store r2, B_next sync Loop Each round requires 11 clock cycles. If the encryption chip is designed using a logic process that can run up to 400 MHZ, then 36 million blocks can be encrypted per second, or 288 MB/s in ECB mode. If we assume 12 rounds (a typical case for RC5), then compared to a conventional CPU running at the same clock speed, the concurrent execution of multiple PEs according to an embodiment of the present invention results in a 12-fold performance improvement over the conventional software implementation. IDEA is one of the most secure block algorithms available and has a substantially more complex structure. It operates on 64-bit plaintext blocks. A 128-bit key is used. The same algorithm is used both for encryption and decryption. The main philosophy of the algorithm is to mix operations from different algebraic groups, operations such as XOR, addition modulo 2.sup.16, and multiplication modulo 2.sup.16 +1. These operations are used to operate on 16-bit blocks. IDEA therefore makes use of both modular multiplication and addition, which are expensive operations in software. The multiplication is complicated by DEA's treatment of zero: in a multiply, a zero is interpreted as (-1) modulo 65537. Assuming that the value 65537 has been pre-loaded into register r8 of the processing element's register file, and that register r0 contains zero, the following multiplication macro is presented for illustrative purposes: MACRO MMULT(A,B,RESULT) cbraA==r0, L1 load RESULT, #1 sub RESULT, B, RESULT jump L2 L1: cbra B==r0, L3 load RESULT, #1 sub RESULT, A, RESULT jump L2 L3: mulm A, B, RESULT, r8 andi #0xFFFF,RESULT L2: ENDMACRO Each round of IDEA consists of modular multiplications, modular addition and exclusive-OR. The 128-bit key is broken down into subkeys. Each processing element's subkey is a function solely of the key and the processing element, and therefore can be computed in advance and stored in the PE. The plaintext input to IDEA consists of four 16-bit sub-blocks X1 through X4, as indicated earlier. Each round uses six subkeys K1 through K6, and can be coded as follows: Loop: load r1, X1 load r9, K1 MMULT r1, r9, r1 load r2, X2 load r9, K2 MMULT r2, r9, r2 load r3, X3 load r9, K3 MMULT r3, r9, r3 load r4, X4 load r9, K4 MMULT r4, r9, r4 xor r5, r1, r3 xor r6, r2, r4 load r9, K5 MMULT r5, r9, r5 add r6, r5 and r6, #0xFFFF load r9, K6 MMULT r6, r9, r6 add r5, r6 and r5, #0xFFFF xor r1, r6, r1 xor r3, r6, r3 xor r2, r5, r2 xor r4, r3, r4 store r1, X1_next store r2, X3_next store r3, X2_next store r4, X4_next sync Loop Since IDEA has eight rounds, the encryption chip hardware implementation according to an embodiment of the present invention accelerates the execution by a factor of eight or more. Additional acceleration is provided by the modular multiply instruction which is not available on most microprocessors. The above code requires roughly 50 clock cycles to perform one round. At 400 MHZ, the encryption chip can encrypt with IDEA at a rate of 64 MB/s, about three times faster than a 25 MHZ hardware implementation developed at ETH University in Zurich. Data Encryption Standard, or DES, was originally designed for hardware implementation, and is therefore the most difficult to implement in software. Nevertheless, it can easily be coded in the encryption chip, according to an embodiment of the present invention. Like the previous two algorithms, DES is also a block cipher encrypting data in 64-bit blocks. A 64-bit block of plaintext is the input and a 64-bit ciphertext is the output. Once again, both encryption and decryption use the same algorithm, making DES a symmetrical algorithm. DES creates subkeys from a single key, in this case 56 bits. The subkeys are a function of the PE and the 56-bit key, so they can be computed in advance. The basic concept behind DES, as illustrated in FIG. 14, consists of a substitution followed by a permutation on the text based on the key. The following operations make up the core of DES: Expansion: The 64-bit block is divided into two 32-bit pieces 108,110. One piece is unaffected by the encryption. (The pieces are operated on every other round.) The piece that is affected is divided into eight groups of four bits. Each group is expanded by copying the two bits adjacent to it. Each expanded group is XOR'ed at 112 with a subkey. The six-bit result of the XOR is used to index a 64-entry.times.4-bit lookup table 114 called an S-box. Each of the eight groups uses its own S-box. The output from the S-boxes is permuted at 116: the bits are scrambled. Eight outputs yield 32 bits. The 32-bit output is XOR'ed at 118 with the other 32-bit half of the block. The operations can be coded as follows: expansion is performed by copying the input word, and masking bits such that there are two words: one representing even-numbered S-box inputs and one representing odd-numbered S-box inputs. The two words are XOR'ed with key information. The result is used to index the S-box lookup table. The data in each S-box is pre-permuted, so that the output of the S-box is 32-bit data. The final value is the logical OR of all components. Sample code follows: Loop: load r1,A load r2,B load r3,r2 store r2, A_next and r2,#0xF9F9F9F9 and r3,#0x9F9F9F9F xor r2,K1 xor r3,K2 load r5,r0 load r4,r3 rol r4,#1 and r4,#0x3f or r5,[r4+S1] load r4,r2 ror r4,#3 and r4,#0x3f or r5,[r4+S2] load r4,r3 rol r4,#7 and r4,#0x3f or r5,[r4+S3 ] load r4, r2 ror r4, r#11 and r4, r#0x3f or r5,[r4+S4] load r4,r3 rol r4,#15 and r4,#0x3f or r5,[r4+S5] load r4,r2 ror r4, r#19 and r4,#0x3f or r5,[r4+S6] load r4,r3 rol r4,#23 and r4,#0x3f or r5,[r4+S7] load r4,r2 ror r4,#27 and r4,#0x3f or r5,[r4+S8] xor r5,r1,r5 store rS,B_next sync Loop This sample code requires 44 clock cycles to perform one round. At 400 MHZ, a data rate of 72 MB/s can be achieved. This rate compares favorably with hardware implementations of DES available in the mid-1990's, that encrypt at rates ranging from 1-35 MB/s. VLSI Technology's VM007 can encrypt up to 200 MB/s. In each of the above cases, performance has been shown to be much faster than a software implementation on a conventional CPU, but slower than a dedicated hardware implementation. The advantage of this invention over the hardware implementations is that the encryption chip is programmable, so that it may implement any algorithm, including those that have yet to be conceived. Although no specific public key algorithm examples have been given, it should be noted that similar improvements over existing approaches will result by employing the techniques as discussed in the preferred embodiment of the present invention. EQUIVALENTS While this invention has been particularly shown and described with references to preferred embodiments thereof, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims. Those skilled in the art will recognize or be able to ascertain using no more than routine experimentation, many equivalents to the specific embodiments of the invention described specifically herein. Such equivalents are intended to be encompassed in the scope of the claims.
|
Same subclass Same class Consider this |
||||||||||
