Method and apparatus for custom operations of a processor5963744Abstract Custom operations are useable in processor systems for performing functions including multimedia functions. These custom operations enhance a system, such as PC system, to provide real-time multimedia capabilities while maintaining advantages of a special-purpose, embedded solution, i.e., low cost and chip count, and advantages of a general-purpose processor reprogramability. These custom operations work in a computer system which supplies input data having operand data, performs operations on the operand data, and supplies result data to a destination register. Operations performed may include audio and video processing including clipping or saturation operations. The present invention also performs parallel operations on select operand data from input registers and stores results in the destination register. Claims What is claimed is: Description BACKGROUND OF THE INVENTION
TABLE 1
______________________________________
Custom operations listed by function type
Function
Custom Pop Description
______________________________________
DSP dspiabs Clipped signed 32-bit absolute value
absolute
dspidualabs
Dual clipped absolute values of signed 16-
value bit halfwords
DSP add
dspiadd Clipped signed 32-bit add
dspuadd Clipped unsigned 32-bit add
dspidualadd
Dual clipped add of signed 16-
bit halfwords
dspuquadaddui
Quad clipped add of unsigned/
signed bytes
DSP dspimul Clipped signed 32-bit multiply
multiply
dspumul Clipped unsigned 32-bit multiply
dspidualmul
Dual clipped multiply of signed 16-
bit halfwords
DSP dspisub Clipped signed 32-bit subtract
subtract
dspusub Clipped unsigned 32-bit substract
dspidualsub
Dual clipped subtract of signed 16-
bit halfwords
Sum of ifir16 Signed sum of products of signed 16-
products bit halfwords
ifir8ii Signed sum of products of signed bytes
ifir8ui Signed sum of products of unsigned/
signed bytes
ufir16 Unsigned sum of products of unsigned 16-
bit halfwords
ufir8uu Unsigned sum of products of unsigned bytes
Merge mergelsb Merge least-significant bytes
mergemsb Merge most-significant bytes
Pack pack16lsb Pack least-significant 16-bit halfwords
pack16msb Pack most-significant 16-bit halfwords
packbytes Pack least-significant bytes
Byte quadavg Unsigned byte-wise quad average
averages
Byte quadumulmsb
Unsigned quad 8-bit multiply most
multiplies significant
Motion ume8ii Unsigned sum of absolute values of signed
estimation 8-bit differences
ume8uu Unsigned sum of absolute values of unsigned
8-bit differences
Clipping
iclipi Clip signed to signed
uclipi Clip signed to unsigned
uclipu Clip unsigned to unsigned
______________________________________
An example is presented to illustrate use of a custom operation of the present invention. This example, a byte-matrix transposition, provides a simple illustration of how custom operations can significantly increase processing speed in small kernels of applications. As in most uses of custom operations, the power of custom operations in this case comes from their ability to operate on multiple data items in parallel. For example, a task to transpose a packed, four-by-four matrix of bytes in memory. The matrix might, for example, contain eight-bit pixel values. FIG. 3(a) illustrates both organization of the matrix in memory and, FIG. 3(b) illustrates in standard mathematical notation, the task to be performed. Performing this operation with traditional microprocessor instructions is straight forward but time consuming. One method to perform the manipulation is to perform 12 load-byte instructions to load bytes (since only 12 of the 16 bytes need to be repositioned) and 12 store-byte instructions to store the bytes back in memory in their new positions. Another method would be to perform four load-word instructions, reposition bytes of the loaded words in registers, and then perform four store-word instructions. Unfortunately, repositioning the bytes in registers requires a large number of instructions to properly shift and mask the bytes. Performing twenty four loads and stores makes implicit use of shifting and masking hardware in load/store units and thus yields a shorter instruction sequence. The problem with performing twenty four loads and stores is that loads and stores are inherently slow operations: they must access at least cache and possibly slower layers in a memory hierarchy. Further, performing byte loads and stores when 32-bit word-wide accesses run as fast wastes the power of the cache/memory interface. A fast algorithm that takes full advantage of cache/memory bandwidth while not requiring an inordinate number of byte-manipulation instructions is desired. The present invention has instructions that merge (mergemsb and mergelsb) and pack bytes and 16-bit halfwords (pack16msb and pack16lsb) directly and in parallel. Four of these instructions can be applied for the present example to speed up manipulation of bytes packed into words. FIG. 4 illustrates application of these instructions to the byte-matrix transposition example. FIG. 5(a) shows a list of the operations needed to implement a matrix transpose. When assembled into actual instructions, these custom operations would be packed as tightly as dependencies allow, for example, up to five operations per instruction. Low-level code in FIG. 5(a) is shown here for illustration purposes only. A first sequence of four load-word operations (ld32d) in FIG. 5(a) brings the packed words of the input matrix into registers r10, r11, r12, and r13. A next sequence of four merge operations (mergemsb and mergelsb) produces intermediate results in registers r14, r15, r16, and r17. A next sequence of four pack operations (pack16msb and pack16lsb) may then replace the original operands or place the transposed matrix in separate registers if the original matrix operands were needed for further computations (a TM-1 optimizing C compiler could perform such an analysis automatically). In this example, the transpose matrix is placed in separate registers (st32d), registers r18, r19, r20, and r21. Four final four store-word operations put the transposed matrix back into memory. Thus, using the custom operations of the present invention, the byte-matrix transposition requires four-word operations and four store-word operations (the minimum possible) and eight register-to-register data manipulation operations. The result is 16 operations, or byte-matrix transposition at a rate of one operation per byte. FIG. 5(b) illustrates an equivalent C-language fragment. While the advantage of the custom-operation-based algorithm over brute-force code that uses 24 load-and store-byte instruction seems to be only eight operations (a 33% reduction) for the present example, the advantage is actually much greater. First, using custom operations, the number of memory references is reduced from twenty four to eight i.e., a reduction by a factor of three. Since memory references are slower than register-to-register operations (such as performed using the custom operations in this example), the reduction in memory references is significant. Further, the ability of the compiling system of the present system (TM-1 system) to exploit performance potential of the TM-1 microprocessor hardware is enhanced by the custom-operation-based code. Specifically, the compiling system more easily produces an optimal schedule (arrangement) of the code when the number of memory references is in balance with the number of register-to-register operations. Generally high-performance microprocessors have a limit on the number of memory references that can be processed in a single cycle. As a result, a long sequence of code that contains only memory references can cause empty operation slots in the long TM-1 instructions and thus, waste performance potential of the hardware. As this example has shown, use of the custom operations of the present invention may reduce the absolute number of operations needed to perform a computation and can also help a compiling system produce code that fully exploits the performance potential of the respective CPU. Other applications such as MPEG image reconstruction for, for example, a complete MPEG video decoding algorithm and motion-estimation kernels could be benefited by use of the custom operations of the present invention, although this is not exhaustive. The present invention includes those custom operations listed in Table 1. The specifics of each of these custom operations are set forth below. In the function code given below, standard symbols, syntax, etc. are used. For example, temp1 and temp2 represent temporary registers. Further, as an example, a function temp1.rarw.sign.sub.-- ext16to32(rsrcl<15:0>) means that temp1 is loaded with the 15:0 bits (bits 0 to 15) of the rsrcl register with the sign bit (in this example, the 15th bit) being extended to the 16 to 32 bits (sign bit extension). Similarly, temp2.rarw.sign.sub.-- ext16to32(rsrcl<16:31>) indicates that the 16th to 31st bits of rsrcl are extracted (and for calculation purposes, `placed` in the 0 to 15th bits) and the sign bit which, in this example, is the 31st bit, is sign extended to the 16th to 32nd bits. This sign extension is used for signed values, in this example, signed integers. For unsigned values, zero fill is used. The notation for zero fill is very similar to that of sign extend. For example, zero.sub.-- ext8to32(rsrcl<15:0>) indicates that the value of the 15 to 0 bits are to be operated on and the 8th to 32nd bits are filled with zeros. rsrc1, rsrc2 and rdest may be any of the available registers as discussed above. For each of the below listed operations, the operation optionally takes a guard, specified in rguard. If a guard is present, in this example its LSB controls modification of the destination register. In this example, if the LSB of rguard is 1, in this example, rdest is written; otherwise, rdest is not changed. dspiabs dspiabs is a clipped signed absolute value operation, pseudo-op for h.sub.-- dspiabs (hardware dspiabs). This operation has the following function:
______________________________________
if rguard then {
if rsrc1> = 0 then
rdest .rarw. rsrc1
else if rsrc1 = 0x800000000 then
rdest .rarw. 0x7fffffff
else
rdest .rarw. rsrc1
}
______________________________________
The dspiabs operation is a pseudo operation transformed by the scheduler into an h.sub.-- dspiabs with a constant first argument zero and second argument equal to the dspiabs argument. Pseudo operations generally are not used in assembly source files. h.sub.-- dspiabs performs the same function; however, this operation requires a zero as first argument. The dspiabs operation computes the absolute value of rsrcl, clips the result into a range ›2.sup.31 -1 . . . 0! or ›0'7fffffff . . . 0!, and stores the clipped value into rdest (a destination register). All values are signed integers. dspidualabs dspidualabs is a dual clipped absolute value of signed 16-bit halfwords operation, pseudo-op for h.sub.-- dspidualabs (hardware dspidualabs). This operation has the following function:
______________________________________
if rguard then {
temp1 .rarw. sign.sub.-- ext16to32 (rsrc1<15:0>)
temp2 .rarw. sign.sub.-- ext16to32 (rsrc1<31:16>)
if temp1 = 0xffff8000 then temp1 .rarw. 0x7fff
if temp2 = 0xffff8000 then temp2 .rarw. 0x7fff
if temp1 < 0 then temp1 .rarw. -temp1
if temp2 < 0 then temp2 .rarw. -temp2
rdest<31:16> .rarw. temp2<15:0>
rdest<15:0> .rarw. temp1<15:0>
______________________________________
The dspidualabs operation is a pseudo operation transformed by the scheduler into an h.sub.-- dspidualabs with, in this example, a constant zero as a first argument and the dspidualabs argument as a second argument. The dspidualabs operation performs two 16-bit clipped, signed absolute value computations separately on the high and low 16-bit halfwords of rsrcl. Both absolute values are clipped into a range ›0.times.0 . . . 0.times.7fff! and written into corresponding halfwords of rdest. All values are signed 16-bit integers. h.sub.-- dspidualabs performs the same function; however, this operation requires a zero as first argument. dspiadd dspiadd is a clipped signed add operation. This operation has the following function:
______________________________________
if rguard then {
temp .rarw. sign.sub.-- ext32to64 (rsrc1) + sgn.sub.-- ext32to64 (rsrc2)
if temp < 0xffffffff80000000 then
rdest .rarw. 0x80000000
else if temp > 0x00000007fffffff then
rdest .rarw. 0x7fffffff
else
rdest .rarw. temp
______________________________________
As shown in FIG. 6, the dspiadd operation computes a signed sum rsrc1+rsrc2, clips the result into a 32-bit signed range ›2.sup.31 -1 . . . -2.sup.31 ! or ›0.times.7fffffff . . . 0.times.80000000!, and stores the clipped value into rdest. All values are signed integers. dspuadd dspuadd is a clipped unsigned add operation. This operation has the following function:
______________________________________
if rguard then {
temp .rarw. zero.sub.-- ext32to64 (rsrc1) +zero.sub.-- ext32to64 (rsrc2)
if (unsigned) temp > 0x00000000ffffffff then
rdest .rarw. 0xffffffff
else
rdest .rarw. temp<31:0>
______________________________________
As shown in FIG. 7 the dspuadd operation computes an unsigned sum rsrc1+rsrc2, clips the result into an unsigned range ›2.sup.32 -1 . . . 0! or ›0.times.ffffffff . . . 0!, and stores the clipped value into rdest. dspidualadd dspidualadd is a dual clipped add of signed 16-bit halfwords operation. This operation has the following function:
______________________________________
if rguard then {
temp1 .rarw. sign.sub.-- ext16to32 (rsrc1<15:0>) +
sign.sub.-- ext16to32 (rsrc2<15:0>)
temp2 .rarw. sign.sub.-- ext16to32 (rsrc1<31:16>) +
sign.sub.-- ext16to32 (rsrc2<31:16>)
if temp1 < 0xffff8000 then temp1 .rarw. 0x8000
if temp2 = 0xffff8000 then temp2 .rarw. 0x8000
if temp1 > 0x7fff then temp1 .rarw. 0x7fff
if temp2 < 0x7fff then temp2 .rarw. 0x7fff
rdest<31:16> .rarw. temp2<15:0>
rdest<15:0> .rarw. temp1<15:0>
______________________________________
As shown in FIG. 8, the dspidualadd operation computes two 16-bit clipped, signed sums separately on two respective pairs of high and low 16-bit halfwords of rsrcl and rsrc2. Both sums are clipped into a range ›2.sup.15 -1 . . . 2.sup.15 ! or ›0.times.7fff . . . 0.times.8000! and written into corresponding halfwords of rdest. All values are signed 16-bit integers. dspuquadaddui dspuquadaddui is a quad clipped add of unsigned/signed bytes operation. This operation has the following function:
______________________________________
if rguard the {
for (i.rarw.0,m.rarw.31,n.rarw.24;i<4;i.rarw.1+1,m.rarw.m-8,n.rarw.n-8)
temp .rarw. zero.sub.-- ext8to32 (rsrc1<m:n>)
+sign.sub.-- ext8to32 (rsrc2<m:n>)
if temp < 0 then
rdest<m:n> .rarw. 0
else if temp > 0xff then
rdest<m:n> .rarw.0xff
else rdest<m:n> .rarw. temp<7:0>
}
______________________________________
As shown in FIG. 9, the dspuquadaddui operation computes four separate sums of four respective pairs of corresponding 8-bit bytes of rsrc1 and rsrc2. Bytes in rsrcl are considered unsigned values; bytes in rsrc2 are considered signed values. The four sums are clipped into an unsigned range ›255 . . . 0! or ›0.times.ff . . . 0!; thus, resulting byte sums are unsigned. All computations are performed without loss of precision. dspimul dspimul is a clipped signed multiply operation. This operation has the following function:
______________________________________
if rguard then {
temp .rarw.
sign.sub.-- ext32to64 (rsrc1) +sign.sub.-- ext32to64 (rsrc2)
if temp < 0xffffffff800000000 then
rdest .rarw. 0x80000000
else if temp > 0x000000007fffffff then
rdest .rarw. 0x7fffffff
else
rdest .rarw. temp<31:0>
______________________________________
As shown in FIG. 10, the dspimul operation computes a product rsrc1.times.rsrc2, clips the results into a range ›2.sup.31- 1 . . . -1.sup.31 ! or ›0.times.7fffffff . . . 0.times.80000000! value into rdest. All values are signed integers. dspumul dspumul is a clipped unsigned multiply operation. This operation has the following function:
______________________________________
if rguard then {
temp .rarw. zero.sub.-- ext32to64 (rsrc1) .times.
zero.sub.-- ext32to64 (rsrc2)
if (unsigned) temp > 0x00000000ffffffff then
rdest .rarw. 0xffffffff
else
rdest .rarw. temp<31:0>
______________________________________
As shown in FIG. 11, the dspumul operation computes an unsigned product rsrc1.times.rsrc2, clips the result into an unsigned range ›2.sup.32 -1 . . . 0! or ›0.times.ffffffff . . . 0!, and stores the clipped value into rdest. Dspidualmul dspidualmul is a dual clipped multiply of signed 16-bit halfwords operation. This operation has the following function:
______________________________________
if rguard then {
temp1 .rarw. sign.sub.-- ext16to32 (rsrc1<15:0>) .times.
sign.sub.-- ext16to32 (rsrc2<15:0>)
temp2 .rarw. sign.sub.-- ext16to32 (rsrc2<31:16>) .times.
sign.sub.-- ext16to32 (rsrc2<31:16>)
if temp1 < 0xffff8000 then temp1 .rarw. 0x8000
if temp2 = 0xffff8000 then temp2 .rarw. 0x8000
if temp1 > 0x7fff then temp1 .rarw. 0x7fff
if temp2 < 0x7fff then temp2 .rarw. 0x7fff
rdest<31:16> .rarw. temp2<15:0>
rdest<15:0> .rarw. temp1<15:0>
______________________________________
As shown in FIG. 12, the dspidualmul operation computes two 16-bit clipped, signed products separately on two respective pairs of high and low 16-bit halfwords of rsrcl and rsrc2. Both products are clipped into a range ›2.sup.15 -1 . . . -2.sup.15 ! or ›0.times.7 . . . 0.times.8000! and written into corresponding halfwords of rdest. All values are signed 16-bit integers. dspisub dspisub is a clipped signed subtract operation. This operation has the following function:
______________________________________
if rguard then {
temp .rarw. sign.sub.-- ext32to64(rsrc1)-
sign.sub.-- ext32to64(rsrc2)
if temp < 0xffffffff80000000 then
rdest .rarw. 08x0000000
else if temp > 0x000000007fffffff then
rdest .rarw. 0x7fffffff
else
rdest .rarw. temp<31:0>
}
______________________________________
As shown in FIG. 13, the dspisub operation computes a difference rsrc1-rsrcb2, clips the result into a range ›0.times.80000000 . . . 0.times.7fffffff!, and stores the clipped value into rdest. All values are signed integers. dspusub dspusub is a clipped unsigned subtract operation. This operation has the following function:
______________________________________
if rguard then {
temp .rarw. zero.sub.-- ext32to64(rsrc1)-
zero.sub.-- ext32to64(rsrc2)
if (signed)temp < 0 then
rdest .rarw. 0
else
rdest .rarw. temp<31:0>
}
______________________________________
As shown in FIG. 14, the dspusub operation computes an unsigned difference rsrc1-rsrc2, clips the result into an unsigned range ›0.0.times.ffffffff!, and stores the clipped value into rdest. dspidualsub dspidualsub is a dual clipped subtract of signed 16-bit halfwords operation. This operation has the following function:
______________________________________
if rguard then {
temp1 .rarw. sign.sub.-- ext16to32(rsrc1<15:0>)-
sign.sub.-- ext16to32(rsrc2<15:0>)
temp2 .rarw. sign.sub.-- ext16to32(rsrc1<31:16>)-
sign.sub.-- ext16to32(rsrc2<31:16>)
if temp1 < 0xffff8000 then temp1 .rarw. 0x8000
if temp2 < 0xffff8000 then temp2 .rarw. 0x8000
if temp1 > 0x7fff then temp1 .rarw. 0x7fff
if temp2 > 0x7fff then temp2 .rarw. 0x7fff
rdest<31:16> .rarw. temp2<15:0>
rdest<15:0> .rarw. temp1<15:0>
______________________________________
As shown in FIG. 15, the dspidualsub operation computes two 16-bit clipped, signed differences separately on two respective pairs of high and low 16-bit halfwords of rsrcl and rsrc2. Both differences are clipped into a range ›2.sup.15 -1 ,,,-2.sup.15 or ›0.times.7fff . . . 0.times.8000! and written into corresponding halfwords of rdest. All values are signed 16-bit integers. ifir16 ifir16 is a sum of products of signed 16-bit halfwords operation. This operation has the following function:
______________________________________
if rguard then
rdest .rarw. sign.sub.-- ext16to32(rsrc1<31:16>).times.
sign.sub.-- ext16to32(rsrc2<31:16>)+
sign.sub.-- ext16to32(rsrc1<15:0>).times.
sign.sub.-- ext16to32(rsrc2<15:0>)
______________________________________
As shown in FIG. 16, the ifir16 operation computes two separate products of two respective pairs of corresponding 16-bit halfwords of rsrcl and rsrc2; the two products are summed, and the result is written to rdest. All halfwords are considered signed; thus, the products and the final sum of products are signed. All computations are performed without loss of precision. ifir8ii ifir8ii is a signed sum of products of signed bytes operation. This operation has the following function:
______________________________________
if rguard then
rdest .rarw. sign.sub.-- ext8to32(rsrc1<31:24>).times.
sign.sub.-- ext8to32(rsrc2<31:24>)+
sign.sub.-- ext8to32(rsrc1<23:16>).times.
sign.sub.-- ext8to32(rsrc2<23:16>)+
sign.sub.-- ext8to32(rsrc1<15:8>).times.
sign.sub.-- ext8to32(rsrc2<15:8>)+
sign.sub.-- ext8to32(rsrc1<7:0>).times.
sign.sub.-- ext8to32(rsrc2<7:0>)
______________________________________
As shown in FIG. 17, the ifir8ii operation computes four separate products of four respective pairs of corresponding 8-bit bytes of rsrcl and rsrc2; the four products are summed, and the result is written to rdest. All values are considered signed; thus, the products and the final sum of products are signed. All computations are performed without loss of precision. ifir8ui ifir8ui is a signed sum of products of unsigned/signed bytes operation. This operation has the following function:
______________________________________
if rguard then
rdest .rarw. zero.sub.-- ext8to32(rsrc1<31:24>).times.
sign.sub.-- ext8to32(rsrc2<31:24>)+
zero.sub.-- ext8to32(rsrc1<23:16>).times.
sign.sub.-- ext8to32(rsrc2<23:16>)+
zero.sub.-- ext8to32(rsrc1<15:8>).times.
sign.sub.-- ext8to32(rsrc2<15:8>)+
zero.sub.-- ext8to32(rsrc1<7:0>).times.
sign.sub.-- ext8to32(rsrc2<7:0>)
______________________________________
As shown in FIG. 18, the ifir8ui operation computes four separate products of four respective pairs of corresponding 8-bit bytes of rsrc1 and rsrc2; the four products are summed, and the result is written to rdest. Bytes from rsrc1 are considered unsigned, but bytes from rsrc2 are considered signed; thus, the products and the final sum of products are signed. All computations are performed without loss of precision. ufir16 ufir16 is a sum of products of unsigned 16-bit halfwords operation. This operation has the following function:
______________________________________
if rguard then {
rdest .rarw. (zero.sub.-- ext16to32(rsrc1<31:16>).times.
zero.sub.-- ext16to32(rsrc2<31:16>)+
zero.sub.-- ext16to32(rsrc1<15:0>).times.
zero.sub.-- ext16to32(rsrc2<15:0>)
______________________________________
As shown in FIG. 19, the ufir16 operation computes two separate products of two respective pairs of corresponding 16-bit halfwords of rsrc1 and rsrc2, the two products are summed, and the result is written to rdest. All halfwords are considered unsigned; thus, the products and the final sum of products are unsigned. All computations are performed without loss of precision. The final sum of products is clipped into the range ›0.times.ffffffff . . . 0! before being written into rdest. ufir8uu ufir8uu is a unsigned sum of products of unsigned bytes operation. This operation has the following function:
______________________________________
if rguard then {
rdest .rarw. zero.sub.-- ext8to32(rsrc1<31:24>).times.
zero.sub.-- ext8to32(rsrc2<31:24>)+
zero.sub.-- ext8to32(rsrc1<23:16>).times.
zero.sub.-- ext8to32(rsrc2<23:16>)+
zero.sub.-- ext8to32(rsrc1<15:8>).times.
zero.sub.-- ext8to32(rsrc2<15:8>)+
zero.sub.-- ext8to32(rsrc1<7:0>).times.
zero.sub.-- ext8to32(rsrc2<7:0>)
______________________________________
As shown in FIG. 20, the ufir8uu operation computes two separate products of four respective pairs of corresponding 8-bit bytes of rsrc1 and rsrc2, the four products are summed, and the result is written to rdest. All bytes are considered unsigned. All bytes are considered unsigned. All computations are performed without loss of precision. mergelsb mergelsb is a merge least-significant byte operation. This operation has the following function:
______________________________________
if rguard then {
rdest<7:0> .rarw. rsrc2<7:0>
rdest<15:8> .rarw. rsrc1<7:0>
rdest<23:16> .rarw. rsrc2<15:8>
rdest<31:24> .rarw. rsrc1<15:8>
______________________________________
As shown in FIG. 21, the mergelsb operation interleaves two respective pairs of least-significant bytes from arguments rsrc1 and rsrc2 into rdest. The least-significant byte from rsrc2 is packed into the least-significant byte of rdest; the least significant byte from rsrc1 is packed into the second-least-significant byte or rdest; the second-least-significant byte from rsrc2 is packed into the second-most-significant byte of rdest; and the second-least-significant byte from rsrc1 is packed into the most-significant byte of rdest. mergemsb mergemsb is a merge most-significant byte operation. This operation has the following function:
______________________________________
if rguard then {
rdest<7:0> .rarw. rsrc2<23:15>
rdest<15:8> .rarw. rsrc1<23:15>
rdest<23:16> .rarw. rsrc2<31:24>
rdest<31:24> .rarw. rsrc1<31:24>
______________________________________
As shown in FIG. 22, the mergemsb operation interleaves the two respective pairs of most-significant bytes from arguments rsrc1 and rsrc2 into rdest. The second-most-significant byte from rsrc2 is packed into the least-significant byte of rdest; the second-most-significant byte from rsrc1 is packed into the second-least-significant byte or rdest, the most-significant byte from rsrc2 is packed into the second-most-significant byte of rdest; and the most-significant byte from rsrc1 is packed into the most-significant byte of rdest. pack16lsb pack16lsb is a pack least-significant 16-bit halfwords operation. This operation has the following function:
______________________________________
if rguard then {
rdest<15:0> .rarw. rsrc2<15:0>
rdest<31:16> .rarw. rsrc1<15:0>
}
______________________________________
As shown in FIG. 23, the pack16lsb operation packs two respective least-significant halfwords from arguments rsrc1 and rsrc2 into rdest. The halfword from rsrc1 is packed into the most-significant halfword of rdest and the halfword from rsrc2 is packed into the least-significant halfword or rdest. pack16msb pack16msb is a pack most-significant 16 bits operation. This operation has the following function:
______________________________________
if rguard then {
rdest<15:0> .rarw. rsrc2<31:16>
rdest<31:16> .rarw. rsrc1<31:16>
}
______________________________________
As shown in FIG. 13, the pack16msb operation packs two respective most-significant halfwords from arguments rsrc1 and rsrc2 into rdest. The halfword from rsrc1 is packed into the most-significant halfword of rdest and the halfword from rsrc2 is packed into the least-significant halfword or rdest. packbytes packbytes is a pack least-significant byte operation. This operation has the following function:
______________________________________
if rguard then {
rdest<7:0> .rarw. rsrc2<7:0>
rdest<15:8> .rarw. rsrc1<7:0>
}
______________________________________
As shown in FIG. 25, the packbytes operation packs two respective least-significant bytes from arguments rsrc1 and rsrc2 into rdest. The byte from rsrc1 is packed into the second-least-significant byte of rdest and the byte from rsrc2 is packed into the least-significant byte or rdest. The two most-significant bytes of rdest are filled with zeros. quadavg quadavg is a unsigned byte-wise quad average operation. This operation has the following function:
______________________________________
if rguard then {
temp .rarw. (zero.sub.-- ext8to32 (rsrc1<7:0>) +
zero.sub.-- ext8to32 (rsrc2<7:0>) + 1)/2
rdest<7:0>.rarw.temp<7:0>
temp .rarw. (zero.sub.-- ext8to32 (rsrc1<15:8>) +
zero.sub.-- ext8to32 (rsrc2<15:8>) + 1)/2
rdest<15:8>.rarw.temp<7:0>
temp .rarw. (zero.sub.-- ext8to32 (rsrc1<23:16>) +
zero.sub.-- ext8to32 (rsrc2<23:16>) + 1)/2
rdest<23:16>.rarw.temp<7:0>
temp .rarw. (zero.sub.-- ext8to32 (rsrc1<31:24>) +
zero.sub.-- ext8to32 (rsrc2<31:24>) + 1)/2
rdest<31:24>.rarw.temp<7:0>
______________________________________
As shown in FIG. 26, the quadavg operation computes four separate averages of four respective pairs of corresponding 8-bit bytes of rsrc1 and rsrc2. All bytes are considered unsigned. The least-significant 8 bits of each average is written to the corresponding byte in rdest. No overflow or underflow detection is performed. quadumulmsb quadumulmsb is a unsigned quad 8-bit multiply most significant operation. This operation has the following function:
______________________________________
if rguard then {
temp .rarw. (zero.sub.-- ext8to32 (rsrc1<7:0>) .times.
zero.sub.-- ext8to32 (rsrc2<7:0>) )
rdest<7:0>.rarw.temp<15:8>
temp .rarw. (zero.sub.-- ext8to32 (rsrc1<15:8>) .times.
zero.sub.-- ext8to32 (rsrc2<15:8>) )
rdest<15:8>.rarw.temp<15:8>
temp .rarw. (zero.sub.-- ext8to32 (rsrc1<23:16>) .times.
zero.sub.-- ext8to32 (rsrc2<23:16>) )
rdest<23:16>.rarw.temp<15:8>
temp .rarw. (zero.sub.-- ext8to32 (rsrc1<31:24>) .times.
zero.sub.-- ext8to32 (rsrc2<31:24>) )
rdest<31:24>.rarw.temp<15:8>
______________________________________
As shown in FIG. 27, the quadumulmsb operation computes four separate products of four respective pairs of corresponding 8-bit bytes of rsrc1 and rsrc2. All bytes are considered unsigned. The most-significant 8 bits of each 16-bit product is written to the corresponding byte in rdest. ume8ii ume8ii is a unsigned sum of absolute values of signed 8-bit differences operation. This operation has the following function:
______________________________________
if rguard then
rdest .rarw. abs.sub.-- val (sign.sub.-- ext8to32 (rsrc1<31:24>) -
sign.sub.-- ext8to32 (rsrc2<31:24>)) +
abs.sub.-- val (sign.sub.-- ext8to32 (rsrc1<23:16>) -
sign.sub.-- ext8to32 (rsrc2<23:16>)) +
abs.sub.-- val (sign.sub.-- ext8to32 (rsrc1<15:8>) -
sign.sub.-- ext8to32 (rsrc2<15:8>)) +
abs.sub.-- val (sign.sub.-- ext8to32 (rsrc1<7:0>) -
sign.sub.-- ext8to32 (rsrc2<7:0>))
______________________________________
As shown in FIG. 28, the ume8ii operation computes four separate differences of four respective pairs of corresponding signed 8-bit bytes of rsrc1 and rsrc2, absolute values of the four differences are summed, and the sum is written to rdest. All computations are performed without lost of precision. ume8uu ume8uu is a sum of absolute values of unsigned 8-bit differences. This operation has the following function:
______________________________________
if rguard then
rdest .rarw. abs.sub.-- val (zero.sub.-- ext8to32 (rsrc1<31:24>) -
zero.sub.-- ext8to32 (rsrc2<31:24>)) +
abs.sub.-- val (zero.sub.-- ext8to32 (rsrc1<23:16>) -
zero.sub.-- ext8to32 (rsrc2<23:16>)) +
abs.sub.-- val (zero.sub.-- ext8to32 (rsrc1<15:8>) -
zero.sub.-- ext8to32 (rsrc2<15:8>)) +
abs.sub.-- val (zero.sub.-- ext8to32 (rsrc1<7:0>) -
zero.sub.-- ext8to32 (rsrc2<7:0>))
______________________________________
As shown in FIG. 29, the ume8uu operation computes four separate differences of four respective pairs of corresponding unsigned 8-bit bytes of rsrc1 and rsrc2. Absolute values of four differences are summed and the sum is written to rdest. All computations are performed without loss of precision. iclipi iclipi is a clip signed to signed operation. This operation has the following function:
______________________________________
if rguard then
rdest .rarw. min (max (rsrc1, - rsrc2-1), rsrc2)
______________________________________
The iclipi operation returns a value of rsrc1 clipped into unsigned integer range (-rsrc2-1) to rsrc2, inclusive. The argument rsrc1 is considered a signed integer; rsrc2 is considered an unsigned integer and must have a value between 0 and 0.times.7fffffff inclusive. uclipi uclipi is a clip signed to unsigned operation. This operation has the following function:
______________________________________
if rguard then
rdest .rarw. min (max (rsrc1, 0), rsrc2)
______________________________________
The uclipi operation returns a value of rsrc1 clipped into unsigned integer range 0 to rsrc2, inclusive. The argument rsrc1 is considered an unsigned integer; rsrc2 is considered an unsigned integer. uclipu uclipu is a clip unsigned to unsigned operation. This operation has the following function:
______________________________________
if rguard then {
if rsrc1 > rsrc2 then
rdest .rarw. rsrc2
else
rdest.rarw.rsrc1
}
______________________________________
The uclipu operation returns a value of rsrc1 clipped into unsigned integer range 0 to rsrc2, inclusive. The arguments rsrc1 and rsrc2 are considered unsigned integers. By use of the above custom multimedia operations, an application can take advantage of highly parallel microprocessor implementations of multimedia functions with low cost. From the above disclosure, one may clearly understand that the present invention may be used with many highly parallel microprocessor implementations using VLIW, RISC, super scalar, etc. instruction formats. Additionally, one skilled in the art may easily add additional operations based on the above concepts. For example, a quad clipped subtract of bytes is not specifically described; however, clearly one skilled in the art could easily develop this operation based on the above disclosure. There accordingly has been described a system and method for custom operations for use in performing multimedia functions. In this disclosure, there is shown and described only the preferred embodiment of the invention, but, as aforementioned, it is to be understood that the invention is capable of use in various other combinations and environments and is capable of changes or modifications within the scope of the inventive concept as expressed herein.
|
Same subclass Same class Consider this |
||||||||||
