Programmatic access to the widest mode floating-point arithmetic supported by a processor6748587Abstract A software mechanism for enabling a programmer to embed selected machine instructions into program source code in a convenient fashion, and optionally restricting the re-ordering of such instructions by the compiler without making any significant modifications to the compiler processing. Using a table-driven approach, the mechanism parses the embedded machine instruction constructs and verifies syntax and semantic correctness. The mechanism then translates the constructs into low-level compiler internal representations that may be integrated into other compiler code with minimal compiler changes. When also supported by a robust underlying inter-module optimization framework, library routines containing embedded machine instructions according to the present invention can be inlined into applications. When those applications invoke such library routines, the present invention enables the routines to be optimized more effectively, thereby improving run-time application performance. A mechanism is also disclosed using a "_fpreg" data type to enable floating-point arithmetic to be programmed from a source level where the programmer gains access to the full width of the floating-point register representation of the underlying processor. Claims We claim: Description TECHNICAL FIELD OF THE INVENTION
#include <inline.h>
int g1, g2, g3; /* global integer variables */
main ( )
{
g1 = _Asm_ADD (g2, g3);
}
ii) For a "LOAD" machine instruction of the form: LOAD.<size>value=[mem_addr] where <size> is an opcode completer encoding the bit-size of the object being loaded which may be one of "b" (for byte or 8-bits), "hw" (for half-word or 16-bits) or "word" (for word or 32-bits), "value" corresponds to a 32-bit general-purpose machine register whose value is to be set by the load instruction "mem_addr" corresponds to a 32-bit memory address that specifies the starting location in memory of the object whose value is to be loaded into "value" the function prototype for the inline intrinsic can be defined as follows: UInt32 _Asm_LOAD (_Asm_size size, void*mem_addr) where "UInt32" corresponds to a 32-bit unsigned integer data-type, "void*" is a generic pointer data type, and "_Asm_size" is a enumeration type that encodes one of 3 possible symbolic constants. For example, in the C language, _Asm_size may be defined as follows: typedef enum { _b=1, _hw=2, _w=3 } _Asm_size; Alternatively, _Asm_size may be defined to be a simple integer data type with pre-defined symbolic constant values for each legal LOAD opcode completer. Using language neutral "C" pre-processor directives, #define _b (1) #define _hw (2) #define _w (3) Note that the declarations associated with "_Asm_size" would be placed in the "inline.h" system header file, and would be read in by the compiler when parsing the source program. The LOAD machine instruction can then be embedded into a "C" program thusly:
#include <inline.h>
int g; /* global integer variable */
int *p; /* global integer pointer variable */
main ( )
{
g = _Asm_LOAD(_w, p);
}
Certain inline assembly opcodes, notably those that may be considered as privileged "system" opcodes, may optionally specify an additional argument that explicitly indicates the constraints that the compiler must honor with regard to instruction re-ordering. This optional "serialization constraint" argument is specified as an integer mask value. The integer mask value encodes what types of (data independent) instructions may be moved past the inline assembly opcode in either direction in a dynamic sense (i.e. before to after, or after to before) in the current function body. If omitted, the compiler will use a default serialization mask value. For the purposes of specifying serialization constraints in a preferred embodiment, the instruction opcodes may advantageously, but not mandatorily, be divided into the following categories: 1. Memory Opcodes: load and store instructions 2. ALU Opcodes: instructions with general-purpose register operands 3. Floating-Point Opcodes: instructions with floating-point register operands 4. System Opcodes: privileged "system" instructions 5. Branch: "basic block" boundary 6. Call: function invocation point With respect to serialization constraints, an embedded machine instruction may act as a "fence" that prevents the scheduling of downstream instructions ahead of it, or a "fence" that prevents the scheduling of upstream instructions after it. Such constraints may be referred to as a "downward fence" and "upward fence" serialization constraint, respectively. Given this classification, the serialization constraints associated with an inline system opcode can be encoded as an integer value, which can be defined by ORing together an appropriate set of constant bit-masks. For a system opcode, this encoded serialization constraint value may be specified as an optional final argument of the _Asm_opcode intrinsic call. For example, for the C language, the bit-mask values may defined to be enumeration constants as follows:
typedef enum {
_NO_FENCE = 0.times.0,
_UP_MEM_FENCE = 0.times.1,
_UP_ALU_FENCE = 0.times.2,
_UP_FLOP_FENCE = 0.times.4,
_UP_SYS_FENCE = 0.times.8,
_UP_CALL_FENCE = 0.times.10,
_UP_BR_FENCE = 0.times.20,
_DOWN_MEM_FENCE = 0.times.100,
_DOWN_ALU_FENCE = 0.times.200,
_DOWN_FLOP_FENCE = 0.times.400,
_DOWN_SYS_FENCE = 0.times.800,
_DOWN_CALL_FENCE = 0.times.1000,
_DOWN_BR_FENCE = 0.times.2000
} _Asm_fence;
(Note: The _Asm_fence definition would advantageously be placed in the "inline.h" system header file.) So, for example, to prevent the compiler from scheduling floating-point operations across an inlined system opcode that changes the default floating-point rounding mode, a programmer might use an integer mask formed as (_UP_FLOP_FENCE.vertline._DOWN_FLOP_FENCE). The _UP_BR_FENCE and _DOWN_BR_FENCE relate to "basic block" boundaries. (A basic block corresponds to the largest contiguous section of source code without any incoming or outgoing control transfers, excluding function calls.) Thus, a serialization constraint value formed by ORing together these two bit masks will prevent the compiler from scheduling the associated inlined system opcode outside of its original basic block. Note that the compiler must automatically detect and honor any explicit data dependence constraints involving an inlined system opcode, independent of its associated serialization mask value. So, for example, just because an inlined system opcode intrinsic call argument is defined by an integer add operation, it is not necessary to explicitly specify the _UP_ALU_FENCE bit-mask as part of the serialization constraint argument. The serialization constraint integer mask value may be treated as an optional final argument to the inline system opcode intrinsic invocation. If this argument is omitted, the compiler may choose to use any reasonable default serialization mask value (e.g. 0x3D3D--full serialization with all other opcode categories except ALU operations). Note that if a system opcode instruction is constrained to be serialized with respect to another instruction, the compiler must not schedule the two instructions to execute concurrently. To specify serialization constraints at an arbitrary point in a program, a placeholder inline assembly opcode intrinsic named _Asm_sched_fence may be used. This special intrinsic just accepts one argument that specifies the serialization mask value. The compiler will then honor the serialization constraints associated with this placeholder opcode, but omit the opcode from the final instruction stream. The scope of the serialization constraints is limited to the function containing the inlined system opcode. By default, the compiler may assume that called functions do not specify any inlined system opcodes with serialization constraints. However, the _Asm_sched_fence intrinsic may be used to explicitly communicate serialization constraints at a call-site that is known to invoke a function that executes a serializing system instruction. EXAMPLE If a flush cache instruction ("FC" opcode) is a privileged machine instruction that is to be embedded into source code and one that should allow user-specified serialization constraints, the following inline assembly intrinsic may be defined: void _Asm_FC ([serialization_constraint_specifier]) where the return type of the intrinsic is declared to be "void" to indicate that no data value is defined by the machine instruction. Now the FC instruction may be embedded in a C program with serialization constraints that prevent the compiler from re-ordering memory instructions across the FC instruction as shown below:
#include <inline.h>
int g1, g2; /* global integer variables */
main ( )
{
g1 = 0; /* can't be moved after FC instruction */
_Asm_FC(_UP_MEMORY_FENCE.vertline.DOWN_MEMORY_FENCE);
g2 = 1; /* can't be moved before FC instruction */
}
Note that the _Asm_FC instruction specifies memory fence serialization constraints in both directions preventing the re-ordering of the stores to global variables g1 and g2 across the FC instruction. Use of Table-driven Approach A table-driven approach is advantageously used to help the compiler handle assembly intrinsic operations. The table contains one entry for each intrinsic, with the entry describing the characteristics of that intrinsic. In a preferred embodiment, although not mandatorily, those characteristics may be tabulated as follows: (a) The name of the intrinsic (b) A brief textual description of the intrinsic (c) Names and types of the intrinsic arguments (if any) (d) Name and type of the intrinsic return value (if any) (e) With momentary reference back to FIG. 1, additional information for code generator 105 to perform the translation from high level intermediate representation 103 to low level intermediate representation 106 It will be appreciated that this table-driven approach enables the separation of the generation of the assembly intrinsic header file, parsing support library, and code generation utilities from the compiler's mainstream compilation processing. Any maintenance to the table may be made (such as adding to the list of supported inlined instruction) without affecting the compiler's primary processing functionality. This makes performing such maintenance easy and predictable. The table-driven approach is also user programming language independent, extending the versatility of the present invention. On a more detailed level, at least three specific advantages are offered by this table-driven approach: 1. Header File Generation The table facilitates generation of a file that documents intrinsics for user programmers, providing intrinsic function prototypes and brief descriptions. Using table elements (a), (b), (c) and (d) as itemized above, and with reference again to the preceding discussion accompanying FIG. 3, a software tool 302 generates an "inline.h" system header SH.sub.2 from inline assembly descriptor table 301. Furthermore, "inline.h" system header SH.sub.2 also defines and contains an enumerated set of symbolic constants, registers, completers, and so forth, that the programmer may use as legal operands to inline assembly intrinsic calls in the current program. Further, in cases where an operand is a numeric constant, "inline.h" system header SH.sub.2 documents the range of legal values for the operand, which is checked by the compiler. 2. Parsing Library Generation The table facilitates generation of part of a library that assists, with reference again now to FIG. 1, front end processing ("FE") 102 in recognizing intrinsics specified by the programmer in source code 101, validating first that the programmer has written such intrinsics legally, and then translating the intrinsics into high-level intermediate representation 103. Note that in accordance with the present invention, it would also be possible to generate intrinsic-related front-end processing directly. In a preferred embodiment, however, library functionality is used. Table-driven front-end processing enables an advantageous feature of the present invention, namely the automatic syntax parsing and semantics checking of the user's inline assembly code by FE 102. This feature validates that code containing embedded machine instructions is semantically correct when it is incorporated into source code 101 in the same way that a front end verifies that an ordinary function invocation is semantically correct. This frees other processing units of the compiler, such as code generator 105 and low level optimizer 107, from the time-consuming task of error checking. This front-end validation through reference to a partial library is enabled by generation of a header file as illustrated on FIGS. 5 and 6. Turning first to FIG. 5, in which it should again be noted that blocks 506, 507 and 508 are explanatory items and not part of the information flow, inline assembly descriptor table 301 provides elements (a), (c) and (d) as itemized above to software tool 501. This information enables software tool 501 to generate language-independent inline assembly parser header file ("asmp.h"), which may then be included into corresponding source code "asmp.c" 503 and compiled 504 into corresponding object code "asmp.o" 505. It will thus be seen from FIG. 5 that "asmp.o" 505 is a language-independent inline assembly parser library object file in a form suitable for assisting FE 102 on FIG. 1. With reference now to FIG. 6, it will be seen that "inline.h" system header SH.sub.2 provides legal intrinsics for a programmer to invoke from source code 601. On FIG. 6, exemplary illustration is made of C source code 601.sub.c, C++ source code 601.sub.p, and FORTRAN source code 601.sub.f, although the invention is not limited to these particular programming languages, and will be understood to be also enabled according to FIG. 6 on other programming languages. It will be noted that each of the illustrated source codes 601.sub.c, 601.sub.p and 601.sub.f have compiler operations and sequences 601-608 analogous to FIG. 1. Further, "asmp.o" library object file 505, being language independent, is universally available to C FE 602.sub.c, C++ FE 602.sub.p and FORTRAN FE 602.sub.f to assist in front-end error checking. Front end processing FE 602 does this checking by invoking utility functions defined in "asmp.o" library object file 505 to ensure that embedded machine instructions encountered in source code 601 are using the correct types and numbers of values. This checking is advantageously performed before actual code for embedded machine instructions is generated in high-level intermediate representation 603. In this way, it will be appreciated that various potential errors may be checked in a flexible, table-driven manner that is easily maintained by a programmer. For example, errors that may be checked include: whether the instruction being inlined is supported. whether the number of arguments passed is correct. whether the arguments passed are of the correct type. whether the values of numeric integer constant arguments, if any, are within the allowable range. whether the serialization constraint specifier is allowed for the specified instruction. Furthermore, the table also allows the system to compute the default serialization mask for the specified instruction if one is needed but not supplied by the user. 3. Code Generation The table 301 facilitates actual code generation (as shown on FIG. 1) by assisting CG 105 in translation of high level intermediate representation ("HIL") 103 to low level intermediate representation ("LIL") 106. Specifically, the table assists CG 105 in translating intrinsics previously incorporated into source code 101. The table may also, when processed into a part of CG 105, perform consistency checking to recognize certain cases of incorrect HIL 103 that were not caught by error checking in front end processing ("FE") 102. Note that according to the present invention, it would also be possible to generate a library of CG object files to assist CG 105 in processing intrinsics, similar to library 505 that assists FE 102, as illustrated on FIG. 5. Turning now to FIG. 4, and again noting that blocks 405 and 406 are explanatory items and not part of the information flow, inline assembly descriptor assembly table 301 provides elements (c), (d) and (e) as itemized above to software tool 400. Using this information, software tool 400 generates CG source file 401.sub.1, which in turn is compiled along with ordinary CG source files 401.sub.2 -401.sub.n (blocks 402) into CG object files 403.sub.1 -403.sub.n. Archiver 404 accumulates CG object files 403.sub.1 -403.sub.n into CG library 407. In more detail now, the foregoing translation from HIL 103 to LIL 106 for intrinsics includes the following phases: A. Generation of data structures Automation at compiler-build time generates, for each possible intrinsic operation, a data structure that contains information on the types of the intrinsic arguments (if any) and the type of the return value (if any). B. Consistency checking At compiler-run time, a portion of CG that performs consistency checking on intrinsic operations can consult the appropriate data structure from A immediately above. This portion of CG does not need to be modified when a new intrinsic operation is added, unless the language in which the table 301 is written has changed. C. Translation from HIL to LIL Most intrinsic operations can be translated from HIL to LIL automatically, using information from the table. In a preferred embodiment, an escape mechanism is also advantageously provided so that an intrinsic operation that cannot be translated automatically can be flagged to be translated later by a hand-coded routine. The enablement of the escape mechanism does not affect automatic consistency checking. The representation of an intrinsic invocation in HIL identifies the intrinsic operation and has a list of arguments; there may be an implicit return value. The representation of an intrinsic invocation in LIL identifies a low-level operation and has a list of arguments. The translation process must retrieve information from the HIL representation and build up the LIL representation. There are a number of aspects to this mapping: i. The identity of the intrinsic operation in HIL may be expressed by one or more arguments in LIL. Information in element (e) in the inline assembly descriptor table set forth above is used to generate code expressing this identity in LIL. ii. The implicit return value (if any) from HIL is expressed as an argument in LIL. iii. Arguments of certain types in HIL must be translated to arguments of different types in LIL. The translation utility for any given argument type must be hand-coded, although the correct translation utility is invoked automatically by the translation process for the intrinsic operation. iv. The serialization mask (if any) from HIL is a special attribute (not an argument) in LIL. v. The LIL arguments must be emitted in the correct order. Information in element (e) in the inline assembly descriptor table as set forth above describes how to take the identity arguments from (i), the return value argument (if any) from (ii), and any other HIL arguments, and emit them into LIL in the correct order. For each possible intrinsic operation, the tool run at compiler-build time creates a piece of CG that takes as input the HIL form of that intrinsic operation and generates the LIL form of that intrinsic operation. In a preferred embodiment, the tool run at compiler-build time advantageously recognizes when two or more intrinsic operations are translated using the same algorithm, and generates a single piece of code embodying that algorithm that can perform translations for all of those intrinsic operations. When this happens, information on the identity of the intrinsic operation described in (i) above is stored in the same data structures described in A further above, so that the translation code can handle the multiple intrinsic operations. In the preferred embodiment, translation algorithms for two intrinsic operations are considered "the same" if all of the following hold: The HIL forms of the operations have the same number of arguments of the same types in the same order. The HIL forms of the operations either both lack a return value or have the same return type. The identity information is expressed in the LIL forms of the operations using the same number of arguments of the same types. The LIL arguments for the operations occur in the same order. In summary, within the internal program representations used by the compiler, the inlining of assembly instructions may be implemented as special calls in the HIL that the front end generates. Every assembly instruction supported by inlining is defined as part of this intermediate language. When an inlined assembly instruction is encountered in the source, after performing error checking, the FE would emit, as part of the HIL, a call to the corresponding dedicated HIL routine. The CG then replaces each such call in the HIL with the corresponding machine instruction in the LIL which is then subject to optimizations by the LLO, without violating any associated serialization constraint specifiers (as discussed above). In addition to facilitating code generation from HIL to LIL, the table-driven approach advantageously assists code generation in other phases of the compiler. For example, and with reference again to FIG. 1, the table could also be extended to generate part of HLO 104 or LLO 107 for manipulating assembly intrinsics (or to generate libraries to be used by HLO 104 or LLO 107). This could be accomplished, for instance, by having the table provide semantic information on the intrinsics that indicates optimization freedom and optimization constraints. Although the greatest benefit comes from using the table for as many compiler stages as possible, this approach applies equally well to a situation in which only some of the compiler stages use the table--for example, where neither HLO 104 nor LLO 107 use the table. Although the preferred embodiment does Library Generation and Partial Code Generator Generation (as described above) at compiler-build time, it would not be substantially different for FE 102, CG 105, or some library to consult the table (or some translated form of the table) at compiler-run time instead. Furthermore, although this approach has been disclosed to apply to assembly intrinsics, it could equally well be applied to any set of operations where there is at least one compiler stage that takes a set of operations in a regular form and translates them into another form, where the translation process can occur in a straightforward and automated fashion. Each time a new intrinsic operation needs to be added to the compiler, a new entry is added to the table of intrinsic operations. A compiler stage that relies on the table-driven approach usually need not be modified by hand in order to manipulate the new intrinsic operation (the exception is if the language in which the table itself is written has to be extended--for example, to accommodate a new argument type or a new return type; in such a case it is likely that compiler stages and automation that processes the table will have to be modified). Reducing the amount of code that must be written by hand makes it simpler and quicker to add support for new intrinsic operations, and reduces the possibility of error when adding new intrinsic operations. A further advantageous feature enabled by the present invention is that key library routines may now access machine instruction-level code so as to optimize run-time performance. Performance-critical library routines (e.g. math or graphics library routines) often require access to low-level machine instructions to achieve maximum performance on modern processors. In the current art, they are typically hand-coded in assembly language. As traditionally performed, hand-coding of assembly language has many drawbacks. It is inherently tedious, it requires detailed understanding of microarchitecture performance characteristics, it is difficult to do well and is error-prone, the resultant code is hard to maintain, and, to achieve optimal performance, the code requires rework for each new implementation of the target architecture. In a preferred embodiment of the present invention, performance-critical library routines may now be coded in high-level languages, using embedded machine instructions as needed. Such routines may then be compiled into an object file format that is amenable to cross-module optimization and linking in conjunction with application code that invokes the library routines. Specifically, the library routines may be inlined at the call sites in the application program and optimized in the context of the surrounding code. With reference to FIG. 7, intrinsics defined in "inline.h" system header file SH.sub.2 enable machine instructions to be embedded, for example, in math library routine source code 702.sub.s. This "mathlib" source code 702.sub.s is then compiled in accordance with the present invention into equivalent object code 702.sub.o. Meanwhile, source code 701.sub.s wishing to invoke the functionality of "mathlib" is compiled into object code 701.sub.o in the traditional manner employed for cross-module optimization. Cross-module optimization and linking resources 704 then combine the two object codes 701.sub.o and 702.sub.o to create optimized executable code 705. In FIG. 7, it should be noted that the math library is merely used as an example. There are other analogous high-performance libraries for which the present invention brings programming advantages, e.g., for graphics, multimedia, etc. In addition to easing the programming burden on library writers, the ability to embed machine instructions into source code spares the library writers from having to re-structure low-level hand-coded assembly routines for each implementation of the target architecture. Floating Point ("_fpreg") Data Type The description of a preferred embodiment has so far centered on the inventive mechanism disclosed herein for inlining machine instructions into the compilation and optimization of source code. It will be appreciated that this mechanism will often be called upon to compile objects that include floating-point data types. A new data type is also disclosed herein, named "_fpreg" in the C programming language, which allows general programmatic access (including via the inventive machine instruction inlining mechanism) to the widest mode floating-point arithmetic supported by the processor. This data type corresponds to a floating-point representation that is as wide as the floating-point registers of the underlying processor. It will be understood that although discussion of the inventive data type herein centers on "_fpreg" as named for the C programming language, the concepts and advantages of the inventive data type are applicable in other programming languages via corresponding data types given their own names. A precondition to fully enabling the "_fpreg" data type is that the target processor must of course be able to support memory access instructions that can transfer data between its floating-point registers and memory without loss of range or precision. Depending on the characteristics of the underlying processor, the "_fpreg" data type may be defined as a data type that either requires "active" or "passive" conversion. The distinction here is whether instructions are emitted when converting a value of "_fpreg" data type to or from a value of another floating-point data type. In an active conversion, a machine instruction would be needed to effect the conversion whereas in a passive conversion, no machine instruction would be needed. In either case, the memory representation of an object of "_fpreg" data type is defined to be large enough to accommodate the full width of the floating-point registers of the underlying processor. The type promotion rules of the programming language are advantageously extended to accommodate the _fpreg data type in a natural way. For example, for the C programming language, it is useful to assert that binary operations involving this type shall be subject to the following promotion rules: 1. First, if either operand has type _fpreg, the other operand is converted to _fpreg. 2. Otherwise, if either operand has type long double, the other operand is converted to long double. 3. Otherwise, if either operand has type double, the other operand is converted to double. Note that in setting the foregoing exemplary promotion rules, it is assumed that the _fpreg data type which corresponds to the full floating-point register width of the target processor has greater range and precision than the long double data type. If this is not the case, then the first two rules may need to be swapped in sequence. Note also that in general, assuming type _fpreg has greater range and/or precision than type long double, it may be that the result of computations involving _fpreg values cannot be represented precisely as a value of type long double. The behavior of the type conversion from _fpreg to long double (or to any other source-level floating-point type) must therefore be accounted for. A preferred embodiment employs a similar rule to that used for conversions from double to float: If the value being converted is out of range, the behavior is undefined; and if the value cannot be represented exactly, the result is either the nearest higher or the nearest lower representable value. It will be further appreciated that the application and availability of the _fpreg data type is not required to be universal within the programming language. Depending on processor architecture and programmer needs, it is possible to limit availability of the _fpreg data type to only a subset of the operations that may be applied to other floating-point types. To illustrate general programming use of this new data type, consider the following C source program that computes a floating-point `dot-product` (a.multidot.b+c): double a, b, c, d; main ( )
{
d = (a * b) + c;
}
where the global variable d is assigned the result of the dot-product. For this example, according to the standard "usual arithmetic conversion rule" of the C programming language, the floating-point multiplication and addition expressions will be evaluated in the "double" data type using double precision floating-point arithmetic instructions. However, in order to exploit greater precision afforded by a processor with floating-point registers whose width exceeds that of the standard double data type, the _fpreg data type may alternatively be used as shown below: double a, b, c, d; main ( )
{
d = ((_fpreg) a * b) + c;
}
Note here that the variable "a" of type double is "typecast" into an _fpreg value. Hence, based on the previously mentioned extension to the usual arithmetic conversion rule, the variables "a", "b", and "c" of "double" type are converted (either passively or actively) into "_fpreg" type values and both the multiplication and addition operations will operate in the maximum floating-point precision corresponding to the full width of the underlying floating-point registers. In particular, the intermediate maximum precision product of "a" and "b" will not need to be rounded prior to being summed with "c". The net result is that a more accurate dot-product value will be computed and round-off errors are limited to the final assignment to the variable "d". Applying the foregoing features and advantages of the _fpreg data type to the inventive mechanism disclosed herein for inlining machine instructions, it will be seen that the parameters and return values of intrinsics specified in accordance with that mechanism may be declared to be of this data type when such intrinsics correspond to floating point instructions. For example, in order to allow source-level embedding of a floating-point fused-multiply add instruction: fma fr4=fr1, fr2, fr3 that sums the product of the values contained in 2 floating-point register source operands (fr1 and fr2) with the value contained in another floating-point register source operand (fr3), and writes the result to a floating-point register (fr4), the following inline assembly intrinsic can be defined: fr4=_fpreg _Asm_fma (_fpreg fr1, _fpreg fr2, _fpreg fr3) Now, following the general programmatic example used above, this intrinsic can be used to compute a floating-point "dot-product" (a.multidot.b+c) in a C source program as follows: double a, b, c, d; main ( )
{
d = _Asm_fma (a, b, c);
}
where d is assigned the result of the floating-point computation ((a*b)+c) Note that the arguments to _Asm_fma (a, b, and c) are implicitly converted from type double to type _fpreg when invoking the intrinsic, and that the intrinsic return value of type _fpreg is implicitly converted to type double for assignment to d. As discussed above, if type _fpreg has greater range and/or precision than type double, it may be that the result of the intrinsic operation (or indeed any other expression of type _fpreg) cannot be represented precisely as a value of type double. The behavior of the type conversion from _fpreg to double (or to any other source-level floating-point type, such as float) must therefore be accounted for. In a preferred embodiment, a similar rule is employed to that used for conversions from double to float: If the value being converted is out of range, the behavior is undefined; and if the value cannot be represented exactly, the result is either the nearest higher or nearest lower representable value. If the result of the dot-product were to be used in a subsequent floating-point operation, it would be possible to minimize loss of precision by carrying out that operation in type _fpreg as follows: double a, b, c, d, e, f, g; main ( )
{
_fpreg x, y;
x = _Asm_fma (a, b, c);
y = _Asm_fma (e, f, g);
d = x + y;
}
Note that the results of the two dot-products are stored in variables of type _fpreg; the results are summed (still in type _fpreg), and this final sum is then converted to type double for assignment to d. This should produce a more precise result than storing the dot-product results in variables of type double before summing them. Also, note that the standard binary operator `+` is being applied to values of type _fpreg to produce an _fpreg result (which, as previously stated, must be converted to type double for assignment to d). Conclusion It will be further understood that the present invention may be embodied in software executable on a general purpose computer including a processing unit accessing a computer-readable storage medium, a memory, and a plurality of I/O devices. Although the present invention and its advantages have been described in detail, it should be understood that various changes, substitutions and alterations can be made herein without departing from the spirit and scope of the invention as defined by the appended claims.
|
Same subclass Same class Consider this |
||||||||||
