System for compiling parallel communications instructions including their embedded data transfer information5355492Abstract The present invention is directed towards a compiler for processing parallel communication instructions on a data parallel computer. The compiler of the present invention comprises a front end, a middle end, an optimizer, and a back end. The front end constructs a parse tree which includes nodes representative of parallel communication instructions. The middle end generates an intermediate representation (IR) tree from the parse tree. The IR tree includes general parallel communication IR nodes representative of target code to carry out parallel communication with general communication. An efficient parallel communication module of the optimizer replaces general parallel communication IR nodes with grid parallel communication IR nodes where doing so would result in more efficient target code. The grid parallel communication IR nodes represent target code to carry out parallel communication instructions with grid communication. The back end generates target code from the optimized IR tree. Claims We claim: Description CROSS-REFERENCE TO OTHER APPLICATIONS
______________________________________
shape [16384]S;
int:S source1, source2, *destinationp;
int: i;
main( ) {
with(S)
*(destinationp + 2)=[. + (1+i)](source1*source2);
______________________________________
In the first example, the shape S has one dimension and 16,384 positions. Parallel variables source1 and source2 are of shape S and have integer elements. The third variable, destinationp, is a pointer to a parallel variable of shape S which has integer elements. The "with(S)" statement selects shape S as current. In the get instruction "*(destinationp+2)=[.+(1+i)](source1 * source2)", the destination expression is "*(destinationp+2)". Adding 2 to the pointer "destinationp" has the effect of offsetting the pointer by two positions. When used as a unary operator, "*" is a dereference operator. The dereference operator treats its operand as the address of the ultimate target, and accesses that address to fetch the contents. The destination expression therefore represents a parallel value which is the third element of the parallel array pointed to by destinationp. The left index expression consists of a single left index: ".+(1+i)". The "." is the dot token explained above. The parameter to the pcoord intrinsic function generated is the identifier of the left index in which the dot token occurred. The identifier of the first (or only) left index is 0. The function thus returns a parallel value of the current shape in which each element is initialized to its self coordinate along the 0 axis. In the expression "(1+i)", "i" is a scalar variable whose value cannot be determined at compile time. At run time, the sum "(1+i)" will be added to each element of the parallel value generated by the pcoord intrinsic function. Accordingly, the left index specifies an offset of 1+i. The source expression of the first example 810 is "source1 * source2", where source1 and source2 are parallel values. When used as a binary operator, "*" is the multiplication operator. When the source operands are parallel values, "*" returns a parallel value having in each position the product of the corresponding elements of its source operands. The source parallel value therefore represents a parallel value in which each element is the product of the corresponding element of source1 and of source2. The lexical analyzer 420 carries out the steps 612 and 613 of FIG. 6 to break the first example 810 into tokens. The syntax analyzer 422 caries out the steps 613 and 614 of FIG. 6 to group the tokens into phrases and check the grammar of the phrases. The lexical analyzer 420 and the syntax analyzer 422 generate the first parse tree 812 from the first example 810. The first, second and third parse trees 812, 814 and 816, respectively, have the following structure. Non-leaf nodes represent operators or function calls. Leaf nodes represent source operands (i.e., identifiers or constants), operations with no operands, or function parameters. The root is an assignment operator. The child of a node represents its operand, the operation performed to determine its operand, or its parameter. The left descendant of a node which represents an operator represents the left operand of the operator. The right descendant of a node which represents an operator represents the right operand of the operator. The central descendant of a node which represents an operand represents a sole operand of the operation. Descendants of a node which represents a function call represent parameters to the function. The root of the parse tree 812 is an "=" node 819, which represents the assignment operator "=" of the first example 810. The left descendants of the "=" node 819 represent the left-hand-side of the first example 810, and the right descendants of the "=" node 819 represent the right-hand-side. The operation performed to determine the right operand associated with the "=" node 819 is the left index of instruction. The right child of the "=" node 819 is therefore a left index node 820. The operation performed to determine the left operand associated with the left index node 820 is addition. The left child of the left index node 820 is therefore a "+" node 824. The left operand associated with the "+" node 824 is a dot token. The left child of the "+" node 824 is therefore a dot node 826. Because there are no operands to the dot token, the latter node is a leaf node. The operation performed to determine the right operand associated with the "+" node 824 is addition. The right child of "+" node 824 is therefore a "+" node 828. The left operand associated with the "+" node 832 is the number "1". The left child of the "+" node 832 is therefore a "1" node 834. The right operand associated with the "+" node 832 is the variable "i". The right child of the "+" node 832 is therefore an "i" node 836. Because they represent source operands, the "1" node 834 and the "i" node 836 are leaf nodes. The operation performed to determine the right operand associated with the left index node 820 is multiplication. The right child of the "+" node 819 is therefore a "*" node 844. The left operand associated with the "*" node 844 is the parallel variable source1. The left child of the "*" node 844 is therefore a "source1" node 846. Because it represents a source operand, the latter node is a leaf node. The right operand associated with the "*" node 844 is the parallel variable source2. The right child of the "*" node 844 is therefore a "source2" node 848. Because it represents a source operand, the latter node is a leaf node. The operation performed to determine the right operand associated with the "=" node 819 is dereferencing a pointer. The left child of the left "=" node 819 is therefore a "*" node 850. The dereference operation has a single operand. The operation performed to determine the operand is addition. The central child of the "*" node 850 is therefore a "+" node 852. The left operand associated with the "+" node 852 is the pointer destinationp. The left child of the "+" node 852 is therefore a destinationp node 854. Because it represents a source operand, the latter node is a leaf node. The right operand associated with the "+" node 852 is the number "2". The right child of the "+" node 852 is therefore a "2" node 854. Because it represents an source operand, the latter node is a leaf node. The dot handler 424 carries out steps 615 and 616 of FIG. 6 to detect the dot token in the first example 810 and replace it with an IR node for a call to the pcoord intrinsic function. The dot handler 424 generates the second parse tree 814 from the first parse tree 812 as follows. In the second parse tree 814, the dot node 826 has been replaced by a pcoord node 858. The parameter associated with the pcoord node 858 is the number 0. The central child of the pcoord node 850 is therefore a "0" node 860. Because it represents a function parameter, the latter node is a leaf node. The shape caster 426 carries out steps 618 and 620 of FIG. 6 to detect the combination of scalar and parallel operands in the left index of the first example 810 and promote "(1+i)" to enable it to be added to the value returned by the pcoord intrinsic function. Specifically, the shape caster 426 generates a promote operation having a left operand of "(signed int: current)" and a right operand of "(1+i)". The shape caster 326 generates the third parse tree 816 from the second parse tree 814 as follows. The third parse tree 816 is identical to the second parse tree 814 except for the right descendants of the "+" node 824. The right child of the "+" node 824 is a promote node 862. The left child of the type cast node 862 is a "(signed int: current)" node 864. Because it represents an operand, the latter node is a leaf node. The right child of the type cast node 862 is the "+" node 828. The right and left child of the "+" node 828 are described above. 7. IR Code Generation a. Structure and Method FIGS. 9-11 show flow charts which illustrate the details of step 516 of FIG. 5 (traversing the parse tree 318 to generate the IR tree 332). Looking at FIG. 9, in a step 910 the middle end 312 traverses the parse tree 318 until an expression is detected. In a step 912, the middle end 312 determines the type of expression detected. If the syntax of the expression matches a template for the send instruction, then in step 914 the middle end 312 generates the appropriate IR nodes for the send, as described in detail in FIG. 10 and the accompanying text. The template for the send instruction is a parallel left index expression which occurs on the left-hand side of an assignment. If the syntax of the expression matches a template for the get instruction, then in step 916 the middle end 312 generates the appropriate IR nodes for the get, as described in detail in FIG. 11 and the accompanying text. The template for the get instruction is a parallel left index expression which occurs as a source operand. If the expression is neither a send nor a get, then in step 918 the middle end 312 evaluates the operands of the expression to the extent possible at compile time. This evaluation could involve a recursive call to the step 910. Then, in a step 920, the middle end 312 generates the appropriate IR nodes for the instruction indicated by the expression. After the step 914, 916 or 920, C* compiler 154 flow of control returns to the calling routine (the step 515 of FIG. 5), as indicated by a step 922. FIG. 10 is a flow chart of the detailed method of processing the send instruction (step 914). In a step 1010, the middle end 312 evaluates the destination expression of the send instruction to the extent possible at compile time. The evaluation could involve a recursive call to the step 910 of FIG. 9. In step 1012, the middle end 312 evaluates the left indices of the left index expression of the send instruction to the extent possible at compile time. This evaluation also could involve a recursive call to the step 910. In a step 1014, the middle end 312 generates, for each left index of the send instruction, an IR node for a function of the Paris library 162 for computing a parallel value called "send.sub.-- address." The operands of each IR node are the shape of the evaluated destination expression and the evaluated left index for which send.sub.-- address is being computed. Each element of send.sub.-- address identifies an address on one of the parallel processors 112. Accordingly, each element indicates where to send the element at the corresponding position of the evaluated source expression. In a step 1016, the middle end 312 evaluates the source expression of the send instruction to the extent possible at compile time. Again, the evaluation may involve a recursive call to the step 910 of FIG. 9. In a step 1018, the middle end 312 generates an IR node for a function of the Paris library 162 for carrying out the send instruction with general communication. The operands of the send IR node are the evaluated destination expression, the send.sub.-- address(es) and the evaluated source expression. In step 1020, the middle end 312 generates an IR node for a function of the Paris library 162 for carrying out the get instruction with general communication. The source operand of the get instruction is the evaluated destination expression. The send.sub.-- address operand(s) of the get instruction is the same as that for the send instruction. The destination operand for the get instruction is a compiler-generated-temporary variable. The get instruction is performed to make the results of the send instruction available (through the compiler-generated temporary variable) to any other operation in which the send instruction is a source operand. C* compiler 154 flow of control then returns to the calling routine. FIG. 11 is a flow chart of the detailed method of processing the get instruction (step 916). In a step 1110, the middle end 312 evaluates the destination expression of the get instruction to the extent possible at compile time. The evaluation could involve a recursive call to the step 910 of FIG. 9. In a step 1112, the middle end 312 evaluates the source expression of the get instruction to the extent possible at compile time. Again, the evaluation may involve a recursive call to the step 910 of FIG. 9. In step 1114, the middle end 312 evaluates the left indices of the send instruction to the extent possible at compile time. This evaluation also could involve a recursive call to the step 910. In a step 1116, the middle end 312 generates, for each left index of the send instruction, an IR node for a function of the Paris library 162 for computing a send.sub.-- address. The operands of each of these IR nodes are the shape of the evaluated destination expression and the evaluated left index for which the send.sub.-- address is being computed. In a step 1118, the middle end 312 generates an IR node for the function of the Paris library 162 for carrying out the get instruction with general communication. The operands of the IR node are the evaluated destination expression, the send.sub.-- address(es) and the evaluated source expression. In step 1120, the middle end 312 generates an IR node for an operation to set the results of the get instruction to a compiler-generated temporary variable. Any instruction to which the get instruction is a source operand can then access the results through the compiler-generated temporary variable. C* compiler 154 flow of control then returns to the calling routine. b. Example Once the C* compiler 154 has constructed the third parse tree 816 for the first example 810, the middle end 312 traverses the third parse tree 816 in order to generate the nodes of the IR tree 332. For the purposes of illustration, the IR nodes generated for the first example 810 are classified into a destination evaluation IR block, a source evaluation IR block, a send IR block and a left index evaluation IR block, a NEWS to send translation IR block, a get IR block, and a set-to-temporary IR block. These IR blocks represent the output of the steps 1110, 1112, 114, 1116, 1118 and 1120 respectively, of FIG. 11. Note that only the IR nodes generated in the step 916 of FIG. 9 are discussed here. There would be additional IR nodes generated for the first example 810, for example, to create parallel variables. As the output of step 1110, the destination evaluation IR block consists of IR nodes to evaluate the destination expression. Specifically, the destination evaluation IR block generates code to evaluate the destination expression "*(destinationp+2)" by evaluating the subtree of the third parse tree 816 whose root is the "+" node 852.
______________________________________
1. MULTIPLY DEST CMC.sub.-- s.sub.-- temp.sub.-- 0
SOURCE1 2
SOURCE2 32
2. OFFSET DEST CMC.sub.-- s.sub.-- temp.sub.-- 1
SOURCE destinationp
AMOUNT CMC.sub.-- s.sub.-- temp.sub.-- 0
______________________________________
To facilitate explanation, the IR nodes of the destination IR block (as well as the IR nodes of the other IR blocks shown) have been numbered. The source operands of the addition represented by the "+" node 852 are destinationp and 2. Adding a number to a pointer has the effect of offsetting the pointer by that number of elements. Therefore, in the first example 810, the offset is the product of two and the size of the elements pointed to by destinationp. The latter are integers, which are assumed to be represented with 32 bits. The first IR node represents C/Paris code for calculating the amount of the offset. Specifically, the first node represents C/Paris code for multiplying a first source, 2, and a second source, 32, and storing the product in a destination temporary variable CMC.sub.-- s.sub.-- temp.sub.-- 0. The second IR node represents C/Paris code for offsetting a source, destinationp, by the amount stored in CMC.sub.-- s.sub.-- temp.sub.-- 0. The offset value is stored in a temporary variable called CMC.sub.-- s.sub.-- temp.sub.-- 1. As the output of step 1112, the source evaluation IR block consists of IR nodes representative of C/Paris code for evaluating the source expression. Specifically, the source evaluation IR block represents C/Paris code to evaluate the source expression "source1 * source2" by evaluating the subtree of the third parse tree 816 whose root is the "*" node 844. The source evaluation IR block consists of the IR node:
______________________________________
3. MULTIPLY DEST CMC.sub.-- p.sub.-- temp.sub.-- 1
SOURCE1 source1
SOURCE2 source2
______________________________________
The third IR node represents C/Paris code for multiplying a first source, source 1, and a second source, source2 and storing the product in a temporary destination variable CMC.sub.-- p.sub.-- temp.sub.-- 1 . The value of CMC.sub.-- p.sub.-- temp.sub.-- 1 is the source parallel value. As the output of step 1114 of FIG. 11, the left index evaluation IR block consists of IR nodes representative of C/Paris code for evaluating the left index expression. Specifically, the left index evaluation IR block represents C/Paris code to evaluate the left index ".+(1+i)" of the first example 810 by evaluating tile subtree of the third parse tree 816 whose root is the "+" node 824. The left index evaluation IR block consists of the IR nodes:
______________________________________
4. PCOORD DEST CMC.sub.-- p.sub.-- temp.sub.-- 3
AXIS 0
5. ADD DEST CMC.sub.-- s .sub.-- temp.sub.-- 2
SOURCE1 1
SOURCE2 i
6. PROMOTE DEST CMC.sub.-- p.sub.-- temp.sub.-- 4
SOURCE CMC.sub.-- s.sub.-- temp.sub.-- 2
7. ADD DEST CMC.sub.-- p.sub.-- temp.sub.-- 5
SOURCE1 CMC.sub.-- p.sub.-- temp.sub.-- 3
SOURCE2 CMC.sub.-- p.sub.-- temp.sub.-- 4
______________________________________
The fourth IR node represents C/Paris code for evaluating the subtree of the third parse tree 816 whose root is the pcoord node 858. Specifically, it represents C/Paris code for calling the pcoord intrinsic function with a parameter of 0 and storing the result of the pcoord intrinsic function in a temporary parallel value called CMC.sub.-- p.sub.-- temp.sub.-- 2. The fifth IR node represents C/Paris code for evaluating the subtree of the third parse tree 816 whose root is the "+" node 828. The evaluation is performed by adding the first operand, 1 to the second operand, i, and storing the sum in a temporary variable called CMC.sub.-- s.sub.-- temp.sub.-- 2. The sixth and seventh IR nodes generate C/Paris code for evaluating the subtree rooted with the "+" node 824. The C/Paris code represented by the sixth IR node promotes CMC.sub.-- s.sub.-- temp.sub.-- 2 (the sum calculated by the C/Paris code represented by the fifth IR node) to a parallel value and stores the result in a temporary variable called CMC.sub.-- p.sub.-- temp.sub.-- 4. The seventh IR node generates the C/Paris code for adding the promoted sum of "1+i" to the value returned by the pcoord intrinsic function. Specifically, the seventh IR node represents C/Paris code for adding a source CMC.sub.-- p.sub.-- temp.sub.-- 3 (the value returned by the function) to a source CMC.sub.-- p.sub.-- temp.sub.-- 4 (the promoted sum) and storing the result in a temporary variable called CMC.sub.-- p.sub.-- temp.sub.-- 5. The value of CMC.sub.-- p.sub.-- temp.sub.-- 5 is the reduced left index. The reduced left index is called a NEWS address. As the output of step 1116, the NEWS to send IR block consists of IR nodes which represent C/Paris code for translating the NEWS address into a send.sub.-- address. Specifically, the NEWS to send IR block represents C/Paris code for generating the appropriate "send.sub.-- address" from the sum of the value returned by the pcoord intrinsic function and the promoted sum of 1 and i into an address which uniquely identifies the source positions indicated by the NEWS address. The NEWS to send IR block consists of the IR node:
______________________________________
8. DEPOSIT.sub.-- NEWS.sub.-- COORDINATE DEST
CMC.sub.-- p.sub.-- temp.sub.-- 2
SHAPE S
SEND.sub.-- ADDRESS (null)
AXIS 0
COORDINATE CMC.sub.-- p.sub.-- temp.sub.-- 5
______________________________________
The eighth IR node represents C/Paris code for performing the NEWS-to-send translation. The first operand to the IR node is the shape of the destination parallel value, which is "S". The second operand is the send.sub.-- address for all other axes numbered lower than the axis for which the send.sub.-- address is sought. Because the source parallel value of the first example 810 is of rank one, there is only one left index. The second operand is therefore null. The third operand is an identifier of the axis for which the send.sub.-- address is sought. The identifier of the first (or the only) axis of a parallel value is 0. Therefore, the third operand is 0. The fourth operand is the NEWS address for the axis for which the send.sub.-- address is sought. The NEWS address for the only left index of the first example 810 is the value calculated by the seventh IR node and stored in CMC.sub.-- p.sub.-- temp.sub.-- 5. As the output of step 1118, the get IR block consists of an IR node representative of the C/Paris code for getting copies of data from the destination parallel value to the addresses specified by the send.sub.-- address. Specifically, the get IR block represents the C/Paris code for getting copies of the elements of the parallel value "*(destinationp+2)" to positions of the parallel value "source1 * source2" which are specified by the left index "[.+(1 +i)]". The C/Paris code uses the results calculated in the steps 1110, 1112 and 1116, to evaluate the "=" node 819 of the third parse tree 816. As the output of the step 1118, the get IR block consists of the following IR nodes:
______________________________________
9. GET DEST CMC.sub.-- p.sub.-- temp.sub.-- 6
SEND.sub.-- ADDRESS CMC.sub.-- p.sub.-- temp.sub.-- 2
SOURCE CMC.sub.-- p.sub.-- temp.sub.-- 1
COLLISION.sub.-- MODE 1
10. DEREFERENCE DEST CMC.sub.-- s.sub.-- temp.sub.-- ret.sub.--
val.sub.-- 0
SOURCE CMC.sub.-- s.sub.-- temp.sub.-- 1
11. ASSIGN DEST CMC.sub.-- p.sub.-- temp.sub.-- ret.sub.-- val.sub.--
0
SOURCE CMC.sub.-- p.sub.-- temp.sub.-- 6
______________________________________
The ninth IR node represents the C/Paris code for carrying out the get. The first operand to the IR node is the send.sub.-- address for the left index which corresponds to the axis on which tile get is to be performed. The send.sub.-- address for the sole left index of the first example 810 was calculated by the eighth IR node. The first operand is therefore CMC.sub.-- p.sub.-- temp.sub.-- 2, the destination of the eighth IR node. The second operand is the source, i.e., the destination parallel value calculated by the third IR node. The second operand is therefore CMC.sub.-- p.sub.-- temp.sub.-- 1, the destination of the eighth IR node. The third operand specifies how get collisions are handled. A get collision occurs when send.sub.-- address two or more send.sub.-- address elements are the same. The collision mode of 1 specifies that collisions are allowed to occur. The destination of the ninth IR node is a temporary variable called CMC.sub.-- p.sub.-- temp.sub.-- 6. The tenth IR node represents C/Paris source code for evaluating the subtree of thee third parse tree 816 whose root is the "*" node 838 so as to dereference the pointer to tile destination parallel value. Specifically, the tenth IR node represents C/Paris code for dereferencing a source, CMC.sub.-- s.sub.-- temp.sub.-- 1 (the result associated with the second IR node), and storing the dereferenced value in a temporary variable called CMC.sub.-- s.sub.-- temp.sub.-- ret.sub.-- val.sub.-- 0. The eleventh IR node represents C/Paris code for assigning the result of the get to the value dereferenced by the C/Paris code represented by the tenth IR node. Specifically, the eleventh IR node assigns the value of CMC.sub.-- p.sub.-- temp.sub.-- 6 to CMC.sub.-- p.sub.-- temp.sub.-- ret.sub.-- val.sub.-- 0. Note that CMC.sub.-- p.sub.-- temp.sub.-- 6 is the destination operand of the ninth node, and CMC.sub.-- p.sub.-- temp.sub.-- rat.sub.-- val.sub.-- 0 is the destination operand of the tenth node. As the output of the step 1130, the set-to-temporary IR block consists of the following node.
______________________________________
12. DEREFERENCE DEST CMC.sub.-- p.sub.-- temp.sub.-- ret.sub.--
val.sub.-- 1
SOURCE CMC.sub.-- s.sub.-- temp.sub.-- 1
______________________________________
The twelfth IR node, like the tenth IR node, generates C/Paris code for dereferencing the result associated with the second IR node. Specifically, the twelfth IR node generates C/Paris code for dereferencing CMC.sub.-- s.sub.-- temp.sub.-- 0 and storing the result in a temporary variable called CMC.sub.-- p.sub.-- temp.sub.-- ret.sub.-- val.sub.-- 1. The results of the get instruction would thereby be available to any other instruction through CMC.sub.-- p.sub.-- temp.sub.-- ret.sub.-- val.sub.-- 1. IR tree generation for the send instruction is similar to that for the above-described get instruction. For example, the IR nodes generated for an example send instruction [.+(1+i)]*(destinationp+2)=source1 * source2 are as follows:
______________________________________
MULTIPLY DEST CMC.sub.-- s.sub.-- temp.sub.-- 0
SOURCE1 2
SOURCE2 32
OFFSET DEST CMC.sub.-- s.sub.-- temp.sub.-- 1
SOURCE destinationp
AMOUNT CMC.sub.-- s.sub.-- temp.sub.-- 0
DEREFERENCE DEST CMC.sub.-- s.sub.-- temp.sub.-- ret.sub.-- val.sub.-- 0
SOURCE CMC.sub.-- s.sub.-- temp.sub.-- 1
PCOORD DEST CMC.sub.-- p.sub.-- temp.sub.-- 2
AXIS 0
ADD DEST CMC.sub.-- s.sub.-- temp.sub.-- 2
SOURCE1 1
SOURCE2 i
PROMOTE DEST CMC.sub.-- p.sub.-- temp.sub.-- 3
SOURCE CMC.sub.-- s.sub.-- temp.sub.-- 2
ADD DEST CMC.sub.-- p.sub.-- temp.sub.-- 4
SOURCE1 CMC.sub.-- p.sub.-- temp.sub.-- 2
SOURCE2 CMC.sub.-- p.sub.-- temp.sub.-- 3
DEPOSIT.sub.-- NEWS.sub.-- COORDINATE DEST CMC.sub.-- p.sub.-- temp.sub.--
1
SHAPE S
SEND.sub.-- ADDRESS (null)
AXIS 0
COORDINATE CMC.sub.-- p.sub.-- temp.sub.-- 4
MULTIPLY DEST CMC.sub.-- p.sub.-- temp.sub.-- 5
SOURCE1 source1
SOURCE2 source2
SEND DEST CMC.sub.-- p.sub.-- temp.sub.-- ret.sub.-- val.sub.-- 0
SEND ADDRESS CMC.sub.-- p.sub.-- temp.sub.-- 1
SOURCE CMC.sub.-- p.sub.-- temp.sub.-- 5
COMBINER 0
NOTIFY (null)
GET DEST CMC.sub.-- p.sub.-- temp.sub.-- 6
SEND ADDRESS CMC.sub.-- p.sub.-- temp.sub.-- 1
SOURCE CMC.sub.-- p.sub.-- temp.sub.-- ret.sub.-- val.sub.-- 0
COLLISION.sub.-- MODE 1
______________________________________
5. Parallel Communication Instruction Optimization a. Overview of Efficient Parallel Communication The middle end 312 generates IR nodes for canning out parallel communication instructions with general communication. If the data parallel computer 110 on which the target code will execute is capable of performing grid communication or of emulating grid communication, then the optimization described in FIGS. 12-15 and the accompanying text is performed. This optimization involves replacing the above IR nodes with IR nodes for canning out parallel communication with grid communication when doing so is possible and would result in more efficient target code. In the C/Paris code, the parallel communication instructions are represented by calls to functions in the C* library 160 or the Paris library 162. At run time, these functions invoke the Paris instruction set 164. Two such functions, called news.sub.-- or.sub.-- send and news.sub.-- or.sub.-- get are in the C* library 160. These functions determine whether grid communication is possible and, if so, whether it would be more efficient than general communication for a particular instruction in a particular program. If grid communication is possible and would be more efficient than general communication, the functions carry out the specified parallel communication instruction via grid communication. Otherwise, they carry out the specified instruction via general communication. If the determination as to whether grid communication or general communication should be used can be made at compile time, the associated program will run faster. If it is determined at compile time that grid communication should be used, the C* library 160 functions called "to.sub.-- torus" and "from.sub.-- torus" can be used. These functions perform the specified parallel communication instruction via grid communication with a shift of a specified offset along a specified axis. The Paris instruction set 164 can only be invoked to carry out grid communication with an offset which is a power of two. The functions to.sub.-- torus and from.sub.-- torus therefore carry out the specified communication by decomposing the offsets into one or more power-of-two offsets (called offset elements) and invoking the Paris instructions set 164 for each offset element. A run time system need not be called to calculate the offset elements if they are calculated at compile time. Accordingly, to speed execution of get instructions in which the offset is a constant, the C* compiler 154 calculates the offset elements at compile time and, for each offset element, generates a call to a function in the Paris library 162 called get.sub.-- from.sub.-- power.sub.-- two. This function carries out parallel communication for the specified power-of-two offset in a specified dimension and along a specified axis. This function, as well as other functions in the Paris library 162 are described in greater detail in the above-cited document entitled Paris Reference Manual for Version 6.0 of Paris. Grid communication among nearest neighbor parallel processors 112 whose grid location differs by one is faster than grid communication among other nearest-neighbor parallel processors 112. Therefore, the Paris library 162 includes functions called get.sub.-- from.sub.-- news and send.sub.-- to.sub.-- news. These functions carry out the specified parallel communication instruction with an offset of 1 or -1 in a specified dimension. b. Structure and Method FIG. 12 shows a more detailed block diagram of the structure of the efficient parallel communication module 338 of FIG. 3. The efficient parallel communication module 338 comprises a send/get locator 1210, a left index analyzer 1212, a minimal set determiner 1214, a grid send code generator 1216, a grid get code generator 1218, a shape comparator 1220 and a threshold comparator 1222. FIGS. 13A and 13B show a flowchart of the interaction between the submodules of the efficient parallel communication module 338. The flowchart of FIGS. 13A and 13B also shows the detailed method of the step 312 of FIG. 4. Looking at FIG. 13A, in a step 1308 the send/get locator 1210 determines whether there are any parallel communication instructions in the IR tree 332 which have not yet been processed. If not, then C* compiler 154 flow of control returns to the calling step (414 of FIG. 4), as indicated by the step 1309. If there is an additional parallel communication instruction to be processed, then in step 1314, the left index analyzer 1212 determines whether there are additional left indices in the left index expression of the parallel communication instruction to process. A negative determination indicates that grid communication IR nodes were generated for the parallel communication instruction. There could be a step to delete the associated general communication IR nodes replaced by the grid communication IR nodes. Alternatively, the optimizer 313 could employ conventional optimization techniques to recognize that the results of the general communication instructions are no longer used and delete the associated IR nodes. In a step 1316, C* compiler 154 flow of control returns to the step 1308. If, on the other hand, there are additional left indices to process, then in a step 1318 a temporary variable called "next" is set to point to the next left index of the left index expression. In a step 1320, the left index analyzer 1212 determines whether "next" is of a form which indicates that grid communication is possible. As stated, grid communication is not possible unless, among other conditions, there is a single offset between each of the source positions and the corresponding destination positions. This condition is assured to be met for a left index of the form: [(./pcoord(left.sub.-- index.sub.-- identifier){.+-.scalar.sub.-- int.sub.-- expression}){%% dimof(current.sub.-- shape,left.sub.-- index.sub.-- identifier)}] In the representation above, expressions written immediately to the left and right of a "/ " are alternatives. Expressions within braces are optional. If the left index consisted only of a dot token or a call to the pcoord intrinsic function with a parameter identifying the left index, then the offset would be zero. Grid communication would therefore be possible. Grid communication would also be possible if the left index further include instructions to add to or subtract from the results of the dot token or pcoord intrinsic function a scalar integer expression (represented by ".+-.scalar.sub.-- int.sub.-- expression"). If such instructions were included, the result of the addition or subtraction would specify one offset between each of the source positions and the corresponding destination positions. Grid communication would further be possible if the left index further included instructions to determine the above sum or difference modulo the number of positions in the dimension addressed by the left index currently being processed (represented by "%% dimof(current.sub.-- shape, left.sub.-- index.sub.-- identifier)"}]. The "dimof" function returns the number of positions in the specified dimension of the specified parallel variable (or shape). Because the data parallel computer 110 provides for toroidal communication between the parallel processors 112, the results of the modulo would specify one offset between each of the source positions and the corresponding destination positions. If "next" were not of a form which indicates that grid communication is possible, then the general communication IR nodes are not replaced by grid communication IR nodes. Grid communication IR nodes may have been generated for previous left indices of the parallel communication instruction being processed. There could be a step to delete the grid communication IR nodes. Alternatively, the optimizer 313 could employ conventional optimization techniques to recognize that the results of the general communication instructions are no longer used and delete the grid communication IR nodes. Or, such IR nodes could be avoided if the determination of the step 1320 were made for all left indices before further generating IR nodes for any left index. C* compiler 154 flow of control then returns to the step 1308. If, on the other hand, "next" is of a form which indicates that grid communication is possible, then in a step 1322, the send/get locator 1210 determines whether the parallel communication instruction is a send instruction with an offset whose absolute value is greater than one. If so, then in a step 1324 the grid send IR node generator 1216 generates a call to the news.sub.-- or.sub.-- send function of the C* library 160. The steps 1322 and 1324 are necessary because the Paris library 162 lacks a function for the send instruction equivalent to the get.sub.-- from.sub.-- power.sub.-- two function (discussed above) for the get instruction. As discussed with regard to the step 1316, there could then be a step to delete any grid communication IR nodes which may have been generated for previous left indices. Alternatively, such instructions could be deleted by the submodule of the optimizer 318 which deletes instructions it determines to be nowhere used. C* compiler 154 flow of control then returns to the step 1308. If the determination of step 1322 was negative, then in a step 1326 the left index analyzer 1212 determines whether the offset of "next" is a constant. If the offset is not a constant, then it cannot be determined at compile time whether replacing the general communication instructions with grid communication instructions would cause the parallel communication instruction to execute faster. The replacement may not result in increased performance if the offset could not be decomposed into a small number of power-of-two offsets. The number of power-of-two offsets at which grid communication becomes slower than general communication is hardware dependent. This number is called the threshold. If the threshold is quite high, it may be desirable to bypass the step 1326. If the step 1326 was not bypassed and the offset was not a constant, then the determination of whether or not to carry out the parallel communication operation by general communication or grid communication is postponed until run time. Accordingly, in a step 1327 the send/get locator 1210 determines whether the parallel communication instruction was a send instruction or a get instruction. If it was a send instruction, then C* compiler 154 flow of control goes to the step 1324, which is explained above. If, on the other hand, the instruction was a get instruction, then in a step 1328 the grid get IR node generator 1218 generates a call to the news.sub.-- or.sub.-- get function of the C* library 160 (described above). C* compiler 154 flow of control then returns to the step 1308 which is explained above. But if the offset of "next" is a constant or if the step 1326 was bypassed, then in a step 1330 the shape comparator 1220 determines whether the shapes of the source parallel value and the destination parallel value are equivalent as follows. First, comparison shapes of the source and destination are determined. The comparison shape of the source (if the parallel communication instruction is a send) or the destination (if the parallel communication instruction is a get) is the shape which was current at the point of the program where the parallel communication instruction occurred. The comparison shape of the destination (for a send) or the source (for a get) is the shape which the variable was cast to be or declared to be. Second, the comparison shapes are compared for equivalence. The method for determining shape equivalence is described in more detail in the pending U.S. patent application Ser. No. 07/788,003, filed Nov. 5, 1991, entitled System and Method for Parallel Variable Optimization, now pending. This document is hereby incorporated by reference as if set forth in full below. A high-level description of the method follows. The comparison shapes are deemed equivalent if their names are the same. The comparison shapes are also deemed equivalent if one was assigned to the other in the program. Note that for such an assignment, the two shapes should have the same configuration, that is, the same number of dimensions and the same number of elements in each dimension. If shapes of the source and destination parallel values are deemed not to be equivalent, C* compiler 154 flow of control goes to the step 1327, which is explained above. If the shapes of the source and destination parallel values are deemed equivalent, and if the step 1326 was eliminated, then the grid get IR node generator 518 generates an IR node for a call to the from.sub.-- torus function. The function is part of the C* library 160 and is described above in the section entitled "Overview of Efficient Parallel Communication". If the step 1326 was not bypassed, then in a step 1332 the minimal set determiner 1214 decomposes the offset so as to generate a minimal set. The offset elements of the minimal set are the offsets for the least number of steps necessary to effect the grid communication. Accordingly, the absolute value of each offset element is a power of two, and the sum of the offset elements is the offset of the parallel communication operation. The method by which the minimal set determiner 1214 performs the decomposition is described in more detail in FIGS. 14A, 14B, 15A and 15B and the accompanying text. Next, in a step 1334 on FIG. 13B, the threshold comparator 1222 compares the number of offset elements in the minimal set to the threshold value. As with the step 1326, if the threshold for the data parallel computer 110 on which the parallel communication instruction will be carried out is high, the step 1334 may be bypassed. If the step 1334 is not bypassed and if the number of paths exceeds the threshold, then C* compiler 154 flow of control returns to the step 1308, which is explained above. Otherwise, in a step 1336, the grid IR node get node generator 1218 generates, for each offset element in the minimal set, an IR mode for the appropriate Paris grid communication function. FIG. 16 and the accompanying text describe the method of carrying out the step 1336 in greater detail. C* compiler 154 flow of control then goes to the step 1314, which is explained above. c. Example The efficient parallel communication module 338, as well as other components of the IR tree optimizer 313, optimize the IR nodes of the first example 10 to generate the following optimized IR nodes.
______________________________________
13 LEFT.sub.-- SHIFT DEST CMC.sub.-- s.sub.-- temp.sub.-- 0
SOURCE1 32
SOURCE2 1
14 OFFSET DEST CMC.sub.-- s.sub.-- temp.sub.-- 1
SOURCE destinationp
AMOUNT CMC.sub.-- s.sub.-- temp.sub.-- 0
15 MULTIPLY DEST CMC.sub.-- p.sub.-- temp.sub.-- 1
SOURCE1 source1
SOURCE2 source2
16 ADD DEST CMC.sub.-- s.sub.-- temp.sub.-- 0
SOURCE1 1
SOURCE2 i
17 FROM.sub.-- TORUS.sub.-- DIM DEST CMC.sub.-- s.sub.-- temp.sub.-- 1
SOURCE CMC.sub.-- p.sub.-- temp.sub.-- 1
AXIS 0
DISTANCE CMC.sub.-- s.sub.-- temp.sub.-- 0
______________________________________
Conventional optimization replaces the multiplication of the first IR node with the thirteenth IR node. The latter node specifies a left shift of 32 by 1 bit, the result of which is equivalent to multiplying 32 and 2. Specifically, the thirteenth node left-shifts the binary representation of a first source, 32, by the number of bits specified by a second source, 1, and stores the result in a destination temporary variable CMC.sub.-- s.sub.-- temp.sub.-- 0. The fourteenth IR node is the same as the second (unoptimized) IR node. The fifteenth IR node is the same as the third (unoptimized) IR node. The sixteenth IR node is the same as the fifth (unoptimized) IR node except that the temporary variable CMC.sub.-- s.sub.-- temp.sub.-- 0 is re-used for the destination, as its previously determined value (the product of 32 and 2) is used only for the offset operation of the fourteenth node. The reuse of temporary parallel variables is described in greater detail in the above-referenced U.S. Patent Application entitled "System and Method for Parallel Variable Optimization". In carrying out step 512 (determining whether the get operation of the first example 810 can be effected with grid communication), the IR tree optimizer 512 of FIG. 5 analyzes the left index ".+(i+1)". This left index is of the form [(./pcoord(left.sub.-- index.sub.-- identifier){.+-.scalar.sub.-- int.sub.-- expression}){%% dimof(current.sub.-- shape, left.sub.-- index.sub.-- identifier)}] Accordingly, the get left index analyzer 1212 of FIG. 12 recognizes that grid communication is possible. The efficient parallel communication module 338 which optimized the first example 810 did not include the step 1326 of FIG. 13A. Therefore, the seventeenth IR node is generated to call the from.sub.-- torus function of the Paris library 162. At run time, the function performs the decomposition and carries out the get with grid communication. The first operand to the seventeenth IR node is a pointer to the source parallel value. The pointer is the product calculated by the fifteenth IR node. The first operand is therefore CMC.sub.-- p.sub.-- temp.sub.-- 1, the destination of the fifteenth IR node. The second operand specifies the axis in which the send is to be performed. In the first example 810, the shape of the destination parallel value has only a 0 dimension. Therefore, the second operand is 0. The third operand is the distance between the processors on which the source positions reside and the processors on which the destination positions reside. The third operand is therefore CMC.sub.-- s.sub.-- temp.sub.-- 0, the scalar integer calculated by the sixteenth IR node. The destination operand is CMC.sub.-- s.sub.-- temp.sub.-- 1, the destination parallel value calculated by the fourteenth IR node. The seventeenth IR node replaces the eighth, ninth, tenth and eleventh (unoptimized) IR nodes. Therefore, the IR tree optimizer 313 deletes these nodes. In the first example 810, the results of the get did not occur as source operands in any other operation. Therefore, the IR tree optimizer 313 deletes the twelfth IR node. IR tree optimization node generation for the send instruction is similar to that for the above-described get instruction. For example, the IR nodes of the optimized IR tree generated for the example send instruction [.+(1+i)]*(destinationp+2)=source1 * source2 are as follows:
______________________________________
LEFT.sub.-- SHIFT DEST CMC.sub.-- s.sub.-- temp.sub.-- 0
SOURCE1 32
SOURCE2 1
OFFSET DEST CMC.sub.-- s.sub.-- temp.sub.-- 1
SOURCE destinationp
AMOUNT CMC.sub.-- s.sub.-- temp.sub.-- 0
ADD DEST CMC.sub.-- s.sub.-- temp.sub.-- 0
SOURCE1 1
SOURCE2 i
MULTIPLY DEST CMC.sub.-- p.sub.-- temp.sub.-- 5
SOURCE1 source1
SOURCE2 source2
NEWS.sub.-- OR.sub.-- SEND DEST CMC.sub.-- s.sub.-- temp.sub.-- 1
SOURCE CMC.sub.-- p.sub.-- temp.sub.-- 5
AXES 1
DISTANCES CMC.sub.-- s.sub.-- temp.sub.-- 0
DIMENSIONS 0 0
RUNTIME 1
______________________________________
6. Decomposition of Offsets into Grid Communication Steps FIGS. 14A, 14B, 15A, and 15B show flowcharts for two detailed methods of carrying out step 1332 of FIG. 13A (decomposing an offset into a minimal set of powers-of-two). The flow charts of FIGS. 14A, 14B, 15A and 15B thus also depict the operation of the minimal set determiner 1214 of FIG. 12. The method of FIGS. 14A and 14B is essentially as follows. A set called "minimal.sub.-- set" is generated. The minimal.sub.-- set contains the largest offset elements whose total distance is the length of the offset and whose direction is the same as that of the offset. As explained above, the distance of an offset element must be a power of two. The direction of an offset elements is indicated by its sign (positive or negative). All but the last offset element in the minimal.sub.-- set are selected as "next" and processed as follows. First, a second set (called "set2") is generated. The first offset element of set2 is the offset element greater having the smallest distance greater than the distance of "next" and having the same direction as that of "next". The remaining offset elements of set2 are such that their direction is the opposite of that of "next" and their total distance is the difference between the distance of "next" and the distance of the offset. Second, the number of offset elements in set2 is compared to the number of offset elements in a subset of minimal.sub.-- set consisting of "next" and the offset elements following it in minimal.sub.-- set. If the former number is smaller than the latter, the subset of minimal.sub.-- set is replaced with the offset elements of set2. Looking at FIG. 14A, in a step 1410 the minimal set determiner 1214 determines whether there are more left indices to process. If not, then in a step 1412, C* compiler 154 flow of control returns to the calling routine (the step 1332 of FIG. 13A). Otherwise, in a step 1414 a temporary variable called offset is set to the offset of the next left index. In a step 1416, a temporary variable called minimal.sub.-- set is set to the largest powers-of-two whose sum is the absolute value of offset, arranged in descending order. Note that the offset elements of minimal.sub.-- set are the powers of two represented by the 1-bits of a binary representation of the value of offset. In a step 1418, a temporary variable called index is set to 0. In a step 1420, the minimal set determiner 1214 determines whether (index+1) is less than the number of offset elements in minimal.sub.-- set. The purpose of the step 1418 is to determine whether there are any more offset elements left to process. If not, C* compiler 154 flow of control returns to the step 1410, which is explained above. Otherwise, in a step 1422, a temporary variable called "next" is set to the offset element of minimal set indicated by index. Looking at FIG. 14B, in a step 1424, a temporary variable called overshot is set to the smallest power of two greater than the absolute value of "next". In a step 1426, offset is set to overshot minus offset. In a step 1428, a temporary variable called set2 is set to the largest powers of two whose sum is the value of offset, arranged in descending order. As with step 1416, the offset elements of set2 are the powers of two represented by the 1-bits of a binary representation of the value of offset. In a step 1430, the minimal set determiner 1214 determines whether the number of offset elements in minimal.sub.-- set exceeds the sum of: (1) the number of offset elements in minimal.sub.-- set which precede the offset element indicated by "next", (2) 1, and (3) the number of offset elements in set2. If not, then in a step 1431, offset is set to the absolute value of the sum of the offset elements of minimal.sub.-- set which follow "next". Next, in a step 1432, the index is incremented. C* compiler 154 flow of control then returns to the step 1420 of FIG. 14A, which is discussed above. If, on the other hand, the determination of the step 1430 was "yes", then in a step 1434 the minimal set determiner 1214 determines whether the value of "next" is negative. If not, then in a step 1436, each offset element of set 2 is negated. If the value of "next" is negative, then in a step 1438, the value of overshot is negated. After carrying out the step 1436 or the step 1438, then in a step 1440 the offset element of minimal.sub.-- set indicated by index is replaced with the value of overshot. In a step 1442, the offset elements of minimal.sub.-- set which follow the offset element indicated by index are replaced with the offset elements of set2. C* compiler 154 flow of control then goes to the step 1431, which is explained above. FIGS. 15A and 15B are a flow chart of an alternative method of calculating the minimal set. The alternate method employs a binary tree. The root of the tree contains the offset. The tree is constructed in a breadth-first manner. In the tree, an arc between a parent and its left child (called a left arc) is assigned the greatest power-of-two value which is not greater than the value of the parent. The left child is assigned the value of the parent minus the value assigned to the left arc. If the value of the parent is negative, then the value of the left arc is negated. An arc between a parent and its right child (called a right arc) is assigned the smallest power-of-two value which is greater than the absolute value of the value of the parent. The right child is assigned the value of the parent minus the value of the right arc. If the value of the parent was negative, the value of the right arc is negated. The value assigned to the left and right arcs then are the power-of-two values which bracket the parent connected thereto. The binary tree is expanded downwardly, using each child as a new parent, until a child is assigned the value zero. The minimal set is then determined by reading the values on the arcs in the path from that child to the root. The tree can be stored in a simple array. Looking at FIG. 15A, in a step 1508, the root is set the value of the offset. In a step 1510, the parent is set to point to the root. In step 1512, a left arc is created for the node pointed to by parent, and it receives a value set to the largest power-of-two value that is not greater than the value of the node pointed to by the parent pointer. In step 1514, the value of the node pointed to by the parent pointer is compared to zero. If the value of the node pointed to by the parent pointer is less than zero, then in a step 1516 the value of the left arc is negated. Otherwise, or after the step 1516, in a step 1518 the value of the left child is set to the value of the node pointed to by the parent pointer minus the value of the left are. In a step 1520, the value of the left child is compared to zero. If the value of the left child is zero, then in a step 1522 the minimal set is set to the values on the arcs on the path from the left child to the root. C* compiler 154 flow of control then returns to the calling routine, as indicated by the step 1524. If, on the other hand, the left child is not equal to zero, then in a step 1526 a right are is created for the node pointed to by parent, and it receives a value set to the smallest power-of-two value that is not less than the value of the node pointed to by the parent pointer. In a step 1528, the value of the node pointed to by the parent is compared to zero. If the value of the node pointed to by the parent is less than zero, then in a step 1530 the value of the right arc is negated. Otherwise, or after the step 1530, in a step 1532 the value of the right child is set to the value of the node pointed to by the parent pointer minus the value of the right arc. In a step 1534, the value of the right child is compared to zero. If the value is less than zero, then in a step 1536 the minimal set is set to the values on the arcs on the path from the right child to the root. C* compiler 154 flow of control then returns to the calling routine, as indicated by a step 1538. If, on the other hand, right child is not equal to zero, then in a step 1540 (on FIG. 15B) it is determined whether the last assignment to the parent was either the offset or the right child. If so, then in a step 1542 the parent is set to the left child. Otherwise, in a step 1544 the parent is set to the right child. After the execution of the step 1542 or the step 1544, C* compiler 154 flow of control returns to the step 1512, which is explained above. The purpose of the steps 1540, 1542 and 1544 is to ensure that expansion of the binary tree is breadth-first. Accordingly, evenly other time the step 1540 is executed (beginning with the first time) the binary tree is expanded to the left. Correspondingly, every other time the step 1540 is executed (beginning with the second time) the binary tree is expanded to the right. 7. Operation of News.sub.-- or.sub.-- Send and News.sub.-- or.sub.-- Get Routines FIG. 16 is a flow chart of the method of the news.sub.-- or.sub.-- send function and the news.sub.-- or.sub.-- get function. If a news.sub.-- or.sub.-- send or news.sub.-- or.sub.-- get IR node was generated by the grid send IR node generator 1216 or the grid get IR node generator 1218 of FIG. 12 (see steps 1324 and 1328 of FIG. 13B), then at run time the data parallel computer 110 carries out the news.sub.-- or.sub.-- send or news.sub.-- or.sub.-- get. When running the news.sub.-- or.sub.-- send function, the data parallel computer 110 carries out a send via grid communication if the host computer 146 determines that doing so would be possible and efficient, and by general communication otherwise. When running the news.sub.-- or.sub.-- get function, the data parallel computer 110 performs a get via grid communication if the host computer 146 determines that doing so would be both possible and efficient, and by general communication otherwise. Both of these functions are in the C* library 160. Looking at FIG. 16, in a step 1610, the shape of the source parallel value and the destination parallel value of a parallel communication instruction are compared. If the shapes are not equal then grid communication is not used. Therefore, in a step 1612 the Paris instruction set 164 is invoked to carry out the parallel communication instruction with general communication. Note that news.sub.-- or.sub.-- send is associated with the send instruction and news.sub.-- or.sub.-- get is associated with the get instruction. In a step 1614, the function terminates. If, on the other hand of the source parallel value and destination parallel value are of equivalent shapes, then in a step 1616 the news.sub.-- or.sub.-- send or news.sub.-- or.sub.-- get function determines whether there are more left indices to process. If not, then in a step 1618, the function terminates. Otherwise, in a step 1622 the minimal set determiner 1214 decomposes the offset of the next left index so as to generate a minimal set. The offset elements of the minimal set are the offsets for the least number of grid communication steps necessary to effect the grid communication. The operation of the minimal set determiner 1214 is explained in greater detail in FIGS. 14A, 14B, 15A and 15B and the accompanying text. In a step 1624, the number of offset elements in the minimal set is compared to the threshold. If the number of offset elements exceeds the threshold, then C* compiler 154 flow of control goes to the step 1612, which is explained above. As with the step 1326, the step 1624 could be bypassed if the hardware environment had a high threshold. If the step 1624 were bypassed or if the number of offset elements did not exceed the threshold, then in a step 1626, for each offset element in the minimal set, the news.sub.-- or.sub.-- send or news.sub.-- or.sub.-- get function invokes the Paris instruction set 164 to carry out grid communication. The step 1626 is explained in greater detail in FIG. 17 and the accompanying text. C* compiler 154 flow of control then returns to the step 1616, which is explained above. 8. Generation of Paris Calls for Minimal Set FIG. 17 is a flow chart of the method of generating calls to the appropriate functions of the Paris library 162 to carry out grid communication with the offsets specified by the minimal set. Accordingly, the flow chart of FIG. 17 illustrates the detailed method of the step 1336 of FIG. 13B. The flow chart also describes the operation of the grid send IR node generator 1216 and the grid get IR node generator 1218. Note that method of the flow chart of FIG. 17 is similar to the detailed method of the step 1626 of FIG. 16, except that the latter is performed at run time rather than at compile time. Looking at FIG. 17, in a step 1708 a temporary variable called temp.sub.-- source is set to the source parallel value of the parallel communication instruction. Next, in a step 1710 the left index expression is analyzed to determine whether there are more left indices to process. If not, the parallel communication instruction has been completely carried out. Accordingly, in a step 1711, the destination is set to temp.sub.-- source. Next, in a step 1712, C* compiler 154 flow of control returns to the calling function (either the step 1336 or the step 1726). Otherwise, in a step 1713 one of the left indices which has not been processed is selected as the current left index. Next, in a step 1714 a temporary variable called temp.sub.-- source is set to the source parallel value of the parallel communication instruction. In a step 1716, it is determined whether any offset elements of the minimal set associated with the current left index have not yet been processed. If not, the parallel communication instruction has been completely carried out for the current left index. Accordingly, C* compiler 154 flow of control goes to the step 1710, which is explained above. Otherwise, in a step 1718 a send (if the parallel communication instruction is a send instruction) or a get (if the parallel communication instruction is a get instruction) is carried out from the temp.sub.-- source parallel value to a temporary variable called temp.sub.-- destination with the next offset in the minimal set. The Paris instruction set 164 would be invoked to carry out the grid send or grid get. In a step 1720, temp.sub.-- source is set to the value of temp.sub.-- destination. C* compiler 154 flow of control then returns to the step 1716 (explained above) to carry out grid communication as indicated by the remaining offsets in the minimal set. 9. C/Paris Code Generation a. Overview Once the IR tree 332 has been optimized by the IR tree optimizer 313 (as indicated by the steps 414 and 415 of FIG. 4), then the back end 414 generates C/Paris code from the IR tree 332 (as indicated by the step 416 of FIG. 4). C/Paris code comprises standard C statements and calls to functions of the C*/Library 160 and the Paris Library 162. In general, the back end 414 generates one C/Paris statement for each IR node of the IR tree 332. b. Examples The C/Paris code for the first example 810 is as follows. From the twelfth IR node, the back end 414 generates the C/Paris statement: CMC.sub.-- s.sub.-- temp.sub.-- 0=32<<1; This C statement specifies left-shifting a binary representation of 32 by 1 bit and assigning the result to the temporary variable CMC.sub.-- s.sub.-- temp.sub.-- 0. From the thirteenth IR node, the back end 414 generates the C statement: CMC.sub.-- s.sub.-- temp.sub.-- 1=CM.sub.-- add.sub.-- offset.sub.-- to.sub.-- field.sub.-- id (destinationp, CMC.sub.-- s.sub.-- temp.sub.-- 0); The Paris function "CM.sub.-- add.sub.-- offset.sub.-- to.sub.-- field.sub.-- id" offsets its first parameter (destinationp) by the amount indicated by its second parameter (CMC.sub.-- s.sub.-- temp.sub.-- 0). The offset pointer returned by the function is assigned to the temporary variable CMC.sub.-- s.sub.-- temp.sub.-- 1. The function, as well as the other functions of the Paris library 162, are described in more detail in the above-cited Paris Reference Manual for Version 6.0 of Paris. From the fourteenth IR node, the back end 314 generates the C/Paris statement: CMC.sub.-- s.sub.-- temp.sub.-- 0=1+i; This statement specifies adding 1 and i and assigning the sum to the temporary variable "CMC.sub.-- s.sub.-- temp.sub.-- 0". From the fifteenth IR node, the HLL code generator 604 generates the C/Paris statement: CM.sub.-- s.sub.-- multiply.sub.-- 3.sub.-- 1L (CMC.sub.-- p.sub.-- temp.sub.-- 5, source1, source2, 32); The Paris function "CM.sub.-- s.sub.-- multiply.sub.-- 3.sub.-- 1L" computes the product of the parallel values specified by the second and third parameters and assigns the result to the first parameter. The fourth parameter specifies the length of the elements of the second and third parameters. Finally, the back end 314 translates the sixteenth IR node to the C/Paris statement: CMC.sub.-- from.sub.-- torus.sub.-- dim(CMC.sub.-- s.sub.-- temp.sub.-- 1, CMC.sub.-- p.sub.-- temp.sub.-- 1, 0, CMC.sub.-- s.sub.-- temp.sub.-- 0, 32); The first four parameters of the function correspond to the operands of the seventeenth IR node. The fifth parameter is the length of the elements of the source parallel variable. These elements are 32 bit integers. The function CMC.sub.-- from.sub.-- torus.sub.-- dim is the "from.sub.-- torus" function described above. The function is in the C* library 160. C/Paris code generation for the send instruction is similar to that for the above-described get instruction. For example, C/Paris code generated for the example send instruction [.+(1+i)]*(destinationp+2)=source1 * source2; is as follows:
______________________________________
CMC.sub.-- s.sub.-- temp.sub.-- 0 = 32 << 1;
CMC.sub.-- s.sub.-- temp.sub.-- 1 = CM.sub.-- add.sub.-- offset.sub.--
to.sub.-- field.sub.-- id (destinationp,
CMC.sub.-- s.sub.-- temp.sub.-- 0);
CMC.sub.-- s.sub.-- temp.sub.-- 0 = 1 + i;
CM.sub.-- s.sub.-- multiply.sub.-- 3.sub.-- 1L (CMC.sub.-- p.sub.--
temp.sub.-- 5, source1, source2, 32);
CMC.sub.-- news.sub.-- or.sub.-- send(CMC.sub.-- s.sub.-- temp.sub.-- 1,
CMC.sub.-- p.sub.-- temp.sub.-- 5, 32, 1,
CMC.sub.-- s.sub.-- temp.sub.-- 0, 0, 0);
______________________________________
Four parallel communication examples further illustrate C/Paris code generation and efficient parallel communication. C* source code for the four examples follows.
______________________________________
shape [100][200][300]5;
f( )
int:s i, j;
int:physical k;
int a;
with(s) {
/* EXAMPLE 1: generates instructions for get via grid
communication because this is a get instruction, offsets are
constant, and shape of source and destination parallel
values is same. */
i = [.+7][.+1][.+13]j;
/* EXAMPLE 2: generates a call to a function to perform
general or grid send at run time, whichever is determined
at run time to be more efficient. Determination is made at
run time because this is a send instruction. */
[.+7][.+1][.+13]i = j;
/* EXAMPLE 3: for the 0 axis, generates a call to a
function to perform general or grid get at run time,
whichever is determined at run time to be more efficient.
Determination is made at run time because offset of 0 axis
is not constant. For the 1 and 2 axis, generates instructions
for get via grid communication because this is a get
instruction, offsets are constant, and shape of source and
destination parallel values is same. */
i = [.+a][.+1][.+13]j;
/* EXAMPLE 4: generates instructions for get via general
communication because the shapes of source and
destination parallel values are different */
[.+7][.+1][.+13]j = k;
}
}
______________________________________
From the above C* source code, the C* compiler 154 generates the following C/Paris code for the initialization and allocation necessary for the four parallel communication instructions.
______________________________________
#include <.sub.-- CMC.sub.-- types.h>
#include <.sub.-- CMC.sub.-- defs.h>
static char CMC.sub.-- version[ ] = "C* Version 6.0.2 (60) for sun4";
CMC.sub.-- Shape.sub.-- t s = CMC.sub.-- no.sub.-- vp.sub.-- set;
int f( );
int f( )
CMC.sub.-- Shape.sub.-- t CMC.sub.-- entry.sub.-- shape =
((CMC.sub.-- call.sub.-- start.sub.-- trap &&
CMC.sub.-- start.sub.-- trap( )),CM.sub.-- current.sub.-- up.sub.--
set);
CMC.sub.-- Pvar.sub.-- t i = CM.sub.-- allocate.sub.-- stack.sub.--
field.sub.-- vp.sub.-- set(64, s);
CMC.sub.-- Pvar.sub.-- t j = CM.sub.-- add.sub.-- offset.sub.-- to.sub.--
field.sub.-- id(i, 32);
CMC.sub.-- Pvar.sub.-- t k =
CM.sub.-- allocate.sub.-- stack.sub.-- field.sub.-- vp.sub.-- set(32,
physical);
int a;
CM.sub.-- set.sub.-- vp.sub.-- set(s);
{
CMC.sub.-- Pvar.sub.-- t CMC.sub.-- p.sub.-- temp.sub.-- 11 =
CM.sub.-- allocate.sub.-- stack.sub.-- field.sub.-- vp.sub.-- set(32,
CM.sub.-- current.sub.-- vp.sub.-- set;
}
}
______________________________________
i. C/Paris Code for Example 1 C/Paris code output of the C* Compiler for axis 0 of Example 1 comprises the following two calls to functions of the Paris library 162. The back end 314 generates time calls from the IR nodes generated by the grid get IR node generator 1218 of FIG. 12 in carrying out the step 1718 of FIG. 17. (Note that although all of the C code corresponding to a generated for processing each particular axis is shown together, the actual C* compiler 154 may generate the lines of C/Paris code in a different order.)
______________________________________
CM.sub.-- get.sub.-- from.sub.-- power.sub.-- two.sub.-- always.sub.--
1L(CMC.sub.-- p.sub.-- temp.sub.-- 11, j,
0, 3, 0, 32);
CM.sub.-- get.sub.-- from.sub.-- news.sub.-- always.sub.-- 1L(CMC.sub.--
p.sub.-- temp.sub.-- 11,
CMC.sub.-- p.temp.sub.-- 11, 0, 1, 32);
______________________________________
The first function called above, CM.sub.-- get.sub.-- from.sub.-- power.sub.-- two.sub.-- always.sub.-- 1L, is a form of the get.sub.-- from.sub.-- power.sub.-- two function described above. At run time, the first function instructs each destination parallel processor 112 to retrieve the element of a source parallel variable on a source parallel processor 112 separated from the destination parallel processor 112 by a specified power-of-two distance and a specified direction along a specified axis. Each destination parallel processor 112 places the retrieved value in the position of a destination parallel variable which is stored on it. The first argument to the first function (CMC.sub.-- p.sub.-- temp.sub.-- 11) is the destination parallel variable. The second argument (j) is the source parallel variable. The third argument of the function specifies the axis (0) along which the shifting is to be performed. The fourth argument (3) is log.sub.2 (distance from each destination position to the corresponding source position). The fifth argument (0) indicates that the direction from each source position to the corresponding destination position is upward. The sixth argument (32) specifies the length of the source and destination parallel variables. At run time, the function performs a get operation via grid communication from j to a compiler-generated temporary variable called CMC.sub.-- p.sub.-- temp.sub.-- 11 with an offset of 8. (A positive offset indicates the upward direction and a negative offset indicates the downward direction.) Note that the source parallel value of an initial grid communication instruction is the source parallel variable of the associated parallel communication instruction. The destination parallel value of a final grid communication step is the destination parallel variable of the associated parallel communication instruction. Otherwise, a compiler-generated temporary parallel variable is the source and/or destination parallel value. The second function called above, CM.sub.-- get.sub.-- from.sub.-- news.sub.-- 1L.sub.-- always, is a form of the get.sub.-- from.sub.-- news function described above. The function performs a get in the same manner as CM.sub.-- get.sub.-- from.sub.-- .sub.-- power.sub.-- two.sub.-- always.sub.-- 1L, except the offset must be 1 or -1. The fields of the first, second and third arguments of CM.sub.-- get.sub.-- from.sub.-- news.sub.-- always.sub.-- 1L are the same as those for the first, second and third arguments of CM.sub.-- get.sub.-- from.sub.-- power.sub.-- two.sub.-- always.sub.-- 1L. Because the second function is neither an initial nor a final grid communication step, CMC.sub.-- p.sub.-- temp.sub.-- 11 is both the source and destination of CM.sub.-- get.sub.-- from.sub.-- news.sub.-- always.sub.-- 1L. The axis is again 0. There is no field for the distance, as it must be 1. The fourth argument (0) indicates that the direction is downward. The fifth argument is the same as the sixth argument of CM.sub.-- get.sub.-- from.sub.-- power.sub.-- two.sub.-- always.sub.-- 1L. At run time, the second function performs a get operation via grid communication from CMC.sub.-- p.sub.-- temp.sub.-- 11 to CMC.sub.-- p.sub.-- temp.sub.-- 11 with an offset of -1. Note that both of the above-functions are "always" forms of the associated | ||||||
