System for parallel processing that compiles a filed sequence of instructions within an iteration space5535393Abstract An improved parallel processing apparatus and method executes an iterative sequence of instructions by arranging the sequence into subtasks and allocating those subtasks to processors. This division and allocation is conducted in such a manner as to minimize data contention among the processors and to maximize the locality of data to them. The improved apparatus and method have application to a variety of multiprocessor systems, including those which are massively parallel. Claims In view of the foregoing, what we claim is: Description BACKGROUND OF THE INVENTION
______________________________________
C*KSR* TILE(<tiling-args>)
C*KSR* END TILE
______________________________________
2.1 The tiling-args The tiling-args which preprocessor 60a can put into the tiling directives have the general form of:
______________________________________
<tiling-args>:==<tile-index-list>
[,<tiling-param>. . .]
<tiling-param>:==<param-keyword> = <param-value>
______________________________________
The tiling-params which are specified in the tiling directive give information about attributes of the tiled-loop. For example, a `local=t` tiling-arg in the tiling directive means "this tiled-loop contains a variable "t" which should be local to each process". The tiling-params which preprocessor 60a generates are shown in the following table. These are referred to hereinafter as the primary set of parameters. Primary Set of Parameters
______________________________________
Syntax Example
______________________________________
order = <dep-list> order=k
lastvalue = (var-list>
smallest
local = <var-list> tmp or (t1, t2)
reduction = <var-list>
sum
______________________________________
The primary set of tiling-params is a subset of the full list of tiling-params which the compiler 60b can accept, as discussed below. However, these tiling parameters contain information which can affect the correctness of the program, and it will be seen below how this affects the tiling. 2.2 Tiling Directive-Summary The syntax of the preprocessor 60a output tiling directive is:
______________________________________
C*KSR* TILE( <index>. . .
[,order = <dep-list>]
[,lastvalue=<variable-list>]
[,local=<variable-list)]
[,reduction=<variable-list>]
[,<extra-params>])
. . .
C*KSR* END TILE
______________________________________
The TILE directive must occur before the loops (i,e., the do statements) whose indices are included in the <tile-index-list>. The loop-indices in the <tile-index-list> are in the same order they appear in the loopnest; left-to-right corresponds to outer-to-inner in the loopnest. (This is a convention for readability.) The tiling-params which preprocessor 60a can create is a subset (sometimes called the primary set) of the tiling-params which the compiler 60b can understand. The primary set of tiling-params is: order, lastvalue, local, reduction. Preprocessor 60a does not know the full list of the tiling-params; it relies on the syntax definitions for its parsing. Tiling parameters which are specified in the TILE input directive will be passed "as is" to the TILE output directive. (That is, if those tiling directives do not belong to the primary set. If they do, it will be flagged as an error and the loop will not be tiled.)
__________________________________________________________________________
Formal syntax definition:
__________________________________________________________________________
<tiled-loop>
is <tile-begin> <Fortran-do-loop> <tile-end>
<tile-begin>
is C*KSR* TILE ( <tiling-args>)
<tile-end> is C*KSR* TILE END
<tiling-args>
is <tile-index-list> [, <tiling-param-list>]
<tile-index>
is <loop-index-variable-name>
<tiling-param>
is ORDER = <order-listing>
or LASTVALUE
= <var-or-invelt-listing>
or REDUCTION
= <var-or-invelt-listing>
or PRIVATE = <var-listing>
or <extra-keyword>
= <easy-to-parse-string>
<order-listing>
is <order>
or ( <order-list>)
<order> is [-] <loop-index-variable-name>
<var-or-invelt-listing>
is <var-or-invelt>
or ( <var-or-invelt-list>)
<var-or-invelt>
is <a variable name or array element name;
in the case of an array element name, all
the subscript expressions must be
invariant with respect to the tiled loop
nest>
<var-listing>
is <var>
or ( <var-list>)
<var> is <a variable name>
<extra-keyword>
is <a Fortran identifier>
__________________________________________________________________________
3. Automatic Tiling This section concentrates on the preprocessor 60a output tiling directives from the functional point of view. The discussion (unless otherwise explicitly stated) assumes three things: that preprocessor 60a is doing fully-automatic tiling, that preprocessor 60a optimization directives (AUTOTILE. ROUNDOFF) are set to allow the maximum parallelization, and that there are no Assertions. The principles for the tiling are to tile from the outermost-index inbound. And to tile as much as possible (i.e, as many indices as possible). 3.1 Overview A given loopnest cannot always be tiled in all the dimensions. However, there can be more than one possibility for correct tiling. This section discusses which tiling possibility preprocessor 60a will choose. (How the user intervenes in choosing the indices to tile will be discussed in the next section.) The preprocessor 60a performs dependence analysis on the iterative sequence. The principles of the data dependence analysis may be understood by reference to "Data Dependence and Its Application to Parallel Processing," Michael Wolfe and Utpal Banerjee, International Journal of Parallel Programming, Vol. 16, No. 2, April 1988; "Optimizing Supercompilers for Supercomputers," Michael Wolfe, Ph.D. Thesis, Dept. of Comp. Sci., Report No. 82-1009, Univ. of Illinois, Urbana, Ill., October 1982; and "Advanced Compiler Optimization for Supercomputers," David Padua and Michael Wolfe, Communications of the ACM, Vol. 29, No. 12, December 1986. It will be noted that data dependence is carried by loops. And that in the context of preprocessor 60a, dependence is between tiles. Several tiling obstacles can prevent tiling, for example, a cycle in the dependence. Since a dependence is carried by a loop, it prevents tiling of that loop, while other loop(s) in the same loopnest can still be tiled. Moreover, some statements are not tilable. These can include a goto out of the loopnest and a subroutine call. This tiling obstacle affects the whole loopnest which encloses the non-tilable statement(s). Another obstacle is a loop that is imperfectly nested. This occurs where there are statements (other than DO's) between the loops in the loopnest. Imperfect nesting introduces a restriction, as if there is a "wall" where the imperfect nesting occurs. In this case the tiling can take place either "above" or "under" that wall, but not on both sides. Further is where the bound(s) of a loop depend on the index of an outer loop. This creates a nonrectangular iteration space, and implies a restriction that those two loops are mutually exclusive for tiling. It will be noted that this restriction can be eased for special cases, such as triangular loops. Based on its analysis of the loopnest (which--among other things--finds out all the tiling obstacles), preprocessor 60a tiles the loop while avoiding the tiling obstacles. In so doing, it produces a loop table which shows the tiling-obstacles and the tiling decisions which are based on them. The final decision whether or not it is worthwhile to actually execute a tiled-loop in parallel is taken by the compiler (or at runtime). Preprocessor 60a can tile one-dimensional ("1D") loops with dependence, as well as loops with a small amount of work, etc. The main reason is that while preprocessor 60a looks at one loop at a time, more global considerations such as memory distribution may influence the tiling strategy. The compiler can "remove" the TILE directive to eliminate any runtime overhead. Reordering (loop interchange), if any, takes place after the tiling, and only inside the tile. 3.2 Choosing the indices for tiling The main role of preprocessor 60a is to insert tiling directives into the code:
______________________________________
C*KSR* TILE ( <tile-index-list> [,<tiling-param>. . .] )
______________________________________
Choosing the tiling-index-list is the main decision. The other tiling-params are determined accordingly. For the sake of this discussion, assume that preprocessor 60a creates the loop table in two steps. First, it collects the information about the loopnest, tiling obstacles, etc. Then, it takes the decision about which indices to tile (note that the loops in the loop table are ordered in the same order as the loops in the loopnest, outermost first). So, after preprocessor 60a's analysis, and before any decisions are taken about tiling, the loop table is as shown in FIG. 4A. There, the "tilable?" field indicates whether there is a tiling obstacle which prevents this particular loop from being tiled, regardless of whether other loops are tiled or not. This occurs when the loop carries a cycle in dependence, or the loop body contains a non-tilable statement, etc. The "restriction" field notes which other loops in the loopnest might be affected by tiling this loop. This occurs, e.g., when the loop is imperfectly nested, or non-rectangular. As previously mentioned, the point at which imperfectly nesting occurs may be thought of as a "wall." The wall can be "attached" either to the previous loop or to the following loop. It can be arbitrarily assumed that it is attached to the previous loop. The obstacle field contains descriptive information about the tiling obstacle if any. Now, all there is left to be done is to fill in the tiling-decision field, based upon the information in the tilable? and restrictions fields. Preprocessor 60a tiles the loop from the outside inbounds, so it can be viewed as if it starts from the first row in the loop table and moves down, tiling as much as possible, while taking care to respect any restrictions. The loop table can be used to describe the concepts of restriction handling. Whenever it is decided to tile a loop, it is marked as tiled in the tiling-decision field. Then a look is taken at its restriction field: If there is an "IMPERFECT" indication, the preprocessor 60a goes ahead and marks the tiling-decision fields of all the rows below as not-tiled; if there is an <idx> (or more than one), the preprocessor 60a marks the tiling-decision field of the correspondent loop(s) as not-tiled. Note that the preprocessor 60a always needs to go "downwards" only. After preprocessor 60a tiles the first loop in a loopnest, rows further down the loop table may already have a tiling decision entry. This results from a restriction imposed by a previously tiled loop. In this case, preprocessor 60a skips that loop row when it comes to it, and moves on the next. Conceptually, this is the way in which preprocessor 60a chooses the indices which will comprise the <tile-index-list> for the TILE directive. Following that, the other <tiling-param> are determined, and the process of tiling is complete. The examples in the rest of this section demonstrate this behavior. For each example, the tiled program is provided, with the loop table being shown in the accompanying drawing. 3.3 Examples Example 1. Inspired by the Linpack benchmark:
______________________________________
do k = 1,n-1
do j = k+1, n
do i = 1, n-k
a(k+i,j) = a(k+i,j) + t * a(k+i,k)
enddo
enddo
enddo
______________________________________
As shown in FIG. 4B, the k-loop cannot be tiled, since it carries a cycle in dependence. Thus, the restriction entries for k did not apply to the tiling decision for j and i. Preprocessor 60a tiles this loop as follows:
______________________________________
do k = 1, n
C*KSR* TILE( J , I)
DO 2 J=k+1,n-1
DO 2 I=1,n-k
A(K+I,J) = A(K+I,J) + T * A(K+I,K)
2 CONTINUE
C*KSR* END TILE
enddo
______________________________________
Example 2. Matrix multiply
______________________________________
do i = 1,n
do j = 1,m
c(i,j) = 0
do k = 1, l
c(i,j) = c(i,j) + a(i,k) * b(k,j)
enddo
enddo
enddo
______________________________________
As reflected in FIG. 4C, the restriction on the j-loop caused the tiling to "Stop" at that point. Preprocessor 60a will tile this loop as follows:
______________________________________
C*KSR* TILE (I,J)
do i = 1,n
do j = 1,m
c(i,j) = 0
do k = 1, l
c(i,j) = c(i,j) + a(i,k) * b(k,j)
enddo
enddo
enddo
C*KSR* END TILE
______________________________________
Example 3. Inspired by the Legendre Transform:
______________________________________
do 400 l = 1, nlev
do 300 k = 1, nwaves
ip = nmp(k)
do 200 j = 1, nlats
do 100 i = 1, nnp(k)
sd(l,ip+i)=
sd(l,ip+i)+fsdl(l,k,j)*pnmd(ip+i)
sq(l,ip+i)=
sq(l,ip+i)+fsql(l,k,j)*pnmd(ip+i)
100 continue
200 continue
300 continue
400 continue
______________________________________
In order to make this example work it is necessary to put in an assertion (not shown here) to remove assumed dependence. Also, in this case preprocessor 60a uses forward substitution technique, so that the k-loop and the j-loop can be made perfectly nested. Preprocessor 60a therefore tiles the program as if the loop was the following.
______________________________________
do 400 l = 1, nlev
do 300 k = 1, nwaves
do 200 j = 1, nlats
do 100 i = 1, nnp(k)
sd(l,nmp(k)+i)=
sd(l,nmp(k)+i)+fsdl(l,k,j)*pnmd(nmp(k)+i)
sq(l,nmp(k)+i)=
sq(l,nmp(k)+i)+fsql(l,k,j)*pnmd(nmp(k)+i)
100 continue
200 continue
300 continue
400 continue
______________________________________
As shown in FIG. 4D, when preprocessor 60a decides to tile the k-loop, the restriction on the i-loop enforces a not-tiled decision for i.
______________________________________
C*KSR* TILE( L,K,J )
do 400 l = 1, nlev
do 300 k = 1, nwaves
do 200 j = 1, nlats
do 100 i = 1, nnp(k)
sd(l,nmp(k)+i)=sd(l,nmp(k)+i)+
fsdl(l,k,j)*pnmd(nmp(k)+i)
sq(l,nmp(k)+i)=sq(l,nmp(k)+i)+
fsql(l,k,j)*pnmd(nmp(k)+i)
100 continue
200 continue
300 continue
400 continue
C*KSR* END TILE
______________________________________
4. Semi-Automatic Tiling This section describes the semi-automatic method for tiling, which allow the user to partially override the tiling decisions as done by preprocessor 60a. 4.1 Overview In the general case there is more then one possibility to choose the indices for tiling. Preprocessor 60a chooses one of those possibilities, through the automatic tiling mechanism. Semi-automatic tiling allows the user to intervene in the process of choosing the indices for tiling, by specifying explicitly which indices he wants to be tiled. Using semi-automatic tiling, the user gains additional control over the tiling, while keeping the same guarantee of correctness as with automatic tiling. This is done by using the following preprocessor 60a input directive:
______________________________________
C*KSR* TILE (<tile-index-list> [, <tiling-param>
. . .])
______________________________________
The <tiling-param> can be any parameter which is not one of the primary set of tiling-parameters. The reason for that is that, as mentioned before, the tiling parameters in the primary set (order, lastvalue, local, reduction) can affect the correctness of the program. Preprocessor 60a transforms the input directive into the following statement:
______________________________________
c*KSR* TILE (<tile-index-list>
[, <tiling-param>. . .])
c*KSR* END TILE
______________________________________
Where <tiling-param> contains all the tiling parameters which were specified in the C*KSR*TILE directive, and probably additional tiling parameters from the primary set. And where <tile-index-list> is the same as the one which the user specified. If the user specified a combination which is incorrect (according to preprocessor 60a criteria) preprocessor 60a will issue as error. 4.2 Example Referring again to the "Inspired by the Legendre Transform" example above, by using forward substitution technique, the loop is tiled in 3D. However, the user could tile it in 2D by putting the following line before the loopnest:
______________________________________
C*KSR* TILE (l,k)
______________________________________
This instructs preprocessor 60a to tile in those indices only. Since it is a legal possibility (as can be seen from the loop table), preprocessor 60a will do so without generating an error message.
______________________________________
C*KSR* TILE (1,k)
do 400 l = 1, nlev
do 300 k = 1, nwaves
ip = nimp(k)
do 200 j = 1, nlats
do 100 i = 1, nnp(k)
sd(l,ip+i) = sd(l,ip+i) +
fsdl(l,k,j)*pnmd(ip+i)
sq(l,ip+i) = sq(l,ip+i) +
fsql(l,k,j)*pnmd(ip+i)
100 continue
200 continue
300 continue
400 continue
______________________________________
Preprocessor 60a tiles it as follows:
______________________________________
C*KSR* TILE( L,K )
do 400 l = 1, nlev
do 300 k = 1, nwaves
do 200 i = 1, nlats
do 100 i = 1, nnp(k)
sd(.omega.,nmp(k)+i) = sd(.omega.,nmp(k)+i) +
fsdl(.omega.,k,j)*pnmd(nmp(k)+i)
sq(.omega.,nmp(k)+i) = sq(.omega.,nmp(k)+i) +
fsql(.omega.,k,j)*pnmd(nmp(k)+i)
100 continue
200 continue
300 continue
400 continue
C*KSR* END TILE
______________________________________
5. Related Issues The above sections focus on the tiling aspect of preprocessor 60a operation. Below, is a brief discussion of other aspects of operation of preprocessor 60a which are not directly related to tiling but which may effect the results of the tiled program. Distribution is performed when it can help the tiling. For example, to tile part of the loop when there are I/O statements in it. In some cases code transformation needs to take place in order to tile a program (for example, in the presence of reduction, or when the last-value is needed). Some of those transformation require to know the bounds of a tile--a runtime value, which is available when a tile is being executed. In most cases the transformation is done by preprocessor 60a. However, if for some reason users (or preprocessor 60a) do this kind of transformations, they might need to know the runtime value of the bounds of the tile. This can be obtained by the use of intrinsic function. Inner loop indices of a serial loop inside an outer tiled loop are treated by the compiler as locals. For example
______________________________________
C*KSR* TILE( L,K )
do 400 l = 1, nlev
do 300 k = 1, nwaves
do 200 j = 1, nlats
do 100 i = 1, nnp(k)
sd(l,nmp(k)+i)=sd(l,nmp(k)+i)+
fsdl(l,k,j)*pnmd(nmp(k)+i)
sq(l,nmp(k)+i)=sq(l,nmp(k)+i)+
fsql(l,k,j)*pnmd(nmp(k)+i)
100 continue
200 continue
300 continue
400 continue
C*KSR* END TILE
______________________________________
The compiler treats this loop as if there is an implicit local=(i,j). The Runtime Library 1. Overview The following sections describe the operation of the Runtime Library 66. 2. The programming model Runtime Library 66 is language independent. It can be called from a variety of languages, e.g. Fortran 77 or 90, C or the like. However, the following sections discuss it with respect to its use from Fortran programs. The three parallel constructs which Runtime Library 66 handles are tiling, parallel regions, and parallel sections. The tiling construct allows the user to execute Fortran do-loops in parallel. The parallel sections construct enables the user to execute different code segments of a program in parallel. The parallel regions construct allows the user to have a single code segment of a program run multiple times simultaneously. All parallel constructs may be nested. However, the Runtime Library 66 may run any parallel construct serially if sufficient resources are not available. The rest of this section contains a short description of the parallel constructs which are supported by Runtime Library 66. The following sections will discuss each one in detail. 2.1 Tiling Tiling of a Fortran loopnest is a partitioning of the iteration space into rectangular parallelipiped chunks called tiles. Hence, a tile is a collection of iterations. The group of tiles which construct a loopnest is called tile-family. The tiles are the basic entities which can be executed in parallel. Numerous processors can execute the same loopnest, each one of them working on a separate tile simultaneously. Example:
______________________________________
C*KSR* TILE ( i, j)
do 10 i = 1,n
do 10 j = 1,m
a(i, j) = 0.0
10 continue
C*KSR* END TILE
______________________________________
In this case, the loopnest is tiled in the two indices i and j. It is possible to tile only part of the loop indices, e.g.--in the above example the following tiling is also possible:
______________________________________
C*KSR* TILE ( i )
do 10 i = 1,n
do 10 i = 1,m
a(i,j) = 0.0
10 continue
C*KSR* END TILE
______________________________________
The tiling model has two qualities which are important to Runtime Library 66. First, flexibility in terms of work/overhead ratio. The Runtime Library 66 it provides a general way to handle granularity of parallelism ranging from one iteration to any number of iterations. Second, convenience in handling dependency: The Runtime Library 66 provides a simple way to define a partial order (tiles dependency), and a way to exploit parallelism in the presence of dependency. 2.2 Affinity Regions The affinity region mechanism applies to the tiling parallel construct. It provides a method for the user to convey optimization information to Runtime Library 66. An affinity region is a collection of tile families which Runtime Library 66 attempts to execute in a fashion so as to avoid data contention and movement. Runtime Library 66 keeps some information about the entire set of tile families, and uses that to distribute tiles to processors so that processors will execute on the same data from tile family to tile family. To declare an affinity region, the user must enclose the desired code within the AFFINITY REGION and END AFFINITY REGION directives. The directives must not interrupt a tile family. If declared within a parallel section, the affinity region must be within one section block. The declaration must be within a single subroutine or main program. These are parameters which affect efficiency of execution rather than correctness. Affinity region requires global decision making, and this is the way for the user to specify them. If the user specified the same parameters in a TILE directive embedded within an AFFINITY REGION, the parameters in the AFFINITY REGION override the ones in the TILE directive. Affinity regions can be nested. Example:
______________________________________
C*KSR* AFFINITY REGION ( i,j, STRATEGY = MOD,
NUMTHREADS = 8)
do k
C*KSR* TILE ( i,j)
do i
do j
. . . .
enddo
enddo
C*KSR* END TILE ( i,j,)
enddo
C*KSR* END AFFINITY REGION
______________________________________
2.3 Team Operators Parallel constructs are executed in Runtime Library 66 by groups of pthreads. In the default mode, these pthread groups are invisible to the user. However, Runtime Library 66 does implement an interface to these pthread groups for the user who wants a greater degree of control of his program. The functions that manage pthread groups are called "team" operators. The interface is described in detail at the end of this section. 2.3.1 Definition of a team Each pthread group, or team, consists of one or more pthreads, where one pthread is designated a "leader". Each team member has a member id unique within the team, starting at 0 and ascending, with no gaps in the sequence. The team leader's member id will be 0. 2.3.2 Default Team Usage Runtime Library 66 will create, manage, and disband teams automatically without direction from the user. However, the Runtime Library 66 interface does allow the user to explicitly specify team creation, dispersion, and usage. If the user does not specify team usage, Runtime Library 66 follows the general practice of creating a new thread team for every new parallel construct. The thread team is disbanded at the end of the construct. An exception is made for TILE constructs that are lexically enclosed within an AFFINITY REGION directive; all such TILE constructs are executed by the same thread team. 2.3.3 Team Ids Team IDs are unique throughout the program. 2.3.4 Team Creation The pthread that runs across an ipr.sub.-- create.sub.-- team call executes the call and becomes the team leader. Pthreads may be members of several teams. 2.3.5 Restrictions in Use of Teams A team may not be used in parallel--it can only execute one construct at a time. However, if constructs are nested, a pthread may be a member of several teams, and may execute multiple constructs. A parallel construct may only be executed by a team where the pthread that encounters the construct is a member and is the leader of the team. The motivation for this restriction is a fundamental implementation issue. The pthread that encounters the construct is the only pthread that has the context to execute the serial code before and after the parallel construct. It could be possible for Runtime Library 66 to allow a pthread to call a team that it is not a member of to execute the construct, but the original pthread will be forced to idle during the parallel execution. 3. Interfaces 3.1 Runtime Library 66/User Interface Users pass input to Runtime Library 66 through run time parameters, program directives or subroutine calls. The run time parameters enable the user to control the resources and calculations done by Runtime Library 66, allowing her to tune for performance. The program directives allow the user to indicate opportunities for parallelism. Program directives may be addressed to the ksr compiler or preprocessor 60a or both. Subroutine calls are used to explicitly control Runtime Library 66 thread group (team) management. 3.1.1 Program directives As noted above, program directives are in the form of Fortran comments. When a program directive is present, the compiler 60b generates calls to the Runtime Library 66 runtime library to cause the parallel execution of the loopnest. The parallel section, parallel region, and set directives are generated only by the user and understood only by the compiler. Affinity region directives are generated by the user or the compiler, and understood only by the compiler. Tiling directives are generated by the user and/or the compiler 60b. To tile a Fortran program, the user can either put in the tiling directives by hand, or rely on the preprocessor 60a to do so. The preprocessor 60a takes a Fortran program as an input, and create the transformed Fortran program which has the tiling directives in it. It is possible to use a mixture of manual and automatic tiling; hence, the preprocessor 60a can take a partially tiled program, retain the loopnests which are already tiled alone, and tile the other loops. The output of the preprocessor 60a is a legal Fortran program; With the fully automatic mode, the user invokes the compilation system 60, which in turn invokes the preprocessor 60a. It will be appreciated that the Runtime Library 66 itself is not aware of the difference between automatic and semi automatic tiling. These different methods produce identical input from Runtime Library 66's point of view. 3.1.2 Run Time Parameters The runtime environment parameters, which are set forth in FIG. 6 are defined using Unix environment variables. These parameters can also be set using the SET directive. In order to achieve parallel execution, the code within a parallel construct is transformed into a special kind of subroutine. The task-subroutine(s) resembles a nested subroutine in the manner of Pascal. It will be appreciated that this is not an extension to programming language itself, the task-subroutines are created in the internal data structures of the the compilation system 60 only. FIG. 7 illustrates the transformation of a tile into a task-subroutine. In the drawing the original program (11.f) is denoted as block 70a. The tiled program (11.cmp) is denoted as block 70b. That tile program is internally transformed into ("as if") the code shown in block 70c. By doing this transformation, the tiled loopnest turned into a subroutine. The arguments to this subroutine are the bounds of the tile. In the example shown in FIG. 7,
______________________________________
do 10 i=1,n
do 10 j=1,n
______________________________________
was transformed to
______________________________________
do 10 i=i1,i2
do 10 j=j1,j2
______________________________________
and the bound become the arguments to the task-subroutine. For example, if a tile with a 16.times.16 tile-size is used, one thread will issue a call to task foo.sub.-- $1 (32, 47, 16, 31). This will cause the execution of
______________________________________
do 10 i=32, 47
do 10 j=16, 31
a(i,j) = 0.0
10 continue
______________________________________
Hence, this thread executes 16 iterations in the i dimension, and 16 iterations in the j dimension. Runtime Library 66 will invoke the calls to task foo.sub.-- $1 from the different threads with the appropriate arguments such that all the iterations will be executed. The parallelism is exercised by having many threads calling the task-subroutine different arguments (i.e., bounds). The existence of the parallel constructs triggers a call to "execute" routine in the Runtime Library 66, and the compiler passes the name of the task-subroutine as an argument. The arguments of these executes routines contain all the information about the construct which is needed in order to execute it in parallel. Some of this information comes from the program directive itself (i.e., which indices to tile, dependency information etc.); some information comes from the source program (i.e. bounds and stride of the loopnest); some of the information is generated by the compilation system 60 (i.e., ssb, codesize--as discussed below). There is also some data which is needed to interface between the code inside the task-subroutine and outside it (pointer to the task-subroutine, frame pointer, flag to support last value). This embodiment provides a simple way for the tile to be executed as an independent entity (by being a subroutine) and at the same time recognize the variables of its parent routine using an existing compiler mechanism (by being a nested subroutine). In this particular example, it recognizes the array a. 3.2 Runtime Library/Operating System interface Runtime Library 66 parallelism is implemented with the OSF implementation of pthreads. The interface between the OS and Runtime Library 66 and pthreads and Runtime System 66 is not discussed in detail here, but there are some basic assumptions about the world in which Runtime Library 66 lives which are needed in order to establish the framework. Runtime Library 66 uses variable number of threads during the life of the program. A single thread is initiated at startup, and becomes the program leader thread. This thread is responsible for executing all serial portions of the program. Each parallel construct is executed by a team of threads called a "thread group". Each thread group has one thread that is designated a group leader, while all other members are group slaves. Runtime Library 66 delegates much of the load balancing between threads to the operating system scheduler. In some cases Runtime Library 66 assumes that a thread is associated with a processor, and that this binding remains. This is an important assumption used by Runtime Library 66 in the modulo tiling strategy where work is partitioned so that a cell will reference data it already owns. Runtime Library 66 may pass information to the OS scheduler to help it make more informed decisions about load balancing. 4. Tiling Tiling, as stated above, is a method to execute a loopnest in parallel. This section will explain the semantics of the Tiling directive and illustrate the way a tiled loop is executed. 4.1 Tiling Directive--Semantics Not every loop can be tiled in a simple way. Some loops can't be tiled at all, and some loops can be tiled only with special care to ensure correct execution. "Correct execution" in this context means the same result as by running the same program serially. The syntax of the tiling directive enables specifications of tiling parameters which will provide the additional point of connection of the bracket correct execution. These are order, lastvalue, local and reduction. In addition, there are other tiling parameters which do not affect the correctness of the program, but do affect the performance. 4.2 Tiling with Order Tiling Parameter The order tiling parameter specifies a partial order for execution of tiles, which is derived from the data dependency within the tiled loopnest. The order tiling parameter deserves some special attention in this section because it can be confusing, and often not intuitive, to determine the dependency and thus the correct order. In addition, it is one of the tiling-parameters which can influence the correctness of the program execution. The fact that dependency--and thus the execution order tiling directive--is not easy to determine is not worrisome, since it is typically detected automatically by the preprocessor 60a. If the user chooses not to use the preprocessor 60a he or she can specify it, and than it becomes his or her responsibility. When a loopnest is tiled with order tiling parameter, Runtime Library 66 will try to achieve parallel execution of that loop while ensuring the correct order of execution. In some cases obeying the order will cause serial execution. However, in some cases a loop which is tiled with order can run in parallel. When a loopnest is executed in parallel, the iterations will not necessarily be executed in the same order as by serial execution of the same loopnest. In some cases it doesn't matter, while in other cases a data-dependency implies a partial order between the iterations. This, in turn, implies a partial order between tiles to guarantee correct execution. The way to handle it is to specify a order tiling directive. This will cause Runtime Library 66 to do the necessary synchronization between the execution of tiles to ensure the correct execution. Example:
______________________________________
do 10 i = 2,n-1
do 10 j = 2,m-1
a(i,j) = a( i-1 , j+1 ) + a ( i+1 , j-1)
10 continue
______________________________________
This loopnest can be tiled in both dimensions in the following way:
______________________________________
C*KSR* TILE ( i, j, ORDER = -J, I )
do 10 i = 2,n-1
do 10 j = 2,m-1
a(i,j) = a(i-1,j+1) + a(i+1,j-1)
10 continue
C*KSR* END TILE
______________________________________
This defines the following partial order between the tiles: execution of a tile can start when the tile before in the I-th direction completed, and the tile after in the J-th direction completed. In the diagram presented FIG. 8 the tile marked by "x" can be executed when the tiles marked by "d" completed. This will typically cause Runtime Library 66 to choose the wave-front tiling strategy, which enables parallel execution in the presence of order in two dimensions. 4.3 The order tiling parameter and Data Dependency--by example As noted above, the ORDER=-J, I is typically inserted by the preprocessor 60a. This section explains by way of example the relations between data dependency and tiling with order. Referring to the original program:
______________________________________
do 10 i = 1,n
do 10 i = 1,m
a(i,j) = a(i-1, j+1) + a (i+1, j-1)
10 continue
______________________________________
First, to define the order in which iterations must be executed, one must look at the loop body. Assuming the position of a given iteration is marked by `x`, and the position of the iteration(s) upon which it depends is marked by "*", the resulting diagram is: ##STR1## In the original loop j is the inner index, hence the one which moves faster; so when `x` is executed, it must be the case that the iterations on its upper-right is already done (this is indicated by a `+`), and the iteration on its lower-left is not-done (this is indicated by a `-`). The result is as follows: ##STR2## In other words, it is safe to execute an iteration in position x, if and only if the iteration on its upper-right is done. This can be reduced to dependency between two iterations, as follows: ##STR3## The next step is to figure out the partial-order between tiles. Inside the tile the original order is preserved. In order to examine what happens between the tiles, the "stencil" in (I), above, can be moved around the borders of an imaginary tile, as follows: ##STR4## The equivalent of (II) will be as follows, with capital letters used to denote tiles: ##STR5## The D in the upper-right is redundant: the bottom line implies that a tile x must wait for the tile to its right before it can start. So, if the tile in the lower-left waits for the tile in its upper-left, it is enough to ensure that the tile on the upper right will be done already. Hence, the lower-left tile (which is marked by x) must wait for the tiles "above" and "to its right" to be done, and by recursion the correct order is defined. Hence, the result is as follows: ##STR6## Which is expressed in the tiling directive in the following way:
______________________________________
C*KSR* TILE ( I,J,ORDER=( -J , I ))
______________________________________
4.4 Tiling with local, lastvalue, reduction In addition to the indices and dependency parameters, there are three other tiling parameters which can affect the correctness of the program. Typically, those tiling parameters will be created by the preprocessor 60a. They are: LOCAL--declaration of local variables needed by the new lexical subroutine. This is handled by the compiler 60b, and is not passed on to Runtime Library 66. LASTVALUE--indicates whether the last value of the loop indices must be preserved after the loop execution. Runtime Library 66 must handle this, because the parallelization of the loop affects the execution order of the iterations. Runtime Library 66 calculates the last value by checking the bounds of each tile executed. When the tile containing the highest bounds of the iteration space is executed, the last value is passed by Runtime Library 66 to the compilation system 60. REDUCTION--declares that a reduction must be handled on a variable within the tile. Reduction is handled by the compilation system 60, and needs some support from Runtime Library 66. 4.5 Other tiling parameters There are tiling parameters which enables the user to intervene with Runtime Library 66's decisions, and influence efficiency decisions. They are supplied only by the user and never by the compilation system 60 and are referred to as "extra" parameters in the preceding sections. Those parameters are listed below. TILESIZE--this is a user supplied vector for the tile size of the following tile family. This vector is only valid for that tile family, and does not apply to any subsequent loopnests. Possible values are n (where n is a numerical value greater than 0), x (where x is a variable), or "*" (a symbol indicating that the tile should take the entire iteration space in that dimension. The vector must supply values for all tiled indices. Syntax follows the general compilation system 60 tiling parameter syntax. STRATEGY--a user directive on what tiling strategy to use on the following tile family. This value is only valid for that tile family. Possible values are GRAB or MOD. Syntax follows the general the compilation system 60 tiling parameter syntax. 4.6 Execution Of a Tiled Loopnest When a new tile-family is about to start execution, Runtime Library 66 decides on a work-plan. This decision is based upon the factors including affinity, dependency, data-locality. The work-plan is a collection of decisions which will determine the parallel execution of the tile-family: allocating threads, partitioning of the iteration space, and choosing a tiling strategy. Choosing the right work-plan, and--in particular--choosing the right strategy, has a major effect on performance. In principle, the work-plan is chosen when a new tile-family starts. If this tile-family belongs to an affinity region, the work-plan is based upon a "template" of the work-plan of the affinity region; this will be discussed later. 4.7 Allocation of Threads On tiling, Runtime Library 66 considers the amount of resources available, and for each tile family, uses n threads, where n is less or equal to the number of processors available to this program. The default will be to use the maximum number of processors available; if the tile family is structured so that it is not worth using the maximum number, a smaller number of threads will be chosen. This algorithm is used regardless of nesting. There are some difference in the allocation of threads according to the tiling strategy which is used (tiling strategy is described below): When using the GRAB strategy, Runtime Library 66 lets the scheduler handle all thread-processor bindings, load balancing, affinity, etc. When using the MODULO and WAVEFRONT strategy, Runtime Library 66 would like to assume that the thread.fwdarw.processor binding is constant. Runtime Library 66 constructs and assigns tiles accordingly. This binding assumption would make it useful to let the scheduler know that Runtime Library 66 would like higher thread.fwdarw.processor stability on these kinds of threads. These rules are also followed for nested parallel structures. 4.8 Tile size and shape A tile is defined in number of iteration space, hence, the tile-size is defined in terms of the iteration space. The tile-size vector specifies the number of iterations in each dimension of the tile-family. For example, a loop
______________________________________
do i = 1, 100
do j = 1, 200
______________________________________
which is divided into tiles may have a tile-size vector of (16,32), hence--the tile-size is 16.times.32=512 iteration. Note that the tile-size need not fit directly into the iteration space. Runtime Library 66 will "trim" the edges of tiles which overflow outside the iteration space. The tile-size is determined once at the beginning of the execution of a tile-family, and remains constant during the execution of that tile-family. However, the same tile-family can be executed more than once in the program, and the tile-size can be different each time--due to, for example, different bounds of the loops. Tile shapes must be chosen with two objectives in mind: maximizing parallelism and making good use of the allcache memory system. The following discussion weaves together considerations of dependency and subpage access to achieve the two goals. A tile is a rectangular n-dimensional parallelipiped with a dimension corresponding to each of the dimensions of the iteration space. The tile shape question is this--how long should the tile be in each dimension? The basic idea is to "stretch" the tiles in the direction of array references and to "stretch" the tiles in the direction of dependency. The first point will avoid contention between two or more s threads for the same subpage. The second point will minimize synchronization. Tiles will be multiplies of subpages, or two subpages. The final decision about tile-size is a compromise between contradicting considerations: on the one hand, we want to have big enough tiles, to justify the unavoidable overhead of starting each tile. On the other hand, we want to have many tiles in order to optimize the load balance. After the shape has been decided, we determine the actual size by looking at the amount of work to be done, the number of available processors, etc. If the tiles are too small, we "stretch" them. 4.9 Tiling Strategy The tiling strategy is the method used to divide the work among the pthreads so that all the tiles which comprise the tile-family will be executed correctly and efficiently. Like tile-size, the tiling strategy is determined at the beginning of execution of a new tile-family. Runtime Library 66 uses a self-scheduling mechanism, which means that after an initial setup done by the leader, each thread can find its chunk of work by itself. Hence, the strategy is expressed in terms of what a thread must do to find out what it needs to do next. There are two fundamental principles regarding the Runtime Library 66 strategies motivated by a desire for design elegance and low runtime overhead. The first is exactness--the strategy defines exactly how the next tile is selected in each situation. The idea is to leave as little calculation as possible to the point when a thread needs to get the next tile, and therefore minimize overhead in the runtime processing to get the next tile. Once a strategy is chosen, choosing a new tile should be a very fast operation. The second principle is progression--the strategies are structured so execution starts from a known point in the iteration space and proceeds in a known direction. The motivation is to avoid the need for complicated data-structures that record and remember which part of the iteration space is has been covered. Runtime Library 66 considers the following factors when deciding upon a tiling strategy: 1) Existence of data dependencies. Data dependencies create ordering requirements between the tiles, which necessitates synchronization between tiles. In the extreme case, data dependency may cause a tile family to execute serially, because all the tiles of the tile family are in the same ordering relationship. In other cases, some degree of parallelization is available in the tile family because each tile is independent of some number of other tiles. 2) Specification of strategy by the user. The user can specify a strategy by setting the PLSTRATEGY environment variable, using the SET directive, or passing a strategy value as a parameter to the TILE or AFFINITY REGION directives 3) Specification of tile size by user. If the user specifies that a dimension with ordering requirements should be tiled, the tiling strategy may be required to handle ordering. Runtime Library 66's tiling strategy decision table is as follows:
______________________________________
(x) = don't care
User
Spec'ed
tile
Number size cuts
User of User n
Spec'ed
Indices Spec'ed ordered
Chosen
Strategy
w/Order tile size
indices
strategy
______________________________________
FALSE 0 x x MODULO/SLICE
FALSE 1 FALSE x MODULO/SLICE
FALSE 1 TRUE 0 MODULO/SLICE
FALSE 1 TRUE 1 WAVEFRONT
FALSE >=2 FALSE x WAVEFRONT
FALSE >=2 TRUE 0 MODULO/SLICE:q
FALSE >=2 TRUE 1-2 WAVEFRONT
FALSE >=2 TRUE >2 Error reported
TRUE 0 x x User specified
strategy
TRUE 1 FALSE x User specified
strategy
TRUE 1 TRUE 0 User specified
strategy
TRUE 1 TRUE 1 Error if user
strategy!=
WAVEFRONT
TRUE >=2 FALSE x Error if user
strategy!=
WAVEFRONT
TRUE >=2 TRUE 0 User specified
strategy
TRUE >=2 TRUE 1-2 Error if user
strategy!=
WAVEFRONT
TRUE >=2 TRUE >2 Error reported
______________________________________
Note that the fact that a tile family is tiled and that tiles are distributed to multiple threads does not mandate parallelism. A tile family that has ordering requirements may be tiled but still execute serially, due to the synchronization required by the ordering information. This situation may be optimal if the lack of parallelization is overcome by the advantage of maintaining data affinity. The following is a description of Runtime Library 66 strategies. SLICE strategy This strategy simply divides the tile family iteration space into n tiles where n is equal to the number of pthreads participating in the construct and assigns a tile to each thread. This is the default strategy for tile families not enclosed within an affinity region and is designed to minimize tiling overhead and the possibility of data contention at tile boundaries. MODULO strategy This strategy distributes tiles evenly throughout the iteration space to the thread group. Assume that threads and tiles are numbered starting from 0. The tiles which will be executed by a given thread are those such that
______________________________________
tile-number MODULO number-of-allocated-
threads = thread-id
______________________________________
When expressed in terms of a self-scheduling strategy, it means that a thread whose thread-id is P will execute the following tiles: first tile is tile whose number is same as the thread-id next tile is previous tile+(number of participating threads) This strategy is a semi-static one. It is dynamic in terms of the number of processors which are available when the execution of the tile-family started, but cannot adjust to a change of the availability of processors during the execution of the tile-family and to an unbalanced load. The major advantages of this strategy are that no synchronization between tiles is required. Also, the set iteration space.fwdarw.thread mapping is designed to handle data affinity, especially when used with affinity regions. Further, the "modulo" distribution system of the iteration.fwdarw.thread mapping, is designed to optimize load balancing within affinity regions, where each tile family within the affinity region may cover a different part of the iteration space. Because of this distribution scheme, the percentage of work allocated to each thread is not dependent on the iteration space used by a single tile family. This is the default strategy for tile families enclosed within affinity regions. WAVEFRONT strategy This strategy is designed to execute tile families with data dependencies correctly and with maximum parallelization. With this strategy, tiles are executed in a "wavefront" pattern on a two dimensional plane. Runtime Library 66 chooses two indices out of the list of tiled indices that have ordering requirements to form the 2D wavefront plane of the iteration space. One index is designated the "column" index while the other is designated the "subtile" index. The columns of the 2D plane are allocated in a modulo fashion to the members of the executing threads team. Each column is made up of a number of tiles. Each tile has a adjacent, dominant neighboring tile that it is dependent upon, and cannot begin execution until that neighbor is finished. A tile's dominant neighbor will be in the column to the right or left, depending on the ordering requirements of the column index. The dominant tile will be on the same subtile index value. Execution of the tile family begins from one of the four corners of the 2D plane, depending on the ordering of the column and subtile index. This strategy also attempts to distribute tiles evenly throughout the iteration space to the executing thread team, and is compatible with the use of the affinity region directive. Runtime Library 66 handles an iteration space with ordering requirements that has more or less than 2 index dimensions in a number of ways. If the iteration space has only one index, and it has an ordering requirement, the work must be done serially, and Runtime Library 66 attempts to create one tile. If the user forces tiling by specifying tile size, Runtime Library 66 executes the tile family in the 2D wavefront pattern, where the subtile index is 1. If the iteration space has more than two dimensions, but only two indices have ordering requirements, Runtime Library 66 processes the tile family as a series of 2D planes, where the wavefront strategy can be conducted on each plane independently. If there are more than two dimensions, and more than two indices that have ordering requirements, Runtime Library 66 will not tile the additional ordered indices, and will create "chunkier" tiles to execute in the 2D wavefront strategy. User specified tile sizes that require tiling in more than two ordered indices are refused. GRAB strategy This strategy issues tiles on a first-come, first-serve basis to the thread group. Each thread must obtain a common lock to access a tile for execution. This strategy is designed to load balance between the threads of the thread group. Note that this strategy does not consider data affinity considerations when assigning tiles to processors, and that it would not be a good strategy to choose in conjunction with affinity regions. 4.10 Relative Advantages of the Strategies The semi-static MODULO tiling-strategy has two main advantages: it is easy to maintain data-affinity; and it minimizes synchronization overhead. On the other hand, tile to thread assignments are static and will not adjust to unbalanced load during execution time. The SLICE tiling strategy is also semi static and has the most minimal tiling overhead, but makes no attempt to maintain data affinity across tile families. The GRAB tiling strategy, is dynamic and will keep each thread busy as much as possible, but will probably cause data migration to occur more frequently. In addition to affinity loss problems is the additional synchronization overhead from the locking required to access a new tile. The WAVEFRONT tiling strategy is required when tile families have ordering requirements caused by data dependencies. Data affinity is maintained, as with the modulo strategy, but the tile-tile synchronization required creates additional overhead. 4.11 Work-plan--Examples The following are few examples of the work-plan chosen by Runtime Library 66 to execute tiled-loops. Details of why a particular work-plan was chosen are note given; rather, this attempts to give the flavor of the behavior of various work-plans. EXAMPLE 1 A 2D iteration space is tiled in 2D. The number of available processors is smaller than the number of columns. There is no dependency. The work-plan is depicted in FIG. 9A. There,
______________________________________
number of processors:
N
tile-size: a whole column
strategy: modulo
______________________________________
EXAMPLE 2 The same as above, only this time there is a dependency in both directions: the strategy used is the modulo strategy with dependency (wave front). Note that this strategy is data-affinity efficient when used in the same affinity group with the normal modulo strategy which is shown in the previous example. The work-plan is depicted in FIG. 9B. There,
______________________________________
number of processors:
N
tile-size: chunks of column
strategy: wavefront
______________________________________
EXAMPLE 3 Data affinity is not an issue, load balance is important. There is no dependency. The ssb point in both i and j index. The work-plan is depicted in FIG. 9C. There,
______________________________________
number of processor: N
tile-size: columns
strategy: GRAB
______________________________________
Note that in this example the arrows shows how the tiles are numbered. This is not to be confused with order of tiles: the numbering is just a way to keep track of what has been executed. The way to number the tiles is arbitrary, and does nor reflect any particular order in which the tiles need to be executed. 5. Affinity Region Runtime Library 66 tries to allocate work to threads in a way which will minimize contention and data movement. The creation of affinity region is a way to coordinate the tiling work-plan decisions made for a group of tile families. The following assumptions are made to support this goal: 1. Data movement is expensive. 2. The tile families within the affinity region reference the same data. 3. The tile families within the affinity region tend to have the same data space @iteration space mapping. 4. The data space is aligned on subpage boundaries, and tiles can be constructed that will avoid data contention. In earlier sections, the work-plan decision was described as a two-part decision, with a tile.fwdarw.thread mapping component and a tile size component. When affinity region are declared, these decisions are made across the whole set of tile families in the affinity region. The tile.fwdarw.thread mapping is maintained across tile families so that the same thread (and hopefully, same processor) works on each tile, maintaining data locality. The affinity region directive is only effective with some tiling strategies. For example, the grab tiling strategy does not allow the maintenance of a tile.fwdarw.thread mapping. The tile size decision is also made by considering the iteration space of all the tile families in the affinity region, rather than the iteration space of a single tile family. As a result, all tile families share the same tile size, making it possible to maintain a tile@thread mapping across tile families. The tile work-plan decision may reflect factors that affect only some of the tile families, but that are extended to all of the tile families. For example, a data dependency may exist on a particular index in some of the tile families. The effect the dependency has on a tiling strategy applies to all tile families in the affinity region. Note that the creation of affinity region is purely an efficiency issue. 5.1 Recognizing an Affinity Region Affinity regions may be declared by the user, or identified by the compilation system 60. 5.2 Common Index Set It must be possible for all the loopnests in an affinity region to be tiled on the same set of indices. This is the "common index set" rule. This is necessary because the affinity oriented strategies need an iteration.fwdarw.processor mapping function which determines which processor executes the tile. If the number of tiled indices varies loopnest to loopnest, this mapping will fail. This does not mean that every tile family have identical tiled indices, just that there is an intersection of indices among the different loop nests. 6. Performance issues 6.1 Overhead There is overhead in the following aspects of operation: (1) affinity region (create the template); (2) start of tile family (create the MCB); (3) choose the next tile (by each thread). 6.2 Runtime Library 66 Decision Making Runtime Library 66 is designed with efficiency as the major consideration. One important concept is the propagation of decisions to the earliest possible time. Obviously, it is more efficient to take a decision at compile time rather than at runtime. Consequently, it is more efficient to take a decision when a tile-family starts execution, rather than at the beginning of execution of each tile. Runtime Library 66's method is to take a decision as soon as possible, based upon all the information available. Once the decision has been taken, Runtime Library 66 forgets the reasons. The goal is that by the time a particular thread needs to find the next tile to execute, it will be a very simple operation, typically involving a couple of simple comparisons and additions. At runtime, when Runtime Library 66 starts to execute a tile-family, all the information about this loopnest is already known: whether or not there is a dependency between the tiles, how many processors are available, the size of the loop (bounds) etc. Based upon that, Runtime Library 66 can decide upon a tiling strategy. There are many factors which influence the choice of tiling strategies. Some of them are known at compile time, e.g., dependency, or the amount of work in the loop. Others can be known at compile time--e.g., the number of available processors. Some of the information is available at runtime in the general case, but in practice it is very often either known at compile time (or known to the user who can provide it via directives). The system has some decision-points, where decisions can be taken. These are at compile time, by compilation system 60. At runtime, by the Runtime Library 66, upon starting a new tile-family (.sub.-- pr.sub.-- execute.sub.-- tiles), and when looking for next tile to do (.sub.-- pr.sub.-- tile.sub.-- next). The following is a list of factors which may influence Runtime Library 66's decisions. The order of these factors, as presented, is essentially random. size of loop (bounds) static amount of work in one iteration dynamic amount of work in one iteration dependency between tiles mapping of iterations to data number of procs data affinity (ssb) history (affinity region) resource (processors) availability (load balance) [important array] 7. Runtime Library 66 Architecture 7.1 Block diagram FIG. 10 is | ||||||
