Background memory allocation for multi-dimensional signal processing5742814Abstract Data storage and transfer cost is responsible for a large amount of the VLSI system realization cost in terms of area and power consumption for real-time multi-dimensional signal processing applications. Applications or this type are data-dominated because they handle a large amount of indexed data which are produced and consumed in the context of nested loops. This important application domain includes the majority of speech, video, image, and graphic processing (multi-media in general) and end-user telecom applications. The present invention relates to the automated allocation of the background memory units, necessary to store the large multi-dimensional signals. In order to handle both procedural and nonprocedural specification, the novel memory allocation methodology is based on an optimization process driven by data-flow analysis. This steering mechanism allows more exploration freedom than the more restricted scheduling-based investigation in the existent synthesis systems. Moreover, by means of an original polyhedral model of data-flow analysis, the novel allocation methodology can accurately deal with complex specifications, containing large multi-dimensional signals. The class of specifications handled by this polyhedral model covers a larger range than the conventional ones, i.e. the entire class of affine representations. Employing estimated silicon area or power consumption costs yielded by recent models for on-chip memories, the novel allocation methodology produces one, or optionally, several distributed multi-port memory architecture(s) with fully-determined characteristics, complying with a given clock cycle budget for memory operations. Claims What is claimed: Description BACKGROUND OF THE INVENTION
__________________________________________________________________________
void ConstructInclusionGraph(collection of LBL's - the index spaces of
signal domains) {
initialize inclusion graph with the def/opd index spaces as nodes ;
while new nodes can still be appended to the graph
for each pair of LBL's (Lbl.sub.1,Lbl.sub.2) in the collection
such that (Lbl.sub.1 Lbl.sub.2) && (Lbl.sub.2 Lbl.sub.1) {
compute Lbl = Lbl.sub.1 .andgate. Lbl.sub.2 ;
if (Lbl .noteq. .0.)
(1) if (Lbl == Lbl.sub.1) AddInclusion(Lbl.sub.1,Lbl.sub.2) ;
else if (Lbl == Lbl.sub.2) AddInclusion(Lbl.sub.2,Lbl.sub.1) ;
(2) else if there exists already lbl .ident. Lbl
if (lbl Lbl.sub.1) AddInclusion(lbl, Lbl.sub.1) ;
if (lbl Lbl.sub.2) AddInclusion(lbl, Lbl.sub.2) ;
(3) else add Lbl to the collection ;
set Lbl .OR right. Lbl.sub.1 ; set Lbl .OR right. Lbl.sub.2 ;
}
__________________________________________________________________________
At the beginning, the inclusion graph of an indexed signal contains only the index spaces of the corresponding array references as nodes. Every pair of LBL's--between which no inclusion relation is known so far (i.e., between which there is currently no path in the graph)--is intersected. If the intersection produces a non-empty Lbl, there are basically three possibilities: (1) the resulting Lbl is one of the intersection operands: in this case, inclusion arcs between the corresponding nodes must be added to the graph; (2) an equivalent linearly hounded lattice exists already in the collection: in this case, arcs must be added only if necessary between the nodes corresponding to the equivalent LBL and to the operands; (Two linearly bounded lattices of the same indexed signal are equivalent if they represent the same set of indices. E.g., {x=i+j.vertline.0.ltoreq.i.ltoreq.2, 0.ltoreq.j.ltoreq.2} and {x=i.vertline.0.ltoreq.i.ltoreq.4) are equivalent. Testing LBL's equivalence can be done employing LBL intersection and size determination.) (3) the resulting Lbl is a new element of the collection: a new node is appended to the graph, along with two arcs towards the operands of intersection. The construction of the inclusion graph ends when no more elements--nodes or arcs--can be created. The procedure AddInclusion creates new inclusion relations (arcs) between groups of signals, but deletes the resulting transitive arcs: keeping a strict hierarchy for the LBL's is essential for the partitioning phase. The intersection ".andgate." of two linearly bounded lattices is also a linearly bounded lattice, which is computed as described subsequently. Example With reference to FIG. 12, the partitioning algorithm is exemplified for the simple Silage code in FIG. 10. Initially, the inclusion graph of signal A contains only the nodes labeled from a to i, corresponding to the LBL's representing the array references extracted from the behavioral specification: A›j!›1!, A›j!›i+1!, A›j!›i!, A›j!›n+i+1!(j<i), A›j!›n+i!(j<i), A›j!›n+i+1!(j.ltoreq.i), A›j!›n+i!(j.gtoreq.i), A›i!›n+i!, A›i!›1n!. The first execution of the while loop in the algorithm adds to the graph the new vertices j, . . . , r along with the sequence of arcs: (j, b), (j, c), (k, b), (k, d), (l, b), (l, h), (m d), (m, g), (n, e), (n, f), (o, e), (o, i), (p, f), (p, g), (q, g), (q, h), (r, g), (r, i). Also the inclusion arcs (d, h), and (a, c) are added. The second execution of the while loop creates the new arcs (m, q) and (k, l), eliminating (m, g) and (k, b) at the same time, as they become transitive. The latter is done in the procedure AddInclusion, which creates new inclusion relations, deleting the transitive arcs. E.g., node l represents the LBL ##EQU3## resulting from the intersection of the linearly bounded lattices corresponding to nodes h and b: ##EQU4## Afterwards, the basic sets of a signal are derived from the inclusion graph with a simple bottom-up technique. If a node has no components, a new basic set--equal to the corresponding linearly bounded lattice--is introduced; otherwise, if all its components have already been partitioned, the union of the component partitions and (potentially) a new basic set will constitute the current node partitioning. In the latter case, the new basic set appears only if there is a difference between the size of the node and the total size of its components. The computation of LBL sizes is described in the sequel. The efficiency of this operation and of the intersection procedure are crucial for the practical time complexity of the whole algorithm. Determination of the basic sets of signals (24)
______________________________________
void CreateBasicSets(inclusion graph) {
for every LBL (node in the inclusion graph) having no arc incident to
it
create a new basic set P, equal to that LBL ;
while there are still non-partitioned LBL's {
select a non-partitioned LBL (node in the inclusion graph) such that
all arcs incident to it emerge from partitioned LBL's (LBL.sub.i
=.orgate..sub.j P.sub.ij) ;
if (size(LBL) > size(.orgate..sub.i LBL.sub.i) )
create a new basic set P equal to LBL - .orgate..sub.i LBL.sub.i ;
partition the LBL into (.orgate..sub.i .orgate..sub.j P.sub.ij ; P) ;
else // size(LBL) = size(.orgate..sub.i LBL.sub.i)
partition the LBL into (.orgate..sub.i .orgate..sub.j P.sub.ij) ;
}
______________________________________
Example (cont'd) The LBL's in the inclusion graph are partitioned in the order: first a, j, k, m, p, n, o, r creating the basic sets A0.div.A7; afterwards, c, d, e, f, i, l, q, b, g, h. When the LBL's l and q are processed, two more basic sets are created, as size(l)>size(k) and size(q)>size(m). The final partitioning for our example is indicated in FIG. 12, and represented in FIG. 13. Each domain has an index space which consists of one (e.g., A›j!›0!A0) or several basic sets (e.g., A›j!›i!A0.orgate.A1). Basic sets cannot be always represented as linearly bounded lattices (as it is the case in this illustrative example). In general, they can be decomposed, or they can be expressed as differences of affine mappings of polytopes. E.g., signal A from the Silage code in FIG. 8 has two basic sets; one of them--A1--cannot be represented as a linearly bounded lattice (see FIG. 18b). Although the novel concept of basic sets relies on linearly bounded lattices, the latter concept cannot replace the former one. Intersection of linearly bounded lattices This section describes thoroughly the procedure for intersecting two linearly bounded lattices (see F. Balasa, ibidem). As already mentioned, the efficiency of this operation together with the computation of an LBL size are essential for the practical time complexity of the whole partitioning algorithm. Let {x=T.sub.1 i.sub.1 +u.sub.1 .vertline.A.sub.1 i.sub.1 .gtoreq.b.sub.1 }, {x=T.sub.2 i.sub.2 +u.sub.2 .vertline.A.sub.2 i.sub.2 .gtoreq.b.sub.2 } be two LBL's derived from the same indexed signal, where T.sub.1 and T.sub.2 have obviously the same number of rows--the signal dimension. Intersecting the two linearly bounded lattices means, first of all, solving a linear Diophantine system (that is, finding the integer solutions of a system of linear equations with integer coefficients) T.sub.1 i.sub.1 -T.sub.2 i.sub.2 =u.sub.2 -u.sub.1 having the elements of i.sub.1 and i.sub.2 as unknowns. If the system has no solution, the intersection is empty. Otherwise, let ##EQU5## be the solution of the Diophantine system. If the set of coalesced constraints A.sub.1 V.sub.1 .multidot.i.gtoreq.b.sub.1 -A.sub.1 v.sub.1(2) A.sub.2 V.sub.2 .multidot.i.gtoreq.b.sub.2 -A.sub.2 v.sub.2 has at least one integer solution, than the intersection is a new linearly bounded lattice defined by {x=T.multidot.i+u.vertline.A.multidot.i.gtoreq.b}, where ##EQU6## An undesirable side-effect of intersection is the rapid size increase of the polytope description for the resulting linearly bounded lattice, due to the coalescing of the two constraint sets (2). Therefore, minimizing the set of constraints proved to be a necessity in order to restrict the computational effort of, e.g., containing the lattice points of the resulting LBL's. However, the practical technique employed for minimizing the set of linear constraints is based on the double description of polyhedra. Computing the size of integer polytopes The polytope size means, in this context, the number of distinct n-dimensional points having integer coordinates (lattice points) inside a polytope: Card {i.epsilon.Z.sup.n .vertline.A.multidot.i.gtoreq.b}. The LBL size (or image size of a polytope) means the number of distinct m-dimensional points of integer coordinates belonging to the respective lattice (see also (1)): Card{x.epsilon.Z.sup.m .vertline.x=T.multidot.i+u, A.multidot.i.gtoreq.b, i.epsilon.Z.sup.n } The LBL size means, in fact, the number of scalars addressed by an array reference. The importance of a correct and efficient method for the computation of index space sizes is manifold: (1) this is needed during the partitioning of the indexed signals, and (2) this allows to determine the size of the basic sets, as well as the number of dependencies between these groups of signals. The latter essential information can be exploited in order to estimate the storage requirements for nonprocedural behavioral specifications, as explained intuitively earlier. Most of the memory-related research work assumes that all loops in the specifications have constant boundaries. That is, in the case of an affine function where only the free term a.sub.0 is nonzero. If this is not the case, approximations by taking the extreme values of the boundaries should yield fairly good results (see, e.g., I. Verbauwhede, ibidem). According to such an assumption, the number of scalars addressed by an array reference is approximated by the size of the hypercube encompassing the iterator space of that array reference (instead of computing the size of the index space). This approximation is good only if, in addition, the loops have unitary steps if there are no iterator-dependent conditions in the array scope, and if the affine mapping Ti+u in (1) is injective. Example ##EQU7## As iterator j takes values between 0 and 23, the hypercube encompassing the iterator space is ›8,15!.times.›0,23!, and it contains 192 points. The size of the LBL corresponding to the signal domain A›i+j! is however 16 (therefore, 12 times less than the approximation). The error is due both to the incorrect evaluation of the iterator space size--which contains 72 points rather than 192, and to the fact that the affine mapping t(i, j)=i+j is not injective (e.g., A›12! is addressed by several pairs of iterators (i,j): (8,4), (9,3) , (10,2)). It must be mentioned also that loops having increment steps different from 1 (i:m . . . M . . . Step):: . . . A›f(i)!. . . can be easily "normalized" with the affine transformation i=i'.multidot.Step+m, thus being equivalent to ##EQU8## For instance, the nest of loop in the example above is equivalent to ##EQU9## Concluding, for the general affine case, the approximation mentioned above may be very crude, being improper to use for memory estimation or allocation as it can lead to an exaggerated number of storage locations. Correct solutions to the problem of index space size determination will be illustrated in the present invention. It is assumed along this section that all loops have been "normalized" in a preprocessing phase (as loop boundaries containing the floor .left brkt-bot..right brkt-bot. and ceiling .left brkt-top..right brkt-top. functions create no special problem: the routines of our system are designed to take into account only the points with integer coordinates). It must be emphasized also that enumerative techniques can always be applied to compute the number of scalars in an array reference. These approaches are obviously simple and extremely efficient for array references with "small" iterator spaces. In image and video processing applications most of the array references are characterized by huge iterator spaces: an enumerative technique, although very simple, can be too computationally expensive for such applications (see F. Balasa, "Background memory allocation in multi-dimensional signal processing," Ph.D. thesis, IMEC, K. U. Leuven, Belgium, Nov. 1995, which is incorporated by reference). As an accurate and general solution for computing the size of an integer n-dimensional polytope was needed, a novel technique based on the Fourier-Motzkin elimination (see, e.g., G. B. Dantzig, B. C. Eaves, "Fourier-Motzkin elimination and its dual," J. of Combinatorial Theory (A), Vol. 14, pp. 288-297, 1973) has been developed. Recently, the Fourier-Motzkin elimination has been employed for testing the data-dependence between array references (see V. Pugh, A practical algorithm foe exact array (dependence analysis," Comm. of the ACM, Vol. 35, No. 8, Aug. 1992). This is equivalent to the problem of checking the emptiness of an integer polytope. The routine for counting the lattice points inside a given polytope is described below. In the sequel, the columns of a matrix are denoted by subscripted vectors: e.g., A=›a.sub.1 a.sub.2 . . . !; the number of columns is denoted by A.nCol. The main idea of the algorithm is the following: the number of points of integer coordinates inside the given n-dimensional polytope, and having as first coordinate z, is equal to the number of points inside the polytope a.sub.2 z.sub.2 +a.sub.3 z.sub.3 + . . . .gtoreq.b-a.sub.1 z.sub.1 (which has one dimension less). The required result is obtained by accumulating this number over the entire discrete range of z.sub.1 (determined by the Fourier-Motzkin elimination).
__________________________________________________________________________
int CountPolytopePoints(A,b) {
// the given polytope is Az.gtoreq.b
if (A.nCol==1) return Range(A,b);
// handles the trivial case az.gtoreq.b
if (FourierMotzkinElim(A,b) .ltoreq. 0) return error.sub.-- code;
// special cases, when Az.gtoreq.b is an unbounded polyhedron or an
empty set, are detected;
// otherwise, the range of z.sub.1 - the first element of z - i.e.
›z.sub.1.sup.min,z.sub.1.sup.max ! is determined
N = 0 ;
for ( int z.sub.1 = .left brkt-top.z.sub.1.sup.min .right brkt-top. ;
z.sub.1 .ltoreq. .left brkt-bot.z.sub.1.sup.max .right brkt-bot. ;
z.sub.1 ++)
N += CountPolytopePoints( ›a.sub.2 a.sub.3 . . . !, b - a.sub.1 z.sub.1);
return N ;
}
__________________________________________________________________________
The routine Range, handling the case when A is a vector (az.gtoreq.b), is called also from FourierMotzkinElim. It checks whether the range of z is unbounded (in which case it returns the error.sub.-- code=-1), or whether the range of z is empty (in which case it it returns the error.sub.-- code=0). Otherwise, it returns the number of integers in the range of z, that is ##EQU10## The worst-case time complexity of the algorithm is exponential. In order to obtain an acceptable efficiency of the algorithm, several enhancing techniques--presented in the sequel--are employed in our current implementation. 1. The algorithm CountPolytopePoints(A, b) is not applied directly on the iterator polytope A.multidot.i.gtoreq.b. It must be noticed that linear inequalities are derived from the constraints on loop boundaries and control-flow conditions. In practice, as e.g. some loop boundaries are often constant, the system of inequalities may have a block decomposition of the form: ##EQU11## after an eventual reordering of the iterators. Therefore, the routine CountPolytopePoints is usually applied on reduced constraint matrices Ak.multidot.i.gtoreq.b.sub.k, k=1 . . . s, the results being multiplied at the end. In particular, the computation of the iterator space size for an array reference inside a nest of loops with constant boundaries is reduced to a simple multiplication of the iterator ranges. 12. As shown in G. B. Dantzig, B. C. Eaves, "Fourier-Motzkin elimination and its dual," J. of Combinatorial Theory (A), Vol. 14, pp. 288-297, 1973, the elimination of a variable from the system of inequalities generates a new system of pq+r inequalities where p, q, r are the numbers of positive, respectively negative and zero coefficients of x.sub.n in the initial system. The order of variable elimination is important for keeping the number of inequalities as low as possible. As a consequence, the variable selected to be eliminated next is one for which the expression pq+r is minimum. The only difference in the algorithm CountPolytopePoints (A,b) is that the Fourier-Motzkin elimination will yield the range of some element z.sub.k of z (rather than the range of z.sub.1 --the first element), and the modifications are straightforward. 3. By numbering distinctly the initial inequalities and the newly-generated ones during the pairwise elimination process, the set of initial inequalities from which a certain inequality is derived can be easily determined. If after the elimination of t variables, an inequality is derived from t+2 initial inequalities or more, it can be proven that the inequality is redundant. The condition above is only sufficient and therefore, not all the redundant inequalities are detected and eliminated with this simple technique. However, eliminating al l the redundant inequalities after each variable elimination proves to create more overhead than speed benefit when it is embedded ill the Fourier-Motzkin elimination, as shown in F. Balasa, "Background memory allocation in multi-dimensional signal processing," Ph.D. thesis, IMEC, K. U. Leuven, Belgium, Nov. 1995, which is incorporated by reference. Computing the size of linearly bounded lattices When the affine function t:Z.sup.n .fwdarw.N.sup.m, defined by t(i)=Ti+u is injective (that is, t(i.sub.1).noteq.t(i.sub.2) implies i.sub.1 .noteq.i.sub.2), the number of m-dimensional points with integer coordinates belonging to the image of a polytope is equal to the number of n-dimensional points with integer coordinates inside the polytope (the LBL size is equal to the polytope size). Two questions arise: (1) how can it be decided whether function t is injective or not, and (2) what is to be done if the injectivity condition is not fulfilled. In this context, it has to be emphasized that generating all lattice points in the polytope and collecting their images in a set is very inefficient for "large" polytopes (see F. Balasa, "Background memory allocation in multi-dimensional signal processing," Ph.D. thesis, IMEC, K. U. Leuven, Belgium, Nov. 1995, which is incorporated by reference). Different from existent works (see, e.g., Van Swaaij, ibidem) handling only examples where matrix T is invertible (and, therefore, mapping t is injective), the present invention consistently handles all the possible cases: also the case when T is nonsquare, or when matrix T is square and singular (and, therefore, mapping t is not injective). In order to compute the size of a linearly bounded lattice independent of whether mapping t is injective or not, the reduced Hermite normal form of matrix T is employed (see, e.g., M. Minoux, "Mathematical Programming--Theory and Algorithms," J. Wiley & Sons, 1986). A matrix is unimodular when it is square, it has integer coefficients and its determinant is .+-.1. For any matrix T.epsilon.Z.sup.m.times.n there exists a unimodular matrix S.epsilon.Z.sup.n.times.n such that ##EQU12## where H.sub.11 is a nonsingular lower-triangular matrix, and P is a row permutation. Denoting rank H.sub.11 =r, and S.sup.-1 i=j, the following situations may be encountered: (1) r=n. The affine mapping ##EQU13## is injective, as x.sub.1 =x.sub.2 implies j.sub.1 =j.sub.2 (as H.sub.11 is lower-triangular), hence i.sub.1 =i.sub.2. (2) r<n. Then x.sub.1 =x.sub.2 implies that only, the first r components of j.sub.1 and j.sub.2 are resp. equal. This means that all vectors j satisfying AS.multidot.j.gtoreq.b, and having the same prefix j.sub.1 . . . j.sub.r, contribute to the image set with one single (distinct value because they are mapped to the same point. A prefix j.sub.1 . . . j.sub.r is called valid if there exist vectors j satisfying AS.multidot.j.gtoreq.b, and having that prefix. Consequently, the affine image size of the polytope is the number of all valid prefixes j.sub.1 . . . j.sub.r. If the number of valid prefixes is equal to the polytope size, the affine mapping is injective (see the examples in the sequel). Therefore, the complete algorithm for counting the image size of the polytope Az.gtoreq.b becomes the following:
__________________________________________________________________________
int CountImagePolytope(Lbl) { // Lbl = {x = T .multidot. i + u
.vertline. A .multidot. i .gtoreq. b }
compute unimodular matrix S such that P .multidot. T .multidot. S =
with H.sub.11 lower
triangular, and P a permutation matrix:
let r = rank H.sub.11 ;
if (r=T.nCol) return CountPolytopePoints(A.b);
// case (1): mapping t is injective
else return CountPrefixes(AS,b,r);
// case (2)
int CountPrefixes(A,b,r) {// Generates potential prefixes of length r for
the vectors
// in the polytope Az.gtoreq.b. Only valid prefixes contribute to the
image size
if (r = 0) // If the whole prefix has been generated, it is checked
whether
if (nonEmptyPolytope(A,b)) return 1 ; else return 0;// the prefix is
valid or not
FourierMotzkinElim(A,b) ;
// Az.gtoreq.b is assumed to be a non-empty bounded polyhedron
// FourierMotzkinElim returns the range of z.sub.1 - the first element
of z
N = 0 ;
// for every possible first component of the prefix (of length r)
// the rest of the prefix (of length r - 1) is generated
for ( int z.sub.1 = .left brkt-top.z.sub.1.sup.min .right brkt-top. ;
z.sub.1 .ltoreq. .left brkt-bot.z.sub.1.sup.max .right brkt-bot. ;
z.sub.1 ++)
N += CountPrefixes( ›a.sub.2 a.sub.3 . . . !, b - a.sub.1 z.sub.1, r -
1);
return N ;
}
bool nonEmptyPolytope(A,b) {
// Returns true (.noteq.0) if the integral polyhedron Az .gtoreq. b is
not empty
if (A.nCol==1) return Range(A,b); // handles az.gtoreq.b
if ( FourierMotzkinElim(A,b) .ltoreq. 0 ) return error.sub.-- code;
// FourierMotzkinElim returns the range of z.sub.1 - the first element
of z
for ( int z.sub.1 = .left brkt-top.z.sub.1.sup.min .right brkt-top. ;
z.sub.1 .ltoreq. .left brkt-bot.z.sub.1.sup.max .right brkt-bot. ;
z.sub.1 ++)
if ( nonEmptyPolytope( ›a.sub.2 a.sub.3 . . . !, b - a.sub.1 z.sub.1))
return 1;
return 0 ;
}
__________________________________________________________________________
Two illustrative examples are briefly discussed in the sequel. EXAMPLE 1 ##EQU14## As there is a unimodular matrix ##EQU15## such that ##EQU16## it results that rank H.sub.11 =2=T.nCol, hence the mapping is injective. Therefore, the LBL size is equal to size(Ai.gtoreq.b)=256.times.256=65536. The size of the polytope can be computed either with the routine CountPolytopePoints(A,b) or, more efficiently in this case, by taking into account that both iterators have constant bounds with range 256. EXAMPLE 2 ##EQU17## With the unimodular matrix ##EQU18## A r<n (2<3), the prefixes (j.sub.1 j.sub.2) of all vectors j satisfying AS.multidot.j.gtoreq.b, that is .ident.7.gtoreq.j.sub.1 -j.sub.3 .gtoreq.0, 7.gtoreq.j.sub.2 -j.sub.3 .gtoreq.0, 7.gtoreq.j.sub.3 .gtoreq.0}, have to be checked for validity. Eliminating j.sub.3 from the previous set of inequalities, the prefixes (j.sub.1 j.sub.2) result to have the ranges j.sub.1 .epsilon.›0,14! and j.sub.2 .epsilon.›max{0,j.sub.1 -7}, min {14,j.sub.1 +7}!. There are 169 pairs (j.sub.1 j.sub.2) having these ranges. All of them prove to be valid prefixes: for instance, the prefix (1 6) is valid as there are two vectors--›1 6 0!.sup.T and ›1 6 1!.sup.T --satisfying the system of inequalities AS.multidot.j.gtoreq.b. It follows that the size of the index space of M›i+k!›j+k! is 169. As this value is inferior to the size of the iterator space--which is 512--it follows that the affine mapping t(i, j, k)=T.multidot.›i j k!.sup.T is not injective. Indeed, this can be easily seen noticing that the triplets (i, j, k) equal to (0,1,1) and (1,2,0) are mapped to the same point (1,2) in the index space. Index space determination for array references Let {x=T.multidot.i+u.vertline.A.multidot.i.gtoreq.b} be the LBL of a given array reference in a program. If matrix T is square and nonsingular, the index space is included into an image polytope which can be easily determined: noticing that the iterator vector i=T.sup.-1 .multidot.(x-u) must represent a point inside the iterator polytope A.multidot.i.gtoreq.b, the image polytope results to be A.multidot.T.sup.-1 .multidot.(x-u).gtoreq.b. Even in this case when matrix T is invertible, not all the points in the image polytope belong to the index space. For instance, if (i:0 . . . 9):: . . . A›5*i! . . . the image polytope of the array reference is 0.ltoreq.x.ltoreq.45. But the indices of A take only 10 from these 46 values. A more direct method for computing the LBL size (see F. Balasa, ibidem) is based on the effective construction of the index space of an array reference (see FIG. 14), even when matrix T is singular, or non-square (i.e., the dimension of the index space differs from the dimension of the iterator space). As mentioned earlier, for any matrix T.epsilon.Z.sup.m.times.n with rank T=r, and assuming the first r rows of T are linearly independent, there exists a unimodular matrix S.epsilon.Z.sup.n.times.n such that ##EQU19## where H.sub.11 .epsilon.Z.sup.r.times.r is a lower-triangular matrix of rank r, and H.sub.21 .epsilon.Z.sup.(m-r).times.r. The block matrix is called the reduced Hermite form of matrix T. Let ##EQU20## where j.sub.1, j.sub.2 are r-, respectively (n-r)-, dimensional vectors. Then ##EQU21## Denoting ##EQU22## (where x.sub.1, u.sub.1 are r-dimensional vectors), it follows that x.sub.1 =H.sub.11 j.sub.1 +u.sub.1 .multidot.As H.sub.11 is nonsingular (being lower-triangular of rank r), j.sub.1 can be obtained explicitly: j.sub.1 =H.sub.11.sup.-1 (x.sub.1 -u.sub.1) The iterator vector i results with a simple substitution: ##EQU23## where S.sub.1 and S.sub.2 are the submatrices of S containing the first r, respectively the last n-r, columns of S. As the iterator vector must represent a point inside the iterator polytope A.multidot.i.gtoreq.b, it follows that: AS.sub.1 H.sub.11.sup.-1 x.sub.1 +AS.sub.2 j.sub.2 .gtoreq.b+AS.sub.1 H.sub.11.sup.-1 u.sub.1 (4) As the rows of matrix H.sub.11 are r linearly independent r-dimensional vectors, each row of H.sub.21 is a linear combination of the rows of H.sub.11. Then from (3), it results that there exists a matrix C.epsilon..sup.(m-r).times.r such that x.sub.2 -u.sub.2 =C.multidot.(x.sub.1 -u.sub.1) (5) The coefficients of matrix C are determined by backward substitutions from the equations: ##EQU24## The system of inequalities (4) and the system of equations (5) characterize the index space of the given array reference. If n>r the image polytope of the index space can be obtained by taking the projection of the n-dimensional polytope (4) on the r-dimensional subspace defined by the first r coordinates. In practical point of view, the projection is performed eliminating with the Fourier-Motzkin technique the n-r variables of j.sub.2. This image polytope is usually not dense (it contains "holes"--i.e., points in the image polytope but not in the index space). as not all its points represent valid prefixes of length r in the polytope (4). Even if n=r (therefore, 110 projection is needed), the image polytope is dense if and only if matrix H.sub.11, is unimodular, as it can be seen from j.sub.1 =H.sub.11.sup.-1 (x.sub.1 -u.sub.1). Indeed, assuming that .vertline.det H.sub.11 .vertline..noteq.1, and taking into account that the elements of j.sub.1 must be integers, it follows (multiplying and dividing H.sub.11.sup.-1 (x.sub.1 -u.sub.1) by .vertline.det H.sub.11 .vertline.) that the points x inside the LBL must satisfy the supplementary constraints .vertline.det H.sub.11 .vertline. .vertline.h.sub.i.sup.T (x.sub.1 -u.sub.1) .A-inverted.=1, r (6) where h.sub.i.sup.T are the rows of the matrix with integer coefficients .vertline.det H.sub.11 .vertline.H.sub.11.sup.-1, and a.vertline.b means. "a divides b". If H.sub.11 is unimodular, the constraints (6) are obsolete. In conclusion, the size of the LBL can be computed, taking into account (4), with the routine CountPrefixes (›AS.sub.1 H.sub.11.sup.-1 AS.sub.2 !, b+AS.sub.1 H.sub.11.sup.-1 u.sub.1, r) described earlier, with a slight modification: if H.sub.11 is not unimodular, a valid prefix must satisfy also the constraints (6). The two previous examples will exemplify again this method. EXAMPLE 1 (revisited) As m=3 and r=n=2, we have x.sub.1 =›x y!.sup.T and x.sub.2 =›z!. Taking also into account the results obtained already earlier, ##EQU25## Condition (4) yields 3320.gtoreq.-x+3y.gtoreq.5 3316.gtoreq.5x-2y.gtoreq.1 As the only row in H.sub.21 is two times the first row in H.sub.11, the image polytope is completed with the equality (5): z-3=2(x-1). As det H.sub.11 =13, the image polytope is not dense. Condition (6) yields 13.vertline.4(x-1)+((y-2) (as 13.vertline.13(x-1) is always satisfied). The number of points in the image polytope, satisfying also the divisibility condition above, are 65536--the size of the LBL (which is also here the size of the iterator space). EXAMPLE 2 (revisited) As m=r=2 and n=3, we denote the (n-r)-dimensional vector j.sub.2 =›.lambda.!. As ##EQU26## (thus unimodular), and H.sub.21 does not exist (as m=r), with ##EQU27## only condition (4) yields ##EQU28## The number of points ›x y!.sup.T in the image polytope is 169--the valid prefixes of length 2. When matrix H.sub.11 is not unimodular, the CPU times can be significantly higher than those yielded by the routine CountImagePolytope. The reason is that the method of index space determination operates on the image polytope (which contains "holes"), while CountImagePolytope operates on the iterator polytope--which can be significantly "smaller" when H.sub.11 is not unimodular: the "holes" in the image polytope can occupy a large space in the image polytope. Therefore, the former method can never be better. The polyhedral data-flow graph (29) After the analytical partitioning of indexed signals (22), a polyhedral data-flow graph (DFG) (29) with exact dependence relations, is constructed (26) (for instance, the DFG from FIG. 15 will result from the Silage example in FIG. 10). However, unlike the classic case of data-flow analysis, the nodes in the graph do not correspond to individual variables/signals, but to groups of signals (covered by the basic sets derived in (24)), and the arcs correspond to the dependence relations between these groups. For instance, the nodes A0, . . . , A9 in the graph from FIG. 15 correspond to the 10 basic sets of signal A in FIG. 13. The nodes in the data-flow graph are weighted with the size of their corresponding basic sets (computed as described above), and the arcs between nodes are weighted with the exact number of dependences between the basic sets corresponding to the nodes. Based on the polyhedral graphs, a data-flow analysis is done in order to provide accurate estimations of storage requirements for RMSP algorithms (see F. Balasa, F. Catthoor, H. De Man, "Background memory area estimation for multi-dimensional signal processing systems," IEEE Trans. on VLSI Systems, Vol. 3, No. 2. pp. 157-172, June 1995) when a partial computation ordering is imposed. Evaluation of dependence relations between the basic sets of signals (25) As the nodes in the DFG and their weights--the basic sets of signals and their sizes--are known from (24), the construction of the graph is completed by determining the arcs between the nodes--the dependence relations between the basic sets (in particular, array references), and their weights--the number of dependences. Suppose for the moment, without decrease in generality, that two basic sets are represented as LBL's: S.sub.1 ={x=T.sub.1 .lambda.+u.sub.1 .vertline.A.sub.1 .lambda..gtoreq.b.sub.1 }, S.sub.2 ={x=T.sub.2.lambda. +u.sub.2 .vertline.A.sub.2.mu. .gtoreq.b.sub.2 } and the two basic sets belong respectively to the index spaces of a definition domain and of an operand domain within the same instruction scope: D.sub.1 ={x=C.sub.1 I.sub.1 +d.sub.1 .vertline.AI.sub.1 .gtoreq.b}, D.sub.2 ={x=C.sub.2 I.sub.2 +d.sub.2 .vertline.AI.sub.2 .gtoreq.b} Solving the linear Diophantine system in (I.sub.1, .lambda.) as variables: C.sub.1 I.sub.1 +d.sub.1 =T.sub.1 .lambda.+u.sub.1 and substituting the solution which always exists as basic set S.sub.1 is included in D.sub.1) in the sets of constraints A.sub.1 .lambda..gtoreq.b.sub.1 and AI.sub.1 .gtoreq.b, the expression of the iterator vector corresponding to the basic set S.sub.1 is obtained: {I.sub.1 =T.sub.1 .alpha.+u.sub.1 .vertline.A.sub.1 .alpha..gtoreq.b.sub.1 } The expression of the iterator vector corresponding to the basic set S.sub.2 is obtained similarly: {I.sub.2 =T.sub.2 .beta.+u.sub.2 .vertline.A.sub.2 .beta..gtoreq.b.sub.2 } There is a dependence relation between S.sub.1 and S.sub.2 if there is at least one iterator vector corresponding to both of them. The number of iterator vectors yields, in this case, the number of dependences. The problem is solved by intersecting the two linearly bounded lattices above. If the intersection is empty, there is no dependence relation between S.sub.1 and S.sub.2. Otherwise, the size of the intersection represents the number of dependences. Example In the Silage code of FIG. 10, basic set A8 belongs to the index space D8 of the definition domain A›j!›n+i+1!(j.gtoreq.i) (node g in FIG. 12), and basic set A9 belongs to the index space D9 of the operand domain A›j!›n+i!(j.gtoreq.i) (node h in FIG. 12). Employing a non-matrix notation, the linearly bounded lattices of the basic sets and of the index spaces are (see FIG. 13): A8={x=.lambda..sub.1, y=.lambda..sub.2 +n.vertline.n-1.gtoreq..lambda..sub.1, .lambda..sub.2 .gtoreq.1, .lambda..sub.1 -.lambda..sub.2 .gtoreq.1} A9={x=.mu., y=n.vertline.n-1.gtoreq..mu..gtoreq.1} D8={x=j, y=i+n+1.vertline.n-1.gtoreq.j.gtoreq.0, n-1.gtoreq.i.gtoreq.0, j.gtoreq.i} D9={x=j, y=i+n.vertline.n-1.gtoreq.j.gtoreq.0, n-1.gtoreq.i.gtoreq.0, j.gtoreq.i} The set of iterators corresponding to A8 in the index space D8 is described by {i.sub.1 =.alpha.', j.sub.1 =.alpha.".vertline.n-1.gtoreq..alpha.", .alpha.'.gtoreq.0, .alpha."-.alpha.'.gtoreq.2}. The set of iterators corresponding to A9 in the index space D9 is: {i.sub.2 =0, j.sub.2 =.beta..vertline.n-1.gtoreq..beta..gtoreq.1}. The intersection of the two LBL's is represented as: {i=0, j=.gamma..vertline.n-1.gtoreq..gamma..gtoreq.2}. Hence, the number of dependences between A9 and A8 is n-2, i.e. the size of the intersection. Therefore, the arc between the nodes A9 and A8 in FIG. 15 has the weight 4 (as n=6). It has been assumed so far that a basic set was represented by a single LBL. This does not always happen as a basic set may result from the routine CreateBasicSets (in 24) to be a difference of the form LBL-(.orgate..sub.i LBL.sub.i) (e.g., the basic set A1 in FIG. 18b). In this latter case, the basic set is decomposed into a collection of mutually disjoint LBL's. The formal approach for this is discussed in the sequel. Let a basic set be represented as the difference LBL-(.orgate..sub.i LBL.sub.i) or, equivalently, LBL.andgate. (.andgate..sub.i LBL.sub.i). The problem is reduced to decompose the complement LBL.sub.i of a given linearly bounded lattice LBL.sub.i ={x=T.sub.z +u.vertline.a.sub.1 z.gtoreq.b.sub.1, . . . , a.sub.r z.gtoreq.b.sub.r } into a collection of mutually disjoint LBL's. From the set identity C.sub.1 .andgate. . . . .andgate.C.sub.n .tbd.C.sub.1 .orgate. . . . .orgate.C.sub.n =C.sub.1 .orgate.(C.sub.1 .andgate.C.sub.2).andgate. . . . .andgate.(C.sub.1 .orgate. . . . .orgate.C.sub.n-1 .orgate.C.sub.n) C.sub.1, C.sub.1 .andgate.C.sub.2, . . . , C.sub.1 .andgate. . . . .andgate.C.sub.n-1 .andgate.C.sub.n are clearly disjoint sets, it follows that LBL.sub.i =.orgate..sub.j-1.sup.ri LBL.sub.ij, where the lattices LBL.sub.ij ={x=T.sub.z +u.vertline.a.sub.1 z.gtoreq.b.sub.1, . . . , a.sub.j-1 z.gtoreq.b.sub.j-1, -a,z.gtoreq.-b.sub.j +1} are mutually disjoint. The last inequality -a.sub.j z.gtoreq.-b.sub.j +1 is equivalent to a.sub.j z<b.sub.j. It is also assumed, without loss of generality, that the mappings of LBL and LBL.sub.i are identical (as LBL.sub.i is included in LBL). In conclusion, the given basic set can be represented as .orgate..sub.{j1,j2, . . . } (LBL.andgate..andgate..sub.iLBL.sub.iji)--a union of at most .pi..sub.i r.sub.i disjoint linearly bounded lattices. The complexity of the decomposition method is proportional to this product. Example The algorithm CreateBasicSets (in 24) applied to the motion detection kernel in FIG. 8 yields a basic set A1 (see FIG. 18b) which is the difference between two linearly bounded lattices: A1={x=i, y=j.vertline.M+m.gtoreq.i.gtoreq.0, N+n.gtoreq.j.gtoreq.0}-A0 A0={x=i, y=j.vertline.M.gtoreq.i.gtoreq.m, N.gtoreq.j.gtoreq.n} After the removal of redundant inequalities, the decomposition of A1 into disjoint LBL's results to be A1=A1.alpha..orgate.A1b.orgate.A1c.orgate.A1d, where: A1a={x=i, y=j.vertline.m-1.gtoreq.i.gtoreq.0, N+n.gtoreq.j.gtoreq.0} A1b={x=i, y=j.vertline.M+m.gtoreq.i.gtoreq.M+1, N+n.gtoreq.j.gtoreq.0} A1c={x=i, y=j.vertline.M.gtoreq.i.gtoreq.m, n-1.gtoreq.j.gtoreq.0} A1d={x=i, y=j.vertline.M.gtoreq.i.gtoreq.m, N+n.gtoreq.j.gtoreq.N+1} The data-flow graph of the motion detection kernel, where now all the nodes represent linearly bounded lattices, is showing in FIG. 16. Note that different from the partitioning of the indexed signals into basic sets, the decomposition into disjoint linearly bounded lattices is not unique. Another possible decomposition is, for instance: A1a'={x=i, y=j.vertline.M+m.gtoreq.i.gtoreq.0, n-1.gtoreq.j.gtoreq.0} A1b'={x=i, y=j.vertline.M+m.gtoreq.i.gtoreq.0, N+n.gtoreq.j.gtoreq.N+1} A1c'={x=i, y=j.vertline.m-1.gtoreq.i.gtoreq.0, N.gtoreq.j.gtoreq.n} A1d'={x=i, y=j.vertline.M+m.gtoreq.i.gtoreq.M+1, N.gtoreq.j.gtoreq.n} Different decomposition solutions can be obtained modifying the order of LBL's in the intersection .andgate..sub.i LBL.sub.i, or interchanging the inequalities in each LBL.sub.i. Also the union of two LBL's can be decomposed with a similar technique in a collection of disjoint LBL's, as LBL.sub.1 .orgate.LBL.sub.2 =LBL.sub.1 .orgate.(LBL.sub.2 -LBL.sub.1). More generally, given two basic sets represented as unions of mutually disjoint linearly bounded lattices--.orgate..sub.i S.sub.1.sup.i and, respectively, .orgate..sub.j S.sub.2.sup.j --the number of dependences between the two basic sets result to be .SIGMA..sub.i,j nr.sub.-- dependences (S.sub.1.sup.i, S.sub.2.sup.j). The number of dependences between each pair of LBL's is computed as shown earlier. The data-flow graph of the example in FIG. 10 is shown in FIG. 15 . The nodes are labeled with the signal name, basic set number, and size; the arcs are labeled with the number of dependences. OUT is a dummy node, necessary for handling delayed signals. Construction of the polyhedral data-flow graph (26) RMSP algorithms describe the processing of streams of data samples. The source code of these algorithms can be imagined as surrounded by an implicit loop having time as iterator. Consequently, each signal in the algorithm has an implicit extra dimension corresponding to the time axis. RMSP algorithms contain usually delayed signals i.e., signals produced or inputs in previous data-sample processings, which are consumed during the current sample processing. The delay operator "@" refers relatively to the past samples. The delayed signals must be kept "alive" during several time iterations, i.e. they must be stored in the background memory during several data-sample processings. In order to handle the delays in an RMSP algorithm, an appealing solution because of its simplicity would be to add an explicit extra dimension--corresponding to the time axis--to all the array references in the behavioral specification. The code transformation for the motion detection kernel (FIG. 8) is presented in FIG. 17 . Choosing a maximum value for the number of data-samples, the equivalent code could be processed in the same way as presented so far. However, this "simple" approach presents a major shortcoming: introducing one extra dimension increases substantially the computational effort for realistic RMSP applications. This effect is especially noticeable in applications with "deep" nested loops, as the motion detection kernel, where the explicit introduction of the time loop (see FIG. 17) causes an increase of the processing time from 37 seconds to 56 seconds. In order to simulate the effect of the delayed signals in terms of memory requirements, a more effective method for handling delays has been proposed (see F. Balasa, F. Catthoor, H. De Man, "Background memory area estimation for multi-dimensional signal processing systems," IEEE Trans. on VLSI Systems, Vol. 3, No. 2. p. 157-172, June 1995). First, the delayed operand domains take part in the partitioninig process as any other signal domain. Afterwards, the construction of the data-flow graph needs the following preprocessing step: create a dummy node OUT; for each basic set b let D(b) be its highest delay value (when belonging to an operand domain); create D(b) copies of the basic set node, each one labeled from I to D(b); for every copy labeled 1, . . . , D(b)-1 create a dependence from the copy node to OUT; for every basic set b belonging to an output signal, or having a max delay D(b)>0 create a dependence arc from its corresponding node to OUT; The basic idea is to create bogus dependence relations towards an "output" node from the groups of signals produced in the current sample processing and consumed in a future one, or from the groups produced in the past and still necessary in the future. E.g., if there is an operand sig@3, the signal sig produced two sample processing ago, will be consumed only in the next sample processing. These groups of signals must be kept "alive" during the entire current sample processing. In practice, all copies of a basic set having a unique dependence relation (only towards the node OUT) are treated as a single item, so handling signals with high delay values does not cause computational problems. This modification of the data-flow graph allows to take into account the effect of delays, "translating" the basic sets which affect the memory requirements from previous data-sample processings into the current one. When the delay values are constant, the extension of all domains with one extra, dimension--corresponding to the implicit time loop--is avoided hence reducing the computational effort. Although our model allows the handling of signals with non-constant delay values (affine functions of loop iterators), the necessity of all explicit time dimension cannot be avoided in this case. However, it could be possible to find out by solving an ILP) the maximum delay values inside the scope of those signals, and to replace the affine delays by constant values. The shortcoming of such a strategy is a possible excessive storage requirement, due to unnecessary basic set copies and false dependences introduced in the data-flow graph. FIG. 18a shows the data-flow graph for the Silage code in FIG. 8. The basic sets corresponding to delayed signals are labeled with "@delay.sub.-- value". Granularity levels (27) The method described so far operates on groups of signals obtained from a coarse-grain partitioning--using the index spaces of operand and definition domains directly derived from the loop organization provided in the initial code. A data-flow analysis operating at the basic set level can find out high-quality partial computational orderings in terms of storage requirements. Reducing the size of the signal groups, more detailed partial orderings can be imposed on the more complex data-flow graph, resulting obviously in lower storage requirements. In order to decrease the groups of signals, the index spaces can be decomposed before starting the partitioning process, removing gradually the restrictions imposed iva the loop organization provided in the initial ode. This domain decomposition is carried out by slicing the n-dimensional polytopes A.multidot.i.gtoreq.b into sets of (n-1)-dimensional polytopes. We name this an increase of granularity level, and the outcome is a finer-grain partitioning. The linearly bounded lattices are then newly derived and partitioned corresponding to that granularity level. The slicing operation is carried out with a Fourier-Motzkin technique, and it can be done along any coordinate in the iterator space. FIG. 19a illustrates the index space decomposition for signal A in the example from FIG. 10 due to the slicing of the iterator space according to the first loop iterator. Similarly, FIG. 19b shows the index space decomposition due to the slicing of the iterator space according to the second loop iterator. For the illustrative example in FIG. 10, other obvious decompositions of the signal domains can be also obtained by slicing the index space (rather than the iterator space) along the coordinates x and y. These decompositions work here because all the indices are of the form iterator+const. For the general affine case such decompositions are more difficult to achieve analytically. It is assumed that the decomposition of the signal domains is done in the nesting order of loops (as in FIG. 19a for the illustrative example). This is equivalent to the preservation of the loop hierarchy given in the source code. However, choosing other decomposition orders within the model signifies performing loop permutations--which are always feasible, due to the single-assignment property of Silage and to the irrelevance of the statement order in the code. FIG. 20 shows the data-flow graph obtained after the first loop level expansion: the nodes corresponding to signal A are the basic sets represented graphically in FIG. 19a. The slicing process can be continued until the domains are eventually decomposed into scalars. The flattened data-flow analysis can therefore be obtained as a particular case. FIG. 21 shows the data-flow graph obtained after expanding two loop levels (therefore, reaching the scalar level for the illustrative example). There are two contrasting objectives when selecting the granularity level. On one hand, the number of basic sets is rapidly increasing, which is an undesirable effect due to the growth of the computational effort. Moreover, also the complexity of the controller in the architecture to be realized will grow exponentially. On the other hand, the memory size will gradually get closer to an absolute lower bound, given that any loop can be potentially unrolled. In practice, descending to the scalar level ought to be avoided. Instead, for each application, the granularity should be gradually increased until a good trade-off is obtained. Estimation of storage and port requirements (30) For solving the memory size estimation problem (or (estimation of storage requirements (31)) in the nonprocedural case (see FIG. 3, and for the context see FIG. 1), data-flow analysis has been for the first time employed as the main exploration strategy in F. Balasa, F. Catthoor, H. De Man, "Exact evaluation of memory area for multi-dimensional signal processing systems," Proc. IEEE Int. Conf Comp.-Aided Design, pp. 669-672, Santa Clara Calif., Nov. 1993. The approach was thoroughly described in F. Balasa, F. Catthoor, H. De Man, "Background memory area estimation for multi-dimensional signal processing systems," IEEE Trans. on VLSI Systems, Vol. 3, No. 2, pp. 157-172, June 1995. In a more recent work, data dependence information provided by the Omega test (see V. Pugh, A practical algorithm for exact array dependence analysis," Comm. of the ACM, Vol. 35, No. 8, Aug. 1992) has also been applied in memory size estimation (see I. Verbauwhede, C. Scheers, J. M. Rabaey, "Memory estimation for high level synthesis," Proc. 31th Design Automation Conference, pp. 143-148, San Diego Calif., June 1994). Although the method is very appealing in terms of speed, the assumptions regarding loop hierarchy and control flow--e.g., a nest of loops is executed in the sequence as specified in the source code, and a sequence of loops is executed in the order specified in the source code--only partly remove the procedural constraints. Relying on the concepts of memory bounds in nonprocedural programs and in-place mapping of signals, the assessment of the minimal memory size of a RMSP algorithm is carried out by means of a heuristic traversal of the polyhedral data-flow graph (29)--created during data-flow analysis (20). The estimation of the port requirements (36) is achieved by a partial ordering of the read/write operations within a given cycle budget (38). The evaluation of the bandwidth is done taking into account also the partial computation order (34) derived from the polyhedral DFG (20) structure and traversal. Estimation of storage requirements (31) The background memory estimation approach--presented in the sequel--relies on the polyhedral model of data-flow analysis. After partitioning the signals from the given RMSP algorithm (22), and after accumulating all the possible dependence information at the level of basic sets, the subsequent step is to obtain an accurate evaluation of the minimal memory size (locations) compatible with the resulting data-flow graph. Even the simpler problem of finding the minimum number of memory locations necessary to compute a directed acyclic graph has been proven to be an NP-complete problem. Structurally, the polyhedral DFG's (29) can be more complex: they may contain cycles, as 2 groups of signals may contain 2 subsets with opposite dependences to the other group. For instance, node A3 in Fig. refDfgO represents a groups of signals included both in the operand A›i!›n+i! (line (3) in Fig. refexample) and in the definition domain A›j!›n+i+1! (line (4)); as node alpha0 represents the signals alpha›i!--operand in line (5) and definition in line (3)--there are two dependences of opposite direction between nodes A3 and alpha0 in the DFG. Due to the inherent complexity of the problem, the assessment of the minimal memory size is achieved heuristically by means of a traversal of the DFG (29). It must be emphasized that the goal of this approach is to introduce only a partial operation ordering--necessary to reduce the storage requirements--while a proper scheduling implies a total ordering, which is unnecessary as already motivated in section "Context of the proposed methodology". The DFG traversal provides a (data-flow which is equivalent to a certain reorganization of the code. The procedural execution of this functionally equivalent code entails a low (eventually minimum) number of storage locations for computing the respective data-flow graph. Applying the same approach to the DFG for a higher granularity level (27), the resulting data-flow corresponds to a certain code reorganization when the nests of loops are (implicitly|) unrolled until a depth equal to the granularity level. As the granularity is increasing, the storage requirement to compute the corresponding graph is decreasing, but the number of supplementary scheduling constraints is also increasing, as it will be shown in section "Example of storage assessment." Memory bounds in nonprocedural languages Example Assume we start from the illustrative RMSP Silage code in FIG. 22, where signal A is not consumed elsewhere, but all scalars B›i!›j! are necessary in the subsequent code. The problem is to find out the minimum number of memory locations necessary to carry out the computation. It must be emphasized from the very beginning that the formulation of the problem is ambiguous, as Silage is a nonprocedural language. Assume for the moment a procedural interpretation of the code in FIG. 22 . Then, it can be easily verified that no more than 101 locations are really necessary. Indeed, at each execution of the i-loop, B›i!›0! can be stored in the location occupied previously by A›i!, as this scalar is no longer used. Consequently, the execution of each of the first 9 i-loops will increase the necessary memory with 9 locations. The last i-loop does not require any extra storage location, as every B›9!›j! can overwrite A›9+j!. Two essential implicit assumptions result whenever the code is interpreted procedurally: (1) the instructions are executed sequentially, and (2) the order of execution is directly derived from the iterations in the code. As Silage is a nonprocedural language, the assumptions mentioned above are no longer valid. First, parallel computations may occur whenever the possibility exists. For instance, in the example in FIG. 22, all scalars B›i!›j! can be computed in parallel, and the minimum storage requirement becomes 100 locations, as signal B can overwrite signal A . (However, if a perfect parallelism cannot be carried out--in the sense that some scalars B›i!›j! may be computed and have to be stored before the reading of the A operands is completely finished, then a more conservative answer--i.e., 120 locations--should be given.) But they can be also partitioned in groups of, e.g., 5, and each group computed in parallel. In this latter case, the minimum number of locations depends both on the partitioning and on the group ordering. Second, even in the case of sequential execution, there are 100| possible orders of computation. A challenging reformulation of the problem is: "Assuming the computations are carried out sequentially, what is the absolute minimum storage requirement, for all the possible computation orders?" Or, in other words, "what is the lower-bound of the minimum storage?" Another way of reformulating the problem is the following: "Assuming the computations are carried out sequentially, what is the minimum number of storage locations, independent of any computation order?" Or, in other words, "what is the upper-bound of the minimum storage?" It is clear now that the first formulation of the problem can imply different interpretations, and therefore, different answers. The solutions for the sequential case will be discussed in the sequel. Lower-bound of the minimum storage It must be noticed that the computation of signal 13 necessitates at least 100 locations--as there are 100 scalars B›i!›j! to be stored. As shown before, the procedural interpretation of the code implies a minimum of 101 locations. Is there any computation order for the example in FIG. 22 leading to 100 locations? The answer is affirmative: there are several ways of computing sequentially the scalars B›i!›j! employing no more than 100 locations. One possible ordering is displayed in FIG. 23. When the number of scalars to be produced is larger than the number of dying (i.e., consumed for the last time) scalars (as in the example in FIG. 22), the minimum storage situation is met when each assignment consumes for the last time no more than one scalar. Only in this situation can the memory occupied by the dying operand domains be completely overwritten by the new definition domain. In FIG. 23, for instance, each of the 9 assignments in the third nest of loops consume for the last time one scalar: A›1!, respectively A›3!, . . . , A›17!. The last assignment in the second nest of loops consumes for the last time A›19! (when B›9!›8! is produced). The scalars A›! having even indices are consumed for the last time in the first nest of loops. As each assignment contains at most one operand having an even index, it is not possible that two scalars A›! are simultaneously consumed for the last time. When the number of scalars to be produced is smaller or equal to the number of dying scalars, the definition domain can be stored in-place (can overwrite the memory occupied by dying operand domains) if the computation order is such that each assignment consumes for the last time at least one scalar. Unfortunately, there is no known algorithm of reasonable complexity able to detect the situations described above, and to generate the corresponding procedural code. Furthermore, it is not known how to determine the reachable lower-bound of the minimum storage. As Silage is a single-assignment languages each program can be flattened in a directed acyclic graph (the nodes being the scalars, and the arcs--computation dependences). As mentioned already, the problem of finding the minimum number of storage locations for the computation of a directed acyclic graph is NP-complete. Upper-bound of the minimum storage It must be noticed that the computation of signal B necessitates no more than 120 locations (for storing simultaneously both signals A and B). However, this upper-bound is too conservative. Indeed, as there are 20 scalars A›! consumed for the last time, and each assignment has two operands, there are at least 10 assignments (and at most 20) consuming scalars for the last time. The signals B›i!›j! produced during these assignments call be stored in-place, overwriting the dying scalars A›!. In conclusion, no more than 90 scalars B›i!›j! need to be stored in locations different from those occupied by A›!, and therefore, no sequential ordering needs more than 20+90=110 memory locations. Is there any computation order for the example in FIG. 22 leading to 110 locations? The answer is affirmative: there are in fact several ways of computing sequentially the scalars B›i!›j! which have to employ 110 locations. One possible ordering is displayed in FIG. 24. As it can he easily noticed, the 10 assignments in the second nest of loops consume for the last time two scalars A each: (A›0!, A›1!) , respectively (A›2!, A›3!), a.s.o. Consequently, the scalars B›i!›i! can overwrite the scalars A›!. The 90 scalars B›i!›j! (i.noteq.j)--produced in the first nest of loops--have to be stored in different memory locations. Therefore, this computation order requires 110 locations. However, in general, it is not possible to determine automatically the best upper-bound, and to generate the corresponding procedural code. In order to preserve as much freedom of decision as possible for the subsequent high-level synthesis phases--operation scheduling, data-path allocation--the memory estimation and allocation model for nonprocedural languages employs storage upper-bounds. Subsequent decisions concerning the computation ordering of the code must entail readjustments of the allocation solution. Deriving good upper-hounds is crucial in order to prevent an important overallocation of memory. High-level in-place mapping In the example from FIG. 22 it was assumed that signal A is consumed for the last time while computing signal B. In large programs, this situation has to he detected. Moreover, it usually happens that only parts of indexed signals are consumed for the last time. The polyhedral data-flow analysis (20) allows the determination of which parts of an M-D signal are consumed for the last time when a signal definition domain is produced. More specifically, the analytical partitioning of indexed signals (22), and the detection of dependences between partitions (basic sets) provide the solution. This information, together with the partition sizes and the number of dependence between them, allows the detection of the possibility of storing signals in-place independent of the computation order, and at the same time, to derive tight upper-bounds for the storage currently occupied when the groups of signals are produced. The problem of in-place mapping refers to the possibility that signals in a RMSP algorithm share the same memory locations during algorithm execution. This problem may be approached from two viewpoints: (1) high-level in-place mapping--referring to the memory sharing clue to the data-flow, hence independent of any detailed operation ordering; (2) low-level in-place mapping--which refers to the same issue but dependent on the full operation ordering. Moreover, this form of in-place must be carried out separately for each memory module. The low-level in-place depends, therefore, on the distributed memory architecture and on the signal-to-memory assignment. If the latter problem can be tackled with a symbolic or scalar-oriented technique, the former is more difficult as it requires the computation of tight upper-bounds on the (usually huge) set of valid operation orderings. The polyhedral data-flow graphs (29) contain sufficient embedded information in order to derive tight memory upper-bounds, independent of the production order--which will be decided later. Several cases of upper-bound determination will be considered in the sequel. These cases represent essential situations encountered when computing a data-flow graph. The more complex ones can be reduced to, or can be solved by simple extensions of the cases below. A. Suppose a DFG contains two basic sets of n, respectively N, scalars each, and there are d dependences between these basic sets (FIG. 25a). Suppose also that no other basic set in the DFG depends on basic set n,--already produced. Analyzing the variation of the memory size when basic set N is produced, a conservative observation is that the occupied memory could have a relative growth of MemoryIncrease=N locations (see FIG. 26a). After the production of the N scalars (no matter in which order, as the code is nonprocedural), the n scalars consumed for the last time are no longer necessary, and the corresponding storage can be freed: MemoryDecrease=n. Taking a closer look, two situations may occur: (1) if d.gtoreq.n, only d-n scalars from the basic set N need new storage locations (and, eventually, the N-d scalars that, in fact do not depend on the basic set n) (see FIG. 26b). (2) if d<n, it follows that n-d scalars in the basic set n are no longer necessary and they can be eliminated immediately--before the production of the N scalars. After this memory decrease, a situation as described previously in (1) is obtained (as the number of dependences is now equal to the number of scalars in n). At this moment, a relative memory increase of N-d is expected (see FIG. 26c). In both cases, the relative increase of the occupied storage is MemoryIncrease=max{N-n, 0}, instead of N--as we would be tempted to answer at a quick glance. Example In the DFG from FIG. 15, the computation of the basic set A0 (A›j!›0!, j=0,5), which depends on the basic set in0 (the input scalar in), implies a relative increase of the occupied memory of max {6-1, 0}=5 locations. Indeed, the last scalar A›j!›0!, whichever it is, can overwrite the input scalar in (see line (1) in the Silage code from FIG. 10). B. Suppose the basic set N depends on two basic sets containing n.sub.1, respectively n.sub.2 scalars, the number of dependences being d.sub.1 and, respectively, d.sub.2. There are two possibilities: (1) the definition domain covering the basic set N is produced while consuming a single operand domain. In this case, the scalars in the basic sets n.sub.1 and n.sub.2 are consumed sequentially (see FIG. 25b), and the relative increase of the memory is MemoryIncrease=(d.sub.1 -n.sub.1)+(d.sub.2-n.sub.2)+(N-d.sub.1 -d.sub.2)=N-n.sub.1 -n.sub.2 (2) the computation of the definition domain relies on two operand domains. In this case, the scalars in the basic sets n.sub.1 and n.sub.2 are consumed simultaneously (see FIG. 25c); therefore d.sub.1 =d.sub.2 def/=d, and the growth is upper-bounded by: MemoryIncrease=min{d-n.sub.1, d-n.sub.2 }+(N-d)=N-max{n.sub.1, n.sub.2 } Example ##EQU29## The DFG is of the form displayed in FIG. 25c, where n.sub.1 =1, n.sub.2 =m (corresponding to A›0!, respectively A›j!), N=nm (corresponding to B›i!›j!), and d.sub.1 =d.sub.2 =nm. It can be easily noticed that, independent of any computation order of B›i!›j!, MemoryIncrease=(n-1)m. But also N-max{n.sub.1, n.sub.2 }=nm-max{1, m}=(n-1)m. It must be emphasized that N-max {n.sub.1, n.sub.2 } is an upper bound which might not be necessarily reached: Example ##EQU30## requires a relative memory increase of (n-1)(m-1)=nm-(n+m-1), independent of any computation order. But n+m-1>max{n,m} if n,m>1. The situation described here can be extended without any difficulty to the case when the basic set N depends on several basic sets n.sub.i, consumed for the last time. A special case for the situation presented in FIG. 25c is displayed in FIG. 25d--where the two basic sets n.sub.1 and n.sub.2 coincide. The upper-bound of the relative memory growth is ##EQU31## Example (i: 0 . . . 10)::B›i!=A›i!+A›10-i!; The DFG is of the form displayed in FIG. 25d--where N=n=d=11. As it can be easily noticed, the upper bound of the memory increase is ##EQU32## and it is effectively reached when, for instance, the code is executed procedurally. C. More difficult situations appear when nodes in the DFG have self-dependence arcs. Such cases are due to recurrence relations between the elements (scalars) of the same indexed signal, as shown in FIG. 27a , where the basic set A3 represents the index space A3={x=i.vertline.8.gtoreq.i.gtoreq.3}. Analyzing the memory variation while computing the DFG in FIG. 27a, it is clear that the basic sets A0, A1, A2 have to be computed first (in any order), and the required storage so far is 3 locations. The production of A3 is carried out with one operand domain. There are 3 dependences emerging from A3 towards other basic sets, which means that at most 3 scalars from A3 are needed to proceed the computation of the graph, the others being necessary for the self-computation of A3. This implies that, in the worst case, 2 extra locations are needed by the production of A3: indeed, the first scalar of A3 cain overwrite one of the previous basic sets, say A0; the next two scalars of A3 could require new locations; but afterwards, the next scalars can overwrite either A1, A2, or the older elements of A3. In conclusion, the upper-bound of the relative storage increase is 2 locations (for the whole DFG being 5 locations). It must he emphasized that this is the best estimation we can get from the DFG in FIG. 27a. At this granularity level (27), the insufficient information concerning the computation of A3--the internal dependences of A3, as well as the dependences between its components and the other basic sets--do not allow a more precise estimation of the memory requirement. Increasing the granularity level, the DFG expanded one loop level (FIG. 27b) offers sufficient information to derive one of the 6 best computation orders: A›0!=in; (i: 3 . . . 9 . . . 3)::A›i!=A›i-3!; out›0!=A›9!; and so on, which requires only 2 locations--the absolute minimum value, for any computation order. It has to be remarked that, for the simpler recurrence relation A›i!=A›i-1!+ . . . , the minimal storage requirement of one location is detected even on the granularity level 0 (see FIG. 28). Multiple recurrence relations between elements of the same indexed signal can create multiple self-dependence arcs (see FIG. 29a). With a similar reasoning as above, the upper-bound of the relative memory increase is 1 location for the production of the basic set A36 (the scalars A›3!, A›4!, A›5!, A›6!), and respectively 1 location for the production of A78 (the scalars A›7!, A›8!). For the whole DFG in FIG. 29a , the upper-bound of the storage requirement is 5 locations. Going one granularity level deeper (see FIG. 29b), it can be noticed that 3 locations are enough for any valid computation ordering. All the cases discussed allow to compute the local memory upper-bounds in data-flow graphs by applying simple formulas. The more complex cases can be solved similarly: for instance, the computation of a domain of AN scalars with k nonoverlapping operands (rather than 2 in FIG. 25c) requires at most N-max{n.sub.1, . . . n.sub.k } locations, where n.sub.i denotes the number of scalars of the i-th operand. Data-flow graph traversal The DFG traversal attempts to find the best possible ordering in which the basic sets of signals should be produced such that the memory size is kept as low as possible. This operation carries out, at the same time, an estimation of the storage requirements (31) for the given RMSP algorithm. The DFG traversal takes into account the mapping in-place (at high level) of the basic sets of signals, relying on memory upper-bounds--computed as explained earlier. It is assumed that a basic set can be produced only after the production of all basic sets having dependences to it. It is assumed, also, that, basic sets of signals are produced sequentially. However, it is possible to extend this model in order to allow also parallel production of basic sets. The difficulty does not consist in the determination of memory upper-bounds for parallel computation. The main problem is to determine beforehand which basic sets should be computed in parallel--as such a decision entails a potential increase of the data-path (unknown at this design stage). A global solution is 1. to extract the (largely hidden) parallelism from the initial source code in the form of do.sub.-- all loops; afterwards, 2. to determin | ||||||
