Optimization

Method and apparatus for size optimization of storage units

6078745

Abstract

The present invention provides a method and an apparatus for reducing the storage size required for temporary data by storage order optimization. Advantageously, the execution order optimization and the storage order optimization may be treated independently. The storage size optimization is preferably performed by determining an optimum intra-array and/or inter-array storage order based on a geometrical model. The geometrical model provides a representation of the address space occupied by an array as a function of time and allows the calculation of the window size of the occupied address/time domain of the array. Where calculations would be time-consuming, these may be shortened by making simplifying assumptions, e.g. calculation of upper and lower bounds of the window size of the occupied address/time domain of an array rather than an exact calculation. Further, heuristical simplifications are described to reduce run-times for the optimization process.


Claims

What is claimed is:

1. A method for optimizing before run-time the size of a storage unit for storing temporary data, comprising:

loading into a compiler execution commands and a definition of at least a first data structure, at least one of the execution commands requiring access to the at least first data structure; and

optimizing the storage size of the temporary data in the storage unit required for the execution of the commands with a substantially given execution order, the storage size optimizing including intra-structure optimizing an intra-structure storage order at least within the first data structure and the optimizing including calculating a window size for the first data structure, the intra-structure optimizing being based on a geometric model, wherein the intra-structure optimizing also includes selecting an at least piecewise affine storage order for the first data structure.

2. The method of claim 1, wherein each of said data structures occupy abstract addresses, and the abstract addresses occupied by a data structure at each time over a period of time is called the occupied address/time domain of that data structure and calculation of the window size of the first data structure includes calculating the maximum distance in address space between two abstract addresses in the occupied address/time domain of the first data structure.

3. The method of claim 2, wherein the occupied address/time domain of the first data structure includes a plurality of value based flow dependencies and the calculating the window size of that data structure includes calculating the window size of each pair of dependencies and selecting the window size for the first data structure from the pair of dependencies which generates the maximum window size.

4. The method of claim 1, further comprising:

selecting the intra-structure storage order for the first data structure which provides the minimum window size of the relevant data structure, and wherein the data structures has a plurality of dimensions and each dimension may have a storage order in at least one direction and the selecting the minimum window size includes determining upper and lower bounds of the window size for the relevant data structure when the position in the storage order and the storage order direction of a limited number of the dimensions are fixed.

5. The method of claim 4, wherein the determining includes:

calculating the upper and lower bounds of the window size for each direction/dimension combination generated by fixing one combination of one of the plurality of dimensions and one direction in turn;

discarding direction/dimension combinations which cannot lead to an optimum solution;

for each of the selected dimension/direction combinations calculating the upper and lower bounds of the window size for each of the direction/dimension combinations generated by fixing one further combination of one of the plurality of dimensions and one direction in turn;

discarding direction/dimension combinations which cannot lead to an optimum solution; and

repeating the above until all dimensions and directions are fixed.

6. The method of claim 5, wherein the discarding processes include discarding all direction/dimension combinations except the one which gives the smallest upper bound.

7. A method for optimizing before run-time the size of a storage unit for storing temporary data, comprising:

loading into a compiler a definition of at least a first data structure and a second data structure and execution commands, at least one of the execution commands requiring access to at least one of the first and second data structures;

optimizing the storage size of the temporary data in the storage unit required for the execution of the commands with a substantially given execution order, the storage size optimizing including inter-structure optimizing which optimizes an inter-structure storage order between the first and second data structures, the inter-structure optimizing being based on a geometric model;

intra-structure optimizing which independently optimizes an intra-structure storage order within each of the first and second data structures to form an abstract address equation for each data structure the intra-structure optimizing being based on a geometric model:

using the abstract address equations in the inter-structure optimizing.

8. The method of claim 7, wherein the inter-structure optimizing includes placement optimizing the placing of data structures in the storage unit to minimize the maximum distance in address space between two addresses being occupied at the same time but not necessarily in the same data structure.

9. The method of claim 8, wherein the abstract addresses occupied by a data structure at the same time over a period of time is called the occupied address/time domain of that data structure and the placement optimizing includes rearranging the occupied address/time domains within the address space without collisions, calculating a common window size for each arrangement of occupied address/time domains and selecting the arrangement with the minimum common window size.

10. The method of claim 9, further comprising calculating rectangular bounds for each occupied address/time domain and rearranging, the rectangular bounds within the address space to determine the minimum common window of the rectangular bounds.

11. The method of claim 9, wherein the folded occupied address/time domain is used instead of the occupied address/time domain of each data structure.

12. A compiler, comprising:

means for loading a definition of at least a first data structure and a second data structure and execution commands, at least one of the execution commands requiring access to at least one of the first and second data structures;

means for reducing the storage size of the temporary data required for the execution of the commands with a substantially given execution order, the reducing means including means for optimizing an inter-structure storage order between the first and second data structures, the inter-structure optimizing means being adapted to carry out the optimization based on a geometrical model, wherein the inter-structure optimizing means includes means for selecting the intra-structure storage order for at least one data structure which provides the minimum window size of the relevant data structure, and the data structures have a plurality of dimensions and each dimension may have a storage order in at least one direction and the selecting means includes means for determining upper and lower bounds of the window size for the relevant data structure and for fixing the position in the storage order and the storage order direction of a limited number of the dimensions.

13. The compiler of claim 12, wherein the inter-stricture optimizing means includes means for the placing of data structures in the storage unit to minimize the maximum distance between two addresses in address space being occupied at the same time but not necessarily in the same data structure.

14. The compiler of claim 13, wherein the abstract addresses occupied by a data structure at the same time over a period of time is called the occupied address/time domain of that data structure and the placement means includes means for rearranging the occupied address/time domains within the address space without collisions, means for calculating a common window size for each arrangement of occupied address/time domains and means for selecting the arrangement with the minimum common window size.

15. The compiler of claim 12, further comprising means for independently optimizing an intra-structure storage order within each of the first and second data structures based on a geometrical model to form an abstract address equation for each data structure.

16. A compiler, comprising:

means for loading execution commands and a definition of at least one data structure, at least one of the execution commands requiring access to at least one data structure; and

means for reducing the storage size of the temporary data in the storage unit required for the execution of the commands with a substantially given execution order, the reducing means including intra-structure optimizing means for optimizing an intra-structure storage order at least within the one data structure, the intra-structure optimizing means including means for calculating a window size for the one data structure, the intra-structure optimizing being based on a geometric model, wherein the at least one data structure occupies abstract addresses, and the abstract addresses occupied by the at least one data structure at each time over a period of time is called the occupied address/time domain of the data structure and the calculation means is adapted to calculate the maximum distance in address space between two abstract addresses in the occupied address/time domain.

17. The compiler of claim 16, wherein the occupied address/time domain of a data structure includes a plurality of value based flow dependencies and the calculating means is adapted to calculate the window size of each pair of dependencies and to select the window size for the data structure from the pair of dependencies which generates the maximum window size.

18. The compiler of claim 16, wherein the data structure has a plurality of dimensions and each dimension may have a storage order in at least one direction, the calculating means including means for fixing the position in the storage order and the storage order direction of a limited number of the dimensions and for then determining upper and lower bounds of the window size for the data structure.

19. A program storage device storing instructions that when executed by a computer perform the method comprising:

loading into a compiler execution commands and a definition of at least a first data structure, at least one of the execution commands requiring access to the at least first data structure; and

optimizing the storage size of the temporary data in the storage unit required for the execution of the commands with a substantially given execution order, the storage size optimizing including intra-structure optimizing for optimizing an intra-structure storage order at least within the first data structure and the optimizing including calculating a window size for the first data structure, the intra-structure optimizing being based on a geometrical model, wherein the intra-structure optimizing also includes selecting an at least piecewise affine storage order for the first data structure.

20. A program storage device storing instructions that when executed by a computer perform the method comprising:

loading into a compiler execution commands and a definition of at least a first data structure, at least one of the execution commands requiring access to the at least first data structure; and

optimizing the storage size of the temporary data in the storage unit required for the execution of the commands with a substantially given execution order, the storage size optimizing including intra-structure optimizing for optimizing an intra-structure storage order at least within the first data structure and the optimizing including calculating a window size for the first data structure, the intra-structure optimizing being based on a geometrical model, wherein the intra-structure optimizing which independently optimizes an intra-structure storage order within each of the first and second data structures forms an abstract address equation for each data structure, the intra-structure optimizing being based on a geometrical model; and

using the abstract address equations in inter-structure optimizing.

21. A program storage device storing instructions that when executed by a computer perform the method comprising:

loading into a compiler execution commands and a definition of at least a first data structure, at least one of the execution commands requiring access to the at least first data structure;

optimizing the storage size of the temporary data in the storage unit required for the execution of the commands with a substantially given execution order, the storage size optimizing including intra-structure optimizing for optimizing an intra-structure storage order at least within the first data structure and the optimizing including calculating a window size for the first data structure, the intra-structure optimizing being based on a geometrical model; and

selecting an intra-structure storage order for at least one data structure which provides the minimum window size of the first data structure, and the at least one data structure has a plurality of dimensions and each dimension may have a storage order in at least one direction and the selecting includes determining upper and lower bounds of the window size for the at least one data structure and for fixing the position in the storage order and the storage order direction of a limited number of the dimensions.

22. A program storage device storing instructions that when executed by a computer perform the method comprising:

loading into a compiler execution commands and a definition of at least a first data structure, at least one of the execution commands requiring access to the at least first data structure; and

optimizing the storage size of the temporary data in the storage unit required for the execution of the commands with a substantially given execution order, the storage size optimizing including intra-structure optimizing for optimizing an intra-structure storage order at least within the first data structure and the optimizing including calculating a window size for the first data structure, the intra-structure optimizing being based on a geometrical model, wherein the abstract addresses occupied by the first data structure at each time over a period of time is called the occupied address/time domain of the first data structure and the calculation means is adapted to calculate the maximum distance in address space between two abstract addresses in the occupied address/time domain.

23. An executable computer program, wherein the size of a storage unit for storing temporary data of the computer program was optimized before run-time by the method comprising:

loading into a compiler execution commands and a definition of at least a first data structure used by computer program, at least one of the execution commands requiring access to the at least first data structure; and

optimizing the storage size of the temporary data in the storage unit required for the execution of the commands with a substantially given execution order, the optimizing including intra-structure optimizing for optimizing an intra-structure storage order at least within the first data structure and the optimizing including calculating a window size for the first data structure, the intra-structure optimizing being based on a geometrical model, wherein the intra-structure optimizing also includes selecting an at least piecewise affine storage order for the first data structure.

24. An executable computer program, wherein the size of a storage unit for storing temporary data of the computer program was optimized before run-time by the method comprising:

loading into a compiler execution commands and a definition of at least a first data structure used by computer program, at least one of the execution commands requiring access to the at least first data structure; and

optimizing the storage size of the temporary data in the storage unit required for the execution of the commands with a substantially given execution order, the optimizing including intra-structure optimizing for optimizing an intra-structure storage order at least within the first data structure and the optimizing including calculating a window size for the first data structure, the intra-structure optimizing being based on a geometrical model, wherein the inter-structure optimizing means includes means for selecting the intra-structure storage order for at least one data structure which provides the minimum window size of the relevant data structure, and the data structures have a plurality of dimensions and each dimension may have a storage order in at least one direction and the selecting means includes means for determining upper and lower bounds of the window size for the relevant data structure and for fixing the position in the storage order and the storage order direction of a limited number of the dimensions.

25. An executable computer program, wherein the size of a storage unit for storing temporary data of the computer program was optimized before run-time by the method comprising:

loading into a compiler execution commands and a definition of at least a first data structure used by computer program, at least one of the execution commands requiring access to the at least first data structure; and

optimizing the storage size of the temporary data in the storage unit required for the execution of the commands with a substantially given execution order, the storage size optimizing including intra-structure optimizing for optimizing an intra-structure storage order at least within the first data structure and the optimizing including calculating a window size for the first data structure, the intra-structure optimizing being based on a geometrical model, wherein the first data structure occupies abstract addresses, and the abstract addresses occupied by the first data structure at each time over a period of time is called the occupied address/time domain of the data structure and the calculation means is adapted to calculate the maximum distance in address space between two abstract addresses in the occupied address/time domain.

26. A method for optimizing before run-time the size of a storage unit for storing temporary data, comprising:

loading into a compiler execution commands and a definition of at least a first data structure, at least one of the execution commands requiring access to the at least first data structure; and

optimizing the storage size of the temporary data in the storage unit required for the execution of the commands with a substantially given execution order, the storage size optimizing including intra-structure optimizing an intra-structure storage order at least within the first data structure and the optimizing including calculating a window size for the first data structure, the intra-structure optimizing being based on a geometric model.

27. A method for optimizing before run-time the size of a storage unit for storing temporary data, comprising:

loading into a compiler a definition of at least a first data structure and a second data structure and execution commands, at least one of the execution commands requiring access to at least one of the first and second data structures; and

optimizing the storage size of the temporary data in the storage unit required for the execution of the commands with a substantially given execution order, the storage size optimizing including inter-structure optimizing which optimizes an inter-structure storage order between the first and second data structures, the inter-structure optimizing being based on a geometric model.

28. A compiler, comprising:

means for loading a definition of at least a first data structure and a second data structure and execution commands, at least one of the execution commands requiring access to at least one of the first and second data structures; and

means for reducing the storage size of the temporary data required for the execution of the commands with a substantially given execution order, the reducing means including means for optimizing an inter-structure storage order between the first and second data structures, the inter-structure optimizing means being adapted to carry out the optimization based on a geometrical model.

29. A compiler, comprising:

means for loading execution commands and a definition of at least one data structure, at least one of the execution commands requiring access to at least one data structure; and

means for reducing the storage size of the temporary data in the storage unit required for the execution of the commands with a substantially given execution order, the reducing means including intra-structure optimizing means for optimizing an intra-structure storage order at least within the one data structure, the intra-structure optimizing means including means for calculating a window size for the one data structure, the intra-structure optimizing being based on a geometric model.

30. A program storage device storing instructions that when executed by a computer perform the method comprising:

loading into a compiler execution commands and a definition of at least a first data structure, at least one of the execution commands requiring access to the at least first data structure; and

optimizing the storage size of the temporary data in the storage unit required for the execution of the commands with a substantially given execution order, the optimizing step including an intra-structure optimizing step for optimizing an intra-structure storage order at least within the first data structure and the optimizing including calculating a window size for the first data structure, the intra-structure optimizing step being based on a geometrical model.

31. An executable computer program, wherein the size of a storage unit for storing temporary data of the computer program was optimized before run-time by the method comprising:

loading into a compiler execution commands and a definition of at least a first data structure used by computer program, at least one of the execution commands requiring access to the at least first data structure; and

optimizing the storage size of the temporary data in the storage unit required for the execution of the commands with a substantially given execution order, the optimizing step including an intra-structure optimizing step for optimizing an intra-structure storage order at least within the first data structure and the optimizing including calculating a window size for the first data structure, the intra-structure optimizing step being based on a geometrical model.


Description

The present invention relates a method and apparatus (e.g. a compiler) which may find application for the reduction of storage size required for temporary data during the execution of a set of commands, e.g. a computer program. The present invention is particularly suitable for the reduction of storage size for multimedia applications and other applications which process data in the form of multi-dimensional arrays.

TECHNICAL BACKGROUND

Due to the enormous progress that has been made during the last few decades in the area of integrated circuits (ICs), the complexity of systems implemented in silicon has been increasing drastically. This has lead to system-on-a-chip designs where all the components previously combined on a board are now integrated on a single die. The design of these systems often requires trade-offs to be made between the cost factors such as: chip area/count, power consumption, design time, and execution speed. The vast majority of recent signal processing systems that have to be implemented in silicon, including multimedia systems, require huge amounts of data to be processed, transferred and stored temporarily. As a result, the largest contributions to the area and power cost factors originate from data storage and transfers. Data-intensive signal processing applications can be subdivided into two classes: most multimedia applications (including video and medical imaging,) and network communications protocols (e.g. ATM network protocols). On application specific integrated circuits (ASICs), typically more than half of the silicon area is being occupied by storage units and related hardware such as address generation logic. Moreover, most of the power consumption in these systems is directly related to data accesses and transfers, both for custom hardware and for processors. Therefore, there is a need to improve data storage and transfer management and to reduce the chip area/count and power consumption.

Unfortunately, any effective optimization would require relatively aggressive global transformations of the system specifications and, due to the high complexity of modern systems, such transformations are often impossible to perform manually in an acceptable time. Many designers are not even aware that such optimizations might affect the area and power cost considerably. Commercially available CAD tools for system synthesis currently offer little or no support for global system optimizations. They usually support system specification and simulation, but lack support for global design exploration and certainly for automated global optimizations. They include many of the currently well-known scalar optimization techniques (e.g. register allocation and assignment) but these are not suited for dealing with large amounts of multi-dimensional data. Also standard software compilers are limited mainly to local (scalar) optimizations.

For instance, storage size requirements for multi-dimensional data have received only little attention from the compiler communities because they were (initially) not seen as a cost (nor a problem). The storage order of data was originally even not seen as a optimization parameter, e.g. it was simply defined by the programming language (e.g. row-major in C or column-major in Fortran), and the optimization efforts were concentrated on obtaining the highest possible degree of parallelism. If a bad storage order is chosen, many "holes" in the memories may be formed, i.e. locations that cannot be used at certain moments in time, resulting in increased storage size requirements. Unfortunately, manual optimization of the storage requirements is very tedious and error-prone for real life applications, because it involves complex bookkeeping. Techniques to help designers take better decisions, or even to automate this difficult task, are therefore certainly desirable.

It is an object of the present invention to provide a method and an apparatus for reducing the storage space required for temporary data when executing a program, particularly to reduce the storage size of multi-dimensional data arrays as these have the largest impact on the storage cost.

Another object of the present invention is to find a storage order for each (part of an) array such that the overall required size (number of locations) of the memories is minimal.

Another object of the present invention is to find an optimal layout of the arrays in the memories such that the reuse of memory locations is maximal.

SUMMARY OF THE INVENTION

The present invention may provide a method for optimizing before run-time the size of a storage unit for storing temporary data, comprising the step of: loading into a compiling means execution commands and a definition of at least: a first data structure, at least one of the execution commands requiring access to said at least first data structure; and an optimizing step for reducing the storage size of the temporary data in said storage unit required for the execution of the commands with a substantially given execution order, said optimizing step including an intra-structure optimizing step for optimizing an intra-structure storage order at least within said first data structure and said optimizing step also including calculating a window size for said first data structure, said intra-structure optimizing step being based on a geometrical model.

The present invention may also provide a method for optimizing before run-time the size of a storage unit for storing temporary data, comprising the steps of: loading into a compiler means a definition of at least a first and a second data structure and execution commands, at least one of said execution commands requiring access to at least one of said first and second data structures; and an optimizing step for reducing the storage size of the temporary data in said storage unit required for the execution of the commands with a substantially given execution order, said optimizing step including an inter-structure optimizing step of optimizing an inter-structure storage order between said first and second data structures, said inter-structure optimizing step being based on a geometrical model.

The present invention also includes a compiling apparatus comprising: means for loading execution commands and a definition of at least a first data structure, at least one of the execution commands requiring access to said at least first data structure; and means for reducing the storage size of temporary data required for the execution of the commands with a substantially given execution order, said reducing means including intra-structure optimizing means for optimizing an intra-structure storage order at least within said first data structure, said intra-structure optimizing means including means for calculating a window size for said first data structure based on a geometrical model.

The present invention also includes a compiling apparatus comprising: means for loading a definition of at least a first and a second data structure and execution commands, at least one of said execution commands requiring access to at least one of said first and second data structures; and means for reducing the storage size of the temporary data required for the execution of the commands with a substantially given execution order, said reducing means including means for optimizing an inter-structure, storage order between said first and second data structures, said inter-structure optimizing means being adapted to carry out the optimization based on a geometrical model.

In accordance with the present invention the objects of the invention may be solved by decomposing the problem into two sub-problems, which can be solved independently or together: intra-array storage order optimization and inter-array storage order optimization. During storage order optimization, the detailed layout of the large data structures in the memories is being decided upon, i.e. the storage order of the data. Storage order in accordance with the present invention relates to how data is allocated to real or virtual memory space before run-time, e.g. at compile time. Given the memory configuration and data-to-memory assignment, this task tries to find a placement of the multi-dimensional data in the memories at compile time in such a way that memory locations are reused as much as possible during the execution of the signal processing application. The goal of this task is to minimize the required sizes of the memories. This directly has impact on the chip area occupation and indirectly also on the power consumption. This task is extremely difficult to perform manually due to the complex code transformations and bookkeeping that are required. Automation of this task is therefore crucial to keep the design time acceptable. The present invention particularly targets the class of applications in which the placement of data is handled at compile time. For some network applications the placement of data is handled at run-time, which requires different techniques. Due to the dominant influence of the data storage and transfers on the overall system cost, the present invention is intended to be applied very early in the design process, i.e. before the synthesis of address generation hardware, datapaths, and controllers.

By tackling the data storage and transfer related issues first, one may impose additional constraints on the remaining synthesis tasks, which may lead to suboptimal solutions for these tasks. However, the dominance of the data storage and transfers on the overall system cost is usually overwhelming, such that suboptimal solutions for the remaining tasks have only a secondary effect on the overall cost. Moreover, application studies have shown that near-optimal solutions for the datapaths can usually still be found by applying the appropriate transformations. Given the fact that the storage order optimization task is located near the end of the optimization script, where the execution order and the data-to-memory assignments have been substantially fixed, its input contains information about what data are stored in what memories and when these data enter and leave the memories.

The dependent claims define individual further embodiments of the present invention. The present invention will now be described with reference to the following drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1a is a schematic global view of a target architecture for use with the present invention.

FIG. 1b is a particular target architecture for use with the present invention.

FIG. 2 is a schematic representation of the flow from system requirements to final implementation of which the present invention may form a part.

FIG. 3 is an example of domains and corresponding flow dependency according to the present invention.

FIG. 4 is an example of memory occupation interval according to the present invention.

FIG. 5 is an example of a BOAT-domain according to the present invention.

FIG. 6 is an example of BOAT- and OAT-domains according to the present invention.

FIG. 7 shows further examples of BOAT-domains according to the present invention.

FIG. 8 shows OAT- and COAT-domains relating to FIG. 7.

FIG. 9 is an example of memory occupation according to the present invention.

FIG. 10 shows examples of abstract OAT-domains of five arrays.

FIG. 11a shows static and FIG. 11b shows dynamic allocation strategies for the arrays shown in FIG. 10 according to the present invention.

FIG. 12a shows static windowed and FIG. 12b shows dynamic windowed allocation strategies according to the present invention.

FIG. 13 shows dynamic allocation with a common window according to the present invention.

FIG. 14 shows the possible storage orders of a 2.times.3 array.

FIG. 15 shows the address reference window size calculation according to the present invention for three dependencies.

FIG. 16 shows a further address reference window calculation according to the present invention.

FIG. 17 shows folding of an OAT-domain by means of a modulo operation according to the present invention.

FIG. 18 shows the address reference window calculation for a non-manifest BOAT-domain according to the present invention.

FIG. 19 shows the calculation of distance components for two arrays according to the present invention.

FIG. 20 shows an intra-array storage search tree for a three-dimensional array according to the present invention.

FIG. 21 shows a folded OAT-domain according to the present invention.

FIG. 22 shows the folded OAT-domain of FIG. 21 after a projection of a dimension according to the present invention.

FIG. 23 shows the OAT-domain of FIG. 21 for a sub-optimal storage order.

FIG. 24 shows optimization run-times in accordance with search strategies of the present invention.

FIG. 25 shows two compatibility scenarios according to the present invention.

FIG. 26 shows mergability according to the present invention.

FIG. 27 shows a folding conflict according to the present invention.

FIG. 28 shows distances between two OAT-domains according to the present invention.

FIGS. 29a and b shows a distance calculation according to the present invention.

FIG. 30 shows a valid intermediate solution not satisfying a distance calculation according to the present invention.

FIG. 31 shows an embodiment of the placement strategy according to the present invention.

FIGS. 32a and b show examples of allowed and non-allowed array splits according to the present invention.

FIG. 33 shows optimization run-times for an overall optimization strategy according to the present invention.

FIG. 34 shows optimized memory layouts according to the present invention.

FIG. 35 shows a common structure of ILP problems according to the present invention.

FIG. 36 shows additional ILP problem structures.

FIG. 37 shows the concept of invariant and carrying loops according to the present invention.

FIG. 38 shows a common variant loop for two dependencies according to the present invention.

FIG. 39 shows syntax tree analysis according to the present invention.

FIG. 40 shows syntax tree analysis with carrying loop according to the present invention.

FIG. 41 shows possible solutions for a pair of carried dependencies according to the present invention.

FIGS. 42a and b show locking of iterator instances according to the present invention.

FIG. 43 shows an infeasible solution to an ILP problem.

FIG. 44 shows abortion of window calculations according to the present invention.

FIG. 45 shows avoiding exact ILP solutions through upper bound estimations according to the present invention.

FIG. 46 shows sorting of dependency combinations according to the present invention.

FIGS. 47-49 show partial distance calculations according to the present invention.

FIG. 50 shows experimental results of the speed-up heuristics according to the present invention.

FIG. 51 shows the effect of window calculation decomposition technique on optimization run-times for intra-array storage order according to the present invention.

    __________________________________________________________________________
    DEFINITION OF SYMBOLS USED
    __________________________________________________________________________
    A.sub.m.sup.abstr ( ) abstract address equation for array variable m
    A.sub.m.sup.abstr ( ) partial abstract address equation for array
    variable m
    A.sub.m.sup.real ( ) real address equation for array variable m
    C.sub.m.sup.addr ( ) storage order constraint (vector) function of
    D.sub.m.sup.var
    C.sub.ikm.sup.def ( ) constraint (vector) function of D.sub.ikm.sup.def
    C.sub.i.sup.iter ( ) contraint (vector) function of D.sub.i.sup.iter
    C.sub.jlm.sup.oper ( ) constraint (vector) function of D.sub.jlm.sup.oper
    C.sub.jlm.sup.rtime ( ) execution order constraint (vector) function of
    D.sub.jlm.sup.oper
    C.sub.i.sup.time ( ) execution order constraint (vector) function of
    D.sub.i.sup.iter
    C.sub.m.sup.var ( ) constraint (vector) function of D.sub.m.sup.var
    C.sub.ikm.sup.wtime ( ) execution order constraint (vector) function of
    D.sub.ikm.sup.def
    D.sub.m.sup.addr storage address domain of D.sub.m.sup.var
    D.sub.ijklm.sup.BOAT binary occupied address/time domain for definition k
    of statement I writing array
    variable m and operand l of statement j reading the same array variable
    D.sup.COAT collective occupied address/time domain
    D.sub.ikm.sup.def definition domain of definition k of statement i for
    array variable m
    D.sub.i.sup.iter iteration domain of statement i
    D.sub.m.sup.OAT occupied address/time domain of array variable m
    D.sub.jlm.sup.oper operand domain of operand l of statement j for array
    variable m
    D.sub.jlm.sup.rtime read execution time domain of D.sub.jlm.sup.oper
    D.sub.m.sup.var variable domain of array variable m
    D.sub.ikm.sup.wtime execution time domain of D.sub.ikm.sup.def
    F.sub.ikm.sup.def ( ) unknown constraint (vector) function of
    D.sub.ikm.sup.def
    F.sub.i.sup.iter ( ) unknown constraint (vector) function of
    D.sub.i.sup.iter
    F.sub.jlm.sup.oper ( ) unknown constraint (vector) function of
    D.sub.jlm.sup.oper
    F.sub.m.sup.var ( ) unknown constraint (vector) function of
    D.sub.m.sup.var
    M.sub.ikm.sup.def definition mapping between D.sub.i.sup.iter and
    D.sub.ikm.sup.def
    M.sub.ijklm.sup.flow dependency relation for definition k of statement i
    writing array variable m and
    operand l of statement j reading the same array variable
    M.sub.jlm.sup.oper operand mapping between D.sub.i.sup.iter and
    D.sub.jlm.sup.oper
    O.sub.m.sup.addr ( ) storage order function of D.sub.m.sup.var
    O.sub.jlm.sup.rtime ( ) read execution order offset function of
    D.sub.jlm.sup.oper
    O.sub.i.sup.time ( ) execution order function of D.sub.i.sup.ter
    O.sub.ikm.sup.wtime ( ) write execution order offset function of
    D.sub.ikm.sup.def
    __________________________________________________________________________


DESCRIPTION OF THE ILLUSTRATIVE EMBODIMENTS

The present invention will be described with reference to certain embodiments and examples and with reference to certain schematic drawings but the invention is not limited thereto but only by the claims. The present invention will be described for convenience with reference to multimedia applications but is not limited thereto. The present invention may find application in any storage reduction application, particularly those in which multi-dimensional arrays of data are processed. In particular the present invention may be independent from or included within a global optimization scheme which may include one or more of: removal of algorithmic bottle-necks, global data life-time reduction, exploitation of memory hierarchy through data reuse, storage cycle budget Distribution, memory allocation and assignment, e.g. number and types, formal verification and address hardware generation. In the following reference is made to certain "strategies", each of these strategies is an independent embodiment of the present invention.

Most real-life multimedia applications deal with large amounts of multidimensional arrays. The search space for all the possible storage orders of ill these arrays becomes very large. First of all the number of possible storage orders for an array quickly explodes when the number of dimensions increases. For single-assignment code, which reveals the true dimensionality of the data, this number can be as high as 10 in practice. Secondly, the presence of multiple arrays enables reuse of storage locations between different arrays, which increases the complexity of the optimization task. Finding a good storage order is crucial though, as the required storage sizes may vary with orders of magnitude depending on the storage order.

The target architecture in accordance with the present invention can be very heterogeneous and imposes few constraints except that a global system pipeline should be present between the memories and the datapaths. A simplified global view is shown in FIG. 1a. The main components in the target architecture are the datapaths 2, the memories 4, the address generation hardware 3, and the global controller 1. The main focus of the present invention lies on data storage and transfers, regions 4 and 6 in FIG. 1a. The datapath 2 may be composed of (a mixture of) several types of execution units, e.g. custom hardware and embedded DSP or RISC cores, i.e. the application may be implemented completely in hardware or in a combination of hardware and software. Also the memories 4 in the architecture useable with the present invention can be of different natures. For instance, ROMs, several kinds of RAMs, FIFOs, and pointer-addressed memories can be present in various combinations.

The global controller 1 and the address generation hardware 3 may be implemented in custom hardware, but the invention is not limited thereto. For instance if an embedded DSP core is integrated in the datapath 2, it is possible that it has one or more integrated programmable address generators that can be used to control the memories.

The different main components in the architectures in accordance with the present invention are not limited to those shown in FIG. 1a, various components may be intermixed in the final hardware implementation. For instance, a DSP core may have an integrated local RAM or, as mentioned above, integrated address generators. Moreover, it is not necessary that the whole system is integrated on a single chip. For instance, large background memories may be implemented on separate chips. An example of a SIMD-type target architecture is shown schematically in FIG. 1b which may be used for implementing block-oriented video algorithms. The architecture contains a few shared memory banks 7 and several processing elements 8 with local memories 9. Each of the PE's 8 has access to the local memories 9 of its left and right neighbors. After deciding on the detailed execution order and data distribution, we exactly know which data is stored in each memory 7 or 9 and when this data is accessed. Given this information the storage order optimization in accordance with the present invention can be applied. The present invention may also be applied to MIMD-type architectures with shared memories provided that the memory accesses can be synchronized and that their relative order is known before run-time.

The vast majority of multimedia applications may be described by a "program" in a certain language. This program specifies what operations have to be executed on what data and many times also how they have to be executed, e.g. in what order operations have to be executed or in what order data have to be stored. In the latter case the language is called procedural. In case the "how" is not specified, the language is called applicative. However, when the application is described in a procedural language, the "how" is usually not really part of the specification. It is mostly present only as a matter of convenience, e.g. because procedural programs can easily be executed. Determining how this program will eventually be executed in the final (hardware or software) implementation, is precisely the job of the designer (or the design tool). Usually he has the freedom to ignore, or at least modify, the specification of the "how" in the original description, as long as the implementation meets the functional and timing constraints. Moreover, it is usually even allowed to change the (internal) "what" of the program as long as the (external) I/O-behavior is not modified.

Nevertheless, the "what" is the most important part of the specification. It consists of two major components: the data and the operations on the one hand and the dependencies between them on the other hand. Data and operations are usually heavily modified during earlier steps in the design flow. For instance, the algorithms used to implement a certain functionality may be heavily changed during the initial design stages. A (simplified) flow from system requirements to final implementation can be represented as in FIG. 2.

The system design process usually starts with the algorithm design and exploration of different algorithm alternatives. The result of this phase is an initial algorithmic specification of the system. In case this specification is given as a procedural program, data-flow analysis allows us to extract the "real" specification, i.e. the data and operations and the dependencies between them, at least in theory. In case the original specification is applicative, this step can be skipped as it is already the "real" specification. Next we can proceed with the data transfer and storage exploration, and impose a(n) (new) order on the execution of operations and the storage of data. This is the input for other system design tasks, such as traditional high-level synthesis.

Naturally, in case the original specification was procedural, it is the intention that the newly imposed order leads to a better implementation than the original one would have. The data-flow analysis step is crucial to reach this goal, because it reveals the freedom that is present in the initial specification.

Given the important role of the data, the operations, the dependencies and the order in this flow, it is important to model them accurately. Multimedia applications not only typically consist of thousands or even millions of operations and data elements, but also exhibit a large degree of regularity (for instance, one can recognize large groups of operations and data that are treated in a similar way).

In the parallel compiler and regular array synthesis communities the main interest lies in the accurate modeling of the execution of program statements and the dependencies between these executions in order to be able to exploit the potential parallelism. This has lead to the definition of the concept of an iteration domain (P. Feautrier, "Dataflow analysis of array and scalar references" International Journal of Parallel Programming, 20(1) pp 23-53, 1991) also called iteration space (C. D. Polychronopoulos, "Compiler optimizations for enhancing parallelism and their impact on architecture design", IEEE Trans. on Computers, 37(8) pp 991-1004, August 1988; W. Li and K. Pingali, "A singular loop transformation framework based on non-singular matrices", Proc. of the Fifth Annual Workshop on Language and Compilers for Parallelism, New Haven, August 1992), index set (Weijia Shang, Matthew T. O'Keefe, Jose A. B. Fortes, "Generalized cycle shrinking" in Algorithms and Parallel VLSI Architectures II, pp 132-144. Elsevier Science, 1992), node space (M. van Swaaij, F. Franssen, F. Catthoor, and H. De Man "Automating high-level control flow transformations for DSP memory management" Proc. of the IEEE Workshop on VLSI Signal Processing, Napa Valley, Calif., October 1992), computation domain (A. Darte, T. Risset, and Y. Robert "Loop nest scheduling and transformations", Advances in Parallel Computing 6, pages 309-332. North Holland, Amsterdam environments and tools for parallel scientific computing edition, 1993), or index space (Christian Lengauer "Loop parallelization in the polytope model", Proc. of the Fourth International Conference on Concurrency Theory (CONCUR93), Hildesheim, Germany, August 1993). An iteration domain is a mathematical description of the executions of a statement in a program. It is a geometrical domain in which each point with integral coordinates represents exactly one execution of the statement. Each statement has its own iteration domain, and the set of points in this domain represents all the executions of this statement. The execution of a specific statement as referred to as an operation. The different mathematical dimensions of such a domain generally correspond to the different iterators of the loops surrounding the corresponding statement (but this is not strictly required). The domain itself is described by means of a set of constraints corresponding to the boundaries of these iterators. In general, the loop boundaries are not constant. Moreover, the program may contain conditional statements. The iteration domain model can easily incorporate these properties though, provided that the loop boundaries and conditions are functions of the iterators of the surrounding loops.

Another important concept is that of auxiliary dimensions, which are also called wildcards. Sometimes it is not possible to represent an iteration domain only by constraints on its dimensions. In those cases, it may be necessary to introduce extra dimensions in order to be able to correctly describe the iteration domain. This is for instance the case if the loop iterators are not incremented by 1 after each execution of the loop.

In general the auxiliary dimensions of a domain are not unique. They only act as a mathematical aid for describing specific constraints. Any combination of auxiliary dimensions and constraints that results in the same orthogonal projection onto the real dimensions is therefore equivalent from a modeling point of view. Of course, from a manipulation or transformation point of view, this is not true. One representation can be more efficient than another one for certain purposes. For instance, certain techniques can only operate efficiently on dense domains or domains that are described as mappings of dense domains.

Sometimes the reasons for the introduction of auxiliary dimensions are less straightforward. Modulo operations in array index expressions may be handled in an efficient way through the introduction of extra (auxiliary) dimensions. Mainly for technical reasons, the iteration domain models being used in parallel compilers and regular array synthesis were usually restricted to be described only by affine constraints. By doing so, many of the well-known results from linear algebra can be used for analysis and optimization of programs. The class of programs that can be handle,d by these restricted models is quite large. However, there is no fundamental reason why the models should not be capable of representing arbitrary functions instead of only affine functions. Of course there have been some practical reasons, i.e. if non-affine functions are present, the techniques for analyzing and optimizing the corresponding programs may have to be extended and may become computationally more complex.

In accordance with the present invention, modeling is done not only of the executions of the statements of a program (by means of iteration domains), but also of the accesses to the program variables, and especially of the array variables. Accordingly, definition domains, operand domains and variable domains (definition domains and operand domains have also been called operation space and operand space respectively) are required. In the remainder of this text, reference is made to program variables simply as variables, i.e. the "program" prefix will not be used but is implied. There should be no confusion with mathematical variables, which are used in the mathematical descriptions of the different domains.

Typically the variables being written or read during the execution of a program are grouped into sets of similar variables, which are called arrays or more generally: data structures. These arrays are arranged as multi-dimensional structures, in which each individual variable can be addressed by a unique set of indices. These multi-dimensional structures are suited to be modeled by geometrical domains. A variable domain is a mathematical description of an array of variables. Each point with integer coordinates in this domain corresponds to exactly one variable in the array.

During the execution of a program, not every statement accesses each array of variables completely. Typically, the executions of a statement only access part of one or a few arrays. The definition and operand domains of a statement describe which variables are being accessed during all possible executions of that statement. Each point with integer coordinates in these domains corresponds to exactly one variable that is being written (in case of a definition domain) or read (in case of an operand domain). Possibly, a variable is written or read multiple times during the execution of a program, even by the same statement.

The relations between the executions of the statements and the variables that are being written or read, are represented by means of mathematical mappings between the dimensions of the iteration domains and the dimensions of the definition or operand domains. These mappings will be referred to as the definition mappings and operand mappings respectively. Note that in general, the definition or operand domains may require the use of extra auxiliary dimensions, next to the dimensions of the corresponding iteration domain. We refer to these mappings as the definition mappings and the operand mappings respectively.

The present invention includes the extension of the above affine models to include non-affine and even non-manifest control and data flow. Manifest means that both the control flow and the data flow (i.e. data values) of the program are known at compile time. Programs can be described by means of iteration, variable, definition, and operand domains/mappings, provided that the program is affine and manifest. In general, these domains and mappings have the following form: ##EQU1##

In these equations, C.sub.i.sup.iter (), C.sub.m.sup.var (), C.sub.ikm.sup.def (), and C.sub.jlm.sup.oper () represent sets of constraint functions of D.sub.i.sup.iter, D.sub.m.sup.var, D.sub.ikm.sup.def, and D.sub.jlm.sup.oper respectively. Note that in practice, many of the inequalities described by these constraint functions can be reduced to equalities, but from a modeling point of view this makes no difference. Therefore, and in order to keep the notation a simple as possible, we continue to use the inequalities rotation. Also note that the main dimensions of the iteration domains can be seen as auxiliary dimensions of the corresponding definition and operand domains. In many cases, these dimensions can be eliminated, as they are existentially quantified (just as any other auxiliary dimension).

The definition and operand mappings will not be referred to explicitly anymore, unless there is a good reason for it, because the information about these mappings can be readily extracted from the (non-simplified) definition and operand domains descriptions. For instance, the mappings from the iteration domains to the operand/definition domains indicate what array elements are accessed by what operations and vice versa. The (simplified) domain descriptions alone do not contain the information about these relations.

These mathematical models allow description of all practical programs that are manifest, i.e. non-recursive programs that contain for-loops with manifest boundaries, manifest conditions and array accesses with manifest index expressions, even if these are not affine. Note that in the worst case one may have to rely on a complete enumeration of the points in the domains, but even an enumeration fits within the models.

For modeling a special class of non-affine manifest index expressions, namely piece-wise linear indexing caused by modulo operations in index expressions. In "Modeling piece-wise linear and data dependent signal indexing for multi-dimensional signal processing", F. Franssen, M. van Swaaij, F. Catthoor, and H. De Man, Proceedings of the 6th ACM/IEEE International Workshop on high-level synthesis, pp. 245-255, Dana Point, Calif., November 1992 it is indicated how a loop (nest) can be rewritten to get rid of the modulo operation. A better alternative is to introduce auxiliary dimensions in the descriptions of the definition and/or operand domains. This method has the advantage that the loops can be left in their original form, and that it can be applied to each operand or definition domain separately. A potential disadvantage of this technique is that the resulting domain descriptions are no longer "dense". They can be converted to dense descriptions using the techniques presented in the above article or in "Transformation of nested loops with modulo indexing to affine recurrences", F. Balasa, F. Franssen, F. Catthoor, and Hugo De Man. Parallel Processing Letters, 4(3), pp. 271-280, December 1994.

The main problem left is the modeling of non-manifest programming constructs. The behavior of many algorithms depends on data values that are not known until the algorithm is executed (e.g. input data). Consequently, the programs implementing these algorithms are not manifest, i.e. the data and/or control flow may be different each time the program is executed. The models presented above are not capable of describing this kind of programs (or at least not very accurately) and therefore require further extensions.

One of the best known extensions is the use of symbolic constants. In scientific computing the problem of optimizing programs for which some of the structure parameters are not yet known occurs frequently. It is for instance possible to do data flow analysis in the presence of unknown structure parameters by treating them as symbolic constants. This concept of symbolic constants and the corresponding analysis techniques can be extended to include any value that is not known at compile time. These unknowns are sometimes called dynamically defined symbolic constants, or hidden variables or placeholder variables. The main difference between the traditional symbolic constants and the more general hidden variables is that hidden variables need not have a constant value during the execution of the program. But even though the actual values are not known at compile time, the rate at which these values changes, is usually known. Since the information about this rate can be crucial in order to perform optimizations, programs containing hidden variables may still be described as accurately as possible, i.e. the rate at which they change may be described. Techniques for dealing with symbolic constants can also be used for dealing with data dependent variables, provided that they remain constant during the execution of the program. In general, it is not required that these values are integral (e.g. input may be a floating point variable). Of course, some constraints may be imposed by the context of the program, and in that case we can add them to the domain descriptions, but this is not required by the model. In the following we assume that dimensions that have not been constrained can take any real value.

In cases with data dependent variables that vary with each execution of a loop, we cannot represent the variable by means of an unknown variable that remains constant during the execution of the program. Therefore, we must explicitly indicate that the value of this unknown variable may be different for each value of the iteration. In other words, the value of the unknown variable is a function of the number of the iteration, although the function is not known at compile time. The functions used to model these unknown variables are also known as uninterpreted function symbols.

The unknowns which are constant during the execution of a program can be seen as a degenerate case of varying unknowns, i.e. varying unknowns that do not vary. In other words, they can be modeled by means of unknown functions without input arguments. In this way, we can unify the modeling techniques for symbolic constants, constant unknowns, and varying unknowns. So we can rewrite the expressions in a uniform way as follows: ##EQU2## Note that F.sub.1/2.sup.iter () are functions without input arguments, i.e. constant functions.

In general, we can describe the geometrical domains associated with programming constructs by means of three kinds of dimensions:

1. Main dimensions: these are the real dimensions of the domains. All other dimensions should be eliminated in order to obtain the real domain consisting of points with integer coordinates (although this may be possible only at runtime). Each of these points then corresponds to exactly one program entity (e.g. an operation, a variable, . . . ). Each main dimension generally corresponds to an iterator in the program (although this iterator may not be explicitly present) or a dimension of a (multi-dimensional) variable. For a given semantical interpretation and a given programming construct, the main dimensions of a domain are unique, i.e. all domain descriptions corresponding to a certain programming construct should be mathematically equivalent. The shape of the iteration domain of a statement for instance, is unique for a given loop structure surrounding the statement and a given semantical interpretation of the dimensions. However, through transformations the shapes of domains can be modified, but then the corresponding programming constructs are assumed to be transformed also. In the following, vectors of main dimensions in formal equations are represented by bold letters, e.g. i.

2. Auxiliary dimensions (also called wildcard dimensions): these dimensions are not really part of the domains, but are used to be able to express more complex constraints on the domains. In general, auxiliary dimensions cannot be associated with any variables present in the program. An exception are the auxiliary dimensions that correspond to main dimensions of other domains (e.g. the dimensions of the iteration domains also appear in the descriptions of the definition and operand domains, unless they have been eliminated). Auxiliary dimensions are nothing but a mathematical aid, and are existentially quantified and therefore not unique. Only the result obtained after their elimination matters. In the following, vectors of auxiliary dimensions in formal equations are represented by Greek letters, e.g. .alpha. (except when they correspond to the main dimensions of other domains, in that case we leave them in bold to highlight this correspondence).

3. Hidden dimensions: these dimensions are also not really part of the domains, but are used to model non-manifest behavior. Just like auxiliary dimensions, hidden dimensions are not unique either, and have to be eliminated in order to obtain the real domains. The difference is that in general this elimination cannot be done at compile time. Generally, hidden dimensions correspond to symbolic constants or data-dependent variables in the program. Hidden dimensions are always expressed as functions of main dimensions and are also existentially quantified. In the following, vectors of hidden dimensions in formal equations are represented by italic letters, e.g. r.

So in general we can represent the iteration, variable, definition and operand domains as follows: ##EQU3##

The dimensions corresponding to the components of vectors d and o of D.sub.ikm.sup.def and D.sub.jlm.sup.oper are the same as the dimensions of vector s of D.sub.m.sup.var, since D.sub.ikm.sup.def and D.sub.jlm.sup.oper are always sub-domains of D.sub.m.sup.var (at least for programs where no arrays are ever accessed outside their boundaries, which is an obvious constraint for practical realizations).

Some of the domain descriptions share names of mathematical variables representing the dimensions. This sharing only suggests that the descriptions can be constructed in a certain way (e.g. a definition domain description can be constructed by combining and iteration domain description and a definition mapping description). From a mathematical point of view, identically named mathematical variables in independent domain/mapping descriptions are unrelated.

Unlike the main and auxiliary dimensions, we do not require the hidden dimensions to be integral. As hidden dimensions may correspond to data variables of the program, hidden dimensions may take any value the corresponding data variables can take, even non-integral ones. Depending on the context, an integrality condition may be added (e.g. when the corresponding data variables have an integral type).

In practice, strict inequalities (<and>) originating from program statements such as A[i]>0 can always be converted to non-strict inequalities because the precision of data types in a program is always limited, even for pseudo-real data types. For instance, if A[i]>0 would result in an inequality p>0, then this inequality be rewritten as p-.epsilon..gtoreq.0 in which .epsilon. is a small enough positive value (depending on the precision of the data type of A). For integral types, .epsilon.=1. Therefore, we can avoid all strict inequalities, which results in a simpler notation.

From a mathematical point of view, the distinction between constraint functions and functions for hidden dimensions is somewhat artificial. For instance, C.sub.l.sup.iter (i,.alpha.,p).gtoreq.0 p=F.sub.i.sup.iter (i) can be rewritten as C.sub.l.sup.iter (I,.alpha., p).gtoreq.0 p-F.sub.i.sup.iter (i).gtoreq.0 F.sub.i.sup.iter (i)-p.gtoreq.0 or C.sub.l.sup.iter (i,.alpha.,p)*.gtoreq.0 in which C.sub.l.sup.iter ()* is a combination of the three (vector) functions. Therefore, in the following, we assume that these functions have been combined into C.sub.l.sup.iter ()* but the * will be dropped. This results in the following general model: ##EQU4##

The primary models described above (iteration, variable, definition, and operand domains) are only used to mathematically describe a given program and they do not contain any information about design decisions such as execution order and/or storage order. These models are not even complete, i.e. without additional information, one cannot restore the original program or sometimes not even an equivalent program from the mathematical description. Given only domains description, one cannot reconstruct the original program, since the notion of execution order is not present in these domains. Moreover, from the domain descriptions, one cannot even derive in what direction the loops have to be executed, since the domain descriptions are independent of the execution order of the loops.

For manifest single-assignment programs, the domain descriptions are sufficient to reconstruct an equivalent program, because in that case one can (at least in theory) find all data-dependencies between the operations, and consequently a valid order. However, if we want to perform optimizations, one of the things we have to do is to decide on an execution order of the statements. Another important decision we have to make is the decision on the storage order, i.e. the layout of the arrays in the memory/memories. We can express these orders by assigning an "execution date" to every operation and a "storage address" to every variable, The meaning of the terms "execution date" and "storage address" is context dependent. Sometimes they are defined in terms of some absolute unit such as a clock cycle or a memory location, but in many cases, the absolute numbers are not relevant, i.e. only the relative order of operations or storage locations is important. For instance, an execution date can be expressed in terms of a number of operations that have been executed before that date, even though the operations may have different time durations in the final implementation of the program. Also, a storage location may not correspond to one physical memory location, but possibly to several ones. We can make use of these observations to simplify certain optimization problems. Nevertheless, when comparing execution dates or storage addresses, we should make sure that their respective scales and offsets are taken into account.

As we are dealing with large sets of operations and large sets of variables, it would be infeasible to assign an execution date to each individual operation and a storage address to each individual variable. Not only would the memory optimization problem become intractable for large applications, but also the implementation cost (controller and address generation overhead) would be prohibitive. Therefore, we have to assign execution dates and storage addresses to groups of operations and variables respectively in order to maintain some regularity.

One of the basic requirements that the storage order and the execution order have to fulfill, is that each variable can have only one storage address and each operation can be executed at only one execution date (provided that variables and operations can be seen as atomic units, which is true in this context). This requirement is compatible with one of the properties of a mathematical function: a function evaluates to only one value for each distinct set of input arguments (although different argument sets may result in the same value). Therefore, it is a straightforward choice to use (deterministic) functions to describe execution and storage order. As arguments of these functions, we can use the coordinates of the operations/variables in the corresponding domains, since these coordinates are unique with respect to all other operations/variables in the same set. So we can describe an execution or storage order with exactly one function per set of operations/variables.

Although execution order and storage order have been treated in a very similar way, there are some differences that have to be taken into account during memory optimizations. First of all, for the purpose of memory related optimizations, it is usually sufficient to know the relative execution order of operations and/or memory accesses (except when dealing with timing constraints for instance, when absolute timing functions should be used). This may allow us (under certain conditions) to use simpler functions for describing the relative execution order than the ones describing the absolute execution order. By doing so, the mathematical complexity of the optimization problems may be drastically reduced.

For storage addresses, the situation is somewhat different as one is usually interested in the memory size required for a set of variables. In order to accurately relate memory sizes to distances between storage addresses, it is important to keep the relation between storage addresses and memory addresses as simple as possible, i.e. in practice this relation should be linear (or piece-wise linear, e.g. because one logical storage address range may be divided over several physical memories or because modulo addressing is being used). Therefore, one has less freedom to choose an appropriate ordering function than in the case of the execution order. Using an ordering function with a non-linear relation to the physical memory addresses could result in serious estimation errors.

Another important difference between execution order and storage order is the fact that, in practice, a storage location can be treated as an atomical unit for many purposes (even though it may correspond to several memory locations), while on the other hand operations may be decomposed into several "sub-operations" with different (but related) execution dates and durations. For instance, on a programmable processor, the execution of a statement may take several clock cycles. Moreover, depending on the memory type, the fetching of operands and storing of results may occur in different clock cycles than the actual datapath operation, e.g. operands may be fetched in one cycle, the datapath operation may be executed in the next cycle and the results may be stored during yet another clock cycle. It may even be possible that the memory accesses of different operations are interleaved in time, e.g. due to bandwidth considerations.

So, a simple time order model associating an execution date with every operation may not be sufficient to model practical applications accurately. However:

In practice, if the memory accesses corresponding to an operation do not coincide with the actual execution of the operation, the time offset for the accesses with respect to the operations is always the same for operations corresponding to the same statement. For instance, operands may always be fetched one clock cycle before the operation and the results may always be stored one clock cycle after the operation. Moreover, it makes no sense to fully decouple the memory accesses from the corresponding operations, i.e. memory accesses always have to be "synchronized" with their corresponding data operations. For instance, for a statement being executed in a loop, it generally makes no sense to perform all read accesses first, all data operations after that, and finally all write accesses. Doing so would implicitly require a (potentially large) extra buffer for storing the operands and/or the definition. In case one really wants to model such a buffering strategy, it should be done explicitly, i.e. by explicitly modeling the transfers to and from the buffers (by means of extra iteration, operand, and/or definition domains).

Note that this does not mean that every storage location always should be modeled explicitly. It is for instance possible that an operand is fetched a few cycles before it is actually needed by an operation, such that it has to be stored temporarily in a register. But as long as the memory accesses are not decoupled from the corresponding data operations, the required number of extra storage locations remains limited (usually at most a few locations, such that they can reside in foreground memory). In that case these (few) extra storage locations usually can be neglected (compared to the large background storage), certainly for memory-intensive applications, and need not to be modeled explicitly (in most contexts).

Based on this reasoning, it seems natural to express the time order of the accesses as a function of the time order of the corresponding operations. For instance, if a set of operations is executed at time 2*(i+j*10), then the corresponding write accesses could occur at time (2*(i+j*10))+1, i.e. one time unit later than the actual operations. In practice, the time offset between the accesses and the data operations is constant, or can at least assumed to be constant. On modern processors it is possible that this time offset is not always constant or is even unpredictable (e.g. because of run-time operation scheduling and out-of-order execution), but at least the processor has to make sure that the relative order is a valid one. So, if we specify a valid relative order in the program (based on constant offsets), the processor is allowed to change this order as long as the I/O behavior of the program is not altered.

This timing model is fully compatible with a model where the time offsets for read and write accesses are measured from the start of the body of loops. In such a model, it is implicitly assumed that the corresponding data operations are executed at fixed time offsets from the start of the loop bodies, such that the read and write accesses remain synchronized with the data operations.

For the domain models presented above, it would be difficult to define timing functions on stand-alone definition or operand domains (which correspond to memory accesses), since in general the mappings between the iteration domains and definition or operand domains can be non-injective, such that one point in a definition or operand domain may correspond to several memory accesses. It would then be very difficult to distinguish between accesses represented by the same point. Therefore, it is easier to use indirect timing functions, such that the timing functions of the memory accesses are defined in terms of the timing functions of the corresponding operations. For operations there is no such ambiguity problem, since each point in an iteration domain corresponds to one operation and vice versa, and consequently we can associate a unique execution date with each memory access.

In case one is only interested in the relative execution order of operations and/or memory accesses, one can assume that each operation and/or access takes exactly one time unit to complete, which is very often even true in practice. By doing so, two operations or accesses either completely overlap in time or don't overlap at all, i.e. operations or accesses cannot partially overlap in time. This can make the mathematical modeling and optimization somewhat simpler. In case it is not possible to use such a simplified model, one has to rely on a more detailed one. In this text we assume that we can make this simplification though. Based on the reasoning given above, we can define the following ordering functions: O.sub.i.sup.time (): the time order function of D.sub.i.sup.iter. Evaluating this function for a point in D.sub.i.sup.iter results in the "execution date" of the corresponding operation. The meaning of the term "execution date" is context dependent, as stated before, i.e. it may refer to an absolute execution date (e.g. in terms of a number of clock cycles) or to a relative execution date (e.g. in terms of a number of executions).

O.sub.ikm.sup.wtime (): the time offset function of D.sub.ikm.sup.def. The resulting offset is an offset relative to O.sub.i.sup.time (). A write access corresponding to a point in D.sub.ikm.sup.def occurs at this offset in time relative to the execution of the corresponding operation. In practice, this offset is always constant (or can at least assumed to be constant). Again, the same remarks apply with respect to the time unit used.

O.sub.jlm.sup.rtime (): the time offset function of D.sub.jlm.sup.oper. The resulting offset is an offset relative to

O.sub.j.sup.time (). A read access corresponding to a point in D.sub.jlm.sup.oper occurs at this offset in time relative to the execution of the corresponding operation. Again, in practice, this offset is always constant (or can assumed to be constant).

O.sub.m.sup.addr (): the storage order function of D.sub.m.sup.var. Evaluating this function for a point in D.sub.m.sup.var results in the storage address at which the corresponding variable is stored. In general, there is a (piece-wise) linear relation between storage addresses and absolute memory addresses (although the storage order function may be non-linear).

In general, each of these functions may have (extra) hidden or auxiliary dimensions as arguments and may be accompanied by extra constraints on these extra dimensions (e.g. a symbolic constant), represented by C.sub.m.sup.addr ().gtoreq.0, C.sub.jlm.sup.rtime ().gtoreq.0, C.sub.ikm.sup.wtime ().gtoreq.0. In practice, C.sub.jlm.sup.rtime () and C.sub.ikm.sup.wtime () are generally not present, and therefore we do not mention them explicitly any more in the following in order to keep our notations a bit simpler. We could extend these simple models to take into account operations with non-unit durations for example but, for practical purposes, these simple models are usually more than sufficient.

The domain and order models presented above are sufficient to be able to perform data-flow analysis of the program. An important concept in data-flow analysis is that of a data dependency. In general a data dependency denotes a kind of precedence constraint between operations. The basic type of dependency is the value-based flow dependency. A value-based flow dependency between two operations denotes that the first one produces a data value that is being consumed by the second one, so the first one has to be executed before the second one.

Other types of dependencies (e.g. memory-based flow dependencies, output dependencies, anti-dependencies, etc.) also correspond to precedence constraints, but these are purely storage related, i.e. they are due to the sharing of storage locations between different data values. This kind of dependencies is only important for the analysis of procedural non-single-assignment code. Eventually, the goal of dependency analysis is to find all the value-based flow dependencies, as they are the only "real" dependencies. Given the value-based flow dependencies, it is (in theory) possible to convert any procedural non-single-assignment code to single-assignment form, where the only precedence constraints left are the value-based flow-dependencies.

For reasons of simplicity, we assume from now on that the code that we are analyzing or optimizing has been converted to single-assignment form. The theory and techniques presented in the following are however also applicable and extendible to non-single-assignment code, provided that the value-based flow dependencies are known. We also use the term flow dependency or simply dependency to refer to a value-based flow dependency in the following.

Just like we do not describe individual operations or individual variables, we also do not describe individual dependencies between operations either. Instead we describe groups of dependencies between groups of operations. A simple example is shown next.

    ______________________________________
            int A[5][15];
            for (i = 0; i < 5; ++i)
            {
              for (j = 0; j < 10; ++j)
    S1: A[i][j] = f1(...);
            ...
              for (k = 5; k < 15; ++k)
    S2: ... = f2(A[i][k]);
    ______________________________________


The domain descriptions for this example are the following: ##EQU5##

A graphical representation of these domains is shown in FIG. 3. One can see that some of the elements of array A that are being produced during the executions of statement S1 are also being consumed during the executions of statement S2. Consequently there exists a flow dependency between these two statements. First we can find the elements of the array that contribute to the dependency by simply intersecting the definition and operand domain. The points in the intersection correspond to the array elements that are being produced by S1 and consumed by S2. Given these elements, we can find out which operations (i.e. executions of the statements) correspond to them by applying the inverse definition or operand mapping.

We refer to the result of the inverse mappings as the definition footprint and operand footprint respectively. Note that in general the definition and operand mappings may be non-injective, but this poses no problems in this general model as we impose no restriction on the nature of mappings. Non-injectivity may complicate the analysis of these dependencies though.

The most general way to describe value-based dependencies is by means of dependency relations, which are mappings from one iteration domain to another one. For our example we obtain the following dependency relation, denoted by M.sub.1211A.sup.flow : ##EQU6## When we apply this mapping to D.sub.1.sup.iter, we obtain the operand footprint. Similarly, when we apply the inverse mapping to D.sub.2.sup.iter, we obtain the definition footprint. Based on this example, we can extend our general model to include value-based flow dependency relations. A dependency due to an overlap between a definition domain D.sub.ikm.sup.def, belonging to an iteration domain D.sub.i.sup.iter, and an operand domain D.sub.jlm.sup.oper, belonging to an iteration domain D.sub.j.sup.iter, is represented by the following dependency relation: ##EQU7##

Of course the definition and operand domains causing a dependency always belong to the same array. In general there may be several dependencies present between two statements, if there are multiple overlapping definition and operand domains.

As stated before, the primary domain models do not contain all information about the programs being modeled. Especially the execution and storage order are missing in these models. If one wants to perform storage optimization, this information about order is crucial. However, the execution and storage order are exactly the subject of optimization, i.e. the result of an optimization is a(n) (new) execution and/or storage order. Hence we must be able to decide whether such orders are valid or not, before we can try to optimize them.

Therefore, we have to derive some necessary and sufficient conditions that the execution and storage order must satisfy. For reasons mentioned earlier, we restrict ourselves to single-assignment programs. The result can be extended to non-single-assignment programs (provided that accurate data-flow analysis is possible), but this would not give extra insights and it would only obscure things. If a less accurate (conservative) data-flow analysis would be used for non-single-assignment programs, the memory occupation models described in the following would still be usable in the sense that they would be conservative too, i.e. they would describe a worst-case memory occupation.

We can say that a memory location is occupied as long as it contains a data value that may still be needed by the program being executed or by its environment. During the execution of a program, a certain memory location may contain several distinct values during disjoint time intervals. It must be clear that an execution and storage order are invalid if they result in two distinct values occupying the same memory location at the same time. But before we can check this, we must first know when a memory location is occupied. The following example shows how we can find this out:

    ______________________________________
            int A[15];
              for (i = 0; i < 10; ++i)
    S1: A[i] =...;
            ...
              for (j = 5; j < 15; ++j)
    S2: ... = f(A[j]);
    ______________________________________


The corresponding iteration, definition and operand domains have been graphically represented at the top of FIG. 4. For this simple example, one can easily see that there are some elements of A being produced during the executions of statement S1 and also being consumed during the executions of statement S2. This results in partially overlapping definition and operand domains, as indicated at the top of FIG. 4. Consequently, there exists a (value-based) flow-dependency between these two statements.

For each array element that is written during the execution of the first statement and read during the execution of the second one, we can easily find out when it is written, when it is read and where it is stored, provided that we know the execution order of both statements and the storage order of the array. This is indicated for one of the common array elements in FIG. 4. So, in an address/time diagram, shown at the bottom of the figure, we can see which is the location being occupied and when it is occupied. Note that in practice, some values may be read multiple times, so the corresponding memory location is occupied until it is read for the last time.

From the above example we can see that in principle, given the primary domain descriptions and ordering functions, we can derive the occupation intervals for each memory location. However, just as we don't want to describe individual operations and individual variables, but groups of operations and groups of variables instead, we also don't want to describe the memory occupation for each individual location either. If possible, we should be able to come up with closed mathematical expressions describing the memory occupation of groups of memory locations.

As indicated above, it is possible to derive the memory occupation interval and address for each variable that is being written during the execution of one statement and that is being read during the execution of another statement, i.e. each variable that contributes to a (value-based) flow-dependency between two statements. Not surprisingly, given two statements, one of which writes elements of an array and one of which reads elements of the same array, it is possible to come up with a closed expression describing the memory occupation for all values being written by the first statement and being read by the second one.

First of all, given two such statements, we can easily find the mathematical description for the commonly accessed array elements and the addresses they occupy, provided that we know the storage order for that array, by taking the intersection of the corresponding definition and operand domains, and applying the storage order to the intersection: ##EQU8## This expression describes all of the addresses that are (potentially) being occupied during some time due to a flow-dependency between two statements. Note that in theory, the constraint s.di-elect cons.D.sub.m.sup.var should be redundant, at least for valid programs, since the definition and operand domains should always be subsets of the variable domains (otherwise an array is being read or written outside its boundaries). In practice however, this extra constraint may contain some extra information. For instance, if the reads or sprites are non-manifest, and no other information is known except that they should not access the array outside its boundaries, then it is not necessary to add this information explicitly to all the definition or operand domains. Instead, it can be added only to the variable domain. In order to be complete, we also need to know when these addresses are occupied. From the execution order, we can derive when the addresses are written and when they are read: ##EQU9##

We are now ready to combine these 3 expressions, as we know that each address in equation 3.8 is possibly (as it may not be manifest) occupied from the corresponding time in equation 3.9 till at least the corresponding time in equation 3.10. This results in the following expression: ##EQU10## This expression is the description of a two-dimensional geometrical domain, which we call a binary occupied address/time domain (BOAT-domain, binary, because it is always associated with a (value-based) flow-dependency between two statements (possibly identical)). Each point with integer coordinates in this domain represents an occupied address/time tuple, i.e. an address that is (possibly) being occupied at that time. For a given execution and storage order, this equation contains all information available at compile time about the (potential) memory occupation due to a flow dependency between two statements. Comparison with equation 3.7 reveals that the mathematical constraints present in a dependency relation are also present in equation 3.11. In fact a BOAT-domain is nothing more than a mapping of a dependency on an address/time space. Note that the addresses contained in the BOAT-domain are discrete, whereas time is continuous (e.g. when an address is occupied from time t=1 till t=2, it is also occupied at time t=1.3763, but an address with a non-integral value is meaningless in practice). For practical purposes, time can also be considered to be discrete, as we assume that nothing interesting can happen at non-integral points in time. Note that in the special case, where all constraints and ordering functions are manifest and affine, and all constraints are convex, the resulting BOAT-domain is a linearly bounded lattice (LBL).

An example shows how non-manifest behavior can be incorporated in the BOAT-domain descriptions:

    ______________________________________
            int A[5][5], B[5];
            for (i = 0; i < 5; ++i)
              for (j = 0; j < 5; ++j)
    S1: A[i][j] = f(...);
            for (k = 0; k < 5; ++k)
              for (l = 0; l < 5; ++l)
            if (B[k] >= 0)   /*non-manifest*/
    S2: ... = g(A[k][l]);
    ______________________________________


We can extract the domain descriptions in a straightforward way: ##EQU11##

Note that F.sub.2.sup.iter () is an unknown function, which is used to model the non-manifest behavior of the program: we are not sure whether any of the elements of A is ever read. If we assume that this program is executed sequentially and that each of the statements S1 and S2 can be executed in one clock cycle and we assume a row major storage order for the array A, the graphical representation of the BOAT-domain is shown in FIG. 5. Due to the condition in the program, it is not known until run-time whether any of the elements of A is ever read and whether the corresponding memory locations are therefore ever occupied. We do know however that there are some relations between some of the occupied address/time tuples: if for a given execution of the k-loop, the condition B[k]>=0 evaluates to true, then all of the corresponding executions of statement S2 are executed. Consequently, if one of the corresponding addresses is read, all of them are read. In other words, the addresses can be divided into groups of addresses whose run-time occupation is related. These groups are indicated by the differently shaded regions in the figure. This information is explicitly present in the description of the BOAT-domain due to the presence of F.sub.2.sup.iter (). This can be very valuable in a memory optimization context. Ignoring it and taking worst-case assumptions can prohibit certain optimizations.

We have shown that it is possible to accurately model the memory occupation for a set of variables being written by one statement and being read by another one, in other words, for each variable contributing to a value-based flow-dependency. However, we are not only interested in the memory occupation due to flow-dependencies, but rather in the memory occupation of a complete set of variables, i.e. a complete array. It is assumed that the storage order is common for the complete array. If required, an array can always be divided first into sub-arrays (e.g. basic sets) each having their own storage order.

We can easily find the complete memory occupation of an array if we know the memory occupation due to each value-based flow dependency related to that array. The memory occupation of the array is then simply the union of the memory occupation due to each of the flow dependencies. Consequently, we can also model the memory occupation of an array by a geometrical domain, which we call the occupied address/time domain (OAT-domain) of that array. This domain is simply the union of the BOAT-domains of all value-based flow-dependencies related to that array: ##EQU12## In general, the different BOAT-domains of an array are not disjoint. For instance, if an array element is written by one statement and read by two other statements, then this element contributes to two flow-dependencies and the corresponding BOAT-domains have certain address/time tuples in common. An example of the memory occupation for a complete array is given next.

    ______________________________________
            int A[4][4];
            for (i = 0; i < 4; ++i)
              for (j = 0; j < 4; ++j)
    S1:         if(j <=i) A[i][j] = f1(...);
            for (k = 0; k < 4; ++k)
              for (l = 0; l < 4; ++l)
    S2:         if(l > k) A[k][l] = f2(...);
            for (m = 0; m < 4; ++m)
               for (n = 0; n < 4; ++n)
    S3:           if(j >=m)... = g1(A[3-m][n]);
            for (o = 0; o < 4; ++o)
               for (p = 0; p < 4; ++p)
    S4:           if(p <=o)... = g2(A[3-o][p]);
    ______________________________________


This program contains two statements in which elements of A are written and two statements in which elements are read. Consequently, four value-based flow-dependencies can be present (one from each writing statement to each reading statement), and in fact they are present here. This means that we can describe four BOAT-domains.

The variable, iteration, definition and operand domains are the following: ##EQU13## Again, we assume a procedural execution order in which each statement takes 1 time unit to execute, and a column-major storage order for A: ##EQU14## This leads us to the following BOAT-domain descriptions, after simplification: ##EQU15##

A graphical representation of these BOAT- and OAT-domains is shown in FIG. 6. Note that there is an overlap between D.sub.1311A.sup.BOAT and D.sub.1411A.sup.BOAT at addresses 3 and 6, and an overlap between D.sub.2311A.sup.BOAT and D.sub.2411A.sup.BOAT at addresses 9 and 12. This is because these addresses are being read more than once. The dotted triangles represent the read and write accesses during the different loops. Note that write (or read) operations of the same statement may contribute to different BOAT-domains in different ways (e.g. not all writes of the same statement must contribute to the same BOAT-domain).

Now that we know how to model the memory occupation for be complete array, we can easily derive an expression for the occupation of a memory itself. It is again simply the union of the memory occupation of all arrays assigned to that memory. The result is again a geometrical domain, which we call the collective occupied address/time domain (COAT-domain) for that memory: ##EQU16## One of the preconditions for an execution and storage order to be valid, is that none of the OAT-domains of different arrays assigned to the same memory overlap, because this would mean that 2 arrays are using the same memory location at the same time, which is incorrect.

An example of a COAT-domain for a non-manifest program and a memory containing three arrays is given next.

    ______________________________________
            int A[10], B[5], C[10];
            for (i = 0; i < 10; ++i)
    S1: C[i] = f1(. . .);
            for (j = 0; j < 5; ++j)
    S2: A[j] = f2(. . .);
            for (k = 5; k < 10; ++k)
    S3: if (C[k-5] > 0)
    S4:       A[k] = f3(. . .);
              else
    S5:       B[k-5] ` f4(. . .);
              for (l = 0; l < 10; ++l)
    S6: if (1 >= 5 && C[l]<= 0)
    S7:       . . . = f5(B[l-5]);
              else
    S8:       . . . = f6(A[l]);
    ______________________________________


The variable, iteration, definition and operand domains for this example are the following: ##EQU17## We assume the following execution and storage orders (assuming that arrays A, B and C all have the same element size): ##EQU18## From this, we can derive the following BOAT-domains (after simplification): ##EQU19## The corresponding graphical representations can be found in FIG. 7. The resulting OAT-domains and COAT-domain can be found in FIG. 8. Note that the OAT-domains of arrays A and B seem to overlap graphically. This is caused by a virtual overlap between D.sub.4811A.sup.BOAT and D.sub.5711B.sup.BOAT. One can verify that this overlap is non-existent in reality: ##EQU20## So, the memory occupation of arrays A and B is partly conditional, and the conditions for A and B are complementary. In practice, this means that some of the elements of arrays A and B can share storage locations, simply because they can never occupy those locations at the same time. If we would not have modeled this exclusive condition accurately, we could not have been sure that this storage order was valid, and we would have to make sure that the address ranges of arrays A and B were disjoint. This would require extra storage locations. So this example already clearly illustrates how a more accurate modeling of non-manifest programs may result in better solutions, compared to the ones obtained through traditional worst-case modeling techniques.

As stated before, the execution and storage order are the subject of optimization. We therefore have to be able to decide whether a given order is valid. We say that a combination of storage and execution order for a program is valid if it satisfies the following constraints:

1. No value of any variable (e.g. array element) is ever read from a memory location before it has been written to that location.

2. No memory location might (could be non-manifest) ever be overwritten when it might till contain a data value of a variable that might still have to be read.

In the following we derive a set of necessary mathematical conditions that the storage and/or execution order have to satisfy. Together, these conditions form a sufficient set for having a valid storage and execution order. Again we limit ourselves to the single-assignment case. Extensions for the non-single-assignment case are possible but require an extended data-flow analysis.

The first validity constraint only affects the execution order. Given the primary domain models, we can easily express this constraint mathematically: ##EQU21##

This can be read as: "If a data value is written to memory during the execution of one statement and it is read during the execution of another statement, then the write access corresponding to the first operation should be executed before the read access corresponding to the second operation."

A closer look at this equation reveals that it corresponds to the precedence constraint due to a flow dependency, which is described by the dependency relation in equation 3.7. In fact equation 3.14 states that an operation that is the source of a flow dependency should be executed before the operation that is the sink.

Alternatively, equation 3.14 can be written as: ##EQU22## i.e. for a valid execution order, each of these sets should be empty.

Note that O.sub.ikm.sup.wtime (), O.sub.i.sup.time (), O.sub.jlm.sup.rtime (), and O.sub.j.sup.time () are in general unknown functions, i.e. they are the result of the optimization process. Also note that it does not matter whether we use absolute or relative timing functions, as the condition only states something about the relative order. This condition is necessary to obtain a valid execution order, but it is not sufficient. In order to have a sufficient set of conditions, we must also take into account the storage order. The second validity constraint can be expressed as follows: ##EQU23## This expression can be read as: "If a first (program) variable is written by a statement S.sub.i1 and read by a statement S.sub.j1, and a second variable is written by a statement S.sub.i2, and these variables are stored at the same memory address, and they either have different indices in the same array or belong to different arrays, then the write access corresponding to S.sub.i2 should occur before the write access corresponding to S.sub.i1 or after the read access corresponding to S.sub.j1."

Together with the conditions imposed by equation 3.14, these conditions make sure that both validity constraints are satisfied and therefore form a sufficient set. Note that due to the presence of the logical or in the last line of the if part of equation 3.16, we can distinguish 2 slightly different sets of conditions: conditions that are related to the intra-array order (i.e. between elements of the same arrays) and conditions that are related to the inter-array order (i.e. between elements of different arrays).

Nevertheless, it must be clear that this expression is potentially hard to evaluate, due to the possibly large number of (mathematical) variables, and especially due to the presence of inequalities and logical or's, which can easily lead to an explosion of the number of scalar (in)equalities to be checked. In practice however, it is usually possible to simplify equation 3.16 by enforcing more stringent but simpler constraints, without sacrificing too much optimality. For instance, if one makes sure that the address ranges of two arrays are not overlapping, then these conditions are always satisfied for each pair of variables, one of which belongs to the first array and one of which belongs to the second array.

It is also important to note that equation 3.16 is again strongly related to the presence of value-based flow-dependencies, as we can recognize the constraints that are also present in equation 3.7. In other words, equation 3.16 only has to be evaluated in the presence of flow-dependencies. So, performing a dependency analysis first may already drastically reduce the number of conditions to be checked, although even then this number equals the number of flow-dependencies times the number of statements writing to array elements present in the program (assuming that each statement writes to only one array, but the extension to tuple writes is straightforward). This also indicates that the constraints can be reformulated for the non-single-assignment case, provided that an accurate data-flow analysis is available. We can again rewrite equation 3.16 as follows: ##EQU24##

In other words, for a valid execution and storage order, each of these sets should be empty. Let us again have a look at some examples.

    ______________________________________
              int A[10];
              for (i = 0; i < 5; ++i)
    S1:       A[i] = f1(. . .);
              for (j = 0; j < 5; ++j)
    S2:       . . . = f2(A[j]);
              for (k = 0; k < 5; ++k)
    S3:       A[k+5] = f3(. . .);
              for (l = 0; l < 5; ++l)
    S4:       . . . = f4(A[l+5]);
    ______________________________________


The corresponding domains are the following: ##EQU25## One can easily verify that the following execution order satisfies equation 3.14: ##EQU26##

Let us assume the following storage order: ##EQU27##

We can now check whether this order, combined with the execution order above, is valid. But first of all, we should do some dependency analysis in order to find the non-empty value-based flow-dependencies. This reduces the number of conditions to be checked. In this case, dependency analysis tells us that there are only two data dependencies: one from S1 to S2 and one from S3 to S4. Also, for this program fragment there are 2 statements writing elements to an array: S1 and S3. So, there are 4 possible combinations of flow dependencies and writing statements that have to be checked:

(S1.fwdarw.S2, S1), (S1.fwdarw.S2, S3), (S3.fwdarw.S4, S1) and (S3.fwdarw.S4, S3). Consequently, we have 4 sets that have to be checked for emptiness: ##EQU28##

After simplification we get the following constraints: ##EQU29##

One can easily verify that each of these sets is indeed empty: for the first one, the condition i.sub.1 mod 5=i.sub.2 mod 5 i.sub.1 .noteq.i.sub.2 can never be satisfied within the given ranges of i.sub.1 and i.sub.2. This means that none of the variables contributing to the first flow-dependency is stored at the same location, such that there cannot be a conflict. For the second one, the condition t=k+10 i.ltoreq.t.ltoreq.i+5 can never be satisfied for the given ranges of i and k, which means that the memory locations being occupied due to the presence of the first flow-dependency are no longer occupied when the memory writes due to statement S3 occur. For the third and fourth equations, similar reasonings hold.

A slightly different example, this time involving 2 different arrays, is given next.

    ______________________________________
              int A[10];
              for (i = 0; i < 5; ++i)
    S1:       A[i] = f1(. . .);
              for (j = 0; j < 5; ++j)
    S2:       B[j] = f2(A[j]);
              for (k = 0; k < 5; ++k)
    S3:       . . . = f3(B[k]);
    ______________________________________


We can readily extract the domain descriptions: ##EQU30## and verify that the following execution order satisfies equation 3.14: ##EQU31##

Let us assume the following storage order: ##EQU32##

Again, we have to check for the emptiness of all sets of the kind of equation 3.17.

Dependency analysis reveals 2 non-empty value-based flow-dependencies, such that we have to check only 4 sets: ##EQU33##

Note that this time, we have 2 slightly different kinds of conditions. The first one and the third one are again related to intra-array storage and execution order, while the second one and the fourth one are related to the order for two different arrays. We can immediately see that the first and the third set are empty due to the conditions s.sub.a1 =s.sub.a2 s.sub.a1 .noteq.s.sub.a2 and s.sub.b1 =s.sub.b2 s.sub.b1 .noteq.s.sub.b2 respectively. After simplification, the second equation becomes the following:

{t.vertline..E-backward.s.sub.a .epsilon.s.t.0.ltoreq.s.sub.a .ltoreq.4 t=2s.sub.a 6 s.sub.a .ltoreq.t.ltoreq.2s.sub.a +5}=.phi.

in which the condition t=2s.sub.a +6 t.ltoreq.2s.sub.a +5 can never be satisfied, so this set is also empty, which means that there is no conflict in time between the memory occupation of the different arrays. A similar reasoning holds for the fourth set. A graphical representation can be found in FIG. 9.

Now we are ready to proceed with the development of a pragmatic optimization strategy in accordance with the present invention. The goal of this strategy is to reuse memory locations as much as possible and hence reduce the storage size requirements. This means that several data entities can be stored at the same locations (at different times). The problem is sometimes referred to as in-place mapping. In the broadest definition of the present invention, two of the main optimization parameters are the execution order of the operations and the storage order of the data variables. However, for practical applications it is infeasible to find the optimal execution date for every single operation and the optimal storage location for every single variable, because there are typically millions of operations and variables present in a multimedia application. Therefore we try to find the optimal execution and storage order for groups of operations and variables respectively. The geometrical modeling techniques for dealing with groups of operations and variables are described above. These models allow us to assign an execution order function to each statement (which corresponds to a group of operations) and a storage order function to each array variable (which corresponds to a group of variables).

Unfortunately, even when we group operations and variables, finding the optimal solution for a given cost function, satisfying all boundary constraints, is still infeasible for real-life applications. One of the main reasons is that the validity constraints expressed by equation 3.14 and equation 3.16 are very difficult to take into account during an optimization process.

The present invention prefers a more pragmatic approach, in which we introduce several simplifications, and which allows us to come up with a good solution, but not necessarily the best one. This is also the reason why the optimization of the execution order and the optimization of the storage order are separated as much as possible in accordance with the present invention: first the execution order is optimized, mainly through loop transformations and memory hierarchy exploitation, and afterwards, among other things, the storage order optimization problem is tackled (through the equivalent of data transformations). This execution and storage order have a direct impact on the main two cost factors, namely the area occupation by memories and power consumption by data transfers. The storage order has a much larger effect on the area requirements than on the power requirements. By choosing a good storage order, we can increase the reuse of storage locations for different variables, and hence reduce the storage size requirements, which results in a smaller area occupation by the memories. Indirectly, the power budget due to memory transfers may also be decreased because of the lower capacitive load of smaller memories. This is also why the decisions on memory hierarchy, memory allocation and array-to-memory assignment are taken during separate preceding steps in accordance with the present invention. For these reasons the storage order optimization phase focuses on storage size reduction.

The storage order optimization techniques in accordance with the present invention can be either used on a stand-alone base, or in combination with other optimization tasks, such as execution order optimization. In the latter case, it is important that the storage order optimization step is executed as the last step, for several reasons:

1. If the execution order has not been fixed, one has to make some worst case assumptions during the storage order optimization, which may lead to (highly) suboptimal results.

2. In contrast, as long as the storage order has not been fixed, then is a lot of freedom that can be exploited during the other optimization steps because the storage order can be virtually ignored. When the storage order has been fixed on the other hand, the constraints on the other tasks would become far too complex. For instance, checking whether a loop transformation is valid would be very difficult in that case (it would require taking into account equation 3.16).

3. The effect on loop and memory hierarchy transformations on the optimal storage order (and hence the required storage sizes) is quite large. For instance, loop transformations can enable more opportunities for storage location reuse by influencing the life-times of the data. The effect of the storage order on other cost factors, such as parallelism, performance, and/or power consumption, is much smaller. Therefore it is more natural to tackle the other problems first, before the storage order optimization. Still it is important to be able to make an estimate of the storage requirements during the early stages of global optimization. For this purpose, some of the techniques in accordance with the present invention can be used as a basis for quick estimators. Based on the above reasoning we can now present the assumptions that we start from:

1. The data-flow of the program has been completely fixed and cannot be changed anymore (e.g. data-flow transformations cannot be applied anymore).

2. The execution order of the program has been mostly fixed. For instance, the code may have been parallelized in advance. In case there is still some freedom left, we do not try to exploit it. In case the program is to be implemented on a parallel architecture with several processors, we assume that there is sufficient synchronization between the different processors. In that case we can accurately model the relative execution order of the memory accesses by different processors.

3. The memory architecture has been fixed already, except for the sizes of the memories. For instance, memory hierarchy decisions have been taken already.

4. The order of the memory accesses is deterministic for a given application code and a given set of input data. This rules out the presence of hardware-controlled caches, because we usually do not know the exact caching algorithm implemented in hardware and therefore we cannot accurately predict the order of the memory accesses in that case.

5. Each (part of a) data array in the program has already been assigned to one of those memories, i.e. we assume that data-distribution decisions have already been taken (array-to-memory assignment). In other words, we assume that we know what data are stored in what memory (but not how they are stored).

6. The storage order of the arrays in the memories has not been fixed yet and can be freely chosen.

7. Finally, an assumption of an entirely different nature, is that the program to be optimized is given in single-assignment form. In this way the freedom present in the program is maximally exposed and our analysis is simplified. This assumption should not be seen as a restriction because data-flow analysis allows us to convert non-single assignment programs to single-assignment ones.

An important consequence of the assumption that the execution order has been fixed is the fact that we can treat each memory separately. Because the required size of a memory depends only on the storage order of the arrays or parts of arrays stored in it on the one hand, and the storage orders of arrays stored in different memories are independent on the other hand, there is no interference between memories. This allows a significant complexity reduction for complex memory organizations such as employed in large multimedia applications.

The techniques in accordance with the present invention can be extended to take into account a partially unfixed execution order, although the opportunities for storage location reuse will decrease when the uncertainty on the execution order increases. In case the remaining freedom on the execution order would have to be exploited also during the storage optimization step, the problem would become much more complex because the execution order may affect the lifetimes of different arrays in different memories, and therefore the memories would no longer be independent.

The present invention is preferably limited to synchronous architectures, i.e. those for which the order of memory accesses is only determined by the application and the input data. This even includes architectures in which software-controlled data- caches are present. In the presence of hardware-controlled caches accessed by multiple processors however, the relative order of memory accesses may be unpredictable. Moreover, the organization of the data in the memories may then affect the caching effectiveness, so in that case it is possible that the storage order optimization techniques in accordance with the present invention interfere with other steps (such as cache performance optimization).

Without loss of generality, we concentrate on the size reduction of only one (shared) memory. In general, multiple memories can be present, but our techniques can be applied to each memory separately, as there is no interference between the optimizations for different memories for a given data distribution and execution order. The pragmatic solution strategy in accordance with the present invention is based on the fact that we can identify two independent components in the storage order of arrays:

the intra-array storage order, which refers to the internal organization of an array in memory (e.g. row-major or column-major layout);

the inter-array storage order, which refers to the relative position of different arrays in memory (e.g. the offsets, possible interleaving, . . . ).

In that sense we can see the mapping of array elements on memory locations as a two-phase mapping: first there is a mapping from the (multi-dimensional) variable domains to a one-dimensional abstract address space for each array, and then there is a mapping from the abstract address spaces onto a joint real address space for all arrays assigned to the same memory. This "real" address space can still be virtual. For instance, an extra offset may be added during a later mapping stage. Above we made a distinction between storage addresses and memory addresses. In practice, the relation between Ihese two is always linear, so we do not continue to make this distinction anymore.

In accordance with the present invention a two-phase optimization strategy may be applied whereby each phase may be applied independently of the other. In one, e.g. first phase, we try to find an optimal intra-array storage order for each array separately. This storage order then translates into a partially fixed address equation, which we refer to as the abstract address equation, denoted by the function A.sup.abstr (). In the other, e.g. second phase, we look for an optimal inter-array storage order, resulting in a fully fixed address equation for each array. This equation is referred to as the real address equation, denoted by the function A.sup.real (). In the following we outline a few strategies that can be used during the inter-array mapping phase, i.e. the phase that transforms the A.sup.abstr () of each array into an A.sup.real (). Then we select the most promising strategy, and derive from this how we can steer the intra-array mapping phase.

Given the abstract address equations of each array, which represent a partial storage order, and the execution order of the program, we can already describe the occupation of the abstract addresses, i.e. we can describe (binary) occupied address/time domains (for the abstract addresses) by means of equations 3.11 and 3.12, in which we substitute the storage order function O.sub.x.sup.addr () by A.sub.x.sup.abstr (). In general, the "shape" of these domains is not known, as we have only implicit descriptions. It is however possible to analyze these descriptions and to extract certain properties (such as the width or the height), which allow us to approximate the shapes.

We will now illustrate five alternative strategies for the inter-array mapping phase by means of an example. In FIG. 10 the occupied address/time domains are shown for five arrays of which the intra-array storage order (i.e. the abstract address function A.sub.x.sup.abstr ()) is assumed to be known. We can then obtain a real address function A.sub.x.sup.real () in the following strategies each of which is an embodiment of the invention:

1. The simplest way to allocate memory for these arrays is to assign. a certain address range to each of them in such a way that the different address ranges do not overlap. The size of the address range allocated for an array then corresponds to the "height" of its occupied address/time domain and this typically equals the size of the array. This is depicted in FIG. 11a. The difference between the abstract and the real address equation is then simply a constant offset C.sub.x as indicated at the bottom of the figure. The choice of the C.sub.x values is straightforward, i.e. the arrays are simply stacked in the memory and the address ranges are permanently assigned to the different arrays. This strategy is commonly referred to as static allocation. Note that this approach does not result in any memory refuse at all, but that it can be implemented very easily (one only needs to know the size of the arrays). Moreover, the intra-array storage order (which influences the shape of the occupied address/time domains) has no effect on the required memory size assuming that for each array a range equal to its size is allocated.

2. A potentially better strategy is illustrated in FIG. 11b. Here the address range assigned to an array is allocated only during the time that the array is in use. The shapes of the occupied address/time domains may be approximated by rectangular bounding boxes. This allows sharing of certain address ranges by more than one array. We refer to this strategy as dynamic allocation, but it is important to note that this allocation can be performed at compile time, in contrast to, for instance, heap-allocation in a programming language such as C. In general, a dynamic strategy requires less memory than a static one. Unfortunately, it requires life-time analysis of the arrays, and the placement of the arrays in memory is no longer straightforward, as the required memory size depends on it. The relation between the abstract and real address equations is similar to that of the first alternative, and also in this case the intra-array storage order has no effect on the required memory size (under the same assumptions as mentioned above).

3. T