Method and system for memory management optimization6952821Abstract A system and method of automatically configuring memory in a data processing system, including the steps of: receiving source code containing a loop nest, wherein the loop nest includes data arrays with affine indexes; optimizing source code by relocating elements from a first array in memory to a second array in memory; and executing the optimized source code. Claims 1. A method for automatically configuring memory in a data processing system, including steps of: Description FIELD OF THE INVENTION
A linear index function is of the form: (i) a loop index variable (such as i or j in the example above); (ii) a constant times a linear function; or (iii) the sum of two linear functions resulting in a change of scale. Thus, a linear function is indicative of spacing between elements to be accessed. An affine function is a constant plus a linear function providing a linear translation. In this exemplary loop nest, there are three data arrays referenced (array A, array B and array C). Arrays A and C are two-dimensional arrays addressed by respective ordered pairs of index values and array B is one-dimensional addressed by a single index value. For each program reference there is an array index function used to specify the element referenced. Referring to the example loop nest above, the single index of array B (i.e., the value i+j+1), both indices of array C, and the first index of array A are affine functions, i.e., include a linear function and added fixed or constant offset value. Since each array reference has one or more affine index functions, all three array references are "partially affine." With index variables i and j having values stepping between 0 and 99, a total of 10,000 access will be made to elements of arrays A, B, and C. In particular, array elements in the range of A [0] [0] through A [198] [9801]; B [1] through B [199] and C [0] [0] through C [99] [99]. If all elements of A are considered within the specified range, assuming a row-major ordering (i.e., in the form A [row] [column], the last dimension, termed the column, varies the fastest so that consecutive entries of a row are stored adjacent one-another), a total of 199×9801=1,940,796 elements would be contained within the range, while only 10,000 accesses would be performed. Thus, at a minimum 99.48% of the array elements stored are not needed by the loops. On the other hand, all elements within the 200 element range limit of B are accessed. However, in the case of C, only diagonal elements are accessed representing 100÷(100×100) or 1% of the total number of elements of the array. An embodiment of the present invention provides a mapping from the relatively large address range or space for at least those elements accessed to a SPM space in which unused references are minimized or eliminated so as to not require storage allocation in SPM. For each partially affine array in main memory, the compiler may allocate one or more areas in the SPM (called a SPM host array) to hold the part of the array referenced in the loop nest. The SPM host arrays are preferably considerably smaller than the original complete array stored in main memory as each stores only the elements "accessed" by a particular loop nest. The compiler may allocate the host array to hold the accessed elements in the original array. An element in an array is accessed if the element is accessed (i.e., read from or written to) during any iteration of the loop nest. The compiler may relocate part of each partially affine array into the SPM prior to starting the loop nest. The relocated elements are a subset of the elements of the original array including at least the array elements accessed by the loop nest. Each accessed element in the loop nest in the original array is replaced with a reference to an element of the array in SPM. It is useful to require that there is an affine relationship between the index of an original array element and the index of the corresponding SPM host array element, and the methods of at least one embodiment of the present invention enforce this restriction. Automatically managing SPM provides a number of benefits over traditional methods of managing SPMs. Automatically managing SPM enables the compiler to make the most efficient use of a target machine's SPM. By managing the SPM, the compiler may achieve higher performance on a given problem by packing into the SPM only the relevant data (e.g., accessed elements) of each array referenced; space is not allocated to elements not accessed. This allows the compiler to pack the relevant parts of a larger set of arrays into the SPM. It also may allow the compiler to keep data in the SPM for a longer period, because there is less contention for space in the SPM. Additionally, automatically managing SPM enables a computer designer to create an embedded computer with the smallest possible SPM. By using less SPM, the total cost of the embedded computer is reduced. P FIG. 1 depicts a detailed view of computer 100, including main memory 101, SPM 103, secondary storage device 104, Central Processing Unit (CPU) 106, video display 107 and input/output device 108 connected by system bus. Main memory 101 stores compiler 102. Compiler 102 may operate on source code 105 stored in secondary storage device 104 to produce executable object code. For simplicity it is assumed that related functions including, for example, linking and loading, are performed as necessary to provide an executable object module. Thus, memory allocation, including SPM, may be performed by a combination of software and other systems including, but not limited to, a compiler, linker, loader, specialized utility, etc. An example of a compiler suitable for use with methods and systems consistent with an embodiment of the present invention is any generic C compiler when adapted for use on a target machine, such as many off-the-shelf Digital Signal Processors (DSPs), that have SPM, as modified to the extent necessary to cooperate with other systems to implement SPM functionality according to an embodiment of the present invention. An example of an architecture synthesis program suitable for use with the methods and systems consistent with an embodiment of the present invention is the HP Labs PICO system. The compiler provides functionality according to the present invention so as to automatically generate appropriate object code (including object code file(s) and/or object module(s) and load modules) enabling intelligent use of the SPM. SPM 103 may be used to temporarily store certain elements from data arrays located in main memory 101. SPM 103 may also provide the stored elements directly to the CPU without interfacing with main memory 101, thereby improving performance and reducing data latency. Access to SPM 103 is enhanced by providing a direct interface between it and CPU 106 to support high speed data access and storage. Secondary storage device 104 may contain source code 105. Source code 105 may be a program written in a high level language, such as the C programming language. Source code 105 contains loop nests, and data arrays are accessed (read and written) within (i.e., in response to instructions included within) those loop nests. Secondary storage device 104 may further store object modules including library modules and executable modules or portions thereof, particularly when a paging mode of operation is used for execution of a large program. FIGS. 2A and 2B depict a more detailed view of a data map depicting locations of data structures stored in main memory 101 and SPM 103. Main memory 101 may contain original array 201 including all elements to be processed by CPU 106. Original array 201 includes both accessed and unaccessed elements. SPM 103 may contain a host array 202, ideally, including a subset of elements from original array 201 to be processed by CPU, such as only those elements in original array 201 that are actually predicted to be accessed by a certain loop nest (i.e., assuming normal entry to and exit from the loop nest). According to one embodiment of the invention, a function provides a mapping from the address space used to access elements of original array 201 to an address space storing, in host array 202, a subset of elements, including elements of original array 201 to be accessed while excluding or minimizing the number of unused elements. As explained, elements in host array 202 may be processed more rapidly than those elements in original array 201, since CPU 106 may directly access SPM 103 without interfacing with main memory 101 or contending for access to shared system bus facilities. The host array in SPM 103 is made as small as possible by selecting a suitable choice of the affine mapping from the original array indices to the host array indices. Elements may also be moved from original array 201 to host array 202, such that the compiler may use (i.e., produce an object module referring to) elements from SPM 103 instead of main memory 101. After CPU 106 processes the elements in any given loop nesting, the elements may be moved back to main memory 101. The compiler generates the appropriate code as part of, for example, the object module, to perform the appropriate data transfers and moves between main memory 101 and SPM 103. FIG. 3 depicts a flow chart of the steps performed by compiler 102 when compiling source code 105. The process is initiated, for example, by compiler 102 receiving source code 105 from secondary storage device 104 (step 301). The source code may contain loop nests containing instructions to access data arrays, the data arrays addressed with affine indexes. Once received, compiler 102 may optimize source code 105 by including code for relocating elements from original array 201 to host array 202 in SPM 103 (step 302). Step 302 is further described below with reference to FIGS. 5 and 6. In general, step 302 includes identification of loop nests and/or portions of loop nests, generating a function mapping an address space of the identified elements to an address space of the SPM, replacing references to the original address to the address space of the SPM, and generating code appropriate to effect the transfer of data between main memory and SPM (i.e., memory management overhead functions). Once these code changes have been made, compiler 102 may compile the source code to create object code (step 303). In one implementation, the compiler parses the original source code and creates an Intermediate Representation (IR) of the program. IR is a standard compiler term, referring to the data structure that represents the program inside the compiler. The compiler optimizes the program as a transformation of the intermediate representation (step 303). It then regenerates an optimized program in C or in another high-level language (step 304A). That optimized program may be compiled by a conventional compiler (step 305A) to create an executable object program. The benefits of the optimization are obtained through the creation of the optimized source code program. In this role a method according to one embodiment of the invention may be incorporated into or as part of a compiler preprocessor. In another embodiment of the invention, the compiler parses the original source code and creates an IR of the program. As part of creation of the IR, the compiler optimizes the code (step 303) as a transformation of the IR. It then proceeds to directly compile the optimized IR into an executable object program (step 304B). Using either method, the resultant optimized object code may then be executed at step 306, the method terminating at step 307. The benefits of the optimization are obtained by virtue of the fact that the generated object program includes the optimized code. In this role, the method may be part of the high-level optimization processing phase of a compiler. In yet another embodiment of the invention, an object code program may be analyzed to find loop nests containing partially affine references. The memory optimizations described are then applied to transform the initial object code program into an optimized object code program. In this role, a method according to an embodiment of the invention is performed as part of a compiler postprocessor, or of the low-level optimization, a phase of compiler operations. FIG. 4 depicts an alternative embodiment consistent with the present invention when determining the smallest amount of SPM needed for an embedded computer system and determining the most efficient use of the SPM. First, compiler 102 may receive source code program 105 and computer system performance parameters, such as the total number of clock cycles allowed to perform some or all of the loop nests in the program (step 401). Once received, compiler 102 may optimize source code 105 by relocating elements from original array 201 to host array 202 in SPM 103 (step 402). Step 402 is further described below with reference to FIGS. 5 and 6. Once optimized, compiler 102 may create a set of hardware specifications that indicate the smallest amount of SPM needed to compile source code 105 based on the optimization (step 403). FIG. 5 depicts a flow chart of the steps performed when optimizing source code to use SPM 103 according to one embodiment of the present invention. Providing a mapping between address spaces may be based on a matrix representation of the index functions and an understanding of affine transformations. First, for each loop nest (or subset of loop nests) in source code 105 (step 501), compiler 102 identifies sets of Uniformly Generated Partially Affine (UGPA) array references (step 502). An array reference (i.e., index) is partially affine if one or more of the array index functions in the reference is affine (a constant plus a linear function). Two partially affine references are uniformly generated if the linear parts (i.e., the multiplier term) of their affine references are identical. For example, array references F[2i] and array F[2i+1] are uniformly generated, affine references since the linear part, "2i", is the same in both references to F. A set of partially affine references is UGPA if each pair of references in the set is uniformly generated. The linear part of a partially affine reference can be expressed as an integer matrix. If there are "n" loop index variables and there are "m" affine index functions in a reference, then the matrix that expresses this reference has m rows and n columns. A lattice is the set of integer linear combinations of a set of vectors. The lattice generated by a matrix is the set of integer linear combinations of the columns of this matrix. If X is a matrix and {x_1, x_2, . . . , x_n} are the columns of X, then the lattice generated by X is the set of vectors of the form {a_1 x_1+ . . . +a_n x_n} where the parameters {a_1, . . . , a_n} are allowed to be arbitrary integers. Two uniformly generated references, having the same linear part but which may have different constant parts are said to be "alias equivalent" iff (if and only if) the difference between their constant parts is in the lattice generated by the integer matrix of their linear part. For example, the two references F[2i] and F[2i+2] are uniformly generated affine references with a common linear part (2i) whose matrix (i.e., multiplier term) is (2). The lattice generated by this matrix consists of all of the multiples of (2), that is, all the even integers. The two references have constant parts 0 and 2, and the difference between their constant parts, 0-2=-2, is an even integer, so it is in the lattice generated by the matrix of their linear part. Therefore these two references are alias equivalent. On the other hand, the reference F[2i+1] is uniformly generated with these other two references, but it is not alias equivalent to them. The property of alias equivalence is transitive: if reference R1 is alias equivalent to reference R2 and R2 is alias equivalent to reference R3, then it must be the case that R1 is alias equivalent to R3. For example, below is an exemplary loop nest in source code 105 containing arrays A and B:
In the above example loop nest consisting of three loops incrementing, in turn, k, j and i, there are two identified sets of UGPA array references. Set 1 has one distinct reference to the three dimensional array A: A[i][2*j][k+3] and Set 2 has the three distinct references to the two dimensional array B: B[2*i+4*k][i+2*j], B[2*i+4*k+1][i+2*j+0] and B[2*i+4*k+2][i+2*j+3]. Once the sets of UGPA array references are identified, for each UGPA set, compiler 102 partitions the references into a set of subsets of references. Within each subset, all of the references are alias equivalent. Once the alias-equivalent subsets of UGPA array references are identified, for each Alias-Equivalent Uniformly Generated Partially Affine Subset (AEUGPAS) (step 503), compiler 102 determines a new linear index function (step 505) and for each reference in the subset the compiler determines if there is sufficient reuse in the set (step 504) under examination to warrant inclusion. While unused references may clearly be excluded, references used only once or some small number of times may also be excluded so as to avoid incurring overhead costs exceeding benefits achieved by including the reference. These overhead costs may include, for example, processing time required to transfer data to SPM and, possibly, write-back data into main memory. In addition to accounting for fixed and variable overhead, step 504 may take into account the size of available SPM so as to include at least the most frequently accessed references or by implementing some other criteria to maximize system performance. As explained, each array reference in the AEUGPAS has the same linear part of its index function and a different constant part of its index function. To store the references from original array 201 into host array 202, a new linear and constant part of the index function may be determined. That is, the partially affine references to original array 201 are replaced with re-indexed references to the host array 202 in which the linear and constant parts of the affine index functions are replaced with new linear and constant parts. This will be a correct program transformation if each element of original array 201 that is referenced in the loop nest has exactly one corresponding position in host array 202, given by the new indexing functions. That is, if two iterations of the loop nest reference the same element of original array 201, then each iteration must reference the same element of host array 202, and no two different elements of original array 201 referenced in the loop nest may share or map to the same position in host array 202. After determining a new linear index function and constant vector, elements in the original array may be relocated to the host array. The new linear index function and constant vector provide a map to the SPM. Host array 202 is preferably substantially smaller than original array 201 and, therefore, requires less space in SPM 103. The test provided by step 503 is further described below with reference to FIG. 6. It should be noted that the compiler may decide (i.e., implement a threshold function) not to use an SPM host array when the benefit from doing so would be outweighed by the required overhead, e.g., the cost of moving data between the SPM host array and the main memory array. For example, SPM may be inefficient where there is little or no reuse of the data once it is migrated to the SPM array, such as when overhead requirements offset any speed increase provided by having data in SPM. After the new linear index function and constant vectors for host array 202 in SPM 103 are determined, the shape of host array 202 may be determined (step 506). The shape of host array 202 is typically not the same as, and is usually much smaller than, the shape of the original array 201. That is, the locus of all elements stored in the original array is typically defined by a regular polyhedron shaped space, while the geometry of elements accessed may be irregular and, in any case, define a smaller address space. Mapping from the original array address space to the SPM address space may result in an irregular SPM address space better fitted to the data to be accessed. Thus, the compiler replaces references in the AEUGPAS with reindexed references to a virtual host array, which does not yet have a determined shape, using the new linear index function in place of the original linear index function and the new constant vectors in place of the original constant vectors in each of the affine index functions. To determine the shape of the host array, compiler 102 calculates the size of the smallest rectangle containing all of the entries in said virtual host array that are referenced by a particular set of loops comprising a loop nest using the new indexing function just installed, the associated loop nest being implemented by source code 105. Note that a "loop nest" may include one or more nested loops which, in turn, may be part of a larger loop nest. More formally, loopnest=loop contain within loop or loopnest contained within loop. The section of the number of loops to be processed according to an embodiment of the invention may be based on several factors including the size of available SPM and the amount of SPM required to store the remapped data for a particular number of inner-most loops comprising a loopnest. Compiler 102 may use the shape of this smallest possible rectangle to define the boundaries of host array 202 so that all of the accessed elements original array 201 are included in host array 202. Compiler 102 inserts a declaration of host array 202 into the optimized source code program once it has determined the shape of host array 202. The declaration reserves space for host array in SPM and provides for generation of the appropriate symbol tables. Once the shape of host array 202 is determined (step 506), compiler 102 may replace references identified in each set of UGPA references in original array 201 from source code 105 with references to host array 202 (step 507). Based on the new linear index function and constant vector (determined in step 505), all occurrences of the array references in original array 201 may be replaced. Each reference in the original array is replaced with a reference to the host array. Finally, code to move elements from original array 201 to host array 202 array is inserted into source code 105 (step 508). For example, if an original array in source code 105 has 80,000 elements and only 800 elements are actually accessed, references to the 800 elements may be relocated to references to a host array having 800 elements in SPM 103. To do so, compiler 102 inserts code into source code 105 to move the accessed elements from main memory 101 to SPM 103. This source code may consist of a loop nest that loops over the elements of the SPM host array. For each element of the SPM host array there is a calculation of the location of the corresponding element in the original array. The data is moved by as assignment statement from the original array element whose location has been calculated to the host array element. A method for calculating the original array index corresponding to a given host array index is specified in more detail below. Compiler 102 also recasts certain elements (using a reference to those elements) from original array 201 to host array 202. Thus, if any changes have been made to the elements while in host array 202 (e.g., writes), compiler 103 may generate code to move the changed elements back to original array 201 at program execution time. FIG. 6 is a flow chart of steps performed to determine a new linear index function and constant vector according to one embodiment of the present invention. First, compiler 102 extracts an integer matrix (e.g., "F") and a set of integer constant vectors (e.g., {f_1, f_2 f_3}) from the set of UGPA sets (step 601). For example using the above example referred to in FIG. 5, the integer matrix for Set 1 (A [i] [2*j] [k+3]) may be: ##EQU1## representing the coefficients of index and the constant vector for Set 1 may be: ##EQU2## representing the constant term. The integer matrix for Set 2 may be: ##EQU3## representing the coefficients of index terms i, j, and k and the constant vectors for Set 2 may be: ##EQU4## Once the integer matrix and constant vectors are extracted for all UGPA sets, the compiler partitions each UGPA set into AEUGPA subsets. For example using the above example referred to in FIG. 5, Set 2 may be partitioned into one AEUGPAS consisting of the references with constant vectors f_1 and f_3, and a second AEUGPA subset consisting of only reference f_2. Every integer matrix has a generalized hermite factorization. The generalized hermite factorization may be described as follows: If F is an m X n integer matrix of rank r, then F=HV, where A unimodular matrix is an integer matrix with determinant of either 1 or -1. Methods to compute the generalized hermite factors of a given integer matrix are well known in the field of integer matrices as described by, for example, George Havas, Bohdan S. Majewski, and Keith R. Mathhews, "Extended GCD and Hermite Normal Form Algorithms Via Lattice Basis Reduction," Experimental Mathematics, v. 7 (1998), pp. 125-136. Once the integer matrix and constant vectors are extracted for all UGPA sets, compiler 102 determines the generalized hermite factorization of the integer matrix (step 602). For example using the above example referred to in FIG. 5, the generalized hermite factorization for Set 1 may be: F=H V where H is equal to F and V is the identity matrix. The generalized hermite factorization for Set 2 may be: where ##EQU5## and V consist of the first two rows of the unimodular matrix ##EQU6## The length of a vector can be measured using various methods including, for example, a measure of the length of a vector provided by its 1-norm, which is the sum of the absolute values of the elements of the given vector. For a set of vectors, we may measure the 1-norm of each vector in the set and take the product of these as a measure of the size of the set of vectors. If a single affine reference with the linear part F occurs in a loop nest, and the loops of the nest have iteration counts given by the vector z, then the size of the smallest rectangle that contains all of the array elements referenced in this loop nest is well approximated by the product of the 1-norms of the rows of the matrix obtained by multiplying each column of F by the corresponding element of z. The present invention provides means to reduce this estimate of the size of an SPM host array by choosing a new linear indexing function for which the estimate is smaller than it is with the original linear indexing function whose matrix is F. The matrix of the new linear indexing function that is used to reference elements of the host array is G V, where V is the hermite factor of F determined above, and G is an r X r unimodular matrix determined as follows. At the iteration of a loop nest having an iteration index of vector I, a typical reference in an equivalence class accesses the array element whose original array index is F I+f. This iteration in the optimized loop nest will instead access the SPM host array element whose array index is G V I. The integer linear combinations of a set of vectors is called a lattice. These vectors determine the lattice. If a set of vectors is linearly independent then they are called a basis for the lattice they determine. If X is any matrix and G is any unimodular matrix, the rows of X and the rows of G X determine the same lattice. If the rows of X are linearly independent, then the rows of X and of G X are bases of this lattice. While they are both bases of the same lattice, the rows of X and of G X may have different 1-norms, and the product of the 1-norms of the rows of X and of G X can differ. In step 603, compiler 102 extracts the number of iterations of each of the loops in the loop nest, and creates an integer vector z whose elements are these iteration count numbers. Compiler 102 next creates a diagonal matrix D(z) whose diagonal entries are these same iteration numbers. Compiler 102 then forms a new matrix V D(z), which is the same as matrix V except that each column is multiplied by the corresponding element of z, the iteration counts. In the next part of step 603, compiler 102 computes a 1-norm reduced basis for the lattice spanned by the rows of V D(z). To do so, compiler 102 computes a unimodular r X r matrix G, so that the rows of G V D(z) have a smaller product of 1-norms than the rows of V D(z). Using the example described with reference to FIG. 5, the rows of the matrix V D(z) are a best-possible basis for the lattice that these rows span. Therefore, the matrix G would be the identity matrix. Regarding Set 2, the 1-norm lattice basis reduction may be computed as follows: the vector z of iteration counts (i.e., for i=0 to 4; j=0 to 9; and k=0 to 14) is computed and is equal to the vector (5,10,15). The matrix V D(z) is equal to: ##EQU7## The compiler computes a new basis for the lattice generated by the rows. The new basis consists of the rows of G V D(z); for the example of EXAMPLE 5 the unimodular matrix G is ##EQU8## and the new linear indexing function has matrix G V which is the matrix ##EQU9## The new linear part of the affine index function for this UGPA set is now determined. It is the linear function from loop indices to array indices given by G V I, where I is a vector of loop indices. The next task for the compiler is to determine new constant vectors for use in reindexed references to host array 202. Compiler 102 computes a left inverse of the hermite factor H of linear index matrix F of the UGPA set (step 604). To compute a left inverse of a matrix X, the columns of X must be linearly independent. The hermite factor H of any matrix has linearly independent columns, so it has left inverses. Given a matrix X with linearly independent columns, a left inverse of matrix X is a matrix Y, such that Y X is the identity matrix. A left inverse of an integer matrix may have rational (fractional) elements. For example, any matrix (a b) for which a+b=1 is a left inverse of the matrix ##EQU10## We denote the left inverse computed in this step by left_inv(H). Compiler 102 then selects one reference from each AEUGPAS as the dominant reference in the subset. For this reference, the new constant vector is zero. For each of the remaining references, the new constant vector is G left_inv(H) (f-f—0) where f is the original constant vector of the remaining reference and f—0 is the original constant vector of the dominant reference. Note that the left inverse is useful only if there is at least one equivalence class with two or more references. If there is no such equivalence class, there is no need to compute the left inverse. For example using the above example referred to in FIG. 5, Set 1 does not require a left inverse since there is only one dominant reference in its class. Regarding Set 2, the three references fall into two equivalence classes under the alias equivalence relation, as explained above. The left inverse is: ##EQU11## The first subset of alias-equivalent references in Set 2 consists of the references with offset vectors f_1 and f_3 and the second such subset in Set 2 consists of the reference with offset f_2 . Note that f_1-f_3 is in the lattice generated by the columns of F and f_2-f_1 is not. Since at least one equivalence class has two or more members, compiler 102 will need the left inverse to compute new constant vectors. Next in step 605, compiler 102 may determine new offset vectors. Compiler 102 may also determine a new constant vector to relocate the references from original array 201 to host array 202. For each array reference to be replaced, there is a constant vector. Compiler 102 uses "f" computed by Step 601, "G" from Step 603 and left inverse (H) from Step 604. To determine the new offset vectors to be used in replacing the references (in Step 506), compiler 102 subtracts the dominant vector from the offset vectors of the references to be replaced, multiplies the results by a left inverse of H (as determined in Step 604) and multiplies the results by G. This creates the new offset vectors to be used in the replacing references. For example using the above example referred to in FIG. 5, the new offset in Set 1 is the zero vector. Referring to Set 2, the first equivalence class of references the reference with constant vector f_1 is the dominant reference. The "projected" offset for the other reference is given by G left_inv(H) (f_3-f_1), which is the new offset vector ##EQU12## The other two new offset vectors are zero, because their references are the dominant references in their classes. In step 606, new vectors are used to replace the references. The vector that replaces f1 may be G left_inv(H)(H) (f_1-f), where f is the constant vector of this chosen dominant reference. For example using the above example referred to in FIG. 5, in Set 1 G and V are both equal to a 3×3 identity matrix. The loop bounds vector is z=[5, 10, 15]. The set of projected offsets consists of the zero vector. The shape of the box that contains this one vector in 3-space is o=[1, 1, 1]. The shape of host array 202 may be the vector abs(G V) (z-1)+o, which is SPM_shape=(z-1)+o=[5, 10, 15]. Note that the host array has 750 elements and the original array has 2000 elements. Compiler 102 may replace the reference to A[i][2*j][k-1] with a reference to SPM_A[i][j][k], and insert a loop to load SPM_A[i][j][k] from A[i][2*j][k-1]before the loop nest. Referring to Set 2 , the matrix abs(G V) is equal to: ##EQU13## and the vector abs(G V) (z-1) is equal to ##EQU14## For the first class, the offsets are the zero vector and the vector G u—3 above, and the shape of the box that contains these is ##EQU15## Compiler 102 allocates a host array 202 (e.g., SPM_B_1) of shape ##EQU16## for this class. For the other class o=[1 1] a second host array (e.g., SPM_B_2) of shape ##EQU17## is needed. These two host arrays may have 650 and 552 elements, respectively and the original array has 2100. Compiler 102 replaces the first two references to B in the loop nest (both with offset zero) by references to SPM_B_1 [i+2*j][j+k]. Compiler 102 also replaces the fourth reference, having an offset f_3, with a reference to SPM_B—1 [i+2*j+3][j+k+1]. The third reference, which is the one and only reference in the second equivalence class, is replaced with a reference to SPM_B—2 [i+2*j][j+k]. Compiler 102 also inserts two loops to load SPM_B_1 and SPM_B_2 and one loop to store SPM_B_1 after the loop nest. FIGS. 7A-7C show the shape of host array 202 and the location of data to be accessed therein after completing the various steps in FIG. 6. FIG. 7A indicates the shape of original array 201, wherein data to be accessed is represented by a plus signs ("+"). FIG. 7B indicates the shape of host array 202 after performing Step 602, wherein data to be accessed is represented by asterisks ("*"). FIG. 7C indicates the shape of host array 202 after performing 1-norm step 603 wherein data to be accessed is represented by oh's or circles ("o"). In each case, it can be seen that re-indexing concentrates the data to be used into a smaller array, while data not to be used is not included or minimized in the target SPM host array. As mentioned above, the optimized source code must include inserted code that moves data from its location in an original array to the corresponding location in the SPM host array, and back if necessary. In these inserted loop nests, it is necessary to calculate the array index of the original array element that holds the same datum as the SPM host array element whose host array index is b. This mapping can be realized with the aid of the mathematical formula where H is the hermite factor of F and inv(G) is the inverse of G, which is also an integer, unimodular matrix. A part of the process is to select the loop nests to which to apply this technology. Not all loop nests will benefit from this optimization, for two primary reasons. First, when there is no reuse of the array data, but rather each array element references is only used once (or a small number of times less than some threshold value), then there is nothing to be gained by moving the data from main memory to SPM, because the cost of moving the data out of main memory (overhead) would be equal (or greater than) to the cost of accessing the data from main memory when it is used in the loop nest. This may occur when the compiler considers only the innermost loop or loops in a larger loop nest. A second situation wherein there may be little benefit to optimization includes cases wherein, the size of the SPM host arrays needed to hold the elements of even one of the arrays accessed by the loop nest exceeds the available SPM space. This might happen if the compiler considers only the outermost loops in the loop nests found in the program, wherein the number of elements to be accessed is too large to be accommodated in SPM. When applied to a machine with a fixed size SPM, a given loop nest may need more space than provided by the computer platform. For each nest examined, the compiler can decide to pack into the SPM as many of the host arrays as will fit, choosing the set that gives the greatest performance benefit if there are several alternative subsets of host arrays that would fit. The choice of a most beneficial set of host arrays that fits can be made by solving a standard combinatorial optimization problem known as a "knapsack" problem. One problem in re-indexing is selection of the loop nests to be optimized. One method of loop selection requires that the compiler perform a preliminary tiling transformation on the source code program. A tiling transformation is one in which the compiler explicitly creates loop nests that are known to have reuse of data from relatively small parts of certain source code arrays. Therefore, in this situation, the compiler would already know that such a loop nest is a good candidate for applying the optimizations described above. A second method, particularly well suited to situation wherein there is no prior identification of specific loops to which the present optimization method can be successfully or optimally applied, involves having the compiler start with innermost loops, and evaluate the performance improvement and SPM space requirement for using the present optimization. The compiler then considers each outer loop in each nest in turn and evaluates the SPM space required and the benefit achieved for each level of nesting. In general, both the benefit and the SPM space requirement will grow as one considers loops farther out in the loop nest. The compiler may then choose to apply the optimization at the most beneficial level of nesting in each of the program's loop nests. While one embodiment of the invention is directed to managing operations and improving performance of SPM, the invention is equally applicable to other environments. For example, many conventional computers have no SPM. Instead, such a conventional computer may be limited to a main memory and a single- or multi-level cache. The cache holds recently referenced data. When a memory word is located or stored (referenced) by the processor, the cache line that includes the referenced word is moved into the cache, displacing the cache line that previously occupied the cache space (a physical cache line) used. A cache line typically consists of 64 or 128 consecutive bytes of the main memory address space, but it may contain more. Thus, it will typically contain from one to four 32- or 64-bit words, i.e., usually it will contain several words. A loop nest that contains partially affine references may, in some cases, make poor use of the cache for two reasons. First, if the set of array elements that are references by the loop nest are not found in consecutive locations, but are sparsely distributed, then when a cache line is moved into the cache, some, or even most, of the words in the cache line will contain array elements that are not referenced in the loop nest. This wastes the valuable resource of space in the cache. Second, if the set of array elements referenced covers a large range of memory addresses, then several of these referenced words may be mapped by the cache associativity algorithm (an algorithm built into the cache hardware that determines the set of cache locations that are allowed to store a given main memory cache line) into the same physical line of the cache, conflicting for this space, so that not all of them can be held in the cache. This may occur while other physical cache lines remain unused because none of the referenced array elements is mapped by the associativity algorithm into one of these unused physical cache lines. A program can avoid these undesirable effects by the use of a small auxiliary array, known as a cache mirror. The program can move the referenced data from an original array into a small cache mirror array prior to executing a loop nest that refers repeatedly to this data. The referenced data are densely packed into the cache mirror array, and this dense packing prevents the two undesirable effects discussed above. A compiler whose target machine is a conventional computer can optimize the performance of a program by using the method presented above, substituting a cache mirror array for the SPM-resident host array. It can use precisely the same techniques for determining the size of the cache mirror array (as was used to determine the size of the SPM host array) and for determining the new index function used to reference the data in the cache mirror array. The program would also generate the same code to move the data from the original array to the cache mirror array. As previously explained, establishment of a cache mirror in main memory may require definition of a separate array space that is smaller than the original array. For example, with reference to FIG. 8A, an original array 801 includes 100 elements. Assuming that a particular loop nest accesses only 37 (or fewer) of these elements, a cache mirror array 802 as shown in FIG. 8B including these 37 elements is also defined in main memory. Because the cache memory management system will automatically copy these 37 elements into cache memory, and since the elements are located at contiguous addresses in main memory, the copies in cache are also optimized, resulting in enhanced cache performance.
|
Same subclass Same class Consider this |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||
