Aggregate structure identification and its application to program analysis6279149Abstract An efficient program analysis method is provided for lazily decomposing aggregates (such as records and arrays) into simpler components based on the access patterns specific to a given program. This process allows us both to identify implicit aggregate structure not evident from declarative information in the program, and to simplify the representation of declared aggregates when references are made only to a subset of their components. The method can be exploited to yield: (i) a fast type analysis method applicable to program maintenance applications (such as date usage inference for the Year 2000 problem); and (ii) an efficient method for atomization of aggregates. More specifically, aggregate atomization decomposes all of the data that can be manipulated by the program into a set of disjoint atoms such that each data reference can be modeled as one or more references to atoms without loss of semantic information. Aggregate atomization can be used to adapt program analyses and representations designed for scalar data to aggregate data. In particular, atomization can be used to build more precise versions of program representations such as SSA form or PDGs. Such representations can in turn yield more accurate results for problems such as program slicing. Our techniques are especially useful in weakly-typed languages such as Cobol (where a variable need not be declared as an aggregate to store an aggregate value) and in languages where references to statically-defined sub-ranges of data such as arrays or strings are allowed. Claims We claim: Description BACKGROUND OF THE INVENTION
DataRef ::= ProgVars .vertline.
DataRef[Int.sub.+ :Int.sub.+ ]
.vertline.
DataRef\Int.sub.+
Stmt ::= DataRef .rarw. DataRef
This grammar describes the parts of a traditional imperative language necessary to describe our algorithm. A statement d.sub.1.rarw.d.sub.2.epsilon. Stmt represents an assignment which copies the contents of data reference d.sub.2 into d.sub.1. For any program P, let Stmts(P) denote the set of such copy statements contained in the program. A data reference d .epsilon. DataRef is a reference to some sequence of abstract locations (bytes) and takes one of the following forms: a program variable x .epsilon. ProgVars (the length of which will be denoted by .vertline.x.vertline.) a sub-range d[i : j] of locations i through j of some data reference d a single, statically indeterminate element of a data reference representing an array of n elements, denoted d.backslash.n Sub-ranges are used to represent a sequence of locations at a statically-determined position in a data reference. For example, if d refers to a record, then d[i : j] can be used represent a reference to a field of the record, and if d identifies an array, then d[i : j] can be used to represent a reference to the locations comprising a specific element of the array. The notation d.backslash.n is intended to suggest that if we break up the sequence of locations that d denotes into n sub-sequences of equal lengths, then d.backslash.n denotes one of these n different sub-sequences. We now define the set of all locations as: Loc={<x, i>.vertline.x .epsilon. E ProgVars, 1.ltoreq.i .ltoreq..vertline.xl} Different elements of ProgVars thus represent disjoint sets of locations. (These are the "top level" variables, also called as level 01 variables in Cobol.) For convenience, we will denote location <x; i>as simply x[i]. As explained above, we may not know the precise sequence of locations referred to by a data-reference d in the general case (e.g., due to a reference to an array element at a statically-indeterminate index). Hence, we treat a data-reference d as a reference to one of a set of sequences of locations, and we will denote this set by D[d]. Formally, we define D by: ##EQU1## where .sigma.[i] indicates the i-th element of sequence .sigma., and .vertline..sigma..vertline. denotes the length of a sequence .sigma.. Further, we represent a sequence by enumerating its elements separated by a dot (.). Thus, a. b. c. denotes the sequence consisting of three elements a, b, and c. Note that all elements of D[d] have the same length, which we will denote .vertline.d.vertline.. For example, let x, y .epsilon. ProgVars. Then, x[3 : 5] denotes the singleton set {x[3]. x[4]. x[5]}. A more interesting example is ((y[1 : 10])/2)[2 : 3]. Here, y[1 : 10]/2 is a referen arbitrary element of a 2-element array; the array as a whole occupies the first 10 locations of y. The sub-range [2..3] is then selected from the array element. Hence, the set of locations referred to consists of {y[2]. y[3], y[7]. y[8]}. In other words, ((y)[1 : 10])/2)[2 : 3] is a reference yo either locations y[2] and y[3] or locations y[7] and y[8]. Sometimes we may be interested only in the set of locations denoted by a data reference. For any sequence .sigma., let set(.sigma.) denote the set {.sigma.[i].vertline. 1 .ltoreq.i .ltoreq..vertline..sigma..vertline.} of elements in the sequence. For any data reference d, define Locs[d] to be .orgate.{ set(.sigma.) .vertline..sigma..epsilon. D[.sigma.]}. Locs[d] identifies the set of locations possibly denoted by a data-reference d. For example, consider the data-reference d=((y[1 : 10])/2)[2 : 3] (from the previous example). Locs[d] is, in this case, the set { y[2], y[3], y[7], y[8]}. We will now define an abstract semantics D.sup..gradient. (d.sub.1.rarw.d.sub.2) for the assignment statement d.sub.1.rarw.d.sub.2, which simply consists of the set of all pairs of locations (l.sub.1 ; l.sub.2), written symbolically as l.sub.1.rarw.l.sub.2, such that the assignment statement might copy the contents of location l.sub.2 to location l.sub.1. This semantics is defined as follows: D.sup..gradient. (d.sub.1.rarw.d.sub.2)={.sigma..sub.1 (i).rarw..sigma..sub.2 (i).vertline..sigma..sub.1.epsilon.D[d.sub.1 ], .sigma..sub.2.epsilon.D[d.sub.2 ], 1.ltoreq.i.ltoreq.min (.vertline..sigma..sub.1.vertline., .vertline..sigma..sub.2.vertline.).} In the rest of this document, we will assume that for every statement d.sub.1.rarw.d.sub.2, .vertline.d.sub.1.vertline.=.vertline.d.sub.2.vertline.. As an example, consider the statement x[1 : 2].rarw.y[3 : 4]. This assignment statement, when executed, will copy the contents of location y[3] to location x[1] and the contents of location y[4] to location x[2]. The abstract semantics (defined above) of the statement x[1:2].rarw.y[3:4] captures this information and is given by the set {x[1].rarw.y[3], x[2].rarw.y[4]}. The abstract semantics for assignments that occur in a given program P .epsilon. Pgm can be used to define an equivalence relation on locations that will be useful in the sequel (e.g., as the basis for inferring equivalent types). To this end, we first define: E=.orgate.{D.sup..gradient. (d.sub.1.rarw.d.sub.2).vertline.d.sub.1.rarw.d.sub.2.epsilon.Stmts(P)} Now, let .ident..sub.p denote the smallest equivalence relation containing the set of pairs of locations E (i.e., the equivalence closure of E). We will omit the subscript P if no confusion is likely. The equivalence relation defined above captures how information may flow between the different variables in a program. As an example, consider a program consisting of the two statements x[1:2].rarw.y[3:4] and z[5:6].rarw.x[1:2]. As explained above, the abstract semantics of the first statement is the set: {x[1].rarw.y[3], x[2].rarw.y[4]}. Similarly, the abstract the second statement is the set {z[5].rarw.x[1], z[6].rarw.x[2]}. The set E is the union of these two sets and is given by {x[1].rarw.y[3], x[2].rarw.y[4], z[5].rarw.x[1], z[6].rarw.x[2]}. The equivalence relation defined above, in this case, partitions these locations into two equivalence classes (sets), namely {y[3], x[1], z[5]} and {y[4], x[2], z[6]}. This reflects the fact that, during program execution, information may flow between locations y[3], x[1], and z[5]. Similarly, information may flow between location y[4], x[2], and z[6]. However, information WELL NOT flow from a location, say y[3], in one equivalence class to a location, say z[6], in another equivalence class. The Equivalence DAG Data Structure In this section we focus on the problem of computing the equivalence relation .ident..sub.p, given a program P. The goal of this section is to give the reader an understanding of the essential algorithmic contributions of this invention, primarily through examples. We will describe extensions and applications of this method in subsequent sections. Rather than generate an explicit representation of the equivalence relation .ident..sub.p, we will actually generate a more compact representation of the equivalence relation that can be used to answer queries about whether two locations are equivalent or not. We will also refer to a statement d.sub.1.rarw.d.sub.2 as an equivalence constraint d.sub.1.congruent.d.sub.2 for notational convenience. We start with a simple version of the problem, where every constraint has the form x.congruent.y, where x, y .epsilon. ProgVars. In this case, rather than partitioning Loc into the equivalence classes, we can partition ProgVars into equivalence classes. Then, <x, i>.ident..sub.p <y,j> iff variables x and y are in the same equivalence class and i=j. The partitioning of ProgVars can be done as follows: initially place every program variable x .epsilon. ProgVars in an equivalence class by itself, and then process the equivalence constraints one by one; a constraint x.congruent.y is processed by merging the equivalence classes to which x and y belong into one equivalence class. Preferably, this is accomplished utilizing the well-known union-find data-structure as described in Cormen, T., Leiserson, C., and Rivest, R, "Introduction to Algorithms," MIT Press, Cambridge, Mass., 1990; and Tarjan, R. E., "Data Structures and Network Algorithms," Society for Industrial and Applied Mathematics, Philadelphia, Pa., 1983, herein incorporated by reference in their entirety. Now, consider a version of the problem where every constraint is of the form x[i:j].congruent.y[k: l], where x, y .epsilon. ProgVars. There are two aspects to the original solution that we would like to preserve when we address this generalized problem. The first aspect is that the method processes every constraint in P exactly once, instead of using an iterative (e.g., transitive-closure-like) method. The second aspect is that we would like to identify "ranges" of locations that are equivalent to each other and partition them into equivalence classes. This can represent .ident..sub.p more compactly than a partition of the set of all locations into equivalence classes. We now illustrate through an example how we can achieve these goals. Assume that W, X, Y, Z .epsilon. ProgVars, and that .vertline.W.vertline.=6, .vertline.X.vertline.=12, .vertline.y.vertline.=8, and .vertline.Z.vertline.=12. Assume that P consist of three equivalence constraints, X[5..8].congruent.Y[1..4], Z[1..6].congruent.W[1..6], and X[3..12].congruent.Z[1..210]. We start off by placing every variable in an equivalence class by itself. We process the first constraint X[5..8].congruent.Y[1..4] as follows. We break up the range X[1..12] into three sub-ranges X[1..4], X[5..8], and X[9..12] and place each in an equivalence class by itself. We refer to this as adding breakpoints. We similarly break up range Y[1..8] into two sub-ranges Y[1..4] and Y[5..8], placing each in an equivalence class by itself. We then merge the equivalence classes to which X[5..8] and Y[1..4] belong into one. Given this kind of a partition of every program-variable into a collection of sub-ranges, every location belongs to a unique sub-range of the partition. We can map every location l into a pair (e.sub.1, .sigma..sub.1) where e.sub.1 is the equivalence class of the unique sub-range containing l and .sigma..sub.1 is the offset of the location within that sub-range. Further, locations l.sub.1 and l.sub.2 are equivalent (l.sub.1.ident.if - 12) iff the e.sub.11 =e.sub.12 and .sigma..sub.11 =.sigma..sub.12. For example, location X[6] will be mapped to (ec(X[5..8]), 2) where ec(r) denotes the equivalence class containing sub-range r. Similarly, location Y[2] will be mapped to (ec(Y[1..4]), 2). Since ec(X[5..8])=ec(Y[1..4]), these two locations are equivalent. Let us briefly revisit the step where we "break up" a range, say X[1..12], into a sequence of sub-ranges, say X[1..4], X[5..8], and X[9..12]. It turns out to be convenient to keep both the original range and the new sub-ranges around, and to capture the refinement relation between these into a treelike representation (rather than, for instance, replacing the original range by the new sub-ranges). FIG. 6(a) and FIG. 6(b) illustrate the representations of the refinement of X and Y for the above example. Each rectangle in the figure, which we will refer to as a node, denotes an equivalence class of sub-ranges, and the number inside indicates the length of each sub-range contained in the equivalence class. Further, we will refer to a node as an internal node if it represents a range that has been broken up into smaller subranges and as a leaf node otherwise. FIG. 6(c) indicates the representation after X[5..8] and Y [1..4] have been merged. The next constraint (Z[1..6]; W[1..6]) is processed just like the first constraint, as illustrated by FIGS. 6(d) and 6(e). In the general case, processing a constraint d.sub.1.congruent.d.sub.2 consists of the following steps: (i) first, adding breakpoints to the representation before the starting-location and after the ending-location of both d.sub.1 and d.sub.1 ; (ii) the sub-ranges d.sub.1 and d.sub.2 can then be represented by a sequence of nodes, say .sigma.=[s.sub.1, . . . , s.sub.k ] and .sigma..sub.2 =[t.sub.1, . . . , t.sub.m ] respectively; (iii) these two sequences are made equivalent to each other as follows: if s.sub.1 and t.sub.1 denote ranges of the same length, we simply merge the two into one equivalence class and proceed with the remaining elements of the sequence. If the two denote ranges of different lengths, we then break up the bigger range, say s.sub.1, into two sub-ranges, s.sub.1 and s.sub.1 ", such that s.sub.1 has the same length as t.sub.1. We then merge s.sub.1 with t.sub.1, and continue on, making the sequences [s.sub.1 ", s.sub.2 . . . , Sk] and [t.sub.2, . . . t.sub.m ] equivalent. The third constraint X[3..12].congruent.Z[1..10] illustrates the more general scenario described above. After adding the necessary breakpoints, the range Z[1..10] is represented by the sequence [s.sub.1, s.sub.2 ] (see FIG. 6(f)), while the range X[3..12] is represented by the sequence [t.sub.1, t.sub.2, t.sub.3 ]. s.sub.1 is longer than t.sub.1, and is broken up into sub-ranges s.sub.1 and s.sub.1 ", as shown in FIG. 6(g). After this, things are straightforward: t.sub.1 is merged with s.sub.1, t.sub.2 is merged with s.sub.1 ", and t.sub.3 is merged with s.sub.2. FIG. 6(h) shows the resulting representation. Clearly, given a location l, we can walk down the DAG (directed acyclic graph) as shown in FIG. 6(h), from the appropriate root to a leaf (see below) e.sub.1 to map the location to a pair (e.sub.1, .sigma..sub.1) l.sub.1.ident.l.sub.2 iff (e.sub.11, .sigma..sub.11)=(e.sub.12, .sigma..sub.2). We call the representation generated by this method an Equivalence DAG. In the above description of the method, we assumed that the nodes in the sequences .sigma..sub.1 and .sigma..sub.2 were leaf nodes. Even if that were true when the processing of the two sequences begins, when we get around to processing elements s.sub.i and t.sub.j, the processing of the earlier elements of the sequences could have had the effect of adding breakpoints to either s.sub.i or t.sub.j or both, converting them into internal nodes. However, this is not a problem. We simply replace any internal node encountered in the sequences by its two children, and continue on. A simple example that illustrates this behavior is the single constraint A[1::12].congruent.A [5::16]. In the general version of the problem, a constraint d.sub.1.congruent.d.sub.2 may consist of arbitrary data-references (see FIG. 5). For example, consider the following where P, Q, R .epsilon. ProgVars with .vertline.P.vertline.=20, .vertline.Q.vertline.=10, and .vertline.R.vertline.=2. Assume we have two constraints P[1..10].congruent.Q[1..10] and (P[1..20])/10.congruent.R[1..2]. The first constraint is processed as described above, producing the representation illustrated in FIG. 7(a). The second constraint is processed as described above, producing the representation illustrated in FIG. 7(b). The nodes labeled u and v represent arrays consisting of 5 elements each of size 2. A detailed description of the preferred methodology for constructing the Equivalence DAG for a program P is illustrated in the pseudo-code of FIGS. 11, 12 and 13, which we will explain in detail later on. Type Analysis In this section, the method for determining an equivalence relation described above is extended to address a generalization of the Y2K type analysis problem discussed earlier. This type analysis method can be used, for instance, to answer questions such as the following. Consider the example depicted in FIG. 6. Assume the programmer wants to modify the representation of a field of the variable W, say W[1 : 2]. What other variables are likely to be affected by this change, requiring corresponding modifications to their own representation? We now present a precise formulation of the type analysis problem. Let L denote some semi-lattice with a join operator. We may think of the elements of L as denoting "abstract types." An example is the lattice of subsets of the set {fyearRelated, notYearRelated } used for year usage inference in the example of FIG. 3. Another simple lattice consists of only two elements Affected and NotAffected with Affected.gtoreq.NotAffected. This lattice is useful to answer questions about variables affected by some change, of the kind illustrated in the previous paragraph. A function .pi. from Loc to L represents a "typing" for the set of locations. We say that .pi. is valid with respect to a program P if .pi.(l.sub.1)=.pi.(l.sub.2) for all l.sub.1.ident..sub.p l.sub.2. Now consider a constraint of the form d.gtoreq.c, where d .epsilon. DataRef and c .epsilon.L. We say that .pi. satisfies this constraint iff: .pi.(l).gtoreq.c for every l.epsilon.Locs[d] (In other words, the constraint d.gtoreq.c says that every location denoted by data-reference d must have a type greater than or equal to c.) Given two typing functions .pi..sub.1 and .pi..sub.2, we say that .pi..sub.1.gtoreq..pi..sub.2 iff .pi..sub.1 (l)>.pi..sub.2 (l) for all l .epsilon. Loc. Given a program P and a set C of constraints of the form d.gtoreq.c, where d .epsilon. DataRef and c .epsilon. L, we are interested in computing the least typing valid with respect to P that satisfies every constraint in C. For example, consider a constraint of the form d.gtoreq.c where d is interpreted as representing a set of locations. The meaning of the constraint is that the type of every location in the set represented by d is at least c. In other words, every location in d may store a value of type c. Of course, it may store values of other types also, and this is what the "at least" part implies. Mathematically, this is captured by the inequality condition. The idea behind the definition of a valid typing is that if two locations l.sub.1 and l.sub.2 are equivalent (that is, in the same equivalence class), then there is a potential information flow between the two locations. Hence, we require the two locations to have the same type. Hence, we are interested in identifying a valid typing that satisfies all the constraints provided by the user. Further, we want the "least" valid typing satisfying this condition: in other words, the solution should say that a location l has a type t iff this is implied by the program and the constraints provided by the user. We now illustrate how we can efficiently compute the desired solution, using the data structure and method presented earlier. We first process the equivalence constraints induced by program P to produce the Equivalence DAG. We then associate every leaf of the resulting DAG with a value from L, which is initially the least element of L. We then process every constraint d.gtoreq.c in C as follows. The data-reference d can be mapped onto a sequence of leaves in the Equivalence DAG (possibly after adding new breakpoints). We update the value associated with each of these leaves to the join of their current value and c. The example in FIG. 8 illustrates this. Assume that we start with the Equivalence DAG shown in FIG. 6(h), and process the constraint W[1 : 2].gtoreq. Affected. We are using the lattice L consisting of only two elements Affected and NotAffected described above. W[1 : 2] corresponds to the single leaf u (shown as a bold rectangle in FIG. 8), whose value is then updated to Affected. The resulting DAG can be viewed as a compact representation of the desired solution. In particular, it can be interpreted as a function .pi. from locations to L: the value associated with any location l is obtained by traversing the DAG to map location l to a leaf e.sub.1 in the DAG (as explained earlier), whose value yields .pi.(l). In particular, the DAG in FIG. 8 maps locations X[3], X[4], Z[1], Z[2], W[1], and W[2] to type Affected and every other location to type NotAffe This indicates that the locations X[3:4] and Z[1:2] are potentially affected by the proposed change to the representation of W[1:2]. Term Unification For any set X, .GAMMA..sub.x denote set of terms defmed by: .GAMMA..sub.x ::=X.vertline..GAMMA..sub.x.sym..GAMMA..sub.x.vertline..GAMMA..sub.x xInt.sub.+ where Int.sub.+ denotes the set of positive integers. Sometimes, we will write a term .tau.x j as .tau..sup.j for notational convenience. Let .alpha.=.orgate..sub.i>0 V.sub.i denote a set of variables. A variable belonging to V.sub.i is said to have a length i. We will use the notation x: i to indicate that a variable x has a length i. Consider the set of terms .GAMMA..sub.v. Note that .GAMMA..sub.v denotes a set of symbolic expressions constructed from symbolic variables and integers, and two operators,.sym. (sequence concatenation) and .sym.x (sequence exponentiation, which is the repeated concatenation of a sequence with itself as many times as indicated). Thus, (a.sym.b), (a.sym.b)x5, and ((.sym.b)x5).sym.((c.sym.d)x10) are examples of such symbolic expressions. Consider an Equivalence DAG. If we label the nodes (especially the leaf nodes) of the DAG with variable names, then we may interpret the "tree" rooted at any node in the Equivalence DAG as a term belonging to .GAMMA..sub.v : internal nodes obtained by breaking up a "range" into two sub-ranges are viewed as representing the operator .sym.; nodes such as u and v of FIG. 7, representing arrays, are viewed as representing the operator x. In other words, T1 .sym. T2 denotes a record-like aggregate consisting of two subcomponents T1 and T2, while T3 .sym. i denotes an array aggregate consisting of i elements where T3 denotes a typical element of the array. For example, the tree pointed to by P in FIG. 7(a) can be represented by the term u.sym.v. If we want to indicate the lengths of the nodes, we can precisely represent this tree by the term (u: 10).sym.(v: 10). For every program variable P let us use a term variable x.sub.p to denote the node in the Equivalence DAG that P points to. The whole Equivalence DAG can be thought of as a function that maps these term variables to terms belonging to .GAMMA..sub.v. (Such a function is also called a substitution.) For example, the Equivalence DAG in FIG. 7(b) represents a function {x.sub.Q.fwdarw.x.sub.R.sup.5 ;x.sub.P.fwdarw.(x.sub.R.sup.5).sym.(x.sub.R.sup.5), x.sub.R.fwdarw.x.sub.R }. Note that this gives us an alternative textual representation for Equivalence DAGs. Our algorithm for constructing the Equivalence DAG can be thought of as unifying terms belonging to a term language .GAMMA..sub.x with the following distinction: unlike in standard unification, we do not treat the operators .sym. and x in this term language as free or uninterpreted operators; instead, we are interested in unification with respect to a specific interpretation of these operators. In particular, every term is interpreted as a sequence, with .sym. interpreted as sequence concatenation and x as sequence exponentiation. The unification of two terms produces a substitution causing the two terms to become equivalent. The concept of unification is explained in detail in Knight, K. "Unification: A multidisciplinary survey," ACM Computing Surveys 21, 1 (1989), pages 93-124, herein incorporated by reference in its entirety. A complete description of our algorithm for constructing the Equivalence DAG and for performing type analysis, in SML-like pseudo-code, appears in FIGS. 11, 12, and 13. The method is based on the fast union-find data structure. We assume that an implementation of a union-fmd data structure is available with the signature shown in lines [1] to [6] of FIG. 11. The data structure may be implemented as described in Tarjan, R. E., "Data Structures and Network Algorithms," Society for Industrial and Applied Mathematics, Philadelphia, Pa., 1983, herein incorporated by reference in its entirety. The function newvar creates a new element not equivalent to any other element (i.e., belonging to an equivalence class all by itself). The function union merges two equivalence classes into one, while the function equivalent indicates if two elements are equivalent or not. In addition, every equivalence class has a value associated with it, whose type is the parameter '.alpha. of the parametrized type '.alpha. eqClass. The value associated with an equivalence class can be retrieved and modified by the two functions findval and setval respectively. The functions newvar and union also take a parameter specifying the value to be associated with the newly created/merged equivalence class. We also assume the existence of a semi-lattice L (of types) with a join operator .orgate. and a least element .perp.. This semi-lattice is determined by the type analysis problem to be solved. The type term-var, defmed in line [11], represents a set of term variables which are partitioned into a collection of equivalence classes (preferably using the union-find data structure). Every equivalence class has an associated value, which has the type term-value, defined in line [10]. For the sake of readability in the pseudo-code, we use pair(x,y) to represent the term x.sym.y and array(i,x) to represent the term xxi. Lines [12] through [16] of FIG. 11 define a function that returns the length .vertline.x.vertline. of a term x. Though .vertline.x.vertline. is presented as a function, the length information for every term may be cached rather than computed multiple times. The function new defined in lines [17] through [20] of FIG. 11 is a convenience function. It creates a new variable, in an equivalence class by itself, with a given value by calling the function newvar, unless the value specified is an array consisting of a single element, in which case it simply returns the variable corresponding to the array element. The function split, defined in lines [21] to [30], adds a breakpoint to the representation of variable x just after position n. The test in line [31 ] ensures that n has a meaningful value. The actual operation is dependent on the type of the current value of variable x. If it is an atom, then the atom is split into (a pair consisting of) two atoms of suitable lengths (line [24]). If it already a pair, then the split operation is performed on either the left or the right subterm, depending on position n (line [25]). If it is an array, then the array is broken into two arrays of suitable length, and the split operation is recursively invoked so as to perform it in the left or right subarray, as appropriate (lines [26]-[29]). The function refine, defined in lines [31]-[38], converts a term into a list of (two) smaller subterms, by adding a breakpoint if necessary. The function update, defined in lines [39]-[43], updates the value associated with a variable x such that every location contained in x has an associated type greater than or equal to the specified value c. The function unify, defined in lines [44]-[57] of FIG. 12, unifies two terms. Depending on the types of the terms, this function calls one of the functions unify-atom, or unify-list, or unify-array. The function unify-atom, defined in line [58], is used to unify two terms if at least one of them, say the first term, is an atom. In this case, the unification is done by invoking the function update on the first term to ensure that its value is greater than or equal to the constant value associated with the second (atomic) term. Then the value of the first term represents the result of the unification. The function unify-list, defined in lines [59]-[66], is used to unify two lists of terms. If the first elements of both lists have the same length, then those two terms are unified, and the remaining elements of the two lists are unified with each other recursively. If the first elements of the two lists do not have the same length, then the bigger term is broken up into two smaller terms, and the list unification is repeated recursively. The function unify-arrays, defined in lines [67]-[77], is used to unify two arrays. We determine the least-common-multiple 1 cm of the lengths of the two array elements e1 and e2. We repeatedly concatenate element e1 with itself to generate a term x1 of length 1 cm. We similarly concatenate element e2 with itself as many times as necessary to generate a term x2 of length 1 cm. We unify e1 and e2, and return an array whose element is e1. Function convert, defined in lines [79]-[99] shows how a data-reference can be represented by a term-variable after some unification operations. Function processStatement, defined in line [100], shows how every copy assignment statement in the program can be processed by converting the left and right data-references into term-variables and by unifying them. Function processConstraint, defined in line [101], shows how a type inequality constraint can be processed by invoking the update function. Function typeAnalyze, defined in lines [102]-[108], shows how type analysis is done. Given a program and a set of type inequality constraints, type analysis is performed by invoking function processStatement on every assignment statement in the program and by invoking function processConstraint for every inequality constraint. If we want to construct the Equivalence DAG without performing type analysis, we can simply omit the parts of the code related to the semi-lattice L. Atomization In this section, we explain our atomization algorithm. Consider a language defined as follows, where Atoms is some set of atomic variables:
DataRef.sub.Atomic ::= Atoms / Int.sub.+
Stmt.sub.Atomic ::= DataRef.sub.Atomic .rarw. DataRef.sub.Atomic
CompStmt.sub.Atomic ::= Stmt.sub.Atomic .vertline.
StmtSeq ( CompStmt.sub.Atomic, CompStmt.sub.Atomic)
.vertline.
IfThenElse (CompStmt.sub.Atomic, CompStmt.sub.Atomic)
This language allows for atomic variables and arrays of atomic variables, but nothing else. In particular, it does not allow any record-like aggregate that may consist of two or more atomic variables. The goal is to transform the given program into an equivalent program in this language. In particular, our method will show how every statement in the original program, possibly with references to aggregates, can be transformed into a collection of atomic statements, structured into a complex atomic statement belonging to CompSt.sub.Atomic defined by the above grammar. A complex atomic statement is either a single atomic statement, or the sequential composition StmtSeq(S1, S2) of two complex atomic statements, which are executed one after another, or is a conditional statement IfThenElse(S1, S2) which executes one of S1 and S2, depending on some unknown condition, which we omit from the representation. We will now describe our atomization algorithm. The first step is to construct the Equivalence DAG by processing every statement in the program (as explained earlier). Note that every program variable corresponds to a node in the Equivalence DAG. We construct a tree representing the atomization of a variable by "flattening" the DAG rooted at the node corresponding to that program variable. The leaves of this atomization tree are the atoms. More formally, recall that the Equivalence DAG can be interpreted as a function .phi. mapping program variable to a term in .GAMMA..sub.v. We consider every occurrence of a variable in .phi.(x) to be a distinct atom. Let .phi..sub.r be the function obtained by renaming the variable occurrences in the set of terms {.phi.(x.sub.p).vertline.P .epsilon. ProgVars} such that no variable occurs twice. For example, if ProgVars={A, B}, and .phi.={x.sub.A.fwdarw.w.sym.y.sym.w, x.sub.B.fwdarw.y.sym.z }, then a suitable renaming is .phi..sub.r ={x.sub.A.fwdarw.x.sub.1.sym.x.sub.2.sym.x.sub.3, x.sub.B.fwdarw.x.sub.4.sym.x.sub.5 }. Here, .phi. represents the Equivalence DAG and .phi..sub.r represents the set of atomization trees obtained by "flattening" the Equivalence DAG. The variables occurring in the set of terms {.phi..sub.r (x.sub.p).vertline.P .epsilon. ProgVars} (that is, the leaves of the atomization trees) give us the set of atoms. In the above example, the set of atoms is given by {x.sub.1, x.sub.2, x.sub.3, x.sub.4, x.sub.5 }. Essentially, the Equivalence DAG combines two pieces of information: the first is a forest, the Atomization Forest, consisting of a tree for every program variable; the second is an equivalence relation on the leaves of this forest, which are referred to as atoms. The mapping .phi..sub.r represents only the Atomization Forest, giving us the "aggregate structure" that was inferred for every program variable. Unlike A, it omits the type equivalence that was inferred between the "atoms". For example, consider the example in FIG. 7(b). Here, .phi.(x.sub.p) is (x.sub.R :2).sup.5.sym.(x.sub.R :2).sup.5. By renaming the variables, we get .phi..sub.r (x.sub.P) is (x.sub.1 :2).sup.5.sym.(x.sub.2 :2).sup.5. This tells us that the aggregate variable P is a record consisting of two arrays, each consisting of 5 elements of size 2 each. Given the atomization tree for every variable, the next step is to translate statements in the program to replace references to aggregate variables by references to atoms only. As an example, assume that our method identifies an atomization tree (.alpha..sub.1 :2).sym. (.alpha..sub.2 :4) for a variable X and an atomization tree (b.sub.1 :2).sym.(b.sub.2 :4) for a variable Y. Then, an assignment statement X.rarw.Y copies b.sub.1 into .alpha..sub.1, and b.sub.2 into .alpha..sub.2. Our algorithm will transform statement X.rarw.Y into the statement StmtSeq(.alpha..sub.1.rarw.b.sub.1, .alpha..sub.2.rarw.b.sub.2). The whole program is transformed by translating every statement in the program in a similar fashion. Our algorithm also replaces any reference to a variable in any other context (other than a copy assignment statement) by a reference to a set of atomic references. This is very easy to do, since we just need to use the set of atoms (leaves) of the atomization tree of the variable. The general case is somewhat more complicated than explained above, since it may involve arbitrary data references instead of simple variables. A description of our complete algorithm, in pseudo-code form, appears in FIGS. 14 and 15. We will now explain the various functions defined in these figures. The type AtomicTree defmed in lines [111] and [112] represents the atomization trees explained above. Function normalize, defined in lines [113]-[116], is used to simplify some of our auxiliary procedures. In particular, it re-associates all pair2 nodes in an atomic-tree to make pair2 right-associative; that is, it converts all subtrees of the form pair2(pair2(x1,x2),x3) into the form pair2(x1, pair2(x2,x3)). The functions tail, head, subrange, and breakup, defined in lines [117] through [135], assume that their inputs have been normalized by function normalize. Note that every atomic-tree x covers a set of locations numbered 1 through .vertline.x.vertline.. Function tail(t,u) returns the subtree oft that covers locations numbered 1 through .vertline.x.vertline.. Function head(t,u) returns the subtree oft that covers locations numbered 1 through u. Function subrange(t, u) returns the subtree of t that covers locations numbered f through u. (These functions assume that the given tree has a subtree covering the specified locations.) Function breakup(t,s) returns a set of subtrees of t, each covering s locations. In particular, it returns a subtree covering locations 1 through s, a subtree covering locations s+1 through 2s, a subtree covering locations 2s+1 through 3s, and so on. Functions atomizeVar and mkAtomicTree, defmed in lines [136]-[141], are used to construct the atomization tree for a variable by "flattening" the DAG (term) associated with that variable. This function also computes the multiplicity of every leaf (atom) of the constructed tree, for subsequent use, where we define the multiplicity .mu.(.alpha.) of an atom .alpha. to be the product of the "cardinalities" of the arrays containing the atom. A (top level) program variable corresponds to an atomization tree. In the general case, an arbitrary data reference corresponds to a set of atomic trees. Function trees of, defined in lines [143]-[145], converts a data-reference into a collection of atomic-trees that cover the data-reference. Function atomizeTree, defined in lines [146]-[148], converts an atomic-tree into a sequence of atomic-references. This is basically the sequence of leaves of the given atomic-tree. Function atomizeProgram, defined in lines [155]-[170], shows how atomization is done for a program. All assignment statements and data-references in the program are processed, as shown in lines [156]-[161], to determine how variables are used in the program. Then, the atomization can be determined for every variable as shown in lines [162]-[164]. This can then be used to convert every assignment statement in the program into a collection of assignments of atomic-references (lines [165]-[167]) and to convert every other data-reference in the program into a reference to a set of atomic-references (lines [168]-[170]). Other Applications and Embodiments The methods described earlier can be used to perform various program analyses and program slicing more efficiently as follows. We can define an equivalence relation on the statements of the program by making two statements equivalent if they refer to atomic variables that are equivalent (as indicated by the Equivalence DAG data structure). If a program analysis is to be performed to determine program properties that hold true at a particular statement S1 in a program P, we can construct a smaller program Q consisting only of statements in the original program P that are equivalent to the statement S1 and perform the desired program analysis on the smaller program Q. Similarly, if we want to compute a slice with respect to some statement S1 in a program P, we can construct a smaller program Q consisting only of statements in the original program P that are equivalent to the statement S1 and perform the slicing operation on the smaller program Q. Though our description of the preferred embodiment of our invention uses the union-find data structure, it is also possible to implement our ideas without using the union-fmd data structure. In we do not place nodes in the Equivalence DAG into equivalence classes, the resulting data structure will be a forest, which corresponds to the Atomization Forest described earlier. In this case, processing the statements in the program, as outlined earlier, will potentially modify this forest. It is necessary to repeat the processing of the statements in the program until the processing causes no more modifications to the Atomization Forest. While the invention has been described in connection with specific embodiments, it will be understood that those with skill in the art may develop variations of the disclosed embodiments without departing from the spirit and scope of the following claims.
|
Same subclass Same class Consider this |
||||||||||
