Including graph or tree representation (e.g., abstract syntax tree or AST)

Method and apparatus for analyzing computer code using weakest precondition

6029002

Abstract

An analyzer for maintaining and analyzing source code is disclosed. The analyzer includes a software translator for converting conventional source code into an intermediate language, slicing capability based upon weakest precondition determination, dual direction flow analysis and incorporation of a computational model to facilitate iterative code.


Claims

What is claimed is:

1. A digital computer system comprising:

translating means for translating a source code file into an intermediate language (IL) file, wherein said IL file includes a plurality of IL statements;

a memory unit, coupled to said translating means, for holding said source code file and IL file, each of said plurality of IL statements being maintained in a data structure providing a connectivity to neighboring IL statements;

a user interface, coupled to said translating means, for selecting a starting point at a first IL statement and an ending point at a second IL statement thereby defining a portion of said IL file; and

analysis means, coupled to said user interface, for analyzing said portion of said IL file by traversing said portion in forward executable order pursuant to said connectivity to identify a plurality of paths, and by subsequently traversing said plurality of paths in reverse executable order pursuant to said connectivity to generate a plurality of weakest preconditions for said plurality of paths, wherein said plurality of weakest preconditions are combined to form a single weakest precondition of said portion.

2. The digital computer system of claim 1 further comprising:

means for providing a slicing predicate comprising one or more variables; and

means for slicing said plurality of IL statements to identify and remove a subset of one or more statements having no effect on said slicing predicate as determined from one or more weakest preconditions associated with said subset.

3. The digital computer system of claim 2 further comprising a means for simplifying said weakest precondition of said portion.

4. The digital computer system of claim 3 wherein said each of said IL statements includes a first data field containing a line number from said source code file and a second data field containing a source block number.

5. The digital computer system of claim 2 wherein said means for slicing removes all IL statements disposed between a third IL statement and a fourth IL statement when weakest preconditions of said third and fourth statements are the same.

6. The digital computer system of claim 2 wherein said means for slicing removes a particular IL statement having an associated weakest precondition and an associated postcondition that are the same.

7. The digital computer system of claim 1 wherein said translating means includes a filter for canceling select lines of code from the source code file prior to being translated into said IL file.

8. The digital computer system of claim 1 further comprising a format generator, said generator being capable of transforming a segment of said IL file into a Unity computational model.

9. The digital computer system of claim 1 further comprising a monitor for displaying code undergoing analysis, said monitor including a first display window for displaying a select portion of source code and a second display window for displaying select IL statements corresponding to said select portion of source code.

10. In an appropriately programmed digital computer, a method of analyzing source code comprising the steps of:

inputting a source code file into said computer, said source code file containing a plurality of source statements;

translating said plurality of source statements into a plurality of intermediate-language (IL) statements disposed within an IL file, said IL file being held in memory in said computer;

constructing a plurality of data structures from said plurality of IL statements, each data structure containing at least one of said IL statements and providing connectivity to neighboring IL statements;

selecting a starting point and an ending point within said IL file to identify a set of select IL statements undergoing analysis;

providing a slicing predicate comprising one or more variables;

determining a weakest precondition of each of said select IL statements while traversing said select IL statements in reverse executable order pursuant to said connectivity; and

slicing said set of select IL statements to identify and remove a subset of one or more statements having no effect on said slicing predicate as determined from one or more weakest preconditions associated with said subset.

11. The method of claim 10 wherein said slicing step comprises the step of slicing all select IL statements disposed between a first select IL statement having a first weakest precondition and a second select IL statement having a second weakest precondition when said first and second weakest preconditions are equal.

12. The method of claim 10 wherein said slicing step comprises the step of slicing a select IL statement having an associated weakest precondition and an associated postcondition that are equal.

13. The method of claim 10 further comprising the steps of:

selecting a direction to traverse said sliced set of IL statements; and

performing a forward analysis of said sliced set of IL statements when the direction is forward and performing a backward analysis when the direction is backward.

14. The method of claim 13 wherein said backward analysis comprises the step of substituting a select IL statement with a skip statement when said select IL statement is an alternation statement and fails to assign a new value to a variable contained in an associated postcondition.

15. The method of claim 14 wherein said backward analysis further comprises the steps of:

supplying an invariance for evaluating an iterating select IL statement;

simplifying a predicate generated from weakest precondition analysis, wherein said invariance functions as a postcondition for said analysis; and

determining whether said invariance is true.

16. The method of claim 15 wherein the step of simplifying said predicate comprises accessing a library of lemmas.

17. The method of claim 14 further comprising the steps of:

isolating an iterating segment of code; and

converting said iterating segment into a Unity computational model.

18. The method of claim 17 wherein said forward analysis comprises the step of identifying all statements on an execution path disposed within said sliced set of IL statements.

19. The method of claim 10 further comprising the step of displaying concurrently on a computer monitor screen a portion of said plurality of source statements in a first window, a portion of said plurality of IL statements in a second, window and a portion of a flow graph corresponding to one or more statements displayed in said first and second windows.

20. In an appropriately programmed digital computer, a method for analyzing source code comprising the steps of:

inputting a source code file into said computer, said source code file containing a plurality of source statements;

translating said plurality of source statements into a plurality of intermediate-language (IL) statements disposed within an IL file, said IL file being held in memory in said computer;

constructing a plurality of data structures from said plurality of IL statements, each data structure containing at least one of said IL statements and providing connectivity to neighboring IL statements;

partitioning said plurality of IL statements into a plurality of segments, wherein each segment includes at least one IL statement;

traversing said plurality of IL statements in reverse executable order pursuant to said connectivity; and

generating a plurality of weakest preconditions for said plurality of segments, wherein said plurality of weakest preconditions may be logically combined to form a single weakest precondition representing multiple segments.

21. The method of claim 20 further comprising the step of equating a source-code variable with a free variable; and wherein said step of generating a plurality of weakest preconditions is based upon an initial postcondition of said source-code variable being equated with said free variable, said generating step thereby extracting information from each of said IL statements to collectively form the meaning of said source code variable.

22. The method of claim 20 further comprising the step of providing a defect predicate that represents an observed defect in said source code file; and wherein said step of generating a plurality of weakest preconditions is based upon an initial postcondition of said defect predicate, said generating step thereby providing information related to said observed defect.


Description

COPYRIGHT NOTICE

A portion of the disclosure of this patent document contains material which is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure, as it appears in the Patent and Trademark Office patent file or records, but otherwise reserves all copyright rights whatsoever.

PAPER APPENDIX

A Paper Appendix of a first embodiment of the invention in the form of a source code listing comprising 275 sheets is included in the provisional application identified below. A paper Appendix of a portion of a second embodiment of the invention in the form of a source code listing comprising 254 sheets is included herewith as Appendix A. Specifications for external and internal "intermediate languages" are included herewith as Appendices B and C, respectively. A specification for weakest precondition calculation in the second embodiment of the invention is included herewith as Appendix D.

RELATED ARTICLES

Material related to the following discussion is contained in Hart, J. M., "Experience with logical code analysis in software reuse and reengineering," in AIAA Computing in Aerospace 10, pp. 549-558 (San Antonio, Tex.) (Mar. 28-30, 1995) and Pizzarello, A., "Formal Techniques For Understanding Programs," (Originally presented at the Eighth International Software Quality Week, San Francisco, Calif., May 30-Jun. 2, 1995), both of which are hereby incorporated by reference in their entirety for all purposes.

BACKGROUND OF THE INVENTION

This application claims priority from a commonly-owned, co-pending provisional application Ser. No. 60/002,134 filed Oct. 31, 1995, entitled "Code Analyzer" (attorney docket no. 1052-30-137).

This invention relates to software maintenance.

More particularly, the invention relates to the systematic identification and verification of software characteristics.

The use of formal methods (i.e., software development techniques emphasizing the logical properties of a program rather than its operational behavior) for supporting the design of new software is recognized as desirable by some segments of the software development community. See, D. Craigen, et al., Formal Methods Reality Check: Industrial Usage, IEEE Transactions on Software Engineering (February 1995), and J. P. Bowen, et al., Seven More Myths of Formal Methods, IEEE Software (July 1995). This is particularly true where stringent demands for reliability and/or safety encourage the investment of an additional verification effort. See, G. Bernst, et al., Experience with Black-Box Testing from Formal Specifications, AQuIS93, Venice, Italy (1993). The suggested advantages of the use of formal methods in designing new software are, however, disputed by some as being mostly anecdotal and not supported by precise measurements. See, N. Fenton, How Effective are Software Engineering Methods?, AQuIS93, Venice, Italy (1993). Furthermore, the use of formal methods for supporting new developments does not take into consideration a very large portion of existing software work, i.e., software maintenance.

The use of formal methods in software maintenance has been described in recent publications, i.e., M. P. Ward, Abstracting a Specification from Code, Journal of Software Maintenance, Research and Practice, Vol. 5, p. 101-122 and M. P. Ward, et al., Formal Methods for Legacy Systems, Journal of Software Maintenance, Research and Practice, Vol. 7, p. 203-219 (1995). However, the emphasis to date has been in creating formal specifications that comprehensively describe the meaning of a particular piece of software. Once the full meaning of the software is obtained, the code will be restructured, thereby facilitating any additional maintenance work.

Such comprehensive treatment of software for purposes of maintenance is undesirable and at times simply not feasible. This is especially true when the software to be maintained consists of thousands of lines of code and a relatively narrow problem has been identified as requiring attention. To be forced to analyze an entire piece of software for the purpose of correcting only a small portion of code is inherently inefficient and expensive.

In response to this problem, techniques have been developed to "slice" large pieces of software into smaller units pursuant to a data flow technique. This method associates portions of a piece of software through data dependency. Accordingly, while data-related portions of code may be identified and collected, such association tends to be overinclusive; identifying portions of code that may reference a variable but do not affect the value of this variable.

Finally, software maintenance cannot be limited exclusively to sequentially executed programs. In practice, many sequential programs interact with each other via permanent files (databases) or in some cases through the sharing of storage (such as is the case for Cobol programs using the linkage section features). These types of non-sequential interactions are often the cause of many problems, including data contamination and deadlocks.

Accordingly, a new method or tool for analyzing software is required that avoids the inefficiencies and limitations described above.

SUMMARY OF THE INVENTION

The present invention provides apparatus and method for maintaining and analyzing source code. The method and apparatus are not limited to any particular type of source code and offer a robust array of features for enabling efficient and effective code analysis.

According to one aspect of the invention, a method of analyzing source code includes the steps of inputting a source code file into a computer; translating the source code file into an intermediate language (IL) file; selecting a starting point at a first IL statement, wherein a second IL statement is disposed after the first IL statement; selecting a direction to traverse the first and second IL statements; and traversing the first and second IL statements in the selected direction. According to another aspect, the method includes a step of slicing the IL file based upon a weakest precondition criteria. According to yet another aspect, the method includes the step of converting portions of the IL file into a UNITY computational model to facilitate handling of iterating code.

A further understanding of the nature and advantages of the invention may be realized by reference to the remaining portions of the specification and the drawings. In the drawings, like reference numerals indicate identical or functionally similar elements.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a block diagram showing the computer architecture used to support a preferred embodiment of the present invention,

FIGS. 2 and 4 are functional block diagrams of code analyzers employing embodiments of the present invention;

FIGS. 3 and 5a are functional block diagrams of translators used in the analyzers of FIGS. 2 and 4, respectively;

FIG. 5b is a schematic drawing of an Abstract Syntax Tree;

FIGS. 6-7a, 7b are flow charts illustrating application of the UNITY computational model;

FIGS. 8a-8c illustrate conversion of a program into the UNITY model;

FIG. 9 is a schematic of a dependency graph;

FIG. 10 is a flow chart illustrating the segmentation of iterating code;

FIG. 11 is a flow chart for analysis of source code by the analyzer of FIG. 2;

FIGS. 12a-12d illustrate linked list data structures;

FIG. 13a is a flow chart of forward analysis performed by the analyzer of FIG. 2;

FIGS. 13b-13i illustrate exemplary source code and forward analysis output;

FIGS. 14a and 14b are flow charts of weakest precondition (wp) computation carried out in backward analysis by the analyzer of FIG. 2;

FIG. 14c illustrates an exemplary statement;

FIG. 15 is a flow chart of wp computation for various statement types by the analyzer of FIG. 2;

FIG. 16 is a flow chart for invariance verification;

FIGS. 17a-17g illustrate predicate simplification by the analyzer of FIG. 2;

FIGS. 18a-18b are flow charts of analysis of source code by the analyzer of FIG. 4;

FIGS. 19a and 19b illustrate a basic block data structure;

FIGS. 20a-20c provide a source code function displayed in C programming language, IL and flow graph configuration;

FIG. 21 illustrates a computer display holding the information in FIGS. 20a-20c;

FIGS. 22a-24b are flow charts for semantic slicing;

FIGS. 25 and 26 are block diagrams of program variable lists;

FIG. 27 is a flow chart of textual slicing;

FIGS. 28a-28b are flow charts of slicing based upon transitive closure;

FIGS. 29a-29b are flow charts of forward analysis performed by the analyzer of FIG. 4;

FIG. 30 is a flow chart of wp computation carried out in backward analysis by the analyzer of FIG. 4;

FIG. 31 is a flow chart of wp computation for various statement types by the analyzer of FIG. 4;

FIGS. 32a-32c are tree data structures illustrating wp calculation;

FIGS. 33a-33b illustrate predicate simplification by the analyzer of FIG. 4;

FIGS. 34a-34d illustrate tree structures undergoing simplification;

FIG. 35 is a flow chart for identification of variables in statements;

FIG. 36 is a block diagram of a predicate calculus engine;

FIG. 37 is a flow chart for simplification operation using a library of lemmas;

FIG. 38 is a block diagram illustrating two-program access of a single file;

FIG. 39 is a block diagram of I/O and JCL analyzers;

FIGS. 40a-40i illustrate application of the UNITY model for verification of invariance for repeatedly executed code;

FIGS. 41a-41b illustrate verification of invariance for the analyzer of FIG. 4;

FIGS. 42a-42c illustrate exemplary backward analysis output;

FIGS. 43a-43e illustrate exemplary predicate calculus rules.

DESCRIPTION OF SPECIFIC EMBODIMENTS

Contents

I. APPARATUS

II. OPERATION

A. System Overview

B. Translation of Source Code

C. UNITY

D. Analysis Sequence: Link Lists

1. Statement Link List

2. Flow Analysis

3. wp computation

4. Simplification

E. Analysis Sequence: Flow Graph

1. Flow Graph Generation

2. Slicing

3. Flow Analysis

4. Simplification

5. WP Calculation Enhancements

F. Library of Lemmas

G. Logical Analysis of Code

1. Analysis of Non-Iterating Programs

2. Verification of Invariance for Repeatedly Executing Code

3. Verification of Invariants for overlapping Code Execution (multitasking)

H. I/O Commands Analysis and JCL Analyzer

I. APPARATUS

The code analyzer described herein is implemented in the "C" programming language and is operational on personal computers running Microsoft Windows 3.1 (or greater) operating system and UNIX.TM.-based engineering work stations such as SUN4, IBM RS6000, and HP 700 series. The code analyzer is designed to be system independent and requires minimal change to rehost on additional work station platforms. It will be apparent to those of skill in the art that the analyzer and method described herein could be readily applied in any one of a variety of programming languages and on any one of a variety of work stations without departing from the scope of the invention. The above-described language and hardware are merely illustrative.

A block diagram of a basic work station capable of supporting a preferred embodiment of the present invention is shown in FIG. 1. A user interface 130 includes a keyboard and/or mouse for inputting various instructions and data to a central processing unit ("CPU") 110. A disk memory 150 holds a variety of files (i.e., source language file, intermediate language file, and SYM file, each of which is described below), as well as code-analyzer executable binary code. A computer monitor 120 provides a means for displaying output data. A system memory 140 is fast access memory used by the CPU during processing. CPU 110 interconnects monitor 120, disk memory 150, system memory 140, and user interface 130. During start up, CPU 110 Accesses disk 150 and loads necessary files in code to system memory 140 for access during analysis. Processed data may be output to disk memory 150 and/or monitor 120.

II. OPERATION

A. System Overview

FIGS. 2 and 4 disclose two embodiments of the present invention. The embodiment in FIG. 2 has been implemented in the "C" programming language while the embodiment in FIG. 4 is currently under development. Only translator 410 of the embodiment in FIG. 4 has been reduced to source code to date.

Referring to FIG. 2, a code analyzer 200 includes a source language file 210 which is input to a translator 220. The translator converts the source language file into an intermediate language ("IL") file 230 and generates a .SYM file 240 containing variable declarations within a conventional symbol table. The intermediate language used to create IL file 230 is based upon Dijkstra's Guarded Commands ("DGC"), which typically recognize only integer and boolean variables. However, in this embodiment, the DGC-based IL file 230 is capable of handling a variety of variable types through the use of SYM file 240. (DGC is described in greater detail in E. W. Dijkstra, A Discipline of Programming (Prentice Hall 1976) (hereinafter "Dijkstra"), and E. Cohen, Programming in the 1990's (Springer-Verlag 1990), both of which are hereby incorporated by reference in their entirety for all purposes.)

After translation, IL file 230 and .SYM file 240 are forwarded to code analysis engine 250, which is controlled by commands entered through a user interface 130. As shown in FIG. 2, user interface 130 is coupled to code analysis engine 250 through a C Language Integrated Production System ("CLIPS") 260. CLIPS is an expert system tool originally developed by the software technology branch at NASA's Lyndon B. Johnson Space Center and available from COSMIC at The University of Georgia, 382 E. Broad Street, Athens, Ga. 30602-4272. A detailed description of CLIPS and its operation within analyzer 200 is provided in commonly owned co-pending patent application Ser. No. 08/550,869 (attorney docket no. 5773-3 (P1)) and entitled Process and Tool for Scalable Automated Data Field Replacement, which is hereby incorporated by reference in its entirety for all purposes.

Prior to forwarding IL file 230 to engine 250, an operator may manually format certain portions of the translated source language pursuant to the "UNITY computational model." As a result of such formatting, code analysis engine 250 need not make any distinction between sequential code and (1) iterating code in a sequential program, or (2) the repeated nondeterministic execution of interacting sequential programs in the system. Application of the UNITY computational model is discussed below.

Code analysis engine 250 processes the translated code pursuant to a forward analysis and/or a backward analysis. In the forward analysis, engine 250 follows the execution path of a designated code fragment--tracing data flow from a selected starting point to ending point. The collected information may be displayed to the operator via monitor 120 or simply stored in memory 150 (FIG. 1). The operator may extract different portions of the stored code, including all statements disposed on an execution path between the starting and ending points, statements related to specific variables or conditions necessary to reach an ending point.

The general purpose of forward analysis is to help the operator obtain an informal understanding of the source code and thereby facilitate a more intelligent backward analysis.

Backward analysis is the process of determining the least restrictive condition that must hold before the execution of a program to guarantee its termination in a state identified by an operator; i.e., the determination of a "weakest precondition" based upon a particular "postcondition."

If the postcondition is always satisfied regardless of the precondition, the precondition is considered "TRUE." Alternatively, if the postcondition is never satisfied by any precondition, the precondition is considered "FALSE." If the precondition is neither TRUE nor FALSE, then it is a specific numerical and/or logical relationship determined by engine 250. In each case, the weakest precondition ("wp") value is output to the operator via monitor 120. As discussed below, this value may be used by the operator in analyzing a program to determine (1) why a defect occurs or (2) what a program does.

To perform backward analysis more efficiently, engine 250 may slice the subject code, as discussed below, to reduce the number of statements under consideration. Such slicing may be performed using conventional criteria such as textual matching and transitive closure.

FIG. 4 illustrates a second embodiment of the present invention in the form of a code analyzer 400, which is currently under development. As shown in FIG. 4, code analyzer 400 includes source language file 210 which is fed into a translator 410. Translator 410 outputs an IL file 420 containing variable declarations within a symbol table and source code that has been converted into a DGC-based language. Like analyzer 200, the DGC-based language in analyzer 400 is capable of handling a number of variable types.

Code analyzer 400 includes an automated UNITY format generator 430 which formats certain portions of a translated source language in accordance with the manual operation mentioned above. After translating and formatting, IL file 420 is forwarded to a code analysis engine 440.

Engine 440 is controlled by commands entered through user interface 130 and a conventional graphical user interface 450. Interface 450 interacts with engine 440 through the use of a "tree-writing" language known as "Sorcerer," derived from the Purdue Compiler Construction Tool Set (PCCTS). (As discussed below, analyzer 400 utilizes data structures known as "Abstract Syntax Trees" (AST) and therefore requires a language capable of manipulating such structures.) Sorcerer has been described in published articles. Source code, examples and a draft of an upcoming book on PCCTS may be found at ftp://ftp.parr-research.com/pub/pccts. Sorcerer is Free (i.e., available in the public domain) and supported by the original authors.

Like analyzer 200, analyzer 400 performs forward and backward analysis on designated portions of source code, and may slice code based upon textual matching and transitive closure. However, engine 440 may also slice code using weakest precondition analysis as a criteria. Such slicing, called "semantic slicing," enables engine 440 to more specifically identify the code statements of interest for a specific set of variables--resulting in a more efficient forward and backward analysis. The details of this operation are provided below.

B. Translation of Source Code

1. Translator 220

Referring to FIG. 3, translator 220 includes a filter 320 coupled to a scanner 330 which is, in turn, coupled to a parser and code generator 340. As shown in FIGS. 2 and 3, the output of translator 220 is intermediate language file 230 and SYM file 240.

In operation, source language file 210 is loaded into filter 320. In response to instructions received from a human operator of code analyzer 200 (i.e., the "code-analyzer operator" or the "operator"), filter 320 cancels select lines of code from the loaded source language file through textual matching (an operation analogous to word searching in a word processor program). In brief, filter 320 textually searches file 210 with the terms provided by the operator and discards any matching lines of code. For example, file 210 typically includes lines with directives or commands specific for the machine in which it was compiled (i.e., compiler or operating system directives). Filter 320 is used to remove these lines. The output of filter 320 is translatable (but not necessarily compilable and executable) source code that will be translated into intermediate language.

Scanner 330 may be any commercially available scanner or lexical analyzer such as "FLEX" (available from the Free Software Foundation, Cambridge, Massachusetts) and LEX (available from AT&T). Referring to FIG. 3, the scanner is "set-up" for a particular language to be translated (i.e., cobol C, C++, PL/I, Pascal, Fortran, etc.) by state definition rules 350. These are "regular expressions" that may be named for subsequent use. An exemplary set of rules 350 for use with FLEX in translating Cobol is provided in file "lexer.l", which is in the Appendix of the aforementioned provisional application from which priority is claimed (the "Provisional Appendix"). The scanner also must receive token definitions 360. These definitions are a way to define key words that parser 340 must understand for later use. An exemplary set of token definitions is provided in file "verb.sub.-- table.c," which is also in the Provisional Appendix. The use of state definition rules and token definitions in connection with a scanner like FLEX is well known to those having ordinary skill in the art.

Parser and code generator 340 ("parser/generator 340") may also be any commercially available parser such as LEX and YACC (which are part of the UNIX operating system package available from a variety of developers such as AT&T, Sun Microsystems, Inc., International Business Machines, and Novelle), BISON (available from the Free Software Foundation, Cambridge, Mass.) and the ELI compiler construction system, described in Gray, et al., "ELI: A Complete Flexible Compiler Construction System," Communications of the ACM 35, pp. 121-131 (February 1992). Translator 220 has run successfully using BISON as parser/generator 340.

Parser/generator 340 receives the token definitions from scanner 330 and generates IL file 230 and a symbol table in SYM file 240. Such construction requires the syntax of the source language held in source language file 210. The parser is set up by introducing the Backus-Naur Form ("BNF") expressions for the language syntax. (BNF is a universally accepted notation for describing language syntax well-known to those having ordinary skill in the art.) An exemplary set of BNF expressions is provided in file "parser.y," included in the Provisional Appendix.

An exemplary symbol table held in SYM file 240 for a Cobol implementation is described in the co-pending application identified above. An exemplary table is also provided in file "symbol table," included in the Provisional Appendix. The generation of a symbol table is a normal compiler operation well known to those having ordinary skill in the art.

In addition to generating a symbol table, parser/generator 340 completes the translation of source language file 210 into IL file 230. Referring to FIG. 3, such translation is directed by semantic rules 380 which are provided to parser and code generator 340. These rules are written in the "C" programming language and are specific to the type of source code held in source language file 210 and the desired object language (i.e., the intermediate language used by the code analyzer of the present invention). An exemplary set of semantic rules for cobol source code using BISON as parser/generator 340 is provided in file "parser.y," which may be found in the Provisional Appendix.

Semantic rules 380 may use functions that are defined only once. In this manner, the amount of work required for parser and code generator setup is reduced since the functions are used for many different applications of translation. The use of BNF and semantic rules for a commercially available parser and code generator is well known to those having ordinary skill in the art.

The intermediate language in IL file 230 is essentially an implementation of DGC, with sufficient additional features to easily permit the representation of real programs implemented in real languages. In the present embodiment, IL statements include four fields separated by white space. A typical IL statement is provided in equation (1) below. The three fields present in equation (1) and a complete list of IL statement types for analyzer 200 are described in Tables 1 and 2, respectively.

23 0.0.1.2 assign x=y+3 (1)

                  TABLE 1
    ______________________________________
    Fields of Typical IL Statement
    ______________________________________
    23 0.0.1.2 assign x = y + 3 (1)
    23:       Line number within source file 210.
    0.0.1.2:  The source block. A block identifier uniquely
              identifying a block of code in the source file defining
              a procedure, such that "if" and "do" statements
              define blocks within which there may be sub-blocks.
              In analyzer 200 this number identifies a node con-
              taining a single IL statement in the form of a text
              string. In analyzer 400, this number identifies a
              basic block containing at least one IL statement in the
              form of a tree data structure.
    assign x = Y + 3
              Instruction corresponding to the source code in the
              form of modified or extended DGC.
    ______________________________________


Referring to "source block" in Table 1, block numbers are sequences of decimal numbers separated by periods. The block numbers of two different source code blocks are always different (that is, block numbers must be unique). Block numbers which represent sub-blocks of a given block have block numbers with the same leading fields as the parent block. Blocks which are siblings differ only in the last field. This permits block parent, child and sibling relationships to be quickly tested using string compare operations. Blocks are also used to delimit variable scopes (see, Appendix B).

New block numbers are created at "if" statements, "do" statements, source language blocks and procedures. "if" statements and "do" statements always create new blocks with unique numbers (within the parent block). Specifically, if the current block level is "0.1.2" and an "if" is encountered, the condition and the code in the "then" clause will get a block number of "0.1.2.n+1", where n is a previously unused decimal number at that level.

Procedures create new blocks. A "." is appended followed by a previously unused decimal number at that level. Unnamed blocks within the code (such as C's `{. . . }`) are equivalent to blocks and can be implemented as a pair of proc statements.

                  TABLE 2
    ______________________________________
    IL Statement Types
    ______________________________________
    assign: basic assignment (DGC based)
    if:     corresponding to conventional if-then-else
    do:     standard iteration
    call:   standard call
    goto:   standard goto
    return: classic return
    LABEL:  defines a label in the intermediate language
    L-entry:
            to introduce labeled procedure entries
    L-procedure:
            to introduce labeled procedures
    file:   to introduce new source files analyzed
    null:   nop (no operation)
    proc:   general purpose no-op
    setRet  mark the end of a paragraph as the end of a perform state-
            ment
    SetNull:
            clear the end of a paragraph
    ______________________________________


The IL described above is processed within engine 250 as a text string contained within a statement link list, as described below. The source block number identified in Table 1 identifies a specific node for a specific IL statement (e.g., node 1200 in FIG. 7).

2. Translator 410

Referring to FIG. 5a, translator 410 includes the same functional units as translator 220, i.e., filter 320, scanner 330 and parser/code generator 340. Accordingly, the foregoing discussion applies equally to these components as used within analyzer 400. However, in this embodiment, parser/code generator 340 outputs a single file 420 containing translated source code and a symbol table. An exemplary set of rules and definitions necessary to carry out this operation for a "C" source language on a scanner and parser/code generator (i.e., Flex and BISON, respectively, as described above) are provided in the files listed in Table 3 below, and contained in Appendix A attached hereto.

                  TABLE 3
    ______________________________________
    Exemplary rules and definitions for translator 410
    File                 Description
    ______________________________________
    parser.y             State definition rule 520
    scanner.1            Token definitions 530
    parcer.y             BNF 540
    fns.sub.-- : a2d.c, e2h.c, i2l.c, m2p.c, q2t.c, u2z.c
                         Semantic Rules 550
    Same as Semantic Rules
                         Symbol table
    ______________________________________


The foregoing rules and definitions generate IL statements that correspond with Tables 1 and 2 above. Further, translator 410 also generates statement types identified in Table 4 below.

                  TABLE 4
    ______________________________________
    Additional IL Statement Types
    ______________________________________
    var:      to declare variables
    type:     to specify type information
    member:   to declare a member of a structured type or field
    Version:  provides information about version of language
    ______________________________________


Moreover, the IL statement produced by translator 410 includes an extra field to identify the source file from which it comes. This field is illustrated in equation (2) below and Table 5.

                  TABLE 5
    ______________________________________
    Extra IL Statement Field
    ______________________________________
    0 23 0.0.1.2 assign x = y + 3 (2)
    0:     The zero based file number of source file 210 (FIGS. 2-4)
           The remaining fields are described in Table 1.
    ______________________________________


The symbol table contained in file 420 is encoded in the var, member and type statements identified above.

The IL described above (i.e., in Tables 1, 2, 4 and 5) is more accurately referred to as "external IL" since the operator can access this during code analysis. A specification setting out the parameters of the external IL language upon which files 230 and 420 are based is attached as Appendix B hereto (i.e., "Specification of PIL" ). As Tables 2 and 4 make clear, the IL of analyzer 200 is essentially a subset of the IL for analyzer 400.

Translator 410 also generates a second representation of IL to facilitate processing and analysis of source code by engine 440. This second type of IL, which is also included in IL file 420, is called "internal IL". Internal IL is represented in an Abstract Syntax Tree (AST) data structure. This structure is invisible to the user and is accessed by engine 440 to carry out its operations. The AST structure is described in A. V. Aho, et al., Compilers: Principles, Techniques, and Tools, Addison-Wesley (1986) (hereinafter "Aho"), which is hereby incorporated by reference in its entirety for all purposes. A graphical representation of an AST 500 for the instruction in equation (1) is provided in FIG. 5b. Further, a specification setting out the parameters of the internal IL is attached as Appendix B hereto (i.e., "Internal PIL and Symbol Table Design").

C. UNITY

After IL conversion, the operator in analyzer 200 (via interface 130) or the UNITY format generator 430 in analyzer 400 (FIGS. 2 and 4) may organize the resulting IL file in accordance with the UNITY computational model. In analyzer 200, this operation is performed manually. Conversely, in analyzer 400, it is anticipated that this operation will be automated. The UNITY model intended to be used in these embodiments is described in J. Misra, A Discipline of Multi-Programming, University of Texas Austin Report (1994), which is hereby incorporated by reference in its entirety for all purposes. A copy of this report is available from the University of Texas upon request.

Code analyzers 200 and 400 may employ the UNITY computational model to represent any program that is analyzed. Through UNITY, no distinction need be made between sequential code and (1) iterating code in a sequential program or (2) the repeated non-deterministic execution of interacting sequential programs in a system. The same computational model and formalism may be used for analyzing parallel programs, multitasking and other forms of concurrence.

With the incorporation of the UNITY model, translation of source code by analyzers 200 and 400 requires two general steps:

(1) The entire source code is translated into sequences of assignments and alternation statements (i.e., intermediate language). Iterative statements (do while's and any other code that can be iterated) are isolated and presented as separate sequences.

(2) These isolated sequences are organized as non-deterministically executing portions of a UNITY program. The execution of the sequences, however, depends on some conditions for their execution. Some of these conditions may contain artificially created variables that have the sole purpose of forcing the original sequence of execution of the source program (e.g., the variable "c" in the example of Table 7 below).

These sequences represent a natural form programmers use for an informal understanding of the code. They essentially are ordered in the same way programmers mentally organize program statements when reading code with the purpose of obtaining information about its meaning.

Specifically, the UNITY model places iterating code in a more accommodating format for code analysis. Some code that iterates is contained within a program that executes sequentially. The iterating portion of this program (e.g., a loop) must ultimately end; therefore, this type of code is called "terminating." The entire program, in turn, may be iterating in the sense that it can be used repeatedly against some permanent data (i.e., information held in a data base). The iterations of the entire program can be any number (i.e., there is no coded mechanism to terminate the repeated execution of a program), therefore, this type of code is called "non-terminating."

It is anticipated the UNITY model will be applied to a subset of all iterating code. Specifically, UNITY will be applied when iterations are caused by a set of cooperating processes executing concurrently (e.g., repeated and simultaneous access to a common database) or by a program whose global framework is based upon one or more large loops. In the context of the UNITY model, small, deterministic loops presented in a sequential program are typically of simple construction and therefore can be manually substituted with specification (i.e., preconditions and postconditions) that define their meaning without being separated from the main program (i.e., the sequential code).

The study of iterating code, both terminating and non-terminating, involves the use of "invariants," as discussed below. The partitioning of sequential code into the UNITY computational model is one way to permit the verification of invariants for the entire program or its relevant portions (i.e., a loop body) using only "weakest precondition calculation" and expression simplification, as discussed below.

The UNITY computational model may be generally represented in the expression of Table 6 below.

                  TABLE 6
    ______________________________________
    UNITY Computational Model
    ______________________________________
            initially x, y, z, ... = e.sub.1, e.sub.2, e.sub.3, ...
            <B1 --> P1 [ ] B2 --> P2 [ ] ...>
    ______________________________________


Referring to Table 6, x, y, z, etc. are program variables that assume the initial values of the expressions e.sub.1, e.sub.2, e.sub.3, etc. Further, P1, P2, etc. are sequential programs executing non-deterministically and repeatedly when the leading conditions (i.e., "guards") B1, B2, etc. hold (any of B1, B2, etc. may be the constant predicate true: in that case the corresponding program would execute at any time). A fat bar ("[]") separates each B-P pair. The UNITY model also requires that the non-deterministic execution is chosen fairly (this can be assumed in all practical cases; i.e., there exists no biasing for any particular P) and that P1, P2, etc. execute as a whole piece of non-interrupted code.

The UNITY logic supplies effective ways to specify and prove properties, and is a convenient way to present code to an operator for informal understanding. Therefore, it is useful to transform sequential programs that have significant portions of their code that can be iterated into a UNITY form. Such iterated code, be it a large loop body in a sequential program or a set of cooperating processes that can even execute concurrently, can be treated exactly in the same way as sequential code after transformation into the UNITY model.

The transformation of sequential code into UNITY requires (in general) the use of auxiliary variables for forcing the sequential execution of these code portions that are represented as non-deterministically executed sequences. Exemplary cases are provided below. These cases represent most of the situations found in practice. The approach can easily be generalized and extended to other more unusual situations.

FIG. 6 provides a graphical representation of a single "do while" loop, which in UNITY form is represented as equation (3). When B holds, the UNITY program reaches a Fixed Point (FP) that corresponds to the loop termination. (A FP is a state where the execution of any of the sequences P.sub.i in the program is either impossible, or does not produce any state change.)

<B.fwdarw.S> (3)

FIG. 7 provides a graphical representation of a single "do until" loop, which in UNITY form is shown in Table 7. In this table, "c" is an auxiliary variable (set initially to zero) which forces the execution sequence S; B.fwdarw.S; etc. The program reaches an FP when B holds.

                  TABLE 7
    ______________________________________
    "do until" Loop
    ______________________________________
    initially c=0
    <c=0 --> S; c:=1 [ ] c=1 B --> S>
    ______________________________________


FIG. 8 provides a graphical representation of a "nested do while" loop, which in UNITY form is shown in Table 8. The two loops identified in Table 8 terminate at an FP when c=0 B holds.

                  TABLE 8
    ______________________________________
    "nested do while" Loop
    ______________________________________
    initially c=0
    <c=0 B1-->S1; c:=1[ ]
    c=1 B2-->S2[ ]
    c=1  B2-->S3; c:=0>
    ______________________________________


The mechanics for converting sequential programs into UNITY form, such as the foregoing, is relatively straight forward. The starting point is the sequential program in BLOCK form. Each block is a sequence of assign, alternation, loop, abort, and skip statements. Here are the block rules; the blocks should be created as a result of the flow analysis:

Blocks cannot be contained within blocks;

A GOTO, if any, must occur at the end of a block (this defines a block boundary, if fact);

Any label that is the target of a GOTO must be on the first statement in a block (defining the other block boundary). This is because the other basic rule for a block is that the only entrance is at the beginning;

The start of a loop body is a block;

The start of each alternation is a block;

The blocks are numbered sequentially from 1 to N as they occur in the code text.

Converting a sequential program in block form into UNITY involves the following steps:

1. Create a global variable called BlockNumber, which is initialized to 1;

2. The last statement of each block must set the next block number;

For a GOTO, set BlockNumber to correspond to the label that was the target of the GOTO (the GOTO does not appear in the UNITY program);

If the block appears at the end of a loop (immediately before the od), set BlockNumber to be the block for the beginning of the loop body;

If the block appears at the end of a guarded statement as part of an alternation, the set BlockNumber to number the block immediately following the alternation;

If the block is followed immediately by another, possibly within a loop or alternation, set BlockNumber to the number of the following block.

3. Every guard in the UNITY program tests the value of BlockNumber for equality to the guarded block, and this test is conjoined with whatever other guard (from an alternation or loop) might be associated with the block.

4. Every guarded statement in the UNITY program is the code of the block with successor BlockNumber computations added (as specified above), GOTOs removed, and all do, od, if, and fi control statements removed.

This process is graphically displayed in FIGS. 8a-8c. Specifically, FIG. 8a illustrates original sequential code, FIG. 8b shows the same code augmented with blocks, and FIG. 8c illustrates a UNITY program created by the foregoing rules.

An operator of code analyzer 200 or UNITY format generator 430 of code analyzer 400, breaks a program into segments as required by the UNITY model. The operator may do this through a manual review and identification of iterating portions of code. Conversely, generator 430 may carry out this operation through the use of internal IL. (Although generator 430 is shown as a separate block in FIG. 4, it may be embodied as a module in engine 440.) This representation as noted above, is in the form of an AST data structure, such as that shown in FIG. 5b. Generator 430 utilizes this data structure to construct a program dependency graph, a construct that is commonly used for data flow analysis of programs and is well known to those having ordinary skill in the art. The program dependency graph contains the dependency relationships among various statements within a program. These dependencies are of three types: (1) control dependencies, (2) data flow dependencies, and (3) declaration dependencies.

Control dependencies indicate the order that statements may be executed. Data flow dependencies indicate how data is propagated through a dependency graph. Declaration dependencies indicate the dependency relationship between data declarations and the statements where their states are changed. FIG. 9 illustrates an example of a program dependency graph constructed from the C- language program fragment provided below in Table 9.

                  TABLE 9
    ______________________________________
    C-Language Program Fragment
    ______________________________________
             main ( )
             {
           [1]         int fact
           [2]         int n
           [3]         n=4
           [4]         fact=1
           [5]         While (n!=0) {
           [6]         fact = fact * n
           [7]         n = n-1
                       }
             }
    ______________________________________


In the fragment of Table 9, the two declarations in lines [1] and [2] are related to equations in lines [3], [4], [6], and [7]. This relationship is graphically represented by lines 936-942 in FIG. 9. Regarding control flow, "main ()" precedes every statement while the inequality in line [5] must precede the statements of the loop body (i.e., [6] and [7]). This relationship is represented by lines 944-956 in FIG. 9. Regarding data flow, the equations in lines [6], [7], and the inequality in line [5] (i.e., n!=0) initially use the value of n set in line [3] and fact in line [4]. These equations also use the value of fact set in the previous cycle of the while loop. These relationships are represented by lines 958-966 in FIG. 9.

The use of a program dependency graph permits the separation of program segments as required by the UNITY model. In the foregoing example, the loop (lines [5], [7]) should be a separate segment because the loop iterates to termination for each use of "main ()." Moreover, this loop can be separated since it is an isolated sub-tree in the dependence graph of FIG. 9. Referring to this figure, if node 920 were removed from dependence graph 900, then the entire loop would be removed from the program fragment.

Significantly, only one statement in "main ()" must precede the loop. However, all "main ()" declarations must be carried in the separate segment because of declaration dependencies. Similarly, statements [3] and [4] just prior to the loop also depend on these declarations. This is an observation necessary for the propagation of weakest precondition calculation from one segment to another (i.e., "backwards" analysis, as discussed below).

Code analyzer 400 uses the program dependency graph data structure to segment a sequential program into UNITY segments, as illustrated in FIG. 10. Referring to FIGS. 4 and 10, IL file 420 (containing "internal IL" in AST form and a symbol table) is received by code analysis engine 440, and combined to form a program dependency graph, in accordance with block 1010 of FIG. 10. Engine 440 then searches the newly created dependency graph for statements that define iterations (e.g., "do while," "goto," etc.), as indicated in block 1020. Thereafter, engine 440 creates alternate UNITY statements for each interaction directed to repeating execution and exiting operation, pursuant to block 1025. The engine also introduces sequencing variables (e.g., variable "c", in Tables 7 and 8) where necessary. Finally, pursuant to block 1030, engine 440 utilizes the control flow dependencies of the subject dependence graph for isolating the subgraph associated with each of the iterations found in block 1020. The operations carried out in blocks 1010, 1020, and 1030 are well known to a person having ordinary skill in the art since such operations are also performed by any commercially-available compiler that optimizes program loops. These operations are also discussed in Aho.

As the foregoing illustrates, control flow dependencies are essential for isolating iterative segments. Conversely, declaration and data flow dependencies supply useful information for efficient execution of forward and backward analyses.

D. Analysis Sequence: Link Lists

FIG. 11 illustrates the software analysis sequence of code analysis engine 250 under the control of user interface 130 (FIG. 2). As a general summary of operation, engine 250 initially generates a statement link list from translated code, in accordance with block 1110. Thereafter, an operator designates the relevant code fragment to be analyzed in block 1112. Finally, flow analysis (either forward or backward) is carried out in blocks 1114-1134. Source code representing this embodiment of the invention is provided in a file entitled "PCA2 main routine," located in the Provisional Appendix.

1. Statement Link List

Analysis engine 250 builds a link list from IL File 230 and a corresponding symbol table in SYM file 240. This operation is well known to those having ordinary skill in the art. An overview of the resulting data structure is provided in FIG. 12.

FIG. 12a illustrates a link list node 1200. Every statement in source code file 210 is represented by a text string disposed within a node (i.e., "struct line" in the source code). A text string simply represents a statement (i.e., "y:=x+1") as a string of characters. Node 700 includes four sets of pointers (i.e., addresses to other nodes in memory) to other structures; i.e., previous node ("prev"), next node ("next"), a parent of the node ("parent"), and children ("children"). The children structure is made up of an array to facilitate any type of branching flow in the source code.

FIG. 12b illustrates a basic data structure of analyzer 300 using link list nodes. More specifically, FIG. 12b illustrates a double-link list 1210 including three nodes; 1202, 1204 and 1206. As illustrated by this data structure, code analysis engine 250 can traverse a link list of nodes representing a portion of source code by moving down a "next" pointer or up a "prev" pointer. An illustration of this data structure within a more complex source code construct is shown in FIG. 12c.

FIG. 12c illustrates a branch construct having multiple levels to accommodate an if-then-else statement. Referring to this figure, fork node 1212 is coupled to "end-if-node" 1220 via previous pointer 1252 and next pointer 1254. Fork node 1212 is also coupled to "if-condition" node 1214 via child pointer 1224, previous pointer 1222, and parent pointer 1226. Control flows to node 1214 when the subject if-then-else condition is true.

Node 1214 is coupled to node 1216 via previous pointer 1256 and next pointer 1258. Node 1216 is also coupled to fork node 1212 via parent pointer 1228. An identical data structure follows from node 1216 to node 1218 with any number of nodes in between. Node 1218 is the last node of this sequence, being coupled to fork node 1212 via parent pointer 1230 and to end-if-node 1220 via child pointer 1234 and next pointer 1232.

An identical data structure exists on the "else" (i.e., left-hand) side of the figure wherein nodes 1222, 1224, 1226, 1212 and 1220 are coupled to each other via previous pointers 1250, 1260; next pointers 1262, 1236; child pointers 1248, 1238; and parent pointers 1240, 1242, and 1246 as shown therein. Control flows to node 1222 when the subject if-then-else condition is false.

Engine 250 will traverse an if-then-else statement in IL file 230 on the data structure shown in FIG. 12c. Fork node 1212 is a "virtual" node supported by the data structure but not existing in the original source code file 210. Similarly, end-if-node 1220 is a virtual node (signifying the end of an if-then-else statement) supported by the data structure but not existing in the source code file 210.

Referring to FIG. 12c, engine 250 can traverse the branch statement represented by this structure simply moving down next pointer 754 or up previous pointer 752. Next or previous pointers allow for movement to the same or higher level. Conversely, child pointers allow for movements to lower levels.

The left-hand side of FIG. 12c (i.e., the "else" side) will be created by engine 250 even if there is no else statement within the source code file 210. Parent pointers allow for immediate movement to fork node 1212, as shown in FIG. 12c. Alternatively, engine 250 may traverse one or more previous pointers to return to fork node 1212. If any child node (i.e., 1214-1218 and 1222-1226) includes an if statement, the same structure as 7c may be found within that node.

FIG. 12d illustrates the components of node 1200. Referring to FIG. 12d, node 1200 includes a type field to identify the type of intermediate language statement (i.e., if, assign, goto, etc.). This node also includes an "LHS" field to identify the left-hand side of an assignment statement and an "RHS" field to identify a right-hand side of an assignment statement. (The LHS and RHS fields are ASCII format for display to a user on monitor 120.) Finally, node 1200 includes a hash field which is an array of integers, wherein each integer is an index to a symbol table (i.e., in SYM file 240) containing attributes of the subject variable. As noted above, an exemplary symbol table may be found in co-pending application Ser. No. 08/550,869, (attorney docket no. 5773-3(P1)).

The link list data structure disclosed in FIGS. 12a-12d is of conventional type well known to those having ordinary skill in the art. Accordingly, it will be apparent to those of skill in the art that this structure may be readily applied to other source code constructs (i.e., assign, goto, etc.). Similarly, the text string data structure disposed in each node is a standard structure well known to those having ordinary skill in the art. Such structure may consist of a string of ASCII characters.

Returning to FIG. 11, once a statement link list has been generated, the code-analyzer operator may designate a relevant code fragment within that list pursuant to block 1112. Specifically, the operator may construct a desired fragment by selecting elements from an existing link list, or directly from the corresponding source or IL files. Responding to operator instructions, engine 250 constructs a code fragment by selecting a (1) starting statement and (2) ending statement.

Upon designating a relevant code fragment, the operator of code analyzer 200 has the option of performing a forward or backward analysis pursuant to blocks 1114-1134 of FIG. 11.

2. Flow Analysis

Referring to FIG. 11, upon selection of a program fragment in block 1112, the operator may now begin flow analysis pursuant to block 1114 in FIG. 11. A flow analysis begins with the operator selecting either a forward analysis in block 1116 or a backward analysis in block 1118. If forward analysis is chosen, control flows to block 1120. Conceptually, forward analysis is a classical application of data flow analysis. Forward analysis is the process of following the execution path of the code fragment being analyzed and either displaying information to the operator or building a data base described in the flow of control. In this process, engine 250 traverses a link list of nodes defining translated source code in a forward direction. At each node, engine 250 textually identifies a statement of the source code represented by a text string to determine the next node in the execution path. More specifically, engine 250 processes each node in forward analysis by interpreting associated IL statements through character matching.

To begin forward analysis, the operator must select corresponding starting and ending points pursuant to block 1120. Thereafter, control flows to block 1122 which represents forward analysis operation. A forward analysis may be used to determine the various paths of operation to get to some point downstream in the code. Variations on this operation include a display of all intervening branching conditions, a display of multiple end points and a display of the entire code and all paths between selected starting and ending points.

Forward analysis may also be used to show only those statements that are involved in the manipulation of one or more variables. This is a slice of the code that represents the path. More specifically, a slice is obtained by forward analysis under control of the operator. It begins at the starting point and continues until it reaches the ending point designated by the user, or the end of the relevant code fragment or the code fragment being analyzed. The resulting forward slice is saved for use as a relevant code fragment for slicing or weakest precondition analysis.

A top level flow diagram illustrating forward analysis in code analyzer 200 is provided in FIG. 13a. Referring to this figure, engine 250 begins forward analysis at the first selected statement pursuant to block 1336. This selected statement is defined by the starting point identified by the operator in block 1120. Engine 250 places the initial statement number (i.e., starting point) in a LIST in memory and initializes additional space in memory (pathstorage) for the first execution path to be analyzed. Control then flows to block 1338 where engine 250 produces a list of facts (i.e., data utilized by CLIPS commands), each fact being a list of node identities representing a possible path. This process is described in more detail in the pseudo code of Table 10.

                  TABLE 10
    ______________________________________
    enter initial and end statement number;
    put initial statement number in a LIST,
    initialize storage for the first path;
    position at the initial statement number in the
    source file -- this is now the current
    statement;
    repeat as long as there is a statement number
    in LIST -->
            repeat as long as next statement is not the
              end point -->
                if current statement branches -->
                  store the statement number where flow
                  of control goes in LIST;
                [ ] current statement does not branch -->
                  skip
                fi;
                put current statement in pathstorage;
                get next statement;
                check if next statement closes a loop
                (isolate the whole loop in this case)
            end repeat.
    Store path in file (path is a flat list with
    condition statements linked together).
    Top element in LIST being initial statement number
    initializes storage for another path; remove top
    element from LIST;
    end repeat.
    ______________________________________


Referring to Table 10, the initial statement number (selected by the operator) is initially treated as the "current statement" for purposes of analysis. Engine 250 then enters two repeating or looping operations. The outer loop enables engine 250 to enter multiple paths, while the inner loop enables the engine to traverse a current path undergoing analysis.

Starting with the initial statement number, engine 250 processes the first path within the inner loop. If the current statement branches, the statement number is stored in the LIST. Otherwise, this step is skipped. Thereafter, the current statement is stored in memory (i.e., pathstorage) with the corresponding "condition" for continuing down the current path.

Engine 250 retrieves the next statement (i.e., link list node) along the current path and checks to determine if the statement closes a loop. If yes, the entire loop is isolated for purposes of imposing a UNITY model or for reviewing manually. The path analysis within the inner loop is then repeated, so long as the next statement is not an end point. As noted above, the IL statements are not executed in forward analysis but rather are interpreted through character matching.

Upon completing a path (i.e., reaching an end point), the entire path is stored in pathstorage as a flat list with condition statements linked together. Engine 250 then returns to the LIST and removes the top element. The top element, representing the statement number of the first branch in the previous path, is used to initialize analysis and storage for another path. Starting at this new statement number, engine 250 repeats inner loop operations; i.e., storing corresponding conditions in pathstorage and recording branches in LIST. These inner and outer loop operations are repeated until every statement and every path is processed by engine 250.

After a list of facts are produced pursuant to block 1338, control flows to block 1340 in FIG. 13a where the user may filter or manipulate the fact contents through the use of CLIPS commands.

More specifically, for statements disposed between starting and ending points, the operator may retrieve (1) the stored conditions and corresponding IL code (linked by statement number), (2) all the statements and (3) all the corresponding paths.

Additionally, specific variables may be identified for purposes of textually slicing the code based upon direct textual matching. This is a simple form of textual slicing --accommodating only direct hits--and therefore different from the more complex textual slicing described in FIG. 27, below.

If no end point is specified for forward analysis, analyzer 200 will continue to process code until a return (i.e., if in a procedure), the end of the program, or a termination/exit statement is reached.

Analyzer 200 also facilitates a "suppress" feature whereby all calls are ignored when performing forward analysis. Accordingly, only a single, primary path is processed.

Control then flows to block 1342 where the operator may selectively display information on monitor 120. As noted above, such display includes all intervening branching conditions, multiple end points and display of the entire code and all paths between selected starting and ending points.

Examples of forward analysis output are provided in FIGS. 13d-13i, based upon source code in FIGS. 13b and 13c. In the examples, a forward analysis is performed on a routine in "samplel.cob" called P010-CALC-LEAP-YEAR (FIG. 13b) which is called from another routine in the same program called 0300-INITIALIZE-RTN (FIG. 13c). These examples were generated on a early prototype to analyzer 200.

FIG. 13d illustrates forward analysis to determine the ability to reach one point from another. This example shows the conditions under which point 2226 is reached and in which lines those conditions appear in the source code. "ENDLine 1" indicates that line 2226 has been reached. Note that two paths are available to reach this point, denoted by "[1]" and "[2]."

FIG. 13e demonstrates that more than one end point may be identified; "ENDLine 1" indicates that a first end point is reached and "ENDLine 2" indicates a second end point is reached.

FIG. 13f illustrates the display of all code between the start and end points for four different execution paths.

FIG. 13g illustrates the textual, forward slicing of a portion of the code based upon variables "C-FEB" and "C-TEMP-1."

FIG. 13h illustrates the extraction of variables upon which C-TEMP1 is dependent. This extraction is achieved through reliance on a data dependency graph such as shown in FIG. 9, below. Significantly, the results of FIG. 13h may then be used in FIG. 13i to perform a slice of all such variables from the subject code.

Returning to FIG. 11, in addition to forward analysis, an operator may choose to perform backward analysis pursuant to block 1118. If backward analysis is selected, control flows to block 1126 wherein the operator must choose starting and ending points in the same manner as for forward analysis. Once valid starting and ending statements have been identified, engine 250 prompts the operator to enter a postcondition pursuant to block 1128. Upon presentation of the postcondition, backward analysis may begin pursuant to block 1130, wherein engine 250 traverses a link list of nodes defining translated source code in a backward direction.

Backward analysis is used in two different operations: one for calculating the "weakest precondition" (wp) of a non-iterating code segment and the other for verifying invariants for one or more iterating segments of the code. In backward analysis, each statement is processed by using the content of the statement to modify a postcondition to obtain a weakest precondition (such operation is described below). Engine 250 interprets each statement though character matching.

All program constructs included in the intermediate language are formed by a variety of statement types including alternation statements, assignments, skips, aborts and iterations connected together by the sequencing s operator (";"). Iterating code segments associated with large loops or cooperating processes executing concurrently will not be processed within the text of the main program because of the manual transformation of sequential programs into the UNITY model. Small, terminating loops, will typically be left within the main program. However, these segments may be analyzed separately requiring some operator input, such as an invariant. This determination is made through visual inspection of the code.

Verification of invariants for iterating statements uses the weakest precondition calculation code. This code represents most of the work for invariant verification. A discussion of weakest precondition computation follows.

3. wp Computation

The concept of the weakest precondition was introduced by E. W. Dijkstra (see, Dijkstra) and so far has been used primarily for theoretical work on the semantics of programming languages. Weakest precondition ("wp") may be defined as follows: given a program S and a condition P (i.e., a logical predicate which the program must satisfy when it terminates), the weakest precondition of S for P (i.e., "wp.S.P.") is the least restrictive condition Q that must hold before the execution of S to guarantee its termination in one of the states defined by P. This relationship is succinctly expressed in equation (4) below.

wp.S.P.=Q (4)

The weakest precondition of many useful programming language statements is calculated by a simple syntactic manipulation of the programming language statements themselves and of the formula describing the postcondition. Several examples are provided below to illustrate the rules for wp computation of six basic statement types defined in DGC language (i.e., assignment, composition, alternation, skip, abort and repetition statements). In these examples, the definitions set out in Table 11 apply.

                  TABLE 11
    ______________________________________
    S, T:
         denotes programs (statements), possibly compound statements.
    P, Q:
         denote predicates, often in simple propositional calculus and
         occasionally in first order predicate calculus with quantifiers.
         These
         predicates are not part of a program; rather, they are generated by
         the operator and used to express the state of program variables.
    B:   denotes a predicate, or logical expression, that is part of a pro-
         gram and is evaluated as part of the execution of alternation and
         repetition statements.
    ______________________________________


As a first example, S1 is a single assignment statement provided as equation (5) below.

x:=x+1 (5)

Weakest preconditions of S.sub.1 for two predicates are shown in equations (6) and (7) below.

wp.S.sub.1 .multidot.(x=2)=(x=1) (6)

wp.S.sub.1.(x>0)=(x>-1) (7)

In words, program S will terminate in the state x=2 if and only if it starts in the state x=1, and it will terminate with x>0 if and only if it starts with x>-1. As a general rule, the semantics of the assignment statement x:=e, for any expression e, is defined in equation (8) where the notation P=P(x(-e) means that all occurrences of the variable x in predicate P are textually replaced by "e". If the variable does not occur in the predicate, then the predicate remains unchanged and, referring to equation (4), predicate Q=predicate P.

wp.(x:=e).P=P(x.fwdarw.e) (8)

As a second example, S.sub.2 is the composition of two assignment statements provided as equations (9) and (10) below.

x:=x+1; (9)

y:=y*y (10)

Examples of weakest preconditions of S.sub.2 for several predicates are shown in equations (11) through (13) below. ##EQU1##

As a general rule, for any two statements S.sub.a and S.sub.b, the semantics of the composition of the two statements in connection with weakest precondition is shown in equation (14).

Wp.S.sub.a ;S.sub.b.P=wp. S.sub.a.(wp.S .sub.b.P) (14)

This is a "backward computation." The wp of the last statement is the predicate used in the wp computation of the first statement.

As a third example, S.sub.3 is a guarded alternation statement provided as equation (15) below.

if x>0.fwdarw.x:=x-1;

[]x.ltoreq.0.fwdarw.skip;

fi (15)

Equation (15) is written in DGC form and in words means if x is greater than zero, x is set equal to x-1, and if x is less than or equal to zero, no operation is performed and control moves to the next statement (i.e., "skip"). Weakest precondition of S.sub.3 for an exemplary predicate is provided in equation (16).

wp.S.sub.3 .multidot.(x>2)=(x>3) (16)

Generally, the semantics of an alternation statement in the form of "if B then S.sub.a else S.sub.b" in connection with weakest precondition is shown in equations (17) (disjunctive form) and (18) (conjunctive form). ##EQU2##

As a fourth example, S.sub.4 is an abort statement--whose wp for any predicate is always "false" (i.e., results in program failure) because the program never reaches the end. The weakest precondition semantic is provided in equation (19).

wp.abort.P=false (19)

As a fifth example, S.sub.5 is a skip statement--which is always equal to the predicate "P" of equation (4) because the skip does not change anything. The corresponding weakest precondition semantic is provided in equation (20).

wp.skip.P=P (20)

Finally, a repetition statement in DGC form (i.e., "do loop") is provided in equation (21) below.

do B.fwdarw.S od (21)

Referring to equation (21), "B" guards the loop body "S," which is repeated so long as B evaluates to true. If B is initially false, then S is never executed. To be useful, S must have "progress properties" that assure that B will eventually be false (although many systems, such as an operating system, loop forever).

Repetition semantics must be defined in terms of a "loop invariant" which is a predicate that holds before and after S; that is, S does not transform the invariant predicate. A necessary and sufficient condition for invariant predicate P.sub.I is shown in equation (22). ##EQU3##

Equation (22) states that once past the guard, both B and the invariant hold and if the invariant is to hold after the loop body, B and P.sub.I must imply the weakest precondition.

When the loop statement terminates, the guard will be false but the invariant must still hold. Furthermore, the invariant must hold initially. These observations lead to the weakest precondition semantics of the repetition statement as shown in equation (23). ##EQU4##

FIG. 14a provides a top-level flow chart of backward analysis in analyzer 200. At block 1430, engine 250 performs a forward analysis between the starting and ending points selected in block 1126 of FIG. 11 to create path fact(s) (see, FIGS. 13a-13i and related discussion). In block 1432, engine 250 massages the resulting path fact(s) by performing such operations as textual or transitive-closure slicing. Such slicing of linked list data structures is well known in the art. An analogous operation for basic block data structures is provided below (i.e., FIGS. 27, 28a, 28b).

Control then flows to block 1434 where wp computation is performed on each selected path. This operation is described in FIG. 14b below. Finally, the wp of every two paths newly calculated are ORed together in block 1436 and then simplified pursuant to FIG. 17a (discussed below).

FIG. 14b illustrates flow chart 1400 which describes weakest precondition calculation in analyzer 200 for each program statement (S) of interest within a path defined by an initial line number and final line number. This operation may be applied to a fragment or the entire original program. In block 1410 of flow chart 1400, the operator identifies initial and final line numbers and a postcondition P. Engine 250 responds by identifying the last statement (S) contained between the initial and final line numbers. After these initial values are established, control flows to block 1412 where engine 250 determines whether S is an alternation. If S is not an alternation, control flows immediately to block 1416 where a weakest precondition is calculated based on postcondition P and statement S. Alternatively, if S is an alternation, then control flows to block 1414 where engine 250 determines whether any assignment in statement S has a left-hand side (LHS) variable in an assignment statement that is contained in the variables of P. If there is, control flows to block 1416 for weakest precondition calculation. Alternatively, if there is not, control flows to block 1415 where statement S is substituted with a skip statement and then control flows to block 1416.

In block 1416, weakest precondition is calculated based on statement S and postcondition P. Upon calculating a new precondition (which will be a postcondition for the previous statement), a new set of variables are identified based on the latest postcondition. (A flow chart describing the operation in block 1416 is provided in FIG. 15.) Control then flows to block 1418 where engine 250 determines whether P contains only constants and free variables (i.e., variables that do not occur in the program). If this is answered in the affirmative, control flows to block 1422 where expression simplification is performed and the process ends pursuant to block 1426 since any remaining statements will have no effect on P.

However, if block 1418 is answered in the negative, control then flows to block 1420 where engine 250 inquires whether S is the first statement between initial and final line numbers. If yes, control again flows to block 1422. If no, control flows to block 1424 where S is decremented by 1, thereby identifying the next statement to be processed. This operation continues until block 1418 or block 1420 is answered in the affirmative. Referring again to FIG. 14b, the expression simplification for block 1422 is optional and described in greater detail below.

Rather than repeatedly inspect the postcondition P for variables of interest during wp computation, engine 250 may alternatively employ a separate list of variables ("V") in accordance with the program variable list shown in FIG. 25. If such a list were employed, the process in FIG. 14b would be modified only slightly. In block 1410, engine 250 would generate the list based on the initial contents of P. In block 1414, the variable list V rather than the postcondition P would be searched for matching variables with a new statement.

Use of a variable list could be significantly more efficient than a postcondition text string when performing the searching operation in block 1414, especially if the latter has a significant number of terms. For example, a statement is shown in FIG. 14c containing 21 terms but only 7 variables. Presumably, the time required to search a list with 7 entries is less than searching a 21-term text string.

Returning to FIG. 14b, upon computing a weakest precondition in block 1416, variable list V is updated by adding and deleting variables based upon the latest postcondition. In summary, incorporating a variable list such as shown in FIG. 25 would affect operations in blocks 1410, 1414 and 1416 of FIG. 14b.

FIG. 15 illustrates the wp computation carried out in block 1416 of FIG. 14b. In this analysis, it is presumed that Hoare's axiom holds which, in a practical sense, means variable aliasing is not accommodated. Turning to FIG. 15, engine 250 determines whether S is an alternation. If yes, control flow to block 1514 where a result (i.e., weakest precondition) is equal to a guard (i.e., B1 or B2) logically ANDed with the weakest precondition of a corresponding statement (i.e., S1 or S2) based on postcondition P. As discussed above, B's are guards to their corresponding statements. The result in block 1514 assumes that only two statements are present in the associated alternation statement. In this instance, engine 250 calls the code recursively to evaluate the weakest precondition for each branch of the if (i.e., alternation). If S is not an alternation, control flows to block 1512 to determine whether S is an assignment. If yes, engine 250 performs a textual substitution of the predicate P (i.e., postcondition), replacing all occurrences of the subject variable in the predicate with the expression on the right-hand side of the assignment.

If statement is not an assignment, control flows to blocks 1518 where engine 250 determines whether statement S is a procedure call. If yes, control flows to block 1520 where the result is a product of recursively performing the weakest precondition computation with the subject procedure as its argument.

If S is not a procedure call, control flows to block 1522 where engine 250 determines whether S is an iteration. If yes, control flows to block 1524 where weakest precondition is computed for the loop body. In block 1524, this computation is performed by engine 250 processing the loop body. After which, control flows to block 1526 where it is determined whether the precondition of iteration S equals the postcondition. If yes, control flows to block 1528 where the result is set equal to P (i.e., P is a true invariant). If no, control flows to block 1530 where engine 250 asks the operator for the weakest precondition of S because the value cannot be computed. In responding to analyzer 200 in block 1530, the operator will also determine manually--by inspection--whether P is false. If S is none of the above, it is treated as a skip and the routine returns to the process in FIG. 14b pursuant to block 1538.

After computation is performed in blocks 1514, 1516, 1520, and 1528, expression simplification is performed pursuant to block 1534 to simplify the result before returning to block 1416 in FIG. 14b pursuant to block 1536.

Although not shown in FIG. 15, if S is a "goto" statement, wp calculation is performed on each associated path in a backwards flow--ultimately generating a value in accordance with the other statement types. More specifically, engine 250 performs a separate wp computation on each path associated with the goto. The individual results are ORed together (and simplified) thereby generating a single wp for the statement.

Invariant verification, an operation performed on iterating statements, is illustrated in FIG. 16. Referring to FIG. 16, invariant verification begins at block 1610 with the operator providing an invariant. Control flows to block 1612 where weakest precondition calculation for a first segment S using the invariant as postcondition P is initialized. In block 1614, engine 250 applies the relationship for invariant calculation as shown in equation (22) below: ##EQU5##

In equation (24), P is the supplied invariant and B is the condition (i.e., guard) for the segment execution. After performing the calculation in block 1614 (pursuant to the wp computation process of FIG. 15), engine 250 simplifies the result in block 1616. The simplification process may require user interaction as shown in block 1618. (See Library of Lemmas discussion, below) Simplification is described in greater detail below. In block 1620, engine 250 determines whether the resulting formula is reduced to "true". If no, control flows to block 1622 where the user may introduce another variant and begin the process again at block 1612. However, if the formula is reduced to true, the invariant is verified and the operator is free to consider the next segment pursuant to block 1624.

4. Simplification

FIG. 17a illustrates the expression simplification process of blocks 1436 in FIG. 14a, 1422 in FIG. 14b, 1534 in FIG. 15, and 1616 in FIG. 16. The process of FIG. 17a is applicable to any predicate (i.e., precondition, postcondition, etc.) and is embodied in a simplifier module contained within engine 250. The source code for an exemplary simplifier is included in a file entitled "PCA2 simplifier," included in the Provisional Appendix. Referring to FIG. 17a, engine 250 initially accepts an input string in block 1710. This input string may represent, for example, a postcondition. The string is processed in block 1720 for ultimate reduction to a single OR chain 1773 shown in FIG. 17f. Processing is performed by a parser disposed within engine 250. The steps carried out in block 1720 are illustrated in FIGS. 17b-17f.

At the outset, engine 250 parses an input string into tokens; each token holding a number or a variable as shown in FIG. 17b. Parsing a string into tokens enables engine 250 to "understand" the expression; i.e., learn what variables are present. Engine 250 next combines tokens into "terms;" each term data structure 1768 holding a number and/or variable (i.e., a term of a polynominal). As shown in FIG. 17c, a sequence of terms 1768 added together forms polynominal 1769.

Engine 250 reads the input string at block 1710 sequentially and, therefore, can identify and situate operators (i.e., <, >, .ltoreq., .gtoreq., =, .noteq., etc.) correctly between polynominals. As shown in FIG. 17d, when two polynominals 1769 are disposed on either side of an operator 1770, the result is an equation or "assign" 1771.

Upon identifying and assembling assigns, engine 250 concurrently places these data structures in a standardized format by moving integers to the right hand side (RHS) of the operator and variables to the left hand side (LHS) pursuant to block 1730. Engine 250 also conducts some rudimentary simplification by combining terms (i.e., same variable) and determining whether the newly created assign may logically reduce to a "TRUE" (i.e., y=y) or "FALSE" (i.e., 1=0) for all cases.

Once the current assign has been reduced to its simplest form, engine 250 applies standard YACC Grammar (See Aho) in block 1720 to link this assign with other assigns which are related via a logical AND (i.e., an ampersand). As a result, engine 250 creates an AND chain 1772 made up of two or more assigns 1771, as shown in FIG. 17e.

Upon creation of AND chain(s) 1772, engine 250 will begin the next level of simplification: (1) removal of equivalent assigns in each chain pursuant to block 1740 (i.e., eliminate redundant assigns), and (2) application of logical rules to each chain for possible reduction to false, pursuant to block 1750. As an example of the latter operation, if any assign reduced to false in block 1730, its corresponding AND chain (if any) would also reduce to false.

Concurrent with the operations in blocks 1740 and 1750, engine 250 also assembles AND chains 1772 into a single logical OR chain 1773, again pursuant to standard YACC Grammar. After AND chains 1772 are simplified, OR chain 1773 is evaluated with logical rules in block 1760 to determine whether further simplification is possible. For example, if any AND chain reduced to true, then OR chain 1773 would also reduce to true. Four fundamental boolean relationships useful for reduction are shown below in Table 12.

                  TABLE 12
    ______________________________________
    True   Anything = Anything
    True   Anything = True
    False   Anything = False
    False   Anything = Anything
    ______________________________________


The logical simplification applied to OR chains in block 1760 is further illustrated in FIG. 17g. Referring to FIG. 17g, engine 250 builds a table of assignments in block 1761, reducing each statement (i.e., assign) to a single variable. In this way, relationships between different statements may be tracked (i.e., x=a and x.noteq.y NOT a).

Engine 250 then builds logic vectors in block 1762 based on the relationship between variables (i.e., statements). Karnaugh Maps may be applied here to simplify relationships. Referring to block 1762 in FIG. 17g, the relationship "b AND c OR a AND b AND c OR NOT a AND b AND c" is ultimately reduced to a series of 1's, 0's and "don't cares" (i.e., "X"). These vectors are simplified in block 1763 resulting in the vector "X11." Finally, this simplified vector is changed back to a statement (i.e., input string) through the variables b and c, pursuant to block 1764.

In the example of FIG. 17g, "X11" converts to b AND c, which becomes "x=yx>y." Since the relationships on either side of the conjunction cannot both be true, this statement simplifies to "false."

Alternatively, if the vectors simplified to a relationship that was true under all circumstances (i.e., x=xy=y), then the statement would reduce to "true."

If the input string cannot be simplified to True or False, some valid, logical relationship is output from the simplifier.

The rules applied in this simplification operation include the boolean relationships provided in FIGS. 43a-43c.

How this value is subsequently used will depend upon the on-going operation. For example, if the string is simplified during the course of a wp computation pursuant to block 1532 of FIG. 15, the simplified value serves as a postcondition for the next (i.e., upstream) statement. Alternatively, if the result is a final wp computation in a backward analysis pursuant to block 1422 of FIG. 14, the simplified expression is displayed to the operator on monitor 120. Finally, if the result is a simplified invariance computation pursuant to block 1616 of FIG. 16, the expression is displayed to the operator on monitor 120 for consideration pursuant to block 1620, as described therein.

Improvements to the simplification process of FIG. 17a include inputting tokens rather than strings. Such a configuration would eliminate two transformation steps (i.e., from and to strings) within the process. A second improvement is creating a symbol table (i.e., in block 1761 of FIG. 17g) that is permanent. Current operation requires the creation of a new table for each statement that is simplified. However, it would be more efficient to maintain the table in memory and simply update when new relationships are produced through simplification of new statements. Such improvements are currently being implemented in the most recent version of analyzer 200.

Referring again to the exemplary source code in FIGS. 13b and 13c, several examples of backward analysis output are provided in FIGS. 42a-42c. In the example of FIG. 42a, the postcondition chosen was "1" (i.e., TRUE; a constant without any dependency on the source code). Accordingly, the weakest precondition is TRUE; i.e., line 2217 will always be reached from line 2207.

In the example of FIG. 42b, the variable C-FEB is assigned a free variable "n" as the postcondition. The weakest precondition tells that line 2217 can be reached from line 2207 by two different paths. One path reports that if C-YY is divisible by 4, then C-FEB will have a value of 29. The other path reports that if C-YY is not divisible by 4, the C-FEB will have a value of 28.

Finally, in the example of FIG. 42c, the postcondition is used to check at what condition C-FEB will be assigned a value greater than 29. The weakest precondition result reports that C-FEB cannot be assigned a value greater than 29 from line 2207 to line 2217. Accordingly, the weakest precondition is FALSE.

E. Analysis Sequence: Flow Graph

FIGS. 18a and 18b illustrate the software analysis sequence of code analysis engine 440 under the control of user interface 130 (FIG. 4). As a general summary of operation, engine 440 initially generates a flow graph from translated code, in accordance with block 1810. Thereafter, an operator designates the relevant code fragment to be analyzed in block 1812. The operator must then decide whether to slice the subject fragment and, if so, under what criteria in accordance with blocks 1814-1826. Finally, flow analysis (either forward or backward) is carried out in blocks 1828-1846.

1. Flow Graph Generation

Analysis engine 440 builds a flow graph from the internal IL and a corresponding symbol table. A flow graph is a tuple (B, E) where B is a set of "basic blocks" representing a procedure P, and E is a set of directed arcs between the basic blocks. A basic block is a data structure constructed to represent a list of statements in P such that the flow of control of P enters at the beginning of a basic block and leaves at the end without halting or branching until the last statement of the block. A basic block is represented as a structure containing a linked list of statements.

A basic block, within a flow graph, has successors and predecessors. For example, if a basic block B ends in an unconditional branch, then the basic block B2 containing the target of the branch is its only successor, and B1 will be a predecessor of B2. If B1 ends in a conditional branch then it will have two successors.

A directed arc exists from any basic block B1 to any basic block B2, if B2 is a successor of B1.

An example of a procedure (i.e., procedure "foo") and its corresponding flow graph is shown in FIGS. 19a and 19b. Referring to FIG. 19b, boxes 1910 to 1940 represent the basic blocks of procedure foo while arrows 1950 to 1980 represent the arcs between the basic blocks.

The construction of basic blocks in a flow graph architecture is standard compiler technology and, therefore, well known to those having ordinary skill in the art. Additional description may be found in Aho.

FIGS. 20a-20c illustrate how a second sample C function "foo" is modified and configured within engine 440. FIG. 20 a shows foo in its native source code; C. FIG. 20b illustrates this function in external IL format. Finally, FIG. 20c shows a foo in flow graph format.

The flow graph of FIG. 20c is made up of a number of blocks or "graph nodes." For example, graph node 2010 is a basic block, as described above, containing a block header 2012, an internal IL statement 2014, identification of block predecessors 2016 and identification of block successors 2018. Block header consists of a block number (integer obtained by enumerating the number of basic blocks in the flow graph) and remarks (information regarding the basic block). Internal IL statement is a sequence of internal IL statements that make up the block. Block predecessors and successors are lists of block numbers of the predecessors and successors of the current basic block.

An additional datastructure that is created and is used in backward analysis is a control flow chart (i.e., control tree). This is generated using the mechanism described in Shirir, "Structual Analysis: A New Approach to Flow Analysis in Optimizing Compilers", Computer Languages, Vol. 5, pp. 151-153 (1980). This structure creates superblocks (Shirir refers to them as "intervals") which contain lists of basic blocks. These represent sequential (i.e., non-iterative) segments of code. These are the same segments of code described in the UNITY discussion. The control tree is further discussed in Appendix C.

Engine 440 will display the information in FIGS. 20a-20c on monitor 120 in accordance with display 2100 of FIG. 21. To help the operator correlate flow graph information with source code and IL code information, a graph window 2110 will be aligned with a source code window 2120 and an external IL window 2130 automatically, as shown in FIG. 21. Selections made between the source, IL and flow graph elements are always "in sync."

Returning to FIG. 18a, once a flow graph has been generated, the code-analyzer operator may designate a relevant code fragment within that flow graph pursuant to block 1812. Specifically, the operator may construct a desired fragment by selecting elements from an existing flow graph, or directly from the corresponding source or external IL files, as shown in FIG. 21. Responding to operator instructions, engine 440 constructs a code fragment by selecting a (1) starting basic block, (2) starting statement, (3) ending basic block, and (4) ending statement.

Upon designating a relevant code fragment, the operator of code analyzer 400 has the option of "slicing" the fragment, pursuant to block 1814 of FIG. 18a.

2. Slicing

The translation of source code into intermediate language helps disentangle the possible complex flow of control of sequential programs. In fact, the presentation of a program in intermediate-language form (i.e., in such formats as assignments, alternations, iterations, etc.) is by itself sufficient to clarify its meaning in many cases. The creation of sequences of condition-actions (i.e., B.fwdarw.P) as required by the UNITY computation model is a further aid to informal understanding of the code.

The latter, however, has a drawback. The number of sequences as well as their length is, in most practical cases, very large. This is of course also true for the source code of any large programs, but the condition/action sequences that can be extracted from any program are a much less compact presentation (albeit more suitable for human understanding) than source code.

For example, in a program made up only by one set of nested if statements, the number of sequences that can be extracted may only be equal to the number of conditions existing in the code itself. However, if two program fragments are connected in sequence (via the ";" operator), the number of sequences to examine for the composite program equals the product of the number of sequences of the two components. In fact, the sequential composition of program statements can produce an explosive growth on the number of sequences that need to be examined during program reading.

To cope with large size, code analyzer 400 employs a slicing function, which enables the operator to extract portions of a subject program relevant to the analysis. Condition/action sequences extracted from a slice of the program, are usually much smaller and fewer in number than those extracted from the original source code.

Referring again to FIG. 18a, the operator has an option of slicing a relevant code fragment at block 1814. If no slicing is requested, control flows to block 1828. However, if slicing is requested, the operator must then select starting and ending points in block 1816 (i.e., define the scope of the slicing operation).

The ending point is the last statement of the relevant code fragment considered in the computation of a slice. It is designated by the operator and must lie within the relevant code fragment. It is the statement at which backward analysis begins or forward analysis ends.

The starting point is a statement lying in the relevant code fragment and preceding the ending point in the flow of execution. It is the statement at which forward analysis begins or backward analysis ends.

Engine 440 can define a slice by selecting an appropriate (1) starting basic block, (2) starting statement, (3) ending basic block, and (4) ending statement, as designated by the operator.

After the initial slice has been defined, the operator selects desired slicing criteria in block 1818 of FIG. 18a. Engine 440 is capable of performing three different types of slicing operations: (1) semantic slicing using weakest precondition computation, (2) simple textural slicing, and (3) transitive closure slicing pursuant to blocks 1820, 1822, and 1824, respectively.

a. Semantic Slicing

The slicing of a program using the weakest precondition is called "semantic slicing." The precondition for slicing need not be a complex relation. It is possible to use free variables (i.e., those that do not occur in the program at issue) in the form x=m (x is a program variable) to obtain the semantic slice for any state involving x. In fact, by calculating backward the weakest precondition of all statements preceding the starting point, all statements that do not produce a weakest precondition different from the postcondition and the calculated weakest precondition can be safely eliminated. The remaining ones are those constituting the semantic slice for x, starting at the chosen point where the precondition x=m has been assumed.

The slice thus calculated is (with the exception of a pathological case of almost no practical interest) the smallest program that shows all statements contributing to the state of x at the point where the postcondition has been assumed.

The pathological case is that of programs containing statements changing the state of a variable and then changing it back to the previous one. For example, statements (25) and (26) provided below would be left in the slice, although they are useless:

x:=x+1 (25)

;x:=x-1 (26)

In practice, the calculation of semantic slices does not require the full evaluation of the weakest precondition. This is because real world programming languages use as alternation only "if-then-else" statements. For these statements, the weakest precondition would be different only if there is an assignment to a variable named in the precondition. This may be explained by reference to the proof shown in Table 13.

                  TABLE 13
    ______________________________________
    Example of Alternation Statement Not Affecting Postcondition
    ______________________________________
    wp. (if B then S1 else S2). Q
    (B   wp.S1.Q)   (  B   wp.S2.Q)
    = {S2 does not name any variable in Q}
    (B   wp.S1.Q)   (  B   Q)
    = {S1 does not change the state of any variable
            named in Q}
    (B   B)  Q
    =
    Q
    ______________________________________


Consequently, semantic slices can be effectively calculated by automatic inspection of the intermediate-language code.

Semantic slices, when applied in large programs, appear to be very effective because in many programs, those portions that are connected sequentially usually deal with different variables. These portions may contain several alternation blocks. Consequently, the number of interesting sequences is reduced, via the elimination of whole alternation blocks. Furthermore, it is likely the length of these sequences is shorter because of the elimination of many statements that do not affect the states of the variables of interest.

(1) Semantic Slice Examples

Consider the program fragment provided in Table 14 below.

                  TABLE 14
    ______________________________________
    Exemplary Program Fragment
    ______________________________________
           y:=x
           ;if x = 0 then (x:=x+1; z:=0) else z:=y end
           ;z:=z+1
           ;y:=0
    ______________________________________


The semantic slices for x, y and z (i.e., those portions of the program which affect the states of x, y and z, respectively) are provided in statements (27), (28) and (29), respectively.

;if x=0 then x:=x+1 else skip end (27)

y:=0 (28)

;if x=0 then z:=0 else z:=y end ;z:=z+1 (29)

The semantic slice for a pair of variables (i.e., x, y) is provided in statement (30).

;if x==0 then x:=x+1 else skip end ;y:=0 (30)

Note that some statements containing y are not in the semantic slice because they do not effect the states of y at the point of interest (i.e., the end of the fragment under consideration).

(2) Program slicing operation

Code analysis engine 440 may carry out semantic slicing operations pursuant to two methods. In either case, the body of a program intended to be searched is defined as "S", where S is a composition of individual statements (i.e., S.sub.1, S.sub.2 . . . S.sub.N), each of which (Si) is an assignment, skip, abort, alternation, composition or repetition. Further, the goal to both methods is slicing S with respect to the slicing predicate P. In other words, engine 440 determines a member of the set "PSlice (S,P)."

In a first method, the steps carried out by engine 440 are described generally below.

1. Compute P.sub.i =wp.(S.sub.i S.sub.i+1 ; . . . ; S.sub.N).P for 1.ltoreq.i.ltoreq.N.

2. Determine all pairs <i,j> with i<j such that P.sub.i .ident.P.sub.j.

3. Starting with j=N, find the smallest i so that <i,j> is one of these pairs and slice all the statements from S.sub.i to S.sub.j inclusive.

4. Set j=i-1 and repeat step (3) while j is positive.

5. Next, examine all remaining alternation statements, performing steps (1) to (5) to slice each guarded statement and loop body. If examining statement S.sub.i (in the original labeling), then P.sub.i plays the role of P (the slicing predicate). In simplifying alternations, combine guard predicates where the guarded statements are the same.

This first method, discussed in detail in FIGS. 22 and 23, has proven effective for moderately-sized code sequences, although its time complexity is O(N.sup.2). A second method, described in FIGS. 23 and 24, is faster although less comprehensive since step 2 only uses pairs where i=j-1 and step 3 is omitted. Both methods are described in detail below. The second method will probably be incorporated in analyzer 400.

FIGS. 22a and 22b disclose flow chart 2200 which describes a first method of semantic slicing of program segment S with respect to predicate P. Each statement S.sub.i in segment S is a skip, abort, composition, assignment, alternation, or repetition. The weakest precondition for each statement within segment S is calculated based upon predicate P. The rules defined above for the various statement types are followed and P.sub.N+1 is set equal to slicing predicate P.

The method begins at block 2214 where a counter i is set equal to N. Pursuant to blocks 2214 through 2220, engine 440 calculates the precondition predicate P.sub.i for each statement S.sub.i based upon a postcondition of P.sub.i+1 (i.e., backwards computation). As shown in FIG. 22a, this operation iterates through i=N to i=1. Referring to block 2218, if a statement S.sub.i is an alternation statement, control flows to block 5100 (i.e., block 2324) in FIG. 23b which will perform a slicing operation in an attempt to simplify this statement. This operation is discussed further below. After the attempted simplification is complete, control returns to block 2218 where the predicate P.sub.i is calculated for the current statement.

If a statement Si is an iterative statement, and wp.(loop body).P.sub.i+1 =P.sub.i+1, then this statement has no effect on the variables in the slicing predicate and the entire loop may be sliced. Accordingly, P.sub.i may be set equal to P.sub.i+1. However, if this first equality does not hold, then the operator searches for a logical conjunction of the P.sub.i+1 such that no variables are shared between that conjunction and the body of the loop or the guard of the loop. These conjunctions contribute directly to P.sub.i. For each variable in the loop or in the guard, a condition of the form is added: variable=free.sub.-- variable. All statements of the loop are left unsliced. For example, if P.sub.i+1 =x=7y=23 and x does not appear in the body of the loop, and the only variables which appear in the loop and guard are y and z, P.sub.i would be set equal to x=7=lambdaz, =mu. This approach results in a conservative slice. The assumption is made that any of the named variables set equal to a free variable could potentially be present in the correct weakest precondition. Therefore, slicing with this precondition results in a superset of the correct slice.

After calculating the predicate P.sub.i for each statement S.sub.i in blocks 2214-2220, control flows to blocks 2222-2232 wherein slicing operations are performed.

Referring to FIG. 22b, all statements S.sub.i contained within segment S are set as "unmarked" for slicing in block 2224. Blocks 2226-2232 attempt to identify equivalent predicate values (P.sub.i and P.sub.j) that will establish a sequence of statements that have no effect on the variables making up the predicates. Blocks 2226 and 2232 represent a first outer loop used to sequence through values of P.sub.j where j is initially set equal to N. Blocks 2228 and 2230 represent a second inner loop where predicate values P.sub.i are sequenced through and i is initially set equal to 1. Starting with the smallest i value and largest j value, P.sub.i is compared with P.sub.j in block 2230. If there is no match, i is incremented and the equivalence check is performed again. However, if there is a match, then control flows to block 2232 which marks all statements from S.sub.i to S.sub.j for slicing. After this sequence of statements have been identified, j is decremented 1 and flow returns to block 2226 to repeat this operation.

Alternatively, if there is no corresponding equivalent P.sub.i for a particular P.sub.j, then j is decremented by 1 in block 2228 and the operation repeats itself for the next j value starting at block 2226. After the j values are exhausted, (i.e., j is decremented to 1) control flows from 2226 to block 2234 where every marked statement in segment S is replaced with a skip statement. By this operation, marked statements are sliced from segment S. Upon completing the slicing operation in segment S, a secondary slicing operation may be performed on each alternation statement. This operation is carried out according to flow chart 2300 in FIGS. 23a and 23b.

In block 2310, counter i is said equal to 1 and in 2312 this value is checked to determine if all S.sub.i values have been considered. If they have, then slicing is complete and the routine may be exited via block 2320. If not, control flows to block 2314 where all skip, abort, repetition and assignment statements are passed over for secondary slicing. In which case, control flows to block 2318 where i is incremented and the exercise is repeated for the next S.sub.i.

However, if S.sub.i is an alternation statement, then engine 440 will apply an alternation slicing theorem, as shown in FIG. 23b, to further simplify this statement pursuant to block 2316. More specifically, as shown in blocks 2316 and FIG. 23a, control flows to address 5100 (i.e., block 2324) to carry out alternation slicing theorem. The alternation slicing theorem provides that for any alternation statement (i.e., "if B then S1 else S2") where none of the alternative statements (i.e., neither S1 nor S2) contain any assignment to a variable listed in the postcondition predicate, its weakest precondition is equal to its postcondition (i.e., the effect of this statement is the equivalent of a skip).

Referring to FIG. 23b, block 2324 represents the first step in the alternation slicing theorem. As indicated in this block, the weakest precondition of S.sub.i based upon a post condition of P.sub.i+1 does not equal P.sub.i+1 because the alternation statement is not a skip. Had this statement been true, alternation statement S.sub.i would have been converted into a skip in block 2234 as discussed above. In blocks 2326-2332, the weakest precondition of each statement associated with a particular guard (B) is evaluated against predicate P.sub.1+1. If the predicate is true, then statement T.sub.j is replaced with a skip as indicated in block 2328. However, if T.sub.j is not a skip statement, then statement T.sub.j is sliced with respect to predicate P.sub.i+1 by invoking a slow complete slice starting at block 2210 (i.e., address 2000) in FIG. 22a or invoking a fast slice at block 2210 (i.e., address 1000) in FIG. 24.

After this operation is complete, control flows to block 2332 where j is incremented by 1 and the analysis is repeated for the next guard and statement pair. Upon completion of the operation over all, control is returned via block 2334 to block 2318 in FIG. 23a. In block 2318, i is incremented by 1 and the next S.sub.i statement is evaluated pursuant to the foregoing discussion. Once i exceeds the value N (i.e., the number of statements S.sub.i) control flows to block 2320 which results in an exit of this routine with segment S now being sliced with respect to predicate P.

Flow chart 2400 in FIG. 24 illustrates a portion of a fast slicing operation which is similar to the first half of the slow slicing operation shown in flow chart 2200 of FIG. 22a. Referring to FIGS. 22 and 24, blocks 2210-2218 and block 2222 in both figures have the same meaning and function. In FIG. 24, however, block 2410 replaces block 2220 from FIG. 22a.

In block 2410, predicate P.sub.i (representing a precondition) is compared with predicate P.sub.i+1 (representing a postcondition). If these values are equal, then the corresponding statement has no effect on the variables defining these predicates and the statement may be sliced from the corresponding segment S. This operation is repeated for each P.sub.i value within the segment. As shown in FIG. 24, once i is decremented to a value less than 1, control flows from block 2216 to block 2222 which then passes to block 2310 at address 4100 in FIG. 23a. The remaining operation is then identical to that described above for the slow slicing operation.

As an alternative embodiment, engine 440 may create a variable list "V" to carry out the foregoing alternation slicing theorem, pursuant to FIGS. 25 and 26. Referring again to flow charts 2200 and 2400 in FIGS. 22a and 24, respectively, engine 440 may create a variable list V in block 2210. During this operation, the engine places variables named in predicate P into the list. In block 2212 of FIGS. 22a and 24, engine 440 adds new variables and deletes irrelevant variables from the list after a new computation of predicate P.sub.i. This list is updated again in block 2218 of FIGS. 22a and 24. An exemplary variable list and its association with a program under analysis is shown schematically in FIG. 25.

Referring to FIG. 25, statement 2530 is associated with variables listed in program variable list 2510 through postcondition 2520. The variables themselves are linked to each other within variable list 2510 as shown by arrows 2540 and 2550.

Referring to FIG. 26, an exemplary postcondition 2610 is shown associated with the variables contained therein; x and y. These variables are shown listed in variable list 2510. Significantly, postcondition contains four terms but only two variables. Engine 440 can analyze the previous statement S by verifying whether S contains on any left-hand side ("LHS") of a statement the variables x or y. The placement of these variables on the left-hand side of a statement indicates a new value is being assigned to them. The names of the variables are linked to the postcondition as shown in FIG. 26.

It is currently anticipated that engine 440 will not use a variable list "V" to carry out the foregoing alternation slicing theorem. Rather, engine 440 will check for the variables involved in a change of state for the s alternation slicing theorem directly from the postcondition

b. Textual Slicing

Returning to FIG. 18a, the foregoing discussion on semantic slicing covers blocks 1820 and 1826. Alternatively, the operator may choose to textual slice a selected code fragment pursuant to blocks 1822 and 1826. This operation is straightforward requiring simple textual matching of the slice criteria (i.e., "seed variables") and the program fragment. Those statements that cannot assign a value to a seed variable will simply be extracted from the slice by engine 440, as described below. This operation is analogous to a word search in a conventional word processing program.

More specifically, the operator specifies a simple textual slice by naming a set of program variables called "seed variables" and designating a code fragment as described above. The result consists of every source or IL statement lying within the designated code fragment that can assign to one or more of the seed variables. Assignments made by called procedures are included. A flow chart describing this operation is provided in FIG. 27.

Flow chart 2700 in FIG. 27 details the process of textual slicing. In block 2710, a list of all basic blocks in any path between a starting and ending statement (identified in block 1816 of FIG. 18a) is created by engine 440. This operation is a simple graph coloring procedure described in Aho. Upon creation of the list in block 2710, control flows to block 2712 to determine whether all basic blocks have been considered. If yes, the procedure is terminated at block 2714, resulting in a sliced fragment. This slice will then be the subject of flow analysis pursuant to blocks 1828-1846 in FIGS. 18a and 18b.

If all basic blocks are not done, control flows to block 2716 in FIG. 27 to determine whether all statements in a basic block have been considered. If yes, control flows to block 2718 where the next basic block is selected from the list created in block 2710. Thereafter, control returns to block 2712 to repeat the foregoing process.

If there is a statement in a basic block under consideration that has not yet been reviewed, control flows to block 2720 to determine whether such statement could possibly assign a value to one of the seed variables. The analysis in block 2720 is carried out in a conservative manner. As such, if an assignment is identified through a pointer, engine 440 assumes it can change one of the seed variables and, therefore, is not removed from the slice. Similarly, for a basic block in a subroutine, any statement that assigns to a formal parameter (i.e., identifier appearing in a procedure definition; called a "formal argument" in C and a "dummy argument" in Fortran) that can correspond to an actual parameter (i.e., argument passed to a called procedure) which is one of the seed variables is kept. Based on this conservative approach, if the subject statement can possibly assign to one of the seed variables, the statement remains in the slice and control flows to block 2724 where the next statement is selected. Alternatively, if no assignment can be made by the subject statement, control flows to block 2722 where the subject statement is removed from the slice and control then flows to block 2724.

As shown in FIG. 27, this operation is carried out repeatedly until all statements within all blocks of a selected list are reviewed and either kept within the slice or discarded.

c. Transitive Closure

Referring again to FIG. 18a, the operator may select yet a third choice for slicing criteria in the form of transitive closure as shown in block 1824. The operator specifies a slice for transitive closure by naming a set of program variables called seed variables. The result consists of every source or PIL statement, lying within the designated code fragment, which would be obtained by the series of steps outlined in FIGS. 28a and 28b.

Flow chart 2800 in FIGS. 28a and 28b describe a slicing operating based upon transitive closure. In block 2810, engine 440 creates a list of all basic blocks on any path between a starting and ending statement that was designated in block 1816 of FIG. 18a. This operation is the same as block 2710 in FIG. 27. In block 2812, engine 440 creates a list of variables representing an initial seed list as defined by the operator. Every seed variable is a current working variable. Control then flows to block 2814 where engine 440 determines whether all basic blocks have been reviewed. If yes, control flows to blocks 2816-2820, which are described below. If no, control flows to block 2822 where engine 440 determines whether all statements in the current basic block undergoing review have been considered. If yes, the next basic block is selected in block 2824 and control returns to block 2814. If no, control flows to block 2826 to determine whether the subject statement could possibly assign a value to one of the seed variables. If no, the statement is removed from the slice in block 2834 and control thereafter flows to block 2832 to select the next statement. If yes, the subject statement is maintained within the slice. In brief, every source or IL statement lying within the designated slice which causes an assignment to one or more of the seed variables is added to the current slice. Assignments made by called procedures, including their side effects, are included.

Control next flows to block 2830 where all referenced variables in the statement being kept within a slice are added to the seed list initially generated in block 2812. Additionally, a flag is set indicating that a variable has been added if at least one was not in the previous list. As a result, every program variable which appears in a statement to be included in the slice and which is not a current seed variable becomes a new seed variable. Global variables and variables declared in called procedures are included.

Engine 440 selects the next statement pursuant to block 2832 and control returns to block 2822, as described above. After all statements in all blocks have been reviewed, control flows to block 2816 to determine whether any seed variables were added to the slicing criteria. If there are no new seed variables, the result is the current slice and control flows to block 2818 which terminates the procedure and returns control to block 1828 in FIG. 18a. Alternatively, if new seed variables have been added to the seed list of block 2812, control flows to block 2820 where the list of basic blocks is reset (i.e., markers indicating that a basic block has been reviewed are cleared) and the flag indicating new seed variables have been added to the seed list is reset. In so doing, the new seed variables are converted to current seed variables and subsequently used as slicing criteria. Control then flows to point A where the slicing operation with the newly converted variables is repeated starting at block 2814.

As the foregoing illustrates, slicing is repeatedly executed under transitive closure until all possible variables, included within the affected statements, are used as slicing criteria. In so doing, a "closure" is obtained with respect to the affected variables.

The operation as described in flow chart 2800 cover blocks 1824 and 1826 in flow chart 1