System and method for lexing and parsing program annotations6353925Abstract When a source program containing annotations is processed by a user-selected tool, the annotations in the source program are detected by a lexer and passed to an annotation processor corresponding to the selected tool. The system contains a number of annotation processors and a number of program processing tools, and the annotation processor to which the annotations are passed is selected based upon the user-selected tool. The selected annotation processor converts annotations compatible with the user-selected tool into annotation tokens and returns the annotation tokens to the lexer. The lexer generates tokens based upon the programming-language statements in the source program, and passes both the tokens and annotation tokens to a parser. The parser, in turn, assembles the tokens and annotation tokens into an abstract syntax tree, which is then passed to the user-selected tool for further processing. Claims What is claimed is: Description The present invention relates generally to compilers and program analyzers, and more particularly to an improved system and method for lexing and parsing computer programs that include tool-specific annotations.
Program ::= ExprA EOS ;where EOS denotes end-of-stream
ExprA ::= ExprB "+" ExprA ;addition, or,
.vertline. ExprB "-" ExpA ;subraction
.vertline. ExprB ;
ExprB ::= Variable ;variable value, or
.vertline. Number ;numeric value, or
.vertline. "-" ExprB ;unary minus, or
.vertline. "(" ExprA ")" ;parenthetical expression.
The tokens for this hypothetical language may be:
NUMBER(n) ;where "n" denotes a non-negative integer
IDENTIFIER(s) ;where "s" denotes a name
PLUS ;addition operator ("+")
MINUS ;subtraction operator ("-")
OPEN_PAREN ;open parenthetical expression
CLOSE_PAREN: ;close parenthetical expression, and
END_OF_STREAM ;end of input stream
Every token in the hypothetical language has a label such as "NUMBER" or "PLUS," and some tokens also have a parameter value, such as an integer (n) or a string (s). Now consider a particular one line source file 102 (FIG. 1) written in the hypothetical language: size +13 The stream of characters corresponding to this one line source file is: "s" "i" "z" "e" " " "+" " " "1" "3" EOS Referring to FIG. 1, lexer 104 converts this stream of characters into the following sequence of tokens: IDENTIFIER ("size") PLUS NUMBER(13)END_OF_STREAM The parser 130 (FIG. 1) then conceptually generates parse tree 300 (shown in FIG. 3A) from these tokens. In practice, the parser actually generates the AST data structure 132 shown in FIG. 3B. In a preferred embodiment AST 132 does not contain unneeded and redundant information present in parse tree 300. In FIG. 3B, AST includes a program node 302, node 304 for the addition expression, a node 306 for the variable "size", and a node 308 for the number "13." As another example, consider a second one line source file written in the hypothetical language described above: -x-5 The sequence of tokens corresponding to this second one line source file is: MINUS IDENTIFIER("x") MINUS NUMBER(5) END_OF_STREAM, which is converted by parser 130 (FIG. 1) into parse tree 350 and AST 132 shown in FIGS. 3C and 3D, respectively. In this example, lexer 104 converts any occurrence of the character "-" into the token MINUS, however, the parser 130 may interpret MINUS either as a unary minus operator negation or as the binary subtraction operator depending on which tokens precede or follow the MINUS token. In this sense, parser 130 is context-sensitive whereas lexer 104 is context-free. Comments and Whitespace Note that in the two examples above, lexer 104 does not produce any tokens for the " " (whitespace) characters of source file 102. Most modem programming-languages are designed in that way. In fact, most languages also allow the program to include "comments" that the programmer writes to document the source program. A standard lexer 104 also ignores comments, and therefore, comments are never processed by parser 130. This has the advantage that a programmer can include whitespace and comments anywhere in the program, as long as a comment or whitespace is not inserted inside consecutive characters that make up a token. It also means that the grammar for a programming-language need not specify all places where comments or whitespace can be placed. Comments are usually delimited by a sequence of characters that begin the comment, and a sequence of characters that end the comment. For example, in the programming-languages C++ and Java, a comment can begin with the characters "/*" and end with the characters "*/". Thus, if a lexer for Java detects the character "/*" followed by the character "*" in a source file, the lexer (assuming that it does not incorporate the present invention) will ignore all following characters up until the next consecutive occurrence of the characters "*" and "/". Annotations Referring to FIG. 1, annotations are used by tools 122. Each tool 122 may have a set of annotations that it recognizes and supports. The annotations are placed in the source file along with the programming-language statements. A standard lexer, one that does not incorporate the present invention, treats the annotations as comments and does not process them. However, in accordance with the present invention the lexer 104 is modified to either (A) send all comments to an annotation processor 124 for processing, or (B) recognize the beginning of a string in a comment that appears to represent an annotation, and pass that string to the annotation processor 124. As indicated earlier, each tool 122 (FIG. 1) in the back-end 120 may use a different set of annotations than the other tools. If a program (source file) contains annotations for use with more than one tool, it may contain annotations not recognized by the user-specified tool. Stated in another way, each tool 122 only processes the annotations that belong to the set of annotations recognized by the particular tool 122. Each tool 122 is preferably coded to ignore annotations in the AST that are not supported by the tool. Furthermore, each annotation processor 124 is preferably coded to return NULL values to the lexer 104 for annotations that are not supported by the corresponding tool 122. As a simple way to define which comments are to be interpreted as annotations, the annotation language of a tool 122 may say that any comment whose first character is the character "@" is an annotation. Thus, for example, the input program fragment: "/* this is a comment */" would be considered a comment that is ignored whereas the input program fragment: "/*@ this is an annotation */" would be considered an annotation. One of skill in the art will recognize that there are many schemes, in addition to the /*@*/ example described above, for distinguishing annotations, which are processed by an annotation processor, and comments, which are simply ignored. The following are examples of annotations that are useful in particular tools 122: /*@ NON_NULL*/ /*@ FREQUENTLY_USED */ /*@ EVEN */ /*@ INVARIANT x<y+10*/ /*@ This is a comment used in some special way by a program documentation system */ /*@ DEPENDS a[t: T] ON c[t] */ In most systems, each tool that uses annotations has a custom designed lexer and parser that are used only with that tool. As discussed above, in the present invention the lexer 104 and parser 130 are generic and are used with all the tools (or at least a set of several tools) for processing programs written in a particular programming-language. When new tools are developed, or new types of annotations are developed for an existing tool, the lexer 104 and parser 130 remain unchanged, since annotation lexing and parsing has been compartmentalized and delegated to the annotation processors 124. Annotation Tokens Referring to FIG. 1, the present invention introduces the concept of a "annotation token." An annotation token is like a token in that it has a label and can be passed by lexer 104 to parser 130. An annotation token is distinguished from other tokens in that its parameter value is not only capable of being a simple integer or string, but also a more complex structure, for example, an abstract syntax tree. Furthermore, the structure of the annotation token is not defined by the lexer 104 or parser 130, but rather by a specific tool 122. That is, the lexer 104 never "looks inside" an annotation token, and is not dependent upon the internal structure of the annotation token. This lets the lexer 104 remain independent of the tools 122. Generating Annotation Tokens As mentioned above, when lexer 104 detects an annotation (or a comment that might contain an annotation), it passes the annotation to an annotation processor 124. Annotation processor 124 takes annotations 106 as input and returns annotation tokens 126 to the lexer. Lexer 104 passes the annotation tokens received from annotation processor 124 to parser 130. FIG. 2 shows the details of one embodiment of an annotation processor 234 (124 FIG. 1). An annotation lexer 236 receives an annotation from the lexer 226. The annotation lexer determines the lexical content of the annotation and passes one or more tokens to annotation parser 238. Annotation parser 238 generates an annotation token based upon the tokens passed to it by the annotation lexer 236. This annotation token is then returned to lexer 226. Note, with this structure, lexer 226 (104FIG. 1) does not need to know all possible annotation tokens that can be returned by annotation processor 234. Lexer 226 simply passes the annotation tokens to parser 228 as it would any other token. In some embodiments, the annotation processor 124 for some tools may have a combined lexer and parser. This combined lexer/parser is preferred when all annotations defined for the tool are extremely simple in structure, each annotation typically consisting of a label or a label and one or two parameters. For more complex annotations, the separate lexer and parser arrangement shown in FIG. 2 is preferred. The present invention works most cleanly when the annotation processor 124 is context-free. That is, the annotation processor 124 produces annotation tokens according to the given annotation 106, without regard to where in the source file 102 the annotations 106 occur. Although the annotation processor 124 is preferably context-free, the context of the annotation in source file 102 may have meaning because the context of the annotation in source file 102 will affect how an annotation token 126, corresponding to the annotation 106, is assembled into AST 132 by parser 130 and processed by tools 122. Put more simply, the position of each annotation token in the sequence of tokens sent by the lexer 104 to the parser 130 will provide context information for the annotation. As a simple example, consider a programming-language whose syntax is given by the following grammar:
Program ::= Statements EOS
Statements ::= Statement ";" Statements
.vertline. Statement
Statement ::= "VAR" Variable "IN" Statements "END"
.vertline. Variable ":=" Expr ;variable assignment
Expr ::= Number ;numeric value, or
.vertline. Variable ;variable value
.vertline. Expr "+" Expr ;addition
Now, consider a simple example in which a tool 122 allows a variable declaration to be annotated to, for example, indicate that the variable declared is frequently used, and that allows an assignment to be annotated to indicate that the numeric value assigned to the variable is even or will be frequently used in the rest of the program, or both even and frequently used. To allow the use of such annotations, the portion of the grammar for the programming-language for defining a Statement, where G* denotes any number of occurrences of G's (including none), is modified to read as follows:
Statement ::= "VAR" Annotation* Variable "IN" Statements "END"
.vertline. Annotation* Variable ":=" Expr
Annotation ::= FrequentlyUsed .vertline. Even
where FrequentlyUsed and Even denote the respective annotations. Using our invention, the precise grammar for annotations is known only to the tool-specific annotation processor 234 (FIG. 2); the non-tool specific lexer 226 and parser 228 treat Annotation as denoting any annotation (token). We allow multiple annotations for a given statement to allow a variable to be declared both frequently used and even. It is the job of the tool 122 (FIG. 1) to disallow the use of an Even annotation on a variable declaration. Note that this factoring of the grammar allows us to change the set of legal annotations later without changing the non-tool specific lexer 226 and parser 228 so long as the new annotations can only appear in the same places as the old annotations. In accordance with the programming-language grammar specified above, the tokens that can be returned by the lexer are: NUMBER(n) IDENTIFIER(s) PLUS VAR IN END BECOMES ;the token for":=" SEMICOLON END_OF_STREAM, and the forms of annotation tokens that the annotation processor (and thus also the lexer) can return are: ANNOTATION_TOKEN(FREQUENTLY_USED) ANNOTATION_TOKEN(EVEN) Thus, consider the following annotated program that is written in the programming-language defined above: VAR x IN x=5; VAR /*@ FREQUENTLY_USED */ y IN y :=x+3; /*@EVEN*/ y :=y+2; /*@ FREQUENTLY_USED */ x :=1; y :=y+x+y+x+y+x END END The lexer 104, after returning the second VAR token and upon recognizing the characters "/*@", will generate a substream consisting of the following characters: "F" "R" "E" "Q" "U " "E" "N" "T" "L" "Y" "_" "U" "S" "E" "D" " " EOS This substream is sent by the lexer 104 to the annotation processor 124. The annotation processor 124 will then produce the following annotation token: ANNOTATION_TOKEN(FREQUENTLY USED) This annotation is returned by the annotation processor 124 to lexer 104, and lexer 104 passes it on to parser 130. After lexer 104 reaches the second FREQUENTLY_USED annotation, lexer 104 will pass a substream also consisting of the characters above to the annotation processor 124, which will again return: ANNOTATION_TOKEN(FREQUENTLY USED). It is noted that the same annotation token is returned, even though the parser 130 will use the annotation token in different ways in these two cases. The first FREQUENTLY_USED annotation applies to a variable that is declared, whereas the second annotation applies to the result of an assignment statement. Accordingly, the two FREQUENTLY_USED annotations are assembled by the parser 130 into the AST tree in a context-sensitive manner and processed by a tool that supports the FREQUENTLY-USED annotation. In the example above, annotation processor 124 is quite simple. In general, however, annotation processor 124 may construct more complex annotation tokens. For example, to create an annotation token for the INVARIANT annotation: /*@ INVARIANT x<y+10*/ the annotation processor must parse the expression that follows the keyword INVARIANT to generate the annotation token depicted in FIG. 4A. More specifically, for tools using complex annotations of this type, the annotation processor will preferably include a lexer 236 (FIG. 2) that converts the annotation text into a sequence of tokens, and a parser 238 that assembles the tokens into an abstract syntax tree in accordance with the grammar of the "annotation language" for the tool. Even a complex annotation token such as that depicted in FIG. 4A is not processed by lexer 104 (FIG. 1) or parser 130. Rather, parser 130 assembles the annotation token into an AST without "looking inside" the annotation token. Then, when the parser passes the AST to the back-end, a tool capable of processing the INVARIANT annotation analyzes the token depicted in FIG. 4A. Tools may support annotations that are highly complex mathematical formulas. For example, a tool may support the annotation: /*@ x=quad(a,b,c) */. In such an example, lexer 104, noting the /*@ */ structure, will pass the annotation to an annotation processor 124. An annotation processor that supports the quadratic function will then generate an abstract syntax tree in accordance with the quadratic equation: ##EQU1## For tools that utilize complex annotation tokens, preferred embodiments of the annotation processor include an annotation lexer and an annotation parser. EXAMPLES The advantage of the system and methods of the present convention can further be illustrated by considering the following examples. Example 1 Consider the hypothetical programming-language:
P ::= .epsilon. .vertline. S ";" P
S ::= .epsilon.
.vertline. "var" X
.vertline. X "=" E
.vertline. "if" E "then" S.sub.1 "else" S.sub.2
E ::= X
.vertline. E.sub.1 "+" E.sub.2
.vertline. E.sub.1 "-" E.sub.2
.vertline. E.sub.1 ".div." E.sub.2
.vertline. E.sub.1 "*" E.sub.2
where .epsilon. represents a null element and X represents a variable name such as x. Now consider the following two-line program written in the hypothetical programming-language: var x.sub.j ; x=x+x.sub.j ; A parser will build the AST depicted in FIG. 4B. Now, suppose that we desire to change the programming-language to support context-sensitive annotations such as: var x annotation; and annotation; These annotations are context-sensitive in the sense that in the first case they apply to the variable immediately preceding them and in the second case they act as a new kind of statement. Such annotations may be used to convey special meaning to a back-end tool such as an error-checker. For example, the annotation "/*@ non_null */" when placed in a variable declaration might mean that the declared variable should never be assigned a null value, and the annotation "/*@ assert x>0*/" placed where a statement could go might instruct the error checker to make sure that when the program reaches that point in the program that x is greater than 0.
P ::= .epsilon. .vertline. S ";" P
S ::= .epsilon.
.vertline. Annotation
.vertline. "var" X Annotation*;
.vertline. X "=" E
.vertline. "if" E "then" S.sub.1 "else" S.sub.2
E ::= X
.vertline. E.sub.1 "+" E.sub.2
.vertline. E.sub.1 "-" E.sub.2
.vertline. E.sub.1 ".div." E.sub.2
.vertline. E.sub.1 "*" E.sub.2
The two-line program written in the hypothetical programming-language may then read: var x.sub.j annotation; x=x+x.sub.j ; When the lexer detects the annotation, it will send it to the appropriate annotation processor. The annotation processor will generate an annotation token and return the annotation token to the lexer. The lexer will pass the annotation token along with the tokens generated by the lexer to the parser. The parser will then generate the AST depicted in FIG. 4C. The annotation token assembled into the AST by the parser will not be processed until the AST is passed to the appropriate tool. Thus, the lexer need only be recoded to the extent that it distinguishes "annotations" from comments in order to support the newly modified hypothetical language. Example 2 In practice, each type of annotation often makes sense only when placed in certain annotation slots of the modified programming-language grammar. For example, the non_null annotation is meaningless when used as a statement and since assert annotations act like statements, it doesn't make much sense to allow them to be attached to variable declarations. While tool 122 can scan an AST and complain about ill-placed annotations, it is easier, in such cases, to put the information about where annotations can occur directly in the non-tool specific grammar. For example:
P ::= .epsilon. .vertline. S ";" P
S ::= .epsilon.
.vertline. StatementAnnotation
.vertline. "var" X DeclAnnotation*;
.vertline. X "=" E
.vertline. "if" E "then" S.sub.1 "else" S.sub.2
E ::= X
.vertline. E.sub.1 "+" E.sub.2
.vertline. E.sub.1 "-" E.sub.2
.vertline. E.sub.1 ".div." E.sub.2
.vertline. E.sub.1 "*" E.sub.2
Here, we have divided up the set of possible annotations into those that may be attached to variable declarations (DeclAnnotations) and those can may be used like statements (StatementAnnotations). To handle this, annotation tokens now contain information about what kind they are (DeclAnnotation or StatementAnnotation). The non-tool specific lexer 104 works as before. The non-tool specific parser 130 is modified to use this kind information when generating parse trees. It ignores all other information in annotation tokens. This means that we may change the set of annotations in any way without changing the non-tool-specific lexer 226 or parser 228 so long as every annotation can appear only either where DeclAnnotation appears in the grammar or where StatementAnnotation appears in the grammar. For most programming languages, given a reasonable choice of grammar slots, this limitation is seldom an issue. Putting information about the kinds of annotations into the grammar also has the advantage of making the parser's job easier because it may need to do less lookahead to determine what it is seeing; the kind information may also enable the parser to produce better error messages. Example 3 The system and method of the present invention is particularly advantageous when annotations are represented as objected-oriented classes. Referring to FIG. 5A, consider a particular tool that represents annotations as subclasses 510 of a class named BASE_CLASS 500. Because subclasses can be used anywhere a superclass is required, this means that the lexer 104 and parser 130 need deal with annotation tokens only of type BASE_CLASS. Moreover, if new annotations are added later that require new classes, we can avoid having to make any change to the original lexer and parser by making the new classes subclasses of BASE_CLASS. Referring to FIG. 5B, in some embodiments, the original lexer may be designed to recognize multiple kinds of annotations (e.g., example 2). In this case, it is most advantageous to have a separate base class 550 for each kind of annotation. Thus, for example, all annotations of variable declarations might be subclasses of BASE_CLASS 1 and all stand-alone statement annotations might be subclasses of BASE_CLASS 2. Here, the lexer 104 and parser 130 need deal with annotation tokens only of types BASE_CLASS 1 and BASE_CLASS 2. This means that they do not need to be changed as new annotation subclasses are added to the two base classes. The foregoing description has been directed to specific embodiments of this invention. It will be apparent, however, that variations and modifications may be made to the described embodiments, with the attainment of all or some of the advantages. Therefore, it is the object of the appended claims to cover all such variations and modifications as come within the spirit and scope of the invention.
|
Same subclass Same class Consider this |
||||||||||
