Software implemented method for automatically validating the correctness of parallel computer programs6286130Abstract A software-implemented method for validating the correctness of parallel computer programs, written in various programming languages, with respect to these programs' corresponding sequential computer programs. Validation detects errors that could cause parallel computer programs to behave incorrectly or to produce incorrect results, and is accomplished by transforming these parallel computer programs under the control of a general purpose computer and sequentially executing the resulting transformed programs. The validation method is system-independent and is portable across various computer architectures and platforms since validation is accomplished via program transformation; thus, the method does not depend on the features of a particular hardware architecture or configuration, operating system, compiler, linker, or thread environment. The input to the validation method is a parallel computer program. The parallel computer program results from annotating its corresponding sequential computer program with a parallelism specification; the annotations describe constraints in the sequential program to be relaxed to allow the exploitation of parallelism. Validation is accomplished by detecting semantic inconsistencies between the parallel computer program and its corresponding sequential computer program. The validation method translates the input parallel computer program into a second, sequential computer program such that the second sequential computer program, when executed, detects and reports the semantic inconsistencies between the parallel computer program and its corresponding sequential computer program. Claims We claim: Description COPYRIGHT NOTICE
C$PAR PARALLEL SHARED(Y,Z) LOCAL(X)
C$PAR PDO STATIC
DO I = 1, N
X(I) = Y(I)
END DO
C$PAR PDO DYNAMIC
DO I = 1, N
Z(I) = X(I)
END DO
C$PAR END PARALLEL
LOCAL_VAR_COMM_PRVCOM--Similar to LOCAL_VAR_COMM except for private COMMON variables instead of LOCAL variables. Construct Nesting ORDERED_TWICE--A C$PAR ORDERED SECTION was executed more than once in a particular iteration of a C$PAR PDO. ORDERED_MISSED_ONE--A C$PAR ORDERED SECTION was not executed in all iterations of a C$PAR PDO. NESTED_WORKSHARING--An attempt was made to execute a C$PAR PDO or C$PAR SINGLE PROCESS when one was already active. BARRIER_IN_WORKSHARING--An attempt was made to execute a C$PAR BARRIER when a C$PAR PDO or C$PAR SINGLE PROCESS was already active. WS_OR_BAR_IN_SYNC--An attempt was made to execute a C$PAR PDO, C$PAR SINGLE PROCESS, or C$PAR BARRIER within an active C$PAR CRITICAL SECTION or C$PAR ORDERED SECTION. I. Overall Structure FIG. 1 illustrates the overall structure of the preferred method of the present invention. The primary input is a parallel computer program 100, which results from annotating its corresponding sequential computer program with a parallelism specification. In addition, input is also taken from a program database 105, which comprises information about the structure, procedures, and variables of input program 100. Input program 100 is augmented via a translation means 110 to produce an instrumented, sequential computer program 115. In the preferred embodiment, translation means 110 performs a source-to-source transformation of input program 100 to produce instrumented program 115, and is implemented by a general purpose computer having a memory and operating under the control of a computer program. Translation means 110 also causes information about input program 100 to be fed back into program database 105. Instrumented program 115 is linked with a runtime support library 120 by a general purpose computer's linker 125 to produce an executable, sequential computer program 130. Sequential execution (135) of executable program 130 on a general purpose computer produces a simulator output 140. Simulator output 140 is processed by a postprocessor 145, with additional input from program database 105, and produces a semantic inconsistency report 150. Report 150 describes the semantic inconsistencies between input program 100 and its corresponding sequential computer program. II. Program Database Program database 105 stores information about the various procedures in input program 100. Postprocessor 145 uses this information to translate simulator output 140 into report 150 so that semantic inconsistencies are reported in a useful, readable format. In the preferred embodiment, program database 105 is implemented as a disk-based persistent store with a hierarchical, directory/file structure that is organized into global and per-procedure information areas. Runtime library 120 represents procedure names, symbol (variable) names, names of COMMON blocks, names of C$PAR CRITICAL SECTIONs, names of C$PAR ORDERED SECTIONs, and other names associated with identifiers in input program 100 as symbol numbers. Symbol numbers are integers that uniquely identify these names (on a per-procedure basis, except for global names such as procedure or COMMON block names). The correspondence between these names and numbers is stored and kept up-to-date in program database 105. Similarly, the parallel programming constructs in each procedure are assigned construct numbers by translation means 110. These construct numbers are stored in program database 105; runtime library 120 represents particular instances of constructs in input program 100 via their construct numbers. III. Translation Means Translation means 110 produces instrumented program 115 via a source-to-source translation of input program 100. In the preferred embodiment of the present invention, translation means 110 inserts calls to runtime library 120 to process a stream of simulator input opcodes generated by instrumented program 115. This opcode stream is presented to the runtime library by adding instructions to instrumented program 115 that store simulator input opcodes in an opcode buffer, and by inserting runtime library calls that pass this buffer as an argument at the appropriate points in input program 110. Simulator input opcodes: Table 1 presents a list of simulator input opcodes that are introduced by translation means 110 and processed by runtime library 120. Many of the opcodes correspond directly to particular C$PAR directives. This list defines the set of program events recognized during simulation and required in order to identify semantic inconsistencies in input program 100. Table 2 illustrates the operands that are associated with each of the simulator input opcodes listed in Table 1. Opcodes and their arguments are encoded into words in an opcode buffer, for presentation to runtime library 120, in one of two forms. In form A, an opcode and up to two small-valued arguments are encoded in the first buffer word. In form B, an opcode and one larger-valued argument are encoded in the first buffer word. In either form, subsequent, extra buffer words are used for additional or larger-valued arguments. For the READn, WRITEn, S_READn, and S_WRITEn opcodes in Table 2, n is a non-negative integer, the size is the number of elements being read or written, the element size is 2.sup.n bytes, and the stride is the number of bytes between successive elements. For scalar operations such as READ, size=1 and stride=1 are implied. FIG. 2 is a flowchart of the algorithm performed by translation means 110 for each procedure in input program 100. Step 200 allocates and initializes global data structures to be used in the translation of the current procedure. Step 205 performs several program transformations on the procedure to facilitate the translation process, including: converting ELSEIF statements into ELSE clauses containing IF statements; restructuring multiple ENTRY points into a single entry combined with branches back to the original ENTRY points; and splitting function calls out of statement expressions and storing their return values in temporary variables. Step 210 divides the procedure into basic blocks. Steps 215 and 220 insert a initialization library call into the main program with an argument that specifies the name of program database 105. Step 225 processes all the procedure's ENTRY points, step 230 processes all the statements in each of the procedure's basic blocks (as determined by step 210), and step 235 processes each of the procedure's exit or return points. Step 240 inserts a declaration for the opcode buffer used in steps 225-235. In the preferred embodiment, this buffer is an array of words, the size of which is determined by a buffer size computed in steps 225-235; the buffer is placed in a COMMON block so that it may be used in multiple procedures.
TABLE 1
opcode meaning
Simulator Input Opcodes (part 1 of 2)
CALL procedure call describes call site in calling procedure
ENTRY procedure entry, beginning of called procedure
EXIT procedure exit, end of called procedure
STACKSIZE estimates local stack size requirements
STOP end of main program, STOP statement, etc.
IO I/O statement
POISON_IO incompletely-analyzed I/O statement (e.g., with
implied DO)
MALLOC ALLOCATE statement
FREE FREE statement
LINE source position change
NOP no-operation
RENTRY C$PAR PARALLEL
REXIT C$PAR END PARALLEL
LENTRY C$PAR PDO
LEXIT C$PAR END PDO
IENTRY beginning of parallel loop iteration
LAST_IENTRY beginning of last parallel loop iteration
IEXIT end of parallel loop iteration
SENTRY C$PAR SINGLE PROCESS
SEXIT C$PAR END SINGLE PROCESS
CENTRY C$PAR CRITICAL SECTION
CEXIT C$PAR END CRITICAL SECTION
Simulator Input Opcodes (part 2 of 2)
OENTRY C$PAR ORDERED SECTION
OEXIT C$PAR END ORDERED SECTION
BARRIER C$PAR BARRIER
EXT_LOCK CALL LOCK
EXT_UNLOCK CALL UNLOCK
EXT_BARRIER CALL BARRIER
INIT write variable but ignore semantic inconsistencies
AUTO declaration of stack-allocated local variable
LOCAL C$PAR PARALLEL, LOCAL declaration
FIRST_LOCAL C$PAR PDO, FIRSTLOCAL declaration
LAST_LOCAL C$PAR PDO, LASTLOCAL declaration
IP_NEW C$PAR NEW for C$PAR INSTANCE TASK
TC_NEW or for C$PAR INSTANCE PARALLEL
IP_COPYNEW C$PAR COPY NEW for C$PAR INSTANCE TASK
TC_COPYNEW or for C$PAR INSTANCE PARALLEL
READ read variable
WRITE write variable
PDO_WRITE write C$PAR PD6 loop index variable
READn read vector with stride, vectorized loop
WRITEn write vector with stride, vectorized loop
S_READn read vector with stride, summarized loop
S_WRITEn write vector with stride, summarized loop
Global data structures: The following data structures are allocated and initialized in step 200. The symbol sets store sets of variables for a procedure, ENTRY point, C$PAR PARALLEL region, or C$PAR PDO with the properties specified below. Opcode Buffer buffer index--initially zero, the most recently filled position in the opcode buffer for a particular basic block buffer size--initially zero, the maximum value of the buffer index over all basic blocks, for step 240 Procedure Symbol Sets procedure AUTOs--all stack-allocated or AUTOMATIC variables procedure globals--all global variables accessible by the procedure procedure instance task COMMONs--all variables in C$PAR INSTANCE TASK private COMMON blocks procedure newed private COMMONs--all variables in C$PAR INSTANCE TASK or C$PAR INSTANCE PARALLEL private COMMON blocks that have C$PAR NEW specified in the procedure's declarations procedure copy newed private COMMONs--all variables in C$PAR INSTANCE TASK or C$PAR INSTANCE PARALLEL private COMMON blocks that have C$PAR COPY NEW specified in the procedure's declarations ENTRY Symbol Sets this entry formals--all formal parameters for a particular ENTRY point C$PAR PARALLEL region symbol sets: region privates--all variables on the region's LOCAL list region escaped privates--subset of region privates that are passed off as actual parameters in calls to other procedures within the region region newed private COMMONs--all variables in C$PAR INSTANCE TASK or C$PAR INSTANCE PARALLEL private COMMON blocks that have C$PAR NEW specified for the region region copy newed private COMMONs--all variables in C$PAR INSTANCE TASK or C$PAR INSTANCE PARALLEL private COMMON blocks that have C$PAR COPY NEW specified for the region C$PAR PDO Symbol Sets PDO first locals--all variables on the PDO's FIRSTLOCAL list PDO last locals--all variables on the PDO's LASTLOCAL list Procedure entries: FIG. 3 is a flowchart of the algorithm performed by step 225 to process procedure entries. Step 300 determines if there are any ENTRY points left to process; if not, the algorithm terminates. Step 305 initializes the set of this entry formals for the current ENTRY point. Steps 310-330 add opcodes to the opcode buffer at the ENTRY. Steps 310 and 315 add the LINE and ENTRY opcodes. Step 320 adds TC_NEW opcodes for the set of procedure instance task COMMONs, TC_NEW and IP_NEW opcodes for the set of procedure newed private COMMONs, and TC_COPYNEW and IP_COPYNEW opcodes for the set of procedure copy newed private COMMONs. Step 325 adds AUTO opcodes for the set of procedure AUTOs, and READ opcodes for all variables, found in the bounds expressions of the declarations for the set of this entry fornals, that are in the set of procedure globals or the set of this entry formals. Step 330 adds a STACKSIZE opcode with an argument that specifies the total size of all variables in the set of procedure AUTOs. Step 335 inserts a simulator library call, with arguments that specify the opcode buffer and current buffer index, and resets the buffer index to zero. Basic blocks: FIG. 4 is a flowchart of the algorithm performed by step 230 to process basic blocks. Step 400 determines if there are any basic blocks left to process; if not, the algorithm terminates. Step 405 determines if there are any statements left to process in the next basic block; if so, step 410 adds a LINE opcode for the next statement. Step 415 determines if the statement is a C$PAR directive or a program statement. For directives, step 420 adds a directive opcode (RENTRY, REXIT, LENTRY, LEXIT, SENTRY, SEXIT, CENTRY, CEXIT, OENTRY, OEXIT, or BARRIER) corresponding to the directive. Steps 425 and 430 handle C$PAR PARALLEL region entry directives, steps 435 and 440 handle C$PAR PDO directives by adding FIRST_LOCAL opcodes for the set of PDO first locals, and steps 445 and 450 handle C$PAR END PDO directives by adding LAST_LOCAL opcodes for the set of PDO last locals. For program statements, steps 455 and 460 handle the beginning of loop iterations by adding IENTRY opcodes, steps 465 and 470 handle the end of loop iterations by processing ENDDO statements, steps 475 and 480 handle I/O statements, and step 485 handles the remaining types of statements. Step 490 inserts a simulator call as in step 335. Procedure exits: FIG. 5 is a flowchart of the algorithm performed by step 235 to process procedure exits. Step 500 determines if there are any exit or return points left to process; if not, the algorithm terminates. If steps 505 and 510 determine that the exit point is not a program exit, step 515 adds an EXIT opcode denoting the end of the procedure and step 520 inserts a simulator call as in step 335. For program exits, step 525 adds a STOP opcode, step 530 inserts a simulator call as in step 335, and step 535 inserts a termination library call. Directives and statements: FIG. 6 is a flowchart of the algorithm performed by step 430 to process a C$PAR PARALLEL region entry. Step 600 adds private COMMON opcodes as in step 320: TC_NEW opcodes for the set of procedure instance task COMMONs, TC_NEW and IP_NEW opcodes for the set of region newed private COMMONs, and TC_COPYNEW and IP_COPYNEW opcodes for the set of region copy newed private COMMONs. Step 605 adds LOCAL and READ opcodes as in step 325: LOCAL opcodes for the set of region privates and the set of region escaped privates, and READ opcodes for all variables found in the bounds expressions of the declarations for the set of region privates. Step 610 adds a STACKSIZE opcode, as in step 330, with an argument that specifies the total size of all variables in the set of region privates plus the set of PDO first locals and the set of PDO last locals for each C$PAR PDO within the C$PAR PARALLEL region. FIG. 7 is a flowchart of the algorithm performed by step 470 to process an ENDDO statement. Step 700 performs processing as if step 485 had processed a DO statement. Steps 705 and 710 add an IEXIT opcode if the ENDDO is associated with a C$PAR END PDO directive. FIG. 8 is a flowchart of the algorithm performed by step 480 to process an I/O statement. Step 800 determines if any implied DO loops or function calls appear in the statement; if not, step 805 adds an IO opcode and step 810 performs processing as if step 485 had processed the I/O statement. Otherwise, step 815 inserts a simulator call as in step 335, and step 820 adds a POISON_IO opcode. Statements and opcodes: FIG. 9 is a flowchart of the algorithm performed by steps 485, 700, and 810 to process the variable references in a statement. The algorithm traverses the items in the statement's expression in operator and operand precedence order. Step 900 determines if there are any items left to process during this traversal; if not, the algorithm terminates. Steps 905 and 910 handle ALLOCATE statements and MALLOC calls by adding MALLOC opcodes. Steps 915 and 920 handle DEALLOCATE statements and FREE calls by adding FREE opcodes. Step 925 determines if the item is a symbol (variable) reference; if so, step 930 determines if the variable is being read or written. For reads, step 935 adds a READ opcode; otherwise, step 940 determines if the write is for a C$PAR PDO loop index variable. If not, step 945 adds a WRITE opcode; otherwise, step 950 adds a PDO_WRITE opcode. FIG. 10 is a flowchart of the algorithm performed by steps 310-330, 410, 420, 440, 450, 460, 515, 525, 600-610, 710, 805, 820, 910, 920, 935, 945, and 950 to add the specified simulator input opcode (see Table 1) to the opcode buffer in preparation for adding a simulator call (as in step 335). Step 1000 allocates a new buffer index for which the opcode will be stored. Step 1005 encodes the opcode word (see Table 2 and its associated explanation of opcode encoding forms A and B). Step 1010 determines if the opcode is IENTRY; if so, step 1015 encodes a second word with a LAST_IENTRY opcode, and step 1020 inserts statements to store either the IENTRY word or the LAST_IENTRY word in the buffer depending on whether the enclosing loop is in its last iteration. For most opcodes, step 1025 inserts a statement to store the encoded opcode word in the buffer. Step 1030 determines if any additional arguments need to be stored in subsequent words following the opcode word in the buffer (see Table 2 for additional arguments for each opcode). If not, the algorithm terminates; otherwise, step 1035 allocates a new buffer index, and step 1040 inserts a statement to store the next argument word in the buffer. FIG. 11 is a flowchart of the algorithm performed by steps 1000 and 1035 to allocate a new buffer index. Step 1100 increments the buffer index, and steps 1105 and 1110 set the buffer size to the maximum of the current buffer size and the new buffer index. IV. Runtime Library: A. Introduction, Data Structures, Initialization In the preferred embodiment of the present invention, calls to procedures in runtime library 120 are inserted by translation means 110 that are executed during step 135 to generate simulator output 140. This description of the runtirne library is organized into several parts: A. This introduction, including descriptions of the global data structures and the initialization procedure. B. A description of how each of the various semantic inconsistencies in the preceding list is identified by the runtime library by using the global data structures. C. Description of the main simulator entry point, including how the various simulator input opcodes are processed. D. A description of low-level procedures used to facilitate opcode processing. E. A description of the termination and simulator output generation procedure. Global data structures: The following data structures are maintained and used by the various procedures in runtime library 120. Several global data structures are used to maintain and manipulate the state of executable program 130; this state comprises several major parts: The parallel constructs that have been executed. This is represented by a set of stack entries, one for each dynamic instance of each construct. A stack describes the current set (and nesting) of active constructs during any particular point in sequential execution (step 135). Stack entries are retained after being popped off this stack for use in associating semantic inconsistencies with particular constructs for generating simulator output 140. The dynamic call tree. Like stack entries, dynamic call tree nodes are retained for use in generating simulator output 140. The context in which a particular operation occurs. A context is defined by the set of C$PAR CRITICAL SECTIONs and C$PAR ORDERED SECTIONs that are held at the time of the operation. The state of memory. This state is represented by a shadow space in memory. The shadow space is divided into shadow pages, each of which contains a number of shadow variables. A shadow variable keeps track of accesses to a particular memory location at a particular access size. The shadow access size is kept as large as possible given the sequence and type of accesses to that memory location. Multiple shadow pages exist for each context in which memory accesses to the shadowed memory occur. Shadow variables are used to check for semantic inconsistencies such as those identified by using dynamic dependence analysis techniques known in the art. Each shadow variable contains a read timestamp (or read shadow) and a write timestamp (or write shadow) which indicate the time of the most recent read or write to the corresponding memory location. These timestamps do not store values of (simulated) execution time; rather, they store integer values, called thread ids, that represent the different threads of execution within a parallel program. A new thread id is allocated by incrementing a global thread id counter and then assigning the new thread id from the global counter; thread ids are allocated for each thread within worksharing constructs and at global program synchronization points. The following paragraphs describe the structure and content of the global data structures in more detail. In this description, procedure names, symbol (variable) names, names of COMMON blocks, names of C$PAR CRITICAL SECTIONs, names of C$PAR ORDERED SECTIONS, and other names associated with identifiers in the source code are represented by symbol numbers; the correspondence between the various names and symbol numbers is stored in program database 105. Hash tables store the addresses of structures that must be saved for simulator output generation or that are associated with particular instances of constructs. Source position: Describes a particular static location in input program 100 that corresponds to a particular point during dynamic execution. A source position is stored as a structure (record) that comprises the following fields: procedure name source line number symbol name (if the particular point in dynamic execution pertains to a variable access) context (indexes a sync node, to be described subsequently) Shadow variable (mem_) flags: Describe particular types of memory accesses and other states and conditions. Table 3 lists the various mem_flags and their meanings; many of the flags correspond to particular simulator input opcodes (see Table 1). An abbreviation for each flag is given that is used in subsequent tables. Private flags refer to any of the mem_flags that denote private, as opposed to shared, memory accesses. Shadow variable: The shadow storage for a single memory location and context. A shadow variable is stored as a structure that comprises the following fields: read shadow (holds a thread id) write shadow (holds a thread id) read source position write source position set of mem_flags (see Table 3) Shadow page: The shadows for a particular shadow page and a single context. A linked list of shadow pages stores the shadows for all relevant contexts. A pointer to each shadow page is saved in a hash table. A shadow page is stored as a structure that comprises the following fields: pointer to the next shadow page (for the next context) access shadow (latest read or write shadow for this context) context (indexes a sync node, to be described subsequently) shadow access size (in bytes) number of shadow variables in this shadow page variable-sized array of shadow variables for this context Osec node: A descriptor for a particular dynamic instance of a C$PAR ORDERED SECTION. A pointer to each osec node is saved in a hash table. An osec node is stored as a structure that comprises the following fields: source position ordered section name
TABLE 3
Shadow Variable Flags and Abbreviations
flag (abbreviation) meaning
mem_undef (u) variable not currently defined
mem_defined (d) variable has been defined
mem_summary (s) S_READn, S_WRITEn opcodes
mem_auto (a) AUTO opcode
mem_local (l) LOCAL opcode
mem_ip_new (pn) IP_NEW opcode
mem_ip_copynew (pc) IP_COPYNEW opcode
mem_tc_new (tn) TC_NEW opcode
mem_tc_copynew (tc) TC_COPYNEW opcode
mem_read (r) read operation
mem_write (w) write operation
mem_heap (h) allocated from global heap, shared storage
mem_thread (t) allocated from local stack, private storage
mem_firstlocal (fl) FIRST_LOCAL opcode
mem_lastlocal (ll) LAST_LOCAL opcode
mem_writelocal (wo) parallel write of LOCAL variable
mem_writelast (wa) parallel write of LASTLOCAL variable
mem_beg_iter (b) last access in first iteration of C$PAR PDO
mem_mid_iter (m) last access in middle iteration of C$PAR PDO
mem_end_iter (e) last access in last iteration of C$PAR PDO
mem_def_part (dp) for partial_loc semantic inconsistency check
mem_cmn_part (cp) for partial_cmn semantic inconsistency check
Notes:
1 "Private flags" refer to any of the a, t, l, fl, ll, pn, pc, tn, or tc
flags.
Simulator internal opcodes: Define the various types of semantic inconsistencies that can be identified during execution. For each type of semantic inconsistency described previously (simulator output opcode, see also the preceding list of semantic inconsistencies identified), that condition is represented internally by the internal opcode given in the right column of Table 4. Some conditions are further represented by the presence of absence of a REDUCTION flag. The presence of this flag indicates a probable reduction operation that is typically not a semantic inconsistency; the subsequent discussion of FIG. 54-FIG. 55 elaborates on the meaning and use of the REDUCTION flag. Dependence node: Used to store a particular semantic inconsistency; lists of these store the semantic inconsistencies associated with a particular parallel construct. A dependence node is stored as a structure that comprises the following fields: pointer to the next dependence node in a list semantic inconsistency type (simulator internal opcode; see Table 4) source position 1 (e.g., source) source position 2 (e.g., sink) Memory node: Describes the memory accessed by a particular simulator input opcode; lists of these describe the memory usage associated with particular constructs. A memory node is stored as a structure that comprises the following fields: pointer to the next memory node in a list type (a single mem_flag; see Table 3) base address size (of access in bytes) symbol name being accessed
TABLE 4
Simulator Output vs. Internal Opcodes
semantic inconsistency simulator internal
(output opcode) opcode
ANTI_DEPENDENCE anti
ANTI_DEPENDENCE_FILTERED anti_sync
FLOW_DEPENDENCE flow
FLOW_DEPENDENCE_FILTERED flow_sync
OUTPUT_DEPENDENCE outp
OUTPUT_DEPENDENCE_FILTERED outp_sync
FLOW_ANTI_DEPENDENCE summ
FLOW_ANTI_DEPENDENCE_FILTERED summ_sync
PARALLEL_IO io
PARALLEL_IO_FILTERED io_sync
LAST_LOCAL last
LAST_COPY_OUT par_def
NOT_LAST_LOCAL not_last
LAST_LOCAL_NOT_LAST_ITER was_last
LAST_LOCAL_NOT_SHARED share_last
FIRST_LOCAL first
FIRST_COPY_IN par_undef
NOT_FIRST_LOCAL not_local
NOT_FIRST_LOCAL_FILTERED not_local &
REDUCTION
FIRST_LOCAL_NOT_SHARED share_first
LOCAL_VAR_COMM partial_loc
LOCAL_VAR_COMM_FILTERED partial_loc &
REDUCTION
LOCAL_VAR_COMM_PRVCOM partial_cmn
ORDERED_TWICE osec_many
ORDERED_MISSED_ONE osec_skip
NESTED_WORKSHARING nested_ws
BARRIER_IN_WORKSHARING ws_barrier
WS_OR_BAR_IN_SYNC sync_ws
COMMON node: Describes the COMMON memory accessed by a particular simulator input opcode; lists of these describe the COMMON memory usage associated with particular constructs. A pointer to each COMMON node is saved in a hash table. A COMMON node is stored as a structure that comprises the following fields: pointer to the next COMMON node in a list type (regular COMMON vs. C$PAR INSTANCE TASK vs. C$PAR INSTANCE PARALLEL) base address of COMMON block size of COMMON block (in bytes) COMMON block name new flag (boolean) specifying whether C$PAR NEW or C$PAR COPY NEW has occurred Heap node: Describes the heap memory allocated for a MALLOC opcode; lists of these describe the heap memory usage associated with particular constructs. A pointer to each heap node is saved in a hash table. A heap node is stored as a structure that comprises the following fields: pointer to the next heap node in a list base address size of allocated area (in bytes) Sync node: Defines a context in which an operation occurs; lists of these describe the contexts associated with a particular construct. A pointer to each sync node is saved in a hash table. A sync node is stored as a structure that comprises the following fields: pointer to the next sync node in a list source position set of C$PAR CRITICAL SECTIONs (names) held set of C$PAR ORDERED SECTIONS (names) held C$PAR SINGLE PROCESS held flag (boolean) Stack entry: Describes a particular dynamic instance of a particular construct; a stack of these describes the current set (and nesting) of active constructs during some point in dynamic execution. The semantic inconsistencies and memory usage associated with a particular construct are bound to the construct via lists of nodes whose heads are stored here. A stack entry is stored as a record and comprises the following fields: source position (of the start of the construct) construct type (ENTRY, RENTRY, LENTRY (possibly serialized), IENTRY (possibly serialized), SENTRY (possibly serialized), CENTRY, or OENTRY) pointer to head of list of dependence nodes pointer to head of list of memory nodes pointer to head of list of COMMON nodes name of C$PAR CRITICAL SECTION or C$PAR ORDERED SECTION for these construct types start timestamp (holds a thread id) current timestamp (holds a thread id) saved thread id saved context (indexes a sync node) local stack size estimate (for this construct only) global stack size estimate (for this construct and the constructs within it) Dynamic call tree node: Describes a particular procedure in the dynamic call tree, the call site in the calling procedure that called it, and the list of its call sites and called procedures. A pointer to each dynamic call tree node is saved in a hash table. A dynamic call tree node is stored as a structure that comprises the following fields: pointer to parent dynamic call tree node (for calling procedure) pointer to next child dynamic call tree node (for next call site from current procedure to subsequent called procedures) source position (of current procedure) call site source position (in calling procedure) procedure name (of current procedure) variable-sized array of stack entries for the constructs in the current procedure Global variables: The following are additional global variables used in runtime library 120: in parallel--boolean flag indicating whether execution is currently in a C$PAR PARALLEL region thread id--urrent, most recently allocated, thread id global thread id--thread id sequence number used to allocate new thread ids global fence--thread id of the most recent global synchronization point (e.g., C$PAR BARRIER, C$PAR END PDO) global context--current context (indexes a sync node) global select--set of mem_flags (see Table 3), used to keep track of the current state of the mem_beg_iter, mem_mid_iter, and mem_end_iter flags global locks--set of C$PAR CRITICAL SECTIONS and C$PAR ORDERED SECTIONS that are currently held global ordered seen--set of C$PAR ORDERED SECTIONs that have been seen in a C$PAR PDO thus far global ordered used--set of C$PAR ORDERED SECTIONs that have been used in a C$PAR PDO iteration thus far stack, top of stack--array of pointers to stack entries that describes the current set (and nesting) of active constructs In addition, a separate hash table is used to store the addresses of each one of each of the following types of nodes: shadow pages, dynamic call tree nodes, sync nodes, osec nodes, COMMON nodes, and heap nodes. Initialization procedure: FIG. 12 is a flowchart of the algorithm performed when calls inserted by step 220 are executed during step 135. An argument for this algorithm is the name of program database 105. Step 1200 reads the state of pertinent user settings from the environment. In particular, the state of the KDD_MALLOC variable is determined for use in processing MALLOC opcodes. Step 1205 sets the name of the simulator output file; a filename analogous to that of program database 105 is used. Step 1210 initializes global data structures before opcode processing begins. V. Runtime Library: B. Identification of Semantic Inconsistencies Semantic inconsistencies are identified by runtime library 120 by examining and manipulating the previously-described global data structures. This fimctional description is expressed in terms of these data structures and the simulator internal opcodes listed in Table 4. The subsequent discussion of structure and operation will describe the implementations of these checks in more detail. Several of the semantic inconsistencies have_sync variants; filterable semantic inconsistencies are identified when the references involved are consistently synchronized, e.g., within C$PAR CRITICAL SECTIONs or C$PAR ORDERED SECTIONS with the same name. Data and I/O Dependence anti, anti_sync--Identified on a WRITE operation: in a C$PAR PDO, an anti dependence exists if the read shadow is not equal to the current timestamp and is less than the global fence; in a C$PAR PARALLEL region (not in a C$PAR PDO or C$PAR SINGLE PROCESS), if the read shadow is equal to the current timestamp. In the latter case, a (reverse) flow semantic inconsistency is also identified. flow, flow_sync--Identified on a READ operation: in a C$PAR PDO, a flow dependence exists if the write shadow is not equal to the current timestamp and is less than the global fence; in a C$PAR PARALLEL region, if the write shadow is equal to the current timestanp. In the latter case, a (reverse) anti semantic inconsistency is also identified. outp, outp_sync--Identified on a WRITE operation: in a C$PAR PDO, an output dependence exists if the write shadow is not equal to the current timestamp and is less than the global fence; in a C$PAR PARALLEL region, if the write shadow is equal to the current timestamp. summ, summ_sync--Identified on a S_READn or S_WRITEn operation in a summarized loop where the mem_summary flag is set in the shadow. A summarized loop may effectively change the order of reads and writes. In these cases, flow and anti semantic inconsistencies cannot reliably be distinguished so they are both identified as summ semantic inconsistencies. For a S_READn operation in a C$PAR PDO, a dependence exists if the write shadow is not equal to the current timestamp and is less than the global fence; for a S_WRITEn operation in a C$PAR PDO, if the read shadow is not equal to the current timestamp and is less than the global fence; for a S_READn operation in a C$PAR PARALLEL region, if the write shadow is equal to the current timestamp; for a S_WRITEn operation in a C$PAR PARALLEL region, if the read shadow is equal to the current timestamp. io, io_sync--Identified on an IO operation: in a C$PAR PDO, a semantic inconsistency exists if the (write) shadow is not equal to the current timestamp and is less than the global fence; in a C$PAR PARALLEL region, if the (write) shadow is equal to the current timestamp. The write shadow keeps track of the last read or write to a given I/O unit (or stream, file descriptor, etc.). The preferred embodiment assumes a non-reentrant I/O library implementation (which is typically found in current practice), and so treats accesses to all I/O units as equivalent. Last Local last--Identified on a READ operation after a C$PAR PARALLEL region for a LOCAL (not LAST_LOCAL) variable that was written in the last iteration of a C$PAR PDO (the mem_end_iter flag is set in the shadow). par_def--Identified on a READ operation after a C$PAR PARALLEL region for a LOCAL variable that was written inside of a C$PAR PARALLEL region (not in a C$PAR PDO). not_last--Identified on a READ operation after a C$PAR PARALLEL region for a LOCAL (not LAST_LOCAL) variable that was written other than in the last iteration of a C$PAR PDO (the mem_end_iter flag is not set in the shadow). was_last--Identified on a READ operation after a C$PAR PARALLEL region for a LAST_LOCAL variable that was written other than in the last iteration of a C$PAR PDO (the mem_end_iter flag is not set in the shadow). share_last--Identified on a LAST_LOCAL operation for a LOCAL variable. First Local first--Identified on a READ operation inside of a C$PAR PDO for which the corresponding shadow variable does not exist or has the mem_undef flag set. par_undef--Identified on a READ operation inside of a C$PAR PARALLEL region (not in a C$PAR PDO) for which the corresponding shadow variable does not exist or has the mem_undef flag set. not_local--Same as a flow dependence except that the mem_local flag is set in the shadow. share_first--Identified on a FIRST_LOCAL operation for a LOCAL variable. Local Communication partial_loc--Identified on a READ operation for a LOCAL variable that was written inside of a worksharing construct and then read outside (the same iteration of) that construct. partial_cmn--Same as partial_loc except for a private COMMON variable. Construct Nesting osec_many--Identified on an OENTRY operation if the C$PAR ORDERED SECTION being entered has already been entered in the current C$PAR PDO iteration. osec_skip--Identified on an IEXIT operation if the set of ordered sections used in the C$PAR PDO iteration just being exited is not equal to the set of ordered sections seen in previous loop iterations. nested_ws--Identified on an LENTRY or SENTRY operation if a previous LENTRY or SENTRY operation indicates that a C$PAR PDO or C$PAR SINGLE PROCESS is already active. ws_barrier--Identified on a BARRIER operation if a previous LENTRY or SENTRY operation indicates that a C$PAR PDO or C$PAR SINGLE PROCESS is already active. sync_ws--Identified on an LENTRY, SENTRY, or BARRIER operation if a previous CENTRY or OENTRY operation indicates that a C$PAR CRITICAL SECTION or C$PAR ORDERED SECTION is already active. VI. Runtime Library: C. Main Simulator Entry Point, Opcode Processing Main simulator entry point: FIG. 13 is a flowchart of the algorithm performed when calls inserted by steps 335, 490, 520, 530, and 815 are executed during step 135. The argument for this algorithm is a buffer of words containing simulator input opcodes and their associated arguments (see Table 1 and Table 2) produced by the execution of instructions, added by translation means 110, during sequential execution 135. Step 1300 determines if there are words left in the buffer to decode and process. If not, the algorithm terminates; otherwise, step 1305 retrieves the next opcode word from the buffer. Step 1310 decodes this opcode word and determines the opcode and the argument encoding form. Step 1315 retrieves zero or more additional argument words and decodes arguments based on the opcode and form determined in step 1310, according to Table 2, and step 1320 processes the opcode and arguments to check for semantic inconsistencies. Read and write opcodes: FIG. 14 is a flowchart of the algorithm performed by step 1320 when step 1310 decodes a READ, WRITE, PDO_WRITE, READn, WRITEn, S_READn, or S_WRITEn opcode. Each opcode describes a reference vector of one or more elements being read or written. The arguments for these opcodes are given in Table 2; bytes is the size of an element, sym denotes the symbol (i.e., variable in input program 100) name being accessed, size is the number of elements being accessed, adr is the address of the first element, and stride is the distance between elements. Step 1400 sets the current source position symbol from sym. Step 1405 determines if there are any elements left to process. If not, the algorithm terminates; otherwise, step 1410 translates the next portion of the reference vector into shadow variable space and checks for some types of semantic inconsistencies. Step 1415 decrements size and advances adr to account for the portion of the reference vector translated in step 1410. For this translated portion of the reference vector, step 1420 determines if any data dependences arise, and step 1425 updates the corresponding shadow variables, for these references. I/O opcodes: FIG. 15 is a flowchart of the algorithm performed by step 1320 when step 1310 decodes an IO or POISON_IO opcode. Step 1500 determines if the opcode denotes poison I/O. If so, steps 1505-1535 set poison I/O flags for constructs on the stack; otherwise, the algorithm continues at step 1540. Step 1505 sets a pointer to the top of the stack. Step 1510 determines if there are any entries left on the stack to consider; if not, the algorithm continues at step 1540. Step 1515 determines if execution is currently in a C$PAR PARALLEL region. If so, step 1520 sets the parallel poison I/O flag in the current stack entry; otherwise, step 1525 sets the sequential poison I/O flag in the current stack entry. Step 1530 determines if the current stack entry is for a C$PAR PARALLEL region; if so, the algorithm continues at step 1540. Step 1535 advances the pointer to the next stack entry and returns to step 1510. Step 1540 determines if execution is currently in a C$PAR PARALLEL region; if not, the algorithm terminates. Step 1545 fetches the appropriate I/O shadow variables, and step 1550 determines if any data dependences arise. Storage allocation opcodes: FIG. 16 is a flowchart of the algorithm performed by step 1320 when step 1310 decodes a MALLOC opcode. The arguments for this opcode are given in Table 2; adr and size describe the base address and size of the area being allocated. Step 1600 creates a heap node, if it does not already exist, for adr and size. Step 1605 removes mem_undef flags from the shadow variable space entries corresponding to adr and stride. Steps 1610-1625 determine if private or shared storage is being allocated. Step 1610 determines if execution is currently in a C$PAR PARALLEL region; if not, shared storage is being allocated. Step 1615 determines if the most recently entered parallel construct is C$PAR SINGLE PROCESS; if so, shared storage is being allocated. Step 1620 determines if the most recently entered parallel construct is C$PAR PDO; if so, shared storage is being allocated. Step 1625 determines if private or shared storage for memory allocated within a C$PAR PARALLEL region, but outside of a worksharing construct, is being allocated. If the KDD_MALLOC variable, whose state was determined in step 1200, is set, shared storage is allocated. Steps 1630 and 1635 translate the allocated area for adr and size into shadow variable space and check for semantic inconsistencies; step 1630 is used when private storage is being allocated and step 1635 is used when shared storage is being allocated. FIG. 17 is a flowchart of the algorithm performed by step 1320 when step 1310 decodes an AUTO, LOCAL, FIRST_LOCAL, or LAST_LOCAL opcode. The arguments for these opcodes are given in Table 2 and are similar to those for READ opcodes, as discussed in the description of FIG. 14. Steps 1700 and 1705 determine if an AUTO opcode is being processed when not executing in a C$PAR PARALLEL region; if so, step 1710 is executed, otherwise step 1715 is executed. Steps 1710 and 1715 translate the declared area for adr and size into shadow variable space and check for semantic inconsistencies; step 1710 is used for AUTO opcodes when not executing in a C$PAR PARALLEL region and step 1715 is used otherwise. In both cases, step 1720 creates a memory node for sym, adr, and size, and adds this node to the head of the memory linked list for the parallel construct on the top of the stack. Private COMMON opcodes: FIG. 18 is a flowchart of the algorithm performed by step 1320 when step 1310 decodes an IP_NEW, TC_NEW, IP_COPYNEW, or TC_COPYNEW opcode. The arguments for these opcodes are given in Table 2 and are similar to those for READ opcodes, as discussed in the description of FIG. 14. Step 1800 sets the current source position symbol from sym. Step 1805 initializes a reset flag to TRUE for IP_COPYNEW and TC_COPYNEW opcodes or FALSE for IP_NEW and TC_NEW opcodes. Step 1810 determines if a common node already exists for adr; if not, step 1815 creates a common node for sym, adr, size, and opcode type (i.e., IP_ vs. TC_ opcode), sets the reset flag initialized in step 1805, and clears the new flag in the common node. Step 1820 determines if execution is currently in a C$PAR PARALLEL region; if not, the algorithm terminates. Step 1825 determines if the new flag in the common node for adr is set; if so, the algorithm terminates. Step 1830 sets the new flag in the common node for adr. Step 1835 checks the state of the reset flag set in steps 1805 and 1815. Steps 1840 and 1845 translate the declared area into shadow variable space and check for semantic inconsistencies; step 1840 is used when the reset flag is set and step 1845 is used otherwise. In both cases, step 1850 creates a memory node for sym, adr, and size, locates the topmost entry on the stack for a C$PAR PARALLEL region, and adds the memory and common nodes to the memory and common linked lists for this stack entry. Procedure entry opcode: FIG. 19 is a flowchart of the algorithm performed by step 1320 when step 1310 decodes an ENTRY opcode. The arguments for this opcode are given in Table 2; num_blocks is the number of parallel constructs in the called procedure, scope denotes the scope of the called procedure, sym denotes the called procedure name, and line is the source line number of the beginning of the called procedure. Step 1900 adds the called procedure as a new leaf in the dynamic call tree. Step 1905 allocates a stack entry for the called procedure and pushes it onto the stack. Step 1910 sets the current source position procedure and line from scope and line. Step 1915 creates a dependence node, if it does not already exist, for the stack entry created in step 1905. Step 1920 adds the appropriate stack padding to the local stack size estimate for the stack entry created in step 1905. Procedure and region exit opcodes: FIG. 20 is a flowchart of the algorithm performed by step 1320 when step 1310 decodes an EXIT or REXIT opcode or when invoked from steps 3210 or 3220. Step 2000 updates the global stack size estimates on the stack based on the construct being exited, and step 2005 pops the stack entry for this construct off of the stack. Step 2010 frees the storage for the construct's memory list, and step 2015 frees the storage for the corresponding COMMON list. Step 2020 determines which opcode is being processed. For EXIT, step 2025 removes the called procedure leaf node, added in step 1900, from the dynamic call tree; steps 2030-2045 are used for REXIT. Step 2030 clears a flag to indicate that execution is no longer in a C$PAR PARALLEL region. Step 2035 frees the storage for the global sync data structure. Step 2040 allocates a new thread id, and step 2045 sets the global fence from this thread id. FIG. 21 is a flowchart of the algorithm performed by step 2000 to update the global stack size estimates for a stack entry and those of enclosing constructs and procedures. Step 2100 accesses the stack entry for the construct being exited. Steps 2105 and 2110 set the stack entry's global stack size estimate to the maximum of the global and local estimates. Step 2115 accesses the stack entry for the nearest enclosing C$PAR PARALLEL region or procedure. Step 2120 adds the estimate produced in steps 2105 and 2110 to this enclosing stack entry's local estimate, and steps 2125 and 2130 set this entry's global estimate to the maximum of the global estimate or the estimate produced in step 2120. Region entry opcode: FIG. 22 is a flowchart of the algorithm performed by step 1320 when step 1310 decodes an RENTRY opcode. The arguments for this opcode are given in Table 2; loc is the construct number and line is the source line number of the beginning of the parallel construct. Step 2200 sets a flag to indicate that execution will now be in a C$PAR PARALLEL region. Step 2205 allocates a stack entry for the C$PAR PARALLEL region and pushes it onto the stack. Step 2210 creates a sync node, if it does not already exist, for the current set of global locks. Step 2215 saves the current global context in the stack entry created in step 2205 and then sets the new global context from the sync node. Step 2220 sets the source position line and context from line and the global context. Step 2225 allocates a new thread id and performs other stack entry initializations required for region entry. Step 2230 adds the appropriate stack padding to the local stack size estimate for the stack entry, and step 2235 sets the region thread id and global fence from the thread id allocated in step 2225. Worksharing construct opcodes: FIG. 23 is a flowchart of the algorithm performed by step 1320 when step 1310 decodes an LENTRY or SENTRY opcode. The arguments for these opcodes are given in Table 2 and are similar to those for RENTRY opcodes, as discussed in the description of FIG. 22. Step 2300 sets the source position line from line. Step 2305 determines if execution is currently in a worksharing construct; if so, steps 2310 and 2315 record a semantic inconsistency for nested worksharing constructs. Step 2320 determines if execution is currently in a synchronization construct; if so, steps 2325 and 2330 record a semantic inconsistency for worksharing nested inside of synchronization constructs. Execution proceeds at step 2335 for the beginning of a legal worksharing construct; a C$PAR PDO or C$PAR SINGLE PROCESS stack entry is allocated and pushed onto the stack. Step 2340 allocates a new thread id and performs other stack entry initializations required for construct entry. Step 2345 determines if a C$PAR PDO or C$PAR SINGLE PROCESS is being entered. For C$PAR PDO, step 2350 initializes the set of ordered sections seen in the loop to empty, step 2355 sets the global select flags to mem_beg_iter, and step 2360 saves the thread id in the stack entry created in step 2335. For C$PAR SINGLE PROCESS, step 2365 creates a sync node, if it does not already exist, for the current set of global locks, step 2370 saves the current global context in the stack entry created in step 2335 and then sets the new global context from the sync node, and step 2375 sets the source position context from the global context. FIG. 24 is a flowchart of the algorithm performed by step 1320 when step 1310 decodes an IENTRY or LAST_IENTRY opcode. The arguments for these opcodes are given in Table 2 and are similar to those for RENTRY opcodes, as discussed in the description of FIG. 22. Step 2400 determines, from the stack entry on the top of the stack, if execution is currently in a sequential loop. If so, step 2405 changes the stack entry type to a sequential iteration; otherwise, execution continues at step 2410, which changes the stack entry type to a parallel iteration. Steps 2415 and 2420 set the global select flags to mem_end_iter when entering the last loop iteration. Step 2425 sets the stack entry source position from the current source position. Step 2430 determines if any loop iterations have yet been executed; if not, step 2435 allocates a new thread id, and step 2440 sets the stack entry current thread id from this new thread id. In either case, step 2445 initializes the set of ordered sections used in the upcoming loop iteration to empty. FIG. 25 is a flowchart of the algorithm performed by step 1320 when step 1310 decodes an IEXIT opcode or when invoked from step 3240. Step 2500 determines, from the stack entry on the top of the stack, if execution is currently in a sequential loop iteration. If so, step 2505 changes the stack entry type, which was changed in step 2405, back to a sequential loop; otherwise, execution continues at step 2510, which changes the stack entry type, which was changed in step 2410, back to a parallel loop. Step 2515 restores the thread id to the value saved in step 2360, and step 2520 sets the global select flags to mem_mid_iter. Step 2525 checks for the possibility of C$PAR ORDERED SECTION semantic inconsistencies; this is possible if the set of sections seen in the loop thus far is not equal to the set of sections used in the current iteration. Step 2530 records a semantic inconsistency for each C$PAR ORDERED SECTION that was not executed in all loop iterations. Synchronization opcodes: FIG. 26 is a flowchart of the algorithm performed by step 1320 when step 1310 decodes a CENTRY, OENTRY, EXT_LOCK, or EXT_UNLOCK opcode. The arguments for these opcodes are given in Table 2; loc is the construct number and line is the source line number of the beginning of the construct, and sym or arg denote the name of the lock (construct). Step 2600 allocates a stack entry for the construct. The stack entry is pushed onto the stack for CENTRY and OENTRY only; EXT_LOCK and EXT_UNLOCK opcodes do not use the stack. Step 2605 creates a dependence node, if it does not already exist, for the stack entry created in step 2600. Step 2610 creates a sync node, if it does not already exist, for the current set of global locks. Step 2615 adds the lock to the set of global locks. Step 2620 sets the global context from the sync node, and then sets the source position line and context from line and the global context. Step 2625 determines if the opcode is EXT_LOCK or EXT_UNLOCK; if so, the algorithm terminates. Step 2630 saves the previous global context, from before step 2620, in the stack entry created in step 2600 and sets the stack entry source position from the current source position. Step 2635 allows the algorithm to proceed only for C$PAR ORDERED SECTION entry. Step 2640 creates an osec node, if it does not already exist, for the current set of global locks, and then sets the osec node source position from the current source position. Step 2645 determines if the lock is in the set of ordered sections used in the current loop iteration; if so, step 2650 records a semantic inconsistency for executing a C$PAR ORDERED SECTION more than once in a loop iteration. Step 2655 determine if the block is in the set of ordered sections seen in any loop iteration; if so, and step 2660 determines that execution is in the first iteration of the loop, step 2665 records a semantic inconsistency for not executing a C$PAR ORDERED SECTION in all loop iterations. Step 2670 adds the lock to the set of ordered sections seen in any loop iteration and the set of ordered sections used in the current loop iteration. Other construct exit opcodes: FIG. 27 is a flowchart of the algorithm performed by step 1320 when step 1310 decodes an LEXIT, SEXIT, CEXIT, or OEXIT opcode or when invoked from steps 3230, 3250, 3260, or 3270. The arguments for these opcodes are given in Table 2 and are similar to those for the corresponding entry opcodes as discussed in the descriptions of FIG. 23 and FIG. 26. Step 2700 pops the stack entry for the construct being exited off of the stack. Step 2705 determines if execution is currently in a sequential loop or a serialized C$PAR SINGLE PROCESS; if so, the algorithm terminates. For LEXIT, steps 2710 and 2715 free the storage for the construct's memory list. For CEXIT and OEXIT, steps 2720 and 2725 remove the lock from the set of global locks. For opcodes other than LEXIT, step 2730 restores the global context from the global context saved in the stack entry by steps 2370 and 2630, and step 2735 restores the source position context from the restored global context. Construct entry/exit sub-procedures: FIG. 28 is a flowchart of the algorithm performed by steps 2010 and 2715 to free the storage for items on the memory list of a stack entry. Step 2800 determines if there are any more memory list items to free; if not, the algorithm terminates. Step 2805 determines if the next item at the head of the memory list is associated with a LOCAL, FIRST_LOCAL, or LAST_LOCAL opcode. If so, step 2810 removes the list item's type flags from the shadows for the list item's address and size; otherwise, step 2815 removes the mem_undef flags from the shadows for the list item's address and size. In either case, step 2820 unlinks and frees the list item at the head of the memory list. FIG. 29 is a flowchart of the algorithm performed by step 2015 to free the storage for items on the COMMON list of a stack entry. Step 2900 determines if there are any more COMMON list items to free; if not, the algorithm terminates. Step 2905 clears the new flag in the item at the head of the COMMON list, and step 2910 unlinks and frees the item. FIG. 30 is a flowchart of the algorithm performed by steps 2225 and 2340 to allocate a new thread id and perform other stack entry initializations during construct entry. Step 3000 sets the stack entry source position from the current source position. Step 3005 allocates a new thread id, and step 3010 sets the stack entry current thread id and start thread id from this new thread id. Step 3015 creates a dependence node, if it does not already exist, for the stack entry. Barrier opcodes: FIG. 31 is a flowchart of the algorithm performed by step 1320 when step 1310 decodes a BARRIER or EXT_BARRIER opcode. The arguments for these opcodes are given in Table 2; for BARRIER, arg denotes the source line number of the C$PAR BARRIER construct, and for EXT_BARRIER, arg denotes the name of the barrier. Step 3100 determines if the opcode is EXT_BARRIER; if so, execution continues at step 3130. For BARRIER, step 3105 sets the source position line from arg. Step 3110 determines if execution is currently in a worksharing construct; if so, step 3115 records a semantic inconsistency for executing a barrier nested inside of a worksharing construct. Step 3120 determines if execution is currently in a synchronization construct; if so, step 3125 records a semantic inconsistency for executing a barrier inside of a synchronization construct. Step 3130 allocates a new thread id, and step 3135 sets the stack entry current thread id and the global fence from this new thread id. Other opcodes: FIG. 32 is a flowchart of the algorithm performed by step 1320 when step 1310 decodes a STOP opcode. The algorithm repeatedly executes the algorithms for the exit opcodes corresponding to constructs on the stack until the stack is empty. For each stack entry, actions are performed as if step 1320 was executed after step 1310 decoded the corresponding opcode. Each exit opcode algorithm causes a construct to be popped off of the stack. Step 3200 determines if there are any stack entries remaining to process; if so, execution continues at step 3205. Steps 3205 and 3210 use the EXIT opcode algorithm for stack entries for procedures. Steps 3215 and 3220 use the REXIT opcode algorithm for stack entries for C$PAR PARALLEL regions. Steps 3225 and 3230 use the LEXIT opcode algorithm for stack entries for C$PAR PDO or sequential loops. Steps 3235 and 3240 use the IEXIT opcode algorithm for stack entries for parallel or sequential loop iterations. Steps 3245 and 3250 use the CEXIT opcode algorithm for stack entries for C$PAR CRITICAL SECTIONs. Steps 3255 and 3260 use the OEXIT opcode algorithm for stack entries for C$PAR ORDERED SECTIONs. Steps 3265 and 3270 use the SEXIT opcode algorithm for stack entries for parallel or serialized C$PAR SINGLE PROCESS. When the stack is empty, step 3275 terminates the simulator and generates simulator output 140. FIG. 33 is a flowchart of the algorithm performed by step 1320 when step 1310 decodes an INIT, LINE, STACKSIZE, FREE, CALL, or NOP opcode. The arguments for these opcodes are given in Table 2. The arguments for INIT are similar to those for READ opcodes, as discussed in the description of FIG. 14. For LINE, arg is a source line number. For STACKSIZE, size is a stack size estimate for the current procedure. The adr argument for FREE is similar to the same argument for MALLOC opcodes, as discussed in the description of FIG. 16. For NOP, arg specifies the number of buffer words to consume and ignore before decoding and processing the next opcode. For INIT (step 3300), step 3305 sets the source position symbol from sym, and step 3310 translates the initialized area for adr and size into shadow variable space and checks for semantic inconsistencies. For LINE (step 3315), step 3320 sets the source position line from arg. For STACKSIZE (step 3325), step 3330 locates the topmost entry on the stack for a C$PAR PARALLEL region or procedure, and step 3335 sets that stack entry's local stack size estimate from size (plus the appropriate stack padding). For FREE (step 3340), step 3345 frees the storage for the heap node created in step 1600, and step 3350 removes the mem_undef flags from the shadows for adr and the size specified in the heap node. For CALL (step 3355), step 3360 sets the global call position from the current source position. For NOP (step 3365), the number of buffer words specified by arg are consumed before completing step 1320 and proceeding to step 1300. VII. Runtime Library: D. Low-Level Opcode Processing Translate buffer to shadows: FIG. 34 is a flowchart of the algorithm performed by steps 1630, 1635, 1710, 1715, 1840, 1845, and 3310 to translate a buffer of memory into shadow space and check for semantic inconsistencies. The arguments for this algorithm are: mode--depends on the particular calling step sym--enotes the symbol name being accessed size--the number of buffer elements adr--the address of the first buffer element type1, type2--flags passed on to subsequent procedures The arguments used for each particular calling step and simulator input opcode (see Table 1) are given in Table 5; sym, size, and adr are from the arguments listed in Table 2 for the specified opcodes, except that sym unused for MALLOC opcodes. Step 1710 uses type2=mem_undef, whereas step 1715 uses type2=type1. Step 1840 uses mode=MARK, whereas step 1845 uses mode=COMMON.
TABLE 5
Input Arguments for Algorithm of FIG. 34
step opcode mode type1 type2
1630 MALLOC MARK t type1
1635 MALLOC MARK h type1
1710, AUTO MARK a u,
LOCAL MARK l
1715 FIRST_LOCAL FLOCAL fl type1
LAST_LOCAL LLOCAL ll
1840, IP_NEW MARK, pn type1 .vertline. cp
TC_NEW tn type1 .vertline. u
1845 IP_COPYNEW COMMON pc type1 .vertline. d
TC_COPYNEW tc type1 .vertline. d
3310 INIT DEFINE d
Notes:
1 type1 and type2 flags are mem_flags from Table 3.
The algorithm attempts to translate the buffer specified by adr and size into shadow space while using as few shadow variables, with the largest shadow access sizes, as possible. Steps 3400 and 3405 are used, except for FIRST_LOCAL and LAST_LOCAL opcodes, to translate the leading, misaligned area of the buffer so that the major area of the buffer can be translated with fewer, larger shadow variables. Step 3410 translates this major area, and steps 3415 and 3420 translate any area that might remain after the major area with smaller shadows. FIG. 35 is a flowchart of the algorithm performed by step 3405 to translate the leading, misaligned buffer area. Step 3500 initializes the shadow access size. Steps 3505 and 3510 determine if there is any leading area left to process; if not, the algorithm terminates. Step 3515 determines if a shadow of the current access size is required to align the leading buffer area; if so, step 3520 translates a single buffer element with the current shadow access size, step 3525 updates the shadow variable, and step 3530 decrements the buffer size and advances the address to the next buffer element. Step 3535 doubles the shadow access size and repeats the process. FIG. 36 is a flowchart of the algorithm performed by step 3410 to translate the major area of the buffer. Step 3600 initializes the shadow access size to its maximum value (in the preferred embodiment, eight bytes). Step 3605 determines if the buffer area has yet been translated at some shadow access size; if so, the algorithm terminates. Step 3610 determines if the current shadow access size can be used; if so, step 3615 selects a (wide) size to use for actual buffer translation and decreases the buffer size accordingly. Step 3620 determines if there is major buffer area remaining to be translated; if so, step 3625 translates a vector of buffer elements with the specified (wide) size and shadow access size, and step 3630 decrements the (wide) size and advances the address to the next portion of the buffer. Steps 3635 and 3640 check for semantic inconsistencies for FIRST_LOCAL and LAST_LOCAL opcodes. Step 3645 updates the shadow variables for the translated portion of the buffer, and step 3650 decreases the shadow access size by half and repeats until the major area of the buffer is translated with the appropriate shadow access size. FIG. 37 is a flowchart of the algorithm performed by step 3420 to translate any remaining buffer area. Step 3700 determines if there is any buffer area remaining to translate; if not, the algorithm terminates. Step 3705 translates a vector of buffer elements with the specified size and a shadow access size of one byte. Step 3710 decrements the buffer size and advances the buffer address. Steps 3715 and 3720 check for semantic inconsistencies for FIRST_LOCAL and LAST_LOCAL opcodes. Step 3725 updates the shadow variables for the translated portion of the buffer. FIG. 38 is a flowchart of the algorithm performed by steps 3640 and 3720 to check for semantic inconsistencies for FIRST_LOCAL and LAST_LOCAL opcodes. Step 3800 sets a pointer to the first shadow variable for the translated vector from step 3625 or 3705. Step 3805 determines if there are any shadow variables left to check; if not, the algorithm terminates. Step 3810 checks if any of the private flags are set in the current shadow; if so, steps 3815, 3820, and 3825 are used to record a share_first or share_last semantic inconsistency (for FIRST_LOCAL or LAST_LOCAL opcodes, respectively). Step 3830 advances the pointer to the next shadow variable and repeats the process. Translate vector to shadows: FIG. 39 is a flowchart of the algorithm performed by steps 1410, 3520, 3625, and 3705 to translate a vector from memory into shadow space and check for semantic inconsistencies. The arguments for this algorithm are: sym--denotes the symbol name being accessed size--the number of vector elements adr--the base address of the vector num_bytes--the vector element size in bytes stride--the distance between element accesses in bytes type--flags denoting the access type for the vector The arguments used for each particular calling step and simulator input opcode (see Table 1) are given in Table 6; sym and adr are from the arguments listed in Table 2 for the specified opcodes. Step 3900 initializes a count of the number of shadow variables produced during translation to zero. Step 3905 determines if a portion of the vector remains to be translated; if so, step 3910 initializes a temporary address pointer and byte count from the current adr and num_bytes. Step 3915 determines if there are any bytes in the current element access left to translate; if so, step 3920 creates a shadow page, if it does not already exist, for the current address. Step 3925 determines if the shadow page needs to be split because its shadow access size is too large for the current element size; if so, step 3930 splits the shadow page by dividing and replicating each shadow in the page so that the resulting shadows each have a shadow access size equal to the element size. Step 3935 creates a context in the shadow page, if it does not already exist, for the global context. Step 3940 checks for semantic inconsistencies, advances the temporary address pointer and decrements the temporary byte count, and increments the shadow variable count. Step 3945 decrements the vector size and advances the base address by the element stride and repeats the process.
TABLE 6
Input Arguments for Algorithm of FIG. 39
step opcode size num_bytes stride type
1410 READ size bytes stride r
WRITE w
PDO_WRITE w
READn r
WRITEn w
S_READn r .vertline. s
S_WRITEn w .vertline. s
3520 1 width width type1
3625 wide_size width width type1
3705 size 1 1 type1
Notes:
1 type flags are mem_flags from Table 3.
FIG. 40 is a flowchart of the algorithm performed by step 3940 to check for semantic inconsistencies during vector translation. Step 4000 initializes an index to the start of the current shadow page. Step 4005 determines if there are any shadows left to process in the shadow page; if not, the algorithm terminates. Step 4010 determines if any of the private flags are set in the current shadow. If not, step 4015 increments the shadow variable count by the number of contexts that exist for the current shadow; otherwise, step 4020 checks to see if the input type flags indicate a write or definition operation. If so, step 4025 performs the appropriate shadow checks; otherwise, step 4030 checks to see if the input type flags indicate a summary operation or procedure-local variable declaration. If not, steps 4035-4050 are used to record a share_first or share_last semantic inconsistency (for FIRST_LOCAL and LAST_LOCAL opcodes, respectively); in either case, step 4055 performs the appropriate shadow checks for non-write/definition operations. Step 4060 increments the temporary address pointer, decrements the temporary byte count by the shadow page access size, and advances the index to the next shadow in the shadow page. Fetch I/O shadows: FIG. 41 is a flowchart of the algorithm performed by step 1545 to translate an I/O operation into shadow space and check for semantic inconsistencies. This algorithm is similar to that for translating a memory reference into shadow space except: the shadow space used here represents the I/O unit (or stream, file descriptor, etc.) that is accessed, rather than the memory locations accessed; and the preferred embodiment assumes a non-reentrant I/O library implementation (which is typically found in current practice), so the preferred embodiment treats accesses to all I/O units as equivalent. Step 4100 creates a shadow page, if it does not already exist, for the I/O unit being accessed. According to the assumptions above, the preferred embodiment uses the same shadow storage for all I/O units. Step 4105 creates a shadow context, if it does not already exist, for the global context. Step 4110 sets a count of the number of shadow variables produced during translation to the number of contexts in the shadow page, and step 4115 returns this count for use after step 1545. Remove flags from shadows: FIG. 42 is a flowchart of the algorithm performed by steps 1605, 2810, 2815, and 3350 to remove the specified flags from the shadows for a buffer with the specified address and number of bytes. Step 4200 determines if any portion of the buffer remains to be processed; if not, the algorithm terminates. Step 4205 determines if a shadow page for the current address exists. If so, step 4210 determines if the shadow page needs to be split because its shadow access size is too large for the current number of bytes; if so, step 4215 splits the shadow page by dividing and replicating each shadow in the page so that the resulting shadows each have a shadow access size equal to the current number of bytes (see also steps 3925 and 3930). Step 4220 initializes an index to the start of the shadow page. Step 4225 determines if there are any shadows left to process in the shadow page. If so, and the flag to be removed is mem_undef (step 4230), step 4235 adds the mem_undef flag to the current shadow for all its contexts; otherwise, step 4240 removes the specified flags from the current shadow for all its contexts. Step 4245 advances the index to the next shadow in the shadow page. Step 4250 examines the processed shadow page and frees the storage for each context that has the mem_undef flag set for each shadow in that context. Step 4255 decrements the number of bytes and advances the current address by the number of entries in the shadow page. Shadow checks: FIG. 43 is a flowchart of the algorithm performed by step 4025 to check a shadow variable for semantic inconsistencies for operations that write or define memory. Step 4300 attempts to locate a context for the shadow and the global context. Step 4305 determines if the shadow has this context defined; if so, step 4310 determines if execution is currently in a worksharing construct. If so, step 4315 sets the mem_def_part flag in the shadow; otherwise, step 4320 clears this flag. Step 4325 determines if any of the private flags are set in the shadow; if so, step 4330 determines if execution is currently in a C$PAR PARALLEL region. If so, steps 4335-4345 set the mem_writelocal flag (and the mem_writelast flag if the mem_lastlocal flag is set in the shadow) in the shadow; otherwise, steps 4350-4360 clear these flags under the corresponding conditions. Step 4365 sets the write shadow from the current thread id and sets the shadow write source position from the current source position. Step 4370 clears the mem_beg_iter, mem_mid_iter, mem_end_iter, and mem_cmn_part flags, and then sets the global select and mem_defined flags, in the shadow. If execution is currently in a C$PAR PARALLEL region (step 4375), step 4380 clears the mem_def_part and mem_cmn_part flags in the shadow for all its contexts. FIG. 44 is a flowchart of the algorithm performed by step 4055 to check a shadow variable for semantic inconsistencies for operations that do not write or define memory. Step 4400 initializes a pointer to the first context for the shadow. Step 4405 determines if there are any contexts left to examine; if not, the algorithm terminates. Step 4410 determines if any of the private flags are set in the shadow for the current context; if so, step 4415 determines if the mem_defined flag is set in the shadow. If so, step 4420 checks further for semantic inconsistencies; otherwise, step 4425 determines if the mem_firstlocal flag is set in the shadow. If not, step 4430 determines if execution is currently in a C$PAR PDO iteration; if so, a first semantic inconsistency is recorded; otherwise, a par_undef semantic inconsistency is recorded. Step 4445 determines if the current context is the global context; if so, step 4450 sets the read shadow from the current thread id. Step 4455 advances the context pointer to the next context for the shadow. FIG. 45 is a flowchart of the algorithm performed by step 4420 to check further for semantic inconsistencies. Step 4500 determines if the mem_def_part flag is set in the shadow for the current context. If so, steps 4505-4520 are used to record a partial_loc or partial_cmn semantic inconsistency (for mem_def_part or mem_cmn_part flags set in the current shadow, respectively); otherwise, step 4525 determines if execution is currently in a C$PAR SINGLE PROCESS. If not, step 4530 checks further for semantic inconsistencies for C$PAR PDO. Steps 4535-4545 record a partial_loc semantic inconsistency if the mem_def_part flag is set in the current shadow and the write shadow is before the start of the current parallel construct, and steps 4550-4560 record a partial_cmn semantic inconsistency if the mem_cmn_part flag is set in the shadow and the write shadow is before the start of the current parallel construct. FIG. 46 is a flowchart of the algorithm performed by step 4530 to check further for semantic inconsistencies for C$PAR PDO. Step 4600 locates the stack entry for the parallel construct for which a data dependence exists. Step 4605 determines if a dependence does exist; if not, the algorithm terminates. Step 4610 determines if the stack entry for the dependence is for a C$PAR PDO; if so, and step 4615 determines that the current write shadow equals the stack entry's current timestamp, the algorithm terminates. Step 4620 determines if the stack entry is for a C$PAR PARALLEL region; if not, the algorithm terminates. Step 4625 determines if the write shadow is greater than or equal to the stack entry's current timestamp; if not, the algorithm terminates. Step 4630 determines if any of the private flags are set in the shadow; if so, the algorithm terminates. A semantic inconsistency is known to exist at step 4635. Step 4635 is used to record a not_local semantic inconsistency if execution is currently in a C$PAR PDO iteration; otherwise, step 4645 records a par_undef semantic inconsistency. Calculate all dependences: FIG. 47 is a flowchart of the algorithm performed by steps 1420 and 1550 to calculate all the data dependences for an opcode. The arguments for this algorithm are: optype--the operation type (READ, WRITE, IO) num--the number of shadow variables in the translated reference sym--denotes the symbol name being accessed type--flags denoting the access type The arguments used for each particular calling step and simulator input opcode (see Table 1) are given in Table 7. Step 1420 uses num from the value returned from step 1410 and sym from the arguments listed in Table 2 for the specified opcodes. Step 1550 uses num from the value returned from step 1545 and sym and type are unused.
TABLE 7
Input Arguments for Algorithm of FIG. 47
step opcode optype type
1420 READ READ d
WRITE WRITE d
PDO_WRITE WRITE d
READn READ d
WRITEn WRITE d
S_READn READ s
S_WRITEn WRITE s
1550 IO IO
POISON_IO IO
Notes:
1 type flags are mem_flags from Table 3.
Step 4700 determines if execution is currently in a C$PAR PARALLEL region; if not, and step 4705 determines that the operation is not a read, the algorithm terminates. Step 4710 initializes an index (i) to the first translated shadow variable. Step 4715 determines if any shadows remain to be processed; if not, the algorithm terminates. Step 4720 sets a pointer to the with shadow. Steps 4725-4735 calculate an anti dependence if the read shadow is greater than or equal to the global fence and the operation is a write. Step 4740 determines if the write shadow is greater than or equal to the global fence; if not, and execution is not currently in a C$PAR PARALLEL region, step 4750 checks for escaped references; otherwise, steps 4755-4780 check for I/O, flow, and output dependences for I/O, read, and write operations, respectively. Step 4785 checks for other semantic inconsistencies, and step 4790 increments the shadow index and repeats the process. FIG. 48 is a flowchart of the algorithm performed by step 4785 to check for other semantic inconsistencies. Step 4800 determines if the current shadow's context is the same as the global context; if not, the algorithm terminates. For I/O operations (step 4805), step 4810 sets both the read and write shadows from the current thread id, and step 4815 sets the shadow read and write source positions from the current source position. Step 4820 determines if execution is currently in a C$PAR PARALLEL region (but not in a C$PAR PDO); if so, step 4825 records a filterable io semantic inconsistency. For write operations (step 4830), step 4835 determines if execution is currently in a C$PAR PARALLEL region (but, as in step 4820, not in a C$PAR PDO); if so, step 4840 updates the shadow and step 4845 records a filterable outp semantic inconsistency. FIG. 49 is a flowchart of the algorithm performed by step 4750 to check for escaped references. This algorithm records semantic inconsistencies arising from improper accesses outside of C$PAR PARALLEL regions. Step 4900 determines if the mem_writelast flag is set in the shadow; if so, step 4905 determines if the mem_end_iter flag is set. If so, step 4910 clears the mem_writelocal, mem_writelast, and mem_end_iter flags in the shadow; otherwise, step 4915 determines if the mem_beg_iter or mem_mid_iter flags are set. If so, step 4920 records a was_last semantic inconsistency; otherwise, step 4925 records a par_def semantic inconsistency. Step 4930 determines if the mem_writelocal flag is set in the shadow. If not, the algorithm terminates; otherwise, step 4935 determines if the mem_end_iter flag is set. If so, step 4940 records a last semantic inconsistency; otherwise, step 4945 determines if the mem_beg_iter or mem_mid_iter flags are set. If so, step 4950 records a not_last semantic inconsistency; otherwise, step 4955 records a par_def semantic inconsistency. Update shadows: FIG. 50 is a flowchart of the algorithm performed by steps 1425, 3525, 3645, 3725, and 4840 to update shadow flags after translation and semantic inconsistency checking. The arguments for this algorithm are: optype--the operation type (READ, WRITE) num--the number of shadow variables in the translated reference to update clear_flags--flags to clear in each shadow set_flags--flags to set in each shadow The arguments used for each particular calling step and simulator input opcode (see Table 1) or mode (from that calling step) are given in Table 8. Steps 1425, 3525, 3645, and 3725 use num from the values returned from steps 1410, 3520, 3625, and 3705, respectively; step 4840 updates a single shadow. Step 5000 initializes an index (i) to the first shadow variable to update. Step 5005 determines if any shadows remain to be processed; if not, the algorithm terminates. Step 5010 sets a pointer to the with shadow, step 5015 clears the clear_flags in the shadow, and step 5020 sets the set_flags in the shadow. Step 5025 determines if a read operation is being performed; if so, step 5030 sets the read shadow from the current thread id and step 5035 updates the shadow read source position from the current source position; otherwise, step 5040 determines if the mem_defined flag appears in the set_flags. If so, step 5045 sets the write shadow from the current thread id and step 5050 updates the shadow write source position from the current source position. Step 5055 increments the shadow index and repeats the process. Calculate dependence: FIG. 51 is a flowchart of the algorithm performed by steps 4735, 4760, 4770, and 4780 to calculate various types of data dependences. The | ||||||
