|
|
|
Batch or transaction processing |
Distributed multi-version commitment ordering protocols for guaranteeing serializability during transaction processing5701480
Abstract
In a multi-version database, copies of prior committed versions (snapshots) are kept for access by the read-only transactions. The read-write transactions are selectively aborted to enforce an order of commitment of read-write transactions that is the same as an order of conflicts among the read-write transactions. In a preferred embodiment, the read-write transactions are serialized by maintaining and referencing a graph of conflicts among read-write transactions, and the read-only transactions are serialized by a timestamp mechanism for selection of the snapshots to be read. Each time that a read-write transaction is committed, the read-write transaction is assigned a unique timestamp that is used to timestamp all resources committed by the read-write transaction. Upon starting, each read-only transaction is also assigned a timestamp. Each read-only transaction reads only the latest committed versions of all resources, that are timestamped earlier than the timestamp of the read-only transaction. In a multiprocessing system, the timestamps are issued to global coordinators and distributed locally with atomic commit messages and global queries. Moreover, read-write transactions may selectively access a hierarchy of uncommitted versions to prepare for various possible commitment orders. The hierarchy defines a path for record access and for cascading aborts. A plurality of mutually-conflicting uncommitted versions may be prepared for each transaction to prepare for all possible commitment orders.
Claims
What is claimed is:
1. A method of operating a digital computer to process read-write transactions and read-only transactions in a computer system, said method comprising the steps of:
a) beginning preparation of results of said transactions;
b) determining an order of conflicts among said read-write transactions;
c) committing to memory state of said computer system prepared results of a selected one of said read-write transactions;
d) aborting an abort set of said read-write transactions for which commitment is contrary to said order of conflicts and said committing to memory state of said computer system said prepared results of said selected one of said read-write transactions;
e) retaining a prior version of memory state of said computer system existing prior to being updated by said prepared results of said selected one of said read-write transactions; and
f) permitting selected ones of said read-only transactions to read said prior version of memory state after said prepared results of said selected one of said read-write transactions are committed to memory state of said computer system, while preventing said read-write transactions from reading said prior version of memory state after said prepared results' of said selected one of said read-write transactions are committed to memory state of said computer system.
2. The method as claimed in claim 1, wherein said step of determining said order of conflicts includes detecting when a data access operation for one of said read-write transactions addresses data accessed by data access operations for other ones of said read-write transactions.
3. The method as claimed in claim 1, wherein said read-only transactions are prevented from reading any results of said read-write transactions before said results of said read-write transactions are committed to said state memory, so that said read-only transactions do not conflict with said read-write transactions.
4. The method as claimed in claim 1, wherein said read-write transactions include global read-write transactions that are distributed across the computing system and local read-write transactions that are not distributed across the computing system, and said step of determining an order of conflicts among said read-write transactions disregards conflicts between said local read-write transactions but takes into account indirect conflicts between said global read-write transactions that are caused by said local read-write transactions.
5. The method as claimed in claim 1, wherein said step of determining an order of conflicts comprises recording in memory of said digital computer a graph of conflict orders between said transactions, and wherein said method further comprises searching said graph for determining said abort set of transactions, said abort set including a transaction on each path in said graph to said selected one of said global transactions from a global transaction not yet having results aborted or committed to memory state of said computer system.
6. The method as claimed in claim 1, further comprising the step of selecting said selected one of said read-write transactions in response to a commit command from a global coordinator.
7. The method as claimed in claim 1, further comprising the step of selecting said selected one of said read-write transactions in order to minimize the number of read-write transactions that are aborted in said step d).
8. The method as claimed in claim 1, further comprising the step of receiving from a coordinator a request to prepare a specified one of said read-write transactions, and delaying acknowledgement of completion of preparation of said specified one of said read-write transactions until none of said transactions not yet committed in step c) nor aborted in step d) are contrary to said order of conflicts and commitment of said specified one of said read-write transactions.
9. The method as claimed in claim 1, wherein a read operation of a second one of said read-write transactions reads write data written by a write operation of a first one of said read-write transactions before said first one of said read-write transactions is committed, and wherein said method further comprises the step of aborting results of all of said transactions that have read data written by aborted read-write transactions.
10. The method as claimed in claim 1, wherein a sufficient number of prior committed versions of memory state of said computer system are retained to process said read-only transactions by permitting each read-only transaction to read a version of memory state last committed prior to the time that processing of said each read-only transaction is begun and without delaying commitment of read-write transactions that update memory state to be read by said each read-only transaction.
11. The method as claimed in claim 1, wherein said computer system includes a plurality of processors, said read-write transactions include global read-write transactions that are distributed across the computer system and that are each processed and committed at more than one of said processors, and said read-only transactions include global read-only transactions that are distributed across the computer system and are each processed at more than one of said processors, and said method further includes synchronizing the processing of said global read-only transactions with commitment of results of said global read-write transactions at each of said processors so that each global read-only transaction reads results committed at one of said processors that are consistent with results committed, and read by said each global read-only transaction, at another of said processors.
12. The method as claimed in claim 11, wherein said synchronizing is performed by synchronizing transmission of commitment messages from a global update coordinator to said processors with initiation of said each global read-only transaction by a global query coordinator.
13. The method as claimed in claim 11, wherein said synchronizing is performed by serializing performance of said global read-only transactions with commitment of said read-write transactions.
14. The method as claimed in claim 11, wherein the performance of said global read-only transactions is serialized with the commitment of said read-write transactions by issuing a timestamp when each of said global read-write transactions is committed and when each of said global read-only transactions is initiated, and for each global read-only transaction, comparing the timestamp issued to said each global read-only transaction with timestamps issued to global read-write transactions to determine a proper version of said memory state of said computer system to be read by said each global read-only transaction.
15. A method of operating a digital computer to process read-write transactions and read-only transactions in a computer system, said method comprising the steps of:
a) beginning preparation of results of said transactions;
b) determining an order of conflicts among said read-write transactions by detecting when a data access operation for one of said read-write transactions addresses data accessed by data access operations for other ones of said read-write transactions;
c) committing to memory state of said computer system prepared results of a selected one of said read-write transactions;
d) aborting an abort set of said read-write transactions for which commitment is contrary to said order of conflicts and said committing to memory state of said computer system said prepared results of said selected one of said read-write transactions;
e) retaining a prior version of memory state of said computer system existing prior to being updated by said prepared results of said selected one of said read-write transactions; and
f) maintaining a sufficient number of prior committed versions of memory state of said computer system to process said read-only transactions by permitting each read-only transaction to read a version of memory state last committed prior to the time that processing of said each read-only transaction is begun and without delaying commitment of read-write transactions that update memory state to be read by said each read-only transaction, while preventing said read-write transactions from reading any of said prior versions of memory state having been updated by committed ones of said read-write transactions, and preventing said read-only transactions from reading any results of said read-write transactions before said results of said read-write transactions are committed to said memory state of said computer system, so that said read-only transactions do not conflict with said read-write transactions.
16. The method as claimed in claim 15, wherein said computer system includes a plurality of processors, said read-write transactions include global read-write transactions that are distributed across the computing system and that are each processed and committed at more than one of said processors, and said read-only transactions include global read-only transactions that are distributed across the computing system and that are each processed at more than one of said processors, and said method further includes synchronizing the processing of said global read-only transactions with commitment of results of said global read-write transactions at each of said processors so that each global read-only transaction reads results committed at one of said processors that are consistent with results committed, and read by said each global read-only transaction, at another of said processors.
17. The method as claimed in claim 16, wherein said synchronizing is performed by synchronizing transmission of commitment messages from a global update coordinator to said processors with initiation of said each global read-only transaction by a global query coordinator.
18. The method as claimed in claim 16, wherein performance of said global read-only transactions is synchronized and serialized with commitment of said read-write transactions by issuing a timestamp when each of said global read-write transactions is committed and when each of said global read-only transactions is initiated, and for each global read-only transaction, comparing the timestamp issued to said each global read-only transaction with timestamps issued to global read-write transactions to determine the version of memory state last committed prior to the time that processing of said each global read-write transaction is begun.
19. A digital computer system for processing read-write transactions and read-only transactions, said digital computer system comprising, in combination:
a) means for performing operations of said read-write transactions such that operations of some read-write transactions are performed in accordance with availability of resources of said digital computer system before commitment of other read-write transactions;
b) means for determining an order of conflicts among said read-write transactions; and
c) means for enforcing an order of commitment of selected ones of said read-write transactions in accordance with said order of conflicts, said means for enforcing including means for delaying commitment of selected read-write transactions and means for aborting an abort set of said read-write transactions selected so that committing of a selected one of said read-write transactions before commitment of other of said read-write transactions excluded from said abort set is consistent with said order of conflicts; and
d) means for maintaining a sufficient number of prior committed versions of memory state of said computer system to process said read-only transactions by permitting each read-only transaction to read a version of memory state last committed prior to the time that processing of said each read-only transaction is begun and without delaying commitment of read-write transactions that update memory state to be read by said each read-only transaction, while preventing said read-write transactions from reading any prior versions of memory state having been updated by committed ones of said read-write transactions, and preventing said read-only transactions from reading any results of said read-write transactions before said results of said read-write transactions are committed to said memory state, so that said read-only transactions do not conflict with said read-write transactions.
20. The digital computer system as claimed in claim 19, further including means for delaying the aborting of read-write transactions in said abort set when said means for delaying commitment delays commitment of said selected one of said read-write transactions.
21. The digital computer system as claimed in claim 19, wherein said means for delaying includes means for delaying acknowledgement of completion of preparation of a selected one of said read-write transactions until committing of said specified one of said read-write transactions before committing all other of said read-write transactions not yet committed nor aborted is consistent with said order of conflicts.
22. The digital computer system as claimed in claim 21, further comprising means for terminating said delaying in response to a signal in an atomic commitment protocol.
23. The digital computer system as claimed in claim 19, wherein said means for aborting includes means for aborting all of said read-write transactions that have read data written by aborted read-write transactions.
24. The digital computer system as claimed in claim 19, wherein said means for determining said order of conflicts comprises means for recording in memory of said digital computer system a graph of conflict orders between said read-write transactions, and means for searching said graph for determining said abort set of read-write transactions, said abort set including a read-write transaction on each path in said graph to said selected one of said read-write transactions from a read-write transaction not yet having results aborted or committed to memory state of said digital computer system.
25. The digital computer system as claimed in claim 19, wherein said read-write transactions include global read-write transactions that are distributed across said computer system and that are each committed at more than one of said processors, and local read-write transactions that are not distributed across said computer system and that are each committed at only one of said processors, and wherein said digital computer system includes means employing write-locks for ensuring that local read-write transactions provide consistent results but permitting global transactions to disregard said write-locks, and wherein said means for determining an order of conflicts disregards conflicts between said local read-write transactions but takes into account indirect conflicts between said global read-write transactions caused by said local read-write transactions.
26. The digital computer system as claimed in claim 19, wherein said digital computer system includes a plurality of processors, said read-write transactions include global read-write transactions that are distributed across said processors and that are each processed and committed at more than one of said processors, and said read-only transactions include global read-only transactions that are distributed across said processors and that are each processed at more than one of said processors, and said computer system further includes means for synchronizing the processing of said global read-only transactions with commitment of results of said global read-write transactions at each of said processors so that each global read-only transaction reads results committed at one of said processors that are consistent with results committed, and read by said each global read-only transaction, at another of said processors.
27. The digital computer system as claimed in claim 26, wherein said means for synchronizing includes means for synchronizing transmission of commitment messages from a global update coordinator to said processors with initiation of said each global read-only transaction by a global query coordinator.
28. The digital computer system as claimed in claim 26, wherein said means for synchronizing includes means for serializing processing of said global read-only transactions with commitment of said read-write transactions by issuing a timestamp when each of said global read-write transactions is committed and when each of said global read-only transactions is initiated, and for each global read-only transaction, comparing the timestamp issued to said each global read-only transaction with timestamps issued to global read-write transactions to determine the version of memory state last committed prior to the time that processing of said each global read-write transaction is begun.
29. A method of operating a digital computer to process read-write transactions in a computer system, said method comprising the steps of:
a) beginning preparation of results of said read-write transactions, said results including uncommitted versions of memory state associated with said read-write transactions, and
(i) preparing a first transaction by modifying a last committed version of memory state to create a first uncommitted version of memory state associated with said first transaction, and
(ii) before committing said first uncommitted version to memory state of said computer system, preparing a second transaction by modifying said first uncommitted version of memory state to create a second uncommitted version of memory state associated with said second transaction;
b) determining an order of conflicts among said versions of memory state associated with said read-write transactions, wherein said order of conflicts includes said second uncommitted version of memory state following said first uncommitted version of memory state;
c) committing to memory state of said computer system prepared results of a selected one of said read-write transactions; and
d) aborting an abort set of said uncommitted versions of memory state for which commitment is contrary to said order of conflicts and said committing to memory state of said computer system said prepared results of said selected one of said read-write transactions, and when said first uncommitted version of memory state is aborted, aborting said second uncommitted version.
30. The method as claimed in claim 29, wherein said beginning preparation of results of said read-write transactions includes following a predefined strategy for selecting either said last committed version or one of said uncommitted versions for access by a third transaction for which preparation has begun.
31. The method as claimed in claim 30, wherein said predefined strategy includes initially selecting said last committed version, and from said last committed version preparing a third uncommitted version associated with said third transaction, and after a conflict is detected between said third uncommitted version and said first uncommitted version wherein said third uncommitted version is based upon a resource of said last committed version having been modified in said first uncommitted version, preparing from said first uncommitted version a fourth uncommitted version associated with said third transaction.
32. A method of operating a digital computer to process read-write transactions in a computer system, said method comprising the steps of:
a) beginning preparation of results of said read-write transactions, said results including uncommitted versions of memory state associated with said transactions, including:
(i) preparing a first transaction by modifying a last committed version of memory state to create a first uncommitted version of memory state associated with said first transaction, and
(ii) preparing a second transaction by modifying said last committed version of memory state to create a second uncommitted version of memory state associated with said second transaction;
b) during preparation of said uncommitted versions, detecting conflicts between said uncommitted versions, said conflicts occurring when one uncommitted version includes a modified version of a resource of a prior version and another uncommitted version is based upon said resource of said prior version, wherein said first uncommitted version includes a modified version of a resource in said last committed version upon which said second uncommitted version is based, and said second uncommitted version includes a modified version of a resource in said last committed version upon which said first uncommitted version is based, so that said first uncommitted version and said second uncommitted version are mutually conflicting;
c) committing to memory state of said computer system an uncommitted version of memory state associated with a selected one of said read-write transactions; and
d) aborting an abort set of said uncommitted versions of memory state which are based upon any resources of any prior version having been modified in said version of said memory state committed in step (c), wherein one of said first and second uncommitted versions is aborted when the other of said first and second uncommitted versions is committed in step (c).
33. The method as claimed in claim 32, wherein said beginning preparation of results of said read-write transactions includes following a predefined strategy for selecting either said last committed version or one of said uncommitted versions for access by a third transaction for which preparation has begun.
34. The method as claimed in claim in claim 33, wherein said predefined strategy includes initially selecting said last committed version, and from said last committed version preparing a third uncommitted version associated with said third transaction, and after a conflict is detected between said third uncommitted version and said first uncommitted version wherein said third uncommitted version is based upon a resource of said last committed version having been modified in said first uncommitted version, preparing from said first uncommitted version a fourth uncommitted version associated with said third transaction.
Description
BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention relates generally to distributed computing, and more particularly to a transaction processing system in which component operations in related transactions are distributed so that at least one operation in a second transaction is performed before a first transaction having a conflicting operation is committed. The present invention specifically concerns a method and apparatus for scheduling the commitment of conflicting transactions in a multi-version transaction processing system that allows read-only transactions to be executed without blocking read-write transactions or being blocked by the read-write transactions.
2. Description of the Background Art
A desirable feature of a computing system is the ability to recover from partial system failures that interrupt memory write operations. If an application program has a memory write operation in progress at the time of the system failure, it is most likely that the memory record will become erroneous. To enable the recovery of memory records after a partial system failure, it is necessary for the application program to keep backup copies of the records in nonvolatile memory. When the computing system is restarted, the memory records to be recovered are replaced with the backup copies.
To facilitate the making of backup copies and the recovery of memory records, the operating system typically provides an established set of memory management procedures that can be invoked or called from an application program to define a "recovery unit." The recovery unit consists of program statements between a "START" statement and a "COMMIT" statement. All of the statements in the "recovery unit" must be completed before the memory records modified by the statements in the recovery unit are made available for subsequent processing. The "START" statement corresponds to initiating the making of a backup copy in nonvolatile memory, and the "COMMIT" statement corresponds to switching of the backup copy with a modified version. The statements in the "recovery unit" specify operations in a single "transaction." Upon recovering from a partial system error, inspection of the nonvolatile memory will reveal that the operations in the single "transaction" are either all completed, or none of them are completed.
In a distributed computing system, the operations in a single transaction may modify files in different data bases, and the files may be shared by other processes. During the operation of the transaction, the files may be inconsistent for a time, although the files will be consistent upon completion of the transaction. A typical example is a transfer of funds from one account to another, in which a first account is debited, and at a slightly later time, another account is credited. During the interim, the two accounts are inconsistent because the sum of the two accounts does not represent the total funds in the two accounts. Due to inconsistency when files are being modified by a transaction, it is known to prevent other processes from accessing the files until the modification is finished. Atomicity can be assured in this example by performing commitment for both files at the same time and place. By changing a single flag, for example, the backup copies of each file can be replaced at the same time with the modified versions of the files. In many instances, however, it is desirable to distribute the operations in a transaction among multiple processors or processes in a computing system, and to commit the transaction by committing the operations in each process or processor while permitting some variability between the times of commitment. In these instances, an "atomic commitment protocol" is typically used to ensure atomicity. The protocol requires the exchange of information about the state of the transaction between the processors or processes. To identify the transaction being performed, the transaction is typically assigned a unique "transaction identification number."
A widely used atomic commitment protocol is known as the "two-phase commit protocol." In a somewhat elementary example of this protocol, one processor or process in the computing system is assigned the role of a coordinator which initiates the commit process of a transaction. For this purpose, the coordinator sends a prepare command to all of the processors or processes participating in the transaction.
Upon receipt of the "prepare" command, each processor or process participating in the transaction checks whether the operation can be completed successfully, writes an indication of the decision to acknowledge successful completion together with the transaction identification number into permanent memory to remember that it is prepared for the transaction, and then sends an acknowledgement back to the coordinator processor, but does not yet commit its results for the transaction. The coordinator waits for acknowledgements from all of the participants. When the coordinator receives acknowledgements from all of the participants, the coordinator records in permanent memory a list of the participants and a notation that the transaction is now being completed, and then the coordinator sends "commit" commands to all of the participants. The coordinator, however, may receive a message from a participant indicating that it cannot prepare for the transaction, or the coordinator may fail to receive acknowledgements from all of the participants after a predetermined time period, possibly after the coordinator has retransmitted the "prepare" command. In this case the coordinator transmits an "abort" command to all of the participants.
Upon receipt of the "commit" command, each participant checks its permanent memory for the transaction identification number to determine whether the participant has prepared for the transaction, and, if it has, it then performs a "COMMIT" operation to write its results into permanent memory and clear the transaction ID from permanent memory in one "atomic" step. Then the participant sends an acknowledgement back to the coordinator. When the coordinator receives acknowledgments from all of the participants, it erases the list of participants from permanent memory, and the transaction is finished.
Additional complexity is introduced when it is desired to process global transactions concurrently across multiple processors or processes in a distributed computing system. It is well known that global serializability is not guaranteed merely by ensuring that each processor or process achieves local serializability, because local transactions may introduce indirect conflicts between distributed global transactions. It is impractical to permit a processor or process to view a global picture of all the conflicts in all of the other processors or processes. Without a global picture, however, it is difficult for a processor or process to ensure that there is a correlation between its serialability order and the serialability orders of the other processors or processes. Time-stamping of transaction requests and data updates is one method that has been used to address this problem of concurrency control. In general, concurrency control in a distributed computing system has been achieved at the expense of restricted autonomy of the local processors or processes, or by locking.
The problem of global deadlock also has to be addressed whenever global transactions are performed concurrently. One known solution is to provide a global transaction scheduler that decides whether or not to dispatch concurrent global transaction requests. An example is described Y. Breitbart et al., "Reliable Transaction Management in a Multidatabase System", Proc. of the ACM SIGMOD conf. on Management of Data, Atlantic City, N.J., June 1990, pp. 215-224. The global scheduler keeps track of global transaction requests for local locks on data items by using a global lock mechanism. Each global data item has a global lock associated with it. A global transaction that needs only to read a data item requests a global read-lock. Locks are conflicting if they are requested by two different transactions on the same data item and at least one of the requested locks is a write-lock. If two global transactions request conflicting global locks, the scheduler will prevent one of the transactions from proceeding because it knows that the two transactions will cause a conflict at the local site. The scheduler uses strict two-phase locking for allocating global locks to global transactions, and maintains a global "wait for graph." The "global wait for graph" is a directed Graph G=(V,E) whose set of vertices V is a set of global transactions and an edge T.sub.i .fwdarw.T.sub.j belongs to E if and only if Global transaction T.sub.i waits for a global lock allocated to global transaction T.sub.j. If a global transaction waits for a global lock, then the transaction state becomes "blocked" and the transaction is included in the "global wait for graph." The transaction becomes active again only after it can obtain global locks that it was waiting for. To avoid global deadlocks, the "global wait for graph" is always made acyclic. To ensure data consistency in the presence of failures, the scheduler also uses a "commit graph" and a "wait-for-commit graph" to determine when to schedule a commit operation. The commit graph CG=<TS,E> is an undirected bipartite graph whose set of nodes TS consists of a set of global transactions (transaction nodes) and a set of local sites (site nodes). Edges from E may connect only transaction nodes with site nodes. An edge (T.sub.i, S.sub.j) is in E if and only if transaction T.sub.i was executing at site S.sub.j, and the commit operation for T.sub.i has been scheduled for processing. After the commit operation for T.sub.i is completed, T.sub.i is removed from the commit graph along with all edges incidental to T.sub.i. Global database consistency is assured if the commit graph does not contain any loops. The wait-for-commit graph is a directed Graph G=(V,E) whose set of vertices V consists of a set of global transactions. An edge T.sub.i .fwdarw.T.sub.j is in E if and only if T.sub.i has finished its execution, but its commit operation is still pending and T.sub.j is a transaction whose commit operation should be completed or aborted before the commit of T.sub.i can be scheduled. The scheduler uses the following algorithm for constructing the wait-for-commit graph, and in scheduling a commit operation of transaction T.sub.i :
1. For each site S.sub.k in which T.sub.i is executing, temporarily add the edge T.sub.i .fwdarw.S.sub.k to the commit graph.
2. If the augmented commit graph does not contain a cycle, then the global commit operation is submitted for processing, and the temporary edges become permanent.
3. If the augmented commit graph contains a cycle then:
a) The edges T.sub.i .fwdarw.T.sub.i1, . . . T.sub.i .fwdarw.T.sub.im are inserted into the wait-for-commit graph. The set {T.sub.i1, T.sub.i2, . . . T.sub.im } consists of all the transactions which appear in the cycle which was created as a result of adding the new edges to the commit graph.
b) Remove the temporary edges from the commit graph.
The transaction T.sub.i, however, need not necessarily wait for the completion of every transaction T.sub.ik such that T.sub.i .fwdarw.T.sub.ik. It may be ready to be scheduled for a commit operation after some of transactions T.sub.ik such that T.sub.i .fwdarw.T.sub.ik (0<1<r) successfully commit (and in some cases, a successful commit of only one such transaction would be sufficient to schedule the transaction's commit|).
Global serializability can be guaranteed in a distributed transaction processing system by enforcing a "commitment ordering" for all transactions. In Yoav Raz, U.S. patent application Ser. No. 07/703,394, filed May 21, 1991, and entitled "Commitment Ordering For Guaranteeing Serializability Across Distributed Transactions," it was shown that if global atomicity of transactions is achieved via an atomic commitment protocol, then a "commitment ordering" property of transaction histories is a sufficient condition for global serializability. The "commitment ordering" property occurs when the order of commitment is the same as the order of performance of conflicting component operations of transactions. Moreover, it was shown that if all of the local processes were "autonomous," i.e., they do not share any concurrency control information beyond atomic commitment messages, then "commitment ordering" is also a necessary condition for global serializability.
Multi-version (MV) based database systems allow queries (read-only transactions) to be executed without blocking, or being blocked by updaters (read-write transactions). Although various mechanisms for multi-versioning have been proposed, most maintiain several prior versions of a data item or object. In general, there is a considerable storage cost for maintaining the required number of prior versions, and for reading older versions rather than younger versions of the data objects. Attempts to reduce these costs have focused on improving the efficiency of caching or buffering the prior versions, and on scheduling the read-only transactions in order reduce the number of prior versions that are kept in storage. In general, scheduling selectively delays the initiation of read-only transactions, and therefore involves a trade-off of querry response time for reduced storage cost and system overhead. It has also been proposed that consistency for read-only transactions should be relaxed, when appropriate, to permit the reading of younger versions, in order to reduce the number of prior versions. These performance issues are further discussed in P. Bober & M. Carey, "On Mixing Queries and Transactions via Multiversion Locking", in Proc. of the Eighth Int. Conf. on Data Engineering, pp. 535-545, Tempe, February 1992; and P. Bober & M. Carey, "Multiversion Query Locking", in Proc. of the Eighteenth Int. Conf. on Very Large Databases, pp. 497-510, Vancouver, British Columbia, August 1992.
A multi-versioning mechanism is employed in the "Rdb/VMS" (Trademark) and "VAX/DBMS" (Trademark) operating systems sold by Digital Equipment Corporation of Maynard, Mass. A "snapshot" mechanism eliminates the need for read locks and also prevents the blocking of read-only transactions by write locks. The "snapshot" mechanism permits a transaction to obtain, at any time, a consistent version of data existing at the time that the transaction begins. Write locks, however, are placed on records to be accessed by a read-write transaction, and the write locks are not released until the results of the read-write transaction are committed. Recoverability is further ensured by flushing to an "undo" log the "before-images" of records to be updated, and then flushing the updated records to state memory just before a transaction is committed. If a crash occurs, the updated records are replaced with "before images" that are obtained from the "undo log" to undo the effects of the failed transactions.
For ensuring recoverability, a "redo" log may be used instead of, or in addition to, an "undo" log. As described in the Spiro et al., U.S. patent application Ser. No. 07/717,212 filed Jun. 18, 1991, entitled "Recovery Logging in the Presence of Snapshot Files by Ordering of Buffer Pool Flushing," updated records are not flushed to state memory after every transaction. Instead, updated records are written sequentially to an afterimage log, and all of the updated records are flushed to state memory only when certain "check points" occur. The "check points" occur, for example, after a specified number of transactions, or after a predetermined number of bytes have been written to the after-image log after the last checkpoint. The "redo" recovery mechanism therefore allows the updated, committed records to remain in volatile memory. When a system crash occurs, the volatile state memory existing at the end of the last committed transaction is reconstructed by reading from the non-volatile state memory the state memory records existing at the time of the last checkpoint, and re-doing the modifications existing in the after-image log.
In a multi database environment, transactions may span several single-version based database systems, as well as multi-version based systems. The database system may implement various concurrency control techniques. It is required that a globally correct concurrency control is guaranteed at least for read-write transactions in such a system. In general, in the multi database environment, multi-versioning in more costly because of a larger number of outstanding transactions. Blocking by write locks from read-write transactions tends to establish a threshold for system performance, and in interactive querry systems, this threshold may cause troublesome delays in querry processing at times of peak demand. Therefore, there is a need for a method of distributed concurrency control that avoids blocking by write locks and is applicable to multi-version database systems in a heterogenous environment.
SUMMARY OF THE INVENTION
The present invention guarantees serializability in a computing system across distributed read-write and read-only transactions referencing a multi-version database, wherein copies of committed versions (snapshots) are kept in order to facilitate the processing of the read-only transactions. The "snapshots" include copies of "prior" committed versions, where a "prior" committed version is a version committed earlier than the last committed version. The read-only transactions may read the prior committed versions. To ensure serializability, however, a read-write transaction is not permitted to read the prior committed versions. The read-write transactions are selectively aborted to enforce an order of commitment of read-write transactions that is the same as an order of conflicts among the read-write transactions.
In a preferred embodiment, a performance advantage is obtained by preventing the read-only transactions from reading uncommitted versions, and by maintaining a sufficient number of "snapshot" versions so that read-only transactions need not be aborted or delayed. When a read-write transaction accesses a version, for example, a "snapshot" of the version is made at that time for the benefit of any read-only transactions. The new "snapshot" version is pushed on a queue of any previously existing "snapshot" version for the same resource. A "garbage collection" mechanism removes "snapshot" versions that are sufficiently old that they will not be needed by any read-only transactions.
In a preferred embodiment, different serializing mechanisms are used to handle read-write transactions and read-only transactions. Read-write transactions are serialized by maintaining and referencing a graph of conflicts among read-write transactions. Read-only transactions are serialized by employing a timestamp mechanism for selection of the snapshot versions to be read. Each time that a read-write transaction is committed, the read-write transaction is assigned a unique timestamp that is used to timestamp all resources committed by the read-write transaction. Upon starting, each read-only transaction is also assigned a timestamp. Each read-only transaction reads only the latest committed versions of all resources, that are timestamped earlier than the timestamp of the read-only transaction. In a multiprocessing environment employing a single processor, the timestamps are conveniently propagated to the various processes along with commit messages of an atomic commitment protocol. In a multiprocessor environment wherein multiple processors function as global coordinators, it is convenient to select a single one of the processors as a source for all of the time stamps.
In an alternative embodiment, time-stamps are not used, but instead the initiation of each global read-only transaction is synchronized with commitment of any global read-write transactions that consistently update the resources to be read by the global read-only transaction, so that only "update consistency" is achieved for the global read-only transaction.
In a typical transaction processing system, a second read-write transaction can read data written by a first transaction only after the second transaction has been committed. This restriction is a sufficient condition to ensure recoverability of the system. To practice the present invention in this case, when a second read-write transaction performs a read operation before a conflicting write operation of a first read-write transaction is committed at a time when the second read-write transaction has not yet committed, the second read-write transaction is aborted to ensure that the order in which the read-write transactions are committed is not different from the conflict order of the read-write transactions.
The present invention, however, permits the construction of a transaction processing system in which a second read-write transaction may read data written by a write operation of a first read-write transaction before the first read-write transaction is committed. In this case, depending on the respective order in which the two conflicting operations occur, either of the two read-write transactions may be aborted to ensure that the order of commitment is the same as the conflict order of the read-write transactions. Moreover, to insure recoverability, both of the read-write transactions should be aborted in the case of the read operation following the write operation and the read operation being performed before aborting of the write operation. In general, in a transaction processing system in which a second read-write transaction may read data written by a write operation of a read-write transaction, recoverability is enforced by a process of cascading aborts; the aborting of a transaction requires the additional aborting of all other transactions that have read data written by aborted transactions.
In a preferred embodiment in a multi-processing or multiprocessor system, a global transaction commitment order of read-write transactions is enforced by committing a selected global read-write transaction for which a result has been prepared, and aborting an abort set of other read-write transactions for which a result is being prepared or is prepared. The global read-write transaction to commit is selected, for example, by a commitment request from an atomic commitment coordinator. The abort set is selected so that the committing of the selected global read-write transaction is not contrary to the order of conflict with read-write transactions that are not included in the abort set. In a multiprocessor system in which an atomic commitment coordinator communicates with a plurality of transaction processors by way of "prepare" and "commit" commands, acknowledgement that a transaction has been "prepared" is preferably delayed until an "abort set" for the transaction has been minimized.
In a multiprocesssor or multiprocessing system, each process may have global transactions as well as local transactions, and the local transactions may be serialized by any kind of serialization mechanism. In this case, it is preferable to assume that a read-write transaction is global, unless indicated otherwise, because an incorrect assumption that a transaction is global will not cause a serializability violation. It is preferred to maintain a directed graph for each local processor or process. The nodes are all the undecided global read-write transactions being processed by the processor or process, together with all the non-aborted local transactions (i.e., committed and undecided) that lie on paths or possible further paths in the graph between undecided global read-write transactions. Edges in the graph represent the order of performance of conflicting operations of the read-write transactions. In particular, there is an edge from transaction T1 to transaction T2 if the transactions have respective conflicting operations, and the respective operation of T2 has occurred after the respective operation of T1. Each time a global read-write transaction is committed, all paths and possible future paths to it in the graph from all undecided transactions are disconnected by aborting a selected set of transactions on the paths. The aborted transactions, for example, are all the undecided transactions on the paths from undecided global read-write transactions to the committed transactions, which are closest (one on each path separately) to the committed transaction. Additional searching through the graph from the committed transaction could be done to possibly find a more optimal "abort set." The graph is further maintained by removing global decided (both committed and aborted) read-write transactions, and local aborted transactions. A local committed transaction, however, is removed from the graph only when there is no path to it from any undecided transaction. Local transactions are committed upon an explicit request from the local concurrency control mechanism.
In accordance with another aspect of the invention, each time that processing is begun for a read-write transaction, a predefined strategy is followed to select, for the processing of the transaction, either the last committed version of memory state or any existing uncommitted version of memory state. Conflicts with the transaction will depend on the particular version selected, and therefore a predefined strategy is followed to minimize aborting of the results of transactions.
In one preferred embodiment, the last committed version is selected for transaction processing unless selection of the last committed version causes conflict with another active read-write transaction. If there is such conflict, then the uncommitted version of that other active read-write transaction is selected. In general, there may exist in the database a multiplicity of uncommitted versions, each associated with a possible commitment order for transactions following the last committed transaction. The uncommitted versions therefore form a hierarchy depending from the last committed version. When processing is begun for a transaction, an "instance" of the transaction is created that depends from the version selected for processing. The hierarchy, for example, defines a path by which records are fetched for processing of the transaction instance, and conflicts are detected when records are fetched. Moreover, each active transaction may have more than one "instance". Any plural instances of a transaction, however, are mutually conflicting. Therefore, a transaction "instance" must not be a descendant of any other instance of the same transaction, and when any instance of a transaction is committed, all other instances of the transaction are aborted.
BRIEF DESCRIPTION OF THE DRAWINGS
Other objects and advantages of the invention will become apparent upon reading the following detailed description and upon reference to the drawings in which:
FIG. 1 is a block diagram of a digital computer configured for transaction processing;
FIG. 2A is a flow chart of a procedure for performing transaction processing in the computer of FIG. 1 by switching between two banks of state memory;
FIG. 2B is an alternative procedure for operating the digital computer of FIG. 1 for transaction processing by saving copies of only the data records of state memory that are modified by a transaction;
FIG. 3 is a flow chart illustrating the operation of a conventional transaction processing system that permits multiple transactions to be processed in such a way that the performance of a second transaction is begun before the results of a first transaction are committed.
FIG. 4A illustrates various scheduling possibilities for conflicting memory access operations of distributed global transactions for the case in which a second transaction can read the write data of a first transaction only after the first transaction is committed;
FIG. 4B illustrates various scheduling possibilities for conflicting memory access operations of distributed global transactions for the case in which a second transaction can read the write data of a first transaction before the first transaction is committed;
FIG. 5A shows a digital computer configured in accordance with the present invention to enforce a global transaction commitment ordering in which distributed global transactions are committed in the order in which conflicting component operations are performed;
FIG. 5B illustrates a distributed computing system including a plurality of the digital computers as shown in FIG. 5A;
FIG. 6 illustrates a scheduling procedure employed by a transaction scheduler component of a digital computer in the system of FIG. 5B;
FIG. 7 illustrates an organization of a transaction list and related pointers which are used by the transaction scheduler for scheduling the performance of component operations of distributed transactions;
FIG. 8 is a schematic diagram illustrating a data structure corresponding to a graph of conflict ordering between distributed transactions having conflicting component operations;
FIG. 9 is a pictorial diagram of the graph corresponding to the data stored in the data structures of FIGS. 7 and 8;
FIG. 10 is a flow chart of a procedure that references the data structure of FIG. 7 to enforce global transaction commitment ordering;
FIG. 11 is a state diagram of the digital computer shown in FIG. 5A when used in a multi-processing system of FIG. 5B for processing both local and global transactions;
FIGS. 12A and 12B (collectively FIG. 12) are a flow chart of a procedure for selecting a transaction to commit and for selectively aborting transactions to enforce global transaction commitment ordering;
FIG. 13 is a flow chart of a "garbage collection" procedure for removing committed local transactions from the graph of conflict ordering shown in FIG. 9;
FIG. 14 is a flow chart of a procedure for committing and aborting transactions in response to signals from a coordinator of a global transaction;
FIG. 15 is a procedure for detecting a conflicting memory access operation during the preparation of a transaction;
FIG. 16 is a modified graph in which write-read conflicts are distinguished from other conflicts;
FIG. 17 is a flow chart of a recursive procedure for insuring recoverability by performing cascading aborts;
FIG. 18 shows a modification to the flow chart of FIG. 12 that should be made for an alternative embodiment of the invention that permits a global transaction to read data written by an undecided transaction;
FIG. 19 is a block diagram showing a global transaction commitment order coordinator employing the present invention inserted in a conventional transaction processing system between a transaction manager and a resource manager;
FIG. 20 is a state diagram of the transaction processing system of FIG. 19 for the processing of global transactions;
FIG. 21 is a state diagram of the transaction processing system of FIG. 19 for the processing of local transactions;
FIG. 22 is a timing diagram that illustrates why snapshots are useful in a transaction processing system;
FIG. 23 is a diagram illustrating a data structure using pointers to link volatile state memory records and volatile snapshot records to a hash table to enable a specified record to be found in volatile memory;
FIGS. 24A and 24B together comprise a flowchart of a procedure for fetching a desired record using the pointers of the data structure of FIG. 4;
FIG. 25 is a flowchart of a procedure for creating snapshot records from state memory records when the state memory records are updated by a transaction;
FIG. 26 is a diagram showing the preferred record organization as a page including variable-length segments; and
FIG. 27 is a flowchart of the preferred procedure for logging state information for ensuring recovery in a digital computer of a multi-version transaction processing system in accordance with the invention;
FIG. 28 is a flowchart illustrating a method of synchronizing the initiation of each global read-only transaction with commitment of any global read-write transactions that consistently update the resources to be read by the global read-only transaction, so that "update consistency" is achieved for the global read-only transaction;
FIG. 29 is a flowchart illustrating a method of serializing global atomic commit messages in order to achieve "strict consistency" for global read-only transactions;
FIG. 30 is a flowchart illustrating a method of processing global read-only transactions to achieve "strict consistency" in a transaction processing system that serializes global atomic commitment messages in accordance with FIG. 29;
FIG. 31 is a block diagram of a "commit history buffer" that can be used to adjust a list of active transactions for a global read-only transaction in order to permit the procedure of FIG. 30 to use snapshot mechanism of FIGS. 22 to 26 without modification in a multi-processor system having significant skews in message transmission time between processors;
FIG. 32 is flowchart of a procedure for writing information into the commit history buffer of FIG. 31 when a read-write transaction is committed in the procedure of FIG. 29;
FIG. 33 is a flowchart of a procedure for reading the commit history buffer and adjusting a list of active transactions for a global read-only transaction when processing of a global read-only subtransaction is initiated in the procedure of FIG. 30;
FIG. 34 is a serializability graph for a case in which both a first read-write transaction and a second read-write transaction write to the same resource in the last committed version in a multi-version transaction processing system;
FIG. 35 is a seriatizability graph for a case in which a first read-write transaction reads a resource in the last committed version and a second read-write transaction writes to the same resource in the last committed version;
FIG. 36 is a serializability graph for a case in which a first read-write transaction writes to a resource in the last committed version and a second read-write transaction reads the same resource in the last committed version;
FIG. 37 is a serializability graph for a case in which a first read-write transaction reads or writes to a resource in an uncommitted but ready version written by a second read-write transaction;
FIG. 38 is a serializability graph for a case in which a second read-write transaction reads or writes to a resource in an uncommitted but ready version written by a first read-write transaction;
FIG. 39 is a serializability graph showing all possible instances of three mutually conflicting read-write transactions in a multi-version transaction processing system, and showing the hierarchical nature of the conflicts resulting when transactions read or write to resources in uncommitted but ready versions written by other transactions;
FIG. 40 is a diagram showing a data structure for a node or "transaction instance" in the serializability graph of FIG. 39;
FIG. 41 is a diagram showing a data structure for a record accessed by a transaction instance in the serializability graph of FIG. 39;
FIG. 42 is a flowchart of a procedure for processing a transaction by referencing the serializability graph of FIG. 39;
FIG. 43 is a flowchart of a procedure for fetching a record by referencing the serializability graph of FIG. 39;
FIG. 44 is a flowchart of a procedure for checking for conflicts and adding new transaction instances to the serializability graph of FIG. 39 when accessing a record;
FIG. 45 is a flowchart of a procedure for referencing the serializability graph of FIG. 39 when committing a transaction; and
FIG. 46 is a flowchart of a procedure for referencing the serializability graph of FIG. 39 when aborting a transaction.
While the invention is susceptible to various modifications and alternative forms, specific embodiments thereof have been shown by way of example in detail here is and will be described in detail herein. It should be understood, however, that it is not intended to limit the invention to the particular forms disclosed, but on the contrary, the intention is to cover all modifications, equivalents, and alternatives falling within the spirit and scope of the invention as defined by the appended claims.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT
Turning now to FIG. 1, there is shown a block diagram generally designated 20 of a digital computer configured for transaction processing. The computer 20 includes a central processing unit 21 for executing programmed instructions; a volatile random access memory 22 for holding instructions or data; a non-volatile memory 23 such as a hard disk drive, an input/output unit 24, and a real time clock 25. The non-volatile memory 23 includes a program memory 26 in which programs are stored, and a scratch memory area 27 for storing data records.
Typically, the digital computer 20 executes programs which have been transferred from the program memory 26 to the volatile random access memory 22. During the execution of a program, it is often necessary to operate upon an amount of data that exceeds the capacity of the volatile random access memory 22. In this case, data records are alternately stored and retrieved from the scratch memory area 27.
A common problem associated with the digital computer 20 is the likelihood that the execution of instructions by the central processing unit will become disrupted due to a hardware failure, software error or power failure. A power failure, for example, will cause the disappearance of data and programs stored in the volatile random access memory 22. The problem of the loss of data in the volatile random access memory 22 due to a power failure can be solved by storing back-up copies of data in the non-volatile memory 23. The back-up copies, however, must be made in such a way that considers the possibility of failure during a write operation to the non-volatile memory 23. In this case the data record affected by the write operation might have been corrupted and therefore must be discarded.
To deal with the problem of possible failure when writing to non-volatile memory, there has been established a method of programming called "transaction processing" which guarantees that a portion of the non-volatile memory (referred to hereinafter as "state memory") will either be unaffected by a transaction or will be properly updated by results of a transaction, in the presence of the failures. Transaction processing is based upon the technique of making a back-up copy of state memory before the results of a transaction are written to state memory, and also writing in non-volatile memory an indication of either a first processing phase in which the back-up copy is being made, or a second processing phase in which the results of a transaction are being written to state memory, in order to indicate which copy might have been corrupted during a failure. For making a back-up copy of state memory, for example, the non-volatile memory 23 includes two banks of state memory 28 and 29. To provide an indication of which bank of stat memory might have been corrupted by a failure, the non-volatile memory 23 includes a memory location 30 for storing a switch or flag.
When recovering from a failure, it is desirable to know the transaction that was last performed by the central processing unit 21, so that processing can be resumed from the interrupted point without repeating or skipping a transaction. For this purpose, whenever the state memory in either of the memory banks 28 or 29 is updated, a transaction identification code 31, 32 is written into the state memory along with the time 33, 34 at which the results of the transaction were first written (i.e., committed) to state memory.
Turning now to FIG. 2A, there is shown a flow chart of a procedure for guaranteeing that when recovering from a failure, the state memory of the computer 20 shown in FIG. 1 is either unaffected by a transaction or is properly updated by the result of a transaction. Assume, for example, that the computer system is turned on after a power failure. In a first step 51, the central processing unit 21 reads the value of the switch 30 stored in the non-volatile memory 23. This switch indicates which of the two banks of state memory 28, 29 might possibly have been corrupted by the power failure. In step 52, the central processing unit 21 references the value of the switch to read the bank of state memory known not to have been corrupted, and to make a "working copy" of the data in the other bank of state memory. Therefore, after step 52, both bank 28 and bank 29 of state memory have the same contents. Moreover, the transaction identifier 31, 32 and the commit time 33, 34 can be inspected to find the location in a program of the next transaction to be processed.
In step 53 that processing is continued by modifying the data in the working copy of state memory by writing results of the transaction being processed. The end of processing of the transaction is reached in step 54. To commit the results of the transaction to state memory, the value of the switch is changed in step 55, and in step 56 the changed value of the switch is written into the switch location 30 of the non-volatile memory. When viewed by the central processing unit 21 during recovery from a failure, the writing of the changed value of the switch into the non-volatile memory has the effect of insuring that the committed result of the transaction either has no effect upon the state memory, or properly updates the state memory, depending upon whether the failure occurs before or after the value of the switch has been written into the non-volatile memory. Because the value of the switch 30 is a single bit and the switch 30 is stored in a record different from the records of the banks of state memory, any failure occurring during the writing of the this single bit is inconsequential; in this case, neither of the banks of state memory should be corrupted, so the value of the switch does not matter.
The method of committing the result of a transaction as illustrated in FIG. 2A is rather inefficient when the result of a transaction modifies only a small portion of the state memory. In this case, step 52 spends a good deal of time unnecessarily copying data records that have not been modified. This unnecessary copying can be eliminated by the somewhat more complex procedure of FIG. 2B.
In the first step 61 of FIG. 2B, the switch is read from the non-volatile memory. Next, in step 62, the central processing unit checks whether the switch is set. If so, then a failure occurred during the processing phase in which the results of a transaction were being committed to state memory, as further described below. Therefore, in step 63 records saved in the state memory bank 29 are copied to state memory bank 28. Then, in step 64, the switch in non-volatile memory is cleared.
To process a transaction, in step 65, data records are read from the state memory bank 28 and transferred into the scratch memory area 27. Then in step 66 the records in scratch memory are modified in accordance with results of the transaction. When the transaction is finished, as found in step 67, then in step 68, original data of records to be modified are copied from state memory bank 28 to the state memory bank 29. Then in step 69 the switch is set in non-volatile memory. Then in step 70 the results of the transaction are committed by writing the modified data into the state memory bank 28. Finally, in step 64, the switch is cleared in non-volatile memory. Processing of the next transaction begins in step 65.
As described above with respect to FIG. 2A or FIG. 2B, it is assumed that transactions are processed in sequence by the central processing unit 21 of the computer 20 in FIG. 1. In a conventional transaction processing system, however, the processing of transactions are typically distributed in such a way that the performance of a second transaction is begun before the results of a first transaction are committed. Moreover, the preparation and committing of transactions is scheduled in such a way as to ensure consistent results. In other words, the transaction processing system provides a mechanism for enforcing local serializability. The scheduling of operations for the transactions is typically performed by a multi-tasking or multi-processing operating system program that services a transaction queue. In such a system, the transaction at the head of the queue is given priority and is processed unless this transaction at the head of the queue must wait for completion of an input/output operation or a memory access operation to non-volatile memory. In this situation, the transaction having priority may return execution to the operating system, and the operating system will pass execution to the next transaction having priority. Upon completion of the input/output or memory access operation, however, an input/output or memory interrupt will occur, causing execution to be interrupted in favor of an interrupt handler that will return execution to the operating system. The operating system will then transfer execution to transaction at the head of the queue, which was waiting for the completion of the input/output or memory access operation. In this fashion, the resources of the computer 20 are used more effectively. Such multi-tasking and multi-processing operating systems are well known in the art and are available commercially from the major computer manufacturers. A specific example is the "Rdb/VMS" (Trademark) and "VAX DBMS" (Trademark) brand of operating systems manufactured and sold by Digital Equipment Corporation of Maynard, Mass. 01754-1418. A detailed description of Rdb/VMS is given in L. Hobbs et al., Rdb/VMS--A Comprehensive Guide, Digital Press, Digital Equipment Corporation, Maynard, Mass., 1991. The processing of transactions in such a conventional system will now be described below, with reference to FIG. 3.
To ensure ease of recovery in the situation where a second transaction is begun before a first transaction commits, the second transaction is usually precluded from reading any results of the first transaction before the first transaction commits. A transaction places "write locks" on the state memory records to be modified by the transaction, and these "write locks" are removed when the transaction is committed, as further described below with reference to FIG. 3.
To ensure consistency of data read by a transaction, the transaction could place "read locks" on any state memory records that are read by the transaction. The use of memory locks, however, inhibits concurrency between transactions, which causes a decrease in transaction processing speed. Therefore, the "Rdb/VMS" (Trademark) operating system uses a known "snapshot" mechanism to prevent memory locks from blocking read operations by read-only transactions. The "snapshot" mechanism permits a "read-only" transaction to read, at any time, a consistent version of any state memory record existing at the time that the transaction begins. In particular, a lock placed on a record for the benefit of a first transaction need not block the reading of the record by a second "read-only" transaction because a "snapshot" of the locked record is created for the benefit of the "read-only" transactions, as further described below with reference to FIG. 3. The "snapshot" mechanism is further described in L. Hobbs et al., Rdb/VMS--A Comprehensive Guide, cited above, and it is also described in Spiro et al. U.S. patent application Ser. No. 07/717,212 filed Jun. 18, 1991, incorporated herein by reference.
Turning now to FIG. 3, there is shown a flow chart of the operation of the computer 20 for processing transactions when using the "Rdb/VMS" (Trademark) operating system. In this case the operating system uses a conventional "undo" recovery procedure, in contrast to the procedure of FIG. 2B, which is known as a "re-do" procedure. When the computer (20 in FIG. 1) is turned on, for example after a power failure, execution by the central processing unit (21 in FIG. 1) begins in the first step 71. In step 71, the Central processing unit 21 reads the switch from non-volatile memory. If the switch is found in step 72 to be set, then execution branches to step 73, to copy records saved in BANK(1) to BANK(0). In step 73, the transaction-ID recorded in BANK(1) is also copied to BANK(0). Then in step 74 the switch in non-volatile memory is cleared. Steps 71 to 74 in effect "undo" the effects of failed transactions. The BANK(1) save records constitute a so-called "before-image log file" indicating records that were modified by failed transactions (i.e., the transactions that had begun but had not yet committed at the time that the failure interrupted the processing of the transactions). The switch read from non-volatile memory in step 71 is an indication of whether or not the "before-image log file" contains any records that were modified by transactions that have not yet been committed.
Once the non-volatile state memory in BANK(0) has been restored, transaction processing can resume in step 75 by beginning processing for a next transaction T.sub.x selected by the scheduler of the operating system. The scheduler, for example, selects the next transaction T.sub.x from a predefined schedule based on the transaction having been last committed that is indicated by the transaction-ID recorded in BANK(0), and begins a "fetch" phase for the transaction T.sub.x. In step 75, a "lock manager" program is called to check the availability of records to be accessed by the transaction T.sub.x. A multi-processing operating system typically provides such a "lock manager". The lock manager, for example, maintains lock data structures such as a hash index table to a cache of locks. The cache of locks is indexed before a record is fetched in the following step 76, in order to determine whether a record to be accessed by the current transaction is already locked, and to lock a free record to be accessed by a "read-write" transaction. Such a lock manager is desirable in multi-processing systems to simplify scheduling. If a record to be accessed by the current transaction is already locked, then the operating system is invoked to interrupt processing of the current transaction, and to begin or continue processing of another transaction, such as the transaction having locked the record. Otherwise, the record is locked for the transaction T.sub.x.
Once the records to be accessed by the current transaction are locked, in step 76 the records are fetched from BANK(0) and written into volatile memory. In step 77, the records to be modified are copied into BANK(1). In step 78, "snapshot copies" of the records to be modified are also made. This completes the "fetch" phase for the transaction T.sub.x.
Next, in step 79, the records are modified in accordance with results of the transaction. Under the control of the scheduler of the operating system, processing of the transaction T.sub.x may be interrupted in step 79 (for example while waiting for completion of a memory or input/output request), to perform operations of other transactions. Moreover, preparation of results for a transaction T.sub.y may become finished in step 79, as detected by the scheduler in step 80, or the processing of a transaction may be interrupted to begin processing of a new transaction, as detected by the scheduler in step 81. Therefore, a number of "before images" may be logged in the BANK(1) state memory, and processing of a number of transactions may begin, until a transaction T.sub.y is ready to be committed, as found in step 80.
In step 82, a "commit" phase is begun for the transaction T.sub.y, by setting the switch in non-volatile memory. Next, in step 83, the records modified by the transaction T.sub.y are written into BANK(0), into BANK(0), and the transaction ID of the transaction T.sub.y is also recorded in BANK(0). In step 84, the "lock manager" is called to release the locks on the records modified by the transaction Ty. In step 85, the switch in non-volatile memory is cleared. Finally, in step 86, the transaction ID of the transaction T.sub.y is recorded in BANK(1). This completes the "commit phase" of processing of the transaction T.sub.y. Then, as selected by the scheduler in step 81, processing of other transactions continues in step 79 or processing for a new transaction is begun in step 75.
FIG. 3 was described in terms of a multiplicity of transactions having begun before some of the multiplicity of transactions have committed. In this situation the scheduler of the operating system program time-shares execution among the multiplicity of transactions during the transaction processing steps 75 to 81. In step 75, the lock manager places locks on a group of records that must be accessed in a consistent fashion during a "read-write" transaction, in order to prevent other transactions from also writing to them and to prevent other transactions from reading inconsistent records. When a "read-only" transaction desires to read a record, it invokes the "snapshot" mechanism, which accesses the lock data structures to determine whether the desired record is locked, and when the desired record is locked, a "snapshot copy" of the record is read instead of the record in the state memory of BANK(0).
In order to guarantee the serializability of transactions in a distributed environment, each transaction is specified as either a "read-only" transaction or a "read-write" transaction. A "read-only" transaction may read a snapshot record, but a "read-only" transaction may not modify a record. A "read-write" transaction may not read a snapshot record, but it may read and modify a record. As is shown in Appendix II to the specification, the serializability of transactions in a multi-version distributed environment generally requires that a read-write transaction does not read a version earlier than the last committed version (however, it may read later uncommitted versions). Moreover, it is preferred, but not necessary, to preclude read-only transactions from reading modified but uncommitted versions. By precluding read-only transactions from reading modified but uncommitted versions and always maintaining a sufficient number of snapshots, as described below with reference to FIGS. 22 to 23, there is no need to abort or block the read-only transactions, nor is there any advantage to permitting read-only transactions to read modified but uncommitted versions in the usual case.
So that the relatively simple recovery scheme of FIG. 3 will operate in such a distributed transaction environment, the locks imposed by a transaction are not released until step 84 when the transaction is committed. The locks imposed by a transaction are also released whenever a transaction is aborted.
In a conventional transaction processing system operating as shown in FIG. 3, consistency of state memory is ensured by the use of memory locks. In the present invention, however, global transactions need not be subject to such stringent locking procedures. Instead, consistency in the presence of global transactions is assured by committing a selected global transaction and aborting an abort set of global or local transactions selected so that the order of commitment of global transactions is consistent with an order of conflicts among the global transactions, taking into consideration indirect conflicts caused by local transactions. In particular, global serializability is ensured by scheduling the commitment of global transactions so that the commitment order of directly or indirectly conflicting global transactions conforms to the order of the conflicts (as reflected by a serializability graph). When the scheduling of commitment of global transactions has this property of "extended commitment ordering", it can also be shown that in a distributed processing system (as further described below in connection with FIG. 5B), global serializability is guaranteed when only "atomic commitment" is used to coordinate the various processors in the system, so long as local serializability is guaranteed by any kind of mechanism. This is demonstrated by a rather elaborate mathematical proof, which is appended to the present specification. From a practical standpoint, this result means that the advantages of the present invention can be applied to any existing distributed transaction processing system.
As described above with reference to FIG. 3, a conventional transaction processing system insures that a second transaction can read the write data of a first transaction only after the first transaction is committed. This is a sufficient but not necessary condition to insure recoverability. In a first embodiment of the present invention, this condition can also be maintained for global transactions to minimize the amount of non-volatile memory required and to avoid what is known as "cascading aborts" to achieve recoverability. In this first embodiment, for example, memory access by global transactions must respect "write locks" placed on records by other transactions.
FIG. 4A shows three different possibilities for the scheduling of a first global transaction having a write operation and a second global transaction having a conflicting read operation. In general, two operations are conflicting when they are memory access operations that access the same resource and at least one of the operations is a write operation. By inspection it can be seen that of the three scheduling possibilities, the possibility (b) violates the commitment ordering requirement and therefore may cause inconsistency in the state of the state memory. Due to the fact that the write operation W.sub.x does not commute with the read operation R.sub.x, the result for the transaction T.sub.2 for the scheduling possibility (b) may be different from the result for the transaction T.sub.2 for the scheduling possibility (a). To obtain consistent results, the present invention permits conflicting operations of two global transactions to be scheduled in a selected order to most efficiently use resources available to a central processing unit, but insures consistency by enforcing a commitment order of global transactions that is consistent with the order of conflicts among the global transactions. Inconsistent scheduling possibilities, such as the possibility (b) in FIG. 4A, are prohibited by aborting a conflicting transaction when a selected global transaction is committed, or by delaying commitment of a selected global transaction until after the conflicting transaction is committed.
In the example of FIG. 4A, for example, suppose that the first operation scheduled is a read operation R.sub.x of the second global transaction T.sub.2, as shown in possibilities (b) and (c). If the global transaction T.sub.2 is committed before the global transaction T.sub.1 as shown in possibility (c), no inconsistency will result because the scheduling is in conformance with the order of conflicts among the global transactions. If, however, the first transaction T.sub.1 is committed before the second transaction T.sub.2 as shown in possibility (b), then the second transaction T.sub.2 must be aborted because otherwise commitment of the second transaction T.sub.2 would be inconsistent with the order of conflicts and may lead to inconsistent results.
For the present invention, indirectly conflicting global transactions must also be considered. Due to the local transactions, two global transactions T.sub.1 and T.sub.2 may indirectly conflict, for example, when referencing different resources that are also referenced by one or more local transactions. As further described below with reference to FIG. 9, indirect conflicts are detected by maintaining a serializability graph recording the effects of transactions, including committed local transactions. Specifically, two global transactions are indirectly conflicting when there is a directed path including more than one edge between them in the serializability graph.
The present invention further permits the scheduling of operations such that a second global transaction T.sub.2 can read the write data of a first global transaction T.sub.1 before the first transaction T.sub.1 is committed. In this case recoverability can be guaranteed by a process of cascading aborts, as further described below with reference to FIGS. 16 and 17. For the case of a first global transaction T.sub.1 having a write operation W.sub.x and a second global transaction T.sub.2 having a conflicting read operation R.sub.x, there are six scheduling possibilities, denoted in FIG. 4B as (a) to (f). Two of these scheduling possibilities (b) and (d) are inconsistent with the order of conflicts among the global transactions and therefore may lead to inconsistent results. The present invention prevents these scheduling possibilities from occurring by determining the order of conflicts among the global transactions and then delaying commitment of a selected global transaction or aborting a conflicting operation if necessary to enforce global transaction commitment ordering.
Turning now to FIG. 5A, there is shown a block diagram of the programming and data structures used in the digital computer 20 of FIG. 1. for scheduling transactions and enforcing global transaction commitment ordering. Global and local transactions are initiated, for example, by application programs 90. To commit the results of transactions to state memory 28, 29 and to recover from failures, the digital computer is provided with a resource manager (RM) 91 that, for example, performs the operations shown in FIG. 3. The resource manager 91, for example, also manages a transaction list (TL) 93 as further described below with reference to FIG. 6. In general, a resource manager (RM) is a software component that manages state memory resources affected by committing transactions in such a way that the memory state of the resources can be restored before the transaction is committed by effectively undoing all of the changes introduced by the transaction. In other words, the resource manager ensures that the transactions have the property of "atomicity", or "all or nothing" semantics upon its state memory resources. A resource is typically, but not necessarily, a data item or a data object. Examples of resource managers are typically found in data base systems (DSB's), queue managers, and cache managers.
To provide an interface for conducting an atomic commitment protocol for scheduling global transactions, digital computer 20 also includes a transaction manager (TM) 92. Preferably the presence of operations conflicting with global transactions is detected in real time when the transactions are performed, as further described below with reference to FIG. 12. To enforce global transaction commitment ordering, the order in which such conflicting operations are performed is recorded in global transaction commitment ordering serializability graph (GTCO-SG) 94 which is a data structure in memory, and which is described further below with reference to FIGS. 8 and 9. To enforce the global transaction commitment order, global transactions are selected for commitment and transactions are selectively aborted by a global transaction commitment order coordinator (GTCOCO) 95, which is further described below with reference to FIGS. 11 to 18.
The present invention is directed to a multi-processor or multi-processing system in which a plurality of transactions are performed concurrently and component operations of the same "global" transaction are performed concurrently in different processors or processes. A multi-processor system 590 is illustrated in FIG. 5B. In this case, three digital computers 591, 592, 593 are interconnected through a communication channel 94, and the communication is controlled by the transaction managers (TM) 595, 596, 597. In the multi-processor system 590, any one of the transaction managers 595, 596, 597 may assume the role of a coordinator and issue global transactions to the other global transaction managers. These global transactions are coordinated, for example, according to the well-known two phase commit protocol, as was described above with reference to the background art, and as further described below with reference to FIG. 11.
The transaction managers may also exchange state information over the communication channel 594. In particular, transaction processing systems generally fall within two broad categories called database management systems and object oriented systems, depending upon whether or not state memory information is resident in the non-volatile memory files of a particular one of the digital computers 591, 592, 593, or whether the state information is associated with predefined objects which may be passed from one computer to another. The present invention, however, is applicable to both types of systems because the present invention more particularly concerns the scheduling of component operations in the transactions and the enforcement of global transaction commitment ordering, and is not particularly concerned with where the state memory is physically located or maintained in a distributed processing system.
Turning now to FIG. 6, there is shown a flow chart of a procedure followed by a transaction scheduler in the resource manager for real-time scheduling of component operations of transactions in accordance with available computing resources of the digital computer. In particular, the transactions include input/output and memory access of rotating memory such as disk drives, and possibly mathematical computations that are performed by a coprocessor. Without real-time scheduling and interleaving of operations of different transactions, the central processing unit of the digital computer would have to spend a good deal of time waiting for these operations to be completed before performing the component operations of other transactions.
To more effectively use the resources of the digital computer, a transaction may dispatch input/output and memory access requests to the input/output and memory units of the computer, then set an inhibit flag indicating to the scheduler that the processing of the current transaction should be inhibited until completion of the input/output or memory access operation, and finally execute a software interrupt to the transaction scheduler in order to permit the transaction scheduler to transfer execution to another transaction. When the requested input/output or memory access operation is completed, the input/output or memory device issues a completion interrupt which is handled by a device handler interrupt routine that clears the inhibit flag of the transaction that requested the input/output or memory access operation. It should be noted that input/output and memory access completion interrupts and device handlers for such interrupts are well known in the art.
Referring now particularly to the first step 101 in FIG. 6, the transaction scheduler responds to an interrupt by removing the context of the interrupted transaction from the processor stack of the digital computer, and by placing the context in a respective context storage for the interrupted transaction. The context includes the value of the program counter which points to the interrupted memory location in the transaction program, as well as the context of other general purpose registers in the digital computer.
The transaction scheduler may also be entered during initial start-up of the digital computer in step 102. In step 102, the transaction list 93 and other data structures such as the serializability graph (GTCO-SG) are cleared and pointers are initialized.
The transaction scheduler may also be entered at the end of preparation for a transaction. In this case, in step 103 the transaction is marked to indicate that it is ready to be committed, and also the current time indicated by the real time clock (25 in FIG. 1) is saved in a memory location allocated to the transaction to indicate the time at which the transaction became ready. It should be noted, however, that some tasks placed on the transaction list might be so-called background tasks of low priority, which are never completed and use central processor execution time remaining after the servicing of all transactions in the list.
The transaction scheduler may also be entered at the end of a device handler interrupt routine. Step 111, for example, clears the inhibit flag (I in the list of FIG. 7) for the transaction having requested the input/output or memory operation, and then execution continues in step 101 to interrupt the current transaction to possibly reschedule execution back to the transaction having requested the input/output or memory operation.
The transaction scheduler performs three major tasks; it responds to transaction requests by placing the transactions on the transaction list; it schedules the performance of component operations of transactions; and it declares ready transactions. In step 104, for example, the transaction scheduler checks whether a transaction has been requested. A transaction scheduler interrupt, for example, may occur in response to an interrupt signal from the input/output unit indicating that a user or another digital computer has requested the performance of a transaction. In this case, in step 105 the transaction request is placed on the transaction list. Also, in step 107, the lock manager of the resource manager is invoked, as described above with respect to step 75 of FIG. 3, to lock the records to be accessed by the transaction, and thereby ensure local serializability. It is possible that some of these records are already locked by another transaction. In this case, the lock manager, for example, puts a pointer to the requested transaction on a "wait list" for the locked records, and sets the inhibit flag for the requested transaction. When the record eventually is unlocked, as described above with respect to step 84 of FIG. 3, the pointer at the head of the wait list is removed, and the inhibit flag for the transition pointed to by the removed pointer is cleared. In this example, the order of performance of conflicting operations, as well as the order of commitment, becomes the order in which the transactions are requested, so long as the memory locks are not bypassed.
Turning for a moment to FIG. 7, there is shown a specific example of the transaction list 93. The transaction list includes a linked list of transaction identification numbers 106. Associated with each transaction identification number is a pointer to the next entry in the linked list, and values for a number of flags (V, R, I, G, P, C, L). These flags include a valid flag V indicating whether the entry in the list includes valid data, a flag R indicating whether preparation of the transaction has been completed and the transaction is ready to be committed, a flag I indicating whether preparation of the transaction has been inhibited until completion of an input/output or memory access request, a flag G indicating whether the transaction is a local or global transaction, a flag P indicating whether the completion of preparation of a global transaction has been reported to a coordinator, a flag C indicating whether a local transaction has been committed, and a flag L indicating that lock has been placed on the transaction because it is in the "abort set" of another transaction that might be committed. The flags G and P associated with global transactions are further described below with reference to FIGS. 15 and 16.
Also associated with the list 93 are a head pointer 108, a tail pointer 109, and a pointer 110 to the transaction being performed. The head pointer 108, for example, has a negative value when the list is empty, and otherwise has a positive value pointing the list entry for the first (highest priority) transaction. In a similar fashion, the tail pointer 109 has a negative value when the list is empty and otherwise has a positive value pointing to the last entry in the list. The pointer 110 to the transaction being performed is used by the transaction scheduler in step 101 of FIG. 6 when responding to an interrupt. In particular the pointer 110 is used to find the respective context storage location for the interrupted transaction when performing step 101.
Returning now to FIG. 6, in step 112 the transaction scheduler checks whether a transaction is ready to be committed. If so, then in step 100, the transaction scheduler checks the "G" flag for the transaction. If the transaction is local, then in step 115 the resource manager (RM) commits the results of the local transaction to the state memory, and releases any locks imposed by the transaction. Otherwise, in step 113, the transaction scheduler invokes the global transaction commitment order coordinator (95) to select the global transaction to commit, and to enforce global transaction commitment ordering with possible aborts and delay. When the global transaction commitment order coordinator decides not to delay commitment, as tested in step 114, then in step 115, the reference manager (RM) commits the results of the global transaction to the state memory, and releases any locks imposed by the transaction.
Because the global transaction commitment order coordinator enforces global transaction commitment ordering, the global transactions can bypass the memory locks to more efficiently use the available resources of the processor. For the Case 1 embodiment of the invention of FIG. 4A, the global transactions may bypass the read locks to read data. For the Case 2 embodiment of FIG. 4B, the global transactions may bypass the read and write locks to read data. Also, the local transactions may bypass the locks so long as the serializability of the local schedule is not violated. The serializability of the local schedule, for example, could be insured by a combination of write locks and time stamps. Instead of using read locks, a resource would be stamped with the beginning time of the transaction that last read or wrote the resource. Any transaction attempting to write to the resource would first compare its time stamp with any time stamp of the resource, and if the write transaction would have an earlier time stamp, it would be aborted to enforce the serializability of the local schedule. Such a mechanism for ensuring local serializability would not necessarily cause the commitment order of all transactions to be the same as the order of conflicts among all of the transactions.
Finally, in step 116, the transaction scheduler checks the transaction list to determine whether there is an uninhibited transaction that is not yet ready. If so, then in step 117, the transaction scheduler selects one of the uninhibited transactions that is not yet ready. To perform steps 116 and 117, for example, the transaction scheduler first checks whether the transaction list is empty by testing whether the head pointer 108 has a negative value. If the head pointer has a positive value, then the transaction scheduler checks the flags R and I for the transaction at the head of the list to determine whether is not yet ready and is not inhibited. If the first entry is ready or is inhibited, then the transaction scheduler checks the tail pointer 109 to determine whether the end of the list has been reached. If not, then the transaction scheduler checks the pointer to the next entry and performs the same steps until either an uninhibited transaction not yet ready is found or the end of the list has been reached.
When an uninhibited transaction not yet ready has been selected, then in step 118 the context of the selected transaction is placed on the stack. In this regard it should be noted that when a transaction is first placed on the transaction list, then an initial context for the transaction is placed in the respective context storage for the interrupted transaction. The initial context, for example, includes a program counter value pointing to the first instruction in the program for the transaction. After step 118, a return from interrupt is performed in step 119 to begin or continue the execution of instructions in the program for the selected transaction.
Turning now to FIG. 8, there is shown a specific example of a data structure 94 for storing the global transaction commitment order serializability graph (GTCO-SG). As further described below in connection with FIGS. 9-14, the data structure 94 is used in connection with the flags in the transaction list 93. Whenever a particular order of performing conflicting operations in a respective pair of transactions has been established, that order of performance of the conflicting operation is noted in the global transaction commitment order serializability graph. If the memory access operations performed by each transaction and the memory locations of those memory access operations are known at the time that a transaction is placed on the list, then it is possible in Case 1 of FIG. 4A for the order of conflicts to be determined at that time. In this regard, it should be noted that for Case 1 as illustrated in FIG. 4A, write operations are in effect performed at the time of transaction commitment. Aside from this particular case, the order of performance of conflicting operations is determined when a second one of the conflicting operations is scheduled for performance by the transaction scheduler and the memory location accessed by that conflicting operation is determined.
It should be noted that the global transaction commitment order serializability graph may include committed local transactions. When a local transaction is committed in step 115 of FIG. 6, its entry in the transactions list 93 is removed at this time only when it does not have any path in the graph 94 from any undecided transactions. If it does have a path from an undecided transaction, then its I flag and its C flag are set, and it remains in the graph so long as it has a path from an undecided transaction. The graph 94 can be searched for such a path by using a recursive procedure similar to the ABORT(T) procedure listed below.
At the time that presence of a conflict is detected, as further described below with reference to FIG. 14, the order of performance is recorded in the global transaction commitment order serializability graph. The pertinent data in the graph of FIG. 8 and transactions list 93 is presented in pictorial form in FIG. 9. The flags that are set in the data structure of FIG. 8 correspond to edges 131 in the pictorial representation of FIG. 9. The direction of an edge 131 indicates the order of performance of the conflicting operations in the transactions. Once this order of performance is established, a corresponding global transaction commitment order is enforced by delaying commitment of transactions, or aborting transactions.
Enforcement of the global transaction commitment order by aborting transactions is illustrated by steps 141 and 142 in FIG. 10. In step 141 a ready global transaction to be committed is selected. Preferably, the selection is performed by an atomic commitment coordinator according to the well-known atomic commitment protocol introduced above. In this protocol, the atomic commitment coordinator sends a "vote" request to all participating processors. If all of the participating processors respond with a "yes" or "prepared" vote, then the atomic commitment coordinator sends a "commit" command to the participating processors. The preferred atomic commitment protocol is further described below with reference to the state diagram of FIG. 11.
In step 142, the global transaction commitment order is enforced by aborting an abort set so that the commitment order of the committed global transaction is consistent with the order of conflicts among global transactions. For the commitment order illustrated by the graph in FIG. 9, for example, if the transaction T.sub.6 is selected, then transactions T.sub.2 and T.sub.3 are aborted to enforce the global transaction commitment order. In particular, when a global transaction is committed, any and all paths to it in the GTCO-SG from any and all global undecided transactions, and from any and all active transactions (representing possible future paths from global transactions) are disconnected by aborting a set of transactions on the paths. This "abort set" may include global as well as local transactions. In some cases, the abort set is empty, in which case no transactions need to be aborted to enforce the global transaction commitment order. In other cases, the abort set may not be unique, and the abort set can be selected in an expedient fashion, or a fashion optimized to maximize system performance, or some trade-off between selection expediency and overall performance.
The most convenient selection involves choosing the undecided transactions, on the paths from the undecided global transactions to the committed transactions, that are the closest (on each path separately) to the committed transaction. This selection gives a unique abort set. Shown below is pseudo-code for a specific procedure to find this abort set:
______________________________________
ABORT (T) /* returns a "closest neighbor" abort set */
set ABORT:=empty /* inital value is the empty set */
set NODE.sub.-- VISITED:=empty /* set of nodes visited */
BACK.sub.-- FRONT (T)
return ABORT
BACK.sub.-- FRONT (T)
/* a recursive procedure that computes the set
ABORT */
for every edge (T' ,T) in the GTCO-SG do
if T' is not in NODES.sub.-- VISITED then
begin
insert T' into NODE.sub.-- VISITED
if undecided (T') then insert T' into ABORT
else BACK.sub.-- FRONT (T')
end
end.sub.-- BACK.sub.-- FRONT
end.sub.-- ABORT
______________________________________
This closest neighbor abort set, however, is not necessarily optimum. If any of the nearest undecided neighbors is ready, for example, then the next nearest undecided neighbor can be alternatively selected for the abort set. An optimal selection for the abort set would choose the abort set to maximize system performance. To maximize performance, an optimum abort set may include a minimum number of transactions to be aborted, although the performance penalty associated with aborting a transaction may be quite different with each transaction. A transaction, for example, may already have been included in another abort set of a global transaction reserved for commitment, and, in this case, the transaction already in an abort set (i.e., the transaction having its flag L=1) can be included in other abort sets with a minimal performance penalty. Depending on the particular system, it may be desirable to abort local transactions instead of global transactions. Also, a priority could be assigned to each transaction, or computed based upon the order of each transaction in the transaction list, and the priorities of the members in each abort set could be summed to compute an overall performance penalty associated with each possible abort set. Therefore, at the expense of additional search time, other possible abort sets could be found, an overall performance penalty could be estimated for each abort set, and the abort set estimated to have the least performance penalty could be chosen.
Aborting of a transaction involves discarding the results of the transaction. For local transactions, a transaction could be aborted by resetting the contents of its respective context storage to its initial context. The current value of the program counter for the transaction, for example, is reset to the beginning of the program for the transaction. In addition, the transaction list 93 and the global transaction commitment order serializability graph 94 must be updated. For a global transaction, the aborted global transaction is restarted if at all by the atomic commitment coordinator of the global transaction. In this case, the global transaction is entirely removed from the transaction list.
Turning now to FIG. 11, there is shown a state diagram of a processor 145 in a distributed transaction processing system that uses the preferred atomic commitment protocol to process global transactions. The processor also processes local transactions. The local transactions, for example, are issued by a local user 146 such as an application program executed by the processor. Global transactions issued by the local user are coordinated by the transaction manager 147, that functions as the atomic commitment coordinator for these global transactions. Therefore, the processor 145 should know whether a transaction is global or local, depending on the source of the transaction. Existing systems, however, may have to be modified to provide information identifying each transaction as global or local. The information should be made available to the local scheduler as early as possible for use by the local concurrency control mechanism. Otherwise, each transaction should be assumed to be global, but in this case any optimization of the local concurrency control for local transactions is lost. When an optimistic local concurrency control is used, for example, knowledge that a transaction is local can be used at any time before the transaction is decided. For some applications, some transaction types are a-prior known to be local, and hence this information could be used to identify local transactions in systems which do not explicitly identify the source of each transaction.
In any case, the transaction scheduler receives the transaction request and puts the transaction request into an entry of the transaction list. The transaction scheduler eventually transfers execution to the transaction, and the transaction is executed until either it becomes inhibited or it becomes ready. As described above in connection with FIG. 6, a transaction may become inhibited after requesting an input/output operation or memory operation, and, upon completion of the input/output or memory operation, the transaction will become uninhibited. A transaction that is either active, inhibited or ready can be aborted to enforce global transaction commitment ordering.
The transaction scheduler may commit a ready local transaction. To insure global synchronization in a distributed transaction processing system, however, a ready global transaction is committed only after a handshake with the coordinator 147. This handshake insures that a global transaction is not committed unless all of the processors that are processing assigned portions of the global transaction are also ready to commit their assigned portions of the global transaction. Therefore, when the state of a global transaction changes from the "active" to the "ready" state, a "prepared" signal is transmitted to the coordinator 147.
When the coordinator 147 receives "prepared" signals from all of the processors participating in a global transaction, then the coordinator sends a "commit" command back to the processors. If, however, the coordinator fails to receive a "prepared" signal from all of the participating processors, then the coordinator may transmit an "abort" signal to the processors. In FIG. 1, these handshake signals are indicated by dotted lines.
When a local transaction is committed, the transaction scheduler notifies the local user 146 that the transaction has been completed. When a global transaction is committed, the transaction scheduler removes the global transaction from the transaction list and sends a signal to the coordinator 147 indicating that the global transaction has been committed. Moreover, when a global transaction is aborted, the global transaction is removed from the transaction list and the global transaction commitment order serializability graph, and the transaction scheduler sends a signal to the coordinator 147 to confirm the abort. For a local transaction, however, it may be desirable to restart preparation of the transaction, and in this case it is only necessary to reset the initial context of the transaction, clear the transaction from the global transaction commitment order serializability graph, and set the state of the transaction back to "active" by resetting the R and I flags in the transaction list entry of the transaction.
Turning now to FIG. 12, there is shown a flow chart generally designated 150 of a procedure for a global transaction commitment order coordinator working in connection with a transaction manager 151 and a resource manager 152 to selectively abort or delay the commitment of transactions to enforce commitment ordering of global transactions. As described above, the transaction manager 151 acts as an interface for initiating global transactions and conducting an atomic commitment protocol. The resource manager 152 has a transaction scheduler 153 that schedules the preparation of local transactions as well as global transactions T.sub.g identified by a request 154.
The transaction scheduler 153 periodically checks whether a transaction is ready to commit. Preferably, the transaction scheduler also checks whether a global transaction is ready to commit in response to a "vote request" 155 from the atomic commitment coordinator of a global transaction. Although such a "vote request" is not needed for the atomic commit protocol described above with respect to FIG. 11, it permits the commitment of a global transaction to be delayed to possibly reduce the number of members in the global transaction's abort set. In the procedure illustrated by the flow chart 150 of FIG. 12, for example, a "prepared" message for a global transaction ready to commit is sent to the atomic commitment coordinator immediately when the abort set for the global transaction is null; otherwise, a "prepared" message for the global transaction is sent to the coordinator only after receiving a vote request 155 from the atomic commitment coordinator. In an alternative embodiment described below, a vote request is not used, but if the abort set is not null, a "prepared" message is sent to the atomic commitment coordinator only after a predetermined period of time.
When the transaction scheduler 153 finds that a transaction is ready to commit, the global transaction commitment order coordinator checks in step 156 whether a global lock has been placed on the ready transaction. If a global lock has been placed on the ready transaction, then it is not committed, and execution returns to the transaction scheduler to continue processing for another transaction. It is not necessary to use such a global lock, but the use of such a global lock permits some transactions to be committed that would otherwise have to be aborted when chosen to be included in the abort set of a global transaction. Instead of immediately aborting the members of an abort set for a global transaction, a global lock is placed (in step 171) on the members of an abort set, and then (in step 169) a "prepared" message for the global transaction is sent to the atomic commitment coordinator. If the atomic commitment coordinator then decides to abort the global transaction, the global locks for the global transaction are released (in FIG. 14), thereby permitting the members of the abort set to be committed.
Next, in step 157, execution branches to step 158 when the ready transaction is local. If the ready transaction has a path from any undecided transaction in the global transaction commitment order serializability graph, as tested in step 158, then the ready transaction must remain in the graph (even though it will become a committed local transaction). Therefore, in this case, execution branches to the resource manager 152 to commit the ready transaction. Otherwise, in step 159, the ready transaction is removed from the graph. Its removal may permit other committed local transactions to be removed from the graph, as attempted in step 160 by calling a "garbage collection" subroutine shown in the flow chart of FIG. 13. Execution continues to the resource manager 152 to commit the ready transaction.
If the ready transaction is global, as tested in step 157, then in step 161, the global transaction commitment order serializability graph is searched to find an abort set for the ready transaction, as described above. If an abort set cannot be found without any transaction reported to an atomic commitment coordinator as being prepared (i.e., without the flag P=1), then in step 167, execution branches depending on whether the abort set is null. If so, then in step 168, a message is sent to the atomic commitment coordinator indicating that the ready transaction has been prepared to be committed, and in step 169, the P flag for the ready transaction is set. Then execution continues so that the transaction scheduler 153 processes another transaction.
If in step 167 the abort set was not null, then in step 170, execution branches depending on whether the atomic commitment coordinator issued a vote request for the ready transaction. If not, then execution continues so that the transaction scheduler processes another transaction. Otherwise, in step 171, a global lock is placed on each member of the abort set. Next, in step 168, a message is sent to the atomic commitment coordinator for the global transaction indicating that the ready transaction has been prepared to be committed, and in step 169, the P flag for the ready transaction is set. Then execution continues so that the transaction scheduler 153 processes another transaction.
Turning now to FIG. 13, there is shown a flow chart 180 of the garbage collection subroutine that is called in steps 160 and 166 of FIG. 12. In a first step 181, execution returns if there are not any committed local transactions that were on paths from the transaction that was just removed from the global transaction commitment order serializability graph. Otherwise, in step 182, the graph is inspected to determine whether each of these local committed transactions has a path from any undecided transaction in the graph. For each of these committed local transactions which does not have any path from any undecided transaction, in step 183, that committed local transaction is removed from the graph, and, in step 184, the subroutine of FIG. 13 is called recursively to attempt the removal of more committed local transactions that were on paths in the graph from the committed local transaction that was just removed from the graph.
Turning now to FIG. 14, there is shown a flow chart 190 of an interrupt routine for responding to commit and abort requests from an atomic commitment coordinator. These requests are passed to the global transaction commitment order coordinator through the transaction manager 151. In response to a request to commit a specified global transaction, in step 191 the members of the transaction's abort set are each removed from the global transaction commitment order serializability graph by performing steps 165 and 166 of FIG. 12, and aborted by the resource manager. Next, in step 192, the specified global transaction is removed from the graph by performing steps 159 and 160 of FIG. 12, and committed by the resource manager. Then, in step 193, an acknowledgement is sent to the atomic commitment coordinator for the global transaction, and execution returns from the interrupt.
In response to a request to abort a specified global transaction, in step 194, any global locks imposed by the transaction are removed. Associated with each globally-locked transaction, for example, is a list of pointers to all of the global transactions having locks on the locked transaction. Associated with each prepared global transaction is a list of pointers of the locked members of its abort set. Removal of the global locks imposed by the specified transaction in this example entails removing the pointers to the specified transaction from the list associated with each member of the specified transaction's abort set, and when any list associated with each member of the abort set becomes empty, releasing the lock on that member. Then, in step 195, the specified transaction is removed from the global transaction commitment order serializability graph by performing steps 165 and 166 of FIG. 12, and the specified transaction is aborted by the resource manager. Finally, in step 193, an acknowledgement is sent to the atomic commitment coordinator for the global transaction, and execution returns from the interrupt.
Turning now to FIG. 15, there is shown a flow chart 200 of a procedure for determining the order of conflicts among conflicting transactions. The procedure 200 is invoked during the preparation of a memory access operation such as a read or write. In the first step 201, the address of the memory access operation is determined. Next, in step 202 the address is compared to addresses of prior operations that may conflict. This is done by searching a list of addresses of prior operations for each transaction in the transaction list. If the present operation is a read operation, then the read operation may conflict with prior write operations. If the present operation is a write operation, then the write operation may conflict with a prior read (or for Case 2 of FIG. 4B, a prior write operation). When there is an address match as tested in step 203, then in step 204 the present order of the transaction is recorded in the global transaction commitment order serializability graph (94 in FIG. 7). In particular, for Case 1 of FIG. 4A, conflicts only occur between a read operation and a write operation, and the order of operation is read then write. For Case 2 of FIG. 4B, the present order must be for the current transaction to be performed after the previous transaction. In step 205 execution branches back to step 202 if there are additional prior memory access operations to check, or otherwise preparation of the memory access continues in step 206 by adding the address determined in step 171 to a list of addresses for read or write operations of the current transaction. Then, in step 207, the operation is prepared or performed. Execution then returns to the transaction.
Turning now to FIG. 16, there is shown an augmented global transaction commitment order serializability graph in which edges including a particular kind of write read conflict are distinguished from edges of other conflicts. Such an augmented graph can be stored in a data structure similar to the data structure shown in FIG. 8, but each edge is represented by a pair of flags, consisting of a first flag indicating any kind of conflict, and a second flag indicating that there is a write-read conflict between a first transaction that was the last transaction to write to a resource x before being read by a second transaction. The augmented graph of FIG. 16 is used to perform cascading aborts to insure recoverability for a system in which a second transaction can read the write data of a first transaction before the first transaction is committed, as was described above with reference to FIG. 4B. Suppose, for example, that global transaction T.sub.5 is selected as a ready transaction to be committed. To enforce global transaction commitment ordering, then global transactions T.sub.3 and T.sub.4 of FIG. 16 must be aborted. However, assume that the transaction processing system operates in the fashion as described above with reference to FIG. 4B. In this case, when a transaction is aborted to enforce global transaction commitment ordering, then every transaction that has read write data of the aborted transaction must also be aborted. From the augmented graph of FIG. 16, it is seen that when the transaction T.sub.4 is aborted, then the transaction T.sub.7 must also be aborted because of the write read conflict between transactions T.sub.4 and T.sub.7. Moreover, when the transaction T.sub.7 is aborted, then so must the transaction T.sub.8 because the transaction T.sub.8 has read data written by the transaction T.sub.7.
A specific procedure for performing a cascading abort is shown in the flow chart 210 of FIG. 17. In the first step 211 the augmented graph is searched to find all of the transactions T.sub.y such that T.sub.y has read data written by a specified transaction T.sub.x. Then in step 212 the transaction T.sub.x is aborted. In a final step 213, the subroutine 190 of FIG. 17 is recursively called to abort each of the transactions T.sub.y. It is assumed, of course, that during the recursive call, step 212 will not attempt to abort any committed local transaction in the graph. Such an attempt is an error condition, indicating that the transaction scheduler has failed to ensure recoverability of the system. Any such error should be reported to the system manager, because it may indicate that the state memory has been corrupted with inconsistent results.
Preferably, an explicit step is inserted into the scheduling procedure to ensure recoverability in any system intended to operate in accordance with Case 2 of FIG. 4B. As shown in FIG. 18, for example, a scheduler 153' first checks in step 156' whether there is a global lock on a ready transaction before permitting the transaction to be committed, as was shown in FIG. 12. To ensure recoverability, however, an additional step 221' is used which prevents any ready transaction from being committed when it has a write-read conflict with any undecided transaction in the augmented global transaction commitment order serializability graph.
Turning now to FIG. 19, there is shown an embodiment of the present invention wherein a global transaction commitment order coordinator (GTCOCO) 251 is inserted into a conventional transaction processing system having a transaction manager (TM) 252 and a resource manager (RM) 253. Application programs 257 send requests for global transactions to the transaction manager 252 and requests for local transactions to the resource manager 253. As shown, the global transaction commitment order coordinator 251 assumes a subset of the interface 254 between the transaction manager 252 and the resource manager 253. The global transaction commitment order coordinator 251 intercepts a conventional po |