Method and mechanism for dependency tracking at low granularity levels7010529Abstract A method and mechanism for tracking dependencies at low granularity levels in a database system is disclosed. An embodiment utilities commit time values at low granularity levels which are associated with structures in a database system. Those associated commit time values are used to compute dependency values. Claims We claim: Description BACKGROUND OF THE INVENTION
For the purposes of illustration, consider the following sequence of transactions which executes the indicated database statements (in SQL-based pseudocode) against a database table "Emp_Table" having the structure Emp_Table (emp_name, emp_value): At commit time 5—Transaction T1 commits having executed the following statement:
At commit time 10—Transaction T2 commits having executed the following statement:
At commit time 20—Transaction T4 commits having executed the following statement:
At commit time 25—Transaction T5 commits having executed the following statements:
At commit time 5, transaction T1 commits having inserted a row into the table Emp_Table, with the value "Smith" in the emp_name column of that row and the value "X" in the emp_value column of that row. At commit time 10, transaction T2 commits having inserted a row into the table Emp_Table, with the value "Jones" in the emp_name column of that row and the value "Y" in the emp_value column of that row. At commit time 15, transaction T3 commits which modifies the "Smith" row to change the emp_value column of the row to "X+1". Transaction T3 also modifies the "Jones" row to change the emp_value column of the row to "Y+1". At commit time 20, transaction T4 commits which has inserted a row into the table Emp_Table, with the value "Miller" in the emp_name column of that row and the value "Z" in the emp_value column of that row. At commit time 25, transaction T5 commits which modifies the "Smith" row to change the emp_value column of the row to "X+2". Transaction T5 also deletes the "Miller" row that was inserted by transaction T4. With respect to the above example sequence of transactions T1-T5, FIGS. 2 and 3 illustrate the application of this embodiment of the invention. FIG. 2 illustrates changes to the Emp_Table as the sequence of transactions T1-T5 are performed and committed. FIG. 3 describes the calculation of dep_SCN values at row level granularities for each of the transactions T1 through T5. Note that in this example, transaction T1 commits at commit time 5 having executed the following statement:
Since this transaction consists of an INSERT statement, it is not dependent upon any other transaction. Thus, the dep_SCN value for any rows affected by this transaction is set to a value that indicates no dependencies, which is equal to "0" in the present embodiment. The row-based commit time for all rows affected by this transaction is 5, which is the commit time of the transaction. Dependent SCN Table 300 in FIG. 3 is used to demonstrate the tracking of dep_SCN values. Row 302 in the Dependent SCN Table 300 shows, in the dep_SCN column 312, that the dep_SCN for transaction T1 has been set to "0". Table 202 in FIG. 2 illustrates the Emp_Table after transaction T1 has committed. Transaction T1 inserts a new "Smith" row to table Emp_Table, with the value "Smith" in the emp_name column and "X" in the emp_value column of the new row. Since the present embodiment of the invention tracks dependencies at the granularity of the row level, the "Smith" row also includes a row_SCN value of "5", which is the commit time of transaction T1. Structures in FIGS. 2 and 3 are shown for purposes of illustration only, to help illustrate the changes to the dep_SCN value and row_SCN values as transactions T1 through T5 are executed. It is understood, however, that the choice of storage locations and format for these commit time values are subject to many design variables and may be stored in numerous locations or formats; thus, the exact storage locations and format shown for these commit time values do not form a necessary limitation to the invention. In a similar manner to transaction T1, transaction T2 having a commit time of 10 executes the following statement:
Transaction T3 commits at time 15 having executed the following statements:
In a similar manner to transactions T1 and T2, transaction T3 having a commit time of 20 executes the following statement:
Transaction T5 commits at time 25 having executed the following statements:
The dependency ordering of transactions can be established once the dep_SCN for transactions have been determined. FIG. 4 illustrates an ordering schedule for transactions T1-T5 based upon the analysis of dep_SCNs as shown in FIG. 3. In FIG. 4, the ordering schedule proceeds from left to right, with required ordering of transactions indicated by vertical bars between transactions. Transactions that are permitted to run in parallel or earlier than each other (i.e., are not dependent upon each other) are separated by horizontal bars. Since transactions T1, T2, and T4 all have the same dependent SCN of"0", these transactions are not dependent upon any earlier transactions, and can be ordered before or parallel to any other transaction subject to the ordering/dependency constraints of any later transactions. Thus, these three transactions are configured in parallel separated by horizontal bars 402 and 404. Transaction T3 has a dep_SCN value of 10; therefore, T3 must be scheduled to begin after all other transactions having SCN values of 10 or less. Vertical bar 406 representing a SCN of 10 is shown in ordering schedule 400 separating transaction T3 from all prior transactions having a SCN of 10 or less (transactions T1 and T2 in this example have SCN values of 10 or less). Since transaction T4 has a SCN higher than 10, transaction T3 is permitted to run in parallel with T4 (shown separated with horizontal bar 404). Transaction T5 has a dep_SCN value of 20; therefore, T5 must be scheduled to begin after all other transactions having SCN values of 20 or less. Thus, vertical bar 408 representing a SCN of 20 in ordering schedule 400 separates T5 from all prior transactions having a SCN of 20 or less. To rephrase the significance of FIG. 4, the dep_SCN values for each transaction has been used to provide a scheduling order for the application of each transaction T1-T5. Since the dep_SCN values were calculated upon SCN values stored for reach row, the granularity of the dependency tracking used for the scheduling order is on the row level. Based upon analysis of the dep_SCN values, it can be seen that transactions T1, T2, and T3 can each be scheduled independent of the other, and even scheduled in parallel. Transaction T3 must be scheduled after transactions T1 and T2 have committed, but may be scheduled in parallel with transaction T4. And finally, transaction T5 must be scheduled after all other transactions T1-T4 have committed. FIG. 1 maps the tree of dependencies for these transactions, which is an alternate method of representing FIG. 4. Since transaction T3 can be applied only if transactions T1 and T2 have already been committed. Node T3 is shown as a child to parent nodes T1 and T2. Transaction T5 is dependent upon all the other transactions since it is modifying or deleting rows in a manner that relies upon actions that have previously been taken by transactions T1-T4. Transaction T5 is directly dependent upon transaction T4 since transaction T5 is deleting the "Miller" row, which exists only if transaction T4 has been executed and committed. Thus, T5 is shown as a child node to node T4. Transaction T5 is dependent upon transaction T3 since transaction T5 is updating the "Smith" row in a manner that relies upon the prior update made by transaction T3. Node T5 is therefore also shown as a child node to T3. Since transaction T3 is dependent upon transactions T1 and T2, then through T3, transaction T5 is also dependent upon transactions T1 and T2. Transactions T1, T2, and T4 do not rely upon actions taken by any previous transactions. Therefore, these nodes are shown without any parent nodes. FIG. 5 depicts a flowchart of a process for dependency tracking according to an embodiment of the present invention. At 502, identification is made of the particular operation being performed by a database statement. According to an embodiment, dependency tracking can be made at the operation level, with certain kinds of data operations having dependency significance that is different than other kinds of data operations. Consistent with this embodiment, the following are examples of operations along with their dependency significance:
At 504, the type of operation being performed by a statement is considered in determining the dep_SCN calculation. If the operation has no dependencies (e.g., INSERT statement), then a default value is assigned to the dep_SCN indicating no dependencies (508). Otherwise, a dependency calculation is performed to determine the dependent SCN for a transaction (506). The dep_SCN value may be obtained by taking the maximum value for all the row_SCNs affected (either directly or indirectly) by the transaction. The database statement can thereafter be applied (510) and the transaction committed (512). According to an embodiment of the invention, the row_SCN value is stored in a head piece that is added to a table row. This portion of the row can be considered a "hidden" column that stores information that is internally usable by the database system. When the row is not locked by a transaction, this row_SCN holds the commit time of the transaction that last modified the row. When a row is marked as locked, the row_SCN could be stored in a lock structure, and is configured to store the commit time of the last modifier of the row. If there is no row_SCN value in the lock structure, the transaction may be active so the row_SCN in the row may be employed for dep_SCN calculations. When a new row is inserted into a table, the row_SCN of that row is the commit time of the transaction that performed the insert operation and the dep_SCN value is initialized to zero. According to an embodiment, the commit of a transaction does not automatically cause the row_SCN of a row to be updated with the commit time of the transaction. Instead, a delayed logging cleanout operation is performed that does not propagate the commit time to each affected row. According to this embodiment, a lock on a row is not automatically released once a transaction commits. Instead, the lock is cleaned out "on-demand" when a later transaction is interested in taking a lock to that row. Only at that later point in time, when the lock is cleared out for a particular row, is the row_SCN for that row updated with the commit time of the last transaction that has operated upon the row and committed (514). That same commit time is copied to the row_SCN for all rows affected by that previous transaction. In an alternate embodiment, the row_SCN of a row is updated as soon as the transaction operating upon that row commits, without waiting for a later transaction to clean out the locks. This invention can be utilized to implement scalable replication in a database system. Replication is the process of copying and maintaining database objects such that information at a local database is replicated to a remote database. Many advantages exist to having a replicated database system. For example, replication can improve the performance and protect the availability of applications because of the existence of alternate data access sites. An application may normally access a local database rather than a remote server to minimize network traffic and achieve maximum performance. Consider a replication system that operates by propagating all changes at a local database to remote database sites. The changes are then applied at the remote sites to maintain correspondence between the state of the remote database and the state of the local database. As changes are performed to the local database site, the changes are queued for transmission to the remote sites. Under certain circumstances, it may be desirable to transmit these changes out-of-order to the remote sites, possibly to maximize network bandwidth usage by transmitting various changes in parallel. With the present invention, the type of scheduling plan as represented in FIG. 4 can be utilized to determine which changes can be transmitted in parallel without violating dependencies between the changes. If the unit of change utilized is the transaction, then the present invention can be used to determine the order in which transactions can be propagated and applied at remote site in a replicated system. Alternatively, the transactions themselves can be propagated in any contemplated order necessary to minimize network usage, but once the changes have actually been received by the remote sites, a scheduling plan may be used to re-apply the changes in the correct order at the remote sites. Replication is merely one example of a procedure in which the present invention may be applied. The present method and mechanism for dependency tracking can also be utilized for improved optimistic locking. With optimistic locking, data items accessed and retrieved by a first user are not immediately placed in an exclusive lock with respect to other users. This allows greater concurrency in the system since other users are not blocked from simultaneous access to that data item. However, this may also allow other users to independently modify that data item after a version of the data item has been retrieved by the first user, thereby rendering the copy of the data item held by the first user invalid or "stale". According to the present invention, the row_SCN value associated with a row in the database can be utilized to determine whether a data item has been modified by another user. When the first user is ready to act upon a data item (e.g., a "modify" operation), the row_SCN value(s) for that data item is checked to determine whether the data item has been modified after the original retrieval time by the first user. If the row_SCN value(s) is more recent than the SCN corresponding to the original retrieval time, then another user has modified that data item and the first user must obtain the most recent version of the data item before proceeding. If the row_SCN values(s) is equal or less than the original retrieval time by the first user, this indicates that another user did not recently modify the data item and the planned operation can proceed without worry of data inconsistency. As another example, the present invention can also be used for improved data caching to more efficiently identify stale data items in cache. Row_SCN values for data items in cache can be checked to determine whether the data versions in cache accurately reflect the most recent changes to those data items. According to an embodiment, the data cache is associated with a cache SCN value corresponding to the last point in time in which the data cache was updated with the most recent versions of data items maintained in that cache. At a later point in time, the row_SCN values for data items in cache are matched against the cache SCN. If the row_SCN value for a data item is less than or equal to the cache SCN, then that data item did not change and the cache contains the latest version of that data item. If the row_SCN value for a data item is greater than the cache SCN, then the cache does not contain the most recent version of that data item and should be updated to become current. Once the updates for all data items have completed, the cache SCN value is updated to reflect the SCN at which the cache refresh operation occurred. Note that this operation can be performed periodically to ensure that the recent versions of data items are maintained in cache. This operation can also be performed upon access to a particular data item in cache. System Architecture Overview Referring to FIG. 6, in an embodiment, a computer system 620 includes a host computer 622 connected to a plurality of individual user stations 624. In an embodiment, the user stations 624 each comprise suitable data terminals, for example, but not limited to, e.g., personal computers, portable laptop computers, or personal data assistants ("PDAs"), which can store and independently run one or more applications, i.e., programs. For purposes of illustration, some of the user stations 624 are connected to the host computer 622 via a local area network ("LAN") 626. Other user stations 624 are remotely connected to the host computer 622 via a public telephone switched network ("PSTN") 628 and/or a wireless network 630. In an embodiment, the host computer 622 operates in conjunction with a data storage system 631, wherein the data storage system 631 contains a database 632 that is readily accessible by the host computer 622. In alternative embodiments, the database 632 may be resident on the host computer, stored, e.g., in the host computer's ROM, PROM, EPROM, or any other memory chip, and/or its hard disk. In yet alternative embodiments, the database 632 may be read by the host computer 622 from one or more floppy disks, flexible disks, magnetic tapes, any other magnetic medium, CD-ROMs, any other optical medium, punchcards, papertape, or any other physical medium with patterns of holes, or any other medium from which a computer can read. In an alternative embodiment, the host computer 622 can access two or more databases 632, stored in a variety of mediums, as previously discussed. Referring to FIG. 7, in an embodiment, each user station 624 and the host computer 622, each referred to generally as a processing unit, embodies a general architecture 705. A processing unit includes a bus 706 or other communication mechanism for communicating instructions, messages and data, collectively, information, and one or more processors 707 coupled with the bus 706 for processing information. A processing unit also includes a main memory 708, such as a random access memory (RAM) or other dynamic storage device, coupled to the bus 706 for storing dynamic data and instructions to be executed by the processor(s) 707. The main memory 708 also may be used for storing temporary data, i.e., variables, or other intermediate information during execution of instructions by the processor(s) 707. A processing unit may further include a read only memory (ROM) 709 or other static storage device coupled to the bus 706 for storing static data and instructions for the processor(s) 707. A storage device 710, such as a magnetic disk or optical disk, may also be provided and coupled to the bus 706 for storing data and instructions for the processor(s) 707. A processing unit may be coupled via the bus 706 to a display device 711, such as, but not limited to, a cathode ray tube (CRT), for displaying information to a user. An input device 712, including alphanumeric and other keys, is coupled to the bus 706 for communicating information and command selections to the processor(s) 707. Another type of user input device may include a cursor control 713, such as, but not limited to, a mouse, a trackball, a fingerpad, or cursor direction keys, for communicating direction information and command selections to the processor(s) 707 and for controlling cursor movement on the display 711. According to one embodiment of the invention, the individual processing units perform specific operations by their respective processor(s) 707 executing one or more sequences of one or more instructions contained in the main memory 708. Such instructions may be read into the main memory 708 from another computer-usable medium, such as the ROM 709 or the storage device 710. Execution of the sequences of instructions contained in the main memory 708 causes the processor(s) 707 to perform the processes described herein. In alternative embodiments, hard-wired circuitry may be used in place of or in combination with software instructions to implement the invention. Thus, embodiments of the invention are not limited to any specific combination of hardware circuitry and/or software. The term "computer-usable medium," as used herein, refers to any medium that provides information or is usable by the processor(s) 707. Such a medium may take many forms, including, but not limited to, non-volatile, volatile and transmission media. Non-volatile media, i.e., media that can retain information in the absence of power, includes the ROM 709. Volatile media, i.e., media that can not retain information in the absence of power, includes the main memory 708. Transmission media includes coaxial cables, copper wire and fiber optics, including the wires that comprise the bus 706. Transmission media can also take the form of carrier waves; i.e., electromagnetic waves that can be modulated, as in frequency, amplitude or phase, to transmit information signals. Additionally, transmission media can take the form of acoustic or light waves, such as those generated during radio wave and infrared data communications. Common forms of computer-usable media include, for example: a floppy disk, flexible disk, hard disk, magnetic tape, any other magnetic medium, CD-ROM, any other optical medium, punchcards, papertape, any other physical medium with patterns of holes, RAM, ROM, PROM (i.e., programmable read only memory), EPROM (i.e., erasable programmable read only memory), including FLASH-EPROM, any other memory chip or cartridge, carrier waves, or any other medium from which a processor 707 can retrieve information. Various forms of computer-usable media may be involved in providing one or more sequences of one or more instructions to the processor(s) 707 for execution. For example, the instructions may initially be provided on a magnetic disk of a remote computer (not shown). The remote computer may load the instructions into its dynamic memory and then transit them over a telephone line, using a modem. A modem local to the processing unit may receive the instructions on a telephone line and use an infrared transmitter to convert the instruction signals transmitted over the telephone line to corresponding infrared signals. An infrared detector (not shown) coupled to the bus 706 may receive the infrared signals and place the instructions therein on the bus 706. The bus 706 may carry the instructions to the main memory 708, from which the processor(s) 707 thereafter retrieves and executes the instructions. The instructions received by the main memory 708 may optionally be stored on the storage device 710, either before or after their execution by the processor(s) 707. Each processing unit may also include a communication interface 714 coupled to the bus 706. The communication interface 714 provides two-way communication between the respective user stations 624 and the host computer 622. The communication interface 714 of a respective processing unit transmits and receives electrical, electromagnetic or optical signals that include data streams representing various types of information, including instructions, messages and data. A communication link 715 links a respective user station 624 and a host computer 622. The communication link 715 may be a LAN 626, in which case the communication interface 714 may be a LAN card. Alternatively, the communication link 715 may be a PSTN 628, in which case the communication interface 714 may be an integrated services digital network (ISDN) card or a modem. Also, as a further alternative, the communication link 715 may be a wireless network 630. A processing unit may transmit and receive messages, data, and instructions, including program, i.e., application, code, through its respective communication link 715 and communication interface 714. Received program code may be executed by the respective processor(s) 707 as it is received, and/or stored in the storage device 710, or other associated non-volatile media, for later execution. In this manner, a processing unit may receive messages, data and/or program code in the form of a carrier wave. In the foregoing specification, the invention has been described with reference to specific embodiments thereof. It will, however, be evident that various modifications and changes may be made thereto without departing from the broader spirit and scope of the invention. For example, the reader is to understand that the specific ordering and combination of process actions shown in the process flow diagrams described herein is merely illustrative, and the invention can be performed using different or additional process actions, or a different combination or ordering of process actions. The specification and drawings are, accordingly, to be regarded in an illustrative rather than restrictive sense.
|
Same subclass Same class Consider this |
||||||||||
