Efficient optimistic concurrency control and lazy queries for B-trees and other database structures5920857Abstract The present invention relates to a system and methods for fine-granularity concurrency control in a parallel database. Very fine granularity (at the level of B-tree records) is implemented in a B-tree. Our method applies to B-trees, B-trees with variable keys, and their applications, such as semantic and object-oriented databases. Our method involves accumulating a transaction and then "optimistically" validating it, while attaining high efficiency with maximum semantic safety. "Lazy queries"--an efficient method for finding the intersection of two large queries--is provided for the system. Claims That which we claim is: Description BACKGROUND OF THE INVENTION
TABLE 1
______________________________________
Implementation of elementary queries
B-tree
Query Implementation
______________________________________
1.x? ›xI,xI + 1!
2.C? ›CI,CI + 1!
3.xR? ›xR,xR + 1!
4.?Rx ›xR.sub.inv,xR.sub.inv + 1!
5.x?? ›x,x + 1!
6.?Rv ›Rv,Rv + 1!
7.R›v1..v2!? ›Rv1,Rv2 + 1!
______________________________________
For most elementary queries (queries 1, 3, 4, 5, and 6) the number of binary facts is usually small. Some queries (queries 2 and 7), however, may result in a very large number of facts, and it may be inefficient to retrieve the whole query at once. A common operation in databases is to calculate an intersection of two queries. For example, consider a query: "Find all objects from category Student that have the attribute BirthYear 1980". This query can be executed using several scenarios: Scenario 1 a. Retrieve all persons born in 1980. Execute an elementary query "BirthYear 1980?" b. For each person retrieved in the step a verify that the person belongs to the category Student Scenario 2 a. Retrieve all persons born in 1980. Execute an elementary query "BirthYear 1980?" b. Retrieve all students: execute an elementary query "Student?" c. Find an intersection of the objects retrieved in a and b. In Scenario 1 we retrieve all persons from all categories (Person, Instructor, and Student) who were born in 1980, and for each person we execute an additional elementary query to verify that the retrieved person is a student. In this scenario we have to execute a large number of small queries. In Scenario 2 we execute only two elementary queries and then find an intersection of the results. The problem is that the elementary query "Student?" may result in a very large set of binary facts. Not only is this very inefficient in terms of expensive communication between client and server, but also such a big query would be affected by any transaction that inserts or deletes students. Also our query would be aborted more often than the query in the Scenario 1. Thus, Scenario 1 is obviously better in our case. Consider now another query: "Find all instructors born in 1970". The number of persons born in 1970 could be larger or comparable with the total number of instructors. In this case, Scenario 2 would be much more efficient because we need to execute only two elementary queries. Lazy Queries Our technique of lazy elementary query execution greatly reduces the number of disk accesses, the server traffic, and the transaction conflict probability by automatically reducing one scenario to another. For example, the intersection operator gets a close-to-optimal implementation without keeping any data distribution statistics. In our B-tree access method, the actual query execution is deferred until the user actually utilizes the query results. We define the elementary lazy query programmatic interface in a B-tree B as follows: 1. Q:=›l, r!B--define a lazy query ›l, r! but do not execute it yet. Let z be the longest common prefix of the strings l and r. A query result is a set of strings x such that zx.di-elect cons.B and l.gtoreq.zx.gtoreq.r. 2. Let Q.P be a pointer to future results of the query. Initially Q.P :=``, i.e. P points to an empty string. 3. Seek(Q, x)--moves the pointer Q.P, so that Q.P =min{y.vertline.zy.di-elect cons.›l, r!B and zy.gtoreq.x}. Derived from the above are the actual principal operations on the query results: 1. Read(Q):=Q.P --reads the current string pointed by the logical pointer Q.P. This operation results in an error if Q.P=null. 2. Next(Q):=Seek(Q, Read(Q)+0). We use notation s+0 to denote a string derived from the string s by appending a zero byte, i.e. s+0 is lexicographically the lowest string after s. When the Seek operation is executed, the string pointed to by the new logical pointer is fetched from the B-tree, and normally a small number of lexicographically close strings are pre-fetched and placed in a lazy query cache buffer. It is likely that the next Seek operation will request a string which is already in the cache buffer, so only a few Seek operations require actual disk and server access. Many queries can efficiently use the Seek operation. For example, we can very efficiently find the intersection of two lazy queries Q.sub.1 and Q.sub.2 : construct a new lazy query (lazy intersection) Q.sub.3 where the Seek operation uses the following algorithm:
______________________________________
Q3: = Q1 & Q2
Seek(Q3,x):
Seek(Q1,x);
Seek(Q2,x);
while (Q1.P .noteq. null & Q2.P .noteq. null &
Q1.P.LAMBDA. .noteq. Q2.P.LAMBDA.) do
if Q1.P.LAMBDA. > Q2.LAMBDA..P then
Seek(Q2,Q1.P.LAMBDA.)
else
Seek(Q1,Q2.P.LAMBDA.);
od;
if Q1.P = null or Q2.P = null then
Q3.P: = null
else
Q3.P: = Q1.P;
______________________________________
This algorithm gives an efficient solution for the sample queries described in the previous section. For the query "Find all objects from category Student that have the attribute BirthYear 1980" we use three lazy queries: a. Q.sub.1 :=elementary lazy query "BirthYear 1980?" b. Q.sub.2 :=elementary lazy query "Student?" c. Q.sub.3 :=Q.sub.1 & Q.sub.2 Since query Q.sub.3 is not actually executed, our algorithm that finds intersection will not require retrieving of every student from the database: the number of actual disk accesses to retrieve the students in the query Q.sub.2 will be less than or equal to the number of persons born in 1980. Thus, the cost of the lazy query Q.sub.3 will be smaller than the cost of the best solution for elementary queries in Scenario 1 described in the previous section. For the query "Find all instructors born in 1970" we use three similar lazy queries. Since the number of instructors is likely to be small, it is possible that all instructors will be fetched at the first disk access, and the whole query will require a number of server accesses close to 2, which is the optimal number. FIG. 1 shows execution of two lazy queries Q.sub.1 and Q.sub.2. Each query contains 10,000 strings at the server machine. A lazy query execution algorithm requires only 3 requests (Seek operations) to the server of 30 strings each, so that the total number of strings retrieved from the server is 90. Without our optimization, it would be necessary to retrieve both queries with size of 20,000 strings from the B-tree server to find the intersection. Lazy queries can also be used to efficiently subtract a large set of strings Q.sub.2 from a another set Q.sub.1. The algorithm for subtraction is similar: we retrieve a string from Q.sub.1 and use the Seek operation to verify that this string does not belong to Q.sub.2. Lazy queries not only result in a smaller number of server accesses. We will show that lazy queries allow the improvement of the granularity of our concurrency control algorithm and reduce the transaction conflict probability. Parallel B-tree Structure A massively parallel B-tree should perform many queries and transactions simultaneously and its size should scale to hundreds of terabytes even if the underlying computer hardware supports only 32 bit addressing. This is achieved by splitting the B-tree into partitions of about 1 gigabyte in size. The whole B-tree is then a network of computers where each computer holds one or more B-tree partitions. The B-tree partitions themselves are indexed by a partitioning map. Concurrency Control Our concurrency control algorithm is an optimistic algorithm that first accumulates a transaction, then performs it using a 2-phase commit protocol ›Gray-79!, and performs a backward validation ›Haerder-84! to ensure the serializability and external consistency of transactions. Our algorithm benefits from and improves upon the validation technique of the ›Adya&al-95! algorithm for an object-oriented database. Their algorithm uses loosely synchronized physical clocks to achieve global serialization and detects conflicts at the object level granularity. In our algorithm, a finer granularity at the level of strings is attained, and we use logical clocks to achieve global serialization; nevertheless, our algorithm does not require maintaining any extra data per string or per client. Transaction Accumulation In a parallel B-tree, updates and queries made by a client should be verified for conflicts with contemporaneous updates and queries made by other B-tree clients. A transaction is a group of B-tree updates and queries which is guaranteed to be consistent with the queries and updates executed concurrently within other transactions. To create such a group of operations we have several B-tree operations in addition to the lazy queries: 1. Insert String x 2. Delete String x 3. Transaction Begin 4. Transaction End A transaction is the execution of a series of actions between a "Transaction Begin" and "Transaction End". When the Transaction End is executed, all queries and updates made since the Transaction Begin are checked for conflicts with the queries and updates made by concurrent transactions. If there is a conflict, the transaction is aborted and the Transaction End returns an error. The updates made within a transaction do not change the B-tree immediately. Instead, these updates are accumulated at the client machine in a set of inserted strings I and a set of deleted strings D. The B-tree strings remain unaffected. The insert and delete operations work as follows: insert(x)={D:=D-{x}; I:=I.orgate.{x}} delete(x)={I:=I-{x}; D:=D.orgate.{x}} When "Transaction End" is executed, the set D is deleted from the B-tree and the set I is inserted into B-tree: B:=(B-D).orgate.I During the accumulation of a transaction into sets D and I, the client machine also accumulates a set V to be used for backward validation. The set V contains the specification of each subinterval read by a query within the transaction and a timestamp of this reading. A subinterval is a subrange within a query which was physically retrieved from one database partition at one logical moment in time. The logical time at a given database partition is incremented each time when a committed transaction physically changes that partition. The subintervals are stamped with this logical time and a number that identifies the partition in the system. Thus the set V is {(›l.sub.k, r.sub.k !, t.sub.k, p.sub.k).sup.n.sub.k=1 }, where t.sub.k is the timestamp and p.sub.k is the partition number. In our validation technique, when committing a transaction T, the system does not need to remember the results of T's queries; it remembers only query specifications ›l, r!, which are checked against concurrent transactions at T's commit time. The validation is done against transaction queues, normally without any disk access. Lazy queries can be used to further reduce the validation specified by the set V and improve the granularity in conflict detection. Previous examples have shown that the user does not actually retrieve all facts from the lazy query interval. The intersection of lazy queries uses the Seek operation and retrieves only a few strings from the original elementary queries. A lazy query automatically keeps track of those string subranges that have actually been retrieved by the user. This union of subranges can be much smaller than the union of the original elementary query intervals. This results in a finer transaction granularity and smaller conflict probability. At the end of transaction execution, the string subranges from all lazy queries are further optimized by merging intersecting subranges of all lazy queries. This optimization is done at the client side, which allows us to reduce the server workload and the transaction execution time. An accumulated transaction is a triple T(I, D, V) of strings to be inserted I, strings to be deleted D, and string intervals V to be verified. Note that even if no updates were made, a transaction is still necessary to ensure the consistency of queries. Thus, a query can produce an accumulated transaction T(I, D, V) with empty sets D and I. Validation Method A validation is necessary to ensure two important properties of transactions: serializability and external consistency. Serializability means that the committed transactions can be ordered in a such a way that the net result would be the same as if transactions ran sequentially, one at a time. External consistency means that the serialization order is not arbitrary: if transaction S was committed before T began (in real time), S should be ordered before T. When a client commits a transaction, the accumulated transaction T is delivered to one of the database servers. This database server is called the transaction's originator. The transaction originator splits the arriving transaction into subtransactions T.sub.i according to the partitioning map and distributes the subtransactions among the database partitions. A subinterval (›l.sub.k, r.sub.k !, t.sub.k, p.sub.k) in the set V is distributed to the partition p.sub.k (without consulting the partitioning map). This allows the detection of conflicts with system transactions that perform load balancing, which may change the partitioning map. The transaction originator uses the 2-phase commit protocol to update the database. In the first phase, the transaction originator distributes the subtransactions among the database partitions. Each database partition verifies that no conflicts with any other transaction is possible and sends a "ready" or "failed" message to the transaction originator. If the transaction originator receives a "failed" message, it immediately aborts the other subtransactions and notifies the client. When all database partitions return a "ready" message, the transaction originator sends a "commit" message to the participating partitions. In our backward validation protocol, the arriving subtransaction T.sub.i (I.sub.i, D.sub.i, V.sub.i) is checked against all transactions already validated successfully. In our B-tree, each partition maintains a log of recently committed transactions CL and a log of transactions waiting for commit WL. We say that a set of string intervals V intersects a set of strings A iff there exists an interval ›l, r! in V such that ›l, r!A.noteq..O slashed. (i.e. for some x.di-elect cons.A:l.ltoreq.x.ltoreq.r). We also say that two transactions T(I.sub.T, D.sub.T, V.sub.T) and S(I.sub.S, D.sub.S, V.sub.S) intersect if: 1. I.sub.t .andgate.D.sub.S .noteq..O slashed. or I.sub.S .andgate.D.sub.T .noteq..O slashed. or 2. V.sub.S intersects I.sub.T .orgate.D.sub.T or 3. V.sub.T intersects I.sub.S .orgate.D.sub.S When the subtransaction T.sub.i arrives, it is verified that T.sub.i intersects with no transaction S in WL. Additional verification is necessary to ensure that no query in T.sub.i is affected by a recently committed transaction S in CL. We check that each interval (›l.sub.k, r.sub.k !, t.sub.k, n.sub.k) in V.sub.i of T.sub.i does not intersect with the sets I.sub.S and D.sub.S of any transaction S in CL that has greater timestamp than t.sub.k. If the subtransaction is successfully verified, it is appended to the WL and the "ready" message is sent to the transaction originator. Otherwise the "failed" message is sent to the transaction originator. FIG. 2 shows a simple case of transaction accumulation, distribution, and validation when only two B-tree partitions are involved. A client at the Client Machine accumulates a transaction T(I, D, V). When the client decides to commit the transaction, T(I, D, V) is sent via the network to the transaction originator machine. The transaction originator machine splits the transaction into two subtransactions T.sub.1 (I.sub.1, D.sub.1, V.sub.1) and T.sub.2 (I.sub.2, D.sub.2, V.sub.2) and sends them to the corresponding B-tree partitions machines. Partitions 1 and 2 execute the validation protocol by checking the subtransactions against the committed transactions logs and the waiting for commit logs according to our validation method with logical timestamps. When the verification is done, a Ready message is sent to the transaction originator, which immediately sends the Commit message to the B-tree partitions. It can be shown that our concurrency control algorithm satisfies both serializability and external consistency requirements.
|
Same subclass Same class Consider this |
||||||||||
