Continuously available database server having multiple groups of nodes with minimum intersecting sets of database fragment replicas5555404Abstract A database server with a "shared nothing" system architecture has multiple nodes, each having its own central processing unit, primary and secondary memory for storing database tables and other data structures, and communication channels for communication with other ones of the nodes. The nodes are divided into at least two groups that share no resources. Each database table in the system is divided into fragments distributed for storage purposes over all the nodes in the system. To ensure continued data availability after a node failure, a "primary replica" and a "standby replica" of each fragment are each stored on nodes in different ones of the groups. Database transactions are performed using the primary fragment replicas, and the standby replicas are updated using transaction log records. Every node of the system includes a data dictionary that stores information indicating where each primary and standby fragment replica is stored. The fragments are allocated as evenly as possible among the system nodes using a fragment to node assignment equation. A transaction manager on each node responds to database queries by determining which fragment of a database is being accessed by the query and then forwarding the database query to the node processor on which the primary replica of that fragment is stored. Upon failure of any one of the data processors in the system, each node updates the information in its data dictionary accordingly. In addition, the fragment replicas made unavailable by the node failure are regenerated and stored on the remaining available nodes in the same node group as the failed node. Claims What is claimed is: Description The present invention relates generally to database server computer systems using multiple interconnected computers to provide continuous, reliable transactional services.
______________________________________
hash.sub.L (k,n) = <v,r> =
< .left brkt-bot.n(k-L) .div. (U-L+1).right brkt-bot.,
N(k-L)) modulo (U-L+1) + L>
______________________________________
where the value of n is equal to the number of fragments or subfragments over which a table or table fragment is being distributed, and k is a key in the range [L, U]. This function divides the range of key values into equal sized intervals, and assumes that the key values k are homogeneously distributed over the value range L to U. "v" is the result (with a value between 0 and n-1) from the hash function, and r is an independent value for recursive use of the hash function. For example, if key value k for a particular record is equal to 1377, all key values for the table fall in the range [1000, 1999], and n=5 on the first level and n=4 on the second level of hashing, then hash.sub.L (k,n) is evaluated as follows:
______________________________________
hash.sub.L (1377,5)
= <.left brkt-bot.5.multidot.377 .div. 1000.right brkt-bot.,
5.multidot.377 modulo 1000 + 1000>
= <1, 1885>
hash.sub.L (1885,4)
= <.left brkt-bot.3540 .div. 1000.right brkt-bot., 3540 modulo
1000 + 1000>
= <3, 1540>
______________________________________
Thus, the record with key value 1377 is allocated to node 1 at the first fragment level, and to the third node at the second fragmentation level. A "fragment crash" occurs when both the primary and the hot standby replicas of a fragment or subfragment become unavailable, typically due to a node or disk failure. The crashed fragment becomes unavailable, but other fragments of the table remain available. Therefore, a fragment crash results in "omission failures" for only those transactions trying to access data records belonging to the failed fragment. A gradual reduction of data availability is provided for subfragments as well as for fragments. If all replicas of a subfragment, or the fragment replicas containing the subfragment become unavailable, the subfragment is said to be crashed. For instance, referring to FIG. 5, if nodes 1 and 6 failed after the failure of Node 2 and after the resulting refragmentations were completed, then subfragments 2'B and 6'B would become unavailable, but the remainder of the database table would still be available. When a previously failed node is restarted, data is again redistributed to obtain an approximately even distribution of data among the available nodes. The redistribution is performed on a per table basis to preserve the serializability of the redistribution operation, and also to restrict the workload induced by the redistribution activity so as to limit its impact on the timely servicing of database transactions. When data in the restarted node is available (i.e., the disk and file in which the table was stored were not lost), the restarted replica, which is assigned the role of a hot standby replica during the table rebuilding process, is produced using the state of the restarted replica at the point that the replica's node failed, plus the log records accumulated by the primary replica during the time that the failed node was unavailable. When data in the restarted node is not available, or is so old that it does not meet timeliness criteria (e.g., restarting the node within 24 hours of a node failure), the restarted replica is rebuilt from scratch by copying the primary replica, and then using log records accumulated by the primary replica during the coping progress to ensure that the records in the restarted replica are consistent. In other words, because the copying process takes time, all data updates during the time occupied by the copying progress are repeated to ensure that the restarted replica is in a consistent state. In the preferred embodiment, the nodes have sufficient computational power and inter-node communications bandwidth, above that needed for normal transaction processing loads, that all the table subfragmenting initiated by a node failure can be accomplished in approximately fifteen minutes, assuming that each node stores on the order of one to five gigabytes of data. It is important that the self-repair process be completed quickly, preferably in less than an hour, in order to reduce the likelihood that a second node might fail before a prior node failure is repaired. The "excess" computational and communications capabilities provided in order to make fast self-repair possible can be used during normal operations for activities such as computing and comparing the transactional and data storage loads on the system's nodes, and redistributing data among the nodes (by selecting a new hash function and then fragmenting the data tables and distributing the data among the nodes using that new hash function) when transactional or data storage loads are imbalanced. Making the self-repair process non-blocking is accomplished by locking down only the pages of a data table that are currently being replaced while they are being read. Each page is therefore locked by the self-repair process only for a very brief time. Thus, the progress of transactions is not blocked by the self-repair process. A consistent version of each fragment replica is generated by sending to the new fragment replica, along with the copied pages of the data table, a copy of (A) all log records created for that fragment, beginning at the time that the copying process began, and (B) all "undo" log records that may be needed to reverse data table changes made by aborted transactions. Undo records are needed only for long running transactions in progress at the time the process of generating the new fragment replica begins. Data Dictionary Referring to FIG. 6, a copy of the data dictionary 158 is stored on every node of the system. The purpose of the data dictionary is to store all the information necessary to determine the current configuration of nodes in the system and to find any identified record in any identified data table. In particular, the data dictionary 158 comprises a set of "system" tables, the purpose and structure of which are explained next. The SysMachine Table 220 has just one record, which provides the name of the database server system, the number of nodes (i.e., data processors) in the system, and the number of data processor groups, also herein called node groups, in the system. As explained above, node groups are totally independent of each other insofar as hardware failures are concerned. The SysGroup Table 222 has one record 224 for each group of data processors. That record 224 indicates the group's Group ID, its status (e.g., running or unavailable), and a count of the number of nodes in the group. The SysNodes Table 226 has one record 228 for each node in the system. Each record 228 indicates the node's node number, its status (e.g., running, restarting, isolated, or dead), the Group ID of its group, and a list of "pair nodes" in other groups that are preferred for table replication. The SysTable Table 230 has one record 232 for each data table in the system. The record 232 for any particular table lists its Table ID, table name, a "replica count" that indicates the numbers of replicas of the table that exist in the system, the top level "Fragment ID" that corresponds to an entry in the SysFragment Table 236, the "distribution method" for locating the fragment associated with a specified record, the table's level 0 Fragment ID, and a timestamp indicating when the table was created. The "distribution method" is typically (1) one of two or more predefined hash functions, (2) "linear," indicating that the fragment containing a particular record is located by hashing the key value for the record with a linear hash function, or (3) "RoundRobin", indicating that the records in this table are assigned to fragments in a "round robin" fashion. The "RoundRobin" distribution method is used only for low usage tables because a transaction using records in such tables must be sent to all the nodes in order to find the one in which the queried record is located. Note that the Replica Count field for each table is a value assigned to each data table either by the table's creator or by an operating system policy, such as a policy that assigns every data table two replicas unless that assignment is explicitly overridden by a system operator. Another table in data dictionary, called the Data Table Schemas table 234, stores the column definitions for each data table, often called table schemas. The SysFragment Table 236 has a separate record 238 for every fragment and subfragment of every data table in the system. The record 238 for each table fragment contains the fragment's Fragment ID, an ordered list of the fragment's Replica IDs, a count of the number of subfragments at the next level of subfragmentation of the table, an ordered list of the Fragment IDs for those subfragments. Since table "fragments" are identified in the preferred embodiment as a value between 0 and n-1, where n is the number of fragments in any particular set, "an ordered list" in this context means that the Fragment IDs in the subfragment list are ordered so that the first Fragment ID in the list is for Fragment 0, the second Fragment ID in the list if for Fragment 1, and so on. Similarly, if a particular data table has M replicas (as specified in the Replica Count field of the SysTable Table 230), then the Replica ID list will contain M Replica IDs for Fragment 0, followed by M Replica IDs for Fragment 1, and so on. Furthermore, in the preferred embodiment, the first Replica ID in the Replica ID list for a particular fragment is the Replica ID for the primary replica, the next Replica ID in the list is for the hot standby replica, and any additional Replica IDs in the list are for additional read only replicas. The purpose of the SysReplica Table 240 is to store data representing the location and status of each data table fragment replica. For each fragment replica there is a separate record 242 in table 240 that indicates (A) the fragment replica's Replica ID and Fragment ID, (B) the node on which the fragment replica is stored and the file ID for the file in which it is stored, (C) the role of the fragment replica, such as "primary," "hot standby," or "additional read only copy," and (D) the status of the fragment replica, such as "available," "void," or "refragmenting". The data dictionary 158 may also include other tables not relevant herein, such as tables for storing security or data access control information, tables for indicating the "views" and indices used in conjunction with each data table, and so on. Whenever a database query is received by a node in the database server system, a transaction manager 162 in the database management system (DBMS) software 154 searches the SysTable Table 230 to find the Fragment ID for the applicable Database and the record Distribution Method for that table. The key value for the record being accessed by the query is hashed or otherwise reduced to a fragment number in accordance with the Distribution Method, and then the transaction manager 162 searches the SysFragment and SysReplica Tables 236 and 240 to find the primary fragment replica corresponding to the record being accessed and the Node number on which the primary fragment replica is located. Finally, if the node on which the primary fragment replica is located is not the node that received the query, the transaction manager 162 forwards the query to the appropriate node via the hypercubic communication network. The DBMS software 252 at the node that receives the forwarded query (or at the original node if the query did not need to be forwarded) executes the query, creates log records indicating data table records changed by executing the query, and forwards a copy of each log record to the data processor(s) on which is stored the standby (or other additional) replica of the effected database table fragment. The DBMS software 252 at the node that receives the log record copy updates the standby replica of the effected database table fragment in accordance with the information in the received log record copy. In practice, the initial entries in the data dictionary 158 are made at the time the system is first put into service. Typically, very few new data tables are created after the system is first put into service, and the number of nodes in the system is changed infrequently. As a result, the only tables in the data dictionary that undergo changes on a regular ongoing basis are the SysFragment and SysReplica Tables 236 and 240. If the data dictionary 158 includes tables with data access control information and tables with information regarding the various procedures used to access data, those data dictionary tables may also undergo frequent changes, but those tables and their operation are not relevant here. When all the nodes of the database server system are operating normally, and there have been no node failures in the recent past, each data table has just two levels of entries in the SysFragment Table 236: a top level fragment entry for the entire table, and one entry for each fragment of the table. The top level (called level 0) entry lists all the subfragment IDs for the table fragments, and each of the next level entries indicates that it has zero subfragments and lists only the replica IDs for the corresponding level 1 table fragment. In reality, a heavily used system with sixteen or more data processors that are in use 24 hours per day will typically suffer node faults in random fashion. For instance, a particular system might average one node failure per week, but only once every thirty years or so will two or more nodes fail within a few (e.g., fifteen) minutes of each other, and occasionally there may be a power supply failure that causes four nodes to fail simultaneously. Node failures are detected by neighboring nodes using a signalling protocol (Node Status Monitoring Software 250 in FIG. 6) that requires that each node to send its neighbors predefined signals, sometimes called "I'm alive" signals, on a periodic basis (e.g., once per millisecond). Each node is connected to several neighboring nodes, for example in accordance with the hypercubic interconnection scheme of the preferred embodiment. When a node fails to receive the expected periodic signals from one of its neighbors, a predefined status verification procedure is executed that attempts to communicate with the neighboring node and then declares the neighboring node to be unavailable if its attempts are unsuccessful. Such procedures are well known to those skilled in the art. One added feature of the node status monitoring procedure 250 that is useful in the present invention is that the status checking procedure checks the status of all nodes in a group, for the purpose of detecting group failures, any time that two or more nodes from the same node group are detected to have failed. When a node determines that one its neighboring nodes is not available, it sends a node-failure message to all its neighbors, which in turn retransmit the node-failure message to their neighbors until the "wavefront" of messages reaches all the nodes in the system. The functional nodes in the system which receive the node-failure message send acknowledgement messages back to node which originated the node-failure message. After collecting all such acknowledgements, the originating node then sends out a "new configuration" message to all the nodes in the system indicating the set of functional nodes in the system. Each node that is still operational responds to the new configuration message by invoking the fragmentation control software 160 (present on every node), which then causes the following sequence of steps to be performed. First, the node inspects its own data dictionary to determine which data tables are affected by the node failure(s) and which new table subfragments will have to be created and stored at that node. At each operational node, fragment replicas status values are updated in the data dictionary. For example, for each primary fragment replica that resided on the failed node, the fragment status is changed to "refragmenting", and the corresponding hot standby fragment replica is given the role of "primary". Each node then creates the files necessary for storing the new subfragments assigned to that node (in accordance with the system's predefined refragmentation procedures in the fragmentation control software 160), sends messages with that information to the other nodes so that all the nodes can update their data dictionaries accordingly, and then goes through the process of creating each new subfragment replica that is to be stored at that node. As the process of generating each new subfragment replica is completed, its status is changed to "available" and a message to that effect is sent to all the other nodes. This process continues until all the new subfragment replicas have been built. In general, each node modifies its own data dictionary entries in accordance with the status messages received from other nodes. Because every node has the same fragmentation control software 160 for data table refragmentation and for data table rebuilding (after a failed node comes back on line), there is no need to coordinate activities between nodes other than the transmission of status messages. After the system has responded to a node failure, the process of accessing data records is impacted by the changed fragment and replica records in the data dictionary. In particular, more levels of the SysFragment and SysReplica Tables may have to be searched to find the node to which a transaction should be sent. In addition, additional levels of these system tables may need to be searched when determining the nodes to which copies of the log records for a transaction should be sent (for the purpose of keeping hot standby and other replicas up to date). When a failed node is repaired, the above described fragmentation process is reversed. In particular, the repaired node goes through a process of rebuilding its data table fragment replicas, and sends status messages to all the other nodes as each recovered table fragment replica is ready to resume its normal role. The other nodes update their data dictionary entries accordingly, and also delete subfragment replica files no longer needed. When an entire group of nodes fails simultaneously, such as when a power supply failure occurs, the subfragmentation procedure is essentially unchanged, except that the number of target nodes for the new subfragments is reduced. In the preferred embodiment, to ensure that no single group failure can cause data to be unavailable, each half of the system has at least two separate groups of nodes that share no resources with the other node groups. As a result, when a group of nodes fails, at least one copy of every table fragment can be found elsewhere in the system, enabling a new replica thereof to be generated. Minimum Intersecting Sets Declustering The following portion of this document describes database fragmenting and fragment replica regeneration schemes used in three preferred embodiments of the database fragmentation control software 160. The general concept of Minimum Intersecting Sets Declustering (MISD) is as follows. Relations (i.e., database tables) are partitioned into a high number of fragments. Each fragment is initially created with one replica for each site. A "site" is herein defined to mean a set of nodes treated as a group that are remotely locate a from other sites and that share no resources, including power supply and cooling system with the nodes at the other sites. Inside a site, fragment replicas are assigned evenly over the nodes. The fragments assigned to a node form a "fragment set." The sets should be assigned such that the maximum cardinality of the intersection between any pair of fragment sets (i.e, one set from each of two distinct sites) is minimized. In case of a node failure the lost fragment replicas are moved to other fragment sets on the same site. In the preferred embodiment, the number of fragments is at least two times the number of nodes at each site on which the fragments are to be stored. The intersection of two fragment sets is defined for the purposes of this document to be the set of fragments the two sets have in common. If a node fails, the nodes having intersecting fragment sets have to take over all work on the common fragments. A smaller intersection means less common fragments and therefore also less added load in a failure situation. By using a fragment replica location assignment scheme that requires a "minimum largest intersection cardinality," the worst case added load to any node is minimized, thereby reducing the overcapacity required to mask errors. The terms "dedicated spare" and "spare node" are used in this document to mean a node which is unused for fragment replica storage when all nodes at the site of the spare node are functioning properly, and which is used to store replacement fragment replicas when a non-spare node at the site fails. The term "distributed sparing" is used in this document to mean the storage of replacement fragment replicas on nodes normally used to store other fragment replicas when the node on which the corresponding fragment replicas fails. When using this technique, all the nodes at each site have the capacity to store and service at least one more database fragment replica than the number normally stored and services at those nodes. In FIGS. 7A-7E, database fragments shown with no cross hatching are primary replicas, while fragments shown in diagonal cross hatching a hot standby replicas. In these Figures the term F.sup.x.sub.y refers to a replica of fragment y at site x. Thus the terms F.sup.0.sub.2 and F.sup.1.sub.2 refer to two replicas of the same database table fragment stored at sites 0 and 1, respectively. Similarly, the term N.sup.s.sub.z refers to node z at site s. For example, the term N.sup.0.sub.3 refers to node 3 at site 0. The configuration shown FIG. 7A satisfies the "largest minimum intersection cardinality" requirement. FIG. 7A shows twenty fragments distributed over ten nodes and two sites. No nodes on site S.sub.0 have more than one fragment in common with nodes on site S.sub.1. The distribution of fragment replicas shown in FIG. 7A is in accordance with the "rotated column" fragment replica assignment method that is discussed in more detail below. FIG. 7B shows that when node N.sup.1.sub.1 fails, fragment replicas F.sup.0.sub.5 and F.sup.0.sub.18 that originally were hot stand-by replicas have to take over as primary replicas, which adds to the computational and I/O load handled by the nodes on which those replicas are located. FIG. 7C shows that during the self repair process, when no spare nodes are available (i.e., using distributed sparing), replicas at site S.sub.0 nodes are copied so as to generate new replicas of the database fragment replicas on failed node N.sup.1.sub.1. The new replicas are distributed over the remaining working nodes at site S.sub.1. FIG. 7D shows that during the self repair process, when a spare node is available (i.e., using a dedicated spare node), replicas at site S.sub.0 nodes are copied so as to generate new replicas of the database fragment replicas on failed node N.sup.1.sub.1. The new replicas are created on the spare node N.sup.1.sub.5 at site S.sub.1. FIG. 7E shows that when failed node N.sup.1.sub.1 is repaired or replaced with a working functioning node, the replaced or repaired node becomes the new spare node. Even before failed node N.sup.1.sub.1 is repaired or replaced, the load on the system is rebalanced by making fragments F.sup.1.sub.5 and F.sup.1.sub.18 primary replicas and returning fragments F.sup.0.sub.5 and F.sup.0.sub.18 back to hot standby status. When spare nodes are not available, after the failed node is N.sup.1.sub.1 is repaired or replaced, the load on the system is rebalanced by copying the newly generated replicas back to their original location and by making fragments F.sup.1.sub.5 and F.sup.1.sub.18 primary replicas and returning fragments F.sup.0.sub.5 and F.sup.0.sub.18 back to hot standby status. The fragment replica declustering methodology of the present invention keeps one and only one replica of each database fragment on each site, ensuring that each site can take over service of that database fragment alone. Reproduction of lost replicas inside the same site ensures that this condition holds also after repair of a failed node is completed. The definition of MISD above is general, but not sufficient for practical use. For a given number of nodes and sites, it is not trivial to find an assignment of fragment replica locations that satisfies the requirements of MISD, and if one is found there is no guarantee that this is the optimal assignment of fragment replica locations. In order for the MISD approach to be useable, a systematic approach for assignment of primary and hot standby fragments to nodes is needed. Transposed Matrix (TM) fragment assignment method Referring to Table 1, the result of applying the transposed matrix assignment method to a database fragmented into twenty-five fragments and distributed over five nodes at each of two sites (i.e., a total of ten nodes) is shown. For simplicity, in this example and in all the other examples discussed below, the same number of nodes N.sub.s (site) are used at each site. In Table 1, the numbers shown in the main part of the table are fragment numbers. Thus, in the first column under the "Transposed Matrix" heading, the numbers 0, 1, 2, 3, and 4 refer to database table fragments F.sup.0.sub.0, F.sup.0.sub.1, F.sup.0.sub.2, F.sup.0.sub.3, and F.sup.0.sub.4. Furthermore in Table 1 primary replicas are represented by bold numbers, and hot standby replicas are represented by numbers in normal, non-bolded type. Using the transposed matrix assignment method, it is preferred that the database to be stored in the system be fragmented into F fragments, where F is any number between kN.sub.s (N.sub.s -1) and kN.sub.s.sup.2, where N.sub.s is the number of nodes at each site and k is a positive integer. The preferred number of fragments is N.sub.s.sup.2. Fragments are assigned to nodes in the transposed matrix assignment method as follows. At a first site (site S.sub.0 in the example shown in Table 1), the fragments are assigned to nodes in a round-robin fashion, which each successive fragment being assigned to a sequentially next node. Then, all the fragments stored on each individual node at the first site are assigned to distinct ones of the nodes at the second site. Thus, if replicas of fragments A, B, C, D and E are stored on a single node at the first site, replicas of fragments A, B, C, D and E are each stored on different nodes at the second site. When viewing the fragment assignments as a matrix, as shown in Table 1, the fragment assignments for the second site are derived from the fragment assignments for the first site by transposing the matrix of fragment assignments for the first site. The transposed matrix assignment method is an optimal assignment scheme when just two sites are being used to store a relation, in that the intersection between any pair of fragment sets (i.e, one set from each of two distinct sites) is never greater than one fragment. However, the transposed matrix assignment method is applicable only to two site systems, and does not provide fragment assignments, optimal or otherwise, for additional sites. As shown in Table 1, primary and hot standbys are assigned in a checkerboard pattern. This is possible due to the transposition symmetry.
TABLE 1
__________________________________________________________________________
Fragment Assignment Schemes
Site
Node
Transposed Matrix
Rotated Columns
Empty Diagonal
__________________________________________________________________________
S.sub.0
N.sub.0.sup.0
0 5 10
15
20
0 5 10
15
20
0 5 10
15
N.sub.1.sup.0
1 6 11
16
21
1 6 11
16
21
1 6 11
16
N.sub.2.sup.0
2 7 12
17
22
2 7 12
17
22
2 7 12
17
N.sub.3.sup.0
3 8 13
18
23
3 8 13
18
23
3 8 13
18
N.sub.4.sup.0
4 9 14
19
24
4 9 14
19
24
4 9 14
19
S.sub.1
N.sub.0.sup.1
0 1 2 3 4 0 9 13
17
21 4 8 12
16
N.sub.1.sup.1
5 6 7 8 9 1 5 14
18
22
0 9 13
17
N.sub.2.sup.1
10
11
12
13
14
2 6 10
19
23
1 5 14
18
N.sub.3.sup.1
15
16
17
18
19
3 7 11
15
24
2 6 10 19
N.sub.4.sup.1
20
21
22
23
24
4 8 12
16
20
3 7 11
15
__________________________________________________________________________
Rotated Columns (RC) fragment assignment method Referring to Table 1, the rotated columns fragment assignment method assigns fragments to the nodes at a first site (e.g., S.sub.0 in Table 1) in a round-robin fashion. Then the rotated columns assignment method assigns fragments to nodes at each additional site in a round-robin fashion similar to the fragment location assignments for site S.sub.0, except that the starting node is shifted by Q nodes each time a column fills up, where Q is an integer "rotation quotient." The rotation quotient is set equal to 0 for site S.sub.0 and is set equal to 1 for site S.sub.1 in the example in Table 1. Thus, in Table 1 the table for fragment assignments for the second site S.sub.1 visually looks like each column is rotated one step down relative to the preceding column. While, in this example the same number of nodes N.sub.s are used at each site, the rotated column fragment assignment methodology of the present invention is equally applicable to a system with different numbers of nodes at each site, since the fragment assignment methodology for assigning fragments to nodes at any site is independent the fragment assignments on any other site. In the rotated columns methodology, each site is assigned a rotation quotient (Q). Each site must be assigned a different rotation quotient than the other sites. Furthermore, each rotation quotient must be an integer having a value between 0 and N.sub.s -1. The database fragments are assigned numerical indices ranging from 0 to F-1, where F is the number of fragments into which the database has been fragmented. While there are no restrictions on the selection of the value of F, in the preferred embodiment F is set equal to at least twice the number of nodes (2.multidot.N.sub.s) at the site with the smallest number of nodes. Database fragments are assigned to the nodes at a site in round robin order. At each site S, the first N.sub.s database fragments F.sup.s.sub.x, for x=0 to N.sub.s -1, are assigned sequentially to each of the nodes. Then, the assignment of the next N.sub.s fragments starts at a node shifted by a number of steps governed by the rotation quotient. For instance, as shown in the part of Table 1 entitled "Rotated Columns," twenty-five fragments are assigned to five nodes in site S.sub.0 with an assigned rotation quotient of 0, and twenty-five fragments are assigned to five nodes in site S.sub.1 with an assigned rotation quotient of 1. Primary fragment replicas and hot standby fragment replicas are preferably assigned in alternating columns, as shown in Table 1. Numerous other schemes for assigning primary and hot standby status will produce an equally even load distribution over the nodes. The rotated columns fragment assignment method is also applicable to multiple site database server systems that use a "read any write all" methodology (often called the read one write all methodology) instead of the primary/hot standby model of the preferred embodiment. In a "read any write all" system, each database fragment is still stored on nodes at two or more sites, but none of the database fragments is considered to be the primary replica. Rather, the server can access any of the replicas of a fragment for read access to a particular tuple or record in the database. When a tuple is updated, all replicas of the associated fragment must be updated (which is also true in the primary/standby systems). "Read any write all" systems thereby distribute the load associated with read accesses over all the nodes and database fragment replicas, which can be advantageous in some systems, especially systems with light write access loads and heavy read access loads and sites that are located far (e.g., over a thousand kilometers) from each other. Table 2 is a pseudocode representation of a procedure, called the MapNode function, for determining the node on which a fragment replica should be stored, including the node to which the fragment replica should be assigned when there is a failure of the node on which the fragment replica was previously stored. The MapNode function is designed for use in "distributed spare" systems or sites which do not have a dedicated spare node.
TABLE 2
______________________________________
Function MapNode (FragNo, S)
/* Pseudocode of procedure for mapping a fragment replica to
a node when a "distributed spare" is used (i.e., a dedicated
spare node is not used) */
/* Q(S) = rotation quotient for site S
For example, Q(s) might be equal to 0 for site
S.sub.0 and equal to 1 for site S.sub.1
state(n) = state of node, "up" or "down"
avail = number of nodes believed to be "up"
nodemap = array of nodes believed to be "up"
N.sub.S = the number of non-spare nodes at a particular
site S on which fragment replicas are
to be stored
*/
avail = N.sub.S
For vnode := 0 to avail-1
{
nodemap(vnode) := vnode
}
Do Forever
{
vnode := (FragNo + (FragNo div N.sub.S).multidot.Q(S))
modulo avail
If state(nodemap(vnode)) = up
Return nodemap(vnode)
/* node failure detected: */
avail := avail -1
if avail = 0 /* If no nodes are available,
abort procedure */
Return -1
/* remap nodes */
For i := vnode to avail-1
{
nodemap(i) = nodemap(i+1)
}
}
}
______________________________________
The MapNode function is called separately for every fragment replica. Thus a failure of node X on site Y does not affect the node assignment of fragment replicas on other nodes and sites. In other words, the MapNode function makes an initial node assignment for a specified fragment (FragNo) on a specified site (Site), and then leaves that assignment unchanged unless the node to which the fragment was originally assigned fails. In that case, the node indices for the nodes at the site of the failed node are "remapped" for purposes of providing a contiguous set of node indices (i.e., 0 to avail-1, where avail is the number of available node), and then the fragment previously stored on the failed node is assigned to another node using the same "fragment to node assignment" function used to make the initial node assignment, except that the assignment will now be different because the number of available nodes has changed and the node indices have been remapped onto the available nodes. More specifically, the MapNode function initially assigns each fragment F.sup.s.sub.x to a data processor y at site S in accordance with the following fragment to node assignment equation: y=(x+(x div N.sub.s).multidot.Q(S) ) modulo avail where y is the node index of the node on which fragment F.sup.s.sub.x is to be stored, N.sub.s is the number of data processors at site S.sub.1 used for storing database fragments, Q(S) is an integer "rotation quotient" between 0 and N.sub.s -1 where Q(S) is a distinct value for each said site, and avail is the number of the N.sub.s data processors that have not failed. Assuming that all processors are initially working, avail is initially equal to N.sub.s. When a node fails, the MapNode function will first remap the node indices for the non-failed nodes at the site of the failed node into a new contiguous set ranging from 0 to avail-1, where avail is the number of data processors that have not failed. Then the modulo function shown above is re-executed to determine a new node assignment for the fragment previously stored on the failed node. The fragment is then regenerated on the node associated with its new node assignment. The MapNode function can also be used for the initial assignment of node locations in systems or sites having a dedicated spare node. However, when a dedicated spare is available, new copies of the fragment replicas lost on the failed node are simply assigned to and created on the dedicate spare node at the same site as the failed node. Empty Diagonal (ED) fragment assignment method The empty diagonal fragment assignment method is a specialization of the rotated column method. The empty diagonal method assumes that the database is split into F=kN.sub.s (N.sub.s -1) fragments. The fragments are assigned in round-robin fashion at site S.sub.0. At site S.sub.1, the fragments are assigned in round-robin fashion, but for each N.sub.s 'th fragment one node is skipped, starting with the first node, as shown in Table 1. By studying Table 1 it can be seen that the rotated column and empty diagonal methods assign fragments identically except for the node enumeration at site S.sub.1. For example, the fragments assigned to node N.sup.1.sub.0 in the rotated columns method are the same as the fragments assigned to node N.sup.1.sub.1 in the empty diagonal method. Primary and hot standby status are assigned to the database fragments in the same way as for the rotated column method. When a node fails k(N.sub.s -1) fragment replicas are lost. Using "distributed sparing" each remaining node at the site of the failed node is assigned k new fragments, increasing the load on each such node by a factor of 1/(N.sub.s -1 ). This ensures an even load redistribution. When there is a second failure at a site, the reassigned fragments will usually result in a less than perfectly even load redistribution. While the present invention has been described with reference to a few specific embodiments, the description is illustrative of the invention and is not to be construed as limiting the invention. Various modifications may occur to those skilled in the art without departing from the true spirit and scope of the invention as defined by the appended claims.
|
Same subclass Same class Consider this |
||||||||||
