System for merging virtual partitions of a distributed database4853843Abstract An object-oriented, distributed data base system separates into a plurality of virtual partitions following communication failure between sites accessing the data base. Each partition accesses a separate copy of an initial data base and independently updates groups of data objects included in the data base to add new versions of data objects to the data base. Each virtual partition maintains a copy of all previous versions of data objects and maintains a change list describing all group updates that it executes. Following restoration of communication between sites, each virtual partition merges the data bases maintained by separate partitions to form a consistent merged data base permitting versions of data objects and collections of data objects created by any one of the separate virtual partitions to be identified and accessed in the merged data base. Claims I claim: Description BACKGROUND OF THE INVENTION
TABLE I
______________________________________
A valid object base name is of the following
form:
"Entity --Name"
refers to the representative
that is the current version
of the entity.
"Entity --Name[time]"
refers to the version of the
entity that was current at
the specified time. A valid
Entity --Name is of the
following forms.
"Object --Name"
refers to the representative
that is the current version on
the principal path of the
object.
"Alternate --Path --Name"
refers to the representative
that is the current version of
an alternate path.
"Object --Name(i)"
where i is the implicit
alias, refers to the current
version of the ith alternate
path created in the object
named.
A valid Object --Name or Alternate --Path --Name is
an ordered concatenation of from one to three
name parts separated by the symbol " ".
"user --defined --name"
the strings of symbols,
excluding " ", as specified
by a user.
"user --name" the name of the user who
specified the string.
"site --name"
the name of the site on which
the string was specified.
______________________________________
Object names and path names are flat space names decomposable into three parts: a user-defined name, a user name, and a site name. The goal of a flat name space is to ensure that the user-defined portion of a name is unique. A user works under the assumption that the user-defined name is unique and can ignore the existence of the suffixed portions of the name. BASIC ACCESS CAPABILITIES IDHSS provides basic access to the IDHSS database through a "checkout/checkin" paradigm. Execution of a "checkout" gives the client local copies of a set of instantiations. The end-user is then free to modify each copy in an appropriate manner. (The tool program and the data model are responsible for monitoring and maintaining the semantics of the user's individual actions on the design data.) Once the end-user's copy of the design has again reached a consistent state (as deemed by the data model), the client may "checkin" the updated design. Checkin adds new current instances to the appropriate set of paths in a set of IDHSS objects. Checkout is a "multi-reader" request. Multireader is implemented by not setting a lock at the time of the checkout and providing each requesting client with a separate copy of the data. Checkin is a "multi-writer" request. Multi-writer means that every successful checkout may be followed by a checkin that will be accepted and saved by the storage system. If multiple checkouts have been performed on an object, the first checkin executed will result in a successful update of the object. Subsequent checkins will result in the creation of alternate but related versions of the object. FIG. 6A shows two clients performiong a duplicate checkout of the current version of two IDHSS objects, FIG. 6B shows the independent modification of the checked out data, and FIG. 6C shows the results produced by the checkin of the new versions. Versions included in the principal path of an object are shaded. For a user, the checkout-checkin sequence represents one user transaction. At the storage layer, two transactions are executed, first a checkout (read) and then a checkin (write). A transaction instance is a unit of locking and recovery. The storage system does not hold locks on data while it is checked out, thus no storage system transaction is in progress during that time. IDHSS supports the atomicity of storage layer transactions only. That is, either all results of the transaction will be in effect or none of the effects of the transaction will prevail. Support for the atomicity of user level transactions is handled by the layers above the storage system through the use of inverse storage system requests. DATA MANIPULATION IN IDHSS IDHSS supports a number of client requests for creating and manipulating IDHSS objects. We assume a session interaction model where the client opens and closes existing federations, and all intervening requests pertain to data in the object set of the open federation. Hereinbelow are described the requests of interest from the perspective of building a robust IDHSS. The message format for each request type is given using a standard notation for repetition where * means zero or more repetitions and + means one or more repetitions; groupings are enclosed in parentheses (); and optional terms are enclosed in square brackets. create object.sub.-- name data.sub.-- value A single IDHSS object is created by executing a create request. The request must specify a name for the object and an initial data value for the object. The proposed object name must be unique within the open federation. The name server site for the federation will approve or reject the name based on the grounds of uniqueness. If the object name is approved, the storage server copies the initial data value to stable storage on the local site and add new entries for this object to the system directory. The requesting user will be the owner of the object. The newly created object consists of a single instance that belongs to two paths, the principal path and the path having implicit alias one. At creation time the new object is assigned an immutable token that is unique system wide. In IDHSS the tokens consist of two parts the name of the site generating the token and a number unique to that site. The local site will be the primary site for the object. The primary site is the control site for the synchronization of future actions on the object. checkout existing.sub.-- name* Execution of a checkout provides the client with a readable copy of the requested data. The checkout request mut specify a set of valid names in the name space of the open federation. IDHSS maps the set of names to a set of data instances and retrieves a readable copy of each of those instances. Information on who requested the checkout and what data was involved is maintained by IDHSS. When data is acquired by a checkout, IDHSS assumes that the data will be copied and modified by the client, and eventually stored in the database via an update request. update (existing.sub.-- name new.sub.-- data.sub.-- value [replication.sub.-- factor])* The update request attempts to add a new temporal version to each specified path. Each update request must have been preceded by a checkout request for the set of paths to be updated. The set of new instance values is copied by the storage system to local stable storage. For each instance in the associated checkout, the primary site for the object containing that instance sets a lock on the object. This allows the updates to be serialized so that the first update based on instance X adds a new instance X' to the path and all subsequent updates based on instance X cause the creation of a new path with instance X as the root of that path. All late updates result in the implicit derivation of an alternate path followed by an update to the new alternate path. The client is notified of all late updates and the implicity alias numbers generated for the new paths. An update is rejected if any instance in the specified path was not checked out or if the paths being updated were deleted or erased after the data was checked out. A failed update request has no effect on the user's local data copy. The local data may be used to create a new object. delete object.sub.-- name An entire IDHSS object can be removed by executing a delete request. The name of the entity specified in the delete request must be the name of an object rather than the name of a path. The name of a deleted object may not be immediately available for reuse as a unique name. Site or communication failures may require that the physical deletion of all references to the object be delayed. In any case, the object is logically deleted at the completion of the delete request. erase name option The erase request removes instances from paths. The two options for erasing are erase one and erase all. The semantics of an erase one request is to remove the current instance in a path reverting to the previous instance as the current. Erase one is implemented as an update of the instance being removed, updating it to be the previous version in that path. This approach insures that there are no time gaps between the temporal versions in a path. Thus, all references to old versions dating back to the time of the creation of the path are well defined. The erase all request removes all instances in a single alternate path. The principal path cannot be erased. The name of the entity specified in the erase all request must be the name of a path. The name of the erased path may not be immediately available for reuse as a unique name due to postponed physical deletion. The implicit alias number corresponding to the erased path will never be reused. assign object.sub.-- name path.sub.-- name Execution of the assign request alters the mapping of a IDHSS object name in a manner contrary to the semantics of update. The assign request will force an object name to be mapped to the current instance of a specific path in the tree of instances. Semantically the assign allows the client to select a related alternate version of an object as the principal version of that object. TRANSACTION PROCESSING IN IDHSS The storage server is multi-threaded, that is, multiple requests are in a state of execution simultaneously. An interleaved processing of requests is serializable if the results produced are equivalent to some sequential processing of the requests. IDHSS produces a serializable execution of client write requests by employing primary site concurrency control. Each IDHSS object has an associated primary site. The primary site sequences the processing of requests on an object by functioning as a lock server for that object. If a request operates on object O, the requester must obtain a lock on O before the request can be serviced. FIG. 7 diagrams the steps required in processing a typical IDHSS write request. The steps are as follows: 1. Receive request from local client. 2. Request object lock from the primary site. 3. Lock is granted. 4. Process and commit the request locally. 5 Multicast the result of the request (one-phase commit). 6. Request that the object lock be released. 7. Lock release acknowledged. A client's local site must obtain all required locks from the appropriate primary sites before the request can be processed. Once the locks are obtained, the local site's directory contains accurate information on the locked entities. The local site processes the request using the information in the local directory. When the processing is completed the results of the request must be incorporated into the directory of each site in the federation. This is accomplished by sending a multicast message to the subset of sites on the network that belong to the federation of interest. When the remote sites have acknowledged that their processing is complete, the initiating site will release all locks held for the processing of this request. All IDHSS requests are processed atomically. This means that the results produced by a single transaction are committed to the database before they are seen by any other transaction. Processing of a multicast message is also performed atomically. Atomicity of execution is achieved by establishing for each transaction a separate task that holds all results until the transaction reaches its commit point. Once the commit point has been reached, the task force all results to the disk in one synchronous write. A synchronous write is a blocking event for the entire storage system; thus no storage system processing can proceed before the write completes. The normal processing protocol illustrated by FIG. 7 maintains the internal consistency and mutual consistency of all sites in the federation. Given that communication or site failures may occur at any step in the processing sequence, this simple protocol is augmented to maintain internal consistency and mutual consistency in the faulty environment. GROUP TRANSACTIONS Providing a checkout and checkin facility for single objects is not sufficient support for the upper layers of a CAE database system. The application tool layer and the representation layer work with complex objects which may be decomposed into many sub-objects. Each sub-object is stored as a single IDHSS object. These upper layers will request data in terms of complex objects. The configuration management layer will decompose a complex object request into its many subcomponents in order to provide the higher layer with a configuration of the complex object. A configuration is a version of a complex object and is constructed by combining a version of each subcomponent of the complex object. A "configuration specification" is stored as an object which refers the sub-objects. A configuration specification may make fixed references or floating references to the sub-objects. Fixed references always map to the same version of a sub-object. Floating references map to the version that is current at the time the reference is processed. When a configuration specification contains floating references, the storage system must provide a checkout and checkin facility for groups of objects in order to provide the user with a consistent version of the complex object being operated on. Storage system requests that operate atomically on a group of IDHSS objects are called "group transactions". CONSISTENCY IN GROUP CHECKOUT In accordance with the invention, the checkout and update transactions of IDHSS operate on a group of data items from one or more IDHSS objects. The group of instances are the subcomponents in a configuration as referenced by a configuration specification. When a group checkout is executed, IDHSS records the status of each item as either "current of a principal path", "current of an alternate path", or "non-current". Use of a group checkout insures the client of configuration read consistency. A set of instances is "configuration read consistent" if: 1. the set of instances is specified as a configuration by the configuration management layer; and 2. in the configuration, all instances named by floating reference will be simultaneously current versions. If a group checkout requesting current instances of a set of objects were carried out as a sequence of singleton checkouts, property 2 of configuration read consistency could be violated. If the local site processes an update affecting those instances yet to be read, the final result of the sequence of reads will not produce simultaneously current versions of the instances. Thus, group checkout is necessary to provide the client with a consistent set of data values. Execution of a group checkout is complicated in that a local site may have to request data copies from several distinct remote sites. Neither local locking nor remote locking is required in the processing of a group checkout because all transactions are executed atomically. This ensures that the local site successfully maps the names in a group checkout to a configuration consistent set of instances. CONSISTENCY IN GROUP UPDATE In accordance with the invention, a client may execute a group update only on a set of instances obtained by a group checkout. A group update must maintain the configuration write consistency of the instances being updated. Hereinbelow are defined a set of rules for processing group updates such that adherence to these rules will maintain configuration write consistency among the items in a group update. The intention of the update rules is to form an alternate configuration object when a principal configuration cannot be updated consistently. A modified definition of configuration write consistency and a modified set of update rules may be desired for maintaining configuration write consistency when modeling other applications such as computer-aided software engineering. For a CAE application, a set of instances is considered configuration write consistent if: 1. the set of instances is specified as a configuration by the configuration management layer; 2. every instance updated will be current at the same time; and 3. every temporal version of the configuration is completely contained in temporal versions of principal paths or is completely contained in temporal versions of alternate paths. The following set of group update rules based on this definition maintain configuration write consistency: 1. If all instances are the current version of a principal path, apply the updates to the principal paths producing new temporal versions of each object in the group. 2. If all instances are the current version of an alternate path, apply the updates to the alternate paths producing new temporal version of each alternate path in the group. 3. If some subgroup of the instances are the current version of an alternate path and the other instances are non-current versions of their respective paths, for each of the non-current instances create and substitute a new alternate path rooted at that instance and carry out the updates according to rule 2. 4. If some, but not all, of the instances are the current versions of a principal path, for each instance that is the current version of a principal path, create and substitute a new alternate path rooted at that instance and carry out the updates according to rule 3. LOCKING DURING GROUP UPDATE Execution of a group update requires that a lock be obtained from the appropriate primary site for each entity in the group. Thus, steps two and three of FIG. 7 must be repeated until all necessary locks have been obtained. Similarly, steps six and seven in FIG. 7, must be repeated until all the acquired locks have been released. When multiple locks are requested by multiple transactions, deadlock may occur. If a lock is currently held by a transaction Ti and transaction Tj requests the lock, transaction Tj enters a transaction.sub.-- wait.sub.-- for relationship with transaction Ti. Transaction Ti in turn may be in a transaction.sub.-- wait.sub.-- for relationship with another transaction. If a cycle exists in the transaction.sub.-- wait.sub.-- for relationship a deadlock has occurred. To deal with this problem IDHSS employs a deadlock avoidance protocol. The two requirements for using deadlock avoidance are satisfied by IDHSS. 1. Every request declares its read set prior to executing any reads and the write set is contained in the read set or introduces completely new data items. 2. Each IDHSS object has an associated immutable token. The tokens are totally ordered so that the set of IDHSS objects is totally ordered PROCESSING A GROUP UPDATE All IDHSS transactions request locks on IDHSS objects in increasing order of the tokens associated with those objects. When all locks have been granted for a group update, the initiating site compares the status of the entities at the time of the checkout with their current status and processes them according to the four group update rules given above. If new alternate paths must be derived, the local site does so by requesting a new implicit alias number from the primary site for the IDHSS object and adding a new path to the tree. The entire tree of versions are held by the lock. Thus, no new locks need to be obtained in order to derive new alternate paths. When all the required new paths have been derived, the updates are performed on the new set of updatable paths. Once the update has been committed locally, the results of the update are multicast to all the sites in the effected federation. A lock release request is sent to the appropriate primary site for each IDHSS object involved in the update. After all the processing steps of FIG. 7 have been completed, the requesting user can be informed of the outcome of the request. ROBUST DATA ACCESS An "optimistic robust distributed storage system" is defined as a system that, even when system components have failed, will successfully and optimistically process transactions, will preserve internal consistency at every site, will preserve mutual consistency among groups of communicating sites, and will preserve configuration write consistency among groups of communicating sites. IDHSS supports optimistic robustness through: 1. a set of rules and mechanisms to support maximum access to user data, and 2. a set of recovery protocols for maintaining internal consistency, mutual consistency, and configuration write consistency among groups of communicating sites. In order to maximize user access when sites or communications have failed, IDHSS supports the following "Optimistic Access Rule" for the processing of transactions: given a running site and an authorized user, an entity may be checked out if a copy of all parts of the entity can be acquired. Any entity that has been previously checked out may be updated. The requests assign, name, delete, erase, derive, set.sub.-- permission, grant, or deny may be processed if they affect IDHSS objects that are known to exist. Requests to create new IDHSS objects are always processed. Under the optimistic access rule almost all requests submitted at running sites during the failure of other components of the distributed system are executed. Requests that cannot be processed include: a checkout request for data that is not accessible due to a failure; and an update request on data that cannot be checked out. Requests such as assign, name, delete, erase, derive, set.sub.-- permission, grant and deny can be performed on an object O only if the site processing the request knows the object O exists. If object O does exist, information about O is contained in the site's system directory. IDHSS can support robust optimistic access control because under normal processing conditions IDHSS ignores the possibility of update conflicts until they actually occur. If a failure partitions the network, IDHSS may unknowingly process conflicting updates in separate partitions. In some sense the system ignores these conflicts until the failure is repaired and the partitions are brought back together. When the partitions are merged, the conflicting updates occur but can be managed, as they would be in a failure free environment, by implicitly deriving alternate versions. The general philosophy is that the results of conflicting requests will be managed by merging the results of the request or by allowing one result to prevail over another. If IDHSS is to support the optimistic access rule, the system must continue to process requests under most every possible circumstance. Processing a IDHSS request may require a combination of four types of service: 1. approval of a name by the federation's name server site, 2. locking of a IDHSS object by the primary site associated with that object, 3. generation of a unique implicit alias number by the primary site associated with the affected IDHSS object, and 4. providing another site with a copy of user data stored by the local site. To continue processing requests when failures occur, the services provided by sites other than the local site must be performed by a replacement site. Service support must be capable of migrating among the sites belonging to a federation. Service responsibilities are migrated from one site to another by the use of pseudo name server sites, pseudo primary sites, and multiple copy sites. A "pseudo name server" site is one site that acts as a temporary replacement for the federation's true name server site. A "pseudo primary" site for a IDHSS object is one site that acts as a temporary replacement for the object's true primary site. If a site holds a copy of a data item, that site is a copy site for the data item. To provide checkout capability during a failure, each data item will have multiple copy sites. Placing multiple copies of each instance of data on distinct sites is done whether or not the distributed system has suffered any component failures. In contrast, designating and using a pseudo name server site or a pseudo primary site is an action that is taken only when a failure is noted by some site in the system. This implies a need for a mechanism for keeping track of failures noted by active sites. The mechanism utilized by IDHSS is "virtual partitioning". A VIRTUAL PARTITION PROTOCOL A virtual partition is a group of sites in one federation that have agreed that they will talk only to each other for the processing of transactions. Sites belong to different virtual partitions for each federation to which the site belongs. If failures did not occur and no new sites ever enrolled in the federation, the virtual partition would consist of all sites belong to the federation. A management protocol allows the list of sites belonging to a virtual partition to be modified. Detection of a possible failure or detection of the correction of a possible failure signals a need to modify the list of sites in the virtual partition. The virtual partition loses sites from the group when a failure occurs and adds sites to the group when a failure is corrected. Every virtual partition has a unique name, the set of all partition names generated forms an ordered set, and the names are generated in an increasing sequence. In IDHSS, virtual partition names are decomposable into two parts. The first part is a count that reflects the number of times the virtual partition membership has been modified. We refer to this count as the "level number" for the virtual partition. The second part is the name of the site that initiated the virtual partition. When a federation is defined by a client on a site S, the name of the first virtual partition for that federation is 1S and site S is the only member of partition 1S. When a request is processed by the sites in a virtual partition, the results written by that request are tagged with the name of the partition. This tag represents the fact that the sites belonging to the virtual partition named must have been informed of the result. A virtual partition is an attempt to form mutually exclusive groups of sites that have two-way communication with each other. A list of the sites belonging to a virtual partition may not be correct in that sites which are not reachable may be in the list and sites that are reachable may not be in the list. The site list for a virtual partition may require modification if any site in the virtual partition receives a message from a site not in the virtual partition, or if any site in the virtual partition fails to receive a response from a site belonging to the virtual partition. If a site S believes that the list of sites belonging to the virtual partition is incorrect, the site S must initiate a new virtual partition. Formation of a new virtual partition is achieved by a two-phase protocol. Appendix 1 presents an algorithm for initiating a new virtual partition. In phase one, the initiator site S first constructs a new virtual partition name by adding two to the current virtual partition's level number or the largest level number proposed in this federation, and appending the site name S. Next, site X sends a message to all known sites in the federation inviting them to join a new virtual partition. The message contains the new proposed virtual partition name and a list of all the sites believed to be members of the federation. The sites receiving the invitation examine the proposed virtual partition name and the list of sites. Appendix 2 presents an algorithm for processing virtual partition invitations. A receiving site accepts the invitation if the proposed virtual partition name is greater than the name of the partition it currently belongs to or of a partition name it has accepted, and if the list of sites contained in the invitation includes all sites believed to be members of the federation. If the invitation is acceptable, the receiving site sends an acceptance message to the initiating site. An acceptance message contains the name of the site and the virtual partition history of that site. A virtual partition history is a list of the names of all of the virtual partitions a site has been a member of. A receiving site rejects the invitation if the proposed virtual partition name is too small or if any sites that are known to be members of the federation are missing from the list of sites contained in the invitation. If the invitation is rejected and the receiving site has not already accepted a better invitation, the receiving site must initiate of a new virtual partition. The initiating site associates a timeout interval with each invitation sent. If the time interval elapses without receiving a reply from the site, that site will not be included in the new virtual partition. When each invitation has either timed out or been replied to, the initiating site enters phase two of the protocol and sends a commit message to those sites that replied. The commit message contains the name of the new virtual partition, each unique virtual partition history, and the list of sites that reported that partition history. Each site that received an invitation sets a timeout on the commit message associated with that invitation. If the commit message does not arrive within the time limit, the receiving site will attempt to initiate a new virtual partition. If the commit message arrives within the time limit, the receiving site enters phase two of the partition protocol. If a site is running, it must be in one of four states: normal, proposal, acceptance, or recovery. Transitions among these states are controlled by the virtual partition negotiation protocol. FIG. 8 is a state diagram showing the possible transitions and the conditions that cause those transitions. With reference to FIG. 8, the following conditions cause the transitions shown: A. Receive a message from a site outside of current virtual partition. B. Fail to receive a replay or response form a remote site within a specified time limit. C. Accept your local site proposal. D. Receive a better proposal. E. Receive the commit message. F. Fail to receive the commit message within a specified time limit. G. Receive a proposal with an unacceptable proposed partition name but announces the existence of new sites in the federation. H. The recovery is completed and the site commits to a new virtual partition. I. Fail to receive a message from a committed site within a specified time limit. The negotiation protocol will generally converge and return all sites to the normal state. There are environments in which the protocol will cause all sites to cycle between the proposal and acceptance states or the proposal, acceptance, and recovery states. When the sites eventually enter the recovery state, each site determines what type of recovery is taking place and executes an appropriate consistency protocol that ensures that the local site is mutually consistent with the other sites in the new virtual partition. Once mutual consistency has been achieved, each site officially commits itself to being a member of the new virtual partition. Committing to membership in a virtual partition means that the site remembers the name of the new virtual partition, remembers the list of sites belonging to the new virtual partition, and uses the new virtual partition name as the tag value for the results of each write request. Two recovery protocols are employed to ensure the mutual consistency of the sites in a new virtual partition. If all of the sites in the new virtual partition reported the same virtual partition history, the virtual partition will perform a divergence recovery. If any site reports a disparate virtual partition history, the virtual partition will perform a merge recovery. PSEUDO NAME SERVER SITES To support optimistic access, INDHSS uses pseudo name server sites as a means of providing name server functionality in each virtual partition. A pseudo name server site is a substitute for the true federation name server site. When a federation is defined, the site at which the define request was submitted becomes the true name server site for the federation. When processing transactions, communication is limited to those sites in the current virtual partition. If the true name server site is not in the current virtual partition, a pseudo name server site must be selected. The selection process requires the existence of a total ordering on the set of sites. At least two natural total orderings exist: order the sites alphabetically by site name; or order the sites by the network address associated with each site. If the true name server site is not in the current virtual partition, the smallest site, according to the ordering on sites, will become the pseudo name server for this virtual partition. This protocol ensures that each virtual partition contains exactly one site acting as the name server for the partition. Each name server site maintains a flat name space within its respective virtual partition by preventing sites within that virtual partition from creating duplicate object base names. Duplicate object base names are avoided by appending a suffix to each duplicate name, the suffix including first the name of the user that created the object base name, and second (if necessary) the name of the site on which the creating request was submitted. When an object base name is created, the name of the creating user and the name of the site on which the creating request was submitted are saved with the user-defined portion of the name. When requests are processed by the storage server, each object base name specified in the request is looked up in the system directory. If multiple directory entries are found for the user-defined portion of the name, the user receives a reply that the name is ambiguous. The user may then query the storage system using the name.sub.-- match request to obtain a list of fully modified names with matching prefixes. The user may repeat the original command using a modified name which is unique. If the user-defined portion of a name is unique over the set of known object base names, the normal reference using only the user-defined portion of the object base name is sufficient. PSEUDO PRIMARY SITES To support optimistic access, IDHSS uses pseudo primary sites as a means of providing primary site functionality for each accessible object in each virtual partition. Primary sites are responsible for granting locks on objects, thus guaranteeing sequenced updates to those objects. The true primary site is the site on which the IDHSS object was created. When a write request is processed by the storage system, each IDHSS object affected by the request must be locked. Request processing is always limited to those sites in the current virtual partition. If one or more of the required primary sites are not in the current virtual partition, each of the missing primary sites must be replaced by a pseudo primary site. A pseudo primary site substitutes for the true primary site by granting locks on behalf of the true primary site. When a site other than the true primary site receives a request for a lock, that site must check to see that it has been elected as a pseudo primary site for the data item being locked. If the site is not the pseudo primary site, an appropriate reply is returned to the site that requested the lock. The requesting site may seek to become the pseudo primary site. The process of selecting a pseudo primary site is based on a two phase selection protocol. If a client, local to a site S, requests a write operation on a data item X, site S must obtain a lock from the primary site for X before processing the request. If the true primary site for X is not in the current virtual partition of site S, site S must determine whether or not a pseudo primary site for X exists in the current virtual partition. A pseudo primary site for X exists if the local site believes that a pseudo primary site has been elected, no merge recoveries have been performed since the election completed, and the elected pseudo primary site is in the current virtual partition. If neither the true primary nor a pseudo primary site for X is in the current virtual partition, site S will initiate a two phase election of a pseudo primary site. Appendix 3 presents an algorithm for nominating a pseudo primary site. In phase one, site S will nominate itself as the pseudo primary site and all sites in the current partition will cast a vote. If elected, in phase two site S will send a commit message to all sites in the current partition, receive an ackowledgment from each site in the current partition, and then consider itself to be an elected pseudo primary site. This two phase protocol is similar to the asynchronous-timeout-based negotiation protocol for forming a virtual partition. Appendix 4 presents an algorithm used by the voting sites. When a site receives an election proposal, that site must send a reply accepting or rejecting the nomination. A site will reply with a rejection only if it has previously received a nomination for a site that is larger than the newly proposed site according to the total order on sites or it has already committed, during the current virtual partition, to a different pseudo primary site. After accepting one nomination, a site not yet committed to that nomination may accept a better nomination. Eventually the largest site that desires to be the pseudo primary site for data item X will be elected. If a failure or timeout occurs during the election, a new virtual partition must be formed. Recovery and pseudo primary site election are interwoven in that recovery will terminate pseudo primary site elections. When the recovery is completed, the request that required the lock must re-evaluate the accessibility of a primary site in the new current virtual partition. If a primary site is not a member of the new virtual partition, the election process will be initiated again. Once a pseudo primary site is elected, that site is used as the primary site until the pseudo primary site is no longer in the current virtual partition or a merge recovery is performed. When a merge recovery is performed, all pseudo primary sites must relinquish their position. Elected pseudo primary sites continue to function in any new virtual partition formed by a divergence recovery. In a divergence recovery, we are ensured that the true primary site could not have rejoined our virtual partition. If all sites in the larger partition agreed on site S as the pseudo primary site for X, all sites in the new reduced partition hold the same agreement; thus, we may continue to use the elected pseudo primary site if that site remains in our virtual partition. When a merge recovery is performed, the sites in each of the merging partitions may have different pseudo primary sites for X; therefore, pseudo primary sites must relinquish their status following a merge recovery. REPLICATION OF DATA To support optimistic access in IDHSS, multiple copies of each item of user data are maintained. In most systems, data replication enhances only read access. In IDHSS, data replication enhances both read access and write access because read access is the only necessary condition for write access. The create request and the update request cause new data to be stored in the database. In each of these commands the user may optionally specify a replication factor for the new data item. A replication factor of one may be specified. In this case, the system will make only one copy of the data item even though this greatly increases the probability that the data could become inaccessible for reading and consequently inaccessible for updating. If no replication factor is specified, a default replication factor is used. Each federation is assigned a default replication factor when the federation is defined; either the user defining the federation specifies a factor to be used as the federation wide default or the IDHSS system wide default factor (currently two) is used. When a create request or an update request is processed, the first copy created is stored on the site at which the request was issued and processed. When the results of the request are multicast to all sites in the current virtual partition, the multicast message also contains a list of the sites that should store a redundant copy of the data. When such a multicast message is received and processed, the site that issued the message is recorded as holding a copy of the data item; all other sites listed are recorded as designated but not yet confirmed copy sites. If a site finds its name among the list of designated copy sites, that site must attempt to obtain a copy of the data from some site that is known to have a copy. Once a site has obtained, a redundant copy of a data item, that site will multicast this fact to all sites in the current partition, thus altering its status from a "designated" copy site to a "confirmed" copy site. An algorithm for constructing the list of N designated copy sites is presented in detail in Appendix 5. The main factors to be considered in selecting placement sites are which sites are designated as fileserver sites for the federation, which sites are in the current virtual partition, which sites store previous versions of the data item, which users are permitted access to the data, and how much storage space has already been contributed by each site. If at least (N-1) filserver sites are in the current virtual partition, select (N-1) of the fileserver sites as designated copy sites giving preference to those fileserver sites that stored the previous temporal version of this data item or have contributed the least storage space to storing IDHSS objects. If the current partition does hot contain sufficient fileserver sites, select other sites giving preference to those sites that stored the previous temporal version of this data item or have contributed the least storage space to storing IDHSS objects. The selection algorithm may be forced to select sites that are not members of the current virtual partition. When a merge recovery is performed, these sites will learn of their selection as designated copy sites. After the recovery is completed, these designated sites request a data copy from a site that is known to store a copy. Once a copy is obtained, the site will multicast this fact to all sites in the current virtual partition, thus changing its status from a designated copy site to a confirmed copy site. In the case of a create request or an update request that adds a new path, the selection algorithm gives preference to fileserver sites and those sites that have contributed the least amount of storage space. For an update to an existing path, preference is given to sites that store the previous temporal versions of that data item. These sites are preferred because we can make use of the ancestral relationship of data instances to reduce the storage space required to store multiple versions of an object. If a site stores two temporal versions of a data item, the most recent version will be stored as a full copy and the older version can be stored as a backward difference based on the newer version. A difference is conceptually an errata list specifying the differences between an old version and a new version of a data item. A backward difference is a list of changes that must be applied to the newer version to produce the older version. If a site acquires a full copy of a new data instance I and the site has a full copy of a previous temporal version O of this same instance, the old full copy O is replaced with a backward difference with respect to the new copy I. The instances O and I participate in a differenced.sub.-- from relationship for the local site; instance O is differenced.sub.-- from instance I. Backward differences are used rather than forward differences so that we may store full copies of all current instances. This saves time when processing checkouts because most checkout requests are for current instances which are always stored as full copies. If a user requests a checkout of an older version and that version is stored as a difference on site S, the storage server on site S must reconstruct the desired version. The storage system directory contains information on the differenced.sub.-- from relationship for the local site so we know how to reconstruct older versions. Reconstruction of an older version is accomplished by following the differenced.sub.-- from relationship forward to locate a full copy and then backwards, applying the changes stored in each difference in the order specified by the reverse of the differenced.sub.-- from relationship. FIG. 9 shows three consecutive temporal versions, instances 1, 2 and 3, from one path of a IDHSS object. A possible placement of those versions on three sites S1, S2, and S3 assuming a replication factor of two is as follows: Instance 1: sites S.sub.1 and S.sub.2 Instance 2: sites S.sub.1 and S.sub.3 Instance 3: sites S.sub.2 and S.sub.3. An update creates a new path instantiated by instance 1 and places a copy of instance 1 on sites S1 and S2. A checkout and update are performed on instance 1 to create instance 2 and copies are placed on sites S1 and S3. Another checkout and update are performed on instance 2 to create instance 3 and copies are placed on sites S2 and S3. The local representation of each instance and the differenced.sub.-- from relationship are recorded in the system directory at each site. If the storage server at site S1 is asked to provide a copy of instance 1, the directory at S1 states that instance 1 is stored as a difference and that it was differenced from instance 2. Since instance 2 is stored as a full copy at S1, the storage server applies the changes for instance 1 to the full copy of instance 2 producing a full copy of instance 1. The process is summarized in tabular form below:
______________________________________
Create Version 1:
version format differenced from
______________________________________
Site S.sub.1 1 full null
Site S.sub.2 1 full null
Site S.sub.3 1 none null
______________________________________
Update Produces Version 2:
version format differenced from
______________________________________
Site S.sub.1 1 diff 2
2 full null
Site S.sub.2 1 full null
2 none null
Site S.sub.3 1 none null
2 full null
______________________________________
Update Produces Version 3
version format differenced from
______________________________________
Site S.sub.1 1 diff 2
2 full null
3 none null
Site S.sub.2 1 diff 3
2 none null
3 full null
Site S.sub.3 1 none null
2 diff 3
3 full null
______________________________________
MAINTAINING CONSISTENCY A requirement for achieving robustness is a set of recovery protocols augmenting normal transaction processing to maintain internal consistency, mutual consistency, and configuration write consistency among groups of communicating sites. A virtual partition represents a group of communicating sites, thus we must preserve consistency within each virtual partition. In a failure free environment, normal request processing is sufficient for maintaining internal consistency, mutual consistency, and configuration write consistency. A properly functioning virtual partition is a failure free environment. When membership in one virtual partition must be abandoned for membership in another virtual partition, the assumption of a failure free environment has been violated. The normal request processing protocol is insufficient for maintaining consistency when transitioning from one virtual partition to another. Virtual partitions reconfigure when sites cannot be reached or when sites considered to be unreachable begin communicating. One virtual partition splits into two or more virtual partitions when a site crashes or a communication failure occurs. First the failure must be detected. Next, a new virtual partition is formed, and finally any inconsistencies among the sites in the new virtual partition must be corrected. A site A detects a failure if any site B in the current virtual partition fails to reply to a service request within some time limit, or fails to acknowledge a multicast message within some time limit. If a site fails to reply to a service request, we must reconfigure the virtual partition so that a pseudo service site can be selected and the client's request can be processed. If a site fails to acknowledge the processing of a multicast message, a mutual inconsistency may exist between the site that sent the multicast message and the receiving site. Any site that believes it has detected a failure must initiate a new virtual partition. A failed site may cause a current inconsistency among the surviving sites. The source of this inconsistency is seen by studying the processing of requests on a site. A site processes requests by obtaining locks from primary sites, performing local processing of the request, and multicasting the results of the request to all other sites in the virtual partition. Few network facilities provide true multicast communications that are reliable. True multicast communication is the sending of a single message to a selected set of sites by placing the message on the network exactly once. Reliable multicast would ensure that either all of the selected sites received the message or none of the selected sites received the message. In general, multicast communication is simulated by point-to-point communication between the sender and each receiver of the message. This means that the single message must be sent repeatedly, once to each destination site. If a site was in the process of multicasting a message by a sequence of point-to-point messages and the site crashes, some of the destination sites will receive the message and some will not receive the message. This results in an inconsistency among the surviving sites. Clearly, part of the solution to this problem is to require the sites that did receive the failed site's last message to propagate that message to those sites that did not receive the last message. Such a propagation strategy is implemented by placing restrictions on the sending of multicast messages, by logging each multicast message, and by requiring some synchronization during the processing of a request. Sending of multicast messages is restricted such that each site may send at most one multicast message at any time. Requests that have completed their local processing must queue for the multicasting of their results. Each site will log the most recent multicast message received from every site including itself. Each multicast message contains a sequence number. Multicast message sequence numbers are generated in increasing order by each site for the multicast messages they send. The multicast message sequence number will be logged by each receiving site as part of the multicast message. These sequence numbers are part of the information which will be passed during the two phase negotiation of a new virtual partition. Table II shows the information that must be exchanged by the virtual partition initiator and the members of the virtual partition during the two-phase virtual partition negotiation.
TABLE II
______________________________________
1. The Virtual Partition Initiator sends an
Invitation containing:
Proposed Virtual Partition Name
List of all sites known to be members of the
federation
2. A Site receives the invitation, and sends a
reply containing:
Local site name
The virtual partition membership history for
the local site
For each site in the current virtual
partition, one pair containing:
site name
sequence number of the last multicast
message logged for this site
3. The Virtual Partition Initiator sends a Commit
message containing:
Name of the new virtual partition
For each unique virtual partition history
reported, a list containing:
The virtual partition history
A list of the sites that reported this
history
For each site reported by a site with
this history, one triple containing:
site name
multicast message sequence number
name of a holding site that reported
this sequence number
______________________________________
During virtual partition negotiations, each site receiving an invitation will include, in the acceptance message, multicast sequence number information for each site in its current virtual partition. The information will be a list of pairs: <site name, sequence number contained in the last multicast message received from this site>. The virtual partition initiator will process the pairs. If more than one sequence number is reported for a site S by two sites in the same virtual partition, an inconsistency exists. Because a site can be in the process of sending at most one multicast message at any time, and because remote sites must acknowledge processing each multicast message, there may be at most two distinct sequence numbers reported by sites that were members of the same virtual partition. That is, a site may miss at most one multicast message before a virtual partition negotiation and recovery is executed. For each site reported in a pair, the initiator will add an ordered triple to the commit message of the new virtual partition. Each ordered triple contains: <site name, largest sequence number reported, name of site reporting this sequence number>. The virtual partition initiator will multicast the commit message to those sites that agreed to join the new partition. When the commit message is received, each site will compare the sequence number in each triple with the appropriate message in the multicast message log. If the sequence number in the log is smaller than the sequence number in specified in the commit message, a copy of the missed multicast message must be acquired from the site specified in the triple. When the multicast message is received, it is logged in the multicast message log and processed to completion. This process is performed for each triple in the multicast message. If multiple multicast messages were missed and had to be acquired, they may be processed in any order because they represent the results of non-conflicting requests. When all multicast sequence numbers have been compared and all missing messages obtained and processed, the site commits to membership in the new virtual partition. Thus, divergence recovery is a three-phase protocol: the two-phase virtual partition negotiation and the propagation of missed multicast messages. Appendix 6 presents an algorithm used by the virtual partition initiator to process replies of all joining sites. Appendix 7 presents an algorithm for message propagation as performed by all members of the newly formed partition. This message propagation protocol is effective in that all sites remaining in a virtual partition formed by a divergence recovery will eventually process the same set of requests. PREEMPTION OF REQUESTS When a divergence recovery is completed, the sites remaining in the newly formed partition will process new requests in the failure free environment of the new virtual partition. Normal request processing in a failure free environment is sufficient to maintain internal, mutual, and configuration write consistency. These results assume that the system has been failure free from time zero when the storage system was empty and zero requests had been processed. In particular, it assumes that no requests are being processed when the failure free environment begins. This assumption does not hold for the environment at time zero of the new virtual partition. Requests that are being executed at the time recovery processing begins are suspended. These suspended requests will either be resumed or backed out and restarted. Divergence recovery does not require that all requests in progress at the initiation of recovery be backed out and restarted. IDHSS selectively performs backout and restart on requests that may cause a loss of mutual consistency in the new partition and requests that may cause hung requests in the new partition. A hung request is one that is blocked by a request that is no longer active in the current virtual partition. AVOIDING HUNG REQUESTS When a divergence recovery is completed, the remaining sites should be able to process almost all requests. The only restriction is that the sites must know of the existence of the data and a copy of the data must reside on at least one site in the new partition. The ability to continue processing after a recovery may be hindered by side effects of requests that were being processed by sites excluded from the new virtual partition. In particular, requests that are partially processed by an excluded site may have set locks at primary sites. The primary sites may be members of the new virtual partition. Because the requesting site has been excluded from the new partition, these resources will not be released by the requesting site. If a new request requires a lock that is held by a partially processed request on an excluded site, the new request would be hung. A request R is hung if R is blocked by a sequence of requests and one of the requests in the sequence is no longer active in the current virtual path. To avoid hung requests, every site in the new virtual partition must release locks that are held by requests being executed on sites which are now excluded from the new partition. As part of divergence recovery, each primary site must scan the local lock table discarding all entries requested by sites that are not members of the new partition. If a request R has its locks released for this reason, from the perspective of the newly formed partition, R has been preempted. A request R is "preempted" if all locks currently held by R are released, R is backed out, and R is restarted. The locks granted to request R are released by one partition and the backout and restart of request R is performed in another partition that has as a member the site which initiated R. Pending name reservations at a name server site are managed in the same manner. If the site that requested the pre-reservation of a name is not a member of a new partition formed by a divergence recovery, the name server site will release the name reservation. The pre-reserved name may be reused by any site remaining in the partition following the recovery. PREPARING FOR FUTURE MERGE RECOVERIES A failed site may cause future inconsistencies among the sites in a virtual partition. If the failed site was detected by one or more sites sending a multicast message of their latest results, these results would have an associated virtual partition tag value that was the name of the pre-failure virtual partition. The purpose of the tag value is to signify that all sites belonging to that virtual partition have knowledge of those results. This constraint may be violated when a site fails. In particular, the failed site will not have knowledge of the results specified in a multicast message that caused the discovery of the site failure. These erroneous tag values will cause a future inconsistency, because there is no way to discover that the failed site does not have knowledge of these particular results. The solution to this problem is to retag the results specified in each of the logged multicast messages, using a future virtual partition name as the tag value. This has the effect of pushing the results forward in partitioned time. This establishes the state required for merge recovery to locate and process all information that may be unknown to a site which is attempting to join a virtual partition. Each site participating in the divergence recovery must scan the multicast message log and retag the results specified in each multicast message. The results specified by a multicast message should be retagged only once, thus each multicast log entry has an associated status regarding retagging. When a multicast message is written to the log, its status is retaggable. If the results specified by a multicast message are retagged due to a divergence recovery, the status is altered to not retaggable. The virtual partition name to be used for retagging is formed from the name of the new virtual partition. If the new virtual partition name consists of level number N and site name S, the retag partition name consists of level number (N-1) and site name S. A retag partition name exists in the gap between the old partition name and the new partition name. All sites that complete the divergence recovery must record the new virtual partition name as part of their partition history; membership in the retag partition is implied by membership in the new virtual partition. Appendix 8 provides an algorithm for committing a divergence recovery. The protocol for divergence recovery as presented is sufficient to maintain mutual consistency among the sites in the newly formed virtual partition. MERGE RECOVERY Two or more virtual partitions merge into one virtual partition when a site recovers from a crash, or a communication failure is corrected, or a new site enrolls in a federation. First, the correction of the failure or the enrollment must be detected. Next, a new virtual partition is formed, and finally any inconsistencies and conflicts among the sites in the new virtual partition must be corrected. If a site wishes to enroll in a federation that is currently unknown at that site, the site is easily enrolled by a merge recovery. The merge recovery brings the new site into the federation, propagates all of the directory information to the new site, and merges the site into a virtual partition for that federation. An enrolling site E causes a merge recovery to take place by acting like a crashed site that was in a single site partition name OE. If the merge recovery is necessitated by the restart of a crashed site, the failed site must perform a local recovery to stabilize itself prior to merging with other sites in the federation. CRASH SITE RECOVERY When a crashed site is restarted, that site must bring the local IDHSS system directory to a stable state prior to communicating with other sites. The purpose of the stabilization process is three fold: first, we must test for media failure due to the site crash; second, we must ensure that each request was executed atomically; third, we must ensure that results committed by this site are already known by other sites or will be propagated to all other sites by future merge recoveries. One side effect of a site crash is that all locks and name reservations recorded at that site will be lost. As part of the stabilization process, the recovering site will form a new virtual partition containing only the local site. Throughout the stabilization process, the recovering site will discard all messages received over the network and will not initiate any sessions with local clients. Once the site has stabilized as a single site partition, a "finder" will attempt to communicate with other known sites in the federation to force a merge recovery. MEDIA FAILURE IN IDHSS Testing for media failure must be the first step in crash site recovery. A stable storage device may experience failure in varying degrees. If any portion of the directory is deemed to be unreadable, the storage system has experienced a media failure. A severe failure would require that the storage device itself be replaced. In such a case the storage system cannot make an automatic recovery becuase all information has been removed. Recovery may be achieved by the intervention of a local client executing an enroll request for each federation in which the site was formerly a member. If the failure renders only portions of the storage device unusuable, the storage system may detect this by attempting to scan the entire IDHSS system directory. System restart always begins with a full scan of the local system directory. The storage system can recover from a partial media failure by building a new system directory based on re-enrollment in each federation in which the site was formerly a member. To carry out this recovery automatically, the storage server must be able to read all of the directory records describing federations and the site and users belonging to those federations. Without this information, the storage server will not know which federations to re-enroll in, which sites to contact, and which users to enroll. The readability of this information is not guaranteed but the probability that it can be read is greatly increased by duplicating the federation, site, and user information in the local system directory. With this approach we must designate one copy to be the primary copy for use during non-failure mode processing. Maintenance of the duplicate copies is not a problem, as the extra copies need to be modified only when a user or site secedes from a federation. If the storage system determines that a partial media failure has occurred, the system will read as many federation, site, and user records as can be located in the system directory. The storage system will then build a new system directory by re-enrolling the local site in each federation found in the old directory. The users found in the old directory will be re-enrolled in their respective federations. ATOMIC EXECUTION OF REQUESTS To ensure atomicity in the event of a site crash the storage system will maintain a write ahead undo/redo log associated with each request currently being processed by the local site. The log for a request will contain the original request message and a report of the new and modified values the request intends to write to the system directory. Log information must be written to stable storage before any of the modifications it reports are written to stable storage. When all locks, name approvals, and implicit alias numbers have been obtained, the local processing of a request consists of nine steps. 1. Calculate all the new and modified values that will be written to the IDHSS system directory. 2. Write the original request message and all intended modifications to an undo/redo log for this request. 3. Commit the undo/redo log information to stable storage by a forced write. 4. Carry out the modifications on the IDHSS system directory. 5. Queue for the ability to multicast the results of this request. 6. When the multicast service has been seized, then ensure that the system directory modifications have been committed to stable storage by a forced write. 7. Initiate a multicast of the results of this request. 8. Notify the requesting client of the outcome of the request. 9. Free the undo/redo log space associated with this request for reuse by new requests. When recovering from a site crash, the storage system must undo and redo requests that have not been committed and multicast to other sites. If the processing of a request has not reached the multicasting step (i.e. step 7), the results of the request should be undone and the request should be re-executed. Once a request has reached the multicast stage, its results are known outside of the local site and must not be undone. If the processing of a request has been committed locally, the results have been multicast to the other sites in the virtual partition, and the requester has received notification of the outcome of the request, the undo/redo log information associated with this request may be deleted from the log. This means that only a minimal amount of stable storage space must be dedicated to the undo/redo log. Undo is accomplished by scanning the current undo/redo log entries. If a request must be undone, the undo information is used to remove each reported modification from the local system directory. Each request whose results are undone is marked for redos. Requests marked for redo will be processed by the system at a later time. Once all of the necessary undo operations have been performed, the storage server will perform a consistency test on the local directory. The consistency test is used in an attempt to detect write failures that occurred at the time of the site crash but did not render any records unreadable. The test requires a logical scan of the IDHSS system directory. Precisely how to perform the scan and what to check for during the scan are determined by the logical organization and the physical organization used to implement the system directory. By physical organization is meant any linkage between directory records that represents solely physical organization concerns. For example, records may be organized into buckets and the buckets may be chained together. The bucket chains determine a physical organization. By logical organization is meant the logical links between records in the directory. For example, each directory record containing information about a IDHSS object must reference another record in the directory specifying that record to contain the information on the principal path for this object. If the record referenced does not contain information about a path, an inconsistency has occurred. Similarly, each record describing a path in a IDHSS object will contain a reference to a record that describes the current instance in that path. The consistency test begins by scanning all physical linkage in the system directory. If the physical linkage is not traversable, the storage server will claim that an undetected write failure has occurred. If the physical linkage is traversable, the logical linkage is traversed and the record types are verified. If a logical link is broken or a record type invalid, the storage server will claim that an undetected write failure has occurred. The logical linkage is on a federation-by-federation basis. If the physical linkage is on a federation-by-federation basis, recovery from an undetected write failure can be achieved by performing a media failure recovery for an individual federation. The storage server caches a list of local users who are members of the ailing federation, free all system directory entries for the federation, and re-enroll the local site and the local users in the ailing federation. If the physical linkage is not on a federation by federation basis, inability to traverse the physical linkage will result in a full media failure recovery. FUTURE INCONSISTENCIES AND ESTABLISHING COMMUNICATIONS Before attempting to establish communications with other sites, the recovering site must ensure that the results recorded in the multicast message log entries are known by the other sites in the federation or will be propagated to the other sites by future merge recoveries. This is accomplished by retagging the results of those requests in the multicast message log with a post-failure virtual partition name. A retag partition name must be constructed for this purpose. The recovering site must know the name of the virtual partition it was a member of at the time of the crash. If this previous partition contained only the crashed site, no new virtual partition should be formed. The retag partition name is formed by subtracting one from the level number of the previous partition level number. If the previous partition was not a singleton, the site should construct a new virtual partition name based on the previous partition name. The retag partition name is formed by subtracting one from the level number of the new partition level number. The retag partition name should be used to retag the results specified by each multicast log entry that has not been retagged previously. Once all necessary undo operations have been performed, all necessary multicast results have been retagged, and the site has committed to a singleton virtual partition, a finder process is started. The finder will cycle through the federations known to the local site and attempt to communicate with sites belonging to those federations. For each federation the finder will send a "liveness" message to one of the member sites, set a timeout for receiving a virtual partition invitation for that federation, and wait. If the invitation arrives within the specified time limit, a merge recovery will be executed. If the time limit expires before an invitation is received, the finder will attempt to communicate with a different site belonging to the federation. If all sites in a federation fail to respond, the recovering site will begin processing as a singleton virtual partition. The first step in processing is to redo all those requests that were marked for redo during the undo phase of crash recovery. Once those requests have been redone, the system will become available for new requests. The finder will periodically send a liveness message to the non-communicative sites in the federation. Appendix 9 presents a complete algorithm for crash site recovery. DETECTION OF FAILURE CORRECTION Two or more virtual partitions may merge only after the recovery of a crashed site or the correction of a communication failure has been discovered by some site. The sending and receiving of liveness messages will be the general mechanism used for detecting the correction of such failures. Continually sending liveness messages among all sites would cause a proliferation of messages on the network. IDHSS sends a liveness message only to those sites that believed to be unable to communicate with. The service request messages and the multicast messages are really a form of liveness message sent exclusively to those sites we believe we can communicate with. The receiving of a liveness message serves as a notification that a failure situation has been corrected. If a site receives a liveness message from another site, the receiving site must attempt to initiate a new virtual partition. Liveness messages are sent and received by a finder process of the storage system. The purpose of the finder is to communicate with sites not in the current virtual partition of a federation. Appendix 10 presents the algorithm used by the finder to detect failure correction. The finder executes periodically. If a federation is in normal processing mode, the finder will try to communicate with sites outside of the current partition. When a federation is performing partition negotiation and recovery, the finder is dormant for that federation. At each execution, the finder cycles through the federations known by the local site, and for each federation the finder determines the set of excluded sites. The excluded sites are those sites that are members of the federation are are not members of the current virtual partition for that federation. A liveness message will be sent to one of the excluded sites. Upon the next execution of the finder, if the set of excluded sites for a particular federation is the same, the finder determines the next site in some ordering of the set and selects it as the destination for a liveness message. It is sufficient to send only one liveness message per federation per execution of the finder because if a merge recovery is initiated, the invitation to join a new virtual partition is always sent to all known sites. Thus, all the sites that are able to merge will be located by using only one liveness message. MERGING VIRTUAL PARTITIONS When two or more virtual partitions merge, each virtual partition may have processed requests that the other partitions have not processed. The requests processed by disparate virtual partitions will be classified as conflicting requests or missing requests. Two requests R1 and R2 "conflict" if they are executed in separate virtual partitions and the intersection of the set of objects and paths altered by R1 with the set of objects and paths altered by R2 is non-empty. For example, two update requests conflict if they update the same instance of user data; two assign requests conflict if they assign the principal path of one IDHSS object to be different paths in that object. All non-conflicting requests are missing requests. The goal of merge recovery is to achieve mutual consistency among the merging sites. This requires the resolution of conflicts presented by conflicting requests and the propagation of the results of all missing requests. To accomplish this, information from the system directory must be exchanged between each pair of unique virtual partitions. This means that there must be one system directory that represents the view held by all sites within a single virtual partition. The partition initiator selects one site from each of the merging virtual partitions to be the representative system directory for that partition. The list of representative sites will be included in the commit message sent by the partition initiator. Table III shows the contents of a virtual partition commit message sent by the virtual partition initiator.
TABLE III
______________________________________
Name of the new virtual partition
For each unique virtual partition history
reported,
a list containing:
The virtual partition history
A list of the sites that reported this
history
The representative site for this
partition history
For each site reported by a site with
this history, one triple containing:
site name
multicast message sequence number
name of a holding site that reported
this sequence number
______________________________________
After the two phase negotiation of a new virtual partition, the third phase of merge recovery will exchange information among the merging sites. Each representative site provides information that represents a mutually consistent view held by all of the sites in the representative's partition. each of the merging virtual partitions can attain mutual consistency with respect to requests that have been multicast by performing the third phase of a divergence recovery among the sites within the partition. Consistency with respect to those requests that have been multicast is not sufficient. The divergence recovery brings each partition to mutual consistency on the level of multicast messages. Each site may have partially processed requests which the other sites known nothing about. This presents a problem for a merge recovery. In a merge recovery, the information provided by a representative site is in the form of copies of records from the representative's system directory. This information is not at the same level of abstraction as multicast messages. All of the merging sites must process the exchanged information at this lowest level of abstraction. To process this information correctly, each site must be internally consistent. That is, no system directory may contain the partial results of requests that have not been multicast yet. During normal request processing, we used mechanisms to achieve the atomic execution of one request with respect to all other requests. A merge recovery is not a request; therefore it circumvents these atomicity mechanisms by looking directly at the suspended state of a system directory. This suspended state must be an internally consistent state. Internal consistency can be achieved by undoing all suspended requests that have not been multicast. Each undone request is marked for redo. A merge recovery may force a set of requests to be undone and redone at a later time. These requests may have obtained locks or name reservations prior to the time merge recovery was initiated. These locks must be released by the merge recovery process; otherwise the request will become hung when it is redone. To avoid hung requests, every site participating in the merge recovery must release, for the federation undergoing recovery, all locally held locks and name reservations. Once each merging virtual partition has achieved mutual consistency and each merging site has achieved internal consistency, the representative sites will scan their local system directory to locate all information that must be exchanged among the merging sites. Every site will send a message to each virtual partition representative requesting the information being provided by each virtual partition. This information is a change list, a list of all the changes that are known in a source partition and are unknown in one or more of the merging partitions. Once all change list information has been processed, each site will commit to the new virtual partition. Once committed to the new partition, the requests marked for redo in the undo/redo log are re-initiated, and new requests are accepted and processed. Appendix 11 presents an algorithm for processing a partition commit message. CONFLICT DETECTION AND RESOLUTION To achieve mutual consistency among a set of merging partitions, the system must detect, propagate, and incorporate missing results and must detect, resolve, and incorporate conflicting results. We will now discuss in detail the algorithm employed by merge recovery to achieve a mutually consistent view of the data stored by the system. MERGING PARTITION HISTORIES In a merge recovery, the representative site for each partition must construct a list of information that will be useful in achieving mutual consistency among the merged sites. The representative site cannot provide a list of all operations executed within the partition represented because the storage system does not maintain an audit trail of system operation. However, an audit trail of system operation is not required and may even be undesirable. When multiple partitions are merging, it is not essential to have knowledge of every operation executed in a virtual partition. What is essential is knowledge of which items have changed value one or more times within a partition, and what the current value of each such item is. Consider the following example. In one partition a client performs an assign operation on an object O. The assign request alters the name mapping of object O to path p1. A short time later in that same partition, object O is assigned to path p2. When a merge is performed, knowledge that object O was assigned to p1 is of little or no use. The important information is that the mapping of object O to its principal path was altered and O currently maps to p2. Using the knowledge that O was assigned to p1 and then p2 would require that each merging site process both of these actions, even though the results of the first are no longer visible. Further, the actions must be performed in order so that all merging sites conclude that object O currently maps to path p2. Therefore, a representative site need only determine which items have changed value unbeknown to other partitions, and what the newest value is. Determining the current value of an item is simple. The current value is stored in the system directory of the representative site and is consistent with the other sites in the represented partition due to the execution of a divergence recovery. Determining which items have changed value unbeknown to the other partitions requires the maintenance of additional information. A tag is associated with each changeable item and the tag value is the name of a virtual partition. When an executing request changes the value of an item, the value of the partition tag associated with that item will also be changed to the name of the current virtual partition. This tag value will be used by the storage system to determine which items have changed without the knowledge of one of the merging partitions. FIG. 10 shows a federation of four sites A, B, C, and D, the partitioning behavior of those sites, and the virtual partition history (VPH) of the three virtual partitions 3A, 5C, and 5D. We wish to investigate a merger of the three partitions. By examining the virtual partition histories, we observe that all of the partitions claim membership in virtual partition 1A. From this we conclude that every item that has not been changed since partition 1A has a value which is known by all of the merging sites. Sites A and B claim membership in virtual partitions 2A and 3A sites C and D do not claim membership in partitions 2A and 3A. From this we conclude that every item that was changed by the site in partitions 2A and 3A has a value which is unknown by sites C and D. Similarly, every item that was changed by the sites in partitions 2D and 3D has a value which is unknown by sites A and B. Site C is the only site that claims membership in partitions 4C and 5C. Thus, each value changed by the site C during the existence of partitions 4C and 5C has a value that is unknown by sites A, B, and D. Similarly, each item changed by site D during the existence of partitions 4D and 5D has a value that is unknown by sites A, B, and C. For the merge recovery which forms partition 7C, we select site B as the representative for partition 3A, site C as the representative for partition 5C, and site D as the representative for partition 5D. Site B will provide information on all items having an associated partition tag value of 2A or 3A. Site C will provide information on all items having an associated partition tag value of 4C or 5C. Site D will provide information on all items having an associated partition tag value of 2D, 3D, 4D, or 5D. Even though the sites in partition 5C and the sites in partition 5D were members of 2D and 3D, the changes made during partitions 2D and 3D must be included in the merge information because these changes are not known by the sites in partit | ||||||
