Global file system and data storage device locks6493804Abstract A system includes shared Small Computer System Interface (SCSI) storage devices for processing clients coupled by a fiber channel interface. The storage devices include storage blocks, and locks controlling their use by clients. Clients issue actions to the storage devices for performing operations on the locks. A client may exclude other clients from using storage blocks using a state element to acquire the lock for shared or exclusive use. If the client modified the data, a version counter in the lock is updated when the lock is released. If an activity bit is set, the version counter is updated upon both reading and writing. Other clients can forcibly release a lock owned by a failed client by monitoring its version counter. Expiration timers associated with the locks allow acquired locks to be released by timing out. A serverless global file system (GFS) manages use of the shared storage resources, and allows remapping of the locks to the storage blocks, for example, based on activity of the locks. Claims What is claimed is: Description COPYRIGHT NOTICE
TABLE 1
Steps For Accessing Files Using NFS
Step Where performed
NFS Client
XDR/RPC Client
TCP/IP Client
Ethernet Network
TCP/IP Server
NFS Server
VFS Server
SCSI Driver Server
SCSI Connection Server
Disk Access Server
GFS File System Structure FIG. 5 is a conceptual diagram illustrating generally, by way of example, but not by way of limitation, one embodiment of a GFS distributed file system mapping structure according to one aspect of the present invention. In FIG. 5, storage capacity in the various subpools of Network Storage Pool 400 is partitioned into several resource groups (RGs), such as resource groups 500-511. According to one aspect of the invention, resource groups 500-511 are designed to distribute file system resources across the aggregation of storage subpools in Network Storage Pool 400. Each storage device 115 can include a single one of resource groups 500-511 or a plurality of the resource groups 500-511. In one embodiment, one or more resource groups 500-511 is distributed (e.g., using striping techniques) across different storage devices 115 or subpools 400A-N within Network Storage Pool 400. FIG. 5 also illustrates a hierarchical or other organizational arrangement such as, by way of example, but not by way of limitation, directory tree 520. File data and metadata may span multiple ones of resource groups 500-511 and/or subpools 400A-N. Directory tree 520 illustrates one example of how files are mapped from a UNIX directory tree 520 to ones of resource groups 500-511 located on different ones of subpools 400A-N. A single file, such as file16, may include portions located on different ones of resource groups 500-511. In the illustrated example, a first portion of file16 (i.e., file16.1) resides on resource group 506 and a second portion of file16 (i.e., file16.2) resides on resource group 507. According to one aspect of the invention, at least one of resource groups 500-511 contains information similar to that of a conventional superblock (e.g., number of blocks in the file system, how many blocks are used for meta data, how many blocks are used for data, bitmaps that indicate which blocks are in use, and the time that the file system was created). In one embodiment, the GFS distributed file system includes a superblock that contains certain information that is not distributed across resource groups 500-511, but is instead stored upon a particular one of resource groups 500-511. Information carried by the superblock includes, for example, the number of clients 105, storage devices 115, or other nodes mounted on the GFS file system. Such information also includes, for example, bitmaps for calculating unique identifiers associated with each node and identifying the particular one of subpools 400A-N on which the GFS file system is mounted. The information carried by the superblock also includes a static resource group index. The resource group index describes the location of each of resource groups 500-511 as well as their attributes and configuration (e.g., their hardware characteristics). According to one aspect of the invention, the data storage capacity of one or a group of storage devices 115 of system 100 is aggregated into shared Network Storage Pool 400. Network Storage Pool 400 is partitioned into resource groups (e.g., illustrated by resource groups 500-511). Resource groups 500-511 are divided into a plurality of data storage capacity units referred to as data blocks, storage blocks, or data storage blocks. The size of the data storage capacity units is userconfigurable (e.g., 512 bytes, 4 kilobytes, 32 kilobytes, etc.). FIG. 6 illustrates generally, by way of example, but not by way of limitation, one embodiment of certain GFS file system constructs in which resource groups 500-511 essentially provide mini-file systems. For example, each of resource groups 500-511 includes at least one information block, at least one data bitmap, and optionally include at least one GFS dinode (disk index node) providing a unique pointer to a particular data file. The sharing of a single storage block 605 by multiple diodes 600 may be inefficient in the distributed GFS file system. As a result, in one embodiment of the present invention, only one dinode 600 resides on a particular GFS file system storage block 605. Dinode 600, however, does not necessarily occupy the entire storage block 605. Dinode 600 includes a header section 610 and a data section 615. Header section 610 includes information about the particular data file. Data section 615 either includes real data (e.g., if the entire file can be fit into the storage block 605 carrying dinode 600) or a set of pointers (categorized as metadata) to other storage blocks 605. According to another aspect of the present invention, a uniform maximum indirection depth is associated with each dinode 600. By contrast, conventional UNIX inodes (index nodes) have a maximum indirection depth that can vary between inodes. For example, FIG. 6 illustrates a dinode 600 having a single level of indirection. In this example, reading data from the file includes reading the dinode 600 and reading the pointers carried in its data section 615. The pointers carried in data section 615 of dinode 600 point to indirect storage blocks 620, which contain pointers to real data storage blocks 625. Thus, only one level of indirect storage blocks 620 must be traversed in order to access real data storage blocks 625. As a result, dinode 600 is referred to as having a single level of maximum indirection. A different level of maximum indirection is also included within the present invention (e.g., 2 or more levels of maximum indirection depth). According to one aspect of the present invention, however, the actual level of maximum indirection associated with each dinode 600 is the same. However, particular dinodes 600 need not utilize the maximum indirection depth. For example, a dinode 600 associated with a very small file may include all of the file's real data within its data section 615, without requiring any reference at all to other storage blocks 605, such as indirect storage blocks 620, and real data storage blocks 625. Storing a file in the data section 615 of dinode 600 is referred to as dinode stuffing. Dinode stuffing compensates for internal fragmentation of data storage resulting from allocating only a single dinode 600 per storage block 605. Another advantage of dinode stuffing is that it allows stuffed files to be transferred with a single storage block request. Directory lookups often benefit from dinode stuffing because each pathname traversal requires at least one directory file read. When directory information is stuffed into a single dinode 600, the number of storage block requests associated with a directory lookup can be decreased by as much as one-half. In one example, the unit size of a storage block 605 is 32 kilobytes, the dinode header 610 requires 128 bytes, and the dinode 600 has 1 level of maximum indirection. In this example, reading a 1 byte file requires a total of 64 kilobytes of storage capacity. This example also requires at least 2 block transfers of data from the disk in order to read the dinode 600 and associated indirect real data storage block 625. By contrast, if the same file is stuffed in the dinode 600, only 32 kilobytes of storage capacity and a single block transfer of data is required. For a 32 kilobyte block size and 128 byte dinode header 610, up to 32,640 bytes can be stuffed in the dinode 600. If the file size increases beyond this, the GFS file system unstuffs the dinode 600. According to one aspect of the invention, the GFS file system assigns UNIX inode numbers based on the disk address of the storage device 115 to which a dinode 600 is assigned. Directories contain file names and accompanying UNIX inode numbers. A directory lookup by the GFS file system matches a file name to an inode number. Then, the GFS file system locates the dinode 600 using the corresponding UNIX inode number. In one embodiment, by assigning disk addresses to UNIX inode numbers, GFS dynamically allocates dinodes 600 from a pool of free storage blocks 605. In summary, UNIX inodes are associated with tree structures with real data storage blocks at the leaves. For a particular UNIX inode, the leaves may be different in height (i.e., there is no uniformity in indirection depth either within the tree associated with a particular UNIX inode, or between trees associated with different UNIX inodes). Similarly, unstuffed GFS dinodes 600 are also associated with tree structures having real data storage blocks 625 at the leaves, as illustrated in FIG. 6. According to one aspect of the present invention, however, all the leaves are uniform in height (e.g., indirection depth of the GFS dinode is relatively uniform both within the tree associated with the particular GFS dinode 600, and between trees associated with different UNIX dinodes). The tree associated with the GFS dinode 600 of FIG. 6 advantageously offers more uniformity in accessing real data. For any file offset, the GFS file system uses the same number of pointer indirections through metadata to reach real data. Uniform tree heights simplify the implementation of the GFS file system and provide regularity in data access times. Moreover, further data access speed is obtainable for small files through dinode stuffing, as explained above. Device Lock Overview FIG. 7 is a block diagram illustrating generally, by way of example, but not by way of limitation, one embodiment of a storage device in which locks are provided for controlling access to data storage blocks. Storage device 115 includes storage media 700 and controller 705. Storage media 700 includes a plurality of N storage blocks 605, which are partitioned into M resource groups 701 in this example. Controller 705 includes a microprocessor or similar circuit configured for operating storage device 115 and communicating with fibre channel interface 200 or other network 110. Controller 705 also includes a controller memory 710, providing at least one data storage device lock mechanism 715 (also referred to as a "lock" or a "device lock") controlling the use of a corresponding storage block 605 by different clients 105. In one embodiment, controller memory 710, carrying locks 715, is faster to access than storage media 700, carrying storage blocks 605. In another embodiment, controller memory 710, carrying locks 715, provides volatile data storage (e.g., RAM) and storage media 700, carrying storage blocks 605, provides nonvolatile data storage. Locks 715 are "acquired" by a particular client 105, thereby usually excluding other clients from accessing the corresponding storage block 605 controlled by the lock 715 until the lock 715 is "released" by the particular client. Locks 715 provide a mechanism for mutual exclusion of storage capacity, such as provided by storage device 115, that is shared by multiple clients 105. As explained below, locks 715 offer a highly parallel mutual exclusion mechanism effected by a plurality of clients 105, rather than by a single centralized server. As also explained below, locks 715 require little overhead storage capacity on the storage device 115. Moreover, locks 715 provide graceful recovery from failures, as explained below. In the embodiments of FIGS. 6 and 7, a lock 715 is assigned to each storage block 605 and a lock 715 is also assigned to each resource group 701. However, there are many other arrangements of assigning locks 715 to storage blocks 605. Such arrangements will be based, in part, on the amount of controller memory 710 that is available for implementing the locks 715. In one embodiment, each lock 715 requires as little as 1 to 4 bytes of controller memory 710, such that controller memory 710 supports thousands of locks 715 (e.g., between 1024 locks and 16384 locks). In such a configuration, the finite number of locks 715 may require that individual locks 715 not be assigned to each file, the number of which may exceed the number of available locks 715. In this case, individual locks 715 may instead be assigned to each storage block 605. In another example, a lock 715 is assigned to each resource group 701 (i.e., locks 715 are not dedicated to individual storage blocks 605). This requires even less controller memory 710. In this way, storage blocks 605 can be grouped together and assigned to a particular lock 715. If the GFS file system includes a superblock, having file system information and carried on one of the storage devices 115, a separate lock 715 can be assigned to the superblock. The GFS distributed file system uses the locks 715 to maintain access control of real data or file system metadata located on shared resources such as shared storage devices 115. Locks 715 are also capable of providing coherence between data that is cached in more than one location. Locks 715 are implemented on the storage devices 115 and accessed by clients 105 with an SCSI command for locking one or more storage blocks or resource groups. This SCSI command is referred to as the DLock command. According to one aspect of the invention, the DLock command is independent of all other commands in the SCSI command set, such that the storage devices 115 supporting the locks 715 have no awareness of the nature of the resource or data that is locked. In other words, it is the GFS distributed file system and not the storage device 115 that maps the locks 715 to the resource controlled. Example 1: Lock Structure FIG. 8A is a block diagram illustrating generally, by way of example, but not by way of limitation, one configuration of an array of locks 715. In this embodiment, the array of locks 715 are indexed by a lock identification (ID) number for identifying particular locks 715 in the array. Each lock 715 includes fields such as Lock[ID].activity, providing an activity element 800, Lock[ID].state, providing a state element 805, and Lock[ID].version, providing a version counter 810. In one embodiment, activity element 800 and state element 805 are each implemented as single bits, referred to as an activity bit and a state bit, respectively. Version counter 810, which is also referred to as a "counter" or "clock", typically includes a plurality of bits (e.g., 22 bits, 30 bits, or any other number of bits). According to one aspect of the invention, each lock 715 controls access to corresponding storage capacity, such as one or more storage blocks 605 or resource groups 701 of storage blocks 605. Example 1: Device Lock Actions According to one aspect of the invention, the SCSI DLock command is initiated by one of the clients 105, and communicated to the storage device 115 over the fibre channel interface 200 or other network 110. Input parameters of the DLock command include a lock identification (e.g., a lock number referred to as ID) and an action. The ID parameter identifies a particular lock 715, in a storage device's 115 array of locks 715, on which to perform an action. The DLock command's action parameter, which is selected from an action set described below, determines the action to be performed on the lock 715. Table 3 describes, by way of example, but not by way of limitation, one action set for performing various actions on the lock 715. In one embodiment, the Lock action checks the state element 805 of the particular lock 715 identified by the lock ID. If the state element 805 is set (i.e., contains a "1" hex), the identified lock 715 has been acquired by one of the clients 105. Other clients 105 are excluded from accessing the one or more storage blocks 605 corresponding to the identified lock 715 if the state element 805 is set. A return parameter (i.e., Return.Result of "0" hex) informs the initiating client 105 that the identified lock 715 has already been acquired. Otherwise, if the state element 805 is
TABLE 3
Example 1 of One Embodiment Of DLock Actions
Action Description of Action
Lock Test-and-Set Action
if (Lock[ID].state = 1)
then
Return.result .rarw. 0
else
Return.result .rarw. 1
Lock[ID].state .rarw. 1
Unlock Clear Action
Return.result .rarw. 1
Lock[ID].state .rarw. 0
if (Lock[ID].activity = 1)
then
Increment Lock[ID].version
Unlock Increment Clear Action
Return.result .rarw. 1
Lock[ID].state .rarw. 0
Increment Lock[ID].version
Reset Lock Conditional Clear Action
if (Lock[ID].version = (input version value))
then
Return.result .rarw. 1
Lock[ID].state .rarw. 0
Increment Lock[ID].version
else
Return.result .rarw. 0
Activity On Turn On Activity Monitor
Lock[ID].activity .rarw. 1
Return.result .rarw. 1
Activity Off Turn Off Activity Monitor
Lock[ID].activity .rarw. 0
Increment Lock[ID].version
Return.result .rarw. 1
No Operation No Operation
Return.result .rarw. 1
Included with each of After Each Action
Lock, Unlock, Unlock Return.state .rarw. Lock[ID].state
Increment, Reset Lock, Return.activity .rarw. Lock[ID].activity
Activity On, Activity Off, Return.version .rarw. Lock[ID].version
and No Operation actions
not set (i.e., contains a "0" hex), the identified lock 715 has not currently been acquired by one of the clients 105. The one or more storage blocks 605 corresponding to the identified lock 715 is available for use by the initiating client 105. In this case, the Lock action sets the state element 805 of the identified lock 715 (i.e., sets Lock[ID].state to a "1" hex). This acquires the identified lock 715 for the initiating client 105, and excluding other clients 105 from accessing the one or more storage blocks 605 controlled by the identified lock 715. A return parameter (i.e., Return.Result of "1" hex) informs the initiating client 105 that it has successfully acquired the identified lock 715, and may access the corresponding one or more storage blocks 605. In one embodiment, the Unlock action is used for releasing an identified, previously acquired lock 715 and, if the activity element 800 is set, updating the version counter 810 of the identified lock 715 (e.g., incrementing Lock[ID].version). The identified lock 715 is released by clearing its state element 805 (i.e., by setting Lock[ID].state to "0" hex). This releases the identified lock 715 and allows access to the corresponding one or more storage blocks 605 by any client 105. If the activity element 800 is set, the version counter 810 is updated (i.e., Lock[ID].version is incremented). A return parameter (i.e., Return.Result of "1" hex) informs the initiating client 105 that it has successfully released the identified lock 715. In one embodiment, the Unlock Increment action is also used for releasing an identified, previously acquired lock 715 and, regardless of whether the activity element 800 is set, updating the version counter 810 of the identified lock 715 (i.e., incrementing Lock[ID].version). The identified lock 715 is released by clearing its state element 805 (i.e., by setting Lock[ID].state to "0" hex). This releases the identified lock 715 and allows access to the corresponding one or more storage blocks 605 by any client 105. A return parameter (i.e., Return.Result of "1" hex) informs the initiating client 105 that it has successfully released the identified lock 715. In one embodiment, the Reset Lock action performs a conditional clear action. The initiating client 105 provides the storage device 115 with two input parameters: an ID and an input version counter value. If the value of the version counter 810 associated with the identified lock 715 matches the input version counter value (i.e., Lock[ID].version=(input version value)), then the identified lock 715 is released by clearing its state element 805 (i.e., by setting Lock[ID].state to "0" hex). This releases the identified lock 715 and allows access to the corresponding one or more storage blocks 605 by any client 105. A return parameter (i.e., Return.Result of "1" hex) informs the initiating client 105 that it has successfully released the identified lock 715. If the value of version counter 810 does not match the input version counter value (i.e., Lock[ID].version.noteq.(input version value)), then the identified lock 715 is not released, and a return parameter (i.e., Return.Result of "0" hex) informs the initiating client 105 that the identified lock 715 was not released. In one embodiment, the Activity On action turns on activity monitoring. The activity element 800 (Lock[ID].activity) of a particular lock 715 is normally not set. According to one aspect of the invention, if the initiating client 105 has not modified its local copy of the data stored on the one or more storage blocks 605 corresponding to the identified lock 715, it uses the Unlock action when it releases the identified lock 715. Since the activity element 800 is not set, the version counter 810 is not incremented. However, if the initiating client 105 has modified its local copy of the data stored on the one or more storage blocks 605 controlled by the identified lock 715, it first writes back the modified data to the corresponding one of more storage blocks 605. Then, the client 105 releases the identified lock 715 using the Unlock Increment action. The Unlock Increment action increments the version counter 810, even though the activity element 800 is not set. In this way, a change in the state of the version counter 810 indicates to other clients 105 that modified data has been stored in the one or more storage blocks 605 controlled by the identified lock 715. Activity monitoring is invoked by the Activity On action, which sets the activity element 800 of the identified lock 715 (i.e., Lock[ID].activity is set to "1" hex) A return parameter (i.e., Return.Result of "1" hex) informs the initiating client 105 that it has successfully invoked activity monitoring. During activity monitoring, version counter 810 of the identified lock 715 is incremented upon execution of either of the Unlock or Unlock Increment actions. A client 105 that repeatedly tries and fails to acquire an already-acquired identified lock 715 can turn on activity monitoring for a predetermined period of time to determine whether: (1) the identified lock 715 is being acquired and released by one or more other clients 105 without data modification; or (2) the identified lock 715 has been acquired by a particular client 105 that has failed, been taken off-line, or has otherwise become indisposed, without first releasing the identified lock 715. In first case, the version counter 810 of the identified lock 715 will show activity (i.e., the version counter will be updated since the corresponding activity element 800 is set). In the second case, the version counter 810 of the identified lock 715 will not have been updated over the monitoring time period. In this case, the client 105 that turned on activity monitoring can then forcibly release the identified lock 715, such as by using the Release Lock action. In one embodiment, the Activity Off action turns off activity monitoring by updating the version counter 810 (e.g., incrementing Lock[ID].version) and clearing the activity element 800 (i.e., setting Lock[ID].activity to "0"). A return parameter (i.e., Return.result of "1") informs the initiating client 105 that the activity element 800 has been successfully cleared. In one embodiment, the No Operation action does not perform any operation on the identified lock 715, but provides a return parameter (i.e., Return.result of "1") to the initiating client upon execution. Moreover, after each of the Lock, Unlock, Unlock Increment, Reset Lock, Activity On, Activity Off, and No Operation actions, return parameters in addition to those described above are provided to the initiating client 105. These additional return parameters include: Return.state, corresponding to the value of the state element 805 of the identified lock 715 (i.e., Lock[ID].state), Return.activity, corresponding to the value of the activity element 800 of the identified lock 715 (i.e., Lock[ID].activity), and Return.version, corresponding to the value of the version counter 810 element of the identified lock 715 (i.e., Lock[ID].version). According to one aspect of the invention, the initiating client 105 is capable of saving these return parameter values, such as for maintaining data coherence between data stored on the storage device 115 and copies of that data stored in a memory cache at the initiating client 105. According to another aspect of the invention, these return parameter values are saved for subsequently providing an input version counter value during activity monitoring. Activity monitoring is particularly useful for prioritizing access to storage capacity or for providing failure recovery, such as when a client 105 has acquired a particular lock 715, but its associated version counter 810 has not been updated for an extended period of time. According to one aspect of the invention, the version counter 810 values will periodically "roll-over" from a maximum value to zero because of a finite number of bits of version counter 810. In one example, offered for illustration, if the number of bits of version counter 810 is between 7 and 16, it is possible for an often-accessed lock 715 to roll-over more than once per second. In another example, also offered for illustration, if the DLock command requires 1 millisecond to execute a Lock action and 1 ms to execute an Unlock action, then at least 2 milliseconds is needed to increment the value of version counter 810. In this case, if version counter 810 includes 22 bits, then 2.sup.22 such 2 millisecond intervals, or 2.33 hours, is the known minimum time between rollover occurrences of version counter 810. On the other hand, if version counter 810 includes 32 bits, then 2.sup.32 such 2 millisecond intervals, or 12 days, is the known minimum time between rollover occurrences of version counter 810. According to one aspect of the invention, such long durations between rollover occurrences ensures that rollover occurrences are not difficult to detect. In one embodiment, clients 105 do not assume that the version counter 810 value is slowly growing and that roll-over has not occurred. Instead, clients 105 determine whether roll-over has occurred by timing the access of each lock 715. If, upon accessing an identified lock 715, its version counter 810 value differs from its previous value during a prior access by an amount that is less than a difference in version counter 810 values that is known to correspond to a minimum roll-over time, then it is known that roll-over did not occur during the interim time period between accesses. According to another aspect of the invention, these return parameter values are saved for measuring activity of particular ones of the locks 715 for load balancing, such as between different shared storage devices 115. For example, if the locks 715 of a particular shared storage device 115 are more active than the locks 715 of a different shared storage device 115, the stored data can be redistributed across the storage devices 115 to balance or optimize use of the shared storage devices 115. Similarly, load balancing can be carried out on a single shared storage device by redistributing data across storage blocks 605, or redistributing storage blocks 605 across the locks 715. This is useful when some locks 715 are accessed more often than other locks 715. According to one aspect of the invention, remapping of the locks 715 to the storage blocks 605 is performed by the GFS file system, such as for load balancing based on the activity of particular locks 715 with respect to activity of other locks 715. Example 1: Device Lock Operation According to one aspect of the invention, the GFS distributed file system uses the locks 715 to maintain consistency between data stored in the storage devices 115 and copies of the data cached in the main memories of the clients 105. For example, particular ones of locks 715 are assigned to storage blocks 605 that store file system resources such as metadata. In another example, locks 715 are assigned to storage blocks 605 storing entire data files. Before a client 105 is allowed to access the data stored in one or more storage blocks 605 assigned to a particular lock 715, the client 105 identifies and acquires that lock 715 using the Lock action, as described above. When the client 105 is finished accessing this data, the client 105 releases the identified lock 715. In releasing the identified lock 715, the Unlock action is typically used if the client has not written back modified data to the storage device 115. Otherwise, the Unlock Increment action is used in releasing the identified lock 715, such that its associated version counter 810 is updated, thereby signifying to other clients 105 that the data was modified. According to another aspect of the invention, after releasing the identified lock 715, the client 105 caches the data, such as in its system memory. In this case, it may not be known whether the data cached at the client 105 is consistent with the data residing on the storage device 115. Even if the client 105 has not modified its cached data, other clients 105 may have subsequently modified the data residing on the storage device 115 after the client 105 released the identified lock 715. One way in which the client 105 determines the consistency of its cached data with that on the storage device 115 is upon again acquiring the identified lock 715 using the Lock action. If, upon reacquisition of the identified lock 715, its version counter 810 value is the same as during the previous acquisition of the identified lock 715 by the same client 105, then the data stored in the corresponding one or more storage blocks 605 has not been modified during the intervening time period. If the version counter 810 value of the identified lock 715 upon reacquisition is different than the version counter 810 value during the previous acquisition of the identified lock 715 by the same client 105, then the data stored in the corresponding one or more storage blocks 605 may have been modified during the intervening time period. According to one aspect of the invention, the client 105 rereads the data from the storage device 115 if the version counter 810 value, upon reacquisition of the identified lock 715, differs from the version counter 810 value saved by the client 105 during its previous acquisition of the identified lock 715. FIG. 9 is a table illustrating generally, by way of example, but not by way of limitation, one possible sequence of events undertaken by a first client 105A and a second client 105B in accessing shared data. The shared data is stored in one or more storage blocks 605 corresponding to a particular identified lock 715 residing on a particular storage device 115. Actions by each of first client 105A and second client 105B are listed. Such actions initiate operations at the storage device 115 executed in response to DLock commands. These actions are communicated by the first client 105A and second client 105B to the storage device 115 for execution by its controller 705. Other listed actions indicate whether the shared data was modified. Also listed are return parameters that are communicated back to the initiating one of the first client 105A and second client 105B upon execution of any of the Dlock actions. First client 105A and second client 105B are capable of storing these return parameters in local memory. These return parameters include Return.state and Return.version, corresponding to the values of the state element 805 and version counter 810, respectively, of the identified lock 715. FIG. 9 also indicates whether the data cached locally at each of the first client 105A and second client 105B is consistent with the data stored at the storage device 115 in the one or more storage blocks 605 controlled by the identified lock 715. FIG. 9 also indicates the value of the state element 805 and version counter element 810 of the identified lock 715. In this example described below, the activity element 800 of the identified lock 715 is set to "0" hex (i.e., activity monitoring is turned off). Time 900 sets forth an initial state in which the state element 805 is "0" hex, indicating that the lock 715 is available for acquisition by any client 105. The version counter 810 is also "0." Each of first client 105A and second client 105B assumes the version counter 810 to have rolled over, as described above, and as indicated by the Return. state of "X." At time 901, first client 105A acquires the lock 715, setting its state element 805 to "1," and returning the present value of each of the state element 805 (i.e., Return.state="1" hex) and the version counter 810 (i.e., Return.version="0" hex) to the first client 105A. Since Return.version at time 901 is different than its previous value at time 900, the data at first client 105A is deemed not consistent with the data at storage device 115. According to one aspect of the invention, first client 105A optionally rereads from storage device 115 and updates its local copy of the data to be consistent with the data at storage device 115. At time 902, first client 105A does not modify the shared data stored in the one or more storage blocks 605 corresponding to the identified lock 715. Since the shared data was not modified, first client 105A releases the identified lock 715 at time 903 using the Unlock action, which does not increment version counter 810, but which resets the state element 805 to "0" hex. Execution of the Unlock action at time 903 also returns the updated present value of each of the state element 805 (i.e., Return.state="0" hex) and the version counter 810 (i.e., Return.version="0" hex) to the first client 105A. The released identified lock 715 is available for acquisition by other clients 105. At time 904, second client 105B acquires the lock 715, setting its state element 805 to "1" hex, and returning the present value of each of the state element 805 (i.e., Return.state="1" hex) and the version counter 810 (i.e., Retum.version="0" hex) to the second client 105B. Since, at time 904, Return.version of second client 105B is different than its previous value at time 903, the data at second client 105B is deemed not consistent with the data at storage device 115. According to one aspect of the invention, second client 105B optionally rereads from storage device 115 and updates its local copy of the data to be consistent with the data at storage device 115. At time 905, second client 105B does not modify the shared data stored in the one or more storage blocks 605 controlled by the identified lock 715. Since the shared data was not modified, second client 105B releases the identified lock 715 at time 906 using the Unlock action, which does not increment version counter 810, but which resets the state element 805 to "0" hex. Execution of the Unlock action at time 906 also returns the updated present value of each of the state element 805 (i.e., Return.state="0" hex) and the version counter 810 (i.e., Return.version="0" hex) to the second client 105B. The released identified lock 715 is available for acquisition by other clients 105. At time 907, second client 105B again acquires the lock 715, setting its state element 805 to "1," and returning the present value of each of the state element 805 (i.e., Return.state="1" hex) and the version counter 810 (i.e., Return.version="0" hex) to the second client 105B. Since Return.version at time 907 is the same as its previous value at time 906, the data at second client 105B is deemed consistent with the data at storage device 115. According to one aspect of the invention, second client 105B does not update its local copy of the data since it is already consistent with the data at storage device 115. At time 908, second client 105B modifies the shared data stored in the one or more storage blocks 605 corresponding to the identified lock 715. Since the shared data was modified, second client 105B releases the identified lock 715 at time 909 using the Unlock Increment action, which increments version counter 810 and resets the state element 805 to "0" hex. Execution of the Unlock Increment action at time 906 also returns the updated present value of each of the state element 805 (i.e., Return.state="0" hex) and the version counter 810 (i.e., Return.version="1" hex) to the second client 105B. The released identified lock 715 is available for acquisition by other clients 105. At time 910, first client 105A acquires the lock 715, setting its state element 805 to "1" hex, and returning the present value of each of the state element 805 (i.e., Return.state="1" hex) and the version counter 810 (i.e., Return.version="1" hex) to the first client 105A. Since Return.version at time 910 is different than its previous value at time 909, the data at first client 105A is deemed not consistent with the data at storage device 115. According to one aspect of the invention, first client 105A optionally rereads from storage device 115 and updates its local copy of the data to be consistent with the data at storage device 115. At time 911, first client 105A modifies the shared data stored in the one or more storage blocks 605 controlled by the identified lock 715. Since the shared data was modified, first client 105A releases the identified lock 715 at time 912 using the Unlock Increment action, which increments version counter 810 and resets the state element 805 to "0" hex. Execution of the Unlock Increment action at time 912 also returns the updated present value of each of the state element 805 (i.e., Return.state="0" hex) and the version counter 810 (i.e., Return.version="2" hex) to the first client 105A. The released identified lock 715 is available for acquisition by other clients 105. At time 913, second client 105B acquires the lock 715, setting its state element 805 to "1" hex and returning the present value of each of the state element 805 (i.e., Return.state="1" hex) and the version counter 810 (i.e., Return.version="2" hex) to the second client 105B. Since Return.version at time 913 is different than its previous value at time 912, the data at second client 105B is deemed not consistent with the data at storage device 115. According to one aspect of the invention, second client 105B optionally rereads from storage device 115 and updates its local copy of the data to be consistent with the data at storage device 115. At time 914, second client 105B does not modify the shared data stored in the one or more storage blocks 605 corresponding to the identified lock 715. Since the shared data was not modified, second client 105B releases the identified lock 715 at time 915 using the Unlock action, which does not increment version counter 810, but which resets the state element 805 to "0" hex. Execution of the Unlock action at time 915 also returns the updated present value of each of the state element 805 (i.e., Return.state="0" hex) and the version counter 810 (i.e., Return.version="2" hex) to the second client 105B. The released identified lock 715 is available for acquisition by other clients 105. At time 916, first client 105A acquires the lock 715, setting its state element 805 to "1" hex, and returning the present value of each of the state element 805 (i.e., Return.state="1" hex) and the version counter 810 (i.e., Return.version="2" hex) to the first client 105A. Since Return.version at time 916 is the same as its previous value at time 915, the data at first client 105A is deemed consistent with the data at storage device 115. According to one aspect of the invention, first client 105A does not update its local copy of the data since it is already consistent with the data at storage device 115. At time 917, first client 105A does not modify the shared data stored in the one or more storage blocks 605 controlled the identified lock 715. Since the shared data was not modified, first client 105A releases the identified lock 715 at time 918 using the Unlock action, which does not increment version counter 810, but which resets the state element 805 to "0" hex. Execution of the Unlock action at time 918 also returns the updated present value of each of the state element 805 (i.e., Return.state="0" hex) and the version counter 810 (i.e., Return.version="2" hex) to the first client 105A. The released identified lock 715 is available for acquisition by other clients 105. Example 1: Failure Recovery Using Device Locks As set forth above, device locks 715 are distributed across various storage devices 115 instead of at a centralized server. As a result, the present invention avoids problems associated with failures of a centralized server. Failure of particular storage devices 115 or clients 105, however, is still possible. Failure or power-on of the storage device 115 will clear the activity element 800, state element 805, and version counter element 810 of each lock 715 in the volatile controller memory 710 on the failing storage device 115. By contrast, a SCSI Reset command received by the storage device 115 will not affect the activity element 800, state element 805, and version counter element 810 of the locks 715 on that storage device 115. After failure or power-on of storage device 115, the storage device 115 sends a SCSI Unit Attention status to notify clients 105 or other nodes that the locks 715 have been cleared. When a client 105 or other node receives a SCSI Unit Attention status, the client 105 checks previously acquired locks 715 on storage devices 115 to see if they are still valid (i.e., that state element 805 is still set and the version counter 810 state is unchanged). The client 105 will re-acquire any locks 715 that may have been lost, such as by failure or power-on of the storage device 115. Similarly, a client 105 or other node may fail, be taken off-line, or have its power cycled. A client 105 that has already acquired a lock 715, but then fails, could potentially leave the lock 715 in a locked state indefinitely. In order to avoid this problem, the present invention allows other clients 105 to forcibly release such locks 715, such as by using the Reset Lock action described above. A client 105, attempting to acquire a particular lock 715 that has already been acquired by a failed client 105, can determine the status of the particular lock 715 by turning activity monitoring on using the Activity On action, described above. With activity monitoring turned on, the version counter 810 is updated for both Unlock and Unlock Increment actions. If, after waiting for a predetermined extended period of time, the version counter 810 value is unchanged (i.e., shows no activity), then the client 105 deems the corresponding lock 715 as being owned by a failed client 105. In this case, the client 105 turns off activity monitoring using the Activity Off action, described above. The client 105 also forcibly releases the lock 715 using the Reset Lock action, described above. A client 105 should exercise care (e.g., by selecting an appropriately long predetermined activity monitoring time period) when forcibly releasing a lock 715 owned by another client 105. The other client 105 may have failed, or may be in a hung state from which it will eventually return, believing it still owns the lock 715. According to one aspect of the invention, the forcible release of a lock 715 using the Reset Lock action compares the present value of version counter 810 with an input version counter value provided by the client 105 that is forcibly releasing the lock 715. The lock 715 will be cleared only if the client 105 can identify the present value of the version counter 810. This resolves a situation in which two clients 105 are each trying to forcibly release the same lock 715. One of these clients 105 will manage to forcibly release the lock 715 first, thereby updating its version counter 810. This will prevent the other client 105 from also releasing the lock. Since the version counter 810 value has changed, it will no longer match the input version counter value provided by the subsequent client 105. Thus, the subsequent client's Reset Lock action is ignored by the storage device 115. Example 1: Device Lock Command According to one aspect of the invention, clients 105 access particular locks 715 on the storage devices 115 using a new Device Lock (DLock) command that is added to the standard SCSI command set. The SCSI DLock command has several input parameters that indicate to the storage device 115 what action to perform, including actions performed on the activity element 800, state element 805, and version counter element 810 of the locks 715. The DLock command includes a sequence of several bytes defining these input parameters. In one embodiment, the first byte of the DLock command identifies it as the DLock command, distinguishing this new SCSI command from existing SCSI commands. Table 4 lists an example of hexadecimal codes used to identify the SCSI DLock and Mode Page commands. Table 5 lists one embodiment, by way of example, but not by way of limitation, of a sequence of bytes comprising a DLock command.
TABLE 4
Codes Defining SCSI Commands
Function Hexadecimal Code Value
DLock A0
Mode Page 29
In Table 5, a hexadecimal (hex) value A0 in the first byte in the DLock Command (Byte 0) identifies it as the DLock command. Table 6 lists one embodiment, by way of example, but not by way of limitation, of hexadecimal codes used in the lower nibble of Byte 1 to specify which of the DLock actions of Table 3 should be executed by the storage device 115. These actions include Lock, Unlock, Unlock Increment, Reset Lock, Activity On, Activity Off, and No Operation. In Table 5, Bytes 2-5 specify the Lock Identification (ID) number of the particular lock 715 on which the DLock action operates. In Table 5, Bytes 6-9 specify an input version counter value parameter, such as needed for the Reset Lock action described above. Byte 10 of the DLock command specifies an allocation length (e.g., the maximum number of data bytes to be returned to the initiating client 105).
TABLE 6
Action Codes for defining a particular DLock command of Example 1
Action Hexadecimal Code Value
No Operation 0
Lock 1
Unlock 2
Unlock Increment 3
Reset Lock 4
Activity On 5
Activity Off 6
Reserved 7 through F
The SCSI command set allows the storage device 115 to return SCSI-defined Sense Data information. The Sense Data information is returned either in response to the SCSI Request Sense command, or upon occurrence of certain conditions resulting from execution of other SCSI commands, such as the DLock command. If the DLock command contains an error, e.g., an illegal action in the lower nibble of Byte 1, or Bytes 2-5 identify an unsupported lock 715, the storage device 115 provides Sense Data information indicating the nature of the problem with the DLock command. Table 7 illustrates generally, by way of example, but not by way of limitation, one embodiment of returned data provided to client 105 by storage device 115 upon execution of one of the actions listed in Table 6 and described more particularly in Table 3. In the first byte (Byte 0) of the returned data, Bit 7 carries the Return.Result return parameter, which is set if execution of the action was successful. Bit 6 of Byte 0 carries the Return.State return parameter, which indicates whether the lock 715 is in an acquired state after execution of the action. Bit 5 of Byte 0 carries the Return.Activity return parameter, which is set if activity monitoring is turned on after execution of the action. Bytes 1-4 carry the Return.version return parameter, which indicates the value of the version counter 810 after execution of the action.
TABLE 7
Returned Data
Byte Bit 7 Bit 6 Bit 5 Bit 4 Bit 3 Bit 2 Bit 1 Bit 0
0 Result State Activ- Reserved
ity
1 Returned Version Value
2 (4 bytes)
3
4
5 Reserved
6 Reserved
7 Reserved
The SCSI Mode Sense and Mode Select commands allow access to and modification of a SCSI-defined Device Locks mode page on the storage device 115. Table 8 illustrates one embodiment, by way of example, but not by way of limitation, of a sequence of bytes comprising a Device Locks mode page. According to one aspect of the invention, the Mode Sense and Mode Select commands are used for configuring the device locks 715 on a storage device 115. Controller memory 710 of each storage device 115 typically includes several SCSI-defined "pages" of configuration data. One such page is a Device Locks mode page accessed and updated by the Mode Sense and Mode select commands. The Device Locks mode page allows storage of configuration data unique to a particular type of storage device 115. According to one aspect of the invention, the Device Locks mode page is also used for configuring the device locks 715 on a storage device 115. The Device Locks mode page is identified by a hexadecimal code value of 29 in Byte 0 of the Device Locks mode page, as illustrated in Table 8. Bit 7 of Byte 0 also includes a Parameter Savable (PS) bit. The PS bit is set to indicate this page may be saved on storage device 115. A Mode Sense command is used to read the current Device Locks mode page on storage device 115. A Mode Select command is used to write the Device Locks mode page to storage device 115.
TABLE 8
SCSI Device Locks mode page (Hexadecimal values).
Byte Bit 7 Bit 6 Bit 5 Bit 4 Bit 3 Bit 2 Bit 1 Bit 0
0 PS Vendor Unique Page Code=29
1 Page Length =06
2 Lock Size (Encoded)
3 Supported Lock Sizes (Encoded)
4 Number of Supported Locks
5 (4 bytes)
6
7
In Table 8, Byte 1 defines the length of the configuration data in the Device Locks mode page. Byte 2 represents an encoded value that defines the presently selected Lock Size (i.e., defining the number of bits in its version counter 810 of each lock 715 on the storage device 115). One example encoding of the Lock Size is illustrated in Table 9, in which a hexadecimal value of 01 represents a 22 bit version counter 810, while a hexadecimal value of 02 represents a 30 bit version counter 810. In Table 8, Byte 3 represents an encoded value that defines other Supported Lock Sizes available on the storage device 115. One example of encoding the Supported Lock Sizes is illustrated in Table 10, in which Bit 0 is set for a 24 bit version counter 810, and Bit 1 is set for a 32 bit version counter 810, and "x" represents a "don't care" for the corresponding bit position. In Table 8, bytes 4-7 describe the number of locks 715 supported by the target storage device 115. The number of supported locks 715 depends on the Lock Size and the storage capacity available in the controller memory 710 of the target storage device 115.
TABLE 9
Example of Encoding Lock Size in Device Locks mode page
Version
Activity Counter
Binary State Element Element 800 Element 810 Total Size of
Encoding 805 Size Size Size Lock 715
00000001 1 bit 1 bit 22 bits 24 bits
00000010 1 bit 1 bit 30 bits 32 bits
Treatment of power cycles, resets, SCSI Mode Select commands, and other conditions is described below. In response to a SCSI Mode Select command identifying the Device Locks mode page, the storage device 115 issues a SCSI Unit Attention. The storage device 115 also clears the state element 805 and zeros the version counter 810 of each of its locks 715. In response to a power cycle or power-on of the storage device 115, the storage device 115 issues a SCSI Unit Attention indicating a power-on condition. The storage device 115 also zeros the version counter 810 of each of its locks 715. In response to a SCSI Reset, the storage device 115 issues a SCSI Unit Attention indicating a power-on condition. The values of the activity element 800, state element 805, and version counter 810 of each of the locks 715 on the storage device 115 are not affected. In response to a SCSI Bus Device Reset, Task Management, or Target Reset commands, the values of the activity element 800, state element 805, and version counter 810 of each of the locks 715 on the storage device 115 are not affected. Example 1: Test System The GFS file system was tested using a system 100 having two clients 105 and two shared storage devices 115. One of the clients 105 included a POWER ONYX computer, sold by Silicon Graphics, Inc. (SGI), of Mountain View, Calif., having four 194 Megahertz MIPS R10000 processors. The other client 105 included a POWER CHALLENGE computer, also sold by SGI, having eighteen 75 Megahertz MIPS R8000 processors. Each of the clients 105 included 2 Gigabytes of 8-Way interleaved memory and a HIO-FC dual-ported host adapter, sold by Prisa Networks of San Diego, Calif. The storage devices 115 included two RF 7000 series disk arrays, sold by Ciprico, Inc. of Plymouth, Minn. Each storage device 115 disk array used a RAID-3 configuration with nine (8 data+1 parity) BARRACUDA 9 (ST19171 WC) disks, sold by Seagate Technology, Inc. of Scotts Valley, Calif. Two different fibre channel interface 200 configurations were used to implement the network 110 connection. The clients 105 and storage devices 115 were first connected using a single arbitrated loop fibre channel interface 200 having a maximum transfer rate of 100 Megabytes/second. The second configuration provided two arbitrated loop fibre channel interfaces 200. Each storage device 115 was coupled to its own fibre-channel interface 200. Each client 105 was coupled to both fibre-channel interfaces 200. This scheme allows the clients 105 access to each storage device 115 with a total network bandwidth of 200 Megabytes/second. In one test, the single arbitrated loop configuration obtained data transfer rates of approximately 60 Megabytes/second, and the configuration using two arbitrated loops obtained data transfer rates of approximately 108 Megabytes/second. These data rates compare favorably with more costly conventional systems. Example 2: Lock Structure FIG. 8B illustrates generally, by way of example, but not by way of limitation, another embodiment of portions of the present system including device locks described in this Example 2. The device locks of Example 2 have lock structure similar to that described above with respect to Example 1. In Example 2, however, each lock includes a state element 805 (Lock[ID].state) that can be: Unlocked (also referred to as "U," i.e., having a value of "0" hex), Locked Shared (also referred to as "S," i.e., having a value of "1" hex), or Locked Exclusive (also referred to as "E," i.e., having a value of "2" hex). Moreover, in Example 2, each lock 715 includes an associated expiration timer 820 and an expiration status element (Lock[ID].expired) 825. The expiration status element 825 indicates whether the expiration timer 820 has expired and, if so, whether it has expired from an exclusive or shared locked acquisition. Furthermore, in Example 2, each lock 715 includes a field indicating the number of current lock holders (Lock[ID].holders), and a World Wide Name (WWN) list 815 identifying all clients 105 sharing use of the lock 715. Also, the returned data of Table 7 includes a Return Data Expired field and a WWN Data field. These aspects of Example 2 are described below. Example 2: Device Lock Actions The device lock actions of Example 2 include Lock Exclusive, Lock Shared, Force Lock Exclusive, Touch Lock, and Report Expired actions, one embodiment of which is described below, by way of example, but not by way of limitation, in Table 11. In one embodiment, the Lock Shared action checks the state element 805 of the particular lock 715 identified by the lock ID. If the state element 805 is set to "E," the identified lock 715 has been exclusively acquired by one of the clients 105. If, upon checking the WWN list 815 for the identified lock 715, it is held exclusively by a client 105 that is different from the initiating client 105, then a return parameter (i.e., Return.Result of "0" hex) informs the initiating client 105 that the identified lock 715 has already been acquired exclusively by another client 105. 10 Otherwise, if checking the WWN list 815 reveals that the identified lock 715 is held exclusively by the initiating client 105, state element 805 is set to "S," and the expiration timer 820 associated with the identified lock 715 is reset. Upon execution of the Lock Shared action, if the state element 805 is "S" and if the number of clients 105 sharing use of the lock 715 is less than a maximum allowable number (MaxHolders), then the initiating client 105 acquires identified lock 715 for shared use by adding the WWN of client 105 to the WWN list 815 of the identified lock 715. The expiration timer 820 for the identified lock 715 is reset. If the number of clients 105 sharing use of the lock 715 equals the maximum allowable number (MaxHolders), then a return parameter (i.e., Return.Result of "0" hex) informs the initiating client 105 that the identified lock 715 is already shared by the maximum allowable number of clients 105. Upon execution of the Lock Shared action, if the state element 805 is "U," then the number of lock holders is set to 1, the WWN of the initiating client 105 is added to the WWN list 815 of the identified lock 715, the expiration timer 820 of the identified lock 715 is reset, and a return parameter (i.e., Return.Result of "1" hex) is returned to the initiating client 105. If the expiration timer 820 of the identified lock 715 has expired from an exclusive acquisition, then state element 805 of the identified lock 715 is set to "E," otherwise the state element 805 of the identified lock is set to "S." In one embodiment, the Lock Exclusive action checks the state element 805 of the particular lock 715 identified by the lock ID. If the state element 805 is set to "E," the identified lock 715 has been exclusively acquired by one of the clients 105. Other clients 105 are excluded from accessing the one or more storage blocks 605 corresponding to the identified lock 715 if the state element 805 is set to "E." A return parameter (i.e., Return.Result of "0" hex) informs the initiating client 105 that the identified lock 715 has already been acquired exclusively by another client 105. Upon execution of the Lock Exclusive action, if the state element 805 is "S," and the identified lock 715 indicates only one current lock holder which, from the WWN list 815 of the identified lock 715, is identified as the initiating client 105, then the state element 805 of the identified lock 715 is set to "E," the expiration timer of the identified lock 715 is reset, and a return parameter (i.e., Return.result="1" hex) informs the initiating client 105 that the identified lock 715 has been acquired exclusively. Otherwise a return parameter (i.e., Return.result="0" hex) informs the initiating client 105 that the identified lock 715 could not be exclusively acquired. Upon execution of the Lock Exclusive action, if the state element 805 is "U," then the state element 805 of the identified lock 715 is set to "E," the number of lock holders of the identified lock 715 is set to 1, the WWN of the initating client 105 is added to the WWN list 815 of the identified lock 715, the expiration timer 820 of the identified lock 715 is reset, and a return parameter (i.e., Return.result="1" hex) informs the initiating client 105 that the identified lock 715 was exclusively acquired. In one embodiment, the Force Lock Exclusive action includes, as an input parameter, a previously-read value of the version counter 810 of a particular lock 715. In one example, the Force Lock Exclusive action is executed after waiting for a predetermined period of time following the earlier reading of the version counter 810 of the lock 715. If the state element 805 of the identified lock 715 is set to "S"
TABLE 11
Example 2 of One Embodiment of DLock Actions
Action Description of Action
Lock Shared if (Lock[ID].state = U = 0 hex) then
If Lock[ID].expired = ExpiredFromLock
Exclusive
then
Lock[ID].state .rarw. (E = 2 hex)
else
Lock[ID].state .rarw. (S = 1 hex);
Return.result .rarw. 1
Lock[ID].holders .rarw. 1
Add WWN to list
Reset expiration timer;
if (Lock[ID].state = S = 1 hex) then
if (Lock[ID].holders < MaxHolders) then
Return.result .rarw. 1
Increment Lock[ID].holders
Add WWN to list
Reset Expiration Timer;
else
Return.result .rarw. 0;
if (Lock[ID].state = E = 2 hex) then
if Lock[ID].wwn[0] = wwn then
Return.result .rarw. 1
Lock[ID].state .rarw. S = 1 hex
Reset expiration timer
else
Return.result .rarw. 0;
Lock Exclusive if (Lock[ID].state = U = 0 hex) then
Lock[ID].state .rarw. (E = 2 hex)
Return.result .rarw. 1
Lock[ID].holders .rarw. 1
Add WWN to list
Reset expiration timer;
if (Lock[ID].state = S = 1 hex) then
if (Lock[ID].holders = 1 and Lock[ID].wwn[0]
=wwn | ||||||
