Parallel file system and method with byte range API locking5999976Abstract A computer system having a shared disk file system running on multiple computers each having their own instance of an operating system and being coupled for parallel data sharing access to files residing on network attached shared disks. Access to a file by a processor node is controlled by tokens transferred to the node from a token manager. To prevent another processor node from removing a token after the token has been received, but before it performs the operation on the file, each process can lock the token after it has been received. A node with a token can lock a byte range of a file, which byte range may include all or only some of byte range cornered by the token. Claims What is claimed is: Description FIELD OF THE INVENTION
TABLE 5
______________________________________
ro ww xw
______________________________________
ro **
ww ** **
xw ** ** **
______________________________________
For the metanode, we devised the following algorithm: when a node opens a file for the first time, it tries to acquire the metanode token in mode "ww". The token manager TM grants the token in "ww" if it can, i.e., if no other node holds the token in "ww" or "xw". If this happens, the node becomes the metanode manager However, if another node holds the token in "ww", then the TM grants the token in "ro". Then the node knows that another node is the metanode. It can query the TM find out who the metanode for this file is. There are situations when a node must become a metanode. In this case, asking for a "ww" token will not help since the old metanode will not downgrade its token. Here the node that wishes to become the metanode asks for an "xw" token. This will cause a revoke message to be sent to the existing metanode. The old metanode will then downgrade its token to "ro" and the TM will return a "ww" token to the new metanode. If a node asks for an "xw" token and no other nodes hold this token at all, then TM will grant the token in that mode. If a node holds the token in "xw", then it is the metanode for this file, but in addition, no other node has this file open. In this case, if a node tries to acquire the token in "ww", a revoke message is sent to the metanode. As a result, the node downgrades its "xw" token to "ww", and the TM is thus able to grant a "ro" token to the new node. Using Enhanced Token Modes for Controlling the File Size PO997074-po8970068 The relevant file system standards require that the correct file size be available on demand; however the maintenance of file size in parallel at all nodes in the presence of multiple applications appending data to the file is complicated and costly in terms of performance. The next of this series of features describes our way of maintaining file size so it is available when needed without constant overhead. In doing so a parallel file system where all disks that make up the file system can independently be accessed by multiple processors can be exploited with a file shared by multiple processors for both reading and writing without a constant overhead. Read & write sharing of files involve accessing the file's size. Every read and write needs to check if the operation's offset is beyond the current file size, and return an EOF (end-of-file) if it is. Every write needs to check if the operation's offset is beyond the current EOF, and if it is, it should extend it. When there are several readers and writers, all this has to be consistent. Thus, if one node writes at offset 1000, a read by any node at that location should not return an EOF. One way of keeping a consistent state is to serialize the accesses to the file's size. This, however, will present a major bottleneck for parallel writers, since each write (and read) will need to get the current file size before each operation. In our preferred embodiment we keep a local copy of the file size within each node. Also, together with each copy, a lock mode is kept. A lock manager assures that lock modes that conflict do not co-exist. An appropriate lock mode for each read and write operation assures that the locally cached file size is accurate enough for a correct result of this operation. The different modes are: "rw" for operations that Read and Write within the locally cached file size "rf" for operations that Read beyond the locally cached File size "wf" for operations that Write beyond the locally cached File size "wa" for Write operations that Append to the file "xw" for operations that reduce the file size (like truncate), and thus need an exclusive Write lock. The conflict table of the file size's lock modes is:
TABLE 6
______________________________________
rw rf wf wa xw
______________________________________
rw **
rf ** ** **
wf ** ** **
wa ** ** ** **
xw ** ** ** ** **
______________________________________
Whenever a node upgrades its lock mode, it reads the new file size from a special node that keeps track of the file size (the metadata node, or metanode for short). Whenever a node downgrades its lock mode, it sends its file size to the metanode. The metanode itself keeps a file size which is a maximum of all the file sizes that it received (except when a node locks the file size in the "xw" mode, which allows reducing the file size). Some modes only allow reading the file size (rw rf). Some modes (wf, wa) allow increasing the file size. One mode (xw) allows to decrease the file size. The true file size is the maximum of all the local copies of the file sizes that the nodes hold. Operations that read or write within the locally cached copy of the file size, need an "rw" lock on the file size. Operations the read beyond the locally cached copy of the file size, need to ensure that the file size did not increase since they last read the file size. Thus, they need to acquire an "rf" lock (which conflicts with modes that increase the file size). Operations that increase the file size acquire either a "wf" or "wa" lock. A "wf" lock is needed if the writer knows the new absolute file size. A "wa" lock is needed for APPEND operations. An APPEND operation writes at the current EOF. Thus, several APPEND operation will write one at the end of the other. Thus, "wa" conflicts with itself since one APPEND operation should wait for other APPEND operations. The only mode that allows decreasing the file size is "xw". This is an exclusive mode which will cause all other nodes to relinquish their locks and thus lose the locally cached file size. Thus, after the node that acquired the "xw" finishes its operation (for example, a file truncate), all the nodes will have to get the new file size from the metanode. We are not aware of a system where different file sizes are cached at different nodes so that parallel write sharing of the file is maximized, and yet the system presents a consistent view of the file for all users. The solution allows users on different nodes to extend the file and thus to achieve a very high degree of write sharing. Write operations do not need to be serialized even if the users extend the file size. Smart Caching of Byte Range Tokens Using File Access Patterns PO997063-po8970070 The next of our parallel write developments addresses the locking used for all accesses; parallel and non-parallel. Locking only the portion of the file that is required immediately is expensive and would require calls to the lock manager with every application call. This algorithm attempts to anticipate the requirements of the application considering what else is going on in the system and to minimize the number of token manager calls. For parallel reading and writing to the same file, in order to serialize accesses to the same regions in a file, a distributed lock mechanism is used. However, getting such a lock usually requires that a token will be acquired first, and this is considered an expensive operation. Thus, it would be beneficial to cache tokens at a node by anticipating the access patterns of the file. On the other hand, acquiring a token that is not needed might reduce performance since this token would be needed by another node. This disclosure describes the algorithm by which a node acquires a token so as to maximize performance by anticipating the file's access patterns. Serializing accesses to different regions in a file to which processes on different nodes write in parallel is done by distributed byte range locks. When a process needs to lock a byte range, it first needs to acquire an appropriate byte range token. The byte range token represents the node's access rights to a portion of a file. Thus, if a node holds a byte range token for file X for range (100, 200) in read mode, it means that the node may safely read that portion of the file. However, to prevent stealing the token, the node must lock the token before the actual read, since if another node needs to write the same portion, it might steal the token. Locking the token prevents the steal. After the read has completed, the token is unlocked. One can view tokens as a way of "caching" locks. When a node needs to lock a portion of a file, it needs to lock the token. At first, it will acquire a token and lock it. Once the operation is finished and the token is unlocked, it is still resident at the node. Thus, subsequent operations on the same region would not need to access the token authority. Only when the token is stolen will a new request for the token be needed. Given this, it may be of benefit to request a larger token than needed to be locked. For example, if a process reads a file sequentially, and it reads from range 1000 to 2000, then although the next lock will be of range 1000 to 2000, it can request a larger token, for example, from 1000 to 10000. However, this may create excessive token traffic on other nodes. If another node is in the process of writing from 5000 to 6000, the token acquisition may delay the operation. The idea is to give two ranges when acquiring a byte range token: a required range (which is the minimum range that is needed for the operation) and the desired range (which is the maximum range that is expected to be of any use). The token manager is guaranteed to grant a token that covers the required range but is not larger than the desired range. Two algorithms need to be specified: (1) how to compute the desired and required range for each operation; this is on the requesting side; (2) how to compute the granted range; this is on nodes which hold conflicting tokens. For the above algorithms, we differentiate between two file access patterns: random and sequential. With random accesses, the starting offset of the next operation cannot be predicted. Sequential operations are assumed to start where the previous operation finished. Each file may be open multiple times on each node, and each such instance may present a different access pattern. We prefer the following algorithm. The main goal is to minimize token traffic. When trying to lock a byte range, we first query the token manager and see if a compatible token exists on the node. The range that is probed is the minimum range that is required by the operation. If the token is available locally, it is locked and no further token activity takes place. However, if the token is not available, then a token is requested. The required range is computed based on the offset and length of the file operation. The desired range is based of the access pattern of the file. If the file is accessed randomly, then the desired range will be equal to the required range, since there is probably no advantage in stealing tokens (that would probably not be needed) from other nodes. If, however, the file is accessed sequentially, the desired range starts from the required range's start, but ends at infinity (there's a special value to represent infinity). This is in attempt to minimize future token requests, since we can predict the future locks that will be needed. When a node holds a token that conflicts with a request for a token on another node, it gets a revoke request. The request contains the requesting node's required and desired ranges. Here, the node has to make a decision what range it can relinquish. If the required range is equal to the desired range, the decision is easy, and the granted range is the required (and desired) range. However, if the desired range is different than the required range, that means that the requesting node is accessing the file sequentially, and it wishes to have a token that starts at the required range's start but ends at infinity. The node then makes a pass over all its active processes that access the file, and checks whether they access the file sequentially or randomly. If all of them access the file randomly, then the node grants the desired range. However, if one or more of the processes access the file sequentially, it would be a waste to relinquish the desired range, since with high probability we know what token will be requested soon. In this case, the file pointers (i.e., the anticipated location of the next operation) of all the sequential operations are examined, and the minimum offset is calculated. It is anticipated that these operations will not access file regions which are below this minimum, since they are sequential. Thus, the granted range is stretched to that calculated minimum, if it is higher than the required range. We are not aware of a system where byte range tokens are requested based on the file's access pattern. The solution allows caching of tokens with regard to the file access pattern. This saves acquisition of tokens which is a costly operation and thus improves the overall performance of the system. Any parallel processing system which has the need to allow parallel write sharing of files and needs to serialize accesses to the same regions in the file. Byte Range Token Interface PO997073-po8970067 Referring now to FIG. 2, this parallel write improvement provides for the management of information describing tokens using a byte range lock algorithm with a byte range token interface. Our parallel file system where all disks that make up the file system can independently be accessed by multiple processors when exploited requires that a file should be shared by multiple processors for both reading and writing. To enable parallel write operation while ensuring file consistency, a locking mechanism for regions in files is required. In a distributed environment, tokens are sometimes used. This token represents the access rights of a node to an object. However, a node might run several processes which try to access the same region of a file; thus, a local lock mechanism is needed on the tokens. In addition, another node might need to access the same region and thus may try to revoke the token from this node; thus, a revoke should not proceed as long as a local process locks the token. Thus, some kind of locking algorithms should be used for these tokens, which are managed by our Token Manager (TM), which is our improvement over U.S. Pat. No. 5,343,108 assigned to International Business Machines Corporation. To get access to a region in a file, a node first has to get the appropriate token, then lock it, perform the operation, and unlock the token. There are several problems associated with locking the tokens; first, a token may already be cached in the node. In this case we do not need to acquire it again. Second, we must ensure that locks within the same node do not conflict; third, we must handle revoke requests from other nodes that need a token that conflicts with a token that we currently hold. Our locking algorithm presented here solves these problems efficiently. Our locking algorithm is presented as a set of APIs. Two APIs are used for locking and unlocking a byte range. A third API is a callback function called by the Token Manager. The Token Manager is assumed to provide three APIs as well. One API is needed to acquire a byte range token ("Acquire"). A second API is needed to test whether a byte range token is already cached in the node ("Test"). A third API is needed when relinquishing a token as a response of a revoke ("Relinquish"). For the purpose of accessing regions in files, each token contains a range (start, end) of the region of the file which it can access. We now elaborate on the Token Manager APIs which are an assumption. An acquire function of the form Acquire(byte.sub.-- range) which is called to acquire a range token And a revoke callback function of the form Revoke(byte.sub.-- range) which the TM calls whenever another node needs that token. As a result, the node should call Relinquish(byte.sub.-- range) The algorithm that we implemented is also based on a fourth interface that has to be provided by the TM: Test(byte.sub.-- range) which queries the TM for the existence of the token on the node. To simplify the implementation, we do not keep track of the tokens that we hold; we leave that to the token manager, and we use the Test interface to query whether a token needs to be acquired. Usually, there are actions to be performed when a token is acquired. Thus, it is desirable to know if a token is already held so that these actions may be spared. The algorithm is based on a lock table (Range lock table, or RLT), which holds all the existing locks. The table is protected by a mutex to enable atomic insertions and deletions of locks. Three main functions are exposed: LOCK, which locks a byte range; UNLOCK, which unlocks a previously locked range; and REVOKE, which handles a revoke request. We present the pseudo code for these interfaces:
______________________________________
LOCK(range)
retry:
old.sub.-- revokes = nrevokes;
if (not Test(byte.sub.-- range)) {
// the token does not exist on this node
acquire.sub.-- mutex;
i.sub.-- am.sub.-- fetching = true;
fetch.sub.-- is.sub.-- pending = true;
release.sub.-- mutex;
Acquire(byte.sub.-- range);
get.sub.-- data.sub.-- associated.sub.-- with byte.sub.-- range;
goto retry;
} else {
// we have the token locally - check that it was not stolen
acquire.sub.-- mutex;
if (old.sub.-- revokes != nrevokes)
release.sub.-- mutex;
goto retry;
}
// make sure there are no pending acquires; if there are
// make sure they are finished first
if (not i.sub.-- am.sub.-- fetching) {
if (fetch.sub.-- is.sub.-- pending) {
sleep();
goto retry;
}
}
// if we acquired the token before the Test, we need to
// release other threads we hold the mutex, so no revokes
// can interfere here
if i.sub.-- am.sub.-- fetching) {
i.sub.-- am.sub.-- fetching = false;
fetch.sub.-- is.sub.-- pending = false;
wakeup();
}
}
err = insert.sub.-- range.sub.-- into.sub.-- lock.sub.-- table;
if (err == E.sub.-- CONFLICT) {
sleep(); // wait for someone to release the lock
goto retry;
}
exit:
if (i.sub.-- am.sub.-- fetching) {
fetch.sub.-- is.sub.-- pending = false;
i.sub.-- am.sub.-- fetching = false;
}
release.sub.-- mutex;
}
UNLOCK(range)
{
acquire.sub.-- mutex;
delete.sub.-- range.sub.-- from.sub.-- lock.sub.-- table;
wakeup;
release.sub.-- mutex;
}
REVOKE (range)
{
retry:
acquire.sub.-- mutex;
err = insert.sub.-- range.sub.-- into.sub.-- lock.sub.-- table;
if (err == E.sub.-- CONFLICT) {
sleep();
goto retry;
}
nrevokes++;
release.sub.-- mutex;
put.sub.-- data.sub.-- associated.sub.-- with.sub.-- byte.sub.-- range;
Relinquish(range);
acquire.sub.-- mutex;
delete.sub.-- range.sub.-- from.sub.-- lock.sub.-- table;
wakeup ;
release.sub.-- mutex;
}
______________________________________
We have thus described a byte range lock. While we are not aware of of any algorithms for byte range locks, we would note that previous solutions for non-byte range locks would keep a copy of the token states outside of the token manager. Here we would remark that our distributed token manager provides interfaces (Acquire, Revoke, Relinquish, and Test) for the locking of ranges (i.e., byte ranges of a file). A given range can be requested in either shared-read or an exclusive-write mode. One of the features of our invention is that we examine a token request for a specified byte range for comparing the request with the existing conflicting ranges in the entire multinode system and granting the largest possible byte range which does not require a toke revoke from another computer. This reduces the probability that the next operation on the requesting node would require another token request. Counters and non-blocking lock calls are used to acquire tokens while holding other locks. This technique allows more efficient serialization for multiple requests within a single node allowing the required multiple node serialization. So we provide that the Acquire interface of the token manager takes as input a mode, as well as two ranges, a "required" range, and a "desired" range. The desired range must be a superset of the required range. An application calling the Acquire interface is guaranteed that, at a minimum, it will be granted the required range. The token manager will determine if any conflicting ranges (i.e., ranges that overlap the required range in a conflicting mode) have been granted to other nodes. If any conflicting ranges are found, then the token manager will request that each node that has a conflicting range downgrade the overlapping range to a non-conflicting mode. We further provide that when any conflicts with the required range have been resolved, the Acquire interface will determine the largest, contiguous range which totally covers the required range, which is also a subset of the desired range; This is the range which the Acquire interface will return to the calling application. In effect, the token manager will grant the largest range possible (bounded by the desired range parameter) that does not require additional revoke processing to be preformed. The Revoke interface of the token manager is used to communicate to an application information about a conflicting range request from another node. When an Acquire request detects conflicting ranges that have been granted to other nodes, it will request that the application running on each of conflicting nodes downgrade the ranges that they've been granted. The information passed through the Revoke interface includes the mode, as well as the required/desired ranges that were specified on the Acquire call. Upon receipt of a revoke request, an application will invoke the Relinquish interface to downgrade any conflicting ranges it's been granted to a non-conflicting mode. At a minimum, the application is required to downgrade any ranges that conflict with the "required" range to a non-conflicting mode, but may downgrade a larger range if it desires. The token manager also provides a Test interface that will determine if a given range has been granted to the local node. This can be used by an application to determine if an Acquire request for a given range will require a communication request to the token server node. By processing with the use of sequence numbers for a given byte range we provide correct processing of acquires and revokes on the same byte ranges. The token manager Acquire interface takes as an argument, a sequence number. For each token, the token manager maintains a sequence number for each node that has been granted a range. The token manager updates the field containing a nodes sequence number at the completion of an Acquire operation, with the value specified in the Acquire interface. When a subsequent Acquire must revoke ranges from conflicting nodes, the token manager will pass the sequence number of the last successful acquire from that node via the token manager Revoke interface. In view of the interfaces to the distributed token manager (Acquire, Revoke, Relinquish, Test), we have provided an improved method for implementing local byte range locks in the code used. Several potential complications are elegantly solved by these program methods or algorithms, while enabling some sophisticated features: We process multiple token acquires and revokes in parallel using the locking techniques described below with the pseudo code in the original disclosure. We allow for several token acquires to be processed in parallel. This can happen, for example, if several file system operations try to access different sections of a file in parallel. And we allow for a token revoke for one part of a file to happen concurrently with an acquire, as long as the two do not conflict. It will be appreciated that we do not need to keep a copy of the local token state within the byte range lock code. We eliminate a livelock situation where, just after it is acquired, but before it is locked, a token is revoked by another node. The other node acquires the token and before being locked, it is stolen again. This ping-pong effect stops progress. Now, a result of our not needing to keep a copy of the local token state within the byte range lock code is a reduction of the memory needs of our program since this information is already stored in the TM. An API queries the TM to find out if the token is already cached. After locking the byte range, a special mechanism is provided to make sure that a revoke didn't happen after testing for token existance but before locking it. It is possible that the token was revoked in between. In this case, we acquire the token and try again. The same byte range lock code that is used by the file system operations is also used by the revoke callback function. However, a special flag signified that this is a lock-for-revoke. This makes the code more compact and allows the use of the same lock tables. The API for locking a byte range supports various options that enhance its operation:Non-blocking; Local-lock; Test; and Sequential. The non-blocking option allows for a non-blocking operation; if we don't have the token or a conflicting lock is being held, the lock code returns immediatly with an appropriate return code. The local-lock option allows for a non-distributed operation; if we do not need to lock globally but only within the node, we can use this option. The test option allows seeing if we could lock the byte range, but without really locking. The sequential option provides a hint that we lock a byte range for a reading (or writing) a file that is accessed sequentially. This hint is used if a token is needed. In this case, a token that is larger that the one which is really needed is desired (but not required). Special provisions are made for keeping track of the various locks that are held by the threads. A debugging utility dumps the existing byte range locks and the thread numbers that are holding them. Also, statistics are kept for understanding the patterns of file access and lock behaviour. By returning a handle for each successful lock operation, an unlock operation is speedy and does not require a search or a lookup. By keeping counters of the various existing lock modes, the operation which checks if a conflicting lock exists is fast. For example, if we keep a counter for the number of active shared-read locks, and active exclusive-write locks, we can often know if we need to check for range overlap. E.g., if there are no exclusive-write locks, and we need a shared-read lock, we know that there is no conflict and we just need to find an empty slot in the lock table. The lock code provides support for an unlimited number of byte range lock requests. In case the lock table gets full, or a conflicting lock is requested, the thread that is asking for the look is put to sleep and is woken up when a lock is unlocked. Our solution does not duplicate token information, and thus is compact and efficient. While we have described our preferred embodiments of our invention, it will be understood that those skilled in the art, both now and in the future, may make various improvements and enhancements which fall within the scope of the claims which follow. These claims should be construed to maintain the proper protection for the invention first disclosed.
|
Same subclass Same class Consider this |
||||||||||
