Single transaction technique for a journaling file system of a computer operating system6021414Abstract A single transaction technique for a journaling file system of a computer operating system in which a single file system transaction is opened for accumulating a plurality of current synchronous file system operations. The plurality of current synchronous file system operations are then performed and the single file system transaction closed upon completion of the last of the file system operations. The single file system operation is then committed to a computer mass storage device in a single write operation without the necessity of committing each of the separate synchronous file system operations with individual writes to the storage device thereby significantly increasing overall system performance. The technique disclosed is of especial utility in conjunction with UNIX System V based or other journaling operating systems. Claims What is claimed is: Description CROSS REFERENCE TO RELATED APPLICATIONS
______________________________________
mdNN -t master log [-n]
mdNN A metadevice name that will represent the metatrans device.
master
The master device; a metadevice or ordinary disk device.
log The log device; a metadevice or ordinary disk device. The same
log may be used in multiple metatrans devices, in which case it
is shared among them.
______________________________________
Metastat may also be extended to display the status of metatrans devices, with the following format:
______________________________________
mdXX: metatrans device
Master device:mdYY
Logging device:mdZZ
<state information>
mdYY: metamirror, master device for mdXX
<usual status>
mdZZ: metamirror, logging device for mdXX
<usual status>
______________________________________
Fsck decides whether to check systems based on the state of the clean flag. The specific implementation of the present invention described herein defines a new clean flag value, FSLOG. If the clean flag is FSLOG and the metatrans device 32 is not in an exception state, "fsck -m" exits with 0 and checking is skipped. Otherwise, the clean flag is handled in a conventional manner and. Fsck checks the state of the metatrans device 32 with a project-private ioctl request. After successfully repairing a file system, fsck will issue a project-private iocti request that takes the metatrans device 32 out of the exception state. If the clean flag is FSLOG and the metatrans device 32 is not in an exception state then quotacheck skips the file system. Otherwise, quotacheck rebuilds the quotafile in a conventional manner. Quotacheck checks the state of the metatrans device 32 with a project-private ioctl request. After successfully repairing a file system, quotacheck will issue a project-private ioctl request that resets metatrans device 32's exception state. The ufs.sub.-- mount program may accept a pair of new options to control whether or not to use delayed directory updates. Header Files
______________________________________
<sys/fs/ufs.sub.-- inode.h>
struct ufsvfs may contain a pointer to struct metatrans to identify the
metatrans device.
i.sub.-- doff is added to struct inode.
<sys/fs/ufs.sub.-- quota.h>
struct dquot may have the new field dq.sub.-- doff.
<sys/fs/ufs.sub.-- fs.h>
The new clean flag value FSLOG is defined here. fs.sub.-- sparecon [53]
is
renamed fs-reclaim.
<sys/fs/ufs.sub.-- trans.h>
<sys/md.sub.-- trans.h>
______________________________________
These are new header files that define project-private interfaces, e.g., metatrans iocti commands and data structures. Kernel Interfaces common/fs/ufs/*.c The VOP and VFS interfaces to UFS need not change unless a flag is added to the directory VOP calls to distinguish local and remote access. Calls to the metatrans logging interface are added to numerous internal UFS functions. common/vm/page.sub.-- lock.c The following functions allow conditional access to a page: paqe.sub.-- io.sub.-- lock (), page.sub.-- io.sub.-- unlock (), page.sub.-- io.sub.-- trylock ut page.sub.-- io.sub.-- assert (). common/vm/vm.sub.-- pvn.c The following function allows release of the pages acquired using the preceding functions: pvn.sub.-- io.sub.-- done. common/os/bio.c A new function, trygetblk (), is added to bio.c. This function checks whether a buffer exists for the specified device and block number and is immediately available for writing. If these conditions are satisfied, it returns a pointer to the buffer header, or NULL if they are not. Thread-specific data ("TSD") may be utilized for testing. Each delta 43 in a file system operation will be associated with the thread that is causing the delta 43. UFS mount stores the value returned by ufs.sub.-- trans.sub.-- get () in the ufsvfs field vfs.sub.-- trans. A NULL value means that the file system is not mounted from a metatrans device 32. UFS functions as usual in this case. A Non-NULL value means the file system is mounted from a metatrans device. In this case: a) The on-disk clean flag is set to FSLOG and further clean flag processing is disabled by setting the in-core clean flag to FSBAD. Disabling clean flag processing saves CPU overhead. b) The DIO flag is set unless the "nosyncdir" mount option is specified. Local directory updates will be recorded with a delayed write. A crash could lose these operations. Remote directory operations remain synchronous. Directory operations are considered remote when T.sub.-- DONTPEND is set in curthread.fwdarw.t.sub.-- flag. c) An exception routine is registered with the metatrans device 32 at mount time. The metatrans drive calls this routine when an exception condition occurs. Exception conditions include device errors and detected inconsistencies in the driver's state. The UFS exception routine will begin a kernel thread that hard locks the affected file systems. Each UFS Vnode or VFS operation may generate one or more transactions. Transactions may be nested, that it a transaction may contain subtransactions that are contained entirely within it. Nested transactions occur when an operation triggers other operations. Typically, each UFS operation has one transaction (plus any nested transactions) associated with it. However, certain operations such as VOP.sub.-- WRITE and VFS.sub.-- SYNC are divided into multiple transactions when a single transaction would exceed the total size of the logging device 46. Others such as VOP.sub.-- CMP and VOP.sub.-- ADDMAP, do not generate any transactions because they never change the file system state. Some operations that do not directly alter the file system may generate transactions as a result of side effects. For example, VOP.sub.-- LOOKUP may replace an entry in the dnlc or inode cache, causing in-core inodes to become inactive and the pages associated with them to be written to disk. Transactions begin with a call to TRANS.sub.-- BEGIN (). The transaction terminates when TRANS.sub.-- END is called. A transaction is composed of deltas 43, which are updates to the file system's metadata. Metadata is the superblock, summary information, cylinder groups, inodes, allocation blocks, and directories. UFS identifies the deltas 43 for the metatrans device 32 by calling TRANS.sub.-- DELTA (). This call identifies a range of bytes within a buffer that should be logged. These bytes are logged when the buffer is written. UFS often alters the same metadata many times for a single operation. Separating the declaration of the delta 43 from the logging of the delta 43 collapses multiple updates into one delta 43. UFS obtains disk blocks for user data and allocation blocks from the same free pool. As a result, user data may occupy locations on disk that contained metadata at some earlier time. The log design must ensure that during recovery, the user data is not incorrectly updated with deltas 43 to the former metadata. UFS prevents this by calling TRANS.sub.-- CANCEL ( ) whenever a block is allocated for user data. Writes to the raw or block metatrans device 32 can invalidate information recorded in the log. To avoid inconsistencies, the driver transacts these writes. The logging device 46 increases synchronous write performance by batching synchronous writes together and by writing the batched data to the logging device 46 sequentially. The data is written asynchronously to the master device 44 at the same time. The synchronous write data recorded in the log is not organized into transactions. The metatrans device 32 transparently logs synchronous write data without intervention at the file system level. Synchronously written user data is not logged when there is not sufficient free space in the log. In this case, an ordinary synchronous write to the master device 44 is done. When synchronous write data is logged, any earlier log records for the same disk location must be canceled to avoid making incorrect changes to the data during recovery or roll-forward. When the asynchronous write of the data to the master device 44 has finished, the metatrans driver's done routine places a cancel record on a list of items to be logged. Subsequent synchronous writes to the same disk location are followed by a synchronous commit that flushes this record to the log and cancels the previous write. Subsequent asynchronous writes to the same location will disappear at reboot unless they are followed by a sync (), fsync () or further synchronous update. The correctness of this scheme depends on the fact that UFS will not start a new write to a disk location while a preceding one is still in progress. The master device 44 is periodically updated with the committed changes in the log. Changes recorded at the head of the log are rolled first. Three performance measures reduce the overhead of rolling the log. First, the driver avoids reading the log when the required data is available, either in the buffer cache or in the page cache. Two new routines, trygetblk () and ufs.sub.-- trypage (), return a buffer header or a page without sleeping or they return NULL. Second, overlapping deltas 43 are canceled. If the log contains multiple updates for the same data, only the minimum set required is read from the log and applied. The third measure involves the untransacted synchronous write data. This data is written synchronously to the logging device 46 and asynchronously to the master device 44. The roll logic simply waits for the asynchronous write to complete. Rolling is initiated by the metatrans driver. When the logging device 46 fills, the metatrans driver immediately rolls the log in the context of the current thread. Otherwise, the metatrans driver heuristically determines when rolling would be efficient and it starts a kernel thread. An obvious heuristic for this case is when the metatrans driver has been idle for several seconds. The log is not rolled forward at fsync (), sync () or unmount but is rolled when the metatrans device 32 is cleared by the metaclear(1 m) utility. The metatrans device 32 puts itself into an exception state if an error occurs that may cause loss of data. In this state, the metatrans device 32 returns EIO on each read or write after calling all registered "callback-on-exception" routines for the device. UFS registers a callback on routine at mount time. The UFS routine starts a kernel thread that hard locks the affected UFS file systems, allowing manual recovery. The usual procedure is to unmount the file system, fix the error, and run fsck. Fsck takes the device out of the exception state after it repairs the file system. The file system can then be mounted, and the file system functions as normal. If the file system is unmounted and then mounted again without running fsck, any write to the device returns EIO but reads will proceed if the requested data can be accessed. UFS must not exhaust log space and, if the metatrans driver cannot commit a transaction because of insufficient log space, it treats the condition as a fatal exception. UFS avoids this situation by splitting certain operations into multiple transactions when necessary. The UFS flush routines create a transaction for every ufs.sub.-- syncip () or VOP.sub.-- PUTPage call. The flush routines are ufs.sub.-- flushi (), ufs.sub.-- iflush (), and ufs.sub.-- flush.sub.-- icache (). The affected UFS operations are VFS.sub.-- Sync and VFS.sub.-- UNMOUNT and the UFS ioctls FIOLFS, FIOFFS, and FIODIO. A VOP.sub.-- WRITE operation is split into multiple rwip () calls in ufs.sub.-- write (). Freeing a file in ufs.sub.-- iinactive () cannot be split into multiple transactions because of deadlock problems with transaction collisions and recursive UFS operations and freeing of the file is delayed until there is no chance of deadlock. The metatrans driver does not recover the resources held by open, deleted files at boot. Instead, UFS manages this problem. A kernel thread created at mount time scans for deleted files if: a) The file system is on a metatrans device 32, or b) The superblock says there are deleted files. A bit in a previously unused spare in the superblock indicates whether any such files are present. The metatrans device 32 driver handles three classes of errors: "device errors", "database errors", and "internal errors". Device errors are errors in reading or writing the logging or master devices 46, 44. Database errors are errors reported by MDD's database routines. Internal errors are detected inconsistencies in internal structures, including structures written onto the logging device 46. A mounted metatrans device 32 responds to errors in one of two ways. The metatrans driver passes errors that do not compromise data integrity up to the caller without any other action. For instance, this type of error can occur while reading unlogged data from the master device 44. The metatrans device 32 puts itself into an exception state whenever an error could result in lost or corrupted data, for example, an error reading or writing the logging device 46 or an error from MDD's database routines. A metatrans device 32 puts itself into an exception state by: a) Recording the exception in MDD's database, when possible. b) Calling any registered "callback-on-exception" routines. These routines are registered with the device at mount time. UFS registers a routine that starts a kernel thread that hard locks the affected UFS file systems. These file systems can be unmounted and then remounted after the exception condition has been corrected. c) Returning EIO for every read or write call while the metatrans device 32 is mounted. After the metatrans device 32 is released by UFS at unmount with ufs.sub.-- trans.sub.-- put (), reads return EIO when be the requested data cannot be accessed and writes always return EIO. This behavior persists even after the metatrans device 32 is mounted again. When fsck repairs the file system, it takes the metatrans device 32 out of its exception state. Fsck first issues a project-private ioctl that rolls the log up to the first error and discards the rest of the log and makes the device writable. After repairing the file system fsck issues a project-private ioctl that takes the device out of its exception state. At boot time, the logging device 46 is scanned and the metatrans device 32's internal state is rebuilt. A device error during the scan puts the metatrans device 32 in the exception state. The scan continues if possible. An unreadable sector resulting from an interrupted write is repaired by rewriting it. The metatrans device 32 is not put into an exception state. Roll forward operations may happen while scanning the logging device 46 and rebuilding the internal state. Roll forward operations happen because map memory may exceed its recommended allocation. Errors during these roll forward operations put the metatrans device 32 into an exception state and the scan continues if possible. It is recognized that delayed recording of local directory updates can improve performance. Two mechanisms for differentiating local and remote (NFS) directory operations may be implemented: a) UFS can examine the p.sub.-- as member of the proc structure (If it is null then the caller is a system process, presumably NFS; otherwise the operation has been initiated by a user-level process and is taken to be local); or b) add a new flag to the Vnode operations for directories that specifies whether or not the operation must be synchronous (or add a new flag to the thread structure). Resources associated with open but deleted files must be reclaimed after a system crash and the present invention includes a kernel thread for this purpose. However, a thread that always searches the entire file system for such files has two disadvantages: the overhead of searching and the possibly noticeable delay until space is found and recovered. An alternative is to use a spare field in the superblock to optimize the case where there are no such files, which would likely be a fairly common occurrence. The FIOSDIO ioctl puts the UFS file system into delayed IO mode, which means that local directory updates are written to disk with delayed writes. Remote directory updates remain synchronous, as required by the NFS protocol. This mode makes directory operations very fast but without the present invention it is unsafe and repairing a file system in DIO mode will usually require user intervention. The logging mechanism of the present invention ameliorates the danger. To improve directory update performance, file systems may be placed into delayed IO mode unless the "nosyncdir" mount option is specified. However, the implementation of delayed IO mode changes considerably and a solution is to avoid use of the FIOSDIO flag and instead use a different, specific flag. This specific flag might be administered by a new utility and a project-private UFS ioctl. The new flag could be stored in the superblock or could be stored in MDD's database. The FIOSDIO ioctl would then have no effect on a file system in accordance with the present invention. UFS Interface to Metatrans Device
______________________________________
A metatrans device 32 records itself with UFS when the metatrans
device 32 is created or is recreated at boot:
struct ufstrans*
ufs.sub.-- trans.sub.-- set(
dev.sub.-- t dev,
struct ufstransops *ops,
void *data)
______________________________________
dev is the metatrans device number. data is the address of a metatrans-private structure. ops is the address of the branch table:
______________________________________
struct ufstansops {
int (*trans.sub.-- begin)(struct ufstrans *, top.sub.-- t,u.sub.-- long,
u.sub.-- long);
void (*tran.sub.-- end)(struct ufstrans *, top.sub.-- t, u.sub.-- long,
u.sub.-- long);
void (*trans.sub.-- delta)(struct ufstrans *, off.sub.-- t, off.sub.--
t, delta.sub.-- t,
int (*)(), u.sub.-- long);
void (*trans.sub.-- cancel)(struct ufstrans *, off.sub.-- t, off.sub.--
t, delta.sub.-- t);
int (*trans.sub.-- log)(struct ufstrans *, char *, off.sub.-- t,
off.sub.-- t);
void (*trans.sub.-- mount)(struct ufstrans *, struct fs *);
void (*trans.sub.-- unmount)(struct ufstrans *, struct fs *);
void (*trans.sub.-- remount)(struct ufstrans *, struct fs *);
void (*trans.sub.-- iget)(struct ufstrans *, struct inode *);
void (*trans.sub.-- free.sub.-- iblk)(struct ufstrans *, struct inode *,
daddr.sub.-- t);
void (*trans.sub.-- free)(struct ufstrans *, struct inode *,
daddr.sub.-- t, u.sub.-- long);
void (*trans.sub.-- alloc)(struct ufstrans *, struct inode *,
daddr.sub.-- t;
u.sub.-- long, int);
};
______________________________________
ufs.sub.-- trans.sub.-- set stores the above information in a singly linked list of:
______________________________________
struct ufstrans {
struct ufstrans
*ut.sub.-- next
/* next item in list */
dev.sub.-- t
ut.sub.-- dev;
/* metatrans device no. */
struct ufstransops
*ut.sub.-- ops;
/* metatrans ops */
struct vfs
*ut.sub.-- vfsp;
/* XXX for inode pushes */
void *ut data; /* private data (?) */
void (*ut.sub.-- onerror)();
/* callback ufs on error */
int ut.sub.-- onerror.sub.-- state;
/* fs specitic state
};
ufs.sub.-- trans.sub.-- reset() unlinks and frees the ufstrans
structure.
ufs.sub.-- trans.sub.-- reset () is called when a metatrans device is
cleared.
______________________________________
At mount time, UFS stores the address of a ufstrans structure in the vfs.sub.-- trans field of a struct ufsvfs: ufsvfsp.fwdarw.vfs.sub.-- trans=ufs.sub.-- trans.sub.-- get(dev, vfsp, ufs.sub.-- trans.sub.-- onerror, ufs.sub.-- trans.sub.-- onerror.sub.-- state); If ufs.sub.-- trans.sub.-- get returns NULL when the file system is not on a metatrans device 32, ufs.sub.-- trans.sub.-- onerror is called by the metatrans device 32 when a fatal device error occurs. ufs.sub.-- trans.sub.-- onerror.sub.-- state is stored as part of the metatrans device 32's error state. This error state is queried and reset by fsck and quotacheck. UFS calls the metatrans device via ufstransops table. These calls are buried inside of the following macros:
__________________________________________________________________________
/*
* vfs.sub.-- trans == NULL means no metatrans device
/*
#define TRANS.sub.-- ISTRANS(ufsvfsp)(ufsvfsp->vfs.sub.-- trans)
/*
* begin a transaction
/*
#define
TRANS.sub.-- BEGIN(ufsvfsp, vid, vsize, flag)
(TRANS.sub.-- ISTRANS(ufsvfsp))?
(*ufsvfsp->vfs.sub.-- trans->ut.sub.-- ops->trans.sub.-- begin)
(ufsvfsp->vfs.sub.-- trans, vid, vsize, flag): 0)
/*
* end a transaction
/*
#define
TRANS.sub.-- END(ufsvfsp, vid, vsize, flag)
if (TRANS.sub.-- ISTRANS(ufsvfsp))
(*ufsvfsp->vfs.sub.-- trans->ut.sub.-- ops->trans.sub.-- end)
(ufsvfsp->vfs.sub.-- trans, vid, vsize, flag)
/*
*record a delta
/*
#define
TRANS.sub.-- DELTA(ufsvfsp, mof, nb, dtyp, func, arg)
if (TRANS.sub.-- ISTRANS(ufsvfsp))
(*ufsvfsp->vfs.sub.-- trans->ut.sub.-- ops->trans.sub.-- delta)
(ufsvfsp->vfs.sub.-- trans, mof, nb, dtyp, func,
arg)
/*
*cancel a delta
/*
#define
TRANS.sub.-- CANCEL(ufsvfsp, mof, nb, dtyp)
if (TRANS.sub.-- ISTRANS(ufsvfsp))
(*ufsvfsp->vfs.sub.-- trans->ut.sub.-- ops->trans.sub.-- cancel)
(ufsvfsp->vfs.sub.-- trans, mof, nb, dtyp)
/*
* log a delta
/*
#define
TRANS.sub.-- LOG(ufsvfsp, va, mof, nb)
if TRANS.sub.-- ISTRANS(ufsvfsp))
(*ufsvfsp->vfs.sub.-- trans->ut.sub.-- ops->trans.sub.-- log)
(ufsvfsp->vfs.sub.-- trans, va, mof, nb)
/*
* The following macros provide a more readable interface to TRANS.sub.--
DELTA
/*
#define
TRANS.sub.-- BUF(ufsvfsp, vof, nb, bp, type)
TRANS-DELTA(ufsvfsp,
dbtob(bp->b.sub.-- blkno) + vof, nb, type,
ufs.sub.-- trans.sub.-- push.sub.-- buf, bp->b.sub.-- blkno)
#define
TRANS.sub.-- BUF.sub.-- ITEM (ufsvfsp, item, base, bp,
TRANS.sub.-- DELTA(ufsvfsp,
(caddr.sub.-- t)&(item) - (caddr.sub.-- t)(base),
sifecf (item), bp, type)
#define
TRANS.sub.-- INODE(ufsvfsp, vof, nb, ip)
TRANS.sub.-- DELTA(ufsvfsp, ip->i.sub.-- doff +vof,
nb, DT.sub.-- INODE, ufs.sub.-- trans.sub.-- push.sub.-- inode, ip
#define
TRANS.sub.-- INODE.sub.-- ITEM(ufsvfsp, item, ip)
TRANS-INODE(ufsvfsp,(caddr.sub.-- t)&(item) - (caddr.sub.-- t)&ip->i.s
ub.-- ic,sizeof (item), ip)
#define
TRANS.sub.-- SI(ufsvfsp, fs, cg)
TRANS.sub.-- DELTA(ufsvfsp,
dbtob(fsbtodb(fs, fs->fs.sub.-- csaddr)) +
(caddr.sub.-- t)&fs->fs.sub.-- cs(fs, cg) - (cadr.sub.-- t)fs->fs.sub.--
csp[0],
sizeof (struct csum), DT.sub.-- SI, ufs.sub.-- trans.sub.-- push.sub.--
si, cg)
#define
TRANS.sub.-- SB(ufsvfsp, item, fs)
TRANS.sub.-- DELTA(ufsvfsp,
dbtob(SBLOCK) + (caddr.sub.-- t)&(item) - (caddr.sub.-- t)fs),
sizeof (item), DT.sub.-- SB, ufs.sub.-- trans.sub.-- push.sub.-- sb, 0)
/*
* These functions "wrap" functions that are not VOP or VFS
* entry points but must still use the TRANS.sub.-- BEGIN/TRANS.sub.--
END
* protocol
*/
#define
TRANS.sub.-- SBUPDATE(ufsvfsp, vfsp, topid)
ufs.sub.-- trans.sub.-- sbupdate(ufsvfsp, vfsp, topid)
#define
TRANS.sub.-- SYNCIP(ip, bflags, iflag, topid)
ufs.sub.-- trans.sub.-- syncip(ip, bflags, iflag, topid)
#define
TRANS.sub.-- SBWRITE(ufsvfsp, topid) fs.sub.-- trans.sub.-- sbwrite(uf
svfsp, topid)
#define
TRANS.sub.-- IUPDAT(ip, waitfor) ufs.sub.-- trans.sub.-- iupdat(ip,
waitfor)
#define
TRANS.sub.-- PUTPAGES(vp, off, len, flags, cred)
ufs.sub.-- trans.sub.-- putpages(vp, off, len, flags, cred)
/*
Test/Debug ops
* The following ops maintain the metadata map.
#define
TRANS.sub.-- IGET(ufsvfsp, ip)
if (TRANS.sub.-- ISTRANS(ufsvfsp))
(*ufsvfsp->vfs.sub.-- trans->ut.sub.-- ops->trans.sub.-- iget)
(ufsvfsp->vfs.sub.-- trans, ip, bno, size)
#define
TRANS.sub.-- FREE.sub.-- IBLK(ufsvfsp, ip, bn)
if TRANS.sub.-- ISTRANS(ufsvfsp))
(*ufsvfsp->vfs.sub.-- trans->ut.sub.-- ops->trans.sub.-- free.sub.--
iblk)
(ufsvfsp->vfs.sub.-- trans, ip, bn)
#define
TRANS.sub.-- ISTRANS(ufsvfsp, ip, bno, size)
if TRANS.sub.-- ISTRANS(ufsvfsp))
(*ufsvfsp->vfs.sub.-- trans,->ut.sub.-- ops->trans.sub.-- free)
(ufsvfsp->vfs.sub.-- trans, ip, bno, size)
#define
TRANS.sub.-- ALLOC)ufsvfsp, ip, bno, size, zero)
if (TRANS.sub.-- ISTRANS(ufsvfsp))
(*ufsvfsp->vfs.sub.-- trans->ut.sub.-- ops->trans.sub.-- alloc)
(ufsvfsp->vfs.sub.-- trans, ip, bno, size, zero)
#define
TRANS.sub.-- MOUNT(ufsvfsp, fsp)
if (TRANS.sub.-- ISTRANS(ufsvfsp))
(*ufsvfsp->vfs.sub.-- trans->ut.sub.-- ops->trans.sub.-- mount)
(ufsvfsp->vfs.sub.-- trans, fsp)
#define
TRANS.sub.-- UMOUNT(ufsvfsp, fsp)
if(TRANS.sub.-- ISTRANS(ufsvfsp))
(*ufsvfsp->vfs.sub.-- trans->ut.sub.-- ops->trans.sub.-- umount)
(ufsvfsp->vfs.sub.-- trans, fsp)
#define
TRANS.sub.-- REMOUNT)ufsvfsp, fsp)
if TRANS.sub.-- ISTRANS(ufsvfsp))
(*ufsvfsp->vfs.sub.-- trans->ut.sub.-- ops->trans.sub.-- remount)
(ufsvfsp->vfs.sub.-- trans, fsp)
__________________________________________________________________________
Besides the vfs.sub.-- trans field in the ufsvfs struct, a new field, off.sub.-- t i.sub.-- doff, is added to the *in-core* inode, struct inode. i.sub.-- doff is set in ufs.sub.-- iget(). i.sub.-- doff is the device offset for the inode's dinode. i.sub.-- doff reduces the amount of code for the TRANS.sub.-- INODE() and TRANS.sub.-- INODE.sub.-- ITEM() macros. Similarly, the field dq.sub.-- doff is added to the "inocre" quota structure, struct dquot. The protocol between ufs.sub.-- iinactive() and ufs.sub.-- iget() is changed because the system deadlocks if an operation on fs A causes a transaction on fs B. This happens in ufs.sub.-- iinactive when it frees an inode or when it calls ufs.sub.-- syncip(). This happens in ufs.sub.-- iget() when it calls ufs.sub.-- syncip() on an inode from the free list. In the implementation of the present invention, a thread cleans and moves idle inodes from its idle queue to a new `really-free` list. The inodes on the `really-free` list are truly free and contain no state. In fact, they are merely portions of memory that happen to be the right size for an inode. ufs.sub.-- iget() uses inodes off this list or kmem.sub.-- alloc( )'s new inodes. The thread runs when the number of inodes on its queue exceeds 25% of ufs.sub.-- ninode. ufs.sub.-- ninode A is the user-suggested maximum number of inodes in the inode cache. Note that ufs.sub.-- ninode does not limit the size of the inode cache. The number of active inodes and the number of idle inodes with pages may remain unbounded. The thread will clean inodes until its queue length is less than 12.5% of ufs.sub.-- ninode. Some new counters may be added to inode stats structure:
______________________________________
/* Statistics on inodes */
struct instats {
int in.sub.-- hits;
/* Cache hits */
int in.sub.-- misses;
/* Cache misses */
int in.sub.-- malloc;
/* kmem.sub.-- allocated */
int in.sub.-- mfree;
/* kmem.sub.-- free'd */
int in.sub.-- maxsize;
/* Largest size reached by cache */
int in.sub.-- frfront;
/* put at front of freelist */
int in.sub.-- frback;
/* put at back of freelist */
int in.sub.-- dnlclook;
/* examined in dnlc */
int in.sub.-- dnlcpurge;
/* purged from dnlc */
int in.sub.-- inactive;
/* inactive calls */
int in.sub.-- inactive.sub.-- nop;
/* inactive calls that nop'ed */
int in.sub.-- inactive.sub.-- null;
/* inactive call with null vfsp */
int in.sub.-- inactive.sub.-- delay.sub.-- free;
/* inactive delayed free's */
int in.sub.-- inactive.sub.-- free;
/* inactive q's to free thread */
int in.sub.-- inactive.sub.-- idle;
/* inactive q's to idle thread */
int in.sub.-- inactive.sub.-- wakeups;
/* wakeups */
int in.sub.-- scan;
/* calls to scan */
int in.sub.-- scan.sub.-- scan;
/* inodes found */
int in.sub.-- scan.sub.-- rwfail; /* inode
rw.sub.-- tryenter's that failed */
______________________________________
ufs.sub.-- iinactive frees the ondisk resources held by deleted files. Freeing inodes in ufs.sub.-- iinactive () can deadlock be system as above-described and the same solution may be used, that is, deleted files are processed by a thread. The thread's queue is limited to ufs.sub.-- ninode entries. ufs.sub.-- rmdir() and ufs.sub.-- remove() enforce the limit. The system deadlocks if a thread holds the inode cache's lock when it is suspended while entering a transaction. A thread suspends entering a transaction if there isn't sufficient log space at that time. The inode scan functions ufs.sub.-- flushi, ufs.sub.-- iflush, and ufs.sub.-- flush inodes use a single scan-inode-hash function that doesn't hold the mode cache lock:
__________________________________________________________________________
*/
* scan the hash of inodes and call func with the inode locked
*/
int
ufs.sub.-- scan.sub.-- inodes(int rwtry, int (*func)(struct inode *,
void*), void *arg)
struct inode *ip, *lip;
struct vnode *vp;
inion ihead *ih;
int error;
int saverror= 0;
extern krwlock.sub.-- t icache.sub.-- lock;
ins.in.sub.-- scan++;
rw.sub.-- enter(&icache.sub.-- lock, RW.sub.-- READER);
for (ih = ihead; ih < &ihead[INOHSZ]; ih++) {
for (ip =ih->ih.sub.-- chain[0], lip = NULL;
ip ! = (struct inode *)ih;
ip = lip->i.sub.-- forw)
ins.in.sub.-- scan.sub.-- scan++;
vp = ITOV(ip);
VN-HOLD(vp);
rw.sub.-- exit(&icache.sub.-- lock);
if (lip)
VN.sub.-- RELE (ITOV(lip));
lip = ip;
/*
* Acquire the contents lock to make sure that the
* inode has been initialized in the cache.
*/
if (rwtry)
if (!rw.sub.-- tryenter(&ip->i.sub.-- contents, RW.sub.-- WRITER))
ins.in.sub.-- scan.sub.-- rwfail++;
rw.sub.-- enter(&icache.sub.-- lock, RW.sub.-- READER);
continue;
}
} else
rw.sub.-- enter(&ip->i.sub.-- contents, RW.sub.-- WRITER);
rw.sub.-- exit(&ip->i.sub.-- contents);
*/
* i.sub.-- number == 0 means bad initialization; ignore
*/
if (ip->i number)
if (error = (*func)(ip, arg))
saverror = error;
rw.sub.-- enter(&icache.sub.-- lock, RW.sub.-- READER);
}
if (lip) {
rw.sub.-- exist(&icache.sub.-- lock);
VN.sub.-- RELE (ITOV(lip)) ;
rw.sub.-- enter(&icache.sub.-- lock, RW, READER);
}
}
rw.sub.-- exit(&icache.sub.-- lock);
return (saverror);
__________________________________________________________________________
ufs.sub.-- iget uses the same protocol. This protocol is possible because the new iget/iinactive protocol obviates the problems inherent in attempting to reuse a cached inode. The lockfs flush routine, ufs.sub.-- flush inodes, is altered to effectuate the present invention. ufs.sub.-- flush-inodes hides inodes while flushing them. The inodes are hidden by taking them out of the inode cache, flushing them, and then putting them back into the cache. However, hidden inodes cannot be found at the end of transactions. ufs.sub.-- flush.sub.-- inodes now uses the new inode hash scan function to flush inodes. ufs.sub.-- unmount() is modified to use the lockfs protocol and the new inode hash scan function. ufs-unmount also manages the UFS threads. All of the threads are created, controlled, and destroyed by a common set of routines in ufs.sub.-- thread.c. Each thread is represented by the structure:
______________________________________
*/
* each ufs thread is managed by this struct (ufs.sub.-- thread.c)
*/
struct ufs.sub.-- q {
void *uq.sub.-- head;
/* first entry on q */
void *uq.sub.-- tail;
/* last entry on q
long uq.sub.-- ne;
/* # of entries */
long uq.sub.-- maxne;
/* thread runs when ne==maxne */
u.sub.-- short
uq.sub.-- nt;
/* # of threads serving this q */
u.sub.-- short
uq.sub.-- nf;
/* # of flushes requested */
u.sub.-- short
uq.sub.-- flags;
/* flags */
kcondvar.sub.-- t
uq.sub.-- cv;
/* for sleep/wakeup */
kmutex.sub.-- t
uq.sub.-- mutex;
/* protects this struct */
};
______________________________________
With reference to the following pseudocode listing, the single transaction technique for a journaling file system of a computer operating system may be further understood.
______________________________________
SINGLE.sub.-- TRANSACTION:
If single transaction is closed
wait for next single transaction to
open
Enter transaction
Perform the synchronous operation
Close this single transaction
Wait for all current sync operations
to finish
Commit all sync operations with single
disk write
Open next single transaction
Leave transaction
______________________________________
UFS tells the metatrans device when transactions begin and end with the macros: TRANS.sub.-- BEGIN(ufsvfsp, vop.sub.-- id, vop.sub.-- size, &vop.sub.-- flag); TRANS.sub.-- END(ufsvfsp, vop.sub.-- id, vop.sub.-- size, &vop.sub.-- flag); vop.sub.-- jd identifies the operation. For example, VA.sub.-- MOUNT for mount() and VA.sub.-- READ for read(). vop.sub.-- size is an upper bound on the amount of log space this transaction will need. vop.sub.-- flag tells the metatrans driver if this thread must wait for the transaction to be committed or not, and whether this thread can sleep. Table 1 (hereinafter) illustrates "commit" and "NFS commit" assertions for various system calls. Fundamentally, using the technique of the present invention, transacted operations will not cause synchronous writes if they do not require a commit and those transacted operations that do require a commit will generate fewer synchronous writes. As can be seen in Table 1, some transacted operations do not require a commit unless they originate on an NFS client. Nevertheless, even the NFS-only-commit operations require a commit if the file system is mounted with the -syncdir option. The operations that do not require a commit can be lost if the system goes down. These operations are "committed" along with the next committed operation. For example, at the next sync. Concurrent file system operations are combined into a single transaction. The file system operations needing a commit will not return until all of the file system operations are complete. The file system operations that do not require a commit will return immediately. A file system operation may be suspended if its log space needs cannot be met and UFS may split writes into multiple transactions if the log is too small. Moreover, UFS may split truncations into multiple transactions if the log is too small.
TABLE 1
______________________________________
System Call Commit NFS Commit
______________________________________
TOP.sub.-- OPEN
TOP.sub.-- CLOSE
TOP.sub.-- READ
TOP.sub.-- WRITE Y
TOP.sub.-- WRITE.sub.-- SYNC
Y Y
TOP.sub.-- GETATTR
TOP.sub.-- SETATTR Y
TOP.sub.-- SETATTR.sub.-- TRUNC
Y
TOP.sub.-- ACCESS
TOP.sub.-- LOOKUP
TOP.sub.-- CREATE Y
TOP.sub.-- REMOVE Y
TOP.sub.-- LINK Y
TOP.sub.-- RENAME Y
TOP.sub.-- MKDIR Y
TOP.sub.-- RMDIR Y
TOP.sub.-- READDIR
TOP.sub.-- SYMLINK Y
TOP.sub.-- READLINK
TOP.sub.-- FSYNC Y
TOP.sub.-- INACTIVE
TOP.sub.-- FID
TOP.sub.-- GETPAGE
TOP.sub.-- PUTPAGE
TOP.sub.-- MAP
TOP.sub.-- FRLOCK
TOP.sub.-- SPACE Y
TOP.sub.-- PATHCONF
TOP.sub.-- VGET
TOP.sub.-- SBUPDATE.sub.-- FLUSH
TOP.sub.-- SBUPDATE.sub.-- UPDATE
TOP.sub.-- SBUPDATE.sub.-- MOUNTROOT
TOP.sub.-- SBUPDATE.sub.-- UNMOUNT
TOP.sub.-- SYNCIP.sub.-- CLOSEDQ
TOP.sub.-- SYNCIP.sub.-- TRYPAGE
TOP.sub.-- SYNCIP.sub.-- FLUSHI
TOP.sub.-- SYNCIP.sub.-- HLOCK
TOP.sub.-- SYNCIP.sub.-- SYNC
TOP.sub.-- SYNCIP.sub.-- FREE
TOP.sub.-- SYNCIP.sub.-- FSYNC
Y
TOP.sub.-- SBWRITE.sub.-- FIOSDIO
TOP.sub.-- SBWRITE.sub.-- CHECKCLEAN
TOP.sub.-- SBWRITE.sub.-- RECLAIM
Y Y
TOP.sub.-- SBWRITE.sub.-- T.sub.-- RECLAIM
Y Y
TOP.sub.-- SBWRITE.sub.-- NOTCLEAN
Y Y
TOP.sub.-- IFREE
TOP.sub.-- IUPDAT
TOP.sub.-- MOUNT
TOP.sub.-- COMMIT.sub.-- FLUSH
TOP.sub.-- COMMIT.sub.-- UPDATE
TOP.sub.-- COMMIT.sub.-- UNMOUNT
______________________________________
While there have been described above the principles of the present invention in conjunction with specific computer operating systems, the foregoing description is made only by way of example and not as a limitation to the scope of the invention.
|
Same subclass Same class Consider this |
||||||||||
