File version reconciliation using hash codes6098079Abstract A file reconciliation process in a distributed file system uses a set of rnal or log files to track the history of file modification at each of different sites, or sets of directories, in a computer system. During reconciliation, sequences of version entries associated with each file in each journal are updated and compared to determine whether (1) a conflict exists for any of the files involved in the reconciliation, and (2) if not, which version of the file is the current version. The version entries contain a hash code or digest that to a high probability uniquely identifies the contents of a file. Sequences of hash codes are used to identify the sequence of file versions. Masks and site indicator fields are included in the journal files and used to track which journal files have copies of version entries for the purpose of deleting version entries when they become obsolete. Claims What is claimed is: Description CROSS REFERENCE TO RELATED APPLICATIONS
______________________________________
<verb> An action performed:
+ Created or Modified
- Deleted
<date> Date action performed
<time> Time action performed
<name> Name of file acted on
<t> Type of file acted on:
(blank) Ordinary file
/ Subdirectory
@ Symbolic Link
?sites Bit field indicating which sites do NOT
know about this Version; omitted if
Version known at all sites. Bit-to-site
mapping defined in the mask field of Site
entries.
<digest> Unique hash code or digest of the contents
of the file that resulted from the action.
"dt" prefix indicates file is a text file;
"db" indicates binary file.
<remarks> Remark whose use depends on context. Two
special remarks are:
!was <name> (saved file)
!deleted <name> (deleted directory)
______________________________________
As shown in FIG. 6, a Site entry includes several fields, these being described as follows:
______________________________________
<date> Date site last involved in reconciliation
<time> Time site last involved in reconciliation
<sitename> Name of site
?mask Bit mask to be used for this site in site
field of Version entries
______________________________________
Based on the above description, the following shows exemplary journal files for the sites shown in FIGS. 1 and 2:
______________________________________
Journal of SITE1
$ <date> <time> SITE2 ?1
$ <date> <time> SITE1 ?2
+ f1date f1time file1.xxx dt = aaaaa
+ f2date f2time file2.xxx dt = bbbbb
+ f3date f3time file3.xxx dt = aaaaa
+ s1date s1time sub1.dir/
Journal of SITE1/sub1.dir
+ f5date f5time file5.xxx dt = eeeee
+ f6date f6time file6.xxx dt = fffff
Journal of SITE2
$ <date> <time> SITE2 ?1
$ <date> <time> SITE1 ?2
+ f1date f1time file1.xxx dt = aaaaa
+ f2date f2time file2.xxx dt = bbbbb
+ f4date f4time file4.xxx dt = ddddd
+ s1date s1time sub1.dir/
Journal of SITE2/sub1.dir
+ f5date f5time file5.xxx dt = eeeee
+ f6date f6time file6.xxx dt = fffff
______________________________________
FIG. 7 illustrates the reconciliation process. The process reads the existing journal files 10 and directories 11 associated with each site involved in the reconciliation. At step 12, the Version entries in the journals for each site are updated to reflect the current versions of the files and directories at the respective sites. First the actual contents of the sites involved in the reconciliation are determined by reading the site directories and sub-directories. New "+" Version entries are created for those files and directories that either (1) have no corresponding Version entries (and are therefore assumed to be newly created), or (2) have a date and time different from the date and time included in the last Version entry for that file or directory in the journal file (and are therefore assumed to have been changed). The method by which "+" Version entries are created is described in greater detail below with reference to FIG. 8. In a file has a corresponding Version entry and the timestamps match, then the version of the file existing at the site is consistent with the last Version entry for that file. In this case, a new Version entry is not created. In this manner, the unnecessary re-calculation of digests is avoided. Because the calculation of digests is compute-intensive, this feature of creating new digests for only new or modified files enhances the performance of the reconciliation process. It is possible that the latest "version" of a file is actually its deletion. A pass is made through the journal to determine if any files or directories named in existing Version entries have been deleted from the file system. For any such files, new "-" Version entries including the names of the deleted files or directories are created, indicating that the last action taken was the deletion of the file at the corresponding site. As the journals are read in step 12, their Site entries are merged into a single master list of known sites, including both those sites which are participating in the current reconciliation and also other sites mentioned in the journals. The master list also contains the mask bits to be used in the new journals, and a date and time of the last known reconciliation for each site. The Site entries are then updated in step 13 as follows: First, the entry for each site involved in the current reconciliation is updated to contain the current time. Then, obsolete sites (which have not been heard from for a long time, such as one month) are purged. The resulting list of sites, including those not participating in the current reconciliation, will eventually be included in all the new journals for the participating sites. It should be noted here that the assignment of mask bits to sites is meaningful only within a particular journal. When journals are merged as described above, the mask bits in both the Site and the Version entries are re-mapped appropriately to maintain the associations between versions and sites. In the illustrated embodiment, the mask bits are assigned to sites as follows: The first site mentioned in a journal is given mask value 1, the second site is given mask 2, the third is given mask 4, the fourth is given mask 8, and so on. This assignment is arbitrary, and may be done in other ways in alternative embodiments. When a site is abandoned, its corresponding mask bit is freed for use by another site. Later sites automatically move up to fill in the gap created. The reconciliation process then proceeds to step 14. First, the sequences of Version entries for each filename in the journals are compared. This process employs an algorithm known as the "maximum common subsequence" or MCS algorithm. The MCS algorithm finds a subsequence of "common" Version entries for each filename, i.e., Version entries that are contained in all of the journal files for the sites being reconciled, if such a subsequence exists. This common subsequence forms the basis for further action by the reconciliation process. The next step is to identify the last Version entry appearing in any journal file after the last common entry, if such an entry exists. If no journal file has a version entry for a data file after the last common version entry, then the current version of the file already exists at each site. In this case, no further reconciliation action needs to be taken for that file. Otherwise, the next step is to check for conflicts. A conflict exists when either (1) no common subsequence exists for the filename in the journals of the sites being reconciled, or (2) different Version entries exist in two or more journals after the last common Version entry. In either case it is not possible for the reconciliation process to determine from the hash codes which version is the most up to date. In this case, one of the conflicting versions is renamed using a unique and distinctive name, thus eliminating the conflict. The choice of which version to rename is arbitrary; one simple way to choose is to pick the version having an earlier timestamp. After this renaming, both conflicting versions are replicated to other sites as necessary, and the user is notified so that the two files may be compared and appropriate remedial action taken. If no conflict is found for a given filename, then the current version of the file, which exists at a site whose journal file has a Version entry subsequent to the last common Version entry, is copied to the other sites. Often the current version exists at only one site. However, it is possible for the current version to exist at more than one site before any copying is done. In such a case, the version is copied from any of the sites where it exists, and is copied to only those sites where the current version does not exist. As the copying takes place, new "+" Version entries are appended to the journals for the sites receiving the current version of the file. It may be that the file is being copied between two different types of systems, for example from a UNIX system to a Windows system. These systems use different characters to indicate the end of a line of text in text files. In such a case, the end-of-line characters are modified during the file copying process as necessary to ensure proper compatibility with the target system. As noted below, these minor modifications to text files do not affect the ability of the hash code to uniquely identify the file, and so the hash code can be copied unmodified. If the last Version entry after the latest Version entry in the MCS is a "-" Version entry, indicating that the file has been deleted, the file is deleted from those sites where it still exists, and "-" Version entries are appended to the journals accordingly. In step 16 the journals are inspected again to purge obsolete Version entries, in order to prevent the journal files from growing indefinitely. A Version entry becomes obsolete when it either (1) precedes any Version entry common to all journals, or (2) is older than some reasonable age, for example one month. This latter action is taken to handle old deletions, or "-" entries, which are typically the last entries for files and so do not precede other Version entries for those files. After the obsolete Version entries have been purged, the updated journals are written back out as updated journal files 18 for use in a subsequent reconciliation. It should be noted that in the foregoing description the last Version entry in the MCS is especially important, because it represents the most recent time that all of the sites saw a given version of the file. Further, the most recent Version entries in the journals are also especially important, because they represent what versions are currently stored at the sites. Thus a version of the MCS algorithm is used that reflects the weight of these entries, giving preference to matching recent and currently existing Versions. This weighting is sensible for the reconciliation process as described, which attempts to bring all sites up to date. However, other weightings of the Version entries are possible, and may be preferred in alternative embodiments of the reconciliation process. The creation of the "+" Version entries shown in FIG. 5 is now described with reference to FIG. 8. At the time a Version entry is created, the values to be included in the date, time, name, and type fields are known, so these are simply inserted in their respective fields. The site indicator is created as shown in step 20. When a Version entry is first created, its mask is set for all sites, except for the site at which the version is created, indicating that it is unknown at all sites but that one site. Successful reconciliations of this Version with other sites result in resetting corresponding mask bits for the Version, indicating that the Version is known at the additional sites. The masks are preserved from reconciliation to reconciliation. When all of its mask bits have been reset, the Version is known to have been propagated everywhere. Once a Version is known everywhere, all previous Version entries for the same file are obsolete and may be safely discarded. The hash code or digest is created in step 22. A procedure known as Message Digest version 5 (MD5) is run using the contents of the file as its input. Based on this input, MD5 computes a 16-byte (128-bit) digest that has an extremely high probability of uniquely identifying the file among all possible files, including earlier and later versions of the same file. The ability to uniquely identify a file is due in part to the large number of possible codes, which is on the order of 10.sup.40 or roughly one million to the one-millionth power. There are also other ways in which a hash code could be generated. It is desirable to use an algorithm that yields an acceptably low rate of false matches. For text files, end-of-line characters are ignored in the computation of the digest. This feature enables the transparent modification of these characters when files are being copied between different types of systems, as discussed above. This feature is an optimization; it may be useful in alternative embodiments to include these characters in the digest computation. EXAMPLES Examples are given below to illustrate the presently-disclosed reconciliation process and its results. Example 1 is the normal, no-conflict case. Example 2 shows a conflict. Examples 3 and 4 illustrate the creation of Site entries and the purging of obsolete Version entries. FIGS. 9-11 represent the sequences of modifications and copying that yield the scenarios below for file1, file2 and file3 respectively. The vertical arrows indicate modification and the horizontal arrows indicate copying. The file extensions have been removed to reduce extraneous detail, and 5-bit alphanumeric values are used to represent hash codes calculated from different versions of the files. In practice the hash codes are much longer strings, as discussed above. Example 1 No Conflict 1. Existing journal files (from some previous reconciliation):
______________________________________
SITE1 SITE2
______________________________________
+ file1 jj39z + file1 jj39z
+ file2 r9t4w + file2 r9t4w
+ file3 pq9zr + file3 pq9zr
______________________________________
2. Current site directories, showing modification of file2 at site 1 and deletion of file3 at site 2 since the previous reconciliation:
______________________________________
SITE1 SITE2
______________________________________
file1 jj39z file1 jj39z
file2 kpn33 file2 r9t4w
file3 pq9zr
______________________________________
3. Results of initial update of journals, reflecting the current contents of the sites. New Version entries for file2 at site 1 and file3 at site 2 have been added.
______________________________________
SITE1 SITE2
______________________________________
+ file1 jj39z + file1 jj39z
+ file2 r9t4w + file2 r9t4w
+ file2 kpn33
+ file3 pq9zr + file3 pq9zr
- file 3
______________________________________
4. Result of merging and conflict checking. Matches between most recent versions of files are indicated by dashed lines. The new version of file2 and the deletion of file3 are detected because the corresponding Version entries appear after the most recent common entries for those files.
______________________________________
SITE1: SITE2:
______________________________________
+ file1 jj39z
+ file1 jj39z
+ file2 r9t4w
+ file2 r9t4w
+ file2 kpn33
+ file3 pq9zr
+ file3 pq9zr
- file3
______________________________________
5. Result of copying file2, deleting file3, and updating the journals accordingly:
______________________________________
SITE1 SITE2
______________________________________
+ file1 jj39z + file1 jj39z
+ file2 r9t4w + file2 r9t4w
+ file2 kpn33 + file2 kpn33
+ file3 pq9zr + file3 pq9zr
- file3 - file3
______________________________________
6. Corresponding updated site contents:
______________________________________
SITE1 SITE2
______________________________________
file1 jj39z file1 jj39z
file2 kpn33 file2 kpn33
______________________________________
7. Result of purging old versions from the journals, assuming no other sites exist:
______________________________________
SITE1 SITE2
______________________________________
+ file1 jj39z + file1 jj39z
+ file2 kpn33 + file2 kpn33
- file3 - file3
______________________________________
Example 2 Reconciliation with a Conflict Continuing from the site contents and journals in 6 and 7 above, suppose the versions of file1 at both sites are updated inconsistently. 8. New site contents after conflicting updates to file 1:
______________________________________
SITE1 SITE2
______________________________________
file1 d9qlj file1 92w3a
file2 kpn33 file2 kpn33
______________________________________
9. Result of updating journals to reflect new site contents, followed by merging and conflict detection:
______________________________________
SITE1 SITE2
______________________________________
+ file1 jj39z
+ file1 jj39z
+ file1 d9qlj + file1 92w3a
+ file2 kpn33
+ file2 kpn33
- file3
- file3
______________________________________
A conflict is detected for file1 because the last common version is followed by non-matching versions at both sites. 10. Site contents as result of renaming one of the conflicting versions:
______________________________________
SITE1 SITE2
______________________________________
file1 d9qlj file1#1 92w3a
file2 kpn33 file2 kpn33
______________________________________
11. Corresponding updated journals:
______________________________________
SITE1 SITE2
______________________________________
+ file1 jj39z
+ file1 jj39z
+ file1 d9qlj + file1#1 92w3a
+ file2 kpn33
+ file2 kpn33
- file3
- file3
______________________________________
"file1#1" is a new, unique file name assigned by the reconciliation program. Now there is a new file at each site. 12. Result of copying the new versions to make both sites consistent:
______________________________________
SITE1 SITE2
______________________________________
file1 d9qlj file1 d9qlj
file1#1 92w3a file1#1 92w3a
file2 kpn33 file2 kpn33
______________________________________
13. Resulting journals:
______________________________________
SITE1 SITE2
______________________________________
+ file1 jj39z + file1 jj39z
+ file1 d9qlj + file1 d9qlj
+ file1#1 92w3a + file1#1 92w3a
+ file2 kpn33 + file2 kpn33
- file3 - file3
______________________________________
14. Assuming that no other sites exist, obsolete versions can be purged, resulting in:
______________________________________
SITE1 SITE2
______________________________________
+ file1 d9qlj + file1 d9qlj
+ file1#1 92w3a + file1#1 92w3a
+ file2 kpn33 + file2 kpn33
- file3 - file3
______________________________________
Example 3 Creation of Site Entries 1. Assuming the above reconciliation between sites SITE1 and SITE2 as a beginning point, the Site entries in each journal file are as follows:
______________________________________
site1.jnl:
$ date1 time1 SITE1 ?01
$ date1 time1 SITE2 ?02
site2.jnl:
$ date1 time1 SITE1 ?01
$ date1 time1 SITE2 ?02
______________________________________
2. Subsequently, a reconciliation is performed between SITE1 and a new site, SITE3. The new Site entries are as follows:
______________________________________
site1.jnl:
$ date2 time2 SITE1 ?01
$ date1 time1 SITE2 ?02
$ date2 time2 SITE3 ?04
site2.jnl (unchanged):
$ date1 time1 SITE1 ?01
$ date1 time1 SITE2 ?01
site3.jnl:
$ date2 time2 SITE1 ?01
$ date1 time1 SITE2 ?02
$ date2 time2 SITE3 ?04
______________________________________
Example 4 Managing Site Indicators and Purging Old Versions 1. Examples 1 and 2 assumed that only SITE1 and SITE2 existed. If there had been another site, say SITE3 with site mask ?4, the older journal entries would not have been purged and the journals would have contained:
______________________________________
SITE1 (?1) SITE2 (?2) SITE3 (?4)
______________________________________
+ file1 jj39z + file1 jj39z + file1 jj93z
+ file1 d9qlj ?4
+ file1 d9qlj ?4
+ file1#1 92w3a ?4
+ file1#1 92w3a ?4
+ file2 r9t4w + file2 r9t4w + file2 r9t4w
+ file2 kpn33 ?4
+ file2 kpn33 ?4
+ file3 pq9zr + file3 pq9zr + file3 pq9zr
- file3 ?4 - file3 ?4
______________________________________
The obsolete entry for file1 (jj39z), for example, can not be purged because the entry that follows it (d9qlj) is not yet known at all sites. 2. Now if a reconciliation is performed between SITE2 and SITE3 (but not SITE1), the appropriate files would be copied to SITE3 and their journals would be updated to reflect this:
______________________________________
SITE1 SITE2 SITE 3 (?4)
______________________________________
+ file1 jj39z + file1 jj39z + file1 jj39z
+ file1 d9qlj ?4
+ file1 d9qlj + file1 d9qlj
+ file1#1 92w3a ?4
+ file1#1 92w3a
+ file1#1 92w3a
+ file2 r9t4w + file2 r9t4w + file2 r9t4w
+ file2 kpn33 ?4
+ file2 kpn33 + file2 kpn33
+ file3 pq9zr + file3 pq9zr + file3 pq9zr
- file3 ?4 - file3 - file3
______________________________________
3. Obsolete entries can now be purged from the journals at SITE2 and SITE3 since the new entries are known to be present at all sites:
______________________________________
SITE1 (?1) SITE2 (?2) SITE3 (?4)
______________________________________
+ file1 jj39z
+ file1 d9qlj ?4
+ file1 d9qlj + file1 d9qlj
+ file1#1 92w3a ?4
+ file1#1 92w3a
+ file1#1 92w3a
+ file2 r9t4w
+ file2 kpn33 ?4
+ file2 kpn33 + file2 kpn33
+ file3 pq9zr
- file3 ?4 - file3 - file3
______________________________________
The journal at SITE1 will be purged the next time it is reconciled with either SITE1 or SITE3. An improved file reconciliation process has been described having application to a variety of distributed file systems. The disclosed reconciliation process is susceptive of modification in many ways. For example, it would be possible when updating Version entries to create new Version entries for all files and directories found at a site, without regard to whether their contents have changed as indicated by the timestamps. This change would simplify the process somewhat, but at the cost of degraded performance, due to the compute-intensive and unnecessary re-calculation of digests for unchanged files. And as mentioned above, the specific digest algorithm could be any of a number of algorithms that yield suitably unique digests. Also, the Version entries in the journal files could receive different weighting with respect to their position in the sequence of Version entries. For example, some Version entries, as specified by a user, could be ignored entirely, based either on position in the sequence or timestamp. Such an approach might be useful, for example, if sites are to be reconciled only up to some earlier date or up to some earlier existing version. As described the disclosed reconciliation process relies upon the ability of the processor on which the process is running to directly access the files and directories at each site. Other mechanisms may be employed to carry out file operations. For example, the reconciliation process could be run as an independent process on each computer, and a signalling and file-exchange protocol used between the independent processes to carry out the reading and writing of directory, data and journal files. It will be apparent to those skilled in the art that other modifications to and variations of the above-described methods and apparatus are possible without departing from the inventive concepts disclosed herein. Accordingly, the invention should be viewed as limited solely by the scope and spirit of the appended claims.
|
Same subclass Same class Consider this |
||||||||||
