Method and apparatus for tracking and viewing changes on the web6366933Abstract A system for accessing documents contained in a remote repository, which change in content from version-to-version. The system allows users to specify lists of documents of interest. Based on the lists, the system maintains an archive, which contains a copy of one version of each listed document, and material from which the other versions can be reconstructed. The system periodically compares the archive with current versions of the documents located in the repository, and updates the archive, thereby maintaining the ability to reconstruct current versions. The system also monitors access to the versions by each user. When a user calls for a current version, the system presents the current version, and indicates what parts of the current version have not been previously accessed by the user. Claims We claim: Description REFERENCE TO A MICROFICHE APPENDIX
COMPUTER PROGRAM LISTING
The program listing is divided into three sections.
1. HTMLDIFF, comprising:
-- html_diff.sml (5 pages),
-- diff.sml (3 pages),
-- mlweb.sml (4 pages), and
-- html.lex (one page).
2. W3NEWER (17 pages).
3. NOHANDS, comprising:
-- nohandsBE (11 pages),
-- no-hands.cgi (3 pages),
-- rcsdiff.cgo (4 pages), and
-- snapshot.cgi (3 pages).
NOHANDS is an overall program set which utilizes W3NEWER and
HTMLDIFF.
A set of tools that detect when World-Wide-Web pages have been modified and present the modifications visually to the user through marked-up HTML. The tools consist of three components: w3newer, which detects changes to pages; snapshot, which permits a user to store a copy of an arbitrary Web page and to compare any subsequent version of a page with the saved version; and htmldiff, which marks up HTML text to indicate how it has changed from a previous version. The tools are referred to collectively as the Network-Oriented HTML Archival, Notification, and Differencing System (No HANDS). Presented are several aspects of NO HANDS, with an emphasis on systems issues such as scalability, security, and error conditions. Use of the World-Wide-Web (W.sup.3) has increased dramatically over the past couple of years, both in the volume of traffic and the variety of users and content providers. The W.sup.3 has become an information distribution medium for academic environments (its original motivation), commercial ones, and virtual communities of people who share interests in a wide variety of topics. Information that used to be sent out over electronic mail or USENET, both active media that go to users who have subscribed to mailing lists or newsgroups, can now be posted on a W.sup.3 page. Users interested in that data then visit the page to get the new information. The URLs of pages of interest to a user can be saved in a "hotlist" (known as a bookmark file in Netscape.TM.), so they can be visited conveniently. How does a user find out when pages have changed? If users know that pages contain up-to-the-minute data (such as stock quotes), or are frequently changed by their owners, they may visit the pages often. Other pages may be ignored, or browsed by the user only to find they have not changed. In recent months, several tools have become available to address the problem of determining when a page has changed. One example of such a tool is, webwatch, a product for Windows.TM. that uses the HTTP HEAD command to find out when a page has been modified since it was last viewed by a user's web browser, and generates a report in HTML that allows the user to go directly to those updated pages. Another example is w3new, by Brooks Cutter, a public-domain perl script that runs on UNIX.RTM. as shown in "B. B. Cutter Ill. w3new. http://www.stuff.com/bcutter/programs/w3new/w3new.html". Each of these tools suffers from a significant deficiency: while they provide the user with the knowledge that the page has changed, they do not show how the page has changed. Although a few pages are edited by their maintainers to highlight the most recent changes, often the modifications are not prominent, especially if the pages are large. Even pages with special highlighting of recent changes are problematic: if a user visits a page frequently, what is "new" to the maintainer may not be "new" to the user. Alternatively, a user who visits a page infrequently may miss changes that the maintainer deems to be old. A system has been developed that efficiently tracks when pages change, compactly stores versions on a per-user basis, and automatically compares and presents the differences between pages. NO HANDS (Network-Oriented HTML Archival, Notification, and Differencing System) provides "personalized" views of versions of W.sup.3 pages with three tools. The first, w3newer, is a more scalable version of Cutter's w3new modification tracking tool that periodically accesses the W.sup.3 to find when pages on a user's hotlist have changed. The second, snapshot, allows a user to save versions of a page end later use a third tool, htmldiff to see how it has changed. Htmldiff automatically compares two HTML pages end creates a "merged" page to show the differences with special HTML markups. While NO HANDS can help arbitrary users track pages of interest, it can be of particular use in a collaborative environment. Consider a software development project that is geographically distributed across several locations. The W.sup.3 can be used to collect requirements, meeting notes, code, documentation, bug reports, and so on, so that everyone involved with the project has a consistent and up-to-date view of the project. As documents change, each project member will want to know what's "new" in their world, without having to waste time browsing documents. The w3newer component of NO HANDS automatically provides this information. Furthermore, what is "new" to one project member will be "old" to another, so that the notion of a document version must be "personalized" rather than global to the entire project. NO HANDS supports personalized versioning of documents with snapshot and uses htmldiff to provide a personalized version of "what's new" in a document. There has been a great deal of interest lately in finding out when pages on the W.sup.3 have changed. Discussed below is related work, issues of scalability and cache consistency, and how to handle possible error conditions. Two tools, webwatch for Windows and w3new for UNIX, were mentioned above. Another similar tool is shown in "M. Newbery. Katipo. http://www.vuw.ac.nz./newbery/Katipo.html", which runs on the Macintosh.TM., and yet another, URL-minder as shown in "Url-minder, http://www.netmind.com/URL-minder/URL-minder.html", which runs as a service on the W.sup.3 itself and sends email when a page changes. Those that run on the user's host use the "hotlist" to determine which URLs to check, while URL-minder acts on URLs provided explicitly by a user via an HTML form. There are two basic strategies for deciding when a page has changed. Most tools use the HTTP HEAD command to retrieve the Last-Modified field from a W.sup.3 document, either returning a sorted list of all modification times or just those times that are different from the browser's history (the timestamp of the version the user presumably last saw). URL-minder uses a checksum of the content of a page, so it can detect changes in pages that do not provide a Last-Modified date, such as output from Common Gateway Interface (CGI) scripts. W3new (and therefore w3newer) requests the Last-Modified date if available; otherwise, it retrieves and checksums the whole page. Changes are generally reported to the user in the form of an HTML page with links to each of the pages being tracked, although it can also be done via email as with URL-minder. These tools also vary with respect to frequency of checking and where the checks are performed. Most of the tools automatically run periodically from the user's machine. All URLs are checked each time the tools run, with the possible exception of URL-minder, which runs on an Internet server and checks pages with an arbitrary frequency that is guaranteed to be at least as often as some threshold, such as a week (URL-minde's implementation is hidden behind a CGI interface). The tools described above poll every URL with the same frequency. The w3new was modified to make it more scalable, as well as to integrate it with the other components of NO HANDS. W3newer runs on the user's machine, but it omits checks of pages already known to be modified since the user last saw the page, and pages that have been viewed by the user within some threshold. The time when the user has viewed the page comes from the W.about. browser's history..sup.1 The "known modification date" comes from a variety of sources: a cached modification date from previous runs of w3newer; a modification date stored in a proxy-caching server's cache; or the HEAD information provided by httpd (the HTTP server) for the URL. If either of the first two sources of the modification date indicate that the page has not been visited since it was modified, then HTFP is used only if the time the modification information was obtained was long enough ago to be considered "stale" (currently, the threshold is one week). In addition, there is a threshold associated with each page to determine the maximum frequency of direct HEAD requests. If the page was visited within the threshold, or the modification date obtained from the proxy-caching server is current with respect to the threshold, the page is not checked. The threshold can vary depending on the URL, with perl pattern matching used to determine what threshold to apply. The first matching pattern is used. Table 1 gives an example of a .quadrature.w3newer_thresholds configuration file. Thresholds are specified as combinations of days (d) and hours (h), with 0 indicating that a page should be checked on every run of w3newer and never indicating that it should never be checked.
TABLE 1
An example of the thresholds specified to w3newer.
# Comments start with a sharp sign.
# perl syntax requires that "." be escaped
# Default is equivalent to ending the file with ".*"
Default 2d
file:.* 0
http://www.backslash..yahoo.backslash..com/.* 7d
http:www.backslash..research.backslash..att.backslash.com/.* 0
http://.*.backslash..att.backslash..com/.* 1h
http://home.backslash.. mcom.backslash..com/honse/whatsnew/- 12h
whats_new.backslash..html
http://www.backslash..ncsa.backslash..uiuc.backslash..edu/SDG/Software/-
12h
Mosaic/ Docs/whats-new.backslash.html
http://snapple.backslash..cs.backslash..washington.backslash..edu:600/-
1d
mobile/
# rarely modified
http://www.backslash..cs.backslash..duke.backslash..edu/ pk/- 7d
HomePage.backslash.. html
# this is in my hotlist but will be different every day
http://www.backslash..unitedmedia.backslash..com/- never
comics/dilbert/
Determining when HTTP pages have changed is analogous to caching a file in a distributed file system and determining when the file has been modified. While file systems such as the Andrew File System in "J. Howard et al. Scale and performance in a distributed file system. ACM Transactions on Computer Systems, 6(1):51-81, February 1988"; and Sprite in "M. Nelson, B. Welch, and J. Ousterhout. Caching in the Sprite network file system. ACM Transactions on Computer Systems, 6(1):134-154, February 1988" provide guarantees of cache consistency by issuing call-backs to hosts with invalid copies, HTTP access is closer to the traditional NFS approach as shown in "R. Sandberg, D. Goldberg, S. Kleiman, D. Walsh, and B. Lyon. Design and implementation of the Sun network filesystem. In Proceedings of the USENIX 1985 Summer Conference, pages 119-130, June 1985", in which clients check back with servers periodically for each file they access. Netscape can be configured to check the modification date of a cached page each time it is visited, once each session, or not at all. Caching servers check when a client forces a full reload, or after a time-to-live value expires. Here the problem is complicated by the target environment: one wishes to know not only when a currently viewed page has changes, but also when a page that has not been seen in a while has changed. Fortunately, unlike with file systems, HTTP data can usually tolerate some inconsistency. In the case of pages that are of interest to a user but have not been seen recently, finding out within some reasonable period of time, such as a day or a week, will usually suffice. Even if servers had a mechanism to notify all interested parties when a page has changed immediate notification might not be worth the overhead. Instead, one could envision using something like th e Harvest replication and caching services a shown in "C. Mic Bowman et al. Harvest: A scalable, customizable discovery and access system. Technical Report CU-CS-732-94, Dept. of Computer Science, University of Colorado--Boulder, March 1995", to notify interested parties in a lazy fashion. A user who expresses an interest in a page, or a browser that is currently caching a page could register an interest in the page with its local caching service. The caching service would in turn register an interest with an Internet-wide, distributed service that would make a best effort to notify the caching service of changes in a timely fashion. (This service could potentially archive versions of HOP pages as well). Pages would already be replicated, with server load distributed, and the mechanism f or discovering when a page changes could be left to a negotiation between the distributed repository and the content provider: either the content provider notifies the repository of changes, or the repository polls it periodically. Either way, there would not be a large number of clients polling each interesting HTTP server. Moving intelligence about HTTP caching to the server has been proposed by James S. Gwertzman and Margo Seltzer in "The case for geographical push-caching. In Proceedings of the Fifth Workshop in Hot Topics in Operating Systems (HO TOS-V), pages 51-55, Orcas Island, Wash., May 1995. IEEE" and others. One could also envision integrating the functionality of NO HANDS into file systems. Tools that can take actions when arbitrary files change are not widely available, though they do exist as in "Sun Microsystems. The HotJava Browsers: A White Paper Available as http://java.sun.com/ 1.0alpha3/doc/overview/hotjava/browser.whitepapers.ps". Users might like to have a unified report of new files and W.sup.3 pages, and w3newer supports the "file:" specification and can find out if a local file has changed. However, snapshot has no way to access a file on the user's (remote) file system. Moving functionality into the browser would allow individual users to take snapshots of files that are not already under the control of a versioning system such as the Revision Control System (RCS) as shown in "W. Tichy. RCS: a system for version control. Software-Practice & Experience. 15(7):637-654, July 1985"; this might be an appropriate use of a browser with client-side execution, such as HotJava in "Sun Microsystems. The HotJava Browser: A White Paper Available as http://java.sun.com/1.0alpha3/doc/overview/hotjava/browser.whitepapers. ps". When a periodic task checks the status of a large number of URLs, a number of things can go wrong. Local problems such as network connectivity or the status of a proxy-caching server can cause all HTTP requests to fail. Proxy-caching servers are sometimes overloaded to the point of timing out large numbers of requests, and a background task that retrieves many URLs in a short time can aggravate their condition. W3newer should therefore be able to detect cases when it should abort and try again later (preferably in time for the user to see an updated report). At the same time, a number of errors can arise with individual URLs. They can move, with or without leaving a forwarding pointer. The server for a URL can be deactivated or renamed. They may disallow retrieval by "robots," meaning that any program that follows the "robot exclusion protocol A standard for robot exclusion. http//web.nexor.co.uk/mak/doc/robots/norobots.html" will not retrieve them. Since the cost of retrieving modification dates is small in comparison to the cost of retrieving robots.txt (part of the exclusion protocol), it may well be appropriate to ignore the robot exclusion protocol for this task, or to check robots.txt only occasionally on each host. Observing the protocol will still be advisable for hosts on which many URLs are checked, especially if the pages' contents are retrieved each time. Finally, automatic detection of modifications based on information such as modification date and checksum can lead to the generation of "junk mail" as "noisy" modifications trigger change notifications. For instance, pages that report the number of times they have been accessed, or embed the current time, will look different every time they are retrieved. W3newer attempts to address these issues by the following steps: If a URL is inaccessible to robots, that fact is cached so the page is not accessed again unless a special flag is set when the script is invoked. Another flag can tell w3newer to treat error conditions as a successful check as far as the URL's times-tamp goes. For instance, if w3newer runs daily and checks a particular URL every four days, normally an error accessing the page on Monday will cause it to be checked again on Tuesday. With this flag, it would be checked again on Friday. In general, it seems that errors are likely to be transient, and checking the next time w3newer is run would be reasonable. When a URL is inaccessible, an error message appears in the status report, so the user can take action to remove a URL that no longer exists or repeatedly hits errors. In addition, w3newer could be modified to keep a running counter of the number of times an error is encountered for a particular URL, or to skip subsequent URLs for a host if a host or network error (such as "timeout" or "network unreachable") has already occurred. Addressing the problem of "noisy" modifications will require heuristics to examine the differences at a semantic level. In addition to providing a mechanism for determining when W.sup.2 pages have been modified, there must be a way to access multiple versions of a page for the purposes of comparison. There are three possible approaches for providing versioning of W.sup.3 pages: making each content provider keep a history of all versions, making each user keep this history, or storing the version histories on an external server. Server-side Support Each server could store a history of its pages and provide a mechanism to use that history to produce marked-up pages that highlight changes. This method requires arbitrary content providers to provide versioning and differencing, so it is not practical, although it is desirable to support this feature when the content provider is willing. Client-side Support Each user could run a program that would store items in the hotlist locally, and run htmldiff against a locally saved copy. This method requires that every page of interest be saved by every user, which is unattractive as the number of pages in the average user's hotlist increases, and it also requires the ability to run htmldiff on every platform that runs a W.sup.3 browser. Storing the pages referenced by the hotlist may not be too unreasonable, since programs like Netscape may cache pages locally anyway. There are other external tools such as warm list as shown in "Warmlist, http://glimpse.cs.arizona.edu:1994/paul/warmlist/"that provide this functionality. External Service The approach is to run a service that is separate from both the content provider and the client. Pages can be registered with the service via an HTML form, and differences can be retrieved in the same fashion. Once a page is stored with the service, subsequent requests to remember the state of the page result in an RCS "check-in" operation that saves only the differences between the page and its previously checked-in version. Thus, except for pages that change in many respects at once, the storage overhead is minimal beyond the need to save a copy of the page in the first place. Drawbacks to the "external service" approach are that the service must remember the state of every page that anyone who uses the service has indicated an interest in and must know which user has seen which version of each page. The first issue is primarily one of resource allocation, and is not expected to be a significant issue unless the service is used by a great many clients on a number of large pages. The second issue is addressed by using RCS's support for datestamps and requesting a page as it existed at a particular time. Alternatively, a version number could be retained for each <user, URL> combination. Relative links become a problem when a page is moved away from the machine that originally provided it. If the source were passed along unmodified, then the W.sup.3 browser would consider links to be relative to the CGI directory containing the snapshot script. HTML supports a BASE directive that makes relative links relative to a different URL, which mostly addresses this problem; however, Netscape 1.1 N treats internal links within such a document to be relative to the new BASE as well, which can cause the browser to jump between the htmldiff output and the original document unexpectedly. The snapshot facility must address four important issues: use of CGI, synchronization, resource utilization, and security/privacy. CGI is a problem because there is no way for snapshot to interact with the user and the user's browser, other than by sending HTML output. When a CGI script is invoked, httpd sets up a default timeout, and if the script does not generate output for a full timeout interval, httpd will return an error to the browser. This was a problem for snapshot because the script might have to retrieve a page over the Internet and then do a time-consuming comparison against an archived version. The server does not tell snapshot what a reasonable timeout interval might be for any subsequent retrievals; instead this is hard-coded into the script. In order to keep the HTTP connection alive, snapshot forks a child process that generates one space character (ignored by the W.sup.3 browser) every several seconds while the parent is retrieving a page or executing htmldiff. Synchronization between simultaneous users of the facility is complicated by the use of multiple files for bookkeeping. The system must synchronize access to the RCS repository, the locally cached copy of the HTML document, and the control files that record which version of each page a user has seen. Currently this is done by using UNIX file locking on both a per-URL lock file and the per-user control file. Ideally the locks could be queued such that if multiple users request the same page simultaneously, the second snapshot process would just wait for the page and then return, rather than repeating the work. This is not so important for making snapshots, in which case a proxy-caching server can respond to the second request quickly and RCS can easily determine that nothing has changed, but there is no reason to run htmldiff twice on the same data. The latter point relates to the general issue of resource utilization. Snapshot has the potential to use large amounts of both processing and disk space. The need to execute htmldiff on the server can result in high processor loads if the facility is heavily used. These loads can be alleviated by caching the output of htmldiff for a while, so many users who have seen version N and N+1 of a page could retrieve htmldiff(page.sub.N,page.sub.N+1) with a single invocation of htmldiff. The facility could also impose a limit on the number of simultaneous users, or replicate itself among multiple computers, as many W.sup.3 services do. Disk space is potentially a problem if the repository can grow without bound and with no cost to its users. In fact, before a service like this could be placed on the Internet, it would have to authenticate each user and limit the user to a fixed number of URLs and/or disk blocks. Most likely, one would use an Internet commerce facility to charge a fee in exchange for permission to store a collection of URLs: this fee could easily offset the cost of the storage medium since it would also be paying for the differencing service. Lastly, security and privacy are important. Because the CGI scripts run with minimal privileges, from an account to which many people have access, the data in the repository is vulnerable to any CGI script and any user with access to the CGI area. Data in this repository can be browsed, altered, or deleted. In order to use the facility one must give an identifier (currently one's email address, which anyone can specify) that is used subsequently to compare version numbers. Browsing the repository can therefore indicate which user has an interest in which page, how often the user has saved a new checkpoint, and so on. By moving to an authenticated system on a secure machine, one could break some of these connections and obscure individuals' activities while providing better security. The repository would associate impersonal account identifiers with a set of URLs and version numbers, and passwords would be needed to access one of these accounts. Whoever administers this facility, however, will still have information about which user accesses which pages, unless the account creation can be done anonymously. So far, only a small fraction of pages on the W.sup.3 contain information that allows users to ascertain how the pages have changed-examples include icons that highlight recent additions, a link to a "changelog", or a special "what's new" page. As was mentioned in the introduction, these approaches suffer from deficiencies. They are intended to be viewed by all users, but users will visit the pages at different intervals and have different ideas of "what's new". In addition, the maintainer must explicitly generate the list of recent changes, usually by manually marking up the HTML. Automatic comparison of HTML pages and generation of marked-up pages frees the HTML provider from having to determine what's new and creating new or modified HTML pages to point to the differences. There are many ways to compare documents and many ways to present the results. HTML separates content (raw text) from markups. While many markups (such as <P>, <I>, and <HR>) simply change the formatting and presentation of the raw text, certain markups such as images (<IMG src= . . . >) and hypertext references (<A href = . . . >) are "content-defining." Whitespace in a document does not provide any content (except perhaps inside a <PRE>), and should not impact comparison. At one extreme, one can view an HTML document as merely a sequence of words and "content-defining" markups. Markups that are not "content-defining" as well as whitespace are ignored for the purposes of comparison. The fact that the text inside <P> . . . </P> is logically grouped together as a paragraph is lost. As a result, if one took the text of a paragraph comprised of four sentences and turned it into a list (<UL>) of four sentences (each starting with <LI>), no difference would be flagged because the content matches exactly. At the other extreme, one can view HTML as a hierarchical document and compare the parse tree or abstract syntax tree representations of the documents, using sub-tree equality (or some weaker measure) as a basis for comparison. In this case, a subtree representing a paragraph (<P> . . . </P>) might be incomparable with a subtree representing a list (<UL> . . . </UL>). The example of replacing a paragraph with a list would be flagged as both a content and format change. An HTML document is viewed as a sequence of sentences and "sentence-breaking" markups (such as <P>, <HR>, <LI>, or <H1>) where a "sentence" is a sequence of words and certain (non-sentence-breaking) markups (such as <B> or <A>). A "sentence" contains at most one English sentence, but may be a fragment of an English sentence. All markups are represented and are compared, regardless of whether or not those markups are "content-defining." In the paragraph-to-list example, the comparison would show no change to content, but a change to the formatting. Hirshberg's solution is applied to the longest common subsequence (LCS) problem as shown in "D. S. Hirschberg. A linear space algorithm for computing maximal common subsequences. Communications of the ACM, 18(6):34t-343, June 1975" and in "D. S. Hirschberg. Algorithms for the longest common subsequence problem. Journal of the ACM, 24(4):664-675, October 1977",(with several speed optimizations) to compare HTML documents. This is the well-known comparison algorithm used by the Unix difficulty in "J. W. Hunt and M. D. Mcllroy. An algorithm for differential file comparison. Technical Report Computing Science TR#41, Bell Laboratories, Murray Hill, N.J., 1975". The LCS problem is to find a (not necessarily contiguous) common subsequence of two sequences of tokens that has the longest length (or greatest weight). Tokens not in the LCS represent changes. In Unix diff a token is a textual line and each line has weight equal to 1. In htmldiff a token is either a sentence-breaking markup or a sentence, which consists of a sequence of words and non-sentence-breaking markups. Note that the definition of sentence is not recursive; sentences cannot contain sentences. A simple lexical analysis of an HTML document creates the token sequence and converts the case of the markup name and associated (variable,value) pairs to upper-case; parsing is not required. It is now described how the weighted LCS algorithm compares two tokens and computes a non-negative weight reflecting the degree to which they match (a weight of 0 denotes no match). Sentence-breaking markups can only match sentence-breaking markups. They must be identical (modulo whitespace, case, and reordering of (variable,value) pairs) in order to match (see section 4.3 for a discussion of the ramifications of this). A match has weight equal to 1. Sentences can match only sentences, but sentences need not be identical to match one another. Two steps are used to determine whether or not two sentences match. The first step uses sentence length as a comparison metric. Sentence length is defined to be the number of words and "content-defining" markups such as <IMG> or <A> in a sentence. Markups such as <B> or <I> are not counted. If the lengths of two sentences are not "sufficiently close," then they do not match. Otherwise, the second step computes the LCS of the two sentences (where words matching exactly against words are assigned weight 1, and markups match exactly against markups, as before). Let W be the number of words and content-defining markups in the LCS of the two sentences and let L be the sum of the lengths of the two sentences. If the percentage (2*W)L is sufficiently large, then the sentences match with weight W. Otherwise, they do not match. The comparison algorithm outlined above yields a mapping from the tokens of the old document to the tokens of the new document. Tokens that have a mapping are termed "common"; tokens that are in the old (new) document but have no counterpart in the new (old) are "old" ("new"). "old" and "new" tokens are referred to as "differences". Below are listed and described the three basic ways to present the differences by creating HTML documents that highlight the differences with a variety of markup techniques: Side-by-Side A side-by-side presentation of the documents with common text vertically synchronized is a very popular and pleasing way to display the differences between documents (see, for example, Unix sdiff or SGI's graphical diff tool gdiff. Unfortunately, there is no good mechanism in place with current 1-ITMIL and browser technology that allows such synchronization (although it might be possible to make a document that contained a table with a document per column in which rows of the table were used to achieve synchronization). Only Differences Show only differences (old and new) and eliminate the common part (as done in Unix diff). This optimizes for the "common" case, where there is much in common between the documents. This is especially useful for very large documents but can be confusing because of the loss of surrounding common context. Another problem with this approach is that an HTML document comprised of an interleaving of old and new fragments might be syntactically incorrect. Merged-page Create an HTML page that summarizes all of the common, new, and old material. This has the advantage that the common material is displayed just once (unlike the side-by-side presentation). However, incorporating two pages into one again raises the danger of creating syntactically or semantically incorrect HTML. (consider converting a list of items into a table, for example). The preference is to present the differences in the merged-page format to provide context and use internal hypertext references to link the differences together in a chain so the user can quickly jump from difference to difference. The syntactic/semantic problem of merging is currently dealt by eliminating all old markups from the merged page (note that this doesn't mean all markups in the older document, just the ones classified as "old" by the comparison algorithm). As a result, old hypertext references and images do not appear in the merged page (of course, since they were deleted they may not be accessible anyway). However, by reversing the sense of "old" and "new" one can create a merged page with the old markups intact and the new deleted. A more Draconian option would be to leave out all old material. In this case, there are no syntactic problems given that the most recent page is syntactically correct to begin with; the merged page is simply the most recent page plus some markups to point to the new material. Other ways to create a merged page is being explored. An example of htlmdiff's merged-page output appears in FIG. 1. Markups are used to highlight old and new material as follows. Two small arrow images are used to point to areas in the document that have changed. A red arrow points to old content and a green arrow points to new content. The arrows are also internal hypertext references to one another, linked in a chain to allow quick traversal of the differences. A banner at the front of the document contains a link to the first difference. Old text is displayed in "struck-out" font using <STRIKE>, which is rarely used in HTML found on the W.sup.3. Unfortunately, there is no ideal font for showing "new" text. Currently <STRONG><I> is used. Ideally, it would be desirable to color code the text or text background to highlight old and new text, but this capability is not provided by current browsers. Another approach would be to choose a font that is not active at the point of the difference. Note that not all changes in the documents are highlighted. For example, new markups that are not "content-defining" (such as <P>) are not marked up. However, markups such as anchors are highlighted. Consider the example of changing the URL in an anchor but not the content surrounded by <A> . . . </A>. In this case, an arrow will point to the text of the anchor, but the text itself will be in its original font, signifying a change to just the URL. Since htmldiff can parse an HTML document and rectify certain syntactic problems, such as mismatched or missing markups, the only real problem it is likely to encounter is a set of changes that are so pervasive as to make the resulting merged HTML unreadable. For instance, if every other line were changed, then the mixture of unrelated struck-out and emphasized text would be muddled. The experimenting with methods is being done for varying the degree to which old and new text can be interspersed, as well as thresholds to specify when the changes are too numerous to display meaningfully. Currently, htmldiff is neither "version-aware" nor "web-ware". That is, hrmldiff only compares the text of two HTML pages. It does not compare versions of the entities that the pages refer to, access them, or invoke itself recursively on other referenced pages. This has a number of consequences. The good news is that htmldiff does not incur the overhead of pulling versions from a repository or sending requests over the W.sup.3 for information. This cost is consumed by w3newer and snapshot The bad news is that some differences may be ignored. For example, if the contents of an image file are changed but the URL of the file does not, then the URL in the page will not be flagged as changed. To support such comparison would require some sort of versioning of referenced entities and would also require htmldiff to have access to the version repositories. Full versioning of all entities would allow interesting comparisons to be done, but would dramatically increase storage requirements. A cheaper alternative would be to store a checksum of each entity and use the checksums to determine if something has changed. It is being explored on how to efficiently perform such "smarter" comparisons. There are two entry points to NO HANDS, one through w3newer and one through snapshot. Currently, w3newer is invoked directly by the user, probably by a crontab entry, and generates an HTML document indicating which pages have changed. If specified, w3newer will associate three links with each document in the hotlist: Remember Send the URL to the snapshot facility, to save a copy of the page. Though the page is retrieved, the RCS ci command ensures that it is not saved if it is unchanged from the previous time it was stored away. Diff Have the snapshot facility invoke htmldiff to display the changes in a page since it was last saved away by the user. History Have snapshot display a full log of versions of this page, with the ability to run htmldiff on any pair of versions or to view a particular version directly. (See FIG. 2.) Thus, each page that is reported as "new" can immediately be passed to htmldiff, and any page in the list can be "remembered" for future use. An example of w3newer's output appears in FIG. 3. A user may also choose to enter snapshot directly to check-in pages, or view the current page or the version history. FIG. 4 shows the interface to NO HANDS through snapshot. If the user selects the history link, the page shown in FIG. 2 is presented. Finally, selecting two pages to compare invokes htmldiff as in FIG. 1. One disadvantage of the current approach is that there is no direct interaction between w3newer, snapshot, and the W.sup.3 browser. Viewing a page with htmldiff does not cause the browser to record that the page has just been seen; instead, the browser records the URL that was used to invoke htmldiff in the first place. Subsequently, w3newer uses the obsolete datestamp from the browser and continues to report that the page has been modified more recently than the browser has seen it. As a result, the user must view a page directly as well as via htmldiff in order to both remove it from the list of modified pages and see the actual differences. This section describes some possible extensions to the work already presented. Section 6.1 discusses an interface between RCS and htmldiff that is already implemented, while Sections 6.2 and 6.3 presents unimplemented extensions to integrate tracking modifications into the server and to invoke scripts via the HTTP POST protocol. The tools described above do not require any changes to arbitrary servers or clients on the W.sup.3. Existing GET and POST protocols are used to communicate with specific servers that save versions of documents and provide marked-up versions showing how they have changed. However, if a server runs htmldiff and some perl scripts, it can provide a direct version-control interface and avoid the need to store copies of its HTMIL documents elsewhere. The perl scripts so far written provide an interface to RCS as shown in "W. Tichy. RCS: a system for version control. Software-Practice & Experience. 15(7):637-654, July 1985". A CGI script (/cgi-bin/rlog) converts the output of rlog into HTML, showing the user a history of the document with links to view any specific version or to see the differences between two versions. Another script (/cgibin/co) displays a version of a document under RCS control, while still another (/cgi-bin/rcsdiff) displays the differences. If the file's name ends in html then htmldiff is used to display the differences, rather than the rcsdiff program. As an example, one might set up a Last-Modified field at the bottom of an HTML document to be a link to the rlog script, with the document name specified as a parameter. After clicking on this unobtrusive field, the user would be able to see the history of the document. Currently, w3newer runs on the user's machine, so multiple instantiations of the script may perform the same work. Although it runs a related daemon on the same machine as an AT&T-wide proxy-caching server, which returns information about pages that are currently cached on the server and may eliminate some accesses over the Internet, there is insufficient locality in that cache for it to eliminate a significant fraction of requests. Alternatively, w3newer could be run on the set of pages that have been saved by the snapshot daemon. Regardless of how many users have registered an interest in a page, it need only be checked once: if changed, the new version could be saved automatically. Then a user could request a list of all pages that have been saved away, and get an indication of which pages have changed since they were saved by the user. Adding this functionality would be useful, since it would offer economies of scale. It would have the disadvantage of being decoupled from a given user's W.sup.3 browser history; i.e., if a user views a page directly, the snapshot facility would have no indication of this and might present the page as having been modified. Because NO HANDS can handle arbitrary URLs, it can interact with CGI scripts that use the GET protocol by passing arguments to the script as part of the URL. However, services that use POST cannot be accessed, because the input to the services is not stored. Both w3newer and snapshot would have to be modified to support the POST protocol, in order to invoke a service and see if the result has changed, and then to store away the result and display the changes if it has. The interface to NO HANDS to support POST is unclear, however. A user could manually save the source to an HTML form and change the URL the form invokes to be something provided by NO HANDS. It, in turn, would have to make a copy of its input to pass along to the actual service. The result would be en HTTP equivalent of a UNIX pipe, interposing an extra service between the browser and the service the user is trying to invoke. Instead, the browser could be modified to have better support for forms: It should store the filled-out version of a form in its bookmark file, so the user could jump directly to the output of a CCI script. It should be able to pass a form directly to NO HANDS, along with the URL specified in the FORM tag, so that the output could be stored under RCS. NO HANDS combines notification, archiving, and ditferencing of W.sup.3 pages into a single cohesive tool. It achieves economies of scale by avoiding unnecessary HTTP accesses, saving pages at most once each time they are modified (regardless of the number of users who track it), and using RCS as the underlying versioning system. Automatic generation of differences within the HTML framework provides users with the ability to see both insertions and deletions in a convenient fashion. In the general setting of the W.sup.3 and document retrieval NO HANDS benefits two communities: users of the no longer have to browse to find pages of interest that have changed; HTML providers no longer have to create suitably marked-up pages to show "what's new". While such automation is clearly helpful in this general context, it is expected that NO HANDS will be a critical part of more focused uses of the W.sup.3, especially in areas involving collaborative and distributed work. Several issues still need to be addressed. In particular, many of the complications of NO HANDS could be avoided by better integration with W.sup.3 browsers and servers. For instance, viewing the difference between an older version of a page and its current version should update the browser's notion of when the page was last visited. Finally, the increasing availability of distributed, hierarchical HTTP repositories such as shown in "C. Mic Bowman et al. Harvest: A scalable, customizable discovery and access system. Technical Report CU-CS-732-94, Dept. of Computer Science, University of Colorado--Boulder, March 1995", will be both an opportunity and a challenge for scalable notification mechanisms and version archives. Numerous substitutions and modifications can be undertaken without departing from the true spirit and scope of the invention. What is desired to be secured by Letters Patent is the invention as defined in the following claims.
|
Same subclass Same class Consider this |
||||||||||
