Electronic document processing system and method for merging source documents on a node-by-node basis to generate a target document6772165Abstract A target document (25) is generated by merging four source documents (2-5). There are three merge operations, an operation (10) for two source documents (2, 13), an operation (13) for an intermediate target document and a source document (4), and an operation (21) for a second intermediate target document and a final source document (5). In each merge operation one source document inherits from the other. An inheriting instruction is embedded within the inheriting document. Merging is performed by parsing a document (6, 8, 11, 20) into a hierarchical tree if it is not already in that form, and merging the trees. Matching nodes are identified and are combined or replaced according to a policy. Claims What is claimed is: Description FIELD OF THE INVENTION
M0 M0 precedes M4. M4 is a descendent of F2, so M2
must precede F2
M4 Matches F9
M5 is a child of M4
M6-M8 are children of M5
M11, M12, M13 precede M18, and M18 is a descendent of F16 - so
M14-M17 M11-17 must precede F16
M2 matches F16
M3 child of M2
M9 child of M2
M10 child of M9
M19 M19 is child of M18, M18 matches F13, so M19 is a
child of the M18/F13 composite node and is placed
after the children inherited from F13.
M20 child of M19
M21-23 children of M20
M24 child of M18
The document merging method 1 permits source documents containing separate strands of content to be combined to generate a target document. The word "strand" means content which is associated, for example, an image and an associated text paragraph. The source documents and the target document may each observe different syntax rules. For example, the inheriting source document may contain xHTML, the inherited source document XML and the target document could be HTML. The process can be applied to files of any block-structured language and the resulting file can be translated into another language. There is thus excellent versatility for combining content. The method is not symmetrical, i.e. a document A merged with a document B produces a different result from document B merged with document A. The merging step is achieved by logically adding nodes of the documents in the movable role into the structure of the document in the fixed role, at appropriate location. This process preserves the order of elements of the document in the fixed role but may not preserve the order of elements from the document in the movable role. A document that inherits in the fixed role will retain its overall structure. Inheriting in the fixed role would be useful for example in the case of a HTML page which supports multiple languages. Such a page could inherit from a different parent depending on the language requested. Each language supported can be defined in a separate inherited document and merged in the movable role to generate a final document in the user's choice of language. Regarding combining of matching nodes, the source tree root nodes are always considered to be matching nodes. Other matching nodes may be identified by a number of means. In this embodiment element identifiers (names) indicate nodes to be merged. Matching nodes may alternatively be identified as such by reference to grammar or structure rules stored externally to the document, or syntax within the documents. Another method for identifying matching nodes is to match on known elements in the structure of a document. For example a HTML document will always have a html element and usually have a head and a body element. It is known in advance that both documents being merged are html documents then the system can treat these nodes as matching nodes. Another method is based on paths to the nodes, possibly using a tool based on the XPATH W3C standard. There may also be multi-matching, in which one fixed tree node matches multiple movable tree nodes. This may be indicated by wild-card characters ("patterns"). An application of this mechanism is treating the result of a database query as a movable role tree. The composite node is created from a mechanism appropriate to the nodes being processed. For example in HTML and XML a node comprises a tag and a list of named attributes or a string of text content. For example: <table id="myTable" border="2"> the word "table" is die tag and the attributes are "id" with a value of "myTable" and "border" with a value of "2". The process of creating a composite node can favour either the node from the inheriting tree or the node from the inherited tree. One mechanism for merging HTML/XML tags is that the composite node takes the tag from the node of the inherited tree and adds the lists of attributes together. Where the node of the inherited tree and the node of the inheriting tree both have the same attributes, the value of the attribute from the inherited tree node is used. This mechanism is said to favour the inherited tree. An alternative mechanism favours the inheriting tree--i.e. it uses the attribute value from the inheriting tree and where both have the same attribute. Another mechanism is to use an attribute in one or other node to indicate which node is to be favoured. For example, and attribute inherit_tag="yes" indicates that the composite node is to be constructed by favouring the inherited document node. A refinement of this mechanism allows the system to favour the tag of the inherited node and the attributes of the inheriting node or vice-versa. This can be achieved using a second attribute, for example an "overwrite_attributes"="yes" or "no". This attribute value can indicate if tag attributes are to be overwritten with values from the inherited element. Alternatively, attribute values may be concatenated. The default behaviour is that a composite element inherits the tag of the inherited node, but does not overwrite the attributes of the inheriting node i.e. inherit_tag="yes", overwrite_attributes="no". In the following description the term "fixed node" refers to a node from the tree that is in the fixed role for the merge operation. The term "movable node" refers to the node from the tree that is in the movable role. The term "fixed children" means the child nodes of the "fixed node". The term "movable children" means the child nodes of the "movable node". The following are the rules for the merge steps. Trees are merged by identifying the matching nodes. The root nodes of the two trees are always treated as matching nodes. The root node of the target tree is a composite of the root nodes of the two source trees. The order relationship of all nodes of the fixed tree is preserved. A pair of matching nodes causes a single composite node to be added to the target tree in the location of the matching node from the fixed tree. The composite node may be constructed from information in the movable node, the fixed node, or both. A matching node (and the sub-tree of which it is the root) from the movable tree is merged into the location of its matching node from the fixed tree. If the fixed tree contains a placeholder node, then the movable children of that node are placed in the position of the placeholder. A placeholder node is described below. A non-matching node from the movable tree is added to the target tree as a child of the node that represents its parent from the movable tree. If it is a child of the root node it is placed as a child of the root in the target tree. If its parent is a matching node, then it is placed after the child nodes from the matching node of the fixed tree, unless this order is modified by a placeholder node or another matching node. Where a node has more than one ancestor that is a matching node, it is placed relative to the nearest matching ancestor in the target tree. The order of non-matching nodes from the movable tree is preserved unless modified by the presence of a matching node or a placeholder. Where there are no matching nodes or placeholders affecting the order of children, the list of children of a composite node will normally be the list of children from the fixed node and the list of children from the movable node concatenated so that the fixed children precede the movable children. A "placeholder" node in the fixed tree is a dummy node which indicates that the nodes preceding it are to precede the nodes from the movable tree in the resulting list, whilst the nodes succeeding it are to succeed the nodes from the movable tree. Effectively, the children of the movable node replace the placeholder node. If a placeholder node has a name that matches one of its ancestor nodes (i.e. one of the nodes above it in the tree structure), then this indicates that the placeholder node is to be replaced by the children nodes of the named ancestors matching nodes of the movable tree. If the placeholder has no name, default behaviour may be defined. An example of possible default behaviour is to treat it as if it has the name of its immediate parent. In the preferred XML embodiment of this invention, the placeholder element is the "<extend/>" element. A placeholder node may indicate which movable role document it relates to. This can be achieved by means of an additional control setting for the node. This can be an attribute of the XML tag, such as: <extend src="parent list">. The default behaviour in the absence of this attribute would be that the placeholder relates to all documents inherited in the movable role. In some applications it may be more appropriate that elements are inherited only where they do not exist in the inheriting document, or that matching elements are completely replaced by the element from the inherited document rather than merged. These alternative ways of performing inheritance are referred to as "inheritance policies". Eight inheritance policies are defined: prefer_inherited; prefer_inheriting; merge; merge_only; lookup_policy, prefer_fixed, prefer_movable, and ignore. The word "Prefer" indicates that when a matching node is encountered, a copy of the node (and sub-tree) only from the preferred tree is added to the target tree. "Merge" indicates that the nodes and sub-trees are to be merged as described above. The "merge_only" policy indicates that a node is inherited and merged only if it occurs in both the inherited and inheriting documents. In the lookup_policy the movable role tree is treated as a resource from which the fixed role tree picks matching nodes and merges these and their children into its structure. Elements which do no match are not merged. Support for different merge policies is important because different types of content may require different handling. For example, executable script content should not be merged, whereas other HTML layout content may be merged. The system makes no assumptions about the content of the documents being merged. For example: the file "buttons.html" shown in FIG. 3(a) produces a single tree (FIG. 3(b)). This is acceptable for the purposes of this invention. However, the buttons.html content is not a valid HTML file because it does not contain a topmost html element. If the syntax of the language of a document does not specifically define a single top-most element for a document then this invention may still be applied if a top-level element is synthesised in the parsing step. As shown in FIG. 1, a document may inherit from a list of documents called its "inherit list". Inheritance is in the following order: Recursively perform inheritance on each child of the current node as a separate tree. If the current node inherits from other documents, inherit from each document in the inherit list in turn. This is achieved by creating a composite node fiom the current node and the root node of the document that it is inheriting from. The composite node becomes the root of the target tree. The sub-tree that starts at the current node and the tree represented by the document being inherited from are then merged using the indicated or default inheritance policy, to create the target tree. The target tree is then used as the document that inherits from the next inherited document in the list. Because each child node is processed as a separate tree, its inheritance does not affect its siblings or other nodes in the tree. This is a recursive operation that builds the final document in a bottom-up order. A document may include a number of separate trees (elements). This may be applied if one tree can be identified as the main tree in a document. The other trees (i.e. the ones that are not the main tree) can serve as trees for inheritance directly or indirectly, by elements within the main tree. This arrangement permits a single document to contain reusable elements within it. The merge process may be enhanced by use of parameters. For example, attributes of the inheriting node may serve as parameters for inheritance. Using parameterised inheritance, each instance of inheritance of a document can result in a different target document. This is a particularly useful feature when combined with a script execution engine. In some cases it will be more appropriate to nest compound elements rather than merging them. As an example, frameset elements in HTML should be nested beneath rather than merged into the inheriting element. Element nesting can be invoked by an appropriate syntax in the block-structured language. In HTML or XML this might be by means for the "nestable" attribute added to the open tag of an element. For example: <frameset id="dxe_html_frameset" nestable="yes"> Such settings may be defined externally to the document. Where an element indicates that it inherits from a list of other elements, the element is merged with the first element. The resulting tree is then merged with the second element listed, and so on until the list of elements is exhausted. Another embodiment of this invention could use a different sequence to implement "multiple inheritance". Each inherited document is inherited in a separate merge operation. The inheritance operation is controlled by information that may be stored in the inherited and inheriting documents, or elsewhere. This control information is referred to as the "inheritance attributes" for an element. The inheritance attributes are stored in the documents where the language supports it. If the language cannot support these requirements then it is possible to store the necessary information in separate control documents, files of a database, or to use some static rules to specify and control the inheritance process. The inherit list can be indicated by an attribute to the opening tag of an element such as: <table extends="commonTable.html"> The value of the attribute can be static as described above, or, using an appropriate syntax, can be a formula, expression or script. For example, the following syntax might indicate that the attribute value is to be treated as a script: <table extends="{isNetscape( )?`netscapeTable.html`:`explorerTable.html`}"> The above script is given as an example of a script language statement only. It is intended to be interpreted as "if the browser is Netscape.TM. then inherit from the file netscapeTable.html otherwise inherit from the file explorerTable.html". Multiple files to be specified as "inherited files" for the element. For example, a comma separated list: <table extends="commonTable.html,mainTable.html"> The syntax also permits a document to specify whether inheritance is to be performed in the fixed role or in the movable role. This can be achieved by, for example, using a different attribute to indicate inheritance in each role: for example the "extends" attribute indicates inheritance in the movable role whereas the "use" attribute indicates inheritance in the fixed role. Alternatively there may be a syntax that indicates the role in which inheritance is to be performed. For example, enclosing a document name in parenthesis might indicate that the document is to be in the movable role in the inheritance. There may be a mechanism for a document to indicate which role it is to fulfil in every inheritance. For example, a html document which is used to define the native language of a page may be most useful when it is inherited in the movable role. This can be indicated by means of an inherit_role="movable" attribute in the opening tag of the html element (the outermost element in a html document). Alternatively an element within the head element of the html document may indicate the same thing. Attempting to inherit such a file in the fixed role causes a warning. The inherited file can be identified by a relative or absolute file name, or by means of any other naming convention, for example Internet URLs. The documents indicated in inherit list may be stored on the same computer or on a different computer, or may be retrieved from or delivered by some other type of device, for example, via HTTP. An alternative method to indicate the inherited documents of a document is to use a link tag: for example: <link href="parent url" type="text/html"/> The inherited file does not necessarily have to be a complete document. Because inheritance operates on an element-by-element basis, it is possible to inherit from an individual element in a document, rather than from the complete document. This can be specified by a suitable syntax in the inherit list. In the case of a URL, a suitable syntax already exists--the anchor in a URL follows the file path and is separated from it by the hash character `#`. This is used to indicate the name of an element in a document. For example: <table extends="commonTable.html#ctable"> It is envisaged that this functionality can be extended to allow for replication of elements to provide for merging of tabular information such as the results from database queries. It is also envisaged that use of a pattern matching mechanism as the anchor of an inherited document will provide additional functionality in this context. The commonly known Xpath language is suitable mechanism here because it is designed to describe elements in XML documents. Also, inheritance can be extended by employing patterns (wildcard characters) in a node identifier. As with any URL, a document may reference itself by containing only an anchor, for example "#MYTABLE". The following is an example where the information that controls inheritance (the inheritance attributes) is stored in separate elements.
<table>
<nodeName>myTable</nodeName>
<extends>commonTable.html</extends>
<extends> mainTable.html< /extends>
<inheritance_policy>merge< /inheritance_policy>
</table>
Alternatively, the same information may be expressed as attributes: <table id="mytable" extends="commonTable.html, mainTable.html" merge_policy="merge"> The inheritance attributes used are shown in Table 1 below. Each of these attributes can be defined for every element:
TABLE 1
Inheritance attributes
Attribute name Values
inherit_tag yes or no
overwrite_attributes yes or no
id name for the element
extends list of parents to inherit from
inherit_role fixed or movable
inheritance_policy policy name
nestable yes or no
Where pages are generated in response to queries, such as in an http server for the World Wide Web, information supplied in the query may be used to generate the inherit list. This can be achieved by providing, in the text of the query, the names of the inherited pages. Alternatively, part of the query could be used by a script or other programmed agent to determine the appropriate list by reference to other information available to it from an external source such as a configuration file, a database, or a session object i.e. an object containing information about the user and his/her activity. Where a document has a known structure, then certain default behaviour may be built into the merging process. For example, HTML files have a structure consisting of an html element containing a head element and a body element. A possible shortcut rule would be to assume that the head elements are matching nodes and that the body elements are matching nodes whenever html documents are merged. This would remove the need for the html author to place matching id values in each of the elements. The inheritance operation illustrated in FIGS. 2(a) to 7(c) avails of some additional matching information that is not explicit in the text of the documents. This information is based on the fact that html documents follow a known structure being that the topmost element is an html element and a html element contains a head and a body element. Because of this knowledge the merge procedure treats the html, head and body elements in the child and parent documents as matching nodes. If this knowledge was not available to the merge procedure then the same result could be achieved by putting matching "id" attributes in each tag--for example: <html id="dxe_html"> <head id="dxe_html_head"> <body id="dxe_html_body"> As described above, the complete hierarchical structure of a document is parsed into a tree. In an alternative application only a portion of the document is parsed and lower levels of the hierarchy are ignored. Again using a HTML document as an example, the system might parse a html document to a maximum of three levels--i.e. to the level of individual tags within the head of body element--each element at that level being treated as a leaf of the structure tree. Strands (associated content) defined in separate documents may contain links to other information. These links may be either absolute or relative links. Relative links use as a starting point, the location of the link. If the relative link is copied to a different location--such as happens when inheritance takes place, the relative link will be invalid. Therefore, the invention provides for mapping relative links to absolute links before they are inherited into a new document. A mechanism to disable this mapping is also provided for. Converting a link from a relative link to an absolute one is a simple operation. For example, if a file stored at location www.docland.com/index.html with the value "products.html". This results in the absolute url: www.docland.com/products.html. It is envisaged that the present invention may include added functionality. It is envisaged that for example, content such as JavaScript elements could be pre-processed to locate errors and ensure function and variable names do not clash, and generally to implement more reliable support for the extended functionality provided by this invention. This would further reduce development time for Web sites. It is envisaged that, by using the present invention, software reliability will be improved because the functionality of each strand can be tested discretely. Strands of content can be reused in many documents. Also, a strand once implemented may be re-designed to improve performance or functionality. Errors arising from dependencies between the various parts of a document can be minimised because these interdependent elements can be packaged together and inherited together. It will be appreciated that, where the invention is implemented in a browser (client-side implementation), it can substantially reduce the communications bandwidth required to transmit pages from the server to the client. This benefit arises from the fact that inherited documents can be transferred from the server to the client once and cached on the client machine. The client can reuse this cached copy for each subsequent inheriting page. This benefit will be particularly useful for devices connected by low-bandwidth channels as mobile wireless devices. A major advantage is that the invention provides a way meaningfully combined separate strands of information from different source documents can be used to produce a target document, thus supporting a "component" based system for generating documents. The present invention is not limited to HTML but can be applied to a wide range of block-structured languages for example: WML, XML, XHTML, DHTML or other SGML derivatives. The present invention can also be applied to other block structured languages such as user interface resource files, the programming languages Pascal and C, and also to languages used to define 2 or 3 dimensional layouts and structures or integrated circuits. The invention also applies to languages and data structures used to store scientific, engineering or business information, games, cryptography and other areas of information technology. The present invention provides for a way to compress documents and collections of documents by eliminating duplication of content. It will be further appreciated that when block structured language is discussed the presentation of information in these files encompasses all electronic information. It will also be appreciated that the invention may be applied to documents without actually constructing a complete tree for both documents. An alternative embodiment of this invention may apply the same merging process without constructing a complete hierarchical representation of the documents. It will be further appreciated that the representation of the hierarchical data structure for a document may be stored in a variety of ways such as in random access computer memory, in relational, object oriented, or hierarchical database management systems or in disk files. The invention is not limited to the embodiments described but may be varied in construction and detail.
|
Same subclass Same class Consider this |
||||||||||
