Edit, composition, or storage control

Method and apparatus for efficient management of XML documents

6941510

Abstract

An in-memory storage manager represents XML-compliant documents as a collection of objects in memory. The storage manager allows real-time access to the objects by separate processes operating in different contexts. The data in the objects is stored in memory local to each process and the local memories are synchronized by means of a distributed memory system that stores the data in the same data region, but maps the data region to the address space of each process. Data corruption in the data region is prevented by a locking mechanism that prevents the processes from simultaneously modifying same data.


Claims

1. Apparatus for representing and managing an XML-compliant document in a memory, the XML-compliant document being updated concurrently by a first process having a first address space in the memory and second process having a second address space in the memory, the apparatus comprising:

a first storage manager controlled by the first process that constructs, from class code in the first address space, at least one document object including first data representing a part of the XML-compliant document and stored in a region mapped into the first address space;

a second storage manager controlled by the second process that constructs, from class code in the second address space, at least one document object including second data representing a part of the XML-compliant document and stored in the same region as the first data, but mapped into the second address space;

a synchronization mechanism that locks the region data when the first process is changing the region data in the first address space.

2. Apparatus as recited in claim 1 wherein the second process comprises a mechanism for requesting a copy of the region data from the first address space if the second address space does not have the most recent copy of the region data.

3. Apparatus as recited in claim 1 wherein the second process comprises methods for requesting that the synchronization manager lock the region data when the second process is changing the region data in the second address space.

4. Apparatus as recited in claim 1 wherein the first process can perform read and write operations on the region and wherein the apparatus further comprises a mechanism for grouping a plurality of the read and write operations into a transaction.

5. Apparatus as recited in claim 4 wherein the first process comprises methods for requesting that the synchronization manager lock the region data during the processing of all read and write operations in a transaction.

6. Apparatus as recited in claim 5 further comprising a logging system that periodically writes recovery log entries to a persistent database during the processing of all read and write operations in a transaction.

7. Apparatus as recited in claim 1 wherein the first process comprises a storage mechanism for storing a copy of the region data in a non-volatile store.

8. Apparatus as recited in claim 7 wherein the non-volatile store comprises an object store.

9. Apparatus as recited in claim 7 wherein the non-volatile store comprises a file system.

10. Apparatus as recited in claim 1 wherein the synchronization mechanism comprises a distributed memory system.

11. Apparatus as recited in claim 1 wherein both the first and second address spaces contain equivalent program code for manipulating the first and second document objects.

12. Apparatus as recited in claim 1 wherein the first and second storage manager each construct a cross-process synchronization object that is used to synchronize the first and second processes.

13. A method for representing and managing an XML-compliant document in a memory, the XML-compliant document being updated concurrently by a first process having a first address space in the memory and second process having a second address space in the memory, the method comprising:

(a) using a first storage manager controlled by the first process to construct, from class code in the first address space, at least one document object including first data representing a part of the XML-compliant document and stored in a region mapped into the first address space;

(b) using a second storage manager controlled by the second Process to construct, from class code in the second address space, at least one document object including second data representing a part of the XML-compliant document and stored in the same region as the first data, but mapped into the second address space; and

(c) locking the region data when the first process is changing the region data in the first address space.

14. A method as recited in claim 13 further comprising requesting a copy of the region data from the first address space if the second address space does not have the most recent copy of the region data.

15. A method as recited in claim 13 wherein step (c) comprises locking the region data when the second process is changing the region data in the second address space.

16. Apparatus as recited in claim 13 wherein the first process can perform read and write operations on the region and wherein the method further comprises (d) grouping a plurality of the read and write operations into a transaction.

17. A method as recited in claim 16 wherein step (c) comprises locking the region data during the processing of all read and write operations in a transaction.

18. A method as recited in claim 17 wherein step (c) further comprises periodically writing recovery log entries to a persistent database during the processing of all read and write operations in a transaction.

19. A method as recited in claim 13 further comprising (e) under the control of the first process, storing a copy of the region data in a non-volatile store.

20. A method as recited in claim 19 wherein the non-volatile store comprises an object store.

21. A method as recited in claim 19 wherein the non-volatile store comprises a file system.

22. A method as recited in claim 13 wherein step (c) is performed by a distributed memory system.

23. A method as recited in claim 13 further comprising (f) manipulating the first and second document objects with equivalent program code in both the first and second address spaces.

24. A method as recited in claim 13 further comprising (g) constructing a cross-process synchronization object that is used to synchronize the first and second processes.

25. A computer program product for representing and managing an XML-compliant document in a memory, the XML-compliant document being updated concurrently by a first process having a first address space in the memory and second process having a second address space in the memory, the computer program product comprising a computer usable medium having computer readable program code thereon, including:

program code for using a first storage manager controlled by the first process to construct, from class code in the first address space, at least one document object including first data representing a part of the XML-compliant document stored in the first address space and stored in a region mapped into the first address space;

program code for using a second storage manager controlled by the second process to construct, from class code in the second address space which class code is identical to the class code in the first address space, at least one document object including second data representing a part of the XML-compliant document stored in the second address space and stored in the same region as the first data, but mapped into the second address space; and

program code for locking the region data when the first process is changing the region data in the first address space.


Description

FIELD OF THE INVENTION

This invention relates to storage and retrieval of information and, in particular, to storage and retrieval of information encoded in Extended Markup Language (XML).

BACKGROUND OF THE INVENTION

Modern computing systems are capable of storing, retrieving and managing large amounts of data. However, while computers are fast and efficient at handling numeric data they are less efficient at manipulating text data and are especially poor at interpreting human-readable text data. Generally, present day computers are unable to understand subtle context information that is necessary to understand and recognize pieces of information that comprise a human-readable text document. Consequently, although they can detect predefined text orderings or pieces, such as words in an undifferentiated text document, they cannot easily locate a particular piece of information where the word or words defining the information have specific meanings. For example, human readers have no difficulty in differentiating the word "will" in the sentence "The attorney will read the text of Mark's will.", but a computer may have great difficulty in distinguishing the two uses and locating only the second such use.

Therefore, schemes have been developed in order to assist a computer in interpreting text documents by appropriately coding the document. Many of these schemes identify selected portions of a text document by adding into the document information, called "markup tags", which differentiates different document parts in such a way that a computer can reliably recognize the information. Such schemes are generally called "markup" languages.

One of these languages is called SGML (Standard Generalized Markup Language) and is an internationally agreed upon standard for information representation. This language standard grew out of development work on generic coding and mark-up languages, which was carried out in the early 1970s. Various lines of research merged into a subcommittee of the International Standards Organization called the subcommittee on Text Description and Processing Languages. This subcommittee produced the SGML standard in 1986.

SGML itself is not a mark-up language in that it does not define mark-up tags nor does it provide a markup template for a particular type of document. Instead, SGML denotes a way of describing and developing generalized descriptive markup schemes. These schemes are generalized because the markup is not oriented towards a specific application and descriptive because the markup describes what the text represents, instead of how it should be displayed. SGML is very flexible in that markup schemes written in conformance with the standard allow users to define their own formats for documents, and to handle large and complex documents, and to manage large information repositories.

Recently, another development has changed the general situation. The extraordinary growth of the Internet, and particularly, the World Wide Web, has been driven by the ability it gives authors, or content providers, to easily and cheaply distribute electronic documents to an international audience. SGML contains many optional features that are not needed for Web-based applications and has proven to have a cost/benefit ratio unattractive to current vendors of Web browsers. Consequently, it is not generally used. Instead, most documents on the Web are stored and transmitted in a markup language called the Hypertext Markup Language or HTML.

HTML is a simple markup language based on SGML and it is well suited for hypertext, multimedia, and the display of small and reasonably simple documents that are commonly transmitted on the Web. It uses a small, fixed set of markup tags to describe document portions. The small number of fixed tags simplifies document construction and makes it much easier to build applications. However, since the tags are fixed, HTML is not extensible and has very limited structure and validation capabilities. As electronic Web documents have become larger and more complex, it has become increasingly clear that HTML does not have the capabilities needed for large-scale commercial publishing.

In order to address the requirements of such large-scale commercial publishing and to enable the newly emerging technology of distributed document processing, an industry group called the World Wide Web Consortium has developed another markup language called the Extensible Markup Language (XML) for applications that require capabilities beyond those provided by HTML. Like HTML, XML is a simplified subset of SGML specially designed for Web applications and is easier to learn, use, and implement than full SGML. Unlike HTML, XML retains SGML advantages of extensibility, structure, and validation, but XML restricts the use of SGML constructs to ensure that defaults are available when access to certain components of the document is not currently possible over the Internet. XML also defines how Internet Uniform Resource Locators can be used to identify component parts of XML documents.

An XML document is composed of a series of entities or objects. Each entity can contain one or more logical elements and each element can have certain attributes or properties that describe the way in which it is to be processed. XML provides a formal syntax for describing the relationships between the entities, elements and attributes that make up an XML document. This syntax tells the computer how to recognize the component parts of each document.

XML uses paired markup tags to identify document components. In particular, the start and end of each logical element is clearly identified by entry of a start-tag before the element and an end-tag after the element. For example, the tags

  • document text . . .


The form and composition of markup tags can be defined by users, but are often defined by a trade association or similar body in order to provide interoperability between users. In order to operate with a predefined set of tags, users need to know how the markup tags are delimited from normal text and the relationship between the various elements. For example, in XML systems, elements and their attributes are entered between matched pairs of angle brackets (< . . . >), while entity references start with an ampersand and end with a semicolon (& . . . ;). Because XML tag sets are based on the logical structure of the document, they are easy to read and understand.

Since different documents have different parts or components, it is not practical to predefine tags for all elements of all documents. Instead, documents can be classified into "types" which have certain elements. A document type definition (DTD) indicates which elements to expect in a document type and indicates whether each element found in the document is not allowed, allowed and required or allowed, but not required. By defining the role of each document element in a DTD, it is possible to check that each element occurs in a valid place within the document. For example, an XML DTD allows a check to be made that a third-level heading is not entered without the existence of a second-level heading. Such a hierarchical check cannot be made with HTML. The DTD for a document is typically inserted into the document header and each element is marked with an identifier such as
However, unlike SGML, XML does not require the presence of a DTD. If no DTD is available for a document, either because all or part of the DTD is not accessible over the Internet or because the document author failed to create the DTD, an XML system can assign a default definition for undeclared elements in the document.

XML provides a coding scheme that is flexible enough to describe nearly any logical text structure, such as letters, reports, memos, databases or dictionaries. However, XML does not specify how an XML-compliant data structure is to be stored and displayed, much less efficiently stored and displayed. Consequently, there is a need for a storage mechanism that can efficiently manipulate and store XML-compliant documents.

SUMMARY OF THE INVENTION

In accordance with one embodiment of the invention, an in-memory storage manager represents XML-compliant documents as a collection of objects in memory. The collection of objects allows the storage manager to manipulate the document, or parts of the document with a consistent interface and to provide for features that are not available in conventional XML documents, such as element attributes with types other than text and documents that contain binary, rather than text, information. In addition, in the storage manager, the XML-compliant document is associated with a schema document (which is also an XML document) that defines the arrangement of the document elements and attributes. The storage manager can operate with conventional storage services to persist the XML-compliant document. Storage containers contain pieces of the document that can be quickly located by the storage manager.

In accordance with another embodiment, the storage manager also has predefined methods that allow it to access and manipulate elements and attributes of the document content in a consistent manner. For example, the schema data can be accessed and manipulated with the same methods used to access and manipulate the document content.

In accordance with yet another embodiment, the schema data associated with a document can contain a mapping between document elements and program code to be associated with each element. The storage manager further has methods for retrieving the code from the element tag. The retrieved code can then be invoked using attributes and content from the associated element and the element then acts like a conventional object.

In all embodiments, the storage manager provides dynamic, real-time data access to clients by multiple processes in multiple contexts. Synchronization among multiple processes accessing the same document is coordinated with event-driven queues and locks. The objects that are used to represent the document are constructed from common code found locally in each process. In addition, the data in the objects is also stored in memory local to each process. The local memories are synchronized by means of a distributed memory system that continually equates the data copies of the same element in different processes.

In still another embodiment, client-specified collections are managed by a separate collection manager. The collection manager maintains a data structure called a "waffle" that represents the XML data structures in tabular form. A record set engine that is driven by user commands propagates a set of updates for a collection to the collection manager. Based on those updates, the collection manager updates index structures and may notify waffle users via the notification system. The waffle user may also navigate within the collection using cursors.

BRIEF DESCRIPTION OF THE DRAWINGS

The above and further advantages of the invention may be better understood by referring to the following description in conjunction with the accompanying drawings in which:

FIG. 1 is a schematic diagram of a computer system on which the inventive storage manager system can run.

FIG. 2 is a block schematic diagram illustrating the relationship of the in-memory storage manager and persistent storage.

FIG. 3 is a block schematic diagram illustrating the representation of an XML document on the storage manager memory as a collection of objects.

FIG. 4A is a block schematic diagram illustrating the components involved in binding code to XML elements.

FIG. 4B is a flowchart showing the steps involved in retrieving program code bound to an element.

FIG. 5 illustrates the relationship of XML text documents and binary sub-documents.

FIG. 6 is a block schematic diagram illustrating the major internal parts of the storage manager in different processes.

FIG. 7 illustrates the mechanism for synchronizing objects across processes.

FIG. 8 is an illustration that shows the major control paths from the storage manager APIs through the major internal parts of the storage manager.

FIG. 9 is an illustration of the storage manager interface constructed in accordance with an object-oriented implementation of the invention.

FIG. 10 is an illustration of the interfaces constructed in accordance with an object-oriented implementation of the invention, that are defined by the storage manager and may be called during the processing of links or element RPCs.

FIG. 11 is an illustration of the database and transaction interfaces constructed in accordance with an object-oriented implementation of the invention.

FIG. 12 is an illustration of the document and element interfaces constructed in accordance with an object-oriented implementation of the invention.

FIG. 13 is an illustration of the element communication and synchronization interfaces constructed in accordance with an object-oriented implementation of the invention.

FIG. 14 is an illustration that shows the major control paths from the collection manager APIs through the major internal parts of the collection and storage managers.

FIG. 15 is an illustration of the collection manager interfaces constructed in accordance with an object-oriented implementation of the invention.

DETAILED DESCRIPTION

FIG. 1 illustrates the system architecture for an exemplary client computer 100, such as an IBM THINKPAD 600®, on which the disclosed document management system can be implemented. The exemplary computer system of FIG. 1 is discussed only for descriptive purposes, however, and should not be considered a limitation of the invention. Although the description below may refer to terms commonly used in describing particular computer systems, the described concepts apply equally to other computer systems, including systems having architectures that are dissimilar to that shown in FIG. 1 and also to devices with computers in them, such as game consoles or cable TV set-top boxes, which may not traditionally be thought of as computers.

The client computer 100 includes a central processing unit (CPU) 105, which may include a conventional microprocessor, random access memory (RAM) 110 for temporary storage of information, and read only memory (ROM) 115 for permanent storage of information. A memory controller 120 is provided for controlling system RAM 110. A bus controller 125 is provided for controlling bus 130, and an interrupt controller 135 is used for receiving and processing various interrupt signals from the other system components.

Mass storage may be provided by diskette 142, CD-ROM 147, or hard disk 152. Data and software may be exchanged with client computer 100 via removable media, such as diskette 142 and CD-ROM 147. Diskette 142 is insertable into diskette drive 141, which is connected to bus 130 by controller 140. Similarly, CD-ROM 147 can be inserted into CD-ROM drive 146, which is connected to bus 130 by controller 145. Finally, the hard disk 152 is part of a fixed disk drive 151, which is connected to bus 130 by controller 150.

User input to the client computer 100 may be provided by a number of devices. For example, a keyboard 156 and a mouse 157 may be connected to bus 130 by keyboard and mouse controller 155. An audio transducer 196, which may act as both a microphone and a speaker, is connected to bus 130 by audio controller 197. It should be obvious to those reasonably skilled in the art that other input devices, such as a pen and/or tablet and a microphone for voice input, may be connected to client computer 100 through bus 130 and an appropriate controller. DMA controller 160 is provided for performing direct memory access to system RAM 110. A visual display is generated by a video controller 165, which controls video display 170.

Client computer 100 also includes a network adapter 190 that allows the client computer 100 to be interconnected to a network 195 via a bus 191. The network 195, which may be a local area network (LAN), a wide area network (WAN), or the Internet, may utilize general-purpose communication lines that interconnect multiple network devices.

Client computer system 100 generally is controlled and coordinated by operating system software, such as the WINDOWS NT® operating system (available from Microsoft Corp., Redmond, Wash.). Among other computer system control functions, the operating system controls allocation of system resources and performs tasks such as process scheduling, memory management, networking and I/O services.

As illustrated in more detail in FIG. 2, the storage manager 206 resides in RAM 200 (equivalent to RAM 110 in FIG. 1) and provides an interface between an application program 202 which uses XML documents 228 and 230 and the persistent storage 208 in which the documents 228 and 230 are stored. The application 202 can interact with storage manager 206 by means of a consistent application programming interface 204 irregardless of the type of persistent storage 208 used to store the objects. Internally, the storage manager 206 represents each document 210, 218, as a hierarchical series of objects 212-216 and 220-224, respectively. The storage manager 206 can store the documents 210 and 218 in persistent storage 208 as schematically illustrated by arrow 226 using a variety of file systems, such as directory-based file services, object stores and relational file systems.

The inventive system operates with conventional XML files. A complete XML file normally consists of three components that are defined by specific markup tags. The first two components are optional, the last component is required, and the components are defined as follows:
    • 1. An XML processing statement which identifies the version of XML being used, the way in which it is encoded, and whether it references other files or not. Such a statement takes the form:
    • 2. A document type definition (DTD) that defines the elements present in the file and their relationship. The DTD either contains formal markup tag declarations describing the type and content of the markup tags in the file in an internal subset (between square brackets) or references a file containing the relevant markup declarations (an external subset). This declaration has the form:
    • 3. A tagged document instance which consists of a root element, whose element type name must match the document type name in the document type declaration. All other markup elements are nested in the root element.


  • If all three components are present, and the document instance conforms to the document model defined in the DTD, the document is said to be "valid." If only the last component is present, and no formal document model is present, but each element is property nested within its parent elements, and each attribute is specified as an attribute name followed by a value indicator (=) and a quoted string, document instance is said to be "well-formed." The inventive system can work with and generate well-formed XML documents.

    Within the storage manager 206, XML documents are represented by means of data storage partitions which are collectively referred to by the name "Groove Document" to distinguish the representation from conventional XML documents. Each Groove document can be described by a DTD that formally identifies the relationships between the various elements that form the document. These DTDs follow the standard XML format. In addition, each Groove document has a definition, or schema, that describes the pattern of elements and attributes in the body of the document. XML version 1.0 does not support schemas. Therefore, in order to associate a Groove schema document with an XML data document, a special XML processing instruction containing a URI reference to the schema is inserted in the data document. This processing instruction has the form:


  • Some elements do not have, or require, content and act as placeholders that indicate where a certain process is to take place. A special form of tag is used in XML to indicate empty elements that do not have any contents, and therefore, have no end-tag. For example, a


    Where elements can have variable forms, or need to be linked together, they can be given suitable attributes to specify the properties to be applied to them. These attributes are specified in a list. For example, it might be decided that the
      Location ENTITY #REQUIRED
      Size CDATA #IMPLIED
    >


    This tells the computer that the
    XML also permits custom definition statements similar to the #DEFINE statements used with some compilers. Commonly used definitions can be declared of within the DTD as "entities." A typical entity definition could take the form:
    • which defines a file location for the binary document "BinDoc3487." Once such a declaration has been made in the DTD, users can use a reference in place of the full value. For example, the


  • Within the storage manager, each document part is identified by a Uniform Resource Identifier (URI) which conforms to a standard format such as specified in RFC 2396. URIs can be absolute or relative, but relative URIs must be used only within the context of a base, absolute URI. When the document is stored in persistent storage, its parts may be identified by a different STORAGEURI that is assigned and managed by the particular file system in use.

    In accordance with the principles of the invention, within each document part, in the storage manager internal memory is represented by a collection of objects. For example, separate elements in the XML document are represented as element objects in the storage manager. This results in a structure that is illustrated in FIG. 3. In FIG. 3, an illustrative XML document 300 is represented as a collection of objects in storage manager 302. In particular, the XML document 300 contains the conventional XML processing statement 304 which identifies the XML version, encoding and file references as discussed above. Document 300 also contains an XML processing statement 306 which identifies a schema document 320 in storage manager 302 which is associated with the document 300. The illustrative XML document also contains a set of hierarchical elements, including ElementA 308 which contains some text 318, ElementA contains ElementB 310 which has no text associated with it. ElementB also contains ElementC 312, which, in turn, contains two elements. Specifically, ElementC contains ElementD 314 that has an attribute (ID, with a value "foo") and ElementE 316.

    In the storage manager 302, the elements, ElementA—ElementE, are represented as element objects arranged in a hierarchy. In particular, ElementA is represented by ElementA object 322. Each element object contains the text and attributes included in the corresponding XML element. Therefore, element object 322 contains the text 318. Similarly, ElementB 310 is represented by element object 324 and elements ElementC, ElementD and ElementE are represented by objects 326, 328 and 330, respectively. Element object 328, which represents element ElementD, also includes the attribute ID that is included in the corresponding element. Each element object references its child element objects by means of database pointers (indicated by arrows between the objects) into order to arrange the element objects into a hierarchy. There may also be attribute indices, such as index 332 that indexes the ID attribute in element object 328.

    The representation of the XML document 300 by means of an object collection allows the storage manager 302 to manipulate its internal representation of the document 300 with a consistent interface that is discussed in detail below. The storage manager 302 can also provide features that are not available in conventional XML documents, such as collection services that are available via a collection manager that is also discussed in detail below.

    As described above, Groove documents that contain XML data may have a definition, or schema document, that describes the pattern of elements and attributes in the body of the document. The schema document is stored in a distinct XML document identified by a URI. The schema document has a standard XML DTD definition, called the meta-schema, which is shown below: