Universal output constructor for XML queries universal output constructor for XML queries6766330Abstract Methods and apparatus are provided according to the present invention which guarantee that an Extensible Markup Language (XML) query output conforms to a Document Type Definition (DTD) of the user's choice. The present invention allows for: (i) selection of a DTD; (2) integration of one or more XML queries with the DTD; and (iii) in accordance with the provided algorithm, automatic generation of a valid output XML document conforming to the DTD, using the data selected by the XML queries as content of the XML document. Claims What is claimed is: Description FIELD OF THE INVENTION
Term Definition Name given in
repetition `?`, `*`, or `+` previous work
symbol
element name Name used in a element type XML Spec
(ENAME) declaration
choice list a list of cp enclosed by `(` and `)`, XML Spec
(CLIST) and separated by `.vertline.`,
i.e. "(cp .vertline. cp .vertline. . . . .vertline. cp)".
terminal choice a list of "#PCDATA" and ENAMEs, previous work
list (TLIST) each appearing only once, enclosed
by `(` and `)`, and separated by `.vertline.`,
i.e. "(#PCDATA .vertline. ENAME .vertline.
ENAME . . . .vertline. ENAME)"
sequence a list of cp enclosed by `( ` and `)` XML Spec
(SEQ) and separated `, `,
i.e, "(cp, cp , . . . , cp)"
content unit ENAME, CLIST, SEQ, or TCLIST previous work
(CU)
content particle a content-unit followed optionally by XML Spec
(CP) a repetition symbol,
i.e. (Name .vertline. choice .vertline. seq)
(`?` .vertline. `*` .vertline. `+`)?
content spec the part that matches `contentspec` in XML Spec
the DTD production rules. That is,
the part that follows ENAME and
proceeds `>` in a DTD element
type declaration.
children content a content spec that is a choice or XML Spec
spec (children) sequence content unit followed
optionally by a repetition symbol
mixed content a content spec that is a XML Spec
spec (mixed) "(#PCDATA)" subexpression, or a
terminal choice list followed by a `*`
repetition symbol
PCDATA #PCDATA XML Spec
declaration
attribute The part that includes an ENAME, a XML Spec
definition type, e.g., CDATA, ID, IDREF etc.,
(ATD) and a default declaration.
value declaration PCDATA declaration or attribute previous work
definition.
Element type The part that includes a previous work
declaration (ED) "<!ELEMENT", followed by an
ENAME, content-spec, and a ">"
Attribute list The part that includes a previous work
declaration (AD) "<!ATTLIST", followed by an
ENAME, a list of ATDs, and a ">"
DTD declaration element type declaration and attribute previous work
list declaration
DTD construct a DTD declaration, a (sub-expression previous work
of a) content spec, or a (sub-
expression of an) attribute-list
declaration
In addition, column variable is a variable whose value is a column identity. Row variable is a variable whose value is a row of a table. A row variable is defined on a particular table, and can only have a row of that table as its value. Vector variable is a variable whose value is a list of objects, including values (or scalars), variables, and other vectors, enclosed by "<" and ">", e.g., v=<100, <1, "ab", 10>, x>. A detailed description of the invention will now follow. Such descriptions will illustrate the various methods and apparatus of the invention for query and access of data from various data sources as integrated XML documents, for guaranteeing that the query outputs conform to the DTD of users' choice, and for generating XML documents based on the combination of different XML queries. In this invention, as mentioned above, a preferred implementation for the mapping between DTD and relational schema is a DTD source annotation (DTDSA), as described in the above-referenced U.S. Ser. No. 09/466,627. In other words, the mapping between DTD and relational schema is encoded inside the DTD as annotations. The binding and value specifications are defined immediately following the DTD constructs which they are associated with. The method to establish mapping between DTD and underlying data is by associating a value specification with each #PCDATA, attribute definition, and choice list, associating a binding specification with each DTD construct ended with a trailing repetition symbol, and optionally associating binding specifications with other DTD constructs. A value specification is a parameterized formula containing variables, which, when a data object is substituted for each variable in it, produces a text value. A binding specification is a variable and expression pair. The expression can be a list of data objects, a formula that identifies a list of data objects, or a parameterized formula containing variables, which, when a data object is substituted for each variable in it, produces a list of data objects. A more detailed description of the DTDSA mechanism that may be used in a preferred embodiment of the present invention is provided below in a section entitled "DTDSA Mapping" following the detailed description of the present invention. (a) List Producing Function L( ) First, we propose a function L(XQ), which takes the output of an XML query, XQ, and returns the result of the filtering step of the XQ in a list. Assume that XQ has n filtering variables a.sub.i for i=1, . . . ,n, and returns k lists {a.sub.11, a.sub.12, . . . , a.sub.1n }, {a.sub.21, a.sub.22, . . . , a.sub.2n }, . . . , {a.sub.k1, a.sub.k2, . . . a.sub.kn }. L(XQ) returns a list of vectors: {(a.sub.11, a.sub.12, . . . , a.sub.1n), (a.sub.21, a.sub.22, . . . a.sub.2n), . . . (a.sub.k1, a.sub.k2, . . . a.sub.kn)} If XQ uses scalar-based filtering, each a.sub.ij is a scalar. If XQ uses subtree-based filtering, each a.sub.ij is a subtree. (b) Binding Specifications The usage and semantics for binding specification is the same for XML query languages using either scalar or subtree-based filtering. Given a DTD and set of XML queries, we can form a DTDSA that constructs a valid XML document using the data selected, i.e., at the filtering step, of the queries as follows. The filtering step of an XML query can be associated to a binding variable to form a binding specification as follows: ::b:=L(XQ) where XQ is the XML query, b is the binding variable, and a.sub.i for i=1, . . . ,n are filtering variables in XQ. The semantics are as follows: This list of values (vectors) returned by L(XQ) is associated with b using the DTDSA binding specification semantics described herein, i.e., the values are bound to b in turn, one for each instantiation of the annotated DTD construct. The values bound to b can be used in other binding specifications or value specifications. We use the following notation for the references to b: b refers to the whole vector (a.sub.i1, a.sub.i2, . . . a.sub.in), and b.a.sub.j refers to the jth field of the current vector bound to b. If the current vector is the ith in the list returned by L(XQ), then b.a.sub.j =a.sub.ij. (c) XML Queries Using Scalar-based Filtering The usage and semantics of value specifications for XML queries using scalar-based filtering is the same as in the original DTDSA definition described herein. For binding specifications, the usage can include single XML query, multiple XML queries (correlated or uncorrelated), and non-XML queries. 1. One XML query construction: XQ is used as the binding function of a binding specification associated with the root element, each DTD construct such as #PCDATA, choice list, CDATA or enumerate type is annotated by a value specification which is either a b.a.sub.i, or a function on b.a.sub.i. 2. Multiple non-correlated XML query construction: Each XML query is used as the binding function of a binding specification, the binding specification of the functions. 3. Multiple correlated XML query construction: Each XML query is used as the binding function of a binding specification, the binding specification of the functions. The input parameter of an XML query may be of the form b.a.sub.i where b is the binding variable of another binding specification. 4. Construction for multiple correlated and uncorrelated XML and non-XML queries: Same as pure XML query cases, but some query can be non-XML queries. (d) XML Queries Using Subtree-based Filtering The syntax and semantics of the binding specification remains the same. ::b:=L(XQ(C, xs)) where b is the binding variable, XQ is an XML query function, C is an XML query scope, and xs is the query (filtering) string. Note L(XQ) in this case returns {(a.sub.11, a.sub.12, . . . , a.sub.1n), (a.sub.21, a.sub.22, . . . a.sub.2n), . . . (a.sub.k1, a.sub.k2, . . . a.sub.kn)}, where each a.sub.ij is a subtree, and b is bound to a list of vectors of subtrees in turn. Let f( ) be a function that takes references to one or more binding variables as input, and returns a subtree. Let y=f(b.sub.1, b.sub.2, . . . b.sub.m), where b.sub.i, for i=1, . . . m, are references to binding variables. For example, y may be b.a.sub.j. (e) Using Binding Variables in Value Specifications Two types of functions may appear in value specifications: i. g(y) and h(b), where g( ) and h( ) each returns a scalar value. ii. XQ(y, xs) and XQ(b, xs) where XQ( ) is an XML query and xs is some query string. A value specification of the form ":g(y)" or ":h(b)" must annotate a #pcdata, choice list, cdata, or enumerate type, and provide value for the annotated construct. A value specification of the form ":XQ(y, xs)" or ":XQ(b, xs)" must annotate an element E. The subtree ST, returned by XQ, will be used as an instance of E in the final XML document constructed by the methodology of the present invention. The present invention provides two types of policies for using ST as an instance of element E: 1. Test policy: If ST conforms to the portion of DTD described by E and its descendants, ST is accepted as an instance of E. Otherwise, it is rejected, and either some default value is used for the instance, or the query output construction process is aborted. 2. Transform policy: If ST conforms to the portion of DTD described by E and its descendants, ST is accepted as an instance of E. Otherwise, some transformation algorithm is used to transform ST into ST' which conforms to E. (f) Using Binding Variables in Other Binding Specifications Binding variables bound to XML query outputs may be used in other binding specifications as follows: ::x:=L(XQ(b, xs)), or ::x:=L(XQ(f(b), xs)), or ::x:=L(XQ({f.sub.1 (b.sub.i), f.sub.2 (b.sub.2), . . . f.sub.m (b.sub.m)}, xs)) where xs is a query (filtering) string, b, and b.sub.i for i=1, . . . , m are binding variables f, and f.sub.i for i=1, . . . ,m, are functions that can take a list of subtrees and return a subtree. (g) Using Cascade Binding for Nested Filtering/construction Most XML query languages allow construction and filtering stages to be recursively defined. Let F.sub.1 C.sub.1 F.sub.2 C.sub.2 . . . F.sub.n C.sub.n represent a recursively defined query, where F.sub.i is the ith filtering stage and C.sub.i is the ith construction stage, for i=1, 2, . . . , n. Assume that F.sub.1 defines a.sub.11, a.sub.12, . . . , F.sub.2 defines a.sub.21, a.sub.22, . . . , and F.sub.n defines a.sub.n1, a.sub.n2, . . . , and a.sub.ij may be used or referred in F.sub.k, for k>i, and in C.sub.k, for k.gtoreq.i. We can have the following cascade binding: ::x.sub.n :=BF.sub.n . . . ::x.sub.2 :=BF.sub.2 ::x.sub.1 :=BF.sub.1 where BF.sub.i is binding functions, for i=1, 2, . . . , n, and BF.sub.1 returns (defines) <a.sub.11, a.sub.12, . . . >, BF.sub.2 returns <a.sub.21, a.sub.22, . . . >, BF.sub.n returns <a.sub.n1, a.sub.n2, . . . >. Also x.sub.i.a.sub.ij may be a parameter of BF.sub.k, for k>i. Referring now to the drawings in which like numerals represent the same or similar elements and initially to FIG. 1, a block diagram of the three stages of general XML query languages is depicted. The first stage is the scoping that decides the domain of the query, indicated as block 140. The second stage is the filtering or selection that chooses data to participate in the query, indicated by block 130. The third stage is the construction that builds the XML output from the chosen data, indicated as block 120. These three stages can be recursively defined, as indicated by the arrow 110, to form a nested loop iteration. The result of the three stages is a well-formed XML document. FIG. 2 is a block diagram illustrating a universal output constructor methodology for XML queries according to an embodiment of the present invention. As shown, a DTD 200 can be annotated with different expressions from different query languages, to form a mapping construct DTDSA 312 that guides the access and translation engine 305 to retrieve data from different types of data sources. The user may provide input parameters 310 to the engine, and receive as output a valid XML document 315. The mapping constructs can be written in the original DTDSA mapping expressions 240, the query languages that include the scoping (block 212) and filtering (block 210) stages such as XQL 230, the query languages that include an extra construction stage such as XML-QL 220, and a mixture of all of the above expressions or languages. If the query language includes a construction stage, the third stage shown as the construction stage 120 in FIG. 1 will be ignored. The query languages 220 and 230 may describe the mapping constructs for accessing XML repositories 225 and 235, while the original DTDSA mapping expression 240 may describe those for the relational database systems 245. The original DTDSA mapping expressions include value specifications and binding specifications, for example, as are illustrated and described below in the context of FIGS. 6A through 6E. It is to be appreciated that the universal output constructor is preferably comprised of the engine 305 and the mapping framework DTDSA 312. However, in an alternative embodiment, the universal output constructor may be considered as just comprising the engine 305 which has access to the DTDSA 312. FIGS. 3A and 3B depict a method for generating XML documents based on the new mappings that are included in the DTDSA. The method makes use of the DTDSA method described above with a modification for evaluating XML queries in a binding specification, and assigning a result to the binding variables, as shown in block 350. It should be understood that the elements shown in FIG. 3B may be implemented in various forms of hardware, software or combinations thereof. Preferably, these elements are implemented in software on one or more appropriately programmed general purpose digital computers having a processor and memory and input/output interfaces, an example of which is shown in FIG. 5. In any case, FIG. 3A illustrates a block diagram representing the XML composition algorithm using DTDSA. A document retrieval and composition engine or algorithm 305 receives input parameter name and value pairs 310, e.g., (A=1, B=100), and generates a return XML document 315 based on the provided DTDSA 312. An internal flow diagram of the algorithm 305 is shown in FIG. 3B. Initially, the algorithm parses the DTDSA 312 into some internal format, e.g., a directed acyclic graph, which is easy to manipulate, and prepares the input parameters into environmental variables, as depicted in block 320 (i.e., read in DTDSA as a graph structure; accept input parameters and add into ENV). The algorithm then performs a breadth first search (BFS) traversal on the internal DTDSA structure, using a first-in-first-out queue to keep track of the set of structure nodes visited. The BFS traversal includes a standard procedure which sets up initial values (add document root and initial environmental variables ENV into queue tail) for the queue (block 325), repeats fetching the queue until the queue is empty (block 330), and for every node and environmental variables fetched (block 335--retrieve a node and ENV from queue head), performs operations (block 340 as will be described below) to generate partial XML components and adds all the children nodes, if any, and new environmental variables/values to the queue (block 345). As shown in block 340, the operations for a visited node, denoting a data type or attribute type, include: (1) resolving unbound variables which are associated with the data type or attribute, and defined in binding or value specifications in the DTDSA, using the fetched environmental variables/values (the resolution of unbound variables may involve accessing data sources and a predefined function calculation); (2) evaluating XML queries of different query languages in the binding specification, and assigning a result to the binding variables (step 350); (3) generating partial XML components based on current DTDSA node name (ENAME) as the tag, and the resolved content as the value or attribute; (4) adding the newly created variable/value pairs into the environmental variables (ENV). Referring now to FIG. 4, the association and distribution of variables in a DTDSA is illustrated. As shown, the DTDSA in a particular illustration may be represented as: <!ELEMENT PO (buyer, seller) :: x:=XML-QL/XQL/XPath/original map> <!ELEMENT buyer (name, addr)> <!ELEMENT name (#PCDATA: x.a)> <!ELEMENT addr (#PCDATA: x.b)> where `x` is the binding variable, `a` and `b` are the filter variables in the XML query and `a` and `b` are referenced in DTDSA as `x.a` and `x.b,` respectively. The association relates to the binding specification in that the query results are bound to a variable, say `x` as referenced by numeral 410. The distribution relates to the usages of the bound variables in the DTDSA. For example, `x.a` at reference numeral 420 retrieves the value of column `a` in a row bound by `x` for a relational database, or the content of tag `a` in an XML component bound by `x` for an XML repository. FIG. 5 is a block diagram illustrating a hardware implementation suitable for employing a universal output constructor for XML queries according to an embodiment of the present invention. It is to be understood that the universal output constructor for XML queries, namely, the engine 305 and, preferably, the DTDSA 312 (FIGS. 2 and 3), may be implemented on one such computer system, or on more than one separate such computer system, e.g., in a client-server relationship. For example, the engine 305 can be run as a server site application written in Java Servlet, or a client site application executed by an Internet browser. As shown, the computer system may be implemented in accordance with a processor 502, a memory 504 and I/O devices 506. It is to be appreciated that the term "processor" as used herein is intended to include any processing device, such as, for example, one that includes a CPU (central processing unit) and/or other processing circuitry. The term "memory" as used herein is intended to include memory associated with a processor or CPU, such as, for example, RAM, ROM, a fixed memory device (e.g., hard drive), a removable memory device (e.g., diskette), flash memory, etc. In addition, the term "input/output devices" or "I/O devices" as used herein is intended to include, for example, one or more input devices, e.g., keyboard, for entering data to the processing unit, and/or one or more output devices, e.g., CRT display and/or printer, for presenting results associated with the processing unit. It is also to be understood that the term "processor" may refer to more than one processing device and that various elements associated with a processing device may be shared by other processing devices. Accordingly, software components including instructions or code for performing the methodologies of the invention, as described herein, may be stored in one or more of the associated memory devices (e.g., ROM, fixed or removable memory) and, when ready to be utilized, loaded in part or in whole (e.g., into RAM) and executed by a CPU. DTDSA Mapping As mentioned above, a preferred implementation for the mapping between DTD and a relational schema is a DTD source annotation (DTDSA), as described in the above-referenced U.S. Ser. No. 09/466,627. The following is a detailed description of such a DTDSA mechanism. FIG. 6A illustratively includes four relational tables, also known as a relational schema, purchase order ("PO") 1305, company 1310, lineitem 1315, and product 1320. Table 1305 has three columns, purchase order identification ("POID"), buyer, and seller. The rows of the table have numerical index values pointing to values for the columns. Thus purchase order number 100 is associated with buyer 20 and seller 10. Table 1310 has three columns: company identification ("COID"), name, and address ("ADDR"). The rows associate numerical values with actual company names and addresses. Thus the numerical value 10 is associated with the company IBM, having an address in New York, and the numerical value 20 is associated with the company Citibank, also having an address in New York. Table 1315 has three columns: POID, product identification ("PRODID"), and amount. The rows, 1330 and 1335, associate purchase order identification numbers with product identification numbers and quantities. In the figure, purchase order 100 is associated with two product identifications, 35678 and 35694, of which 20 k and 100 k are ordered respectively. Table 1320 has three columns, PRODID, name, and desc. (description). The rows associate product identification 35678 with a "THINKPAD.TM." and product identification 35694 with a server. Arrows in FIG. 6A illustrate foreign key relations among various fields. For example, the record 1325 in PO table with POID=100 is related via arrows 1340 and 1345 to two records 1330, 1335 in the lineitem table 1315 with POID=100. Similarly, records 1330 and 1335 are associated via arrow 1350 to records 1355 and 1360. FIG. 6B shows an example of a Document Type Definition (DTD). As explained above, XML makes use of DTDs to specify documents. DTDs are very flexible and can specify any number of different documents. FIG. 6B shows only one simple example, in which a purchase order is specified. Line 1301 shows the definition of the variable PO. In a tree-like fashion, the definition incorporates child definitions, i.e., "id" defined at line 1302, "buyer" defined at line 1303, "seller" defined at line 1304, and "lineitem" defined at line 1307. The asterisk after "lineitem" at 1320 indicates that this feature may be repeated any number of times in the purchase order. The definitions of "id" 1302, "address" 1311, "prodname" 1308, and "amount" 1309 use the #PCDATA command to get data directly from a data storage device. The definitions of "buyer" and "seller" have attribute lists at 1323 and 1324. The definition of lineitem, also incorporates child definitions, "prodname" at line 1308 and "amount" at line 1310. FIG. 6B is to be annotated based on the relational schema in FIG. 6A, and the resulting annotated DTD (DTDSA) is illustrated in FIG. 6C. FIG. 6C shows a DTD annotated in accordance with the preferred mapping language, i.e., a DTDSA. The preferred mapping language includes two types of constructs: the binding specification and the value specification. FIG. 6D shows a directed acyclic graph specifically related to the example of FIGS. 6A, 6B and 6C. A sequence of resolutions occurs based on the BFS traversal order. The resolutions at numerals 1905, 1910, 1915, 1920, 1925, 1935, and 1945 of FIG. 6D correspond to the binding/value specifications at numerals 1505, 1510, 1515, 1520, 1525, 1535, and 1545 of FIG. 6C, respectively. The resolution for the binding specification at numeral 1510 using x=100 involves table PO access with poid=100 and derives into a record <100,20,10> for r as shown in 1910. The binding specification in 1505 uses the record r to derive its third argument, PO.poid(r), to 100, which is needed to resolution of w, i.e., row(lineitem,poid,100). Since there are two records in table lineitem with POID=100, as shown by numerals 1330 and 1335 in FIG. 6A, w is assigned the two records as shown at numeral 1905. Such binding can be used to derive multiple occurrence of a data type along the edge marked with "*" or "+" as shown at numeral 1902. The two records for variable w can be used to derive two XML components lineitem as shown at numeral 1925. Attribute values with value specification can also be similarly derived. For example, as shown at numeral 1920, the attribute name with a "@" prefix of data type buyer can have a resolved value s from deriving the binding specification at numeral 1535 using r as shown at numeral 1935. FIG. 6E shows the retrieved XML document for the example depicted in FIGS. 6A, 6B, 6C and 6D. Based on the input x=100, the document is a PO with id 100. There are two line items retrieved and composed as shown at numerals 2010 and 2015. Attributes are also illustrated as shown at numerals 2005 and 2010. Although illustrative embodiments of the present invention have been described herein with reference to the accompanying drawings, it is to be understood that the invention is not limited to those precise embodiments, and that various other changes and modifications may be affected therein by one skilled in the art without departing from the scope or spirit of the invention.
|
Same subclass Same class Consider this |
||||||||||
