Method and apparatus for querying structured documents using a database extender6366934Abstract An extender for a computer-implemented relational database system is disclosed for storing, querying, and retrieving structured documents. The extender provides a new abstract data type and includes a plurality of user defined functions for storing, querying, and retrieving structured documents internally, as character-based large objects (CLOB), or externally, in flat files or URLs, for example. A method and apparatus using an extender having a "conditional select" functionality is further disclosed for querying structured documents. The "conditional select" functionality is based on user defined functions residing within the extender being applied to a projection in a structured query in the database system. A method and apparatus using an extender having a set of user defined functions is further disclosed for creating one or more indexes for structured documents stored in the relational database system. The method uses the set of functions to create the indexes and uses already created B+ tree index structures implemented in the database system to support the indexes. A method and apparatus using an extender is further disclosed for creating and supporting structure indexes. The method includes a tag counting system for indexing structured documents and for implementing the structure indexes within the relational database. Claims What is claimed is: Description BACKGROUND OF THE INVENTION
create adt DB2XML (
DTDid int, /* you need to get DTDid from XML_DTD_REF*/
StType int,
fileName varchar (256),
content clob (10k),
size int,
creator varchar (20))
In this routine, DTDid is an integer value identifying a document type definition element (DTD) of an XML document. This value resides in a DTD reference table also created as a result of the enabling of the DB2.RTM. database. The stType attribute is a storage type as a constant, specifying the way the XML document must be stored in DB2.RTM.. The storage type has the following values: XML_FILE for storing an XML object as external file; XML_URL for storing an XML object as external URL; and XML_CLOB for reading an XML object from a file and storing it as CLOB in DB2.RTM.. The fileName attribute represents the name of an external file or URL of the XML document if stType is XML_FILE or XML_URL, and is null otherwise. The content attribute is a character-based large object (CLOB) containing the XML document if stType is XML_CLOB, and is null otherwise. The size attribute represents the size of the XML document in term of bytes. The creator attribute represents the user ID. Together with the creation of the abstract data type DB2XML, a set of accessor methods is automatically generated when DB2XML is created. The system-generated accessor methods for DB2XML include observers and mutators. The instruction for the observers is DB2XML..ATTRNAME ( ) RETURNS ATTRTYPE, wherein ATTRNAME and ATTRTYPE are the name and type of each attribute described above. The instruction for the mutators is DB2XML..ATTRNAME (ATTRTYPE) RETURNS DBTXML, wherein ATTRNAME and ATTRTYPE are the name and type of each attribute described above. Subsequent to the creation of DB2XML, the XML extender overwrites a system-generated constructor and redefines the constructor as follows:
DB2XML( DTDid int,
stType int,
fileName varchar (256),
content clob(10k),
size int,
creator varchar(20))
RETURNS DB2XML
This redefined constructor can be used to insert and update data together with other UDFs provided by the XML extender. The DB2XML as an abstract data type can be subtyped, i.e. a user can customize the data type of the XML document. For example, the user can add additional attributes to the data type with the following instructions:
EXEC SQL Create adt XML_CDF under DB2XML AS
(channel varchar(256),
title varchar(1000))
In this example, XML_CDF is a subtype of DB2XML and has additional attributes, channel and title. 2. Creation of User Defined Functions Another internal action executed when the database is XML enabled is the creation of user defined functions. The XML extender provides a number of functions for storage, search, and retrieval of XML documents. It is to be understood that the present invention is not intended to be limited to functions solely provided by the XML extender, and that other fractions, provided by the user for example, may be utilized. Storing fractions are used to store XML documents into the DB2.RTM. database. When the XML parser 160 parses the XML document, it finds whether the XML document possesses a DTD. Then, the XML extender searches the XML_DTD_REF reference table to see if the DTD is inserted into the XML_DTD_REF table. If not, the XML extender inserts the DTD into the XML_DTD_REF and gets the new DTDid. Otherwise, the DTDid is retrieved and assigned to the XML object as its attribute. Among the storing functions used within the XML extender are: a) xmlFromFile--xmlFromFile takes the storage type and the name of a file containing the XML document and returns a DB2XML data type. The syntax of this function is
xmlFromFile (stType, int,
fileName varchar (256))
and the return type is DB2XML. b) xmlFromBuff--xmlFromBuff is used for storing short XML documents residing in the memory. It takes the storage type, a buffer containing the XML document, the length of the buffer, and an optional fileName, if the content needs to be stored in a file, and returns a DB2XML data type. The syntax is
xmlFromBuff (stType, int,
buffer varchar (4000)
length int,
fileName varchar (256))
and the return type is DB2XML. The length of buffer is limited to 4000 by the DB2.RTM. database. Therefore, for long XML documents, use of xmlFromFile( ) or xmlFromCLOB( ) is recommended. c) xmlFromCLOB--xmlFromCLOB is used for storing long XML documents residing in the memory. It takes the storage type, a CLOB buffer containing the XML document, and an optional fileName, if the content needs to be stored in a file, and returns a DB2XML data type. The syntax is
xmlFromCLOB (stType, int,
clobdata clob,
fileName varchar (256))
and the return type is DB2XML. Retrieval functions are used to retrieve XML documents from the DB2.RTM. database. The XML extender provides several retrieval functions, such as: a) xmlToFile--xmlToFile takes a DB2XML object and returns a fileName, which contains an XML document. If the XML data is stored inside DB2.RTM. in CLOB, then a temporary file name in directory "templobs" under XML extender's home directory is returned. If the XML data is stored externally to DB2.RTM. in a file, then that file name is returned. The syntax is xmlToFile (xmlobj DB2XML) and the return type is varchar (256). b) xmlToBuff--xmlToBuff takes a DB2XML object and returns a Buffer which contains an XML document. If a user needs to get data as the type of CLOB, then the user can use DB2XML..content if data is stored as CLOB in DB2.RTM. or use xmlFileToCLOB(DB2XML..fileName) if data is stored in an external file. The syntax is xmlToBuff (xmlobj DB2XML) and the return type is varchar (4000) /* 4k is the max length of varchar */ The type of return buffer is varchar. The length of buffer is limited to 4000 by the DB2.RTM. database. Therefore, this UDF is useful for short XML documents. For long XML documents, use of data type CLOB for memory access is needed. Then, the user can use ADT observer DB2XML..content if data is stored as CLOB in DB2.RTM., or use xmlFileToCLOB(DB2XML..fileName) if data is stored in an external file. Search functions are used to search XML documents in a structural manner. The XML extender provides the following search functions: a) xmlContains--xmlContains takes a DB2XML object, a search path and a search content and returns an integer. If the search content is found in the search path, the function returns a 1, otherwise, it returns a 0. The syntax is
xmlContains (xmlobj DB2XML,
path varchar (256)
content varchar (256))
and the return type is Integer. b) xmlNoMatches--xmlNoMatches takes a DB2XML object, a search path and a search content and returns an integer as number of matches found in the document. The syntax is
xmlNoMatches (xmlobj DB2XML,
path varchar (256),
content varchar (256))
and the return type is integer. Transformation functions are used to transfer data between the internal CLOB stored within DB2.RTM. and the external files. As illustrated in FIG. 3, data can travel from DB2.RTM. 300 to the external files 500 via data link 400 and direct from the external files 500 to DB2.RTM. 300. a) xmlFileToCLOB--xmlFileToCLOB takes an input file name and returns a DB2.RTM. CLOB having the file data. The syntaxis xmlFileToCLOB (fileName varchar (256)) and the return type is CLOB. b) xmlCLOBToFile--xmlCLOBToFile takes a DB2.RTM. CLOB and a directory name and returns a file under the input directory which has the CLOB data. The syntax is
xmlCLOBToFile (data CLOB,
directory varchar (256))
and the return type is varchar. Utility functions are used to help an application when the constructor DB2XML is called. The XML extender provides the following utility functions a) xmlFileSize--xmlFileSize takes the file name and returns the size of the file. The syntax is xmlFileSize (fileName varchar (256)) and the return type is integer. When the DB2.RTM. database is XML enabled, DTD reference tables are also internally created. 3. Creation of the DTD ReferenceTable (XML_DTD_REF table) The XML_DTD_REF is a DTD reference table, which stores all of the information about DTDs that can be used by XML documents. The XML_DTD_REF table also serves the role of the DTD repository. Each row of the DTD table represents a DTD with additional metadata information. A user can insert additional rows to add other DTDs. The DTDs can be stored either externally as files or internally as CLOBs, specifying the stType storage type. With this reference table, no duplication information needs to be stored in normal tables with XML columns. The XML_DTD_REF table is created with the following create table statement:
create table xml_dtd_ref (
id int not null,
author varchar (256),
stType int , not null,
URL varchar (256),
content clob (10k),
colCount int,
info1 varchar (2000),
info2 varchar (2000),
constraint xml_ref_pk primary key (id));
4. Creation of Internal Registration Table Another internal action executed when the database is XML enabled is the creation of internal tables, such as an XML_COLUMNS table. The XML_COLUMNS table is used to store common information for each XML enabled column. The table is created with the following statement.
create table xml_columns (
fTableSchema varchar (8) not null,
fTableName varchar (18) not null,
fColName varchar (18) not null,
dtdid int not null,
indexType int not null,
idxSchema varchar (8),
idxName varchar (18),
constraint xmlcol_pk primary key (
fTableSchema, fTableName, fColName),
constraint xmlcol_fk foreign key (dtdid)
references xml_dtd_ref (id));
This table stores meta data regarding each XML column in the database which has been enabled by the xmladm enable_col option to be described later. Finally, enablement of the DB2.RTM. database triggers a grant of privileges to the public. 5. Granting of table privileges to public Any user of the database may query the XML_DTD_REF table and the XML_COLUMNS table, but only certain users can update these tables. The enable_db option of xmladm will also handle the privileges. B. Creation of an XML Table An XML table is a table that includes one or more XML columns. An XML column is a column created with the DB2 XML data type or its subtype. In order to create such a table, a CREATE TABLE statement is provided. The CREATE TABLE statement must include an XML column in its column clause. As an example, consider a bookkeeping system of the first published book of each employee in a publishing company. Employees' IDs, names, years of the service time and the first books are stored in the first_book table. The following CREATE TABLE statement will be used:
CREATE TABLE first_book (
( id char(6),
name varchar (20),
service int,
book db2xml);
C. Enablement of an XML Column Once created, the XML column must be enabled. The column can be enabled with an enable_col option of the xmladm administration tool. The syntax of the option is xmladm enable_col dbName tbName column [DTDid] where the dbName is the name of the XML enabled database, the tbName is the table name in which the XML column resides, the column is the name of XML column, the DTDid is the optional key in the xml_dtd_ref table. An example of the syntax for enabling the XML column of the previously created first_book table in database mydb is: C:.backslash.xml> xmladm enable_col mydb first_book book Connecting to the Database----Connect to Database Successful. Calling Enable Column, Please wait----Enable Column Successful. C:.backslash.xml> The enable_col option of the xmladm administration tool triggers the following actions: a) It inserts a row into the XML_COLUMNS table; b) It updates the XML_DTD_REF table to increase the column counter if DTDid is specified. The DTDid here is provided to specify that this XML column must be tied with a specific DTD. In this case, the same DTDid will be assigned to every DB2XML data for every row. An index can also be bound to the DTD. D. Enablement of an XML Index Since XML columns contain XML documents, applications need to perform structural queries on these columns. Therefore, an index strategy that will quickly identify the search items is needed. The XML extender does key transformation on top of the B+ tree index structures provided by the DB2.RTM. database. The key transformation hides all details from the application and creates the index for structural search. Based on the nature of XML document, the XML extender provides three indexing mechanisms: 1. General Indexing Mechanism In a general indexing mechanism, the XML column is not bound to one DTD. The XML document stored in the XML column, can be well-formed (without a DTD) or valid (including DTD). The index will be created to include all structures of the XML document. Each time a document is stored, the index tree structure may be updated. 2. DTD Bounded Indexing Mechanism In a DTD bounded indexing mechanism, the XML column is mapped to one DTD. The DTD must be in the XML_DTD_REF table and application should get the DTDid before the index is created. Therefore, prior to indexing, the XML extender must: a) Parse the DTD and generate its internal tree structure, and b) Store DTD data into the XML_DTD_REF table. Then, the XML extender can create the index using the create_index option of the xmladm administration tool. In this approach, every time an XML document instance is stored, the index structure will not change. 3. User Defined Indexing Mechanism In this alternative, the user may define a set of elements and attributes needed to be indexed. This approach may be restrictive but provides better performance. The xmladm command for enabling the index is the enable_index option, having the following syntax:
xmladm enable_index dbName tbName column type
[<DTDid> <path list>]
where the dbName is the name of the XML enabled database, the tbName is the table name in which the XML column resides, the column is the name of the XML column, the type is the indexing type, which can be a G for general indexing, a D for DTD bounded indexing, or a U for user defined indexing, the DTDid is the key in the XML_DTD_REF table, and the path list is the list of the structure path. The DTDid is needed if the type=`D` and is otherwise not needed. If the DTDid is specified, then the enable_col option must have the same DTDid specified to the same column. The path list is needed if the type=`U` and is otherwise not needed. The index enabling of the previously created book column in table first_book of database mydb is: C:.backslash.xml> xmladm enable_index mydb first_book book D 1 Connecting to the Database----Connect to Database Successful. Calling Enable Index, Please wait----Enable Index Successful. C:.backslash.xml> This command will create the index of the book column with the DTD bounded indexing mechanism and use DTD specified in the XML_DTD_REF table with reference id `1`. Structure Indexes for the XML Extender It is known in the art that a structure query consists of two parts: a path and a content. The content part of the structure query specifies the terms (with optional AND, OR, and other operators) that the user is searching for. The path part, also called the structure part is a sequence of document elements specifying the regions to be searched for content matches. The content of a document matching the structure query must reside in the specified regions; for example, the query to find all books that contain `XML` and `HTML` terms in the same chapter, "/book/ chapter contains `XML` and `HTML`", has the path (structure) part "/book/chapter" and the content part "`XML` and `HTML`". The query defines a containment relationship, i.e. a connection between the terms of the content part and the path of the structure part using the "contains" function. In general, the path part may also specify attribute name/value pairs associated with an element in the path; for example, the query "/book[@on_sale="80%"]/title contains `XML`" is asking for XML books that are at 80% discount. Furthermore, a wildcard `*` may be used to denote a sequence of zero or more tag names. The XML document presented above has its document structure tree illustrated in FIG. 8. Any attributes and contents have been omitted from the representation of the document structure tree and only the nodes have been shown in FIG. 8. If the XML documents conform to a single DTD, all possible document structures may be known in advance, and the user may specify which regions of the documents need to be indexed. A structure index for the XML document, merging nodes of the same path, e.g. /book/toc/chapter, is illustrated in FIG. 9, but it is to be understood that many other index alternatives may be created using the same XML document. Each node in the structure index corresponding to a path p is associated with a list of occurrences of the path p and with a list of attributes. FIG. 10 describes the data structure at each node in the structure index, wherein the data structure includes the list of occurrences and the list of attributes. As illustrated in FIG. 10, different structures may be used for indexing different attributes of the XML document. The DB2.RTM. UDB version 6 only supports B+ tree index structures. However, it is to be understood that other structures, such as a hash table, also shown in FIG. 10, may be used for indexing purposes. In general, a structure search refers to the searching for the path part, while a content search refers to the searching for the content part. For content searches, the index used in most search engines is based on variations of inverted files, wherein each keyword is mapped to the list of document occurrences: contentIndex(keyword)----> (DocId, Pos)* and where Pos represents the positioning information of an occurrence of the keyword in a document DocId. Correspondingly, for structure searches, a structure index maps a path to a list of occurrences: structureIndex(path)----> (DocId, Pos)* where Pos similarly describes the positioning information of an occurrence of the path in a document DocId. The positioning information plays a significant role in the integration of the structure index with the content index. The positioning information should enable the user to determine the containment relationship. For example, if offset is used for positioning information in XML documents, and an occurrence of /book/title is (D1, 150-168), then the path occurs in document D1 at offset 150-168. In addition, if the content index shows that one occurrence of the word `XML` is (D1, 155-157), then the result is that document D1 satisfies the query "/book/title contains `XML`". However, in XML documents, offset information may not always be useful, because entities may affect the offset information. More particularly, in the XML environment, an entity is declared by <!ENTITY name text>, where name is the entity name, and text represents the definition of the entity. Once the entity is defined, it can be referenced by using &entity_name, which in fact will be replaced by the entity text within the document. For example, assuming that <!ENTITY OutofStock "Out of Stock. Will fill in 3 weeks."> is a declaration of an entity and <book> <availability> &OutofStock </availability> . . . </book> is an XML document, the &OutofStock reference will be equivalent to the "Out of Stock. Will fill in 3 weeks." text. If the user wants to query whether "/book/availability contains `Stock`", several problems associated with offset may be encountered. First, in the <book> document, there may be no offset information for the "Out of Stock. Will fill in 3 weeks." text, because the text can be located anywhere within the document, or may reside in another document. At the same time, if multiple references to &OutofStock are present within the document, even assuming that offset information exists for words in the "Out of Stock. Will fill in 3 weeks." text, the offset cannot be used for positioning information because it cannot have multiple values for the same words. Therefore, a novel tag counting system is introduced for producing more useful positioning information. When parsing an XML document, a counter for the number of existent tags is introduced and maintained. For each element node in the document structure tree, a start count (when first encountering <element>) and an end count (when first detecting </element>) are recorded. Each pure text content (i.e., having no nested tags) is counted as one, so that the terms it contains have the same number and are not individually counted. The content count should be entered in the content index as a part of the positioning information. For an element containing mixed nested elements and pure text content, each piece of pure text content is separately counted. The tag counting system is illustrated in FIG. 11 and assumes that pure text content occurs only at the leaf nodes of the document structure. The occurrence lists in the content index and the structure index of the XML document are shown in FIG. 12. The resulting positioning information in the structure index and the content index makes integration or "index-anding" of the two indexes possible. Referring to FIGS. 11 and 12, and given a query "/book/toc/chapter contains `data` and `variable`, the structure index finds that the occurrences of "/book/toc/chapter" are {(d1,21,28), (d1,29,36), . . . }, while the content index finds that the occurrences of `data` are {(d1,23), (d1, 34), . . . } and the occurrences of `variable` are {(d1, 26), . . .). It is very easy to see that the (d1,21,28) occurrence contains both terms, as 23 and 26 are in the 21-28 range. Therefore, integration of the structure index and the content index is achieved by containment relationship of the positioning information in occurrence lists. The tag counting system described above also helps the user conduct proximity searches on XML documents. A proximity search usually imposes a restriction on occurrences of certain terms to have a separation of no more than a given number of words apart. Other proximity conditions include the requirement that a certain term precede another or that several terms appear in the same paragraph. In the example query presented above, "/book/chapter contains `XML` and `HTML`", the challenge is to find all books that contain `XML` and `HTML` terms in the same chapter. An index that supports proximity searches will distinguish the two documents illustrated in FIGS. 14A and 14B and return the document depicted in FIG. 14A as a correct result, but not the document depicted in FIG. 14B.
//return a list of occurrences of `path` that contains `content`
//content is a list of words connected by AND or OR in an expression tree;
structureSearch(path, content) {
S = structureIndex(path); //occurrence list of path
switch (content->type) {
case WORD:
L = contentIndex(content->word); //the occurrence list of word
return includes(S, L);
case AND:
R1 = structureSearch(path, content->operand1);
R2 = structureSearch(path, content->operand2);
return intersect (R1,R2);
case OR:
R1 = structureSearch (path, content->operand1);
R2 = structureSearch(path, content->operand2);
return union(R1,R2);
}
}
//returns a subset or S, which contains some word in L
includes(S, L) {
result=empty;
for (each e in S)
for (each f in L)
if (within (f.position, e.position))
(result.add(e); break;}
return result;
}
The function within(pos1, pos2) returns TRUE if pos1.DocId=pos2.DocId and pos1.start<pos2<pos1.end. The function includes(S, L) is used to filter S against L. Alternatively, a function includes(L, S) may also be used to filter L against S. It is to be understood that the novel tag counting system also applies to XML documents having different or no DTDs. If the XML documents have different DTDs, and all the different DTDs are known at the time the index is created, then the previous procedure applies and all possible structure paths may be calculated in advance. If the DTDs cannot be known in advance (e.g., newly inserted documents bring in new DTDs), or if the XML documents have no DTDs at all, then the set of all possible paths changes over time, and such information should be consistently stored in proper format. If the set of possible paths is maintained in a tree-structured index, such as the one illustrated in FIG. 9, then the tree structure is not fixed. As a result, new nodes may be created when a new XML document is inserted. If the set of possible paths is encoded as keywords or part of keys, then this encoding or mapping needs to accommodate new paths. As described above, structure indexes may be implemented inside a DB2.RTM. database using its B+ tree support and index extensions implemented within the IBMO DB2.RTM. Spatial Extender, created to support spatial data. Currently, B+ trees are the only index structures supported by the DB2.RTM. UDB version 6. FIG. 13 illustrates an architecture that fully integrates spatial data into relational databases. This architecture is known in the art and has been implemented in the IBM.RTM. DB2.RTM. Spatial Extender. When performing insert/update/delete functions, the B+ tree index manager concatenates the set of colunm values, defined by the CREATE INDEX statement, to form a key to the index, and to store/delete the record ID (RID) into/from the index. Given a search request specified by a pair of start/stop keys, the index manager browses down the B+ tree, looking for keys in the given range, and retrieves the RIDs. The DB2.RTM. Spatial Extender allows UDFs to be invoked by the B+tree index manager. The four modules in which UDFs can be invoked to apply application specific operations are a key transformer, a range producer, a filter, and a DMS filter. The UDF in the key transformer is the only one applied at insert/update/delete time. Given a record, a UDF for the key transformer module, such as a table function, can be invoked to generate a set of keys to be used by the index manager for index maintenance. Multiple entries for a single record can exist in the index. The UDF in the range producer is applied at query time. Given a user search predicate, a UDF for the range producer module, such as a table function, can be invoked to generate a set of start/stop key pairs for searching in the B+tree index. A UDF for the filter module is also applied at query time. Such a boolean filter UDF is invoked right after a RID is retrieved from the index. A common use of this UDF is to remove duplicates, because multiple entries may exist for a RID. A UDF for the DMS filter module is applied at query time. Such a boolean filter UDF is invoked, after the RID is used, to retrieve the data record and before the original predicate is applied. The mapping of the structure index is structureIndexpath)---->(Docld, Pos)*. If each document stored in the database is logically a different document, even though it refers to the same physical document (there is no dependency between records due to document sharing), then RIDs can be used in the mapping of the structure index instead of the DocIds. The following steps transform the structure index into B+ tree style index:
path -->(DocId, Pos)*
=> path x DocId x Pos (flatten out the function)
=> path x Pos --> DocId* (rearrange and make a function)
=> path x Pos --> RID* (replace DocId with RD)
Thus, the mapping of the B+ tree index derived from the structure index is structurelndex(path, Pos)----> RID*. In order to search the paths using this index, the path needs to be a prefix of the index key. Several UDFs may be defined and then invoked by the index manager during a search request in the four areas detailed above. A key transformer UDF takes a document (or a document file name), parses it, and returns a set of (path, Pos), each of which is an occurrence of a path to be indexed. A range producer UDF takes the path and generates a single key whose prefix is the path. A filter UDF gets the content matches from the content index and performs containment check for each (DocId, Pos) obtained from the structure index at fetch time. Creation of an Index for Structured Documents with Rich Data Types Adding new data types to XML documents and associating these data types with XML elements and attributes provides a tool for users to ask range queries, rather than just keyword queries. In the preferred embodiment of the present invention, a method is described that uses already created B+ tree index structures implemented in RDBMS to support new indexes for structured documents with rich data types. A structured document with rich data types can handle most of the SQL data types, for example integer, real, decimal, date, time. This method presents a few obvious advantages. First, no new index structures and index managers need to be created and the existing robust and matured RDBMS indices can fulfill the requirements of fast query performance. Second, using the existing RDBMS B+ tree index support structures allows users to create indexes of all SQL data types that are already in place, such as integer, real, decimal, date, time. Third, this method does not require any extra columns or extra tables for creating and supporting the indexes. Fourth, this method allows the existing powerffil SQL engine to automatically exploit the B+ tree index structures. As illustrated in FIG. 5, an XML document is stored in the DB2.RTM. database and creates an entry directly into an index manager and further into the existing B+ tree of the database. Subsequently, the user can create multiple indexes on columns, using UDFs and the statement: CREATE INDEX idx1 ON t(cl, . . . , ck) USING (udfName(cl, . . . , ck)); where idx1 is the name of the newly created index, cl . . . ck represent the XML columns, and udfName is the name of the selected function. An example of the implementation of this statement is illustrated in FIG. 6. As shown in FIG. 6, the assumption is that xmldoc is the column that stored the XML document (either as a CLOB or as a file name), and that the XML document has the usual <book> schema. Also, let ExtractPrice and ExtractPubDate be the UDFs to extract respectively, the content of /book/price (of data type `real`) and the content of /hook/pubdate (of data type `date`). As a result using the above expression, the user creates an index idx1 on /book/price and another index idx2 on /book/pubdate: CREATE INDEX idx1 ON t(xmldoc) USING (ExtractPrice(xmldoc)); CREATE INDEX idx2 ON t(xmldoc) USING (ExtractPubDate(xmldoc)). Both indexes may be created on the same table column, in this case xmldoc, or may be created on different columns. Once the indexes have been created, the user may input a query such as: SELECT docid from t where ExtractPrice(xmldoc) > 100.0 utilizing the index idx1 to evaluate the predicate without scanning the XML document source. Having this powerful SQL support, the user can also input more complicated queries on XML elements, such as: SELECT count(*) from t where ExtractPrice(xmldoc) > 100.0; and SELECT t1.docid, t2.docid from magazine t1, book t2 WHERE ExtractAuthor(t1.xmldoc) = ExtractAuthor(t2.xmldoc). E. Insertion of Data into the XML Column The DB2XML ADT 142 contains several data fields, so an XML document cannot be entered directly into an XML column. Instead, the XML documents must be loaded using the functions provided by the XML extender and detailed above. The most recommended function to be used for insertion of data into the XML column is xmlFromFile, which takes input parameters of file name and storage type. The storage type specifies whether you want to store XML document as external file or internal CLOB. In the example below, a record is inserted into the first_book table. The function xmlFromFile transforms the file into content of DB2XML.
#include "xml.h"
EXEC SQL BEGIN DECLARE SECTION;
int storage_type;
EXEC SQL END DECLARE SECTION;
storage_type = XML_CLOB;
EXEC SQL INSERT INTO first_book VALUES (`1234`,
`Sriram Srinivasan`
5,
xmlFromFile (:storage_type,
`e:.backslash.xml.backslash.book.xml`))
where the storage type of XML_CLOB will specify to store the data of `e:.backslash.xml.backslash.book.xml` into the first_book table as CLOB. F. Querying of an XML Table The XML table is ready to use when the XML columns are enabled. However, if the indexes have already been constructed, the query process will be faster. The indexing is not required by the search but is recommended for better performance. Direct retrieval of the XML columns requires the XML data to be converted into a file or a buffer. In the preferred embodiment of the invention, the query simply returns the file name. The function xmlToFile returns the original file name if the XML document is stored externally, or returns a temp file with the CLOB data retrieved from DB2.RTM. if the XML document is stored as CLOB in DB2.RTM.. To support the "select * from" statement, a function xml_to_prog( ) is also provided as the default transformation function. In the example below, a small sqc program segment illustrates how an XML document is retrieved from the XML table. This example is based on the inserting example illustrated above.
EXEC SQL BEGIN DECLARE SECTION;
char buffer (1000);
EXEC SQL END DECLARE SECTION;
EXEC SQL CONNECT TO mydb
EXEC SQL DECLARE c1 CURSOR FOR
SELECT xmlToBuff (book) from first_book
WHERE name = `Sriram Srinivasan`
EXEC SQL OPEN c1;
do {
EXEC SQL FETCH c1 INTO :buffer;
if (SQLCODE != 0) {
break;
}
else {
/* do whatever you need to do with the XML doc in buffer */
}
}
EXEC SQL CLOSE c1;
EXEC SQL CONNECT RESET;
If the function is xmlToFile, then the XML document is put into a file, which can be operated on directly. The Conditional Select Functionality In order to obtain XML element contents and attribute values as search results, the XML extender provides a "conditional select" functionality, which will be described below in further detail. The "conditional select" functionality is based on UDFs being applied to a projection in a select clause of a SQL statement in the DB2.RTM. environment. For example:
SELECT
db2xml.ExtractChars (book, `/book/title`,`/book/author[@last-
name="Bob"]`)
FROM book_tab
WHERE db2xml.ExtractDouble(book,`/book/price`) > 25.50
is such a statement, where the db2xml.ExtractChars( ) is an UDF applying the conditional select to XML elements. The above SQL statement returns a table of book titles, which also satisfies the other conditions, namely books with the author name being "Bob" and having their price greater than $25.50. First, the WHERE clause returns all XML documents having a book with price greater than $25.50. Then, the XML extender searches the narrowed pool of documents for the XML documents having the author's first name equal to "Bob." Returning to the semantics of the "conditional select" SQL statement, the first parameter must be the column name of a user defined type of the XML extender. The second parameter is the XML path, which identifies what needs to be selected. The third parameter is the condition, which is a qualified path expression. In return, the "conditional select" functionality will gather a SQL table containing rows of selected element contents and/or attribute values of the XML documents. The UDF of the "conditional select" functionality scans the XML documents for the content of DB2XML data, using the XML parser 160 illustrated in FIG. 2. If the conditions are satisfied, the particular element content or attribute value is returned with the search results. Working in conjunction with the WHERE clause provided by the db2xml.ExtractChars( ) function, which performs the role of filtering the search of XML documents, the "conditional select" functionality further narrows down the projection and returns the search results. The DB2 XML extender provides a set of UDFs to work with the "conditional select" functionality. Each UDF corresponds to one popular SQL data type, as shown in the following table:
Return Type Table Function
Integer extractIntegers (xmlobj DB2XML
path varchar)
Double ExtractDoubles (xmlobj DB2XML
path varchar)
varchar (2k) ExtractChars (xmlobj DB2XML
path varchar)
CLOB ExtractCLOBs (xmlobj DB2XML
path varchar)
date extractDates (xmlobj DB2XML
path varchar)
time extractTimes (xmlobj DB2XML
path varchar)
timestamp ExtractTimestamps (xmlobj DB2XML
path varchar)
G. Updating Values of an XML Column Since the ADT DB2XML is composed of a set of attributes, updating the values of a DB2XML column data translates actually into updating attribute values of the DB2XML. In its preferred embodiment, the present invention provides two ways to update the column values. 1. Update attribute values individually For example, if a user wants to change the storage type of an instance of the book table illustrated above from the external file to the internal CLOB, the following segment of sqc code illustrates the procedure.
#include "xml.h" /*where stType is defined */
EXEC SQL BEGIN DECLARE SECTION;
int storage_type;
EXEC SQL END DECLARE SECTION;
EXEC SQL CONNECT TO mydb
storage_type = XML_CLOB;
UPDATE first_book set book.stType = :storage_type
book..fileName = NULL
book..content=xmlFileToCLOB(book..fileName)
WHERE name = "Sriram Srinivasan`
EXEC SQL CONNECT RESET;
2. Update column data as a whole On the other hand, the XML extender also provides functions to allow updating of the entire DB2XML column data. The following segment of sqc code illustrates this procedure.
#include "xml.h" /* where stType is defined */
EXEC SQL BEGIN DECLARE SECTION;
int dtdid;
int storage_type;
EXEC SQL END DECLARE SECTION;
EXEC SQL CONNECT TO mydb;
EXEC SELECT id into :dtdid from sml_dtd_ref
WHERE URL = `http://w3.stl.ibm.com/xml/book.dtd"
storage_type = XML_CLOB;
WITH fname(book..fileName) AS
(SELECT book..fileName from first_book
WHERE name = `Sriram Sriivasan`)
UPDATE first_book
set book = xmlFrom File(:storage_type,
fname)
WHERE name = `Sriram Srinivasan`
EXEC SQL CONNECT RESET;
The two segments of sqc code include some key differences. The first alternative has better performance since it directly updates attribute values of DB2XML. The second one uses the same function call for inserting and thus needs to retrieve the fileName first, then pass it as the third parameter to the xmlFromFile. Therefore, there is a performance drawback. However, if the content of the XML document needs to be updated from another file, totally different from the original file, the file name can be specified as the third parameter to the xmlFromFile. In addition, the second alternative also shows how to get the DTDid from the XML_DTD_REF table, while the first one does not. H. Searching the Stored XML Documents The above sections have shown how to use XML extender as the document repository for storage and retrieval of XML documents. Another important feature of the XML extender is its structural search capability. The XML extender provides SQL functions that enable a user to include structural subqueries in SQL queries. These functions are provided in addition to the finctions normally available in SQL and are described above in great detail. One of the structural search queries used with the present invention refers to finding the XML documents containing certain terms in the specified structural path. Using the example illustrated above, having the sample XML document "book.xml" inserted into the firs_book table in database mydb, the xmlContains function searches for XML documents which are stored in the XML book column. The function returns integer 1 if the document satisfies the search argument and otherwise returns a 0. EXEC SQL SELECT id, name FROM first_book WHERE xmlContains (book, `/book/title` , `Programming`)=1 This example returns the id and name of each record in the first_book table where the element `book.title` contains the word `Programming`. The xmlContains function can only be used in conjunction with a WHERE clause. The xmlNoMatches function is used to determine the number of matches meeting the search criteria in each XML document. For example: WITH TEMPTABLE (id, name, no_matches) AS(SELECTid,name,xmlNoMatches(book,`/book/title`,`Programming`) From first_book ) SELECT * FROM TEMPTABLE WHERE no_matches>2 The xmlNoMatches function returns an integer value. The function can be used in both a SELECT clause and a WHERE clause. Search arguments are used in the xmlContains and xmlNoMatches functions. For structural search purposes, the search arguments are separated into two parts. A path part, or so-called structure part, is defined as: path: : = element .vertline. path `/` element element: :=tagName (`[` (@attributeName=attributeValue) +`]` ) ?.vertline. `*` where tagName, attributeName and attributeValue are CDATA in XML terminology. The path part is a sequence of document elements that specifies the regions within which the content part is searched against. The path may also specify attribute NameValue pairs associated with an element in the path. For example, the path `/book/toc/chapter[@id=`1`]/section` is identifying the section within the chapter whose id is equal to `1`. The wildcard "*" may be used to denote a sequence of zero or more tag names. A content part specifies the terms, with optional sequence, with "," or AND and OR operation, that the application is searching for. The content part is defined as: Content: :=primaryList primaryList: := primary .vertline. primaryList (OR.vertline.AND) primaryList primary: := atom .vertline. `(` atomList `)` atomList: := atom `,` atom atom: := `"` word `"` where word is #CDATA in XML terminology, "OR" and "AND" are Boolean operators "&" and ".vertline.". Several examples of structural searches are listed below. a) Searching for Several Terms SELECT id, name, hTime FROM first_book WHERE xmlContains(book,`/book/toc/chapter`,`("reference","students")`)=1 In this example, all XML documents containing both `reference` and `student` in path `book.toc.chapter` will satisfy the search criteria. b) Searching with Wildcards SELECT id, name, hTime FROM first_book WHERE xmlContains (book, `/book//section`,`"reference"`)=1 In this example, all XML documents having the word `reference` in the path starting with `book` and ending with `section` will satisfy the search criteria. c) Searching with Boolean Operation on Content SELECT id, name, hTime FROM first_book WHERE xmlContains(book,`/book/toc`, `"reference".vertline."student"`)=1 In this example, all XML documents having either `reference` or `student` will satisfy the search criteria. d) Searching with Path Constrained by Attribute Values SELECT id, name, hTime FROM first_book WHERE xmlContains(book,`/book/toc/chapter[@id=`2`].section`,`"student"`)=1 In this example, only XML documents having `student` in the sections under chapter 2 will satisfy the search criteria. Thus, the present invention may be implemented as a method, apparatus, or article of manufacture using standard programming and/or engineering techniques to produce software, firmware, hardware, or any combination thereof. The term "article of manufacture" (or alternatively, "computer program product") as used herein is intended to encompass a computer program accessible from any computer-readable device, carrier, or media. Of course, those skilled in the art will recognize that many modifications may be made to this configuration without departing from the scope of the present invention. The foregoing description of the preferred embodiment of the invention has been presented for the purposes of illustration and description. It is not intended to be exhaustive or to limit the invention to the precise form disclosed. Many modifications and variations are possible in light of the above teaching. It is intended that the scope of the invention be limited not by this detailed description, but rather by the claims appended hereto.
|
Same subclass Same class Consider this |
||||||||||
