Apparatus and methods for retrieving information by modifying query plan based on description of information sources5600831Abstract Techniques for optimizing queries in a system in which executing the query requires retrieval of information from a number of different data bases which are accessible via a network. In the techniques, a query results in a query plan which includes subplans for querying the data bases which contain the required information. When a subplan is executed in one of the data bases, the data base returns not only the information which results from the execution of the subplan, but also source and constraint information about the data in the data base. The source and constraint information is then used to optimize the query plan by pruning redundant subplans. An embodiment is disclosed in which queries are made to a domain model implemented using a knowledge base system. The domain model includes a world view of the data, a set of descriptions of the data bases, and a set of descriptions of how to access the data. The information in the domain model is used to formulate the query plan. Claims What is claimed is: Description BACKGROUND OF THE INVENTION
__________________________________________________________________________
query(Name, AC, TelNo) :-
quote(Ag, Al, `Newark, NJ`, `Santiago, Chile`, C, D),
areaCode(`Miami, Fl`, AC), dir(Ag, AC, TelNo),
name(Ag, Name), C < 1000.
__________________________________________________________________________
This query does not explicitly make use of the world-view concept travelAgent, since the type of Ag in the world-view relation quote is constrained to be the concept travelAgent. .quadrature. Typically, languages for querying object/relational databases use SQL-like constructs to access attributes of relations, and "path expressions" to access attributes of objects. In our world-view, concepts can be viewed as unary relations, and object attributes can be viewed as binary relations. Consequently, accessing object attributes using path expressions is equivalent to using a chain of unary and binary relations corresponding to concepts and attributes. For this reason, our queries are conjunctive relational queries expressed in terms of the world-view relations and objects. Sites and Site Descriptions: FIG. 2 Users pose queries in terms of the relations of world view 115. However, the world-view relations constitute just a conceptual view; the information required to answer queries is present in the external information sources 123 described in information source descriptions 113. Information sources 123 can be viewed as providing extensions of site relations from information source descriptions 113, which are type-set in this font. In order to answer user queries, the system needs a precise description of the site relations . Such a description is termed herein a site description. As shown in FIG. 2, a site description 201 in a preferred embodiment includes at least two types of information: content specification 203 which relates the contents of the external relations with the world-view relations . set of query forms 205 (0. . . n) which indicates subsets of queries on the relations that the external site is willing to answer. In a preferred embodiment, there are two subsets of queries indicated by the query forms: those queries which the external site can answer at all and those queries which the external site can answer efficiently. We first present some examples of site descriptions 201 to illustrate specification of content and capability. We then formally describe the language used for content specifications 203. EXAMPLE 5.4 A travel information source provides directory information for travel agents in the relation travel.sub.-- dir(Ag, Ac, TelNo). Content specification 203 for this relation specifies that this relation contains telephone information about travel agents in the dir world-view relation, though not necessarily all travel agents. The query forms 205 for this travel information source specify that this source answers two kinds of queries: first, the information source provides an agent's area code and phone number, given a specific travel agent, and second, the information source provides all travel agents and their phone numbers, given an area code. This information source does not answer queries where none of the arguments is bound to a constant. The Manhattan directory information source provides the relation bigapple.sub.-- dir(Cust, TelNo). The content specification 203 for this relation specifies that this relation contains the phone numbers of customers in the 212 area code. In addition, content specification 203 specifies that it has complete information about the phone numbers of customers in the 212 area code, i.e., there is no phone number in the 212 area code which does not exist in the relation bigapple.sub.-- dir. Specifying completeness information is useful for a query processor to determine that it need not query any other sources for information regarding 212 phone numbers. See O. Etzioni, K. Golden, and D. Weld. "Tractable closed world reasoning with updates", In Proceedings of KR-94, 1994. .quadrature. Details of Content Specifications 203 A content specification 203 describes the contents of external site relations by relating them to the world-view relations . A content specification 203 thus has three parts: a right hand 211 which is a conjunction of expressions involving relations in world view 115, a left hand 207 of expressions involving relations in information source descriptions 113, and a connector 209 between them. In the site description language of the preferred embodiment, a content specification may have one of the following four forms: C.sub.R (Y),R.sub.1 (X.sub.1), . . . R.sub.k (X.sub.k).OR right.C.sub.E (X), E(X) (1) C.sub.R (Y),R.sub.1 (X.sub.1), . . . , R.sub.k (X.sub.k)=C.sub.E (X), E(X)(2) C.sub.R (X),R(X).OR right.C.sub.E (Y),E.sub.1 (X.sub.1), . . . , E.sub.k (X.sub.k) (3) C.sub.R (X),R(X)=C.sub.E (Y),E.sub.1 (X.sub.1), . . . , E.sub.k (X.sub.k)(4 ) The R's (with or without subscripts) refer to the external site relations, the E's (with or without subscripts) refer to the world-view relations, and the C.sub.R 's and C.sub.E 's denote constraints (order constraints and CLASSIC concepts). X (with or without subscripts) and Y denote tuples of variables and/or constants. Each expression must be range-restricted, i.e., X.OR right.X.sub.1 .orgate.. . . .orgate.X.sub.k. The meaning of an expression is the natural one, given by the following relational algebra expressions (where .sigma. denotes selection, .pi. denotes projection, and denotes join). For example, the meaning of content specifications of form (1) is: .pi..sub.X (.sigma..sub.C.sbsb.R(Y) (R.sub.1 (X.sub.1). . . R.sub.k (X.sub.k))) .OR right..sigma..sub.C.sbsb.E(X) (E(X)). The meaning of content specifications of form (4) is: .sigma..sub.C.sbsb.R(X) (R(X))=.pi..sub.X (.sigma..sub.C.sbsb.E(Y) (E.sub.1 (X.sub.1) . . . E.sub.k (X.sub.k))). Expressions of the type (1) and (2) differ from expressions of the type (3) and (4) in the following way. The first two specify how fragments of world-view relations can be computed from the site relations, i.e., the world-view relation fragments are akin to traditional views on the site relations and external database schemas in multidatabases. Sec W. Litwin, L. Mark, and N. Roussopoulos. "Interoperability of multiple autonomous databases", ACM Computing Surveys, 22(3):267-293, Sept. 1990. In contrast, the latter two define the contents of fragments of the site relations as views on the world-view relations. An expression of type (1) specifics that part of the fragment is computed using the description. An expression of type (2) specifics that all of the fragment is computed using the description. The relationship between expressions of type (3) and (4) is the same as the relationship between expressions of type (1) and (2). EXAMPLE 5.5 Consider our airline flight application. Fly-by-Night Airlines provides two site relations 207: fbn.sub.-- fights(Flt, Src, Dest), which denotes that flight Flt of Fly-by-Night Airlines is from Src to Dest, and fbn.sub.-- quote(Ag, Flt, C,D), which denotes that a designated travel agent Ag of Fly-by-Night Airlines quotes a price of C to travel by flight Flt on date D. The world-view relation 211 quote can be related to the contents of the site relations fbn.sub.-- flights and fbn.sub.-- quote using a content specification 203 of the form (1) as follows: fbn.sub.-- flights(Flt, Src, Dest), fbn.sub.-- quote(Ag, Flt, C, D).OR right.quote(Ag, `Fly-by-Night`, Src, Dest, C,D). This content specification 203 states that tuples in the relation quote can be computed by joining tuples in the relations fbn.sub.-- flights and fbn.sub.-- quote. Suppose that only the designated travel agents of Fly-by-Night Airlines were allowed to offer quotes on Fly-by-Night Airlines. Then, all the information about fare quotes for this airline is present in the relations fbn.sub.-- flights and fbn.sub.-- quote. This complete information can be represented using a content specification 203 of the form (2) as follows: fbn.sub.-- flights(Flt, Src, Dest), fbn.sub.-- quote(Ag, Flt, C, D)=quote(Ag, `Fly-by-Night`, Src, Dest, C, D). .quadrature. EXAMPLE 5.6 Consider the external site relations described in Example 5.4. The external site relation travel.sub.-- dir contains a listing of travel agents, though not necessarily all of them. This is specified using a content specification of the form (3) as follows: travel.sub.-- dir(Ag, Name, Ac, TelNo).OR right.dir(Ag, Ac, TelNo), travelAgent(Ag), name(Ag, Name). This content specification 203 states that the site relation travel.sub.-- dir already has a subset of the join of the world-view relations dir and travelAgent. .quadrature. Our site description language does not allow content specifications 203 of the form: C.sub.R (Y),R.sub.1 (X.sub.1), . . . , R.sub.k (X.sub.k).OR left.C.sub.E (X),E(X) C.sub.R (X),R(X).OR right.C.sub.E (Y),E.sub.1 (X.sub.1), . . . , E.sub.k (X.sub.k) Intuitively, these content specifications are not useful because they only provide information about tuples that are "possibly" in the world-view relations, and not about tuples that are "definitely" in the world-view relations. The following example illustrates this point. EXAMPLE 5.7 The external site relation ta.sub.-- ia.sub.-- dir(Ag, Ac, Telno) contains a listing of the phone numbers of all travel agents as well as all insurance agents. The contents of this site relation can be specified using the content specifications: ta.sub.-- ia.sub.-- dir(Ag, Ac, TelNo).OR right.dir(Ag, Ac, TelNo), travelAgent(Ag). ta.sub.-- ia.sub.-- dir(Ag, Ac, TelNo).OR right.dir(Ag, Ac, TelNo), insuranceAgent(Ag). Without any means of distinguishing which number in this site relation is the phone number of a travel agent, and which is the phone number of an insurance agent, this site relation is not useful in answering queries on the world-view relation travelAgent. .quadrature. Specifying Query Forms 205 Information sources in global information systems are autonomous and, for reasons such as security or privacy, may decide to answer only a subset of the possible queries on the site relations. In our site description language, each information source can specify the subset of queries it is willing to answer using a set of query forms 205 on the site relations provided by the information source. For details on query forms, see J. D. Ullman, Principles of Database and Knowledge-base Systems, Volumes I and II, Computer Science Press, 1989. Intuitively, a query form 205 m.sub.R on a k-ary relation R is a string of length k, using the alphabet {b,f}. A `b` in the i'th position indicates that the i'th argument of R must be bound to a constant in a query conforming to m.sub.R ; an `f' in the i'th position indicates that the i'th argument of R can either be free or be bound to a constant. An information source is willing to answer a query on a site relation if and only the query bindings match one of its query forms. EXAMPLE 5.8 Consider the external information sources of Example 5.4. The travel information source specifies the subset of queries on relation travel.sub.-- dir that it is willing to answer as follows: possible.sub.-- queries: travel.sub.-- dir[bff,fbf]. The query form 205 bff indicates that, given a specific travel agent, the information source can provide the agent's area code and phone number. The query form 205 fbf indicates that, given an area code, the information source can provide the travel agents and their phone numbers in that area code. .quadrature. Often it is the case that some of the queries that an external information source is willing to answer can be answered efficiently, because of clustering of tuples in the site relations, availability of indices, etc. Answering queries in a global information system can be optimized if this information were available to the query processor. Hence, our site description language also allows external information sources to specify the subset 215 of queries that it can answer efficiently, again using query forms 205. EXAMPLE 5.9 Consider our airline flight application, and the travel information source which provides the site relation travel.sub.-- dir. This source is willing to answer queries matching either of the query forms bff and fbf (see Example 5.8). These query forms thus make up the set of permitted queries 213. However, answering queries matching bff might be efficient because of the availability of a primary index on the travel agent attribute, while answering queries matching fbf might be quite inefficient because of the absence of any clustering in the site relation travel.sub.-- dir. The subset 215 of queries that can be efficiently answered by the travel information source can be specified as follows: efficient.sub.-- queries: travel.sub.-- dir[bff]. .quadrature.Of course, the access plan would first attempt to use the efficient queries provided by information source 213 to answer the query, and would specify an inefficient query only if there were no other way to obtain the information. In other embodiments, site descriptions 201 may include other useful information such as the cost and reliability of accessing tuples of the site relations. Incorporation of these into the site description language requires the development of algorithms that can use this information effectively in query evaluation. Query Evaluation Users of a global information system 101 formulate queries in terms of relations in world view 115, without regard to the location and distribution of this information. However, the world-view relations are not explicitly stored; all the data that are needed to answer these queries reside in site relations in external information sources 123. It is the task of the query evaluation system to access these external site relations and answer the user's queries. Since the cost of accessing an information source over the network is significant, the main optimization to be performed is to minimize the number of external information sources 123 that need to be accessed in order to answer the query. In this section, we present several techniques that make effective use of site descriptions to minimize access to external information sources. Answering Queries: FIG. 3 Answering a query in a database system typically has two phases: generating the plan for answering the query, and executing this plan. In traditional database systems, a query plan specifies the order of computing the joins of the database relations in the query and the techniques used for each of the joins. This requires that each of the database relations mentioned in the query be either stored explicitly, or computed on demand. Since the world-view relations in a global information system are not stored explicitly, the query plan has to compute the tuples in the world-view relations from the tuples in the site relations. Our algorithm for generating a query plan is shown in FIG. 3. Algorithm 301 operates after a join order for the query has been determined using traditional techniques. Algorithm 301 creates sub-plans for evaluating each of the conjuncts in the query. It does so by determining which external information sources need to be queried in order to obtain tuples of a world-view relation E(W) that satisfies some constraint C(W) (which is statically computed from the query). Our algorithm assumes that each external site has the capability of answering any query form. The algorithm can be straightforwardly extended, using the techniques described in K. A. Morris, "An algorithm for ordering subgoals in NAIL!", In Proceedings of the ACM Symposium on Principles of Database Systems, pages 82-88, Mar. 1988, to handle cases when only certain query forms can be answered, or when certain query forms can be handled more efficiently. Algorithm 301 generates a plan that is guaranteed to be sound, i.e., all answers obtained by executing this plan are indeed answers to the query. If all content specifications are of the forms (1) or (2), executing the plan is also guaranteed to generate all possible answers to the query, i.e., our algorithm is also complete. However, since algorithm 301 tries to answer each conjunct in the query in isolation, it may not find all answers in the presence of content specifications of the forms (3) and (4), as illustrated by the following example. EXAMPLE 5.10 Consider a query that retrieves names and telephone numbers of travel agents in the 212 (Manhattan, N.Y.) area code. query(Name, TelNo) :- travelAgent(Ag), dir(Ag, 212, TelNo), name(Ag, Name). Suppose that the site relation nyTA precisely has the names and telephone numbers of all the travel agents in the 212 area code, specified using the following content specification: nyTA(Name, TelNo)=travelAgent(Ag), dir(Ag, 212, TelNo), name(Ag, Name). The answer to the query can be computed by using just the tuples in the external site relation nyTA. However, our algorithm would not be able to determine that the site relation nyTA is useful, since it would try to separately compute the tuples in the world-view relations travelAgent, dir and name, and the nyTA site relation does not have the variable Ag, which is present in each of the three world-view relations. .quadrature. A complete strategy for answering queries in the presence of content descriptions of the forms (3) and (4) requires solving the problem of answering queries using materialized views. A general solution to this problem which works for a large class of query languages is described in the next section. The work on the general solution resulted in a demonstration that answering queries using materialized views (even when the query and the views arc just conjunctive queries) is NP-complete, whereas algorithm 301 presented here is in polynomial time. A key aspect of algorithm 301 is that it generates a plan that accesses only information sources that can possibly contribute to answering the query, given the static constraints in the query and in the site descriptions. Furthermore, we can extend algorithm 301 to cases in which both the query and the content specifications 203 of the form (1) and (2) involve aggregation, negation and recursion. using techniques described in A. Y. Levy and Y. Sagiv. "Constraints and redundancy in Datalog", In Proceedings of the Eleventh ACM Symposium on Principles of Database Systems, San Diego, Calif., June 1992; A. Y. Levy, I. S. Mumick, Y. Sagiv, and O. Shmueli, "Equivalence, query-reachability and satisfiability in Datalog extensions", In Proceedings of the ACM Symposium on Principles of Database Systems, Washington, D.C., 1993; and A. Y. Levy, I. S. Mumick, and Y. Sagiv. "Query optimization by predicate move-around", In Proceedings of the International Conference on Very Large Databases, Santiago, Chile, Sept. 1994. Answering Queries using Materialized Views Answering a query using materialized views can be done in two steps. In the first step, containment mappings from the bodies of the views to the body of the query are considered to obtain rewritings of the query. The appropriate view literals for the rewriting are added to the query. In the second step, redundant literals of the original query are removed. Once this is done, evaluation of the query is done using one of these new versions which is cheaper to evaluate than the original query. The following discussion begins with some preliminary definitions and a running example and then presents detailed descriptions of the two steps. Preliminaries In our discussion we refer to the relations used in the query as the database relations. We consider conjunctive and unions of conjunctive queries (i.e., datalog without recursion). In addition, queries may contain built-in comparison predicates (=, .noteq., < and .ltoreq.). We use V, V.sub.1, . . . , V.sub.m to denote views that are defined on the database relations. Views are also defined using queries. Given a query Q, our goal is to find an equivalent rewriting Q' of the query that uses one or more of the views: Definition 5.1 A query Q' is a rewriting of Q that uses the views =V.sub.1, . . . , V.sub.m if Q and Q' are equivalent (i.e., produce the same answer for any given database), and Q' contains one or more occurrences of literals of V. .quadrature. We consider only rewritings that have the same form as the original query (i.e., they do not use a more expressive query language than the original query). We say that a rewriting Q' is locally minimal if we cannot remove any literals from Q' and still retain equivalence to Q. A rewriting is globally minimal if there is no other rewriting with fewer literals..sup.1 .sup.1 Note that we do not count literals of built-in predicates. EXAMPLE 5.11 Consider the following query and view: q(X,U) :- p(X,Y),p.sub.0 (Y,Z),p.sub.1 (X,W),p.sub.2 (W,U). v(A,B) :- p(A,C),p.sub.0 (C,B),p.sub.1 (A,D) The query can be rewritten using v as follows: q(X,U) :- v(X,Z),p.sub.1 (X,W),p.sub.2 (W,U). Substituting the view enabled us to remove the first two literals of the query. Note, however, that although the third literal in the query is guaranteed to be satisfied by the view, we could not remove it from the query because the variable W also appears in the last literal. .rect-ver-solid..quadrature. Clearly, we would like to find rewritings that are cheaper to evaluate than the original query. The cost of evaluation will depend on many factors which differ from application to application. In this paper we consider rewritings which reduce the number of literals in the query, and in particular, reduce the number of database relation literals in the query. In fact, we will show that any rewriting of Q that contains a minimal number of literals is isomorphic to a query that contains a subset of the literals of Q and a set of view literals. Although we focus on reducing the number of literals, it should be noted that rewritings can yield optimizations even if we do not remove literals from the query, as illustrated by the following example. EXAMPLE 5.12 Using the same query as in Example 5.11, suppose we have the following view : v.sub.1 (A) :- p(A,C),p.sub.1 (A,D) We can add the view literal to the query to obtain the following rewritten query. q(X,U) :- v(X),p(X,Y),p.sub.0 (Y,Z),p.sub.1 (X,W),p.sub.2 (W,U). The view literal acts as a filter on the values of X that are considered in the query. It restricts the set of values of X to those that appear both in the relation p and p.sub.1. .rect-ver-solid..quadrature. In some applications we may not have access to any of the database relations. Therefore, it is important to consider the problem of whether the query can be rewritten using only the views. We call such rewritings complete rewritings: Definition 5.2: A rewriting Q' of Q, using =V.sub.1, . . . , V.sub.m is a complete rewriting if Q' contains only literals of and built-in predicates. .quadrature. EXAMPLE 5.13 Suppose that in addition to the query and the view of Example 5.11 we also have the following view: v.sub.2 (A,B) :- p.sub.1 (A,C),p.sub.2 (C,B),p.sub.0 (D,E). The following is a complete rewriting of q that uses v and v.sub.2 : q(X, U) :- v(X,Z), v.sub.2 (X,U). It is important to note that this rewriting cannot be achieved in a stepwise fashion by first rewriting q using v and then trying to incorporate v.sub.2 (or the other way around). Finding the complete rewriting requires that we consider the usages of both views in parallel..rect-ver-solid..quadrature. Finding Redundant Literals in the Rewritten Query In this section we describe a polynomial algorithm for the second step. Given mappings from the views to the query, the algorithm determines a set of literals from the query that can be removed. We show that under certain conditions there is a unique maximal set of such literals and the algorithm is guaranteed to find them. In other cases, the algorithm may find only a subset of the redundant literals, but all the literals it removes are guaranteed to be redundant, and therefore the algorithm is always applicable. Note that in such cases, the rest of the query can still be minimized using known techniques. Together with an algorithm for enumerating mappings from the views to the query, our algorithm provides a practical method for finding rewritings. For simplicity, we describe the algorithm for the case of rewriting using a single occurrence of a view. Suppose our query is of the form q(X) :- p.sub.1 (U.sub.1), . . . , p.sub.n (U.sub.n). (5) and we have the following view: v(Z) :- r.sub.1 (W.sub.1), . . . , r.sub.m (W.sub.m). (6) Let h be a containment mapping from the body of v into the body of q, and let the following be the result of adding the view literal to the query: q(X) :- p.sub.1 (U.sub.1), . . . , p.sub.n (U.sub.n), v(Y).(7) where Y=h(Z). Note that we can restrict ourselves to mappings where the variables of Y already appear in the p.sub.i (U.sub.i). To obtain a minimal rewriting, we want to remove as many of the p.sub.i literals as possible. To determine she set of redundant literals, consider the rule resulting from substituting the definition of Rule (6) instead of the view literal in Rule (7). That is, we rename the variables of Rule (6) as follows. Each variable T that appears in Z is renamed to h(T), and each variable of Rule (6) that does not appear in Z is renamed to a new variable (that is not already among the p.sub.i (U.sub.i)). Let the following be the result of this substitution. q(X) :- p.sub.1 (U.sub.1), . . . , p.sub.n (U.sub.n),r.sub.1 (V.sub.1), . . . , r.sub.m (V.sub.m). (8) Note that the variables of Y are the only ones that may appear in both the p.sub.i (U.sub.i) and the r.sub.j (V.sub.j). Given the mapping h, there is a natural containment mapping from Rule (8) into the original rule for q (i.e., Rule (5)) that is defined as follows. Each subgoal p.sub.i (U.sub.i) is mapped to itself and each subgoal r.sub.j (V.sub.j) is mapped to the same subgoal of Rule (5) as in the containment mapping h (from Rule (6) to Rule (5)). We will denote this containment mapping as .phi.. The following is an important observation about .phi.: The containment mapping .phi. maps each variable of Y to itself. Each subgoal p.sub.i (U.sub.i) of Rule (5) is the image (under .phi.) of itself, and maybe a few of the r.sub.j (V.sub.j) literals. We say that the literals r.sub.j (V.sub.j) that map to p.sub.i (U.sub.i) under .phi. are the associates of p.sub.i (U.sub.i). For the rest of the discussion, we choose arbitrarily one of the associates of p.sub.i (U.sub.i) and refer to it as the associate of p.sub.i (U.sub.i). Note that if h maps each subgoal r.sub.j (V.sub.j) to a unique subgoal in Rule (5), then each p.sub.i (U.sub.i) will have at most one associate. Before we define the set of redundant subgoals, we need the following definition: Definition 5.3 A subgoal r.sub.j (V.sub.j) covers a subgoal p.sub.i (U.sub.i) if all of the following hold. The subgoals r.sub.j (V.sub.j) and p.sub.i (U.sub.i) have the same predicate. If p.sub.i (U.sub.i) has a distinguished variable (or a constant) in some argument position .alpha., then r.sub.j (V.sub.j) also has that variable (or constant) in argument position .alpha.. If argument positions .alpha..sub.1 and .alpha..sub.2 of p.sub.i (U.sub.i) are equal, then so are the argument positions .alpha..sub.1 and .alpha..sub.2 of r.sub.j (V.sub.j). .quadrature. The set of redundant literals in Q will be the complement of the needed literals, defined as follows: Definition 5.4: The set is the minimal set satisfying the following four conditions. 1. All the p.sub.i (U.sub.i) that do not have associates are in . 2. If r.sub.j (V.sub.j) is the associate of p.sub.i (U.sub.i) and r.sub.j (V.sub.j) does not cover p.sub.i (U.sub.i), then p.sub.i (U.sub.i) is in . 3. Suppose that all of the following hold. Subgoal p.sub.i (U.sub.i) has the variable T in argument position .alpha..sub.1. The associate of p.sub.i (U.sub.i) has the variable.sup.2 H in argument position .alpha..sub.1. .sup.2 Note that the associate of p.sub.i (U.sub.i) cannot have a constant in argument position .alpha..sub.1 if p.sub.i (U.sub.i) has a variable in that argument position. The variable H is not in Y (hence, H appears only among the r.sub.j (V.sub.j)). The variable T also appears in argument position .alpha..sub.2 of p.sub.1 (U.sub.l). The associate of p.sub.l (U.sub.l) does not have H in argument position .alpha..sub.2. Then p.sub.i (U.sub.i) is in . 4. Suppose that p.sub.i (U.sub.i) is in and that variable T appears in p.sub.i (U.sub.i). If p.sub.l (U.sub.l) has variable T in argument position .alpha. and its associate does not have T in argument position .alpha., then p.sub.l (U.sub.l)is also in . .quadrature. EXAMPLE 5.14 Consider the query and the view of Example 5.11. The result of substituting the view in the query would be the following: q(X,U) :- p(X,Y),p.sub.0 (Y,Z),p.sub.1 (X,W),p.sub.2 (W,U),p(X,C),p.sub.0 (C,Z),p.sub.1 (X,D). The literal p.sub.2 (W,U) is needed because it does not have an associate. The literal p.sub.1 (X,W) is needed by condition 4 in the definition, because its associate p.sub.1 (X,D) does not contain the variable W (which appears in p.sub.2 (W,U)). Consequently, these two literals need to be retained to obtain the minimal rewriting. .quadrature. Further details and the proofs of complexity may be found in A. Y. Levy, A. O. Mendelzon, Y. Sagiv, and D. Srivastava. "Answering queries using views", Submitted for publication, 1994. Using Completeness Information In generating a plan for answering a query, algorithm 301 accesses all (and only) sources that may contribute to answering the query. While this may be necessary in general, there are many cases where a small subset of the relevant site relations contains all the information needed to answer the query. Since completeness information of single sources can be expressed in the content specification 203 (using specifications of the forms (2) and (4)), the query processor can effectively use these forms of content specification 203 to ignore redundant sites. EXAMPLE 5.15 Consider the airline flight application. Let the site relation ta.sub.-- dir contain listings of all travel agents in the U.S. and let the site relation bigapple.sub.-- dir contain listings of all telephone customers in the 212 area code. Accessing both these site relations is redundant in order to answer a query that asks for the phone number of a specific travel agent in the 212 area code, although both these site relations are relevant to answering this query. Querying either of these two site relations suffices. Both these site relations are also relevant to answer the query that asks for the phone number of a specific travel agent (without knowing the area code of the travel agent). However, querying ta.sub.-- dir is sufficient in this case, though querying bigapple.sub.-- dir may not be sufficient. .quadrature. Intuitively, we use content specifications of the form (2) as follows. Given that we are trying to compute tuples of a world-view relation E that satisfy the constraint C, we search for a minimal set SD.sub.1, . . . SD.sub.n of content specifications 205 which together can be used to compute all the tuples of E that satisfy C. Formally, the algorithm for doing this is the following. Suppose we are trying to compute the tuples of E(W) that satisfy the constraint C(W). Our algorithm chooses a set .sub.E ={SD.sub.1, . . . , SD.sub.n } of content specifications of the form (2): C.sub.R.sup.j (Y.sup.j),R.sub.1.sup.j (X.sub.1.sup.j), . . . , R.sub.k.sup.j (X.sub.k.sup.j)=C.sub.E.sup.j (W),E(W) for 1.ltoreq.j.ltoreq.n such that: C(W) C.sub.E.sup.1 (W) . . . C.sub.E.sup.n (W). There is no subset of .sub.E that satisfies the first property. If such a set does not exist for C(W), then let C'(W) be the weakest constraint for which such a set does exist. (The constraint C'(W) can be obtained by conjoining C(W) with the disjunction of the C.sub.E 's of all content descriptions of the form (2).) The tuples of E(W) that satisfy the constraint C'(W) can be computed using content specifications 205 of the form (2), as above. Furthermore, let C"(W) be C(W)/C'(W). The tuples of E(W) that satisfy the constraint C"(W) can be computed using the other content specifications 205, as described in Algorithm 301. Although the above algorithm is not a polynomial time algorithm (even for order constraints), the complexity of the algorithm is in the size of the representation of the query constraints and the site description constraints, not in the size or number of the site relations. Dynamic Query Plans In traditional database systems, the plan execution comes strictly after the query is optimized and the complete plan for evaluating the query is generated. Although such a static query plan is adequate for traditional database system applications, global information systems require dynamic plans, where the query plan generation phase interacts with the plan execution phase. The following example illustrates the benefits of postponing generating plans for sub-queries until run-time, when values are known for some of the query variables. EXAMPLE 5.16 Consider the airline flight application. The following query retrieves the telephone numbers of travel agents in Manhattan, N.Y.: query(AC, TelNo) :- areaCode(`Manhattan, N.Y.`, AC), travelAgent(Ag), dir(Ag, AC, TelNo). The constraint travelAgent(Ag) present statically in the query entails that directory information sources that do not contain listings of travel agents are irrelevant to answering the query. However, in the absence of knowledge about tuples in the world-view relation areaCode (which are computed only at run-time), the query plan would have to treat all other directory information sources (e.g., the one for the 908 area code) as relevant to the query. However, once the sub-query areaCode(`Manhattan, N.Y.`,AC) is evaluated, the bindings for AC (in this case just 212) can be used to restrict the set of relevant directory information sources to only those with area code 212. .quadrature. To be able to perform such optimizations, it is necessary that we pass sideways values computed for some of the query variables to create or modify segments of the query plan dynamically, i.e., at run-time. The following example illustrates the optimization benefits of passing not just values of the query variables, but also additional information obtained at run-time. EXAMPLE 5.17 Suppose that unitedAgent and americanAgent were disjoint subconcepts of the concept travelAgent, i.e., no travel agent is both an agent for United Airlines and for American Airlines. Assume that the United Airlines information source provides a directory service for United Airlines agents ua.sub.-- dir(Ag, AC, TelNo), and American Airlines provides a directory service for American Airlines agents aa.sub.-- dir(Ag, AC, TelNo). The content specifications 205 for these site relations are as follows: ua.sub.-- agents(Ag, AC, TelNo).OR right.unitedAgent(Ag), dir(Ag, AC, TelNo). aa.sub.-- dir(Ag, AC, TelNo).OR right.americanAgent(Ag),dir(Ag, AC, TelNo). Consider now the following query that retrieves the telephone numbers of award-winning travel agents (a subconcept of travelAgent). query(AC, TelNo) :- awardTravelAgent(Ag),dir(Ag, AC, TelNo). If a binding for awardTravelAgent(Ag) was found at a site that only had information about United Airlines agents, this information could be used to determine that the site relation aa.sub.-- dir is irrelevant for answering the query, therefore showing that knowing the source from where the binding for Ag was found can be used to prune the directory sources where no matching listing would be found. .quadrature. The above examples illustrate the two key features of dynamic query plan generation: 1. Postpone planning for sub-queries until run-time, when sufficient information is available to determine a small set of relevant sources. 2. Pass additional information obtained at run-time, not just values of query variables, to the query optimizer. We have identified two additional pieces of information that are very useful for pruning information sources, and which can be easily determined from the site descriptions, and passed in the binding information for query variables: (1) the type of the value, and (2) the location where the value was found. Details concerning the information and how to use it in an algorithm for dynamically generating a query are presented below. A second reason for supporting dynamic query plans in a global information system is that when the external information sources are distributed over a computer network, it is quite likely some external sources are unavailable when required. In the presence of alternative information sources that can provide the same information (because of redundancy in the autonomous information sources), the query plan must be dynamically modifiable. Types of Information which are Useful in Dynamic Query Generation The following discussion provides details about the selection of information which is useful in dynamic query generation. The discussion is based on Craig A. Knoblock and Alon Levy, "Efficient Query Processing for Information Gathering Agents", to appear in working notes of the 1995 AAAI Spring Symposium on Information Gathering in Distributed and Heterogeneous Environments, available from AAAI. In the following, C, C.sub.i etc. denote classes in domain model 111. Binary relations among objects in domain model 111 are represented by roles (denoted by r, r.sub.i etc. ). The discussion also employs a running example in which system 101 has received a query concerning the publications of Ron Brachman, who is a researcher in artificial intelligence at ATT Bell Laboratories. An information source 123 s can be viewed as providing some knowledge about a class in the domain model C.sub.s. It can either provide some or all of the instances of the class C.sub.s. In the latter case we will say that s is a complete source. The source s also provides some role fillers for the instances it knows about. Formally, s provides the role fillers for the roles r.sub.1.sup.s, . . . r.sub.n.sup.s. For each role, s may provide all the fillers or only some of them. The information about which class and roles s knows about it is contained in information source description 113 for s. We can now describe the kinds of information that can be obtained by system 101 at run time and how they can be used. The first set of information types (called domain information) include information about the class hierarchy and individuals in those classes. Specifically, we have identified the following types of information: Membership An individual being a member (or not a member of a class), for example, Ron Brachman being an instance of AI-researcher. Fillers One or more individuals filling a role of another individual (or not being a filler of a role), for example, that the affiliation of Ron Brachman is AT&T Bell Labs. Size The size of a class or the number of fillers of a role. Constraints High level constraints on classes or fillers of roles (e.g., all fillers are in a certain range). Relationships Relationships between different classes or roles (e.g., one class contains another)..sup.3 .sup.3 Note that intensional subsumption relationships between classes are can be inferred in the domain model. This class of information refers to extensional containment relationships, e.g., in the current state, all instances of C.sub.1 are also instances of C.sub.2. The second set of information types (called source information) are like the above types, but concerns knowledge about information sources, and not about the domain model's class hierarchy: Membership An individual being found in an information sources (or not being found there). Fillers One or more individuals filling a role of another individual in a specific in formation source. Size The number of class instances found in a specific information sources. Constraints High level constraints specific to an information source (e.g., an information source only contains Bell Labs researchers). Relationships Relationships between different classes or roles (e.g., source s.sub.1 containing all the data in source s.sub.2). It should be noted that in some cases the domain information can be inferred from the source information, and the description of the sources. Using the Information to Optimize Queries There are several ways in which the information types outlined above can be used to optimize queries: Membership Membership information can be useful in identifying an information source that is likely to contain additional information. If we found the individual .alpha. in source s, and a subsequent subgoal asks for the filler of a role r of .alpha., we will first check whether s contains fillers for r (which will be known in the description). Note that this type of information is especially useful because typically information sources will only have part of the instances of a class, and therefore, finding an instance in a given information sources is a significant piece of information. Fillers Information about specific fillers for roles can be used to constrain the queries to other information sources. For example, if we learn the area code for Bob Jones from one information source, then it can be incorporated into the query sent to another information source. Size Size information about classes and intermediate results is useful in ordering subgoals in a query. Traditional query processing systems estimate sizes before processing starts, but using actual size information may be critical when good estimates are unavailable. Relationships The main use of additional domain model information is to rule out possible information sources. Knowing that an individual belongs to a more specific class that can be inferred from the query enables us to limit the number of sources considered in later subgoals of the query that contain the individual as a binding. For example, knowing that Ron Brachman is an AI researcher enables us to focus on paper repositories that provide AI publications. Knowing that he is an AT&T employee provides a justification for considering first a paper repository from AT&T researchers. Constraints Domain-level constraints can be used by propagating the restrictions from one subgoal to the next. This is similar to some of the reformulations done with semantic query optimization, except that the constraints are identified dynamically instead of using precompiled information. Completeness Completeness information about a class (or the fillers of a role) enable us to stop searching for more instances of the class (or fillers of that role). Obtaining Domain and Source Information A second dimension along which dynamic query processing methods differ is the way that the domain and source information are obtained: Information can be found by simply solving subgoals in the query. Instead of recording only the values of the bindings that are found in solving a subgoal, we can also record the information sources in which they are found. Additional domain knowledge can be inferred from the description of the information source in which the binding was found. For example, if Ron Brachman was found in the AAAI-fellow information source, then we can infer that he is a member of the class AAAI-fellows, which is a subclass of AI-researcher. If Brachman was not found in an information source that contains all physics researchers, then we can infer that he is not a physicist. Details of this technique are presented below. Information about a binding can be found in the process of trying to solve the subgoal that needs the information. For example, we may begin considering a few paper repositories to find Brachman's papers, and by doing so figure out that he is a member of AI-researcher class. This will enable us to prune the subsequent paper repositories we consider. Information gained in solving previous queries can be used. The challenge here is to remember from previous queries only information that may be relevant in future queries, and will not change rapidly. Finally, an the information agent can create new subqueries in order to actively seek information about bindings. For example, by considering the descriptions of information sources providing paper repositories, the agent can determine that knowing the affiliation and field of an author dramatically reduce the number of relevant information sources. Therefore, the agent may first pose a query looking for Brachman's field and affiliation, before solving the query. Algorithm for Dynamically Generating a Query Plan: FIG. 4 In overview, the algorithm shown in FIG. 4 works by using type information received from information source 123 to prune the sub-plans used to compute the tuples for the rest of the query. In detail, algorithm 401 for dynamically generating a query plan first determines a join order using traditional techniques. Then, algorithm 401 operates in two phases when evaluating each conjunct in the query. In the first phase 405, algorithm 401 uses the known bindings for the query variables to generate a sub-plan for evaluating the conjunct. In the second phase 407, algorithm 401 accesses the relevant information sources and generates new bindings for the query variables using type information received from the relevant information sources. The type information appears in algorithm 401 as C.sub.R.sup.SD 409, that is, a constraint on the external site relation. In other embodiments, information other than type binding information may be used. Algorithm 401 alternates between phase 405 and 407 until each conjunct in the query has been evaluated, and the query answered. Although algorithm 401 chooses a join order at compile-time, it is straightforward to extend the algorithm to use the binding information to decide on a join order dynamically. It is important to stress that all the type information 409 that algorithm 401 uses for optimizing queries at run-time is available statically in the query and the various site descriptions. In principle, it is possible to generate all possible query plans at compile-time and merely choose from amongst these plans at run-time. Practically speaking, the large number of information sources makes this approach quite infeasible, and our algorithm creates plans for segments of the query at run-time. Access Plan Generation and Execution 119 in a Preferred Embodiment: FIG. 5 In order to implement algorithm 401, access plan generation and execution component 119 of system 101 must be modified as shown in FIG. 5. Component 119 has two subcomponents: query plan generator 509 and query plan executor 519. Query plan generator 509 responds to an information access description 501 from KBS 109 which contains site descriptions 201 by generating a query plan 511 which is made up of a number of subplans 512. Each subplan 512 is sent in turn to query plan executor 519. Query plan executor 519 executes the current subplan 512 by producing subquery protocol 525 for querying the information source 123 specified in current subplan 512. When the protocol is executed, it returns subquery results 523 and additional information 517 to query plan executor 519, which retains subplan results 523 and returns additional information 517 to query plan generator 509, which then prunes the remaining subplans 512 on the basis of the additional information. When all of the necessary subplans have been executed, the retained subquery results 523 go to graphical user interface 103 as query results 521. In a presently-preferred embodiment, the additional information is treated as a constraint which applies to subplan result 523. That constraint is then applied to the concept for which the subplan was retrieving instances. If query plan 511 has unexecuted subplans 512 which include that concept and a constraint which is mutually satisfiable with the constraint defined by the additional information, those unexecuted subplans 512 may be pruned from query plan 511. Conclusion The foregoing Detailed Description has disclosed to those skilled in the art the best mode presently known to the inventors for implementing certain improvements to the information retrieval system disclosed in the parent of the present patent application. All of these improvements permit optimizations in query plans. One group of improvements are found in the site descriptions. The site descriptions include components that indicate whether an information source can completely answer a query or provide only a partial answer, components that indicate what kinds of queries an information source can answer, and components that indicate which of those queries can be answered efficiently. The information retrieval system uses all of these components to generate more efficient query plans. Another group of improvements involves dynamic query plan generation. The information returned by an information source includes not only the results of the subquery performed on the information source, but also information additional information which can then be used to modify the remaining query plan. Additional information about the source of the subquery results and the type of the results have been found to be particularly useful. While the foregoing improvements are particularly advantageous when implemented in an information retrieval system like that disclosed in the present patent application, they may be implemented in other information retrieval systems as well. In particular, both the site descriptions and the additional information can be used for query plan optimization in standard data base systems, as well as in information retrieval systems which include knowledge bases. It should further be pointed out that what is important for the improvements is the information they provide for optimizing, not the particular manner in which that information is represented. Further, while the specific optimizing algorithms disclosed herein are the best presently known to the inventors, other algorithms may employ the same information to achieve similar ends. All of the above being the case, the foregoing Detailed Description is to be understood as being in every respect illustrative and exemplary, but not restrictive, and the scope of the invention disclosed herein is not to be determined from the Detailed Description, but rather from the claims as interpreted according to the full breadth permitted by the law.
|
Same subclass Same class Consider this |
||||||||||
