Database system index selection using candidate index selection for a workload5960423Abstract An index selection tool helps reduce costs in time and memory in selecting an index configuration or set of indexes for use by a database server in accessing a database in accordance with a workload of queries. The index selection tool attempts to reduce the number of indexes to be considered, the number of index configurations to be enumerated, and the number of invocations of a query optimizer in selecting an index configuration for the workload. Claims What is claimed is: Description FIELD OF THE INVENTION
______________________________________
SELECT * FROM onektup, tenktup1
WHERE (onektup.unique1 = tenktup1.unique1)
AND (tenktup1.unique2 between 0 and 1000)
is {onektup.unique1, tenktup1.unique1, tenktup1.unique2}.
______________________________________
The indexable column set for an index configuration is the minimum set of columns of database schema 306 such that each index in that index configuration is on one or more column of that column set. A required column set for a query with respect to a table of database 210 is the set of on or more columns of that table that are needed to answer the query. Cost Evaluation of a Workload of Queries for Candidate Index Configurations Given the queries of workload 304 and a set of candidate index configurations, cost evaluation tool 320 determines a cost of workload 304 for the candidate index configurations. Cost evaluation tool 320 uses query optimizer 240 to determine costs to execute queries of workload 304 against database 210 for the candidate index configurations. Based on the costs determined by query optimizer 240, cost evaluation tool 320 determines the cost of workload 304 for the candidate index configurations. As illustrated in FIG. 3, cost evaluation tool 320 maintains a cost table 322 for storing a cost of each query of workload 304 and a total cost of workload 304 for each candidate index configuration. Cost table 322 may be implemented in the form of one or more suitable data structures and may also be stored on any suitable computer-readable medium. Cost table 322 for one embodiment is organized in the following table format.
______________________________________
Q.sub.1 Q.sub.2 . . . Q.sub.n Total Cost
______________________________________
C.sub.1
Cost(Q.sub.1,C.sub.1)
Cost(Q.sub.2, C.sub.1)
. . .
Cost(Q.sub.n,C.sub.1)
Cost(W,C.sub.1)
C.sub.2
Cost(Q.sub.1,C.sub.2)
Cost(Q.sub.2,C.sub.2)
. . .
Cost(Q.sub.n,C.sub.2)
Cost(W,C.sub.2)
. . . . . . . . . . . .
. . . . . .
C.sub.m
Cost(Q.sub.1,C.sub.m)
Cost(Q.sub.2,C.sub.m)
. . .
Cost(Q.sub.n,C.sub.m)
Cost(W,C.sub.m)
______________________________________
For this cost table, Q.sub.i represents each query of workload 304, C.sub.j represents each candidate index configuration under evaluation by cost evaluation tool 320, Cost(Q.sub.i,C.sub.j) represents a cost as determined by cost evaluation tool 320 to execute query Q.sub.i against database 210 for candidate index configuration C.sub.j, and Cost(W,C.sub.j) represents a cost to execute workload 304 against database 210 for candidate index configuration C.sub.j and is determined as the sum of the costs of each query Q.sub.i of workload 304 for candidate index configuration C.sub.j. Cost evaluation tool 320 may invoke query optimizer 240 to obtain a cost of each query of workload 304 for each candidate index configuration. As illustrated in FIG. 3, cost evaluation tool 320 for one embodiment presents a request-to-optimize command 324 designating a query and a candidate index configuration to query optimizer 240. Query optimizer 240 estimates a cost to execute the designated query for the designated candidate index configuration and returns an execution plan 241 comprising the cost estimate. For N queries of workload 304 and M candidate index configurations under evaluation by cost evaluation tool 320, cost evaluation tool 320 would have to invoke query optimizer 240 N*M times. Cost evaluation tool 320 attempts to reduce the number of invocations of query optimize 240 by determining costs of queries of workload 304 for one or more candidate index configurations without invoking query optimizer 240 based on costs determined by query optimizer 240 for one or more other candidate index configurations that are atomic for workload 304. Cost evaluation tool 320 determines a candidate index configuration is atomic for a query Q of workload 304 if the execution of the query Q against database 210 for the index configuration may use all of the indexes of the index configuration. That is, a candidate index configuration is atomic for a query Q if each index of the candidate index configuration is on one or more columns of the indexable column set of the query Q. The largest atomic index configurations for a query are also referred to as the maximal atomic index configurations. A candidate index configuration is atomic for workload 304 if the index configuration is atomic for at least one query of workload 304. That is, a candidate index configuration is atomic for workload 304 if query optimizer 240 is to be invoked for at least one query of workload 304 to determine a total cost of workload 304 for the index configuration. Cost evaluation tool 320 for one embodiment determines a cost of a query Q for a candidate index configuration C based on costs for atomic index configurations C'.sub.i in accordance with a flow diagram 500 illustrated in FIG. 5. The candidate index configuration C may or may not be an atomic index configuration. For step 502 of FIG. 5, cost evaluation tool 320 determines a set of n atomic index configurations {C'.sub.1, . . . , C'.sub.n } from among a set of candidate index configurations such that a cost of the query Q for a candidate index configuration C, or Cost(Q,C), can be determined from costs of the query Q for the atomic index configurations C'.sub.i, or {Cost(Q,C'.sub.i)}. Each of the n atomic index configurations C'.sub.i for step 502 is a subset of the candidate index configuration C. Cost evaluation tool 320 for steps 504, 506, 508, 510, 512, and 514 of FIG. 5 determines costs of the query Q for each of the atomic index configurations C'.sub.i. For step 504, cost evaluation tool 320 initializes a variable i to 1 for use in identifying each atomic index configuration C'.sub.i determined for step 502. Cost evaluation tool 320 for step 506 determines whether a cost of the query Q for the atomic index configuration C'.sub.i has been determined. If not, cost evaluation tool 320 for step 508 uses what-if indexes create and load manager 310 to create and/or load any what-if indexes as necessary to enable query optimizer 240 to determine a cost of the query Q for the atomic index configuration C'.sub.i. Cost evaluation tool 320 for step 510 then invokes query optimizer 240 to determine Cost(Q,C'.sub.i). For step 512, cost evaluation tool 320 determines whether costs for each of the n atomic index configurations C'.sub.i have been determined and, if not, increments the variable i for step 514 to evaluate another atomic index configuration C'.sub.i. If cost evaluation tool 320 determines for step 506 that Cost(Q,C'.sub.i) has not been determined, cost evaluation tool 320 proceeds directly to step 512. Cost evaluation tool 320 may have previously determined the cost of the query Q for atomic index configuration C'.sub.i, for example, in determining the cost of the query Q for another candidate index configuration. Cost evaluation tool 320 repeats steps 506 through 514 in this manner until cost evaluation tool 320 determines for step 512 that costs for each of the n atomic index configurations C'.sub.i have been determined. Cost evaluation tool 320 for step 516 then determines a cost of the query Q for the candidate index configuration C from the determined costs of the query for the atomic index configurations C'.sub.i without invoking query optimizer 240. For one embodiment, cost evaluation tool 320 determines a cost of the query Q for the candidate index configuration C as the minimum of the determined costs of the query Q for the atomic index configurations C'.sub.i, that is Cost(Q,C) =Min{Cost(Q,C'.sub.i)}. Because inclusion of an index in an index configuration can only reduce the cost of some queries, such as an SQL Select query for example, cost evaluation tool 320 may determine a cost of such a query Q for a candidate index configuration C as the minimum cost of the query Q for the largest atomic index configurations C'.sub.i. As one example, let I.sub.1, I.sub.2, and I.sub.3 be one-column indexes on onektup.unique1, tenktup1.unique1, and tenktup1.unique2, respectively, for the following query Q.sub.1.
______________________________________
SELECT * FROM onektup, tenktup1
WHERE (onektup.unique1 = tenktup1.unique1)
AND (tenktup1.unique2 between 0 and 1000)
______________________________________
Let I.sub.4 be an index on a different column in the indexable column set of another query. The determined cost to execute the query Q.sub.1 for the index configuration C={I.sub.1, I.sub.2, I.sub.1 } is the minimum cost to execute the query Q.sub.1 for the atomic index configurations for the query Q.sub.1 that are subsets of the index configuration C, that is for the atomic index configurations {}, {I.sub.1 }, {I.sub.2 }, and {I.sub.1, I.sub.2 }. Cost evaluation tool 320 may determine costs of any suitable query, such as SQL Select and Update queries for example, in accordance with flow diagram 500. The cost of a query Q such as an SQL, Insert or Delete query, for example, depends on the following cost factors: (1) the cost of selection, (2) the cost of updating the table and the indexes that may be used for selection, and (3) the cost of updating indexes that do not affect the selection cost. Costs of updating each index I for cost factor (3) are independent of one another and arc also independent of the execution plan selected by query optimizer 240 for cost factors (1) and (2). Query optimizer 240 will therefore select an execution plan that minimizes cost factors (1) and (2) of the query Q for an index configuration C. Similar to the determination of costs for queries such as SQL Select and Update queries, cost evaluation tool 320 for one embodiment determines for cost factors (1) and (2) a cost T of the query Q for the index configuration C as the determined minimum cost over the atomic index configurations for the query Q that are subsets of the index configuration C. Cost evaluation to 320 for one embodiment determines for cost factor (3) a cost of the query Q as the sum of the determined cost of the query Q for each index I.sub.j for cost factor (3), that is .SIGMA..sub.j (Cost(Q,{I.sub.j })-Cost(Q, {})). Cost evaluation tool 320 therefore determines a total cost of the query Q for the index configuration C as the sum of costs for cost factors (1), (2), and (3), that is Cost(Q,C)=T+.SIGMA..sub.j (Cost(Q,{I.sub.j })-Cost(Q,{})). The total number of atomic index configurations for workload 304 can be very large. The number of atomic index configurations for multi-table queries, for example, is exponential relative to the number of tables. Because the characteristics of database server 220 influence whether query optimizer 240 must be invoked to determine a cost of a query for an index configuration, cost evaluation tool 320 for step 502 may reduce the set of atomic index configurations C'.sub.i by excluding one or more candidate index configurations in accordance with one or more constraints of query optimizer 240. Also, every candidate index may potentially be either a clustered index for a table or a non-clustered index. Because the choice of the clustered index for a table affects the choice of indexes over other tables, the choice of the clustered index for a table cannot be made local to the table. As one exemplary optimizer constraint, database server 220 typically intersects at most some predetermined number of indexes, whether clustered or non-clustered, to identify tuples of one relation. Cost evaluation tool 320 for step 502 may therefore exclude from the set of atomic index configurations any candidate index configurations comprising more than the predetermined number of indexes for any one table or correlation of database 210. The predetermined number of indexes for one embodiment is two. For another embodiment where database server 220 does not support index intersection, for example, the predetermined number may be one. As another exemplary optimizer constraint, the first few joins of a multi-table query typically impact the cost to execute the query the most. Cost evaluation tool 320 for step 502 may therefore exclude from the set of atomic index configurations any candidate index configurations comprising indexes on more than a predetermined number of tables. The predetermined number of indexes for one embodiment is two. Atomic index configurations comprising indexes on at most two tables as determined for step 502 in accordance with this optimizer constraint are referred to as single-join atomic index configurations. Similarly, n-join atomic index configurations refers to atomic index configurations comprising indexes on at most n+1 tables. As one example for a Select query with conditions T.sub.1.A<20, T.sub.1.A=T.sub.2.B, T.sub.3.C BETWEEN ›30,50!, and T.sub.3.C=T.sub.2.B, one atomic index configuration for this query would be (T.sub.1.A, T.sub.2.B, T.sub.3.C) because indexes on all three of these tables may be used to answer the query. Because of a single-join atomic index configuration constraint, however, cost evaluation tool 320 would not evaluate this three-table atomic index configuration. Cost evaluation tool 320 therefore determines a cost of the query for this atomic index configuration as the minimum cost for the two-table atomic index configurations (T.sub.1.A, T.sub.2.B), (T.sub.1.A, T.sub.3.C), and (T.sub.2.B, T.sub.3.C). In accordance with another exemplary optimizer constraint, cost evaluation tool 320 for step 502 may also exclude from the set of atomic index configurations any candidate index configurations comprising one index on a table of database 210 for exploiting a join condition and comprising a different index on the same table for exploiting a selection condition. As one example, let I.sub.1, I.sub.2, and I.sub.3 be one-column indexes on onektup.unique1, tenktup1.unique1, and tenktup1.unique2, respectively, for the following query Q.sub.1.
______________________________________
SELECT * FROM onektup, tenktup1
WHERE (onektup.unique1 = tenktup1.unique1)
AND (tenktup1.unique2 between 0 and 1000)
______________________________________
The set of atomic index configurations for the query Q.sub.1 comprises {}, {I.sub.1 }, {I.sub.2 }, {I.sub.3 }, {I.sub.1, I.sub.2 }, {I.sub.1, I.sub.3 }. The index configurations {I.sub.2, I.sub.3 } and {I.sub.1, I.sub.2, I.sub.3 } may be excluded from this set of atomic index configurations because the indexes I.sub.2 and I.sub.3 must use a join predicate and a selection predicate, respectively, on the same table tenktup1. Cost evaluation tool 320 for one embodiment may determine, based on the interaction among indexes of an atomic index configuration, whether the cost of a query for the atomic index configuration needs to be determined in order to determine a cost of the query for another index configuration. With reference to flow diagram 1200 of FIG. 12, cost evaluation tool 320 for one embodiment may determine whether to invoke query optimizer 240 for an atomic index configuration C'.sub.i even though the atomic index configuration C'.sub.i has not been evaluated as determined for step 506 of FIG. 12. If cost evaluation tool 320 determines for step 506 that the atomic index configuration C'.sub.i has not been evaluated, cost evaluation tool 320 may determine for step 507 whether the atomic index configuration C'.sub.i is an extension of another atomic index configuration C'.sub.y that has been determined to comprise interacting indexes. With reference to FIG. 13, cost evaluation tool 320 determines an atomic index configuration C'.sub.i is an extension of another index configuration C'.sub.y for step 507 if the atomic index configuration C'.sub.i comprises all indexes of the index configuration C'.sub.y and only one additional index that is a member of the current set of candidate indexes (step 1302). Cost evaluation tool 320 determines atomic index configuration C'.sub.y comprises interacting indexes for step 507 if the evaluated cost of the query Q for the atomic index configuration C'.sub.y, as determined by invoking query optimizer 240 for the query Q and the atomic index configuration C'.sub.y (step 1304), is at least x% less (step 1308) than the derived cost of the query Q for the atomic index configuration C'.sub.y, as determined by deriving the cost of the query Q for the atomic index configuration C'.sub.y using costs of the query Q for other atomic index configurations (step 1306). The variable x may have any suitable predetermined value and for one embodiment is 20%. The value of the variable x for one embodiment may be determined by a user of index selection tool 300. If cost evaluation tool 320 for step 507 determines the atomic index configuration C'.sub.i is an extension of another atomic index configuration C'.sub.y that has been determined to comprise interacting indexes, cost evaluation tool 320 determines the cost of the query Q for the atomic index configuration C'.sub.i for steps 508 and 510 of FIG. 12. Otherwise, cost evaluation tool 320 proceeds to step 512 without determining the cost of the query Q for the atomic index configuration C'.sub.i. For an Update, Insert, or Delete query Q, cost evaluation tool 320 determines atomic index configuration C'.sub.y comprises interacting indexes for step 507 if the evaluated cost of the query Q for the atomic index configuration C'.sub.y is at least z% different than the derived cost of the query Q for the atomic index configuration C'.sub.y. The variable z may have any suitable predetermined value and for one embodiment is 20%. The value of the variable z for one embodiment may be determined by a user of index selection tool 300. Cost evaluation tool 320 may attempt to further reduce the number of invocations of query optimizer 240 by determining costs of queries for one or more atomic index configurations based on costs determined by query optimizer 240 for other atomic candidate index configurations. Only indexes on one or more columns of the indexable column set P for a query Q such as an SQL Select or Update query, for example, affects the cost of the query Q. Cost evaluation tool 320 for one embodiment may therefore determine a cost of the query Q for an index configuration C as the cost of the query Q for an index configuration C" that is the largest subset of the index configuration C and that has an indexable column set equal to the intersection of the indexable column set P for the query Q and the indexable column set P' for the index configuration C. That is, Cost(Q,C)=Cost(Q,C"). Any index that is not in the index configuration C" does not affect the cost of the query Q for the index configuration C. The determination of costs in this manner is referred to as relevant index set optimization. If the index configuration C" is empty, the determined cost of the query Q for the index configuration C is the same as that over database 210 with no indexes. The determination of costs in this manner is referred to as irrelevant index set optimization. In determining a cost of a query Q for an index configuration C, cost evaluation tool 320 for one embodiment may attempt to reduce the number of atomic index configurations for step 502 of FIG. 5 by determining the set of atomic index configurations C'.sub.i in accordance with the relevant index set optimization technique. As one example of relevant index set optimization, let I.sub.1 and I.sub.2 be indexes on onektup.unique1 and tenktup.unique1, respectively. The determined cost of the following query Q.sub.2 :
______________________________________
SELECT * FROM onektup
WHERE unique1 < 100
______________________________________
for the index configuration {I.sub.1, I.sub.2 } is the determined cost of the query Q.sub.2 for the index configuration {I.sub.1 }, that is Cost(Q.sub.2,{I.sub.1,I.sub.2 })=Cost(Q.sub.2,{I.sub.1 }), as the indexable column set of the query Q.sub.2 is {onektup.unique1} and the indexable column set of the index configuration {I.sub.1, I.sub.2 } is {onektup.unique1, tenktup.unique1}. If cost evaluation tool 320 has already determined Cost(Q.sub.2,{I.sub.1 }), cost evaluation tool 320 may then determine Cost(Q.sub.2,{I.sub.1,I.sub.2 }) without invoking query optimizer 240. Cost evaluation tool 320 may determine costs with the relevant index set optimization technique for any suitable query, such as SQL Select and Update queries, for example. The cost of a query Q such as an SQL Insert or Delete query, for example, may not be determined with the relevant index set optimization technique because insertions and deletions affect all indexes on a table. The costs for updating each index I.sub.i that is not a member of any atomic index configuration for the query Q, however, are independent of one another. Cost evaluation tool 320 may therefore determine a cost of the query Q for an index configuration P.sub.i =P+{I.sub.i } comprising an atomic index configuration P for the query Q and an index I.sub.i that is not a member of any atomic index configuration for the query Q as the sum of the determined cost of the query Q for the atomic index configuration P and the determined cost to update the index I.sub.i. That is Cost(Q,P.sub.i)=Cost(Q,P)+Index-Update-Cost(Q,{I.sub.i }). If cost evaluation tool 320 has already determined Cost(Q,P) and Index-Update-Cost(Q,{I.sub.i }), cost evaluation tool 320 may then determine Cost(Q,P.sub.i) without invoking query optimizer 240. Cost evaluation tool 320 for one embodiment helps reduce the number of invocations of query optimizer 240 by evaluating atomic index configurations for flow diagram 500 of FIG. 5, for example, in order of increasing configuration size. Cost evaluation tool 320 may evaluate single-table atomic index configurations in order of increasing configuration size, that is, single indexes over single columns, single indexes over multi-columns, and single table index configurations with multiple indexes in order of increasing configuration size, and may then evaluate multi-table atomic index configurations in order of increasing configuration size. Cost evaluation tool 320 for one embodiment invokes query optimizer 240 on-demand. That is, cost evaluation tool 320 invokes query optimizer 240 to determine a cost of the query Q or each atomic index configuration C'.sub.i only as needed to determine a cost of the query Q for the index configuration C. Cost evaluation tool 320 for one embodiment invokes query optimizer 240 in batches of queries so that cost estimates for more than one query over the same index configuration may be determined. Batching queries in this manner helps reduce costs otherwise incurred in presenting each index configuration/query pair to query optimizer 240 separately and also helps ensure that the loading of what-if indexes to determine costs of queries for an index configuration is performed only once. Cost evaluation tool 320 for one embodiment invokes query optimizer 240 in a no-execution mode so as to request from query optimizer 240 only cost estimates of queries for index configurations. Candidate Index Selection for a Workload Given the queries of workload 304, candidate index selection tool 330 determines a set of candidate indexes and therefore candidate index configurations for evaluation by index selection tool 300 in selecting index configuration 302. The set of candidate indexes for evaluation by index selection tool 300 may comprise each possible combination of admissible indexes for workload 304. An admissible index belongs to the indexable column set of at least one query of workload 304. The space of all admissible indexes and therefore candidate index configurations over database 210 can be very large and can incur a significant cost in time and memory in searching the space of candidate index configurations to select index configuration 302. Candidate index selection tool 330 attempts to reduce the number of indexes and therefore index configurations for evaluation for the queries of workload 304 by determining a query-specific index configuration for each query of workload 304. As illustrated in FIG. 6, candidate index selection tool 330 determines a set of candidate indexes 338 for evaluation by index selection tool 300 as the union of indexes of the determined query-specific index configurations. This technique is referred to as query-specific-best-configuration candidate index selection and presumes an index that is not a member of an optimal index configuration for any query of workload 304 would not likely be selected as a member of an optimal index configuration for the entire workload 304. Candidate index selection tool 330 for one embodiment comprises a syntactic index selection tool 334 and an index configuration enumeration tool 336, as illustrated in FIG. 6. Syntactic index selection tool 334 and index configuration enumeration tool 336 for one embodiment are each implemented as program modules or computer-executable instructions and may be stored on any suitable computer-readable medium for execution in a suitable operating environment. FIG. 6 illustrates n candidate index selection tools 330 to represent that candidate index selection tool 330 determines an index configuration for each of the n queries of workload 304. Candidate index selection tool 330 for one embodiment determines candidate indexes 338 in accordance with a flow diagram 700 illustrated in FIG. 7. For step 702 of FIG. 7, candidate index selection tool 330 identifies each query Q.sub.i among the queries Q.sub.1, . . . , Q.sub.n of workload 304 to determine an index configuration for each query Q.sub.i. Candidate index selection tool 330 for one embodiment generates n workloads W.sub.1, . . . , W.sub.n 332 from the queries Q.sub.1, . . . , Q.sub.n of workload 304 such that each workload W.sub.i 332 consists of a respective one of the queries Q.sub.i, that is W.sub.i ={Q.sub.i }. For step 704 of FIG. 7, syntactic index selection tool 334 determines a set of one or more indexes X.sub.i of each query Q.sub.i of workload 304. The determined set of indexes X.sub.i are candidate indexes for evaluation in determining an index configuration for the query Q.sub.i. For a query Q.sub.i with no insert, delete, or update, syntactic index selection tool 334 for one embodiment determines the set of indexes X.sub.i as comprising indexes on each column in the set of one or more indexable columns of the query Q.sub.i of each workload W.sub.i 332. For step 706 of FIG. 7, index configuration enumeration tool 336 determines a query-specific index configuration C.sub.i for each workload W.sub.i 332 and therefore for each query Q.sub.i based on the respective set of candidate indexes X.sub.i for the workload W.sub.i 332. Index configuration enumeration tool 336 determines the query-specific index configuration C.sub.i from among possible candidate index configurations of candidate indexes X.sub.i. Index configuration enumeration tool 336 for one embodiment determines a cost to execute each query Q.sub.i against database 210 for each of the possible candidate index configurations of candidate indexes X.sub.i and selects the query-specific index configuration C.sub.i as the candidate index configuration having the least determined total cost. Index configuration enumeration tool 336 for one embodiment uses cost evaluation tool 320 to determine costs for step 706. Index configuration enumeration tool 336 for one embodiment is one in the same as index configuration enumeration tool 340 of FIG. 3. Index configuration enumeration tool 336 may determine each query-specific index configuration C.sub.i with an upper bound on the number of indexes for each query-specific index configuration C.sub.i. That is, index configuration enumeration tool 336 may select each query-specific index configuration C.sub.i from among the possible candidate index configurations having at most a predetermined number of candidate indexes X.sub.i. If a query Q.sub.i does not have any updates and if index configuration enumeration tool 336 does not limit the number of indexes of each query-specific index configuration C.sub.i, the query-specific index configuration C.sub.i selected for the query Q.sub.i will have a less or equal total cost for the query Q.sub.i as compared to any other candidate index configuration for the query Q.sub.i. For a query Q.sub.i that has updates or if index configuration enumeration tool 336 does limit the number of indexes of each query-specific index configuration C.sub.i, index configuration enumeration tool 336 for one embodiment may select more than one query-specific index configuration C.sub.i1, . . . , C.sub.ij for the query Q.sub.i as index configuration enumeration tool 336 may not recommend any index if the determined cost of the query Q.sub.i is relatively high. Index configuration enumeration tool 336 may or may not select the same number of query-specific index configurations C.sub.i1, . . . , C.sub.ij for each query Q.sub.i, and may select relatively more alternative query-specific index configurations C.sub.i1, . . . , C.sub.ij for queries Q.sub.i having relatively higher costs. As one example for a workload comprising a query Q with two indexable columns T.sub.1.C.sub.1 and T.sub.2.C.sub.2 an insert query U on T.sub.1 such that the query-specific index configuration C.sub.i for the query Q consists of an index on T.sub.1.C.sub.1 only, index configuration enumeration tool 336 may not recommend any index if the determined cost of the query U is high because the indexable column T.sub.2.C.sub.2 will not be considered. Index configuration enumeration tool 336 may therefore select more than one query-specific index configuration for the query Q. Because query optimizer 240 will select among non-clustered indexes depending on the clustered indexes selected for each table of database 210, index configuration enumeration tool 336 for one embodiment invokes query optimizer 240 with multiple index configurations each of which corresponds to a distinct set of clustered indexes, one for each table, and with the remaining indexable columns of the query Q.sub.i considered as non-clustered indexes. Although query optimizer 240 may have to be invoked to evaluate many index configurations for multi-join queries, for example, these index configurations need to be evaluated for only one query. The problem of selecting a query-specific index configuration is similar to the overall problem of selecting index configuration 302 for workload 304, only the workload in selecting the query-specific index configuration consists of only one query. Index configuration enumeration tool 336 for one embodiment is therefore index configuration enumeration tool 340 of FIG. 3. That is, the implementation of selecting index configurations for each query of workload 304 may be bootstrapped onto the implementation of selecting index configuration 302 for workload 304. Candidate index selection tool 330 for one embodiment may perform steps 702, 704, and 706 in succession for one query Q.sub.i of workload 304 at a time and repeat steps 702, 704, and 706 for each query Q.sub.i of workload 304 until a query-specific index configuration C.sub.i has been selected for all queries Q.sub.1, . . . , Q.sub.n of workload 304. For another embodiment, candidate index selection tool 330 may perform step 702 for all queries Q.sub.1, . . . , Q.sub.n of workload 304 and then perform step 704 for all queries Q.sub.1, . . . , Q.sub.n of workload 304 and then perform step 706 for all queries Q.sub.1, . . . , Q.sub.n of workload 304. For step 708 of FIG. 7, candidate index selection tool 330 determines a set of candidate indexes 338 as the union of all indexes of the determined query-specific index configuration C.sub.1, . . . , C.sub.n for evaluation by index selection tool 300 to select index configuration 302 for workload 304. Although candidate index selection tool 330 selects indexes for evaluation by index selection tool 300 from index configurations C.sub.i for each query Q.sub.i, one or more indexes that are members of a next-best query-specific index configuration for the query Q.sub.i may appear in the selected index configuration for another query. For selected index configurations C.sub.i having multiple indexes, indexes that are a member of a next-best query-specific index configuration for one query may also be a member of the selected index configuration for another query. By not limiting the number of indexes of each query-specific index configuration C.sub.i, indexes that are members of next-best query-specific index configurations may be more likely included in the determined set of candidate indexes 338. For another embodiment, candidate index selection tool 330 determines one or more query-specific index configurations C.sub.i1, . . . , C.sub.ij for each workload W.sub.i 332 and therefore for each query Q.sub.i to help ensure the set of candidate indexes 338 includes indexes that are members of a next-best query-specific index configuration for each query Q.sub.i. Index configuration enumeration tool 336 determines a cost to execute each query Q.sub.i against database 210 for each of the possible candidate index configurations of candidate indexes X.sub.i and selects for each query Q.sub.i the candidate index configuration C.sub.i1 having the least determined total cost as well as any other candidate index configurations C.sub.ij determined to have a total cost within a predetermined percent of the least determined total cost. The predetermined percent is approximately ten percent for one embodiment. Candidate index selection tool 330 then determines the set of candidate indexes 338 for evaluation by index selection tool 300 as the union of all indexes of the selected one or more candidate index configurations C.sub.i1, . . . , C.sub.ij for each query Q.sub.i. Index Configuration Enumeration for a Workload and Database Given the queries of workload 304 and a set of candidate indexes, index configuration enumeration tool 340 enumerates over the set of candidate indexes to determine a suitable index configuration for workload 304 from among the set of candidate indexes. Index configuration enumeration tool 340 could exhaustively enumerate over all possible candidate index configurations of the candidate indexes using cost evaluation tool 320 in accordance with flow diagram 500 of FIG. 5, for example, and determine an optimal index configuration for workload 304 as the candidate index configuration having the least total cost. Index configuration enumeration tool 340 and cost evaluation tool 320 may, for example, determine query costs for a set of all atomic index configurations for workload 304 from the set of candidate indexes. Based on the costs for the atomic index configurations, a total cost for each candidate index configuration may be determined and the candidate index configuration having the least total cost may be selected as an optimal index configuration for workload 304. Index configuration enumeration tool 340 may, however, incur a significant cost in time and memory in searching a relatively large space of candidate index configurations in this manner. Index configuration enumeration tool 340 for one embodiment attempts to reduce the cost in searching among candidate index configurations to determine a suitable index configuration by reducing the number of atomic index configurations to be evaluated in determining costs for the candidate index configurations and therefore reducing the number of invocations of query optimizer 240. Index configuration enumeration tool 340 for one embodiment uses a greedy enumeration algorithm in accordance with a flow diagram 800 of FIG. 8. The greedy enumeration algorithm of FIG. 8 may be referred to as Greedy(m,k,j) where m, k, and j are input values for the greedy enumeration algorithm. For step 802 of FIG. 8, index configuration enumeration tool 340 determines from a set of candidate index configurations the candidate index configurations having at most m indexes. If the value of m is greater than the value of k, then m is set equal to k for step 802. Index configuration enumeration tool 340 then uses cost evaluation tool 320 to determine costs of the queries of workload 304 for the at most m-index candidate index configurations based on costs of the queries of workload 304 for n-join atomic index configurations comprising indexes on at most j tables, where n=j-1. By restricting the number of candidate index configurations for exhaustive enumeration to those comprising at most m indexes and/or by restricting the set of atomic index configurations for evaluation to those comprising indexes on at most j tables, index configuration enumeration tool 340 may reduce the number of atomic index configurations for evaluation in searching among the set of candidate index configurations. For step 804 of FIG. 8, index configuration enumeration tool 340 selects the m-index candidate index configuration having the least total cost for workload 304 as a seed index configuration for the greedy algorithm. Index configuration enumeration tool 340 for step 806 determines whether the current index configuration contains less than a predetermined number k of indexes. If so, index configuration enumeration tool 340 determines for step 808 whether the addition of any one of the remaining candidate indexes that are not already a member of the current index configuration would further reduce the total cost of workload 304. If both conditions are satisfied, index configuration enumeration tool 340 for step 810 adds the candidate index of the remaining candidate indexes that would reduce the total cost of workload 304 the most. Index configuration enumeration tool 340 repeats steps 806, 808, and 810 until index configuration enumeration tool 340 determines for step 806 that the current index configuration contains k indexes or until index configuration enumeration tool 340 determines for step 808 that the inclusion of any additional candidate index would not reduce the cost of workload 304. Index configuration enumeration tool 340 then determines for step 812 an index configuration as the current index configuration. Index selection tool 300 may use the index configuration determined for step 812 as selected index configuration 302. With the greedy enumeration algorithm, index configuration enumeration tool 340 therefore combines an exhaustive phase for steps 802 and 804 with a greedy phase for steps 806, 808, and 810. The exhaustive phase helps capture index interactions, such as merge join using two clustered indexes and single table index intersection for example, that may have a significant effect the cost of workload 304 while the greedy phase attempts to add one or more candidate indexes that can alone significantly reduce the cost of workload 304 despite any interaction among indexes. As one example, the join order of any single query is often determined primarily by sizes of intermediate relations and the presence or absence of a few significant indexes. Once the join order has been determined, other indexes typically only help reduce the cost of a join locally and do not interact with indexes used in other operations. Using the greedy enumeration algorithm, index configuration enumeration tool 340 may select significant interacting indexes that affect the join order and subsequently select the remaining indexes greedily. Index configuration enumeration tool 340 would determine an index configuration in a purely greedy manner with m=0 or in an exhaustive manner if m=k. If m>k , then m is set equal to k for one embodiment. The greedy enumeration algorithm may be computationally efficient and exhibit relatively near-greedy behavior if the value of m is relatively small compared to the value of k. The value of m relative to the value of k reflects a desired degree of completeness of enumeration. For another embodiment, index configuration enumeration tool 340 for step 804 selects one or more seed index configurations each comprising at most m indexes. Index configuration enumeration tool 340 for one embodiment selects as a seed index configuration the at most m-index candidate index configuration having the least total cost for workload 304 as well as any other at most m-index candidate index configurations having for workload 304 a total cost within a predetermined percent of the least total cost. The predetermined percent is approximately ten percent for one embodiment. Index configuration enumeration tool 340 performs steps 806 through 812 for each selected seed index configuration to determine one or more corresponding current index configurations and then selects one of the one or more corresponding current index configurations. For one embodiment, index configuration enumeration tool 340 selects a corresponding current index configuration that has the minimum total cost for workload 304. Index selection tool 300 may use the selected index configuration as selected index configuration 302. Index configuration enumeration tool 340 for one embodiment determines an index configuration for workload 304 from among a set of candidate indexes in accordance with a two-tier search algorithm of a flow diagram 900 of FIG. 9. For the first tier, index configuration enumeration tool 340 for step 902 of FIG. 9 determines an index configuration for workload 304 from a set of candidate indexes in accordance with the greedy enumeration algorithm Greedy(m, k=.infin.,j) of FIG. 8. Index configuration enumeration tool 340 attempts to reduce the cost in time and memory in reconciling among the set of candidate indexes that were derived from multiple queries of workload 304 in accordance with an n-join atomic index configuration pruning technique, where n=j-1, by restricting the set of atomic index configurations for step 802 of FIG. 8 to those comprising indexes on at most j tables. As the value of k is .infin. for the first tier, index configuration enumeration tool 340 does not limit the number of indexes of the index configuration determined for step 812 of FIG. 8. For one embodiment, index configuration enumeration tool 340 performs the greedy enumeration algorithm with the value of m as 2 and the value of j as 2. With j=2, index configuration enumeration tool 340 performs a single-join atomic index configuration enumeration technique. The n-join configuration enumeration pruning technique for one embodiment also considers whether indexes are clustered or non-clustered in determining candidate indexes for step 812. Each candidate index may be marked as clustered or non-clustered. The set of indexes for step 812 may comprise multiple clustered indexes over the same table or a non-clustered index as well as a clustered index on the same column. Index configuration enumeration tool 340 determines for step 904 of FIG. 9 a current set of candidate indexes for the second tier of flow diagram 900 as the indexes of the index configuration determined for step 812 of FIG. 8 for the first tier. For the second tier, index configuration enumeration tool 340 for step 906 of FIG. 9 determines an index configuration for workload 304 from the current set of candidate indexes as determined for step 904 in accordance with the greedy enumeration algorithm Greedy(m, k, j=.infin.) of FIG. 8. As the value of j is .infin., index configuration enumeration tool 340 for the second tier considers for step 802 of FIG. 8 all multi-table atomic index configurations for workload 340 in evaluating costs for current candidate index configurations. Index selection tool 300 may use the index configuration determined for step 812 as selected index configuration 302. For one embodiment, index configuration enumeration tool 340 performs the greedy enumeration algorithm with the value of m as 2 and any suitable value of k. Index configuration enumeration tool 340 for one embodiment for the second tier also ensures the index configuration determined for step 906 of FIG. 6 comprises at most one clustered index for every table of database 210. Use of the greedy enumeration algorithm to determine an index configuration allows index configuration enumeration tool 340 and cost evaluation tool 320 for one embodiment to interleave the evaluation of atomic index configurations with the selection of indexes for steps 806, 808, and 810 of FIG. 8. Atomic index configurations may therefore be evaluated for step 802 only as needed. That is, index configuration enumeration tool 340 and cost evaluation tool 320 may invoke query optimizer 240 on-demand. Index configuration enumeration tool 340 for another embodiment determines an index configuration for workload 304 from a set of candidate indexes in accordance with an index configuration enumeration branch-and-bound algorithm. For the branch-and-bound algorithm, index configuration enumeration tool 304 for one embodiment uses the greedy enumeration algorithm Greedy(m,k,j) with a relatively low value of m to determine an initial index configuration for workload 304. Index configuration enumeration tool 340 then enumerates index configurations exhaustively with the constraint that the cost of each partial or subset index configuration must be within a predetermined factor of the cost of workload 304 for a corresponding partial or subset index configuration of the index configuration determined using Greedy(m,k,j). The branch-and-bound algorithm for one embodiment is illustrated as now diagram 1000 in FIG. 10. For the algorithm of FIG. 10, V.sub.i denotes a set of index configurations {C.sub.1, . . . , C.sub.Ki } each containing at most i indexes of the set of candidate indexes to be evaluated for the algorithm. For step 1002 of FIG. 10, index configuration enumeration tool 340 initializes a variable i to 1 for use in identifying the size of the candidate index configuration evaluated at any stage in the algorithm. Index configuration enumeration tool 340 for step 1004 initializes a variable t to 1 for use in identifying an index configuration C.sub.t of a current set V.sub.i under evaluation. For step 1006 of FIG. 10, index configuration enumeration tool 340 determines an index configuration G.sub.i as the index configuration selected from all candidate indexes in accordance with the greedy enumeration algorithm Greedy(m,i,j) and determines whether the cost of workload 304 for the index configuration C.sub.t of V.sub.i, or Cost(W,C.sub.t), is within a predetermined factor F of the cost of workload 304 for the index configuration G.sub.i, or Cost(W,G.sub.i). If so, index configuration enumeration tool 340 determines for step 1008 each extension of the index configuration C.sub.t and adds the index configuration C.sub.t and each extension of C.sub.t to the set of index configurations denoted by V.sub.i+1. Index configuration enumeration tool 340 determines an extension of the index configuration C.sub.t by adding one of the candidate indexes not already a member of the index configuration C.sub.t to the index configuration C.sub.t. Index configuration enumeration tool 340 then determines for step 1010 whether all K.sub.i index configurations C.sub.t of the set V.sub.i have been evaluated and, if not, increments the variable t for step 1012 to evaluate another index configuration C.sub.t of the set V.sub.i for step 1006. If index configuration enumeration tool 340 determines for step 1006 that the cost of workload 304 for the index configuration C.sub.t of V.sub.i is not within a predetermined percentage of the cost of workload 304 for the index configuration G.sub.i, index configuration enumeration tool 340 proceeds directly to step 1010. Index configuration enumeration tool 340 repeats steps 1006, 1008, 1010, and 1012 until index configuration enumeration tool 340 determines for step 1010 that all K.sub.i index configurations C.sub.t of the set V.sub.i have been evaluated for step 1006. Index configuration enumeration tool 340 then determines for step 1014 whether all index configurations C.sub.t of all k index configuration sets V.sub.i have been evaluated and, if not, increments the variable i for step 1016 to evaluate index configurations C.sub.t of the next set V.sub.i. Index configuration enumeration too 340 therefore iteratively evaluates sets of candidate index configurations in order of increasing size. If index configuration enumeration tool 340 determines for step 1014 that all index configurations C.sub.t of all k index configuration sets V.sub.i have been evaluated, index configuration enumeration tool 340 determines for step 1018 an index configuration for workload 304 as the index configuration C.sub.t of the set V.sub.k having the least total cost for workload 304. The predetermined factor F for step 1006 may have any suitable value and for one embodiment is 1.3. The predetermined factor F for one embodiment may be determined by a user of index selection tool 300. Multi-Column Index Configuration Selection for a Workload and Database Given the queries of workload 304, index selection tool 300 for one embodiment determines index configuration 302 from among a set of indexes comprising multi-column indexes. The number of indexes and therefore index configurations over database 210 can be very large and can incur a significant cost in time and memory in searching the index configurations to select index configuration 302. This cost is further compounded by the presence of multi-column indexes as the number of indexes may possibly comprise k| multi-column indexes for a given set of k columns and therefore require an increased number of invocations of query optimizer 240 to determine costs for index configurations. Index selection tool 300 for one embodiment attempts to reduce the cost attendant to selecting index configuration 302 from a set of indexes comprising multi-column indexes for the queries of workload 304 by iteratively determining for workload 304 index configurations comprising indexes over one additional column at a time. Index selection tool 300 for one embodiment determines index configuration 302 from a set of indexes in accordance with a flow diagram 1100 illustrated in FIG. 11. For step 1102 of FIG. 11, index selection tool 300 initializes a variable i to 1 for use in identifying the maximum number of columns for indexes of index configurations evaluated for each iteration. For step 1104 of FIG. 11, index selection tool 300 determines an index configuration of at most i-column indexes from among a working set of indexes for workload 304. For an initial iteration, the working set of indexes is an initial set of single-column indexes determined for workload 304. Index selection tool 300 for step 1104 for one embodiment determines the index configuration of at most i-column indexes for workload 304 using index configuration enumeration tool 340. For step 1106 of FIG. 1, index selection tool 300 determines whether a predetermined number n of iterations have been completed to determine index configuration 302 of at most n-column indexes. The predetermined number n corresponds to the maximum width in number of columns for the indexes of index configuration 302. If index selection tool 300 determines i is less than n for step 1106, index selection tool 300 for step 1108 of FIG. 11 determines a new working set of indexes as comprising the i-column indexes of the index configuration determined for step 1104 and comprising from among the initial set of indexes any (i+1)-column indexes having as i leading column(s) a set of columns from one of the i-column indexes of the index configuration selected for step 1104. For one embodiment, the trailing column of each (i+1)-column index determined for the new working set of indexes for step 1106 of each iteration may or may not be one of the columns of the indexes selected for step 1104 of that iteration. Index selection tool 300 for one embodiment uses multi-column index generation tool 350 of FIG. 3 to determine the new working set of indexes for step 1108. As one example for an index configuration S of one-column indexes as determined for step 1104, the new working set of indexes comprises the one-column indexes of index configuration S and any two-column index M(a,b) over column pair (a,b) such that the leading column a of each two-column index M(a,b) is one of the one-column indexes of index configuration S and such that each two-column index M(a,b) is a member of the initial set of indexes. The trailing column b may or may not be one of the columns of the indexes for index configuration S. For step 1110 of FIG. 11, index selection tool 300 increments i by one and proceeds to step 1104 to determine an index configuration of i-column indexes for workload 304 from among the new working set of indexes as determined for step 1108. Index selection tool 300 iteratively repeats steps 1104 through steps 1110 until index selection tool 300 determines for step 1106 that i=n. Index selection tool 300 for step 1112 determines index configuration 302 for workload 304 as the index configuration of n-column indexes determined for step 1104. The implementation of determining index configuration 302 from a set of indexes comprising multi-column indexes for flow diagram 1100 is therefore bootstrapped onto the implementation of determining an index configuration from a given set of indexes. Index selection tool 300 reduces the number of indexes to be evaluated in determining index configuration 302 by iteratively determining a new working set of indexes for step 1108 of each iteration based on the index configuration determined for step 1104 of each iteration. Index selection tool 300 therefore helps reduce the number of invocations of query optimizer 240 to determine index configuration 302. Because the trailing column of each (i+1)-column index determined for the new working set of indexes for step 1108 of each iteration may or may not be one of the columns of the indexes selected for step 1104 of that iteration, index selection tool 300 emphasizes the significance of the interaction between columns of a multi-column index in determining the significance of the multi-column index. As one example, a trailing column b of a multi-column index M(a,b) may not be one of the columns of the indexes selected for step 1104 but the interaction between column b and the leading column a that is one of the columns of the indexes selected for step 1104 may increase the significance of multi-column index M(a,b) as an index. For another embodiment, index selection tool 300 determines the trailing column of each (i+1)-column index determined for the new working set of indexes for step 1108 of each iteration to be one of the columns of the indexes selected for step 1104 of that iteration. That is, index selection tool 300 determines the new working set of indexes for step 1108 of each iteration such that all columns of each index is in the indexable column set for the index configuration selected for step 1104. Index selection tool 300 for this embodiment emphasizes the significance of all columns of a multi-column index in determining the significance of the multi-column index. With multi-column indexes, index selection tool 300 may also consider indexes that help database server 220 answer queries using only indexes without having to access any tables of database 210. That is, index selection tool 300 may also consider indexes that help index-only access. To consider indexes that help index-only access, index selection tool 300 for one embodiment determines the trailing column of each (i+1)-column index determined for the new working set of indexes for step 1108 of each iteration to be one of the columns that is part of a projection list in workload 304. Index selection tool 300 for another embodiment determines the trailing column of each (i+1)-column index determined for the new working set of indexes for step 1108 of each iteration to be a column from the required column set for any query of workload 304 with respect to the table of database 210 having the i leading columns of the (i+1)-column index. Index selection tool 300 may also consider indexes that help index-only access by determining for step 1108 the new working set of indexes as also comprising any (i+1)-column index on the required set of columns for any query of workload 304 with respect to a table of database 210. For each such (i+1)-column index, the required set of columns for a query with respect to a table may be ordered in any suitable manner. For one embodiment, all columns of a table on which the query has one or more conditions precede the remaining columns of the required set of columns for the query with respect to that table. Index selection tool 300 for one embodiment may also attempt to reduce the number of invocations of query optimizer 240 by cost evaluation tool 320 in evaluating atomic index configurations comprising multi-column indexes in accordance with the relevant index set optimization technique. Because database server 220 may reduce the cost of executing queries with conditions on columns a and b and with conditions on column a only using a multi-column index M(a,b) on two columns a and b, cost evaluation tool 320 for one embodiment may determine the cost of a query Q with a condition on column a for the multi-column index M(a,b) as the cost of the query Q for only a single-column index on column a. Using the multi-column index M(a,b) may not, however, reduce the cost of executing a query with a condition on column b only. Cost evaluation tool 320 may also determine the cost of a query Q with conditions on both columns a and b for the multi-column index M(a,b) as the cost of the query Q for the multi-column index M(b,a) because database server 220 may use either multi-column index M(a,b) or multi-column index M(b,a) to help reduce the cost in executing the query Q. For a multi-column index M(a,b) and an arbitrary index configuration J, cost evaluation tool 320 for one embodiment may determine a cost Cost(Q, J.orgate.{M(a,b)}) of a query Q for the index configuration J.orgate.{M(a,b)} as: (1) the cost Cost(Q, J.orgate.{}) of the query Q for the index configuration J if the column a is absent from the indexable column set of the query Q; or (2) the cost Cost(Q, J.orgate.{a}) of the query Q for the index configuration J.orgate.{a} if the column a is present in and the column b is absent from the indexable column set of the query Q; or (3) the cost Cost(Q, J.orgate.{M(b,a)}) of the query Q for the index configuration J.orgate. {M(b,a)} if both columns a and b are present in the indexable column set of the query Q. Cost evaluation tool 320 may extend determining costs in this manner to multi-column indexes with widths greater than two columns. Exemplary Embodiment for Index Selection Tool Referring to FIG. 3, index selection tool 300 for one embodiment determines the set of indexable columns for the queries of workload 304 and generates a set of indexes for candidate index selection tool 330 as the single-column indexes on each indexable column of workload 304. Using this set of candidate indexes, candidate index selection tool 330 performs the query-specific-best-configuration algorithm in accordance with flow diagram 700 of FIG. 7 to determine a set of candidate indexes 338 of FIG. 6. Index configuration enumeration tool 340 then determines a suitable index configuration of single-column indexes for workload 304 from among the set of candidate indexes 338 in accordance with the greedy enumeration algorithm or Greedy(m,k,j) of flow diagram 800 of FIG. 8. For one embodiment, m=2, j=2, and k is determined by a user of index selection tool 300. Based on the determined set of single-column indexes, multi-column index generation tool 350 determines a set of multi-column indexes where the leading column of each multi-column index is in the indexable column set for the determined set of single-column indexes. Using the determined set of single-column indexes and the determined set of multi-column indexes as admissible indexes, index selection tool 300 iteratively repeats this process in accordance with flow diagram 1100 of FIG. 11. In the foregoing description, the invention has been described with reference to specific exemplary embodiments thereof. It will, however, be evident that various modifications and changes may be made thereto without departing from the broader spirit or scope of the present invention as defined in the appended claims. The specification and drawings are, accordingly, to be regarded in an illustrative rather than a restrictive sense.
|
Same subclass Same class Consider this |
||||||||||
