What-if index analysis utility for database systems6223171Abstract What-if index analysis utility provides the ability to analyze the performance of the existing configuration of a database system with respect to one or more workloads of queries and to propose a hypothetical configuration for the database system to analyze its potential impact on the performance of the database system. The utility may be used, for example, to perform an impact analysis of the set of indexes selected by an index selection tool, for example, with respect to a workload of queries and may also be used to explore what-if scenarios for the database system by analyzing the impact of hypothetical sets of indexes with respect to the execution of various workloads over projected sizes of a database. The utility may be used to perform summarizations of workloads, configurations, and the performance of workloads with respect to the existing configuration and hypothetical configurations. What-if index analysis utility may be used, for example, by a database administrator or a physical database design tool to help improve performance of a database system. Claims What is claimed is: Description TECHNICAL FIELD
TABLE 1
Workload Analysis Data
Workload Name Query ID Frequency Query Properties
Workload_A 1 1 <Query Text>, (T.sub.1, T.sub.2), etc.
In executing the Define Configuration command, HCA engine 240 registers a new configuration associated with a name as designated by <configuration_name> and comprising a set of indexes as designated by (Table.sub.1, column_list.sub.1), . . . , (Table.sub.i, column_list.sub.i). HCA engine 240 for one embodiment may initially set the database size for a newly registered hypothetical configuration to that of the existing configuration. HCA engine 240 for one embodiment generates analysis data for the defined configuration in accordance with the following Table 2.
TABLE 2
Configuration Analysis Data
Scaling
Configuration Name Indexes in Configuration Values
Current_Configuration Index_A, Index_B, Index_D (T.sub.1, 1),
(T.sub.2, 5)
The set of indexes designated for the new configuration may comprise existing indexes created by database server 220 and/or hypothetical indexes. Existing indexes for one embodiment may alternatively be designated in the Define Configuration command by name. HCA engine 240 may obtain information associated with each index by reference to catalog table 226. Database server 220 for one embodiment generates such information in accordance with the following Table 3.
TABLE 3
Index Information
Number
Table Number of of Index
Index Name Name Rows Pages Columns Statistics
Index_A R 100,000 1865 R.a <histogram>
Once a configuration is registered, the Set Database Size command may be used to define a hypothetical database size for the configuration. In executing the Set Database Size command, HCA engine 240 sets the number of rows of each table of database 210 as designated by (Table.sub.1, row_count.sub.1), . . . , (Table.sub.i, row_count.sub.i) for the configuration designated by <configuration_name>. In executing the Estimate Configuration command, HCA engine 240 estimates the cost to execute each query of the workload designated by <workload_name> with respect to the configuration designated by <configuration_name>. HCA engine 240 may also determine the usage of indexes by each query of the workload <workload_name> with respect to the configuration <configuration_name>. HCA engine 240 for one embodiment generates such analysis data in the form of a relation in accordance with the following Table 4.
TABLE 4
Cost and Index Usage Analysis Data
Configuration Name Query ID Cost Indexes Used
New_Configuration 1 0.02 Index_A, Index_D
New_Configuration 2 0.11 Index_B
Conceptually, the relation has as many rows as queries in the workload <workload_name> for each configuration that is evaluated. Configuration Name identifies the configuration <configuration_name>, Query ID identifies each query of the workload <workload_name>, Cost identifies a cost estimate to execute each query of the workload <workload_name> with respect to the configuration <configuration_name> relative to the total cost estimate of the workload <workload_name> with respect to the existing configuration, and Indexes Used identifies those indexes of the set of indexes for the configuration <configuration_name> that are expected to be used by database server 220 to execute each query of the workload <workload_name> with respect to the configuration <configuration_name>. In executing the Remove command, HCA engine 240 removes analysis data generated in executing other commands. In executing the Remove Workload command or the Remove Configuration command, HCA engine 240 removes all analysis data, including any cost and index usage analysis data, about the designated workload <workload_name> or the designated configuration <configuration_name>, respectively. In executing the Remove Cost-Usage command, HCA engine 240 for one embodiment removes only the cost and index usage information for the designated workload <workload_name> and the designated configuration <configuration_name> and retains the workload analysis data about the workload <workload_name> and the configuration analysis data about the configuration <configuration_name>. HCA engine 240 may execute the Remove command by deleting from the analysis data tables rows corresponding to the workload and/or configuration designated by the command. For other embodiments, hypothetical configuration simulation interface 262 may alternatively interface directly with database server 220 using suitable server extensions to simulate hypothetical configurations. What-if index analysis UI 250 provides a user interface for a database administrator to simulate hypothetical configurations for database system 200 using HCA engine 240. What-if index analysis UI 250 prompts the database administrator for input to define and analyze workloads and configurations for database system 200 and generates the appropriate commands supported by hypothetical configuration simulation interface 262 for execution by HCA engine 240. For embodiments using the Microsoft.RTM. SQL Server as database server 220, the database administrator may use the SQL Server Profiler to generate a workload for database system 200 by logging events on database server 220 over a specified period of time into a workload file. The database administrator may also use a filter to log only relevant events. Alternatively, the database administrator may define a workload dynamically by specifying the set of queries for the workload using a highlighted buffer of queries presented to the database administrator by what-if index analysis UI 250. To define a new configuration, what-if index analysis UI 250 prompts the database administrator to define the set of indexes and database size for the configuration. What-if index analysis UI 250 for one embodiment presents to the database administrator a screen with the list of indexes for the existing configuration and prompts the database administrator to add or remove indexes to define a new configuration. An example of one such screen is illustrated in FIG. 4 where a list of indexes is recommended by index selection tool 230 and the database administrator is prompted to change some or all of the indexes for the existing configuration. What-if index analysis UI 250 may also present the database administrator with the list of tables and associated row counts for the existing configuration and prompt the database administrator to either modify the row count or specify a row count multiplier for any table for a new configuration. What-if index analysis UI 250 may present to the database administrator the evaluation of a workload with respect to a configuration in any suitable manner. What-if index analysis UI 250 may present a cost estimate of a workload with respect to a hypothetical configuration, for example, by displaying the cost estimate relative to the cost estimate of the workload with respect to the existing configuration. What-if index analysis UI 250 may also display, for example, the top 25 queries of the workload by cost. What-if index analysis UI 250 may present index usage for a workload with respect to a hypothetical configuration, for example, by displaying the fraction of queries of the workload that uses one or more indexes. What-if index analysis UI 250 may also display, for example, the top 10 indexes by usage. Implementation of HCA Interface HCA engine 240 interfaces with database server 220 using server interface 270 to evaluate a workload with respect to the existing or a hypothetical configuration. HCA engine 240 may interface with database server 220 in any suitable manner to simulate hypothetical configurations and to evaluate a workload with respect to the existing configuration or a simulated hypothetical configuration. HCA engine 240 for one embodiment interfaces with database server 220 to evaluate queries of a workload with respect to the existing configuration by invoking query optimizer 224. Query optimizer 224 generates plans comprising cost estimates and index usage for the execution of queries using existing indexes in database server 220. In determining whether to use any particular one of the existing indexes in executing a query, query optimizer 224 uses statistical information on the column(s) of each existing index. Database server 220 may gather this statistical information in accordance with Table 3 for the columns of tables over which indexes are constructed during the index generation phase and may store this statistical information as existing index entries 227 in catalog table 226. This statistical information for one embodiment comprises a histogram on the column values on which the index is defined. Because database server 220 may similarly gather statistical information for indexes that have not been constructed, the presence of such hypothetical or what-if indexes may be simulated for consideration by query optimizer 224 in evaluating queries with respect to a hypothetical configuration. Database server 220 for one embodiment does not need to write out pages of a what-if index because using query optimizer 224 to evaluate queries with respect to a hypothetical configuration requires only the statistical information for an index and does not require the actual use of the index by database server 220 to access any data in database 210. In gathering statistical information for a what-if index, database server 220 may sample a fraction or percentage of the pages of the table for the what-if index or may read all pages of the table to obtain possibly more accurate cost estimates by query optimizer 224 at the expense of increasing time and memory costs. Database server 220 may gather the statistical information for what-if indexes in accordance with Table 3 and may store this statistical information as what-if index entries 228 in catalog table 226. For one embodiment, what-if index entries 228 are stored similarly as existing index entries 227 only what-if index entries 228 are marked with what-if index tags to distinguish what-if index entries 228 from existing index entries 227. Catalog table 226, existing index entries 227, and what-if index entries 228 may be implemented in the form of one or more suitable data structures and may be stored on any suitable computer-readable medium. HCA engine 240 for one embodiment evaluates queries of a designated workload with respect to a designated hypothetical configuration in accordance with flow diagram 500 of FIG. 5. HCA engine 240 for one embodiment performs flow diagram 500 in response to the Estimate Configuration command designating a workload and a hypothetical configuration. For step 502, HCA engine 240 invokes database server 220 to create any hypothetical or what-if indexes needed to simulate the designated hypothetical configuration. As illustrated in FIG. 3, HCA engine 240 for one embodiment issues to database server 220 a create index command 272 designating a table of database 210 and one or more columns of the table for each hypothetical index to be created. Create index command 272 for one embodiment is in the form CREATE INDEX <index_name> ON <table_name(column_name(s))> WITH STATISTICS ONLY [=<fraction>] to designate that a hypothetical or what-if index named <index_name> is to be created by gathering statistical information on the column(s) named column_name(s) of the table named table_name using a sample of that table as designated by <fraction>. If <fraction> is not designated, database server 220 may use a suitable default fraction for sampling. As one example, the command CREATE INDEX supplier_stats ON Orders(supplier) WITH STATISTICS ONLY=0.10 would be used to create a hypothetical index named supplier_stats on the supplier column of the Orders table by sampling 10% of the rows in the Orders table. Database server 220 may execute create index command 272 in any suitable manner to create a hypothetical or what-if index. HCA engine 240 for one embodiment may invoke database server 220 to create what-if index entries 228 as described in U.S. patent application Ser. No. 08/912,036, U.S. patent application Ser. No. 08/980,830, U.S. patent application Ser. No. 08/982,046, U.S. patent application Ser. No. 08/980,831, and U.S. patent application Ser. No. 08/980,829. Briefly, HCA engine 240 may designate a table of database 210 and one or more columns of that table for the what-if index to be created. Database server 220 may then sample the fraction or percentage of the pages of the designated table and sort the rows of the sampled pages by the designated one or more columns of the table to generate the statistical information for the what-if index. Database server 220 for one embodiment uses an adaptive page-level sampling technique as described in U.S. patent application Ser. No. 09/139,835 (Attorney Docket No. 14-683) filed on the same date as this patent application, entitled HISTOGRAM CONSTRUCTION USING ADAPTIVE RANDOM SAMPLING WITH CROSS-VALIDATION FOR DATABASE SYSTEMS, by Surajit Chaudhuri, Rajeev Motwani, and Vivek Narasayya, to gather statistical information for the what-if index to be created. This patent application is herein incorporated by reference. Briefly, database server 220 obtains a seed sample of approximately m pages of the table for the what-if index. Database server 220 for one embodiment samples approximately m=n pages of the table where n is the number of pages in the table. Based on this sample, database server 220 generates a sorted list of column values and a set of statistical measures comprising an equi-depth histogram of the column values characterized by step boundaries. Database server 220 samples another approximately m pages of the table and tests how well the values of this new sample fit within the histogram. If this test for convergence fails, database server 220 merges the new sample with the sample(s) collected thus far and updates the sorted list of column values as well as the statistical measures. Database server 220 continues to collect and check new samples in this manner until the values of a new sample fit within the updated histogram within a predetermined degree of accuracy. For step 504, HCA engine 240 invokes query optimizer 224 to simulate the hypothetical configuration. HCA engine 240 for one embodiment requests query optimizer 224 to restrict the consideration of indexes to those of the hypothetical configuration and to consider the table and index sizes in database 210 as adjusted by the database size of the hypothetical configuration. As illustrated in FIG. 3, HCA engine 240 for one embodiment issues to database server 220 a define configuration command 274 to invoke query optimizer 224 with a connection-specific hypothetical configuration (HC) mode call having as arguments (1) the set of indexes of the hypothetical configuration and (2) the database size of the hypothetical configuration. By conveying the information for a hypothetical configuration in a connection-specific manner, database server 220 may concurrently execute operational queries against database 210 while optimizing queries with respect to the hypothetical configuration. For embodiments using an SQL server for database server 220, HCA engine 240 also passes as a third argument the base index for each table of the hypothetical configuration because the leaf node of a non-clustered B+-tree index contains the keys of the clustered index, if any, on that table and because the plan chosen by query optimizer 224 depends on the columns available in each index. The base index for a table is either the clustered index on the table or, in the absence of a clustered index in the configuration, the heap structure for the table. HCA engine 240 projects the size of each index of the hypothetical configuration based on the database size of the hypothetical configuration. For embodiments using an SQL server for database server 220, the size of a non-clustered index depends on the clustered index, if any, on that table. As one example, if a configuration comprises a clustered index on column A of a table and a non-clustered index on column B of that same table, the size of the index on column B is proportional to the sum of the widths of columns A and B. If a hypothetical configuration comprises clustered indexes I.sub.1 and I.sub.2 on the same table and a non-clustered index I.sub.3 on that same table, HCA engine 240 would then consider one scaling value for the size of the index I.sub.3 in simulating the configuration {I.sub.1, I.sub.3 } and a different scaling value for the size of the index I.sub.3 in simulating the configuration {I.sub.2, I.sub.3 }. For step 506, HCA engine 240 invokes query optimizer 224 to produce a plan for execution of a query of the designated workload with respect to the simulated hypothetical configuration and return both the estimated cost to execute the query with respect to the hypothetical configuration and any indexes that are expected to be used to answer the query. As illustrated in FIG. 3, HCA engine 240 for one embodiment issues to database server 220 an optimize query command 276 to invoke query optimizer 224 in this manner, and query optimizer 224 returns to HCA engine 240 an execution plan 278 comprising the cost estimate and the indexes expected to be used to answer the query. HCA engine 240 for one embodiment invokes query optimizer 224 in a no-execution mode so as to request the execution plan 278 from query optimizer 224 without actually executing that plan. For embodiments using the Microsoft.RTM. SQL Server as database server 220, query optimizer 224 returns the execution plan 278 through the Showplan interface. For step 508, HCA engine 240 determines whether all queries of the workload have been evaluated and, if not, performs step 506 for another query of the workload. HCA engine 240 repeats steps 506-508 until all queries of the workload have been evaluated with respect to the simulated hypothetical configuration. HCA engine 240 for step 510 may then determine the estimated cost of the whole workload with respect to the hypothetical configuration as the sum of the cost estimates for all queries of the workload. As one example where database system 200 has an existing configuration comprising a table T with indexes I.sub.1 and I.sub.2, where index I.sub.1 is the clustered index for the table T, and where the table T has one million rows in the existing configuration, HCA engine 240 for one embodiment performs flow diagram 500 as follows to evaluate a workload W with respect to a hypothetical configuration having the hypothetical set of indexes {I.sub.1, I.sub.3 } and the size of table T as ten million rows. For step 502, HCA engine 240 creates the hypothetical index I.sub.3 because this index is absent from the existing configuration. HCA engine 240 for step 504 computes the new sizes S.sub.1 and S.sub.3 of the indexes I.sub.1 and I.sub.3, respectively, when the number of rows of the table T is scaled to ten million, talking into account that index I.sub.1 is the clustered index. HCA engine 240 then invokes query optimizer 224 with a connection-specific hypothetical configuration (HC) mode call HC_mode((I.sub.1, I.sub.3) (S.sub.1, S.sub.3), (1, 0)) to simulate the hypothetical configuration, passing to query optimizer 224 the indexes I.sub.1 and I.sub.3 of the hypothetical configuration to be considered by query optimizer 224 for plan generation, the respective sizes S.sub.1 and S.sub.3 of each index I.sub.1 and I.sub.3, and an indication that index I.sub.1 is the base index for the table T. For steps 506 and 508, HCA engine 240 invokes query optimizer 224 for each query of the workload W to obtain the cost estimate and index usage for each query with respect to the hypothetical configuration. HCA engine 240 for step 510 may then determine the estimated cost of the whole workload W with respect to the hypothetical configuration as the sum of the determined cost estimates for all queries of the workload W. For other embodiments, HCA engine 240 may implement hypothetical configurations by physically altering the existing configuration at the expense of incurring additional overhead to create and remove indexes and of degrading the performance of database system 200 with respect to the execution of operational queries. Maintaining Analysis Data Tables The analysis data tables generated for the simulation of hypothetical configurations may be maintained by database system 200 in any suitable manner. Although presented as relations in non-first-normal form, the analysis data may be stored in multiple normalized tables. For one embodiment, the schema of the analysis data tables can be assumed to be fixed once the properties of entities supported by HCA engine 240, such as queries, indexes, etc., are determined. Database server 220 may maintain each analysis data table as a system catalog for one embodiment. Any naming issues for the tables are then resolved for embodiments where the system catalog names are fixed a priori. When database server 220 is invoked to simulate a hypothetical configuration, database server 220 may write the resulting analysis data to the appropriate system catalog. As system catalogs, the analysis data tables may be accessed directly by a user using suitable queries, for example. For another embodiment, HCA engine 240 may write the analysis data returned by database server 220 into connection-specific temporary tables. When an index analysis session using HCA engine 240 has been completed, the analysis data generated during the session may then be saved by HCA engine 240 using a suitable Save command. The analysis data may be saved, for example, at the option of a user into user specified tables. The user may name the saved tables and access the tables using suitable queries, for example. Summary Analysis HCA engine 240 may also be used to perform summary analyses using the ability of HCA engine 240 to simulate hypothetical configurations. HCA engine 240 may be used to perform summarizations of workloads, configurations, and the performance of workloads with respect to the existing configuration and hypothetical configurations. FIGS. 6, 7, and 8 illustrate exemplary user interfaces each produced by performing a summary analysis that may be useful to a database administrator. FIG. 6 illustrates a breakdown of a workload by the type of queries. FIG. 7 illustrates a breakdown of selection conditions in queries by table after drilling-down on the queries of type Select, for example, to provide the database administrator with a better grasp of a workload. FIG. 8 illustrates the relative frequencies of usage of indexes in the existing configuration, for example, to assist the database administrator in identifying indexes that are rarely used and that therefore may be removed. HCA engine 240 for one embodiment may be used to perform workload analysis to analyze queries and their structural properties, to perform configuration analysis to analyze indexes and their structural properties, and to perform cost and index usage analysis to analyze relationship properties between one or more queries and a configuration. HCA engine 240 performs a summary analysis over a set of one or more objects. For workload analysis, for example, the objects are queries. For configuration analysis, the objects are indexes. For cost and index usage analysis, the objects are relationship objects that capture the interaction between queries of a workload and a configuration. The set of objects for each of these analyses may be implicitly identified. For workload analysis, the queries of a workload may be identified by the name of the workload. For configuration analysis, the indexes of a configuration may be identified by the name of the configuration. For cost and index usage analysis, the queries and the configuration may be identified by a workload name and a configuration name. Each object has associated properties with each property classified as having an atomic value or a list or set value. Boolean is a type of atomic value. For workload analysis, exemplary query properties having an atomic value comprise the type of query (e.g., Insert, Delete, Update, or Select), whether the query has a group by clause (Boolean), whether the query has an order by clause (Boolean), and whether the query has nested subqueries (Boolean). Exemplary query properties having a list or set value comprise the set of tables referenced in the query, required columns from each table, columns on which selection conditions exist, columns on which join conditions exist, and equi-join conditions. HCA engine 240 may identify any other suitable query properties by collecting additional parsed query information. As one example, properties of the query SELECT R.a, R.b, S.c FROM R, S WHERE R.a=S.c AND R.b=v.sub.1 comprise Select as the query type, {R, S} as the set of required tables, {R.a, R.b, S.c} as the set of required columns, {R.b} as the set of columns on which selection conditions exist, and {(R.a=S.c)} as the set of equi-join conditions. For configuration analysis, exemplary index properties having an atomic value comprise the table on which the index is built, the width or number of columns of the index, the storage space for the index, the time of creation of the index, and whether or not the index is clustered (Boolean). Exemplary index properties having a list or set value comprise the list of columns in the index in major to minor order, for example. HCA engine 240 may identify any other suitable index properties. As one example, properties of a clustered index I.sub.1 on columns (C.sub.1, C.sub.2) of table T comprise T as the table on which the index is built, two as the width of the index, TRUE as to whether the index is clustered, and {C.sub.1, C.sub.2 } as the list of columns. The storage space and time of creation properties may also be determined. For cost and index usage analysis, exemplary properties having an atomic value for a relationship object of a query and a configuration comprise the cost of the query with respect to the configuration. Exemplary properties having a list or set value for a relationship object of a query and a configuration comprise the list of indexes of the configuration expected to be used to answer the query. HCA engine 240 may identify any other suitable relationship object properties, for example, by collecting additional information about the query execution plan. The operators used in the plan, for example, may be identified. The objects and properties form the fundamental primitives for summary analysis. To help pose queries against analysis data, one or more numerical measures may be derived from each property of an object for summary analysis. The specification of a measure comprises a property name and an expression that derives a numerical measure from the property value. For an atomic valued property, the value of the property itself, such as the storage space for an index for example, or a user defined function of that value may be used as the numerical measure. For a list or set valued property, the count of the number of elements in the list or set, such as the number of tables referenced in a query for example, may be used as the numerical measure. The numerical measure for a list or set valued property may also be derived by applying an aggregate function, such as Sum or Average for example, on the element values in the list or set. As one example for a property of a set of objects, an aggregate function may be used on a numerical measure for each object of the set to derive a numerical measure for the set of objects. The expression f(p) denotes the numerical measure for a list or set valued property p, where f is an aggregation function such as Count, Min, Max, Sum, or Average for example. Given a numerical measure m for each object of a set of objects, a corresponding aggregate measure f(m) over the set of objects may be derived by applying an aggregate function f on the numerical measure values for the objects of the set of objects. As one example, the average number of tables referenced per query in a workload may be derived. Summary Analysis Interface As illustrated in FIG. 3, HCA interface 260 comprises a summary analysis interface 264 to support the ability to perform summary analyses using HCA engine 240. Summary analysis interface 264 for one embodiment supports the following paradigm of steps, as illustrated in FIG. 9, for summary analysis of analysis data generated using hypothetical configuration simulation. For step 902 of FIG. 9, a class of analysis is determined. The class of analysis may be, for example, workload analysis, configuration analysis, or cost and index usage analysis. For step 904, the set of objects to be analyzed are identified. As one example, a workload name may be specified to identify queries for analysis. For step 906, a subset of objects may optionally be filtered from the set of objects based on their properties, for example. As one example, queries that reference the table Orders may be filtered from a workload of queries. Successive filtering operations may also be supported. Using the same example, queries that reference the column Supplier in the table Orders may be filtered from the subset of queries that reference the table Orders. Successive filtering operations may be supported through drill-down using user interfaces, for example. For step 908, the objects surviving any filtering may optionally be partitioned in a set of classes by a measure. The partitions need not be disjoint. Using the same example as for step 906, the queries that reference the table Orders may be partitioned by their query type. For step 910, the set of objects surviving any filtering and/or partitioning are ranked or summarized. Objects may be ranked by associating a measure with each object, such as the number of tables referenced in a query for example, and using the measure to order the objects. The top k objects ranked by the measure may therefore be determined. Objects may also be summarized based on an aggregate measure, such as the average number of tables referenced by a set of queries. If the set of objects surviving any filtering are not partitioned, all such objects are ranked or summarized. Otherwise, the object(s) in each partition are separately ranked or summarized. For one embodiment, all partitions share the same ranking or summarization criteria. Summary analysis interface 264 for one embodiment provides a generic querying model that supports the paradigm of FIG. 9 to formulate ad-hoc requests for summary analysis and also provides an extensible framework for generating new summary statistics from the analysis data tables. Summary analysis interface 264 supports, for example, the following query-like interface. ANALYZE [WORKLOAD .vertline. CONFIGURATION .vertline. COST-USAGE] WITH <parameter-list> [TOP <number> .vertline. SUMMARIZE USING <aggregation-function>] BY <measure> WHERE <filter-expression> {PARTITION BY <partition-parameter> IN <number> STEPS} Although described in the context of this particular command having a particular syntax, summary analysis interface 264 may use any suitable set of one or more commands to support the ability to perform summary analyses. In executing the ANALYZE command, HCA engine 240 produces results in a format depending on whether the command specifies a ranking or summarization of analysis data. In executing the ANALYZE command using SUMMARIZE, HCA engine 240 produces an output table having for each partition one row with two columns: one column for the value of the partitioning parameter and one column for the summarized value for that partition. As one example, HCA engine 240 counts the number of queries of each query type in Workload_A in executing the following query Q1.
Q1: ANALYZE WORKLOAD WITH Workload_A
SUMMARIZE USING Count
PARTITION BY Query_Type
In executing the ANALYZE command using SUMMARIZE without the PARTITION BY clause, HCA engine 240 produces a scalar output representing the summarized value for all objects analyzed for the query. As one example, HCA engine 240 counts the total number of indexes of Current_Configuration in executing the following query Q2.
Q2: ANALYZE CONFIGURATION WITH Current_Configuration
SUMMARIZE USING Count
In executing the ANALYZE command using TOP, HCA engine 240 produces an output table having for each partition one row with three columns: one column for the value of the partitioning parameter, one column for the ranked object, and one column for the rank of that object within that partition. In executing the ANALYZE command using TOP without the PARTITION BY clause, HCA engine 240 produces <number> 2-column tuples. As one example, HCA engine 240 returns the 20 most expensive queries in Workload_B with respect to Current_Configuration in executing the following query Q3.
Q3: ANALYZE COST-USAGE WITH Workload_B,
Current_Configuration
TOP 20 BY Cost
The form for each tuple produced by HCA engine 240 for the query Q3 may be (Query, Rank). Measures may be specified in the BY <measure> clause and the PARTITION BY <partition-parameter> clause. The <measure> specification may be omitted when the <aggregation-function> of the SUMMARIZE USING clause is Count. As one example for a summary analysis based on an atomic valued property of an object, HCA engine 240 returns the top three indexes ranked by storage in New_Configuration in executing the following query Q4.
Q4: ANALYZE CONFIGURATION WITH New_Configuration
TOP 3 BY Storage
As an example for a summary analysis based on a list or set valued property of an object, HCA engine 240 counts in Workload_A the number of queries that reference a given number of tables in executing the following query Q5.
Q5: ANALYZE WORKLOAD WITH Workload_A
SUMMARIZE USING Count
PARTITION BY Count(Tables)
The syntax of <filter-expression> for one embodiment is a Boolean expression where base predicates are composed using Boolean connectors. For atomic valued properties, the base predicates have the form <property> <operator> <value>. The <operator> may be any comparison operator, such as >, <, =, or .noteq. for example. For Boolean properties, only an equality check may be used for one embodiment. As one example, HCA engine 240 counts the number of two-column, non-clustered indexes in Current_Configuration in executing the following query Q6.
Q6: ANALYZE CONFIGURATION WITH Current_Configuration
SUMMARIZE USING Count
WHERE (Num-Columns = 2) AND (Is-Clustered = FALSE)
For set valued properties, the base predicates may have the form <set property> [SUBSET-OF .vertline. SUPERSET-OF .vertline.=]<set>. As one example, HCA engine 240 returns the ten most expensive queries in Workload_A with respect to Current_Configuration such that the query references at least the tables part and supplier in executing the following query Q7.
Q7: ANALYZE COST-USAGE WITH Workload_A,
Current_Configuration
TOP 10 BY Cost
WHERE Tables SUPERSET-OF {part, supplier}
The base predicates for set valued properties may also have the form f (<set property>) <operator> <value>, where f is an aggregation function and <operator> is any comparison operator. As one example, the filter Count(Tables)>1 is satisfied for queries that reference at least two tables. For list valued properties, the base predicates for set valued properties may be used by interpreting the list as a set. The base predicates for list valued properties may also be based on prefix matching in the form <list property> [SUBLIST-OF .vertline. SUPERLIST-OF .vertline.=]<list>. As one example, HCA engine 240 counts in Configuration_A the number of indexes that have part.size as the leading column, where Columns is the list property of an index that contains the columns in the index, in executing the following query Q8.
Q8: ANALYZE CONFIGURATION WITH Configuration_A
SUMMARIZE USING Count
WHERE Columns SUPERLIST-OF (part.size)
The objects being analyzed may be partitioned either by a property that may or may not be numeric or by a numeric measure. The query Q1 is an example query that specifies a partition by a non-numeric property. The query Q5 is an example query that specifies a partition by a numeric measure. For a <partition-parameter> of the name of a list or set valued property, a separate partition is used for each distinct value d of the list or set. A set valued object S belongs to the partition for d if and only if d is a member of the set S. As one example, HCA engine 240 computes the average number of indexes used for queries on each table, but eliminating join queries, in executing the following query Q9.
Q9: ANALYZE COST-USAGE WITH Workload_A,
Current_Configuration
SUMMARIZE USING Average BY Count (Indexes-Used)
WHERE Count (Join-Columns) = 0
PARTITION BY Tables
For the query Q9, a query of Workload_A belongs to the partition of each table that is referenced in that query. When the partitioning domain is numeric, the number of steps allows partitions to be coalesced into fewer steps. The <parameter-list> depends on the class of analysis and may be a workload name for workload analysis, a configuration name for configuration analysis, or a combination of a workload name and a configuration name for cost and index usage analysis. Summary analysis interface 264 may also be extended to compare workloads and/or configurations by specifying multiple workloads and configurations for <parameter-list>. As one example, HCA engine 240 compares the cost of queries of Workload_A that reference the table Orders with respect to Current_Configuration and Proposed_Configuration in executing the following query Q10.
Q10: ANALYZE COST-USAGE WITH Workload-A,
(Current_Configuration, Proposed_Configuration)
SUMMARIZE USING Sum BY Cost
WHERE Tables SUPERSET-OF Orders
As another example, HCA engine 240 compares the cost of Configuration_A, Configuration_B, and Configuration_C corresponding to projected database sizes A, B, and C, respectively, in executing the following query Q11.
Q11: ANALYZE COST-USAGE WITH Workload_A,
(Configuration_A, Configuration_B, Configuration_C)
SUMMARIZE USING Sum BY Cost
What-if index analysis UI 250 provides a user interface for a database administrator to perform summary analysis for database system 200 using HCA engine 240. What-if index analysis user interface UI 250 prompts the database administrator for input to specify the summary analysis to be performed for database system 200 and interfaces with HCA engine 240 using summary analysis interface 264 to perform the specified summary analysis. What-if index analysis UI 250 also presents summary analysis results to the database administrator, for example, by displaying the results in graph form to help the database administrator visualize the results. FIGS. 6, 7, 8, 10, 11, 12, 13, 14, and 15 illustrate exemplary user interfaces each produced by performing a summary analysis. Exemplary workload analyses that may be performed using HCA engine 240 comprise counting the number of queries of a workload by type as illustrated in FIG. 6, drilling-down on the queries of the type Select of a workload and counting the number of selection conditions in those queries per table as illustrated in FIG. 7, listing the top five tables on which most queries of a workload are posted, comparing summary statistics from two workloads, and drilling-down at a table level to find which columns of a table have the most conditions posted on them as illustrated in FIG. 10. Exemplary configuration analyses that may be performed using HCA engine 240 comprise counting the number of indexes for each table, listing the top six tables ranked by the count of the multi-column indexes on those tables as illustrated in FIG. 11, and analyzing the distribution of indexes in a configuration by width as illustrated in FIG. 12. Exemplary cost and index usage analyses that may be performed using HCA engine 240 comprise analyzing the frequency of usage of each index of a configuration for a workload as illustrated in FIG. 8, counting the number of queries of a workload that used a given number of indexes with respect to a configuration as illustrated in FIG. 13, comparing the cost of two configurations for a workload by query type as illustrated in FIG. 14, and analyzing the cost of each query of a workload with respect to a proposed configuration relative to an existing configuration as illustrated in FIG. 15. Summary Analysis Interface Implementation HCA engine 240 for one embodiment maps the ANALYZE query into an equivalent query supported by database server 220, such as an SQL query for example, over the analysis data tables stored by database server 220. HCA engine 240 may further process the results of the query, such as by bucketizing the results for example, before completing the summary analysis. Because the ANALYZE query resembles SQL, HCA engine 240 may map the ANALYZE query into an SQL query with relative ease. HCA engine 240 may also exploit the ability to compose operations using table expressions in the FROM clause of an SQL query. The class of analysis specified in the ANALYZE query determines the superset of analysis data tables that are to be joined to answer the query. For workload analysis, for example, HCA engine 240 accesses only workload analysis data tables. HCA engine 240 then generates a table expression T.sub.1 that joins the required analysis data tables, only retrieving objects that are specified in the <parameter-list>. Any <filter-expression> specified in the query is included in the table expression T.sub.1. HCA engine 240 then generates a table expression T.sub.2 that partitions the table expression T.sub.1 using any <partition-parameter>, using the GROUP BY construct, and the <aggregation-function> is applied to the attribute specified by <measure>. If instead, the query requests the TOP <number> of rows by <measure>, then the table expression T.sub.2 is generated by applying an ORDER BY clause on the <measure> attribute of the table expression T.sub.1. A cursor is opened for the table expression T.sub.2 to return the first <number> of rows. For other embodiments, summary analysis interface 264 may support a set of canned queries for performing predetermined summary analyses. Such canned queries, however, may not provide the extensible framework as the ANALYZE command. Summary analysis interface 264 may also directly interface with database server 220 using queries supported by database server 220, such as SQL queries for example, for direct access of the analysis data tables stored by database server 220. Using such queries, however, shifts the burden of analysis to a user or client tool of database server 220 and may require the manual generation of complex queries over the analysis data tables. Exemplary What-If Index Analysis Session Using what-if index analysis UI 250, a database administrator may use HCA engine 240 in a variety of manners to analyze the design of database system 200 and modify the design as desired. The database administrator may begin by specifying a workload for the session. The workload may be a log of queries executed against database 210 over the past week. The database administrator asks for a breakdown of the queries by type to try to understand the workload mix. The results are displayed as illustrated in FIG. 6. The database administrator may then decide to focus on the most expensive queries in the workload for the existing configuration by requesting the top 25 queries ordered by cost. To decide which tables are good candidates for indexing, the database administrator may wish to see the distribution of conditions in queries on tables as illustrated in FIG. 7. Having picked a table that has many conditions on it, the database administrator may decide to further drill-down and look at the distribution of conditions in queries over columns of that table as illustrated in FIG. 10. This gives a good idea of which columns on the table are likely candidates for indexes. The database administrator finds that columns A and B of the table T.sub.2 look promising. The database administrator may then wish to see if indexes on columns A and B of the table T.sub.2 already exist in the existing configuration. The database administrator does this by requesting to see all indexes on the table T.sub.2 in the existing configuration ordered by their storage requirement. In this case, an index on column A exists but an index on column B does not exist. The database administrator then decides to explore hypothetical configuration scenarios that include an index on column B. The database administrator decides to explore two what-if scenarios and evaluate each relative to the existing configuration. The database administrator first proposes a hypothetical configuration comprising the existing configuration with an additional non-clustered single-column index on column B of the table T.sub.2. The database administrator compares the cost of the workload for this configuration with the cost of the workload for the existing configuration. Adding a single-column index on column B produces a 5% improvement in total cost of the workload as illustrated in FIG. 15. By studying index usage in the proposed configuration as illustrated in FIG. 8, the database administrator sees that the new index was used in three queries. Not being impressed with the improvement in performance, the database administrator decides to explore a different hypothetical configuration having a two-column index (B, A) on the table T.sub.2 in addition to the existing configuration. Once again, the database administrator compares the cost of this configuration with the existing configuration and sees an 18% improvement for the workload. The database administrator then looks at the top five queries in the workload that are affected by adding the index and notices that two of the most expensive queries under the existing configuration were positively affected and that there were no queries that were negatively affected. The database administrator then decides to build the two-column index (B, A) and schedules the index to be built at midnight. Although described in the context of serving as an impact analysis tool for index selection, the framework of HCA engine 240 may be extended to analyze other features of physical database design, such as materialized views for example. In the foregoing description, the invention has been described with reference to specific exemplary embodiments thereof. It will, however, be evident that various modifications and changes may be made thereto without departing from the broader spirit or scope of the present invention as defined in the appended claims. The specification and drawings are, accordingly, to be regarded in an illustrative rather than a restrictive sense.
|
Same subclass Same class Consider this |
||||||||||
