Access augmentation or optimizing

Method of generating attribute cardinality maps

6865567

Abstract

This invention provides a novel means for creating a histogram for use in minimizing response time and resource consumption when optimizing a query in a database, and other like structures, the histogram being created by placing ordered elements into specific range until the next element to be considered for inclusion in the range is a predetermined distance from the (generalized) mean value associated with the elements within the range, whereupon that next element is placed in the following range. Similarly, the following ranges are closed when the next element to be considered for inclusion in the range is greater than a predetermined distance from the (generalized) mean value associated with the elements in that range, whereupon that next element is placed in the following range. For each range, the location and size of the range is recorded with, for example, the mean value, the slope or other attribute characterizing one or more elements in the range. The invention has also applications in pattern recognition, message routing, and in actuarial sciences.


Claims

What is claimed is:

1. A computer-implemented method for estimating the computational efficiency of a search within a database comprising the steps of:

a. generating a histogram of data, comprising the steps of:

i. providing a data set representing a plurality of elements from the database and a value associated with each element, the data set having a property defining an order of the elements therein;

ii. determining at least one range, each of the at least one range having at least an element, an arithmetic mean of each range equal to the arithmetic mean of the values associated with the at least an element within said range, a specific range from the at least one range comprising a plurality of elements from the data set adjacent each other within the defined order, wherein the arithmetic mean of the specific range is within a predetermined maximum distance from a value associated with an element within the specific range, the predetermined maximum distance being independent of the number of elements within the specific range and their associated values;

iii. determining at least a value related to an estimate of a value associated with an element within the range; and

iv. for each range storing said value related to an estimate of a value associated with an element within the range and data relating to the size and location of the range, to provide a histogram of said value associated with an element within the range and data relating to the size and location of the range; and

b. using the histogram of data for estimating the computational efficiency of a search within a database.

2. A method as defined in claim 1 wherein the at least one range comprises a plurality of ranges, and wherein some ranges from the plurality of ranges have a different number of elements and some ranges from the plurality of ranges have different areas, an area of each range equal to the product of the arithmetic mean of said range and the number of elements within said range.

3. A method as defined in claim 2 wherein the step of determining the at least one range is performed so as to limit variance between the values associated with the elements within a same range from the at least one range, the limitation forming further data of the histogram.

4. A method as defined in claim 3 wherein a value associated with each element within a range from the at least one range is within ##EQU239##

where .tau. is the tolerance value used in generating the R-ACM and i is the location of the element within a histogram sector of length l.

5. A method as defined in claim 3 wherein the at least a value related to an estimate of a value for an element within the range includes a value relating to the arithmetic mean, and wherein the at least data relating to the range comprises data relating to both endpoints of the range.

6. A method as defined in claim 3 wherein the step of determining at least one range comprises the steps of:

a. using a suitably programmed processor, defining a first bin as a current bin;

b. using the suitably programmed processor, selecting an element and adding it to the current bin as the most recently added element to the current bin;

c. selecting an element from within the data set, the selected element not within a bin and adjacent the most recently added element;

d. determining at least a mean of the values associated with elements within the bin;

e. when the most recently selected element differs from a mean from the at least a mean by an amount less than a predetermined amount, adding the most recently selected element to the current bin as the most recently added element to the current bin and returning to step (c);

f. when the selected element differs from the mean from the at least a mean by an amount more than the predetermined amount, creating a new bin as the current bin and adding the selected element to the new bin as the most recently added element to the current bin and returning to step (c); and,

g. providing data relating to each bin including data indicative of a range of elements within each bin as the determined at least one range.

7. A method as defined in claim 6 comprising the step of: providing a value, .tau., of the predetermined maximum distance.

8. A method as defined in claim 6 wherein the at least a mean comprises an arithmetic mean and wherein the mean from the at least a mean is the arithmetic mean.

9. A method as defined in claim 6 wherein adjacent elements are selected from a start location within the data set in order sequentially toward an end of the data set.

10. A method as defined in claim 6 wherein adjacent elements are selected from a start location in an alternating fashion toward a beginning of the data set and toward an end of the data set.

11. A method as defined in claim 3 wherein the step of determining at least one range comprises the steps of:

a. selecting an element from within the data set;

b. determining a bin with which to associate the element;

c. when the determined bin is empty, adding the element to the bin;

d. when the determined bin is other than empty, determining at least a mean of the values associated with elements within the determined bin;

e. when the most recently selected element differs from a mean from the at least a mean by an amount less than a predetermined amount, adding the most recently selected element to the determined bin and returning to step (a);

f. when the selected element differs from the mean from the at least a mean by an amount more than the predetermined amount, adding the selected element to the determined bin and dividing the determined bin into one of two bins and three bins, one of which includes the selected element and returning to step (a); and,

g. providing data relating to each bin including data indicative of a range of elements within the bin.

12. A method as defined in claim 11 wherein the at least a mean comprises an arithmetic mean and wherein the mean from the at least a mean is the arithmetic mean.

13. A method as defined in claim 12 comprising the step of: determining a first arithmetic mean of a first selected bin; determining a second arithmetic mean of a second selected bin adjacent the first selected bin; comparing the first and second arithmetic means; and, when the arithmetic means are within a predetermined distance of each other, merging the first selected bin and the second selected bin to form a single merged bin including all the elements of the first selected bin and all the elements of the second selected bin.

14. A method as defined in claim 3 wherein the step of determining at least one range comprises the steps of:

a. using a suitably programmed processor, defining a first bin as a current bin;

b. using the suitably programmed processor, selecting an element and adding it to the current bin as the most recently added element to the current bin;

c. selecting elements adjacent the most recently added element(s);

d. determining a first mean of the values associated with elements within the bin and determining a second mean of the selected elements;

e. when the second mean differs from the first mean by an amount less than a predetermined amount, adding the most recently selected elements to the current bin as the most recently added elements and returning to step (c);

f. when the second mean differs from the first mean by an amount more than the predetermined amount, creating a new bin as the current bin and adding at least one of the selected elements to the new bin as the most recently added element(s) to the current bin and returning to step (c); and,

g. providing data relating to each bin including data indicative of a range of elements within the bin.

15. A method as defined in claim 14 wherein the step of (f) includes the steps of:

(F1) determining a first element within the selected elements to add to the new current bin;

(F2) adding the selected element(s) before the first element to the previous current bin; and,

(F3) adding the selected element(s) from and including the first element to the new current bin.

16. A method as defined in claim 3 wherein the step of determining at least one range comprises the steps of:

a. using a suitably programmed processor, defining a first bin as a current bin;

b. using the suitably programmed processor, selecting an element and adding it to the current bin as the most recently added element to the current bin;

c. selecting an element from within the data set, the selected element not within a bin and adjacent the most recently added element;

d. determining the Generalized positive-2 mean of the current bin as the sum of the squares of the values associated with elements within the bin divided by the number of the elements within the bin and taking the square root of the resulting quotient;

e. determining the Generalized negative-2 mean of the current bin as the sum of the square-roots of the values associated with element within the bin divided by the number of the elements within the bin and squaring the resulting quotient;

f. when the value associated with the selected element is lower than the said Generalized positive-2 mean, determining a difference between the value associated with the selected element and the said Generalized positive-2 mean, and when the value associated with the selected element is higher than the said Generalized negative-2 mean, determining a difference between the value associated with the selected element and the said Generalized negative-2 mean;

g. when a difference is other than greater than the predetermined amount, adding the selected element to the current bin as the most recently added element to the current bin and returning to step (c);

h. when a difference is greater than the predetermined amount, defining a new bin as the current bin, adding the selected element to the current bin as the most recently added element to the current bin, and returning to step (c); and,

i. providing data relating to each bin including data indicative of a range of elements within each bin as the determined at least one range.

17. A method as defined in claim 3 wherein the step of determining at least one range comprises the steps of:

a. using a suitably programmed processor, defining a first bin as a current bin;

b. using the suitably programmed processor, selecting an element and adding it to the current bin as the most recently added element to the current bin;

c. selecting an element from within the data set, the selected element not within a bin and adjacent the most recently added element;

d. determining the Generalized positive-k mean of the current bin for a predetermined k as the sum of the k.sup.th powers of the values associated with elements within the bin divided by the number of the elements within the bin and taking the k.sup.th root of the resulting quotient;

e. determining the Generalized negative-k mean of the current bin as the sum of the k.sup.th roots of the values associated with elements within the bin divided by the number of the elements within the bin and raising the quotient to the k.sup.th power;

f. when the value associated with the selected element is lower than the said Generalized positive-k mean, determining a difference between the value associated with the selected element and the said Generalized positive-k mean, and when the value associated with the selected element is higher than the said Generalized negative-k mean, determining a difference between the value associated with the selected element and the said Generalized negative-k mean;

g. when a difference is other than greater than the predetermined amount, adding the selected element to the current bin as the most recently added element to the current bin and returning to step (c);

h. when a difference is greater than the predetermined amount, defining a new bin as the current bin, adding the selected element to the current bin as the most recently added element to the current bin, and returning to step (c); and,

i. providing data relating to each bin including data indicative of a range of elements within each bin as the determined at least one range.

18. A method as defined in claim 3 wherein the step of determining at least one range comprises the steps of:

a. using a suitably programmed processor, defining a first bin as a current bin;

b. using the suitably programmed processor, selecting an element and adding it to the current bin as the most recently added element to the current bin;

c. selecting an element from within the data set, the selected element not within a bin and adjacent the most recently added element;

d. determining a current largest value as the largest of the values associated with elements within the bin;

e. determining a current smallest value as the smallest of the values associated with elements within the bin;

f. when the value associated with the selected element is lower than the current largest value, determining a difference between the value associated with the selected element and the current largest value, and when the value associated with the selected element is higher than the current smallest value, determining a difference between the value associated with the selected element and the current smallest value;

g. when a difference is other than greater than the predetermined amount, adding the selected element to the current bin as the most recently added element to the current bin and returning to step (c);

h. when a difference is greater than the predetermined amount, defining a new bin as the current bin, adding the selected element to the current bin as the most recently added element to the current bin, and returning to step (c); and,

i. providing data relating to each bin including data indicative of a range of elements within each bin as the determined at least one range.

19. A method as defined in claim 18 wherein the predetermined amount is equal to 2(.tau.).

20. A method as defined in claim 1 wherein the step of determining at least one range comprises the steps of:

a. selecting a group of elements within the data set and adjacent one another within the ordering;

b. determining a mean of the values associated with each selected element within the selected group of elements;

c. comparing a value associated with each selected element in the group to the mean value to determine a difference;

d. when a value is different from the mean by more than a predetermined amount, returning to step (a); and,

e. when all values are different from the mean by less than or equal to the predetermined amount, creating a bin including the selected group of elements and returning to step (a).

21. A method as defined in claim 20 comprising the steps of

(F1) selecting an element adjacent the bin including the selected group of elements, the selected element other than an element within a bin;

(F2) determining a mean of the values associated with each element within the bin and the selected element; and

(F3) when the value of the selected element differs from the mean by less than or equal to the predetermined amount, adding the selected element to the bin and returning to step (F3).

22. A method as defined in claim 1 comprising the step of: providing the mean associated with the range as an estimate of a value associated with an element within said range.

23. A method as defined in claim 22 comprising the step of: estimating a reliability of the estimated value.

24. A method as defined in claim 23 wherein the estimated value is used for estimating the computational efficiency of a search within a database.

25. A method as defined in claim 1 comprising the step of: estimating a value associated with a selected range of elements as a sum of the products of the arithmetic mean for each range and a number of elements within both the range and the selected range.

26. A computer-implemented method comprising the steps of:

a. generating a histogram of data, comprising the steps of:

i. providing a data set representing a plurality of elements from the database and a value associated with each element, the data set having a property defining an order of the elements therein;

ii. determining at least one range, each of the at least one range having at least an element, an arithmetic mean of each range equal to the arithmetic mean of the values associated with the at least an element within said range, a specific range from the at least one range comprising a plurality of elements from the data set adjacent each other within the defined order, wherein the arithmetic mean of the specific range is within a predetermined maximum distance from a value associated with an element within the specific range, the predetermined maximum distance independent of the number of elements within the specific range and their associated values;

iii. determining at least a value related to an estimate of a value associated with an element within the range; and

iv. for each range storing at least a value related to an estimate of a value associated with an element within the range and at least data relating to the size and location of the range;

b. determining a routing table in dependence upon the histogram; and,

c. estimating a value for use in network routing in dependence upon the routing table.

27. A computer-implemented method comprising the steps of:

a. generating a histogram of data, comprising the steps of:

i. providing a data set representing a plurality of elements from the database and a value associated with each element, the data set having a property defining an order of the elements therein;

ii. determining at least one range, each of the at least one range having at least an element, an arithmetic mean of each range equal to the arithmetic mean of the values associated with the at least an element within said range, a specific range from the at least one range comprising a plurality of elements from the data set adjacent each other within the defined order, wherein the arithmetic mean of the specific range is within a predetermined maximum distance from a value associated with an element within the specific range, the predetermined maximum distance independent of the number of elements within the specific range and their associated values;

iii. determining at least a value related to an estimate of a value associated with an element within the range; and

iv. for each range storing at least a value related to an estimate of a value associated with an element within the range and at least data relating to the size and location of the range;

b. providing the mean associated with the range as an estimate of a value associated with an element within said range; and

c. using the estimate for determining an approach to searching within a plurality of different databases given a predetermined limited time for conducting the search, wherein the approach is selected to approximately maximise the probability of successfully completing the search.

28. A computer-implemented method comprising the steps of:

a. generating a histogram of data, comprising the steps of:

i. providing a data set representing a plurality of elements and a value associated with each element, the data set having a property defining an order of the elements therein;

ii. determining a plurality of ranges, each of the plurality of ranges having at least an element, a known statistical correlation existing between values associated with elements in a same range, some ranges from the plurality of ranges comprising a plurality of elements from the data set adjacent each other within the defined order, the statistical correlation for those ranges indicative of a maximum error between an estimated value associated with an element within the range and the value associated with the element, the maximum error other than the difference between the total area of the range and the estimated value, wherein an area of each range is equal to the product of the arithmetic mean of said range and the number of elements within said range;

iii. determining at least a value related to an estimate of a value associated with an element within the range; and

iv. for each range storing at least a value related to an estimate of a value associated with an element within the range and at least data relating to the size and location of the range; and

b. using the histogram of data for estimating the computational efficiency of a search within a database.

29. A method as defined in claim 28 wherein a range from the some ranges define a range of values associated with elements within the range, and wherein the range of values has a maximum upper limit and a minimum lower limit, the lower limit other than zero and the upper limit other than the area of the range.

30. A method as defined in claim 29 wherein a value associated with each element within a range from the some ranges is within a known maximum error of an estimated value for that element and wherein the maximum error is different for some elements in a same range than for others.

31. A computer-implemented method comprising the steps of:

a. generating a histogram of data, comprising the steps of:

i. providing a data set representing a plurality of elements and a value associated with each element, the data set having a property defining an order of the elements therein;

ii. determining at least one range having a length such that the value associated with at least one element within the range is within a predetermined maximum distance of at least one other element within the range;

iii. determining at least a value related to an estimate of a value associated with an element within the range; and

iv. for each range storing at least a value related to an estimate of a value associated with an element within the range and at least data relating to the size and location of the range; and

b. using the histogram of data for estimating the computational efficiency of a search within a database.

32. A computer-implemented method comprising the steps of:

a. generating a histogram of data, comprising the steps of:

i. providing a data set representing a plurality of elements and a value associated with each element, the data set having a property defining an order of the elements therein;

ii. determining at least one range having a length such that the value associated with every element within the range is within a predetermined maximum distance of every other element within the range;

iii. determining at least a value related to an estimate of a value associated with an element within the range; and

iv. for each range storing at least a value related to an estimate of a value associated with an element within the range and at least data relating to the size and location of the range; and

b. using the histogram of data for estimating the computational efficiency of a search within a database.

33. A computer readable program storage medium, the medium embodying one or more instructions executable by the computer to perform method steps, the method comprising the steps of:

a. generating a histogram of data, comprising the steps of:

i. providing a data set representing a plurality of elements and a value associated with each element, the data set having a property defining an order of the elements therein;

ii. determining at least one range, each of the at least one range having at least an element, an arithmetic mean of each range equal to the arithmetic mean of the values associated with the at least an element within said range, a specific range from the at least a range comprising a plurality of elements from the data set adjacent each other within the defined order, wherein the arithmetic mean of the specific range is within a predetermined maximum distance from a value associated with an element within the specific range, the predetermined maximum distance independent of the number of elements within the specific range and their associated values;

iii. determining at least a value related to an estimate of a value associated with an element within the range; and

iv. storing for each range at least a value related to the mean and at least data relating to the size and location of the range; and

b. using the histogram of data for estimating the computational efficiency of a search within a database.

34. A computer-implemented method comprising the steps of:

a. generating a histogram of data, comprising the steps of:

i. providing a data set representing a plurality of elements and a value associated with each element, the data set having a property defining an order of the elements therein;

ii. determining a range within the data set, the range comprising a plurality of elements from the data set and adjacent within the order; and,

iii. storing a plurality of values indicative of a straight line defining an approximate upper boundary of the values associated with each element within the range, the straight line indicating different values for different elements within the range; and

b. using the histogram of data for estimating the computational efficiency of a search within a database.

35. A method as defined in claim 34 wherein the range has a mean equal to the mean of the values associated with the plurality of elements within said range and an area of the range equal to the product of the mean of said range and the number of elements within said range; and comprising the steps of:

iv. providing a value associated with specific elements within the range; and

v. determining a straight line approximating the determined value for each of the specific elements and having an area therebelow determined according to the value of the straight line at the first element in the range plus the value of the straight line at the last element in the range which sum is multiplied by the number of elements within the range divided by 2, the area below the straight line being approximately equal to the area of the range.

36. A method as defined in claim 35 comprising the steps of:

vi. providing a second range within the data set, the second range adjacent the first range and comprising a plurality of elements from the data set and adjacent within the order, the provided second range having a mean equal to the mean of the values associated with the at least an element within said range and an area of the range equal to the product of the mean of said range and the number of elements within said range;

vii. providing a value associated with specific elements within the second range;

viii. determining a second straight line approximating the determined value for each of the specific elements within the second range and having an area therebelow determined according to the value of the second straight line at the first element in the second range plus the value of the second straight line at the last element in the second range which sum is multiplied by the number of elements within the second range divided by 2, the area below the straight line being approximately equal to the area of the second range; and, storing a plurality of values indicative of the second straight line.

37. A method as defined in claim 36 wherein the step of determining the straight line and of determining the second straight line are performed such that the adjacent endpoints of the first straight line and the second straight line are a same point.

38. A method as defined in claim 37 wherein the step of determining the second straight line is performed in dependence upon the endpoint of the first straight line adjacent the second range.

39. A method as defined in claim 35 wherein the step of determining a range is performed so as to limit variance between values associated with elements in the range and the straight line, the limitation forming further statistical data of the histogram.

40. A method as defined in claim 39 wherein the step of determining a range is performed so as to limit average error between some values associated with elements in the range and the straight line.

41. A method as defined in claim 39 wherein the step of determining a range is performed so as to limit least squared error between some values associated with elements in the range and the straight line.

42. A method as defined in claim 34 wherein the plurality of values are indicative of a range beginning, a range ending, a value at the range beginning and a value at the range ending.

43. A method as defined in claim 42 wherein the plurality of values includes data for determining a straight line approximating the values associated with elements within the range, and differing therefrom by an amount less than a known amount less than each associated value.

44. A computer-implemented method comprising the steps of:

a. generating a histogram of data, comprising the steps of:

i. providing a data set representing a plurality of elements and a value associated with each element, the data set having a property defining an order of the elements therein;

ii. providing a range within the data set, the range comprising a plurality of elements from the data set and adjacent within the order;

iii. determining a straight line indicating different values for different elements within the range; and,

iv. storing a plurality of values indicative of a straight line defining an approximate upper boundary of the values associated with each element within the range, the straight line indicating different values for different elements within the range; and

b. using the histogram of data for estimating the computational efficiency of a search within a database.

45. A method as defined in claim 44 wherein the provided range has a mean equal to the mean of the values associated with the at least an element within said range and an area of the range equal to the product of the mean of said range and the number of elements within said range; and wherein the step (iii) comprises the steps of:

(iii1) providing a value associated with specific elements within the range; and,

(iii2) determining a straight line approximating the determined value for each of the specific elements and having an area therebelow determined according to the value of the straight line at the first element in the range, plus the value of the straight line at the last element in the range which sum is multiplied by the number of elements within the range divided by 2, the area below the straight line being approximately equal to the area of the range.

46. A method as defined in claim 45 comprising the steps of:

v. providing a second range within the data set, the second range adjacent the first range and comprising a plurality of elements from the data set and adjacent within the order, the provided second range having a mean equal to the mean of the values associated with the at least an element within said range and an area of the range equal to the product of the mean of said range and the number of elements within said range;

vi. providing a value associated with specific elements within the second range;

vii. determining a second straight line approximating the determined value for each of the specific elements within the second range and having an area therebelow determined according to the value of the second straight line at the first element in the second range, plus the value of the second straight line at the last element in the second range which sum is multiplied by the number of elements within the second range divided by 2, the area below the straight line approximately equal to the area of the second range; and, storing a plurality of values indicative of the second straight line.

47. A method as defined in claim 46 wherein the step of determining the straight line and of determining the second straight line are performed such that the adjacent endpoints of the first straight line and the second straight line are a same point.

48. A method as defined in claim 47 wherein the step of determining the second straight line is performed in dependence upon the endpoint of the first straight line adjacent the second range.

49. A method as defined in claim 48 wherein the step of determining the straight line is performed so as to limit variance between values associated with elements in the range and the straight line, the limitation forming further statistical data of the histogram.

50. A method as defined in claim 49 wherein the step of determining the straight line is performed so as to limit average error between some values associated with elements in the range and the straight line.

51. A method as defined in claim 49 wherein the step of determining the straight line is performed so as to limit least squared error between some values associated with elements in the range and the straight line.

52. A method as defined in claim 34 comprising the step of estimating a value associated with an element based on the location of the straight line at the element.

53. A method as defined in claim 57 comprising the step of estimating a reliability of the estimated value.

54. A method as defined in claim 44 comprising the step of estimating a value associated with an element based on the location of the straight line at the element.

55. A method as defined in claim 54 comprising the step of estimating a reliability of the estimated value.

56. A method as defined in claim 54 comprising the step of using the estimated value, estimating the computational efficiency of a search within a database.

57. A computer-implemented method comprising the steps of:

a. generating a histogram of data, comprising the steps of:

i. providing a data set representing a plurality of elements and a value associated with each element, the data set having a property defining an order of the elements therein;

ii. providing a range within the data set, the range comprising a plurality of elements from the data set and adjacent within the order;

iii. determining a straight line indicating different values for different elements within the range; and,

iv. storing a plurality of values indicative of a straight line defining an approximate upper boundary of the values associated with each element within the range, the straight line indicating different values for different elements within the range;

b. determining a routing table in dependence upon the histogram; and,

c. determining an estimate of a value within the routing table for determining a network routing.

58. A computer-implemented method comprising the steps of:

a. generating a histogram of data, comprising the steps of:

i. providing a data set representing a plurality of elements and a value associated with each element, the data set having a property defining an order of the elements therein;

ii. providing a range within the data set, the range comprising a plurality of elements from the data set and adjacent within the order;

iii. determining a straight line indicating different values for different elements within the range; and,

iv. storing a plurality of values indicative of a straight line defining an approximate upper boundary of the values associated with each element within the range, the straight line indicating different values for different elements within the range;

b. estimating a value associated with an element based on the location of the straight line at the element; and

c. using the estimate for determining an approach to searching within a plurality of different databases given a predetermined limited time for conducting the search, wherein the approach is determined to approximately maximise the probability of successfully completing the search.

59. A method as defined in claim 34 comprising the step of estimating a value associated with a selected range of elements as a sum of areas below the straight lines for portions of ranges within the selected range.

60. A method as defined in claim 44 comprising the step of estimating a value associated with a selected range of elements as a sum of areas below the straight lines for portions of ranges within the selected range.

61. A computer readable program storage medium, the medium embodying one or more instructions executable by the computer to perform method steps, the method comprising the steps of:

a. generating a histogram of data, comprising the steps of:

i. providing a data set representing a plurality of elements and a value associated with each element, the data set having a property defining an order of the elements therein;

ii. providing a range within the data set, the range comprising a plurality of elements from the data set and adjacent within the order;

iii. determining a straight line indicating different values for different elements within the range; and,

iv. storing a plurality of values indicative of a straight line defining an approximate upper boundary of the values associated with each element within the range, the straight line indicating different values for different elements within the range and

b. using the histogram of data for estimating the computational efficiency of a search within a database.


Description

1. FIELD OF THE INVENTION

The present invention relates generally to estimation involving generalized histogram structures and more particularly to a method of result size estimation in query optimization.

2. BACKGROUND OF THE INVENTION

One of the main uses of computers is storing and retrieving large amounts of data efficiently. The computer system used for this purpose are known as database systems and the software that manages them are known as database management systems (DBMS). The DBMS facilitates the efficient management of data by (i) allowing multiple users concurrent access to a single database, (ii) restricting access to data to authorized users only, and (iii) providing recovery from system failures without loss of data integrity. The DBMS usually provides an easy to use high-level query/data manipulation language such as the Structured Query Language (SQL) as the primary interface to access the underlying data.

SQL, the most commonly used language in modern-day DBMSs, is a declarative language. Thus, it shields users from the often complex procedural details of accessing and manipulating data. Statements or commands expressed in SQL are generally issued by the user directly, using a command-line interface. The advantage of the declarative SQL is that the statements only need to specify what answer is expected, and not how it should be computed. The actual sequence by which an SQL command is computed is known as the procedural Query Evaluation Plan (QEP). The procedural QEP for a given non-procedural SQL statement is generated by the DBMS and executed to produce the query result. Typically, for a given query, there are many alternative procedural QEPs that all compute the result required. Each QEP, however, has its own cost in terms of resource use and response time. The cost is usually expressed in terms of the I/O operations such as the number of disk reads and writes, and the amount of CPU work to execute a given QEP. The problem of devising the best procedural QEP for a query so as to minimize the cost is termed query optimization.

In all brevity, given a declarative SQL query, the DBMS's Query Optimizer module determines the best possible procedural QEP to answer it. In order to do this, the query optimizer uses a model of the underlying system to select from a large set of candidate plans an efficient plan as quickly as possible. Efficiency of the QEP is measured in terms of resource utilization and response time.

The cost incurred in evaluating a QEP is proportional to the number of operations, including disk reads, writes and CPU work required to compute the final answer from the base relations. The size of the final result of a query as well as the sizes of the base relations will be the same regardless of which QEP, from among many possible candidate QEPs, is chosen by the query optimizer. Hence the cost of a QEP depends on the size of the intermediate relations generated during the computation of the query, as this is the single most important factor responsible for the difference in the costs of various QEPs of the given query. Hence by choosing a QEP that has smaller intermediate relations then other QEPs, it is possible to minimize the cost involved in computing the final result of the given query. Although this is easy to explain, due to the large number of possible alternative QEPs, computing the sizes of the intermediate relations accurately for each possible QEP is virtually an impossible task. Hence, one approach is to approximately estimate the sizes of the intermediate relations.

A complete set of glossary is found in many undergraduate database text books, including [Elmasri and Navathe, 1994], pp 137-177.

2.1 Query Optimization: An Overview

Query optimization for relational database systems is a combinatorial optimization problem, which makes exhaustive search unacceptable as the number of relations in the query increases. The query optimization process is generally divided into three distinct phases, namely query decomposition, query optimization and query execution as shown in FIG. 1.

In the query decomposition module, the declarative SQL query is first scanned, parsed and validated. The scanner sub-module identifies the language components in the text of the query, while the sparser sub-module checks the query syntax. The validator checks that all attribute and relation names are valid and semantically meaningful. The query is then translated into an internal format expressed in relational algebra in the form of a query Operator Tree. Using this operator tree as its input, the query optimizer module searches for a procedural plan with an optimal ordering of the algebraic operators. This optimal procedural plan is represented by an annotated query tree. Such trees encode procedural choices such as the order in which operators are evaluated and the method for computing each operator. Each tree node represents one or several relational operators. Annotations on the node represent the details of how it is to be executed. For example, a join node may be annotated as being executed by a hash-join, and a base relation may be annotated as being accessed by an index-scan. The choices of the execution algorithms are based on the database and system characteristics, for example, the size of the relations, available memory, type of indexes on a given attribute etc.

This fully annotated operator tree with the optimal QEP is then passed on to the query execution engine where all the low level database operations are carried out and the answer to the query is computed. An example annotated query tree is shown in FIG. 2.

2.1.1 An Example

FIG. 3 shows an example relational database from a Property Tax Assessor's office.

This database consists of three relations, namely,

TaxPayer(SIN, CityCode, Name, DOB, Balance),

Property(Type, OwnerSIN, Tax) and

City(CityCode, CityName, Population).

Assume that these relations have the following cardinalities:

.vertline.TaxPayer.vertline.=12.times.10.sup.6 tuples,

.vertline.Property.vertline.=3.times.10.sup.6 tuples, and

.vertline.City.vertline.=4000 tuples.

Given these pieces of information, it is possible to answer the following declarative query:

Query: Find the name, city and tax information of all property owners.

To illustrate the large difference in the execution cost that exists among various QEPs, two QEP's are compared for this query using a simple cost model based on the number of I/O operations. Since the disk I/O is usually the dominant factor.sup.1 in the query response time, it is assumed the cost of evaluation a QEP is given by the total number of all tuples read and written to generate the final answer. In addition, the cost of reading the two input relations and then writing the resulting relation back on to the disk is also considered.

.sup.1 Actually, a realistic cost model should include many other factors such as various join algorithms, availability of indexes and other auxiliary access methods, effects of caching and available memory, data skew etc.

From the set of many alternatives, two possible QEPs shown in FIG. 4 are analyzed below. The QEPs execute the answer to the query as described below:

Query Evaluation Plan 1:

In the first QEP, the relations TaxPayer and City are joined to determine the city information for each person where he or she lives. This join generate an intermediate relation. This intermediate relation is then joined with the relation Property to compute the final results.

The first join requires the relations TaxPayer and City to be read. This results in 12.times.10.sup.6 +4000 reads. Assuming the result of this join is written back to the disk, it would require 12.times.10.sup.6 writes. Note that the size of the intermediate result is 12.times.10.sup.6. The second join requires the above intermediate relation and the relation Property to be read. This involves an additional 12.times.10.sup.6 +3.times.10.sup.6 =15.times.10.sup.6 reads. Since the final result contains 3.times.10.sup.6 tuples, it requires that many write operations. Hence the total number of operations required for the first QEP is N.sub.1 =(12.times.10.sup.6 +4000)+(12.times.10.sup.6)+(15.times.10.sup.6)+(3.times.10.sup. 6)=42,004,000.

Query Evaluation Plan 2:

Joining the relation TaxPayer and Property, an intermediate relation with all the property owners can be obtained. This intermediate relation is then joined to the relation City to determine the information of the city for each tax payer.

In the second, QEP, the first join requires 12.times.10.sup.6 +3.times.10.sup.6 reads and 3.times.10.sup.6 writes. The second join requires another 3.times.10.sup.6 +4000 reads and 3.times.10.sup.6 writes. Hence the total number of operations required for the second QEP is N.sub.2 =(12.times.10.sup.6 +3.times.10.sup.6)+(3.times.10.sup.6)+(3.times.10.sup.6 +4000)+(3.times.10.sup.6)=24,004,000, which is almost half of the first QEP.

Using a simple cost model, this short example with two different QEPs illustrates that the cost of one plan is sometimes half of the other. Most of the real-world queries are complex queries with many relations, and with a more sophisticated realistic cost model, one can generate QEPs with substantially different costs. The task of a query optimizer is to judiciously analyze the possible QEPs and choose the one with the minimal cost.

2.2 Goals of Query Optimization

The main goals of a query optimizer are that of minimizing both the response time and resource consumption. These optimization goals are often conflicting. For example, a QEP which computes the result of a query quickly but requires all available memory and CPU resources is probably not desirable because it would virtually deprive other users from accessing the database.

Finding good solutions for a query is resource intensive, but can reduce the actual evaluation cost considerably. For example, let T.sub.o be the time taken to find an optimal QEP and let T.sub.c be the time taken to execute the optimal QEP and obtain the results. If the average time taken to execute a random QEP is T.sub.avg, then ideally T.sub.o +T.sub.c <<T.sub.avg.

As queries get more complex in terms of the number of relations involved or alternative algorithms for computing an operator the number of potential alternative QEPs that to be considered explodes. The number of alternatives quickly increases with the number of relations etc. into the order of millions, while the differences between the cheapest and most expensive QEP can easily be several orders of magnitude. Even with simple string queries with n relations, there are (2(n-1))!/(n-1)! different join orders. For joins involving small number of relations, this number is acceptable; for example, with n=5, the number is 1680. However, as n increases, this number rises quickly. With n=10, the number of different join orders is greater than 176 billion!

As a rule of thumb, for small queries with no more than four or five relations all QEPs are generated within a few seconds. In this case the optimization time, T.sub.o, is often a fraction of the response time improvement gained.

Apart from the fact that the execution space of QEPs is usually very large, the computation of a single join operation itself is one of the most time consuming tasks. To compute the join of N relations, (N-1) dyadic join operations have to be performed. Since the size of the relations joined determines the cost of a single join operation, the order in which all N relations are joined is preferably chosen in such a way so as to minimize the overall cost of computing the join of N relations.

Unfortunately, finding the optimal join order has been proven to be an NP-hard problem while, at the same time, it is one area where considerable cost benefits can be derived. Only for specific join ordering cases, exact and optimal solutions have been obtained [Boral and Zaniolo, 1986, Ibaraki and Kameda, 1984]. In addition, most of the optimization techniques proposed in the literature cannot be applied to large queries. For example, two popular early day database systems, namely, System R and Ingres use algorithms that essentially perform an exhaustive search over the entire QEP domain. This is probably adequate in their implementation because they do not allow large queries with more than 15 joins.

The problem of optimizing a query can be formally states as follows:

Problem 1 Given a relational query Q, an execution space E and a cost function C(QEP) over elements of E, find the query evaluation plan QEP.di-elect cons.E such that,

(1) QEP computes Q

(2) There does not exist QEP'.di-elect cons.E such that QEP' also computes Q and C(QEP')<C(QEP).

where the execution space E is the space of QEPs considered by the optimizer.

Finding an optimal QEP is a computationally intractable problem, especially when the query involves more than, say, ten relations. One obvious approach is to decrease the size of the execution space E, but although it would reduce the time and/or space requirement, it has the tradeoff of reducing the chances of finding good plans. Another approach consists of estimating the costs of QEPs using the standard statistical information available in the database catalog. In fact, most commercial DBMSs use some form of statistics on the underlying data information about the available resources in order to estimate the cost of a query plan approximately. Since these

                             TABLE 1
    Statistical Information found in a typical DBMS Catalogue
        Notation    Explanation
        N.sub.R     Number of tuples in relation R
        b.sub.R     Number of disk blocks storing relation R
        s.sub.R     Size of a tuple of relation R in bytes
        bf.sub.R    Blocking factor of relation R. This is the number of
                    tuples of relation R that fit into one disk block
        .delta.(A, R) Number of distinct values of attribute A in
                    relation R.
        .phi.(A, R) Selection cardinality of attribute A in relation R.
                    Given a relation R and an attribute A of the relation,
                    .phi.(A, R) is the average number of records that
                    satisfy an equality condition on attribute A.


statistics are used to provide approximate estimates of query costs, the validity of the optimizer's decisions may be affected. On the other hand, since optimizers use these costs only for comparison purposes, approximate estimates for the costs of the QEPs are usually sufficient as long as these estimates are reasonably close to the actual cost of executing the QEP. For example, if the actual time units of execution of two plans P.sub.1 and P.sub.2 are 10 and 20 respectively, and the estimated times are 5 and 15, the optimizer will still pick P.sub.1 as the less expensive plan, and this will be done correctly. On the other hand, if the estimated times are 5 and 4, the optimizer will pick the suboptimal plan P.sub.2. This is an extremely important issue because costs of plans often differ considerably, and choosing a suboptimal plan often results in severe performance degradation, defeating the very purpose of query optimization.

Query optimizers make use of the statistical information stored in the DBMS catalogue to estimate the cost of a QEP. Table 1 lists some of the commonly used statistical information utilized in most of the current commercial DBMSs.

In addition to these pieces of information, most DBMSs also maintain information about the indices.sup.2 in their catalogues. These pieces of information include values such as the average fan-out, number of levels in an index etc.

.sup.2 Indices refer to the actual data structures used to store and retrieve the physical records of base and intermediate relations.

Since the actual evaluations of all possible QEPs in space E is very difficult, like all other investigators in this area, one approach is to approximately estimating the costs of QEPs.

Consider the cost C of joining two relations R.sub.1 and R.sub.2 on a common attribute X, using the nested-loop join algorithm [Selinger et al., 1979]. If b.sub.R.sub..sub.1 and b.sub.R.sub..sub.2 are the number of disk blocks required for relations R.sub.1 and R.sub.2 respectively, then the cost formula in terms of number of disk accesses is, ##EQU1##

where js is the join selectivity.sup.3 of these other operators and bf.sub.R.sub..sub.1 .sub.R.sub..sub.2 is the blocking factor for the resulting relation.

.sup.3 Selectivity of an operator is the ratio of the result size and the product of its input sizes. For the join operator, the join selectivity js=.vertline.(R.sub.1 {character pullout}.sub.c R.sub.2).vertline./(.vertline.R.sub.1.vertline..times..vertline.R.sub. 2.vertline.)

From the above cost formula, it is apparent that the following quantities are crucial for the query optimizer in order to estimate the cost of such an operation:

(a) Query Result Size: The cost formula for the above join algorithm depends on the sizes of the input relations, which is true for nearly all relational operators. In a query with several operators, an input to an operator may itself be the result of another operator(s). This shows the importance of developing techniques to estimate the result sizes of these other operators.

(b) Query Result Distributions: The result size of a join or any other relational operator depends mainly on the data distribution(s) of its input relation(s). Again the input to this operator may itself be the relation which results from invoking another operator(s). This means that techniques to estimate the distribution of a query result are also desirable.

(c) Disk Access Costs: This is the cost of searching for, reading, and writing data blocks that reside on secondary storage, mainly on disk. The cost of searching for records in a file depends on the type of access structures on that file, such as its ordering, hashing, and primary or secondary indexes. In addition, factors such as whether the file blocks are allocated contiguously on the same disk cylinder or scattered on the disk affect the access cost.

The design of a good query optimizer involves developing techniques to efficiently and accurately estimate the various query result sizes. The objective of this invention is to provide new techniques that would provide more accurate query result size estimation results than the state-of-the-art methods used in the current database systems. There are other topics of concern in query optimization such as implementation algorithms for various relational operations, physical access plans for retrieving records etc., but they are not addressed herein.

In order to completely solve the above estimation problem, it is useful to develop result size estimation techniques for each of the relational operators. Most of the current commercial DBMSs employ methods based on one or more of the following techniques to estimate the above quantities:

1. Sampling based technique

2. Parametric techniques

3. Probabilistic counting techniques and

4. Non-parametric or histogram based techniques.

These techniques are briefly reviewed below. The accuracy of these estimates is often of critical importance since the selection of an optimal QEP from an exponential number of alternatives solely depends on these estimates. Despite the widespread usage of the above techniques in many commercial database systems, their accuracy has not been studied extensively.

3 PRIOR ART

Query optimization has been a very active research field in the database community for the last two decades.

Apart from the area itself being vast, the techniques used to optimize a query search itself are varied. At the top most level, the query optimizer has to distinguish the various query evaluation plans. This is a problem in its own right. The question of pruning these query evaluation plans, and discarding the less promising ones without actually processing them in any depth, is itself, a huge area of research. Additionally, for any given query evaluation plan, the question of estimating its efficiency without actually processing the query is a problem that has captivated much research.

3.1 A Top-down Approach to Query Optimization

Early work in query optimization essentially followed two distinct tracks, namely, the bottom up and top down. Database researchers found the query optimization problem, when posed in a general context, to be both difficult and complex--and indeed, intractable with regard to optimality. They, in the early days, mainly undertook a bottom-up approach, considering only special cases. This included, for example, finding "optimal" strategies to implement major relational operations and finding "optimal" methods to execute simple subclasses of queries. As better strategies emerged with time, researchers attempted to obtain "optimal" strategies for relatively larger and more complex queries by composing the results from the smaller building blocks.

As the number and sizes of the commercial databases grew, the need for functional optimization systems resulted in the development of full-scale query evaluation techniques. These techniques provided more general solutions and handled query optimization in a uniform, though heuristic manner [Astrahan and Chamberlin, 1975, Makinouchi et al., 1981, Niebuhr and Smith, 1976, Wong and Youseffi, 1976]. Obviously, generating a query optimization plan out of smaller building blocks, often did not achieve optimal system efficiency. This led the researchers to move towards a top-down approach that incorporated more knowledge about special-case optimization opportunities into the general procedures. This top-down approach also incorporated many general algorithms, which were augmented by combinatorial cost-minimization techniques for choosing between the various potential query evaluation plans.

Using the top-down approach, the general query optimization is divided into the following steps:

(1) Use logical transformation techniques to change the given query so that (a) the query is represented in a standardized form, (b) the query is simplified such that the same operation is not computed twice, and (c) the query is ameliorated so as to organize the computation to apply procedures developed for special case scenarios.

(2) Represent the transformed query as alternative sequences of simple elementary relational operations that are computed using known implementation techniques. This makes it possible to compute the associated execution cost for the evaluation plan. This step generates a set of potential access plans that simplify the process of evaluating the original query.

(3) Compute the execution cost for each access plan in the above access plan set.

(4) Execute the cheapest access plan and return the results.

The first step of this procedure involves logical transformation of the query, and thus it is, generally, independent of the underlying data. Hence this step is often handled at compile time. But in order to properly carry out steps (2) and (3), and to generate optimal query evaluation plan, sufficient knowledge about the underlying data distribution is required.

The face that steps (2) and (3) require some prior knowledge about the underlying data poses a number of difficulties. First of all, if the data distribution is dynamic, steps (2) and (3) can be carried out only at run time. Consequently some gain in the overall execution efficiency is sacrificed in order to minimize the optimization cost. Another difficulty is that the DBMS should maintain a catalogue or a meta-database about the pertinent statistical information about the underlying data distributions. It is obvious that in order to benefit from such as optimization strategy one must compare the costs of gathering and maintaining such information in the DBMS catalogue against the cost savings in the overall optimization process.

3.1.1 Query Transformation

Any relational algebraic query can be transformed into a number of semantically equivalent expressions. The work on transformation of a given expression into an equivalent query by means of well defined transformation rules is discussed below. The objectives of query transformation can be described as the following:

1. Obtaining a normalized starting point for query optimization (called standardization).

2. Eliminating the duplication of effort (called simplification) and

3. Generating query expressions that have superior evaluation performance compared to the initial query (called amelioration).

A number of works [Jarke and Schmidt, 1981, Kim, 1982, Palermo, 1972] have been done to define a normalized starting post for representing a query for the purpose of query optimization.

One of the main differences between any two equivalent expressions is their degree of redundancy [Hall, 1976]. In other words executing a query with redundant expression results in carrying out a number of unnecessary duplicated operations. Hence query simplification steps should involve the transformation of a redundant query expression into an equivalent non-redundant one by applying simple logical transformation rules listed below.

3.1.2 Transformation Rules

A transformation or equivalence rules says that expressions of two forms are equivalent. This implies that one expression can be transformed to the other while preserving equivalence. When two relations have the same attributes ordered in a different manner, and equal number of tuples, the two expressions that generate them are considered to be preserve equivalent. Modern-day query optimizers heavily use equivalence rules to transform initial expressions into simpler logically equivalent expressions.

A number of commonly used general equivalence rules for transforming relational algebraic expressions are given below. Three equivalent transformations are shown in FIG. 5. The predicates in the query expressions are denoted by .theta..sub.i, attributes are denoted by L.sub.i, and the algebraic expression is denoted by E.sub.i. Since the result of a relational algebraic expression itself is a relation, a relation name R can appear anywhere in place of an expression.

1. Conjunctive selection operations can be simplified into a sequence of individual selections. This transformation operation is known to as a cascade of .sigma..

.sigma..sub..theta..sub..sub.1 .sub..LAMBDA..theta..sub..sub.s (E)=.sigma..sub..theta..sub..sub.1 (.theta..sub..theta..sub..sub.2 (E)).

2. Selection operations applied successively onto a relation are commutative.

.sigma..sub..theta..sub..sub.1 (.theta..sub..theta..sub..sub.2 (E))=.sigma..sub..theta..sub..sub.2 (.sigma..sub..theta..sub..sub.1 (E)).

3. In a sequence of projection operations, only the final operation is required to compute the results. This transformation operation is known as a cascade of .pi..

.pi..sub.L.sub..sub.1 (.pi..sub.L.sub..sub.2 ( . . . (.pi..sub.L.sub..sub.n (E)) . . . )=.pi..sub.L.sub..sub.1 (E).

4. Selections can be processed together with Cartesian products and .theta.-joins.

.sigma..sub..theta. (E.sub.1.times.E.sub.2)=E.sub.1 {character pullout}.sub..theta. E.sub.2. (a)

This expression is indeed the definition of the theta join.

.sigma..sub..theta..sub..sub.1 (E.sub.1.times..sub..theta..sub..sub.2 E.sub.2)=E.sub.1 {character pullout}.sub..theta..sub..sub.1 .sub..di-elect cons..theta..sub..sub.2 E.sub.2. (b)

5. A theta-join operation is commutative.

E.sub.1 {character pullout}.sub..theta. E.sub.2 =E.sub.2 {character pullout}.sub..theta. E.sub.1.

Note that the natural join operation is also commutative, since the natural-join operator is actually a special case of the theta-join operator.

6. (a) Natural join operations are associative.

(E.sub.1 {character pullout}E.sub.2){character pullout}E.sub.3 =E.sub.1 {character pullout}(E.sub.2 {character pullout}E.sub.3).

(b) Theta joins can be associative as shown below.

(E.sub.1 {character pullout}.sub..theta..sub..sub.1 E.sub.2){character pullout}.sub..theta..sub..sub.2 .sub..di-elect cons..theta..sub..sub.3 E.sub.3 =E.sub.3 =E.sub.1 {character pullout}.sub..theta..sub..sub.1 .sub..di-elect cons..theta..sub..sub.3 (E.sub.2 {character pullout}.sub..theta..sub..sub.2 E.sub.3).

where .theta..sub.2 involves attributes from only E.sub.2 and E.sub.3. Since any of these predicates can be empty, the Cartesian product (x) is an associative operation. It is important to observe that the properties of commutativity and associativity of join operations are important for join reordering in query optimization.

7. The selection operation and the theta join operation are distributive under the conditions states below:

(a) Selection operation distributes over the theta join operation when all the attributes in selection predicate .theta..sub.0 involve only the attributes of one of the expressions (E.sub.1) being joined.

.sigma..sub..theta..sub..sub.0 (E.sub.1 {character pullout}.sub..theta. E.sub.2)=(.sigma..sub..theta..sub..sub.0 (E.sub.1)){character pullout}.sub..theta. E.sub.2.

(b) Selection operation distributes over the theta distribution when selection condition .theta..sub.1 involves only the attributes of E.sub.1 and .theta..sub.2 involves only the attributes of E.sub.2.

.sigma..sub..theta..sub..sub.1 .sub..di-elect cons..theta..sub..sub.2 (E.sub.1 {character pullout}.sub..theta. E.sub.2)=(.sigma..sub..theta..sub..sub.1 (E.sub.1)){character pullout}.sub..theta. (.sigma..sub..theta..sub..sub.2 (E.sub.2)).

8. The projection operation and the theta join operation are distributive.

(a) Suppose the attributes L.sub.1 and L.sub.2 belong to the expressions, E.sub.1 and E.sub.2, respectively. Assume that the join predicate .theta. has only attributes in L.sub.1.orgate.L.sub.2. This means,

.pi..sub.L.sub..sub.1 .sub..orgate.L.sub.2 (E.sub.1 {character pullout}.sub..theta. E.sub.2)=(.pi..sub.L.sub..sub.1 (E.sub.1)){character pullout}.sub..theta. (.pi..sub.L.sub..sub.2 (E.sub.2)).

(b) Consider a join E.sub.1 {character pullout}.sub..theta. E.sub.2. Assume the attributes, L.sub.1 and L.sub.2 belong to the expression E.sub.1 and E.sub.2, respectively. Assume the attribute L.sub.3 is from the expression E.sub.1 which are involved in join predicate .theta., but are not in L.sub.1.orgate.L.sub.2, and let attribute L.sub.4 of E.sub.2 that are involved in join condition .theta., but are not in L.sub.1.orgate.L.sub.2. This means,

.pi..sub.L.sub..sub.1 .sub..orgate.L.sub..sub.2 (E.sub.1 {character pullout}.sub..theta. E.sub.2)=.pi..sub.L.sub..sub.1 .sub..orgate.L.sub..sub.2 ((.pi..sub.L.sub..sub.1 .sub..orgate.L.sub..sub.3 (E.sub.1)){character pullout}.sub..theta. (.pi..sub.L.sub..sub.2 .sub..orgate.L.sub.4 (E.sub.2))).

9. The union operation and the intersection are commutative set operations.

E.sub.1.orgate.E.sub.2 =E.sub.2.orgate.E.sub.1 and E.sub.1.andgate.E.sub.2 =E.sub.2.andgate.E.sub.1.

It is important to note that set difference operation is not commutative.

10. The union operation and intersection operation are associative set operations.

(E.sub.1.orgate.E.sub.2 (.orgate.E.sub.3 =E.sub.1.orgate.(E.sub.2.orgate.E.sub.3) and (E.sub.1.andgate.E.sub.2).andgate.E.sub.3 =E.sub.1.andgate.(E.sub.2.andgate.E.sub.3).

11. The selection operation distributes with the union, intersection, and set difference operations.

.sigma..sub.P (E.sub.1 -E.sub.2)=.sigma..sub.P (E.sub.1)-.sigma..sub.P (E.sub.2).

When substituting the set difference (-) with either one of .orgate.or .andgate., selection operation is distributive with the union and intersection operations.

12. The projection and union operations are distributive.

.pi..sub.L (E.sub.1.orgate.E.sub.2)=(.pi..sub.L (E.sub.1)).orgate.(.pi..sub.L (E.sub.2)).

The proof of the above equivalence is straightforward and is presented in the work of [McKenna, 1993]. Also, the preceding list is only a partial list of equivalences. More equivalence involving extended relational operators, such as the outer join and aggregation, are constructed from the above list.

The process of query simplification does not always produce a unique expression. Many non-redundant expressions exist that are equivalent to the one generated by a given equivalence transformation. The evaluation of expressions corresponding to a given query may differ significantly with respect to performance parameters, such as the size of the intermediate results and the number of relation elements accessed.

3.2 Query Evaluation

3.2.1 One-Variable Expressions

These are the simplest algebraic expressions used to select a given set of attribute values from a single relation. A simple approach to evaluate such an expression is to read every corresponding attribute value of the relation, and to check whether it satisfies each term of the expression. Obviously this approach is very expensive, especially when the query expression involves large relations. In these situations, there have been many techniques proposed to minimize the number of attribute values accessed and the number of tests applied to an accessed attribute value. These are briefly discussed below.

One of the commonly employed techniques to minimize the number of record accesses is by means of data structures that provide direct access paths, so that an exhaustive sequential search is avoided. This is accomplished by maintaining the relation in a sorted form with respect to one or more attributes so that records are easily accessed. This also facilitates a binary search on the sorted attribute values. A use of such a technique is the evaluation of range queries [Bolour, 1991, Davis and Winslow, 1982].

Another commonly used technique is hashing. Hashing technique has many variants and its major advantage is that it provides fast direct access to records without requiring that the attribute values are maintained in a sorted order.

In addition to hashing, techniques used indexes play an important role is providing direct and ordered access to records. Indexes are often combined with multi-list structures. The idea of an index is based on a binary relationship between records being accessed. Specifically the index method uses tuple identifiers or keys to establish a binary relationship between records, so that traversal becomes much faster than a sequential search. The indexing method is either one-dimensional or multi-dimensional A one-dimensional index implements record access based on a single relation attribute, whereas a multidimensional index implements record access based on a combination of attributes. Two popular approaches for implementing one-dimensional indexes are ISAM [Corporation, 1966] and B-tree [Bayer and McCreight, 1972] structures. A detailed discussion on multidimensional index structures is found in the work of Bentley and Friedman [Bentley and Friedman, 1979].

The logical transformation rules are often applied at run time to a query expression to minimize the number of tests applied to an accessed record or attribute value of a relation. Changing the order in which individual expression components are evaluated sometimes results in performance increases. A number of techniques utilizing a priori probabilities of attribute values, to obtain optimal evaluation sequences for specific cases are discussed in [Hanani, 1977].

3.2.2 Two-Variable Expressions Query expression that describe conditions for the combination of records from two relations are known as two-variable expressions. Usually two-variable expressions are formed by monadic sub-expressions and dyadic sub-expressions. Monadic expressions restrict single variables independently of each other, whereas dyadic expressions provide the relationship between both variables. In what follows the basic strategies for the evaluation of a single dyadic expression, and methods of the evaluation of any two-variable expressions are discussed.

Method 1 Nested_Iteration_Join

for i:=1 to N.sub.1 do

read the i.sup.th element of R.sub.1 ;

for j:=1 to N.sub.2 do

read j.sup.th element of R.sub.2 ;

compute the join based on join predicate;

end;

end;

There are two major approaches to implementing the join operation, namely order-dependent and order-independent strategies. The nested iteration method is a simple approach that is independent of the order of record access. In this method, every pair of records from the relation is accessed and concatenated if the join predicate is satisfied. A nested iteration based join list is Method 1.

Let N.sub.1 and N.sub.2 be the number of records of the relations read in the outer and inner loops respectively. It is clear that N.sub.1 +N.sub.1 *N.sub.2 disk accesses are required to evaluate the dyadic term, assuming that each record access requires one disk access.

The nested iteration method is improved by using an index on the join attribute(s) of the relation R.sub.2. This strategy does not require accessing R.sub.2 sequentially for each record of R.sub.1 as the matching R.sub.2 records are accessed directly [Griffeth, 1978, Klug, 1982]. Hence using an index N.sub.1 +N.sub.1 *N.sub.2 *j.sub.12 record accesses are needed, where j.sub.12 is a join selectivity factor representing the Cartesian product of R.sub.1 and R.sub.2.

In a paged-memory environment, the idea of the method iteration method is extended to be nested block method [Kim, 1982]. In this method, a main memory buffer is used to hold one or more pages of both relations, where each page contains a set of records.

The nested block algorithm is essentially similar to the nested iteration method. The only difference is that memory pages are red instead of single records of the relation. Using this approach the number of disk accesses required to compute the join is reduced to P.sub.1 +(P1/B1)*P.sub.2, where P.sub.1 and P.sub.2 are the number of pages required to store the outer and inner relations of the algorithm respectively and B.sub.1 is the number of pages of the outer relation occupying the main memory buffer. It is easy to observe that it is always more efficient to read the smaller relation in the outer loop (i.e., make P.sub.1 <P.sub.2). If the smaller relation is maintained entirely in the main memory buffer, then only P.sub.1 +P.sub.2 disk accesses are needed to form the join.

Another approach, based on the order in which records of the relations are retrieved, is known as the merge method. In this approach, both relations are scanned in ascending or descending order of join attribute values and merged according to the join predicate. Almost N.sub.1 +N.sub.2 +S.sub.1 +S.sub.2 disk accesses are required to form the join, where S.sub.1 and S.sub.2 are the number of disk accesses needed to sort the relations being joined. This method is an excellent choice when both relations are already sorted by the same criterion. Of course, when the smaller relation is completely maintained in the main memory buffer, the nested block method is more efficient. Whenever an indexing mechanism exists, then when one relation is much larger than the other, nested iteration with indexes is preferable.

The strategies developed to evaluate one-variable expressions and computation of dyadic terms are also used to evaluate any arbitrary two-variable expressions. These strategies mainly differ in the manner they implement the access paths and methods, namely indexing mechanisms and the order in which the component expressions are evaluated.

[Blasgen and Eswaran, 1976], [Niebuhr and Smith, 1976], and [Yao, 1979] describe many different approaches and compare their conclusion that usually there is no a priori best algorithm for a general evaluation process and thus the query optimizer must rely on heuristics or alternatively choose the optimal plan through an expensive cost comparison of potentially many alternatives for each query.

3.2.3 Multi-variable Expressions

A query optimizer designed to process any arbitrary query should incorporate strategies for evaluating multi-variable expressions.

One of the popular techniques, known as parallel processing of query components evaluates an expression by avoiding repeated access to the same data. This is usually achieved by simultaneous or parallel evaluation of multiple-query components. Usually all monadic terms associated with a given variable are evaluated first and all dyadic terms in which the same variable participates are partially evaluated parallely [Palermo, 1972]. This also includes parallel evaluation of logical connectors such as AND existing among the various terms, that always minimizes the size of intermediate query results [Jarke and Schmidt, 1982]. A strategy where aggregate functions and complex sub-queries are computed concurrently, has been proposed by Klug [Klug, 1982]. Parallel processing of query components requires appropriate scheduling strategies for the overall efficient optimization. A few interesting techniques for parallel scheduling for query processing have been proposed by Schmidt [Schmidt, 1979].

Another strategy that is useful for scheduling parallel optimization is the pipelining of operations that works on the partial output of preceding operations. This is described in [Smith and Chang, 1975, Yao 1979].

Another method known as the feedback technique combines the tasks of simultaneous evaluation of query components and pipelining, using partial results of a join operation [Clausen, 1980].

3.3 Access Plans

The combination of these building blocks into an efficient and optimal evaluation procedure for an arbitrary standardized, simplified, and ameliorated query expression is discussed below.

The access plan selection strategies use the transformed and simplified query, the storage structures and access paths, and a simple mathematical cost model to generate an optimal access plan. This is usually achieved by the following procedure:

1. Find all possible logical access plans for evaluating the given query. A logical access plan is defined as a sequence of operations or of intermediate results leading from existing relations to the final result of a query.

2. Obtain details of the physical representation of data such as access paths, statistical information stored in the data-dictionary.

3. Using the mathematical cost model based on the access paths and processing costs, select the cheapest access plan.

3.3.1 Generation of Access Plans

The sequence of operations or intermediate results leading from the existing data structures to a query result is known as an access plan for a query. An ideal query optimizer generates an optimal access plan with a minimal optimization effort.

Two completely different methods were proposed in [Smith and Chang, 1975] and [Yao, 1979]. Smith and Change use a fixed set of query transformation rules. This approach usually results in a single access plan, which need not be optimal. Yao's approach generates all non-dominated access plans that are possible in a given scenario. Obviously generating every single possible access plan becomes prohibitively expensive for complex queries.

In addition to the above methods, there are other strategies that resort to methods that lie between heuristic selection and detailed generation of alternative access plans. This includes an SQL nested block method based on a hierarchical technique used in System R [Selinger et al., 1979]. In this strategy, in addition to generating and comparing the evaluation plans for each query block, the sequence in which how query blocks are evaluated is also computed. The disadvantage of this method is that as this approach requires a user-specified block structure, the initial query normalization step is pushed into the actual query processing phase [Kim, 1982], INGRES also uses a somewhat similar approach [Wong and Youseffi, 1976].

3.3.2 Cost Analysis of Access Plans

The query optimizer chooses a physical access plan for a given query either on the basis of a set of heuristic rules or on the basis of a mathematical cost model involving the underlying data structures and access methods. Merrett discussed a top-down approach for cost analysis based on the storage structures [Merrett, T. H., 1977]. Though the techniques presented in [Merrett, T. H., 1977] have been significantly improved on since 1977, it is worth mentioning that Merrett's ideas were seminal and initiated the work in the area of cost analysis in relational database systems. In particular, Merrett provides general principles that allow a database analyst either to quickly estimate the costs of a wide variety of alternatives, or to analyze fewer possibilities in depth. He also provides a number of techniques that can be used at any level of the iterative design process.

A brief review of some of the cost models and their use in the query optimization is given below.

Even though working storage requirements or CPU costs are considered to be important factors by some researchers, the number of secondary storage accesses or I/O operations is considered to be the most important one in designing a cost model for analyzing any physical access plan. The number of secondary storage accesses is mainly dependent on the size of the operands used in the query, the type of access structures used, and the size of the temporary buffers available in the main memory.

In the initial stages of the evaluation, the sizes of most of the operands are known, at least approximately. In addition, available index methods for the attribute values involves are also known. But in the later stages of the evaluation, most intermediate relations are the results of preceding operations, and consequently, the cost model estimates their size by using information about the original data structures and any other information available from the DBMS catalogue, such as the selectivity of the operations already performed on them. Unfortunately, there are no generally accepted well defined formulae for estimating the size of intermediate results.

It is possible to construct a simpler model, using only a few parameters by imposing restrictions on the assumptions about the data. The result size estimate analysis discussed in [Selinger et al., 1979] uses only information already available in the database catalogue. Obviously this requires many assumptions about data and queries to complete the cost model [Astrahan and Chamberlin, 1975]. On the other hand, the result size estimate analysis proposed by Yao assumes that detailed selectivity data is known [Yao, 1979], without stating how this detailed information is obtained.

In the late 1980s and early 1990s, researchers began to carefully describe and critically analyze all underlying assumptions about database characteristics. The objective was to formally generate valid parameters for the cost model that enable the optimizer to do the following:

1. estimate a result size for any relational operation, and

2. estimate parameter values for intermediate results based on the previous results and base relations.

Cost models based on such techniques relate the database state as run time as the result of a random process. This random process is assumed to generate relations from the Cartesian product of the attribute value domains, based on some probability distribution and by invoking certain other general principles such as the functional dependencies of the underlying database schema [Glenbe and Grady, 1982, Richard, 1981]. These assumptions enable one to derive parameters in order to compute the size of any intermediate result of more complex relational operations.

These assumptions have been critically involved in [Christodoulakis, 1981] and [Montgomery et al., 1983]. They have shown that these assumptions often lead to a bias against direct-access structures in selection plans. However, to date, there has been no practical and simple cost model or formula with more general assumptions, which does not require detailed catalogue information.

3.3.3 Selection of Access Plans

Selection of access plans is often based on either heuristic rules or cost estimates. Some approaches, proposed in the literature, combine heuristic reduction of access plan choices with enumerative cost minimization techniques. A number of experimental studies show that combinatorial analysis significantly improves the database performance.

Cost estimates can be used in the access plan selection process in two different ways. The first approach estimates the costs of each alternative access plan completely. The major advantage of this approach is that it is possible to make use of techniques like parallelism or feedback in such situations. But, obviously the overall optimization effort is high.

In the second method, a parallel incremental generation of the cost of strategies is proposed. The main advantage of this method is that it permits whole families of strategies with common parts to be evaluated concurrently. Hence this approach results in a significant reduction in the overall optimization effort. One good example using this approach is the heuristic method proposed by Rosenthal and Reiner [Rosenthal and Reiner, 1982], in which the "optimizer" keeps only the current-optimal way to generate each intermediate result, and discards any other sub-optimal methods.

A more popular method is the dynamic-programming query optimization strategy, This is actually an extension to the second method, in which, at each level, only the next operation to be performed is determined based on the optimal sub-plans. As in any dynamic programming technique, this method has actual information about all its operands, including intermediate result sizes that are used to update the estimates of the remaining steps. This dynamic approach has two obvious drawbacks. Firstly, it has a potential of getting stuck in local optima if no adequate look ahead is planned. Furthermore, in the more general setting, the method leads to prohibitively high optimization cost.

3.4 Query Optimization Using Join Re-Ordering

3.4.1 Query Graphs and Join Trees

Query graph is a commonly used pictorial representation of a relational query. In a query graph, nodes denote the relation names, and edges represent the respective predicates. If a predicate expression, p, references the attributes of two relations, say R.sub.i and R.sub.j, then an edge labeled p is said to exist between the nodes of these two relations. Denoting the edges of the query graph as {p.sub.1, . . . , p.sub.n } and nodes as {R.sub.1, . . . , R.sub.m }, the result of a query graph G=(V, E) is defined as a Cartesian product followed by relational selection: .sigma..sub.p1.LAMBDA.p2 . . . .LAMBDA.pn (R.sub.1 x . . . x R.sub.m).

Instead of using the straight definition of product followed by selection of the query tree stated above, simple Join trees are used to evaluate queries. A join tree is defined as an operator tree whose inner nodes are labeled by the relational join operators and whose leaves are labeled by base relations. The computation of the result of a join tree is always carried out in a computed bottom-up fashion. The nodes of the join trees are usually annotated to reflect the type of join-algorithm used. These generally include algorithms such nested loops, hash, sort merge and so on. It should be noted that some of the binary trees on the relations of the query are not actually join trees since they may involve the use of Cartesian products.

The query graphs and join trees are two different forms of representing a given query. A query graph merely represents a collection of relations and their corresponding predicate expressions. It does not specify any evaluation order for the expressions involved. On the other hand, a join tree specifies clearly the inputs to each relational operator, and the order in which how they should be evaluated.

FIG. 6 shows a query graph, and two possible operator trees that solve the query.

3.4.2 Join Tree Topologies

Join trees are classified into two major categories, namely linear join trees and bushy join trees. For every join operator in a join tree, if at least one of the two input relations is a base relation, then the join tree is called linear, otherwise it is called bushy. Using this classification, bushy search space is defined as a set of join trees when there is no restriction on the topology of the trees. Similarly a linear search space is defined as a set of join trees where the topology of the trees is restricted to only linear join trees. The linear search space is actually a subset of the bushy search space. Since join trees do not distinguish between right and left subtrees, they are unordered. Since each of the n-1 internal nodes of an unordered tree of n leaves can have two choices, a total of 2.sup.n-1 ordered trees can be generated from such an unordered tree.

The number of alternative join trees exponentially increases when considering the various implementation algorithms available for each join operator in the tree. Some of the commonly used join algorithms are hash-join, nested-loop and merge-scan. For example, for a query graph involving n relations where each join has x alternative implementation schemes available, then for each query tree which was previously unordered, one can generate a pool of x.sup.n-1 ordered trees.

3.4.3 Structured Query Graphs

When presented with an arbitrary query graph, finding the total number of valid join trees is a computationally intractable problem. On the other hand, when the topology of the query graph is "structured", it is relatively easy to find the exact number of join trees using standard combinatorial methods [Lanzelotte et al., 1993]. The number of unlabeled binary trees is shown in [Knuth, 1968]to be equal to the Catalan number. Three well known query graph topologies, string, completely connected and star query graphs are discussed below.

String Query: Both end nodes of a string query have degree 1 and all other n-2 nodes have degree 2 as shown in FIG. 7. String queries are frequently encountered

                             TABLE 2
        Number of join trees for structured query graphs.
                                          Completely
        Search Space      String Query     Connected     Star
        Linear join trees    2.sup.n-2          n!        (n-1)!
        Bushy join trees     ##EQU2##       ##EQU3##     (n-1)!


in the verification of foreign key relationships and in object-oriented database environments.

Completely Connected Query:

This is also known as a clique. These type of query graphs are not usually found in day to day operations of a DBMS. The main use of such a graph is as a test case for its large number of evaluation orders. It is clear that the query graph for a completely connected query of n relations, has n nodes of degree n-1. Since every node of this graph is connected to every other node, these graphs are also called complete or n-1 regular graphs. FIG. 8 depicts a clique on four nodes.

Star Query:

Online analytical processing (OLAP) applications often use star queries. In the query graph of a star query of n relations, there are n-1 nodes with degree 1 and one central node with degree n-1 as shown in FIG. 9. It is interesting to note that for a star query, all valid join trees involve join operators that have at least one operator as a base relation. Hence in a star query, the number of bushy search space is the same as the linear join tree search space.

For the three query graph topologies discussed above, namely string, completely connected and star the exact number of unordered linear and bushy join trees is listed in Table 2.

3.5 Join Tree Selection Methods

Optimal join tree selection is a search problem which depends on the three factors, namely the join tree search space, the coat model and the search strategy or search algorithm. The join tree search space contains all valid join trees, both bushy and linear. The cost model is usually a simple mathematical model that annotates a cost to each operator in the tree. The search algorithm is the actual procedure that finds the optimal join tree in the search space using the cost model. Most of the modern-day query optimizers use three different search strategies, namely (a) exhaustive search (b) probabilistic techniques (c) non-exhaustive deterministic algorithms.

3.5.1 Exhaustive Search

Exhaustive search is sometimes efficient when the join tree search spaces are not too large. For most of the relational queries this should be less than ten relations. There are two different exhaustive search techniques commonly in use. These are namely, dynamic programming and transformation-based search methods. Dynamic programming and transformation-based search are also known as the bottom-up and top-down search methods.

1. Dynamic Programming: Early database systems such as System R described in [Salinger et al., 1979] and Starburst described in [Hasan and Pirahesh, 1988, Ono and Lohman,1990] used dynamic programming to find optimal join trees or query evaluation plans. This strategy is very simple and easy to comprehend. The dynamic programming algorithm first considers the individual base relations in a bottom-up fashion and adds new relations until all pairs of relations are joined to generate the final result. A table for storing the intermediate result sizes is maintained so that the sequence of the next join operation can be determined based on the cheapest alternative currently known.

2. Transformation-based: The Volcano query optimizer and the Cascades described in [Graefe, 1995] query optimizer use the transformation based exhaustive search approach. This strategy works by applying various transformation rules to the initial query tree recursively until no new join trees are generated. These optimizers use an efficient storage structure for storing the information about the partially explored search space, in order to avoid generating the same query trees.

3.5.2 Probabilistic Algorithms

Finding a globally optimal join tree in a very large search space, specially in a bushy search space involving more than 10 relations, is usually a difficult problem. One approach is to use probabilistic algorithms to eliminate the less promising portions of the search space. These algorithms improve the search performance by avoiding the worst join trees during the search process, instead of looking for the best join tree. A number of different probabilistic algorithms such as Iterative Improvement (II) [Swami and Gupta, 1988], Simulated Annealing (SA) [Ioannidis and Wong, 1987]and their variations Toured Simulated Annealing (TSA) [Lanzelotte et al., 1993] and Two Phase Optimization (2PO) [Ioannidis and Kang, 1990] are currently in use for searching large query tree search space. All these algorithms are based on rewriting join trees based on the algebraic properties of the join operator such as commutativity and associativity in order to generate alternative query trees.

All of the above probabilistic algorithms look for an optimal join tree in the following manner. First, an arbitrary join tree is generated by the algorithm to represent the given query. Then, applying the logical transformation rules, new join trees are generated. For each new join tree, using a simple cost model, the algorithm computes an estimated cost. The new join tree is retained as the current solution, if the cost is found to be below a specified value. The transformation rules are applied again on the current solution to generate new join trees in a recursive manner. This recursive process is carried out until a stopping condition, such as a time limit or level of cost improvement, is met.

The algorithm TSA finds the optimal join tree by performing simulated annealing repeatedly. Each time the algorithm picks a new query tree as its starting point. The 2PO algorithm combines the features of the II algorithm and the SA algorithm. It first performs II for a given period of time, then using the output from the II phase as its starting query tree, it performs SA until the final globally optimal query tree is obtained.

3.5.3 Non-exhaustive Deterministic Algorithm

An efficient non-exhaustive deterministic algorithm using a block-wise nested-loop join algorithm was first proposed in [Ibaraki and Kameda, 1984]. This algorithm generates optimal linear join trees, especially for acyclic queries. A variant of this algorithm was described in [Boral and Zaniolo, 1986], in which the search strategy employs an arbitrary join algorithm based on the available access paths instead of a nested-loop join algorithm.

3.6 Statistical Estimations in Database Systems

The quantitative properties or statistics that summarize the instances of a database are important for query optimization. In this section the major estimation techniques used to obtain the database statistics that have some commonalities with the present invention are reviewed.

Query optimization is an NP-hard problem due to the exponential growth of the number of alternative QEPs with the number of relations in the query. Due to this reason, most database systems are forced to make a number assumptions in order to quickly select an optimal QEP. Christodoulakis investigated the relationships of a number of frequently used assumptions and the accuracy of the query optimizer [Christodoulakis, 1983a, Montgomery et al., 1983]. In specific, he considered the following issues in his study:

1. Uniformity of attribute values: Every attribute value in the corresponding value domain has the same frequency.

2. Attribute independence: The data distributions of various attributes in a relation are independent of each other. The value of two attributes (say A and B) are independent if the conditional probability of an A value given a B value is equal to the probability of obtaining the A value by itself.

3. Uniformity of queries: Queries reference all attribute values in a given attribute value domain with equal probabilities.

4. Constant number of records per block: The number of tuples in each file block is the same. Hence the probability of referencing any block is 1/B, where B is the number of blocks.

5. Random placement: The probability of a record in a file appearing in a query is the same, regardless of its location among the different pages of a disk. That is, the probability of referencing any tuple is 1/N, where N is the number of tuples.

One can easily observe that assumptions 1 and 2 have a direct relationship to the estimates of the sizes of plan operations, thus determine their accuracy. Similarly assumption 3 impacts the size estimate of queries that are associated with a given parameter, as well as the physical database design problems. Assumptions 4 and 5 affect the estimation of logical block accesses for a given attribute value,

Though these statistical assumptions simplify the factors involved in the estimation of the cost of a QEP, they also decrease accuracy of these estimates. In order to increase the estimation accuracy, some modem query optimizers incorporate more detailed cost models for assumptions 1 and 2. Few optimizers incorporate assumptions 3, 4, and 5 in more detail, since these assumptions are more difficult to analyze.

Christodoulakis studied the effects of the above assumptions on a number of frequently encountered problems such as the estimation of the (1) number of block accesses for a given number of tuples, (2) number of block accesses for all queries on a given attribute, (3) number of block accesses for queries with multi-attributes, and (4) number of distinct attribute values resulting from a selection operation. Using these studies, he proved that the expected cost of a query estimated using these assumptions is an upper bound on the actual cost of the QEP. For example, he showed that the uniformity and independence assumptions lead to a worst-case estimate for the number of distinct attribute values. He further demonstrated that most commercial database systems using these assumptions are prone to use expensive query evaluation strategies. This indicates that non-uniformity, non-independence, and non random placement that usually exist in a real-world data distribution could be exploited in a query optimizer in order to reduce the system cost. He also argued that optimizers that use these often erroneous assumptions tend to ignore direct access structures because these cost estimates often favor the simpler structures. His investigation in this area was a catalyst in motivating further research in this area for better result size estimation techniques. He also discussed a number of mathematical techniques such as Schur concavity [Marshall and Olkin, 1979] in database performance evaluation,

The size estimation of select and join queries on a heavily skewed data distribution were thoroughly investigated by Montgomery et al [Montgomery et al., 1983]. In this study, they estimated the result sizes of select and join operations using uniformity assumptions and compared them with the actual result sizes. They also constructed a few models of size estimation using formulae that closely describe the skewed data distributions under study. They reported that the size estimates for the select operations based on the widely used uniformity assumption actually overestimate the actual result size by 200-300%. The size estimates for the join operations were reported to be 200-300% lower than the actual join sizes.

The local and distributed cost models based on the uniformity assumption that is used for single table sorting and two table joins in R* were experimentally verified and validated in [Mackert and Lohman, 1986a, Mackert and Lohman, 1986b]. They also further reported that the CPU costs are usually a significant portion of the overall, cost for sort operations. In addition, they also found that size estimation is a significant issue in the overall cost, including I/0 and CPU costs. Considering the nested loop join method, they reported that most query optimizers based on uniformity assumption overstate the evaluation cost, when an access structure is used to retrieve the attribute values and when the inner table fits in main memory. This indicates that the cost of evaluating the nested loop join is usually sensitive to parameters such as join cardinality, the outer table's cardinality, and the memory used to maintain the inner table.

The relationship of join selectivity and the selection of the optimal nesting order involving four and five variable queries were studied by Kumar and Stonebraker [Kumar and Stonebraker, 1987]. A query processor that behaves similarly to the System R optimizer was built. This query processor considered the nested loop and merge scan join algorithms, using only two-way joins. It also considered the availability of any secondary indexes on the inner join table. The sensitivity of a query with respect to changes in the joint selectivity was derived. This was obtained as the ratio between the cost of the optimal evaluation plan and the plan used by the query processor. They found that most optimal evaluation plan under varying selectivities was either the one that minimizes the average cost ratio or the one that minimizes the maximum cost ratio. Using the sensitivity factor as a parameter, they further measured the sensitivity of four and five variable queries under a number of join selectivities. Their experimental results showed that, assuming the query evaluation plan was selected using their criteria, the best query evaluation plan is usually insensitive to changing join selectivities.

The relationship of the assumption of random placement of tuples to pages and the dependence of multiple attribute values were investigated by Vander Zander et al. [Vander Zander et al., 1986]. A query of the form, "Select the tuples from relation R where R.A=constant" when R is clustered according to another attribute, say B, was used for this type of experiments. They found that whenever there is a high correlation between the attributes A and B, the assumption of random placement of tuples to pages is always violated. This was evident from the skewed pattern of tuple distributions to the pages.

Recently in [Ioannidis and Christodoulakis, 1991] proved that the worst case errors incurred by the uniformity assumption propagate exponentially as the number of joins in the query increase. As a result, except for very small queries, errors may become extremely high, resulting in inaccurate estimates for result sizes and hence for the execution costs.

Despite the overwhelming evidence that most of the common assumptions either overestimate or underestimate the true statistics by a wide margin, the query optimization literature abounds with a variety of estimation techniques, most of them discussed in the extensive survey by Mannino, Chu, and Sager [Mannino et al., 1988]. The various estimation techniques can be grouped into four broad classes as follows:

1. Probabilistic counting techniques

2. Sampling based techniques

3. Parametric techniques

4. Non-parametric techniques

3.6.2 Probabilistic Counting Techniques

The probabilistic counting techniques are useful for estimating the number of distinct attribute values in the result of projecting a relation over a given subset of attributes. Flajolet and Martin [Flajolet and Martin, 1985] proposed a technique for estimating the number of distinct values in a multi-set, which generates an estimate during a single pass through the data. This technique was extended by Shukla et al for estimating the size of multidimensional projections such as the cube operator [Shukla et al., 1996]. Their experiments have shown that these techniques based on multidimensional projections usually result in more accurate estimates than sampling based techniques. The applicability of these techniques to other operators has not yet been resolved.

3.6.2 Sampling based Techniques

Sampling based techniques are mainly used to compute result estimates during query optimization time. They compute their estimates by collecting and processing random samples of the data The major advantage of these techniques is that since they do not rely on any precomputed information about the data, they are not usually affected by database changes. Another advantage is that they do not require any storage overhead. These techniques also provide a probabilistic guarantee on the accuracy of the estimates. The major disadvantages of these techniques include the disk I/Os, CPU overhead during query optimization and the extra overhead for recomputing the same information since they are not preserved across queries. The sampling based techniques are highly desirable whenever a parameter needs to be estimated once with high accuracy in a dynamic environment. One good example is in the context of a query profiler. A number of different methods on the sampling based techniques have been extensively discussed in the Prior Art literature.

Traditionally the sampling based size estimation techniques use two distinct models, namely, point space model [Hou et al., 1991] and urn model [Lipton et al., 1990] for result size estimation. These size estimation techniques can be classified into random sampling and systematic sampling.

The idea behind the random sampling technique is that the tuples are chosen randomly from their set so that every tuple has an equal probability of being chosen. The random sampling techniques are either random sampling with replacement or random sampling without replacement. In random sampling with replacement, a tuple that is drawn randomly is put back into the data set and available for subsequent selections. But, in random sampling without replacement, any selected tuple is not returned to the original data set.

In systematic sampling technique, all the tuples in a sampled block or page are be used as sampled tuples. This obviously increases efficiency by reducing the number of I/O operations. But its accuracy is highly related to the uniformity of data across blocks. This means, whenever the tuples are clustered, sampled data could become highly biased resulting in increased estimation error [Hou et al., 1991].

Most of the sampling techniques used in modern-day database systems are based on random sampling with replacement. Some of the widely used techniques are discussed briefly in the following sections.

3.6.3 Adaptive Sampling

Adaptive sampling technique and its variants were proposed in [Lipton et al., 1990]. This technique expresses the stopping condition for sampling in terms of the sum of samples and the amount of sample taken. They provide an asymptotic efficiency of their algorithm and discuss the practicality of their method for estimating the result sizes of select and join operations. Their method provides a bound on the sample size necessary to obtain a desired level of estimation accuracy, providing sanity bounds to handle queries for which the underlying data distribution is skewed.

3.8.4 Double Sampling

In Double sampling technique [Hou et al., 1991], sampling is divided into two phases. In the first phase, x number of tuples are sampled. These samples are used to obtain the estimated mean and variance and to compute the number of samples y to be obtained in the second phase. The number of samples y is computed based on the desired level estimation accuracy, confidence level, and the estimated mean and variance obtained in the first phase. If y>x, then an additional y-x samples are obtained during the second phase of sampling. Otherwise, the number of samples, x, obtained in the first phase is sufficient to provide the desired level of estimation accuracy and confidence level.

3.6.5 Sequential Sampling

Sequential sampling is a random sampling technique for estimating the sizes of query results [Hass and Swami, 1992]. This is a sequential process in which the sampling is topped after a random number of steps based on a stopping criterion. The stopping criterion is usually determined based on the observations obtained from the random samples so far. For a sufficient number of samples, the estimation accuracy is predicted to lie within a certain bound. The method is asymptotically efficient and does not require any ad hoc pilot sampling or any assumptions about the underlying data distributions.

3.8.8 Parametric Techniques

Parametric techniques use a mathematical distribution to approximate the actual data distribution A few popular examples include the uniform distribution, multi-variate normal distribution or Zipf distributions. The parameters used to construct the mathematical distribution are usually obtained from the actual data distributions, and consequently the accuracy of this parametric approximation mainly depends on the similarity between the actual and parameterized distributions. Two major advantages of the parametric technique are (a) their relatively small overhead and (2) small run-time costs. But their downside is that the real-world data hardly conform to any simple mathematical formula. Consequently parametric approximations often result in estimation errors. Another disadvantage of this method is that since the parameters for the mathematical distribution are always precomputed, this technique results in further errors whenever there is a change in the actual data distribution.

An interesting variant of this approach is the algebraic technique, where the actual data distribution is approximated by a polynomial function. Regression techniques are usually used to obtain the coefficients of this polynomial. Another promising algebraic technique involves adaptively approximating the distribution by a six-degree polynomial, whose coefficients are obtained dynamically based on a query feedback mechanism [Gelenbe and Gardy, 1982]. The main disadvantages in using the above noted algebraic techniques include the difficulties in choosing the degree of the polynomial model and uniformly handling result-size estimates for operators other than simple selection predicates, Some of the positive results obtained in the work of Wei Sun et al [Sun et al., 1993]on algebraic techniques indicate that they have a potential of generating closely matching parametric distributions.

The uniform distribution is the simplest and applies to all types of variables, from categorical to numerical and from discrete to continuous. For categorical or discrete variables, the uniform distribution provides equal probability for all distinct categories or values. In the absence of any knowledge of the probability distribution of a variable, the uniform distribution is a conservative, minimax assumption. Early query optimizers such as System R relied on the uniformity and independence assumptions. This was primarily due to the small computational overhead and the ease of obtaining the parameters such as maximum and minimum values.

The use of the uniform distribution assumption has been criticized, however, because many attributes have few occurrences with extreme values. For example, few companies have very large sales, and few employees are very old or very young. The Zipf distribution has been suggested by Fedorowicz [Fedorowicz, 1984, Fedorowicz, 1987] and Samson and Bendell [Samson and Bendell, 1983] for attributes with a skewed distribution such as the occurrence of words in a text.

Christodoulakis [Christodoulakis, 1983b] demonstrated that many attributes have unimodal distributions that can be approximated by a family of distributions. He proposed a model based on a family of probability density functions, which includes the Pearson types 2 and 7, and the normal distributions. The parameters of the models such as mean, standard variation, and other moments are estimated in one pass and dynamically updated. Christodoulakis demonstrated the superiority of this model over the uniform distribution approach using a set of queries against a population of Canadian engineers.

Faloutsos [Faloutsos et al., 1996] showed that using multi-fractal distribution and 80-20 law, one better approximates real-world data than with the uniform and Zipf distributions. His method also provides estimates for supersets of a relation, which the uniformity assumption based schemes cannot provide.

3.6.7 Non-parametric or Histogram-based Techniques

In order to avoid the restrictions of particular parametric methods, many researchers have proposed non-parametric methods for estimating a distribution. The oldest and most common of these methods is the histogram. These techniques approximate the underlying data distribution using precomputed histograms. They are probably the most common techniques used in the commercial database systems such as DB2, Informix, Ingres, Microsoft SQL Server, Oracle, Sybase, and Teradata--see Table 3. The essence of the histogram is to divide the range of values of a variable into intervals, or buckets and, by exhaustive scanning or sampling, tabulate frequency counts of the number of observations falling into each bucket. The frequency counts and the bucket boundaries are stored as a distribution table. The distribution tables are used to obtain upper and lower selectivity estimates. Within those bounds, a more precise estimate is then computed by interpolation or other simple techniques. Since the histograms are precomputed, they often incur errors in estimation if the database is updated and hence require regular re-computation. It has been shown

                             TABLE 3
               Histograms used in commercial DBMSs.
    Vendor      Product                Histogram Type
    IBM         DB2-6000 (Client-Server) Compressed(V, F) Type
    IBM         DB2-MVS                Equidepth, Subclass of End-
                                       Biased(F, F)
    Oracle      Oracle7                Equidepth
    Sybase      System 11              Equidepth
    Tandem      NonStop SQL/MP         Equidepth
    NCR         Teradata               Equidepth
    Informix    Online Data Server     Equidepth
    Microsoft   SQL Server             Equidepth


that significant effort is required to identify the correct form of histograms that incur small errors for all estimation problems.

Most of the database research on histograms is in the context of simple elementary relational operations such as selections. One of the first works on histograms was by Piatetsky-Shapiro and Connel where they analyzed the effect of histograms on minimizing the error for selection queries [Piatetsky-Shapiro and Connell, 1984]. These histograms involved two distinct classes, namely, equi-width histograms and equi-depth histograms [Kooi, 1980]. One of their major results is that the worst-case and average case errors in both the equality and range selection queries are usually smaller in the equi-depth histograms than the equi-width histograms.

Extending the work of Kooi, [Muralikrishna and Dewitt, 1988] investigated the use of multi-dimensional equi-depth histograms for result size estimations. Multidimensional attributes are very common in geographical, image and design databases. A typical query with a multidimensional attribute is to find all the objects that overlap a given grid area. By building histograms on multiple attributes together, their techniques were able to capture dependencies between those attributes. They proposed an algorithm to construct equal-height histograms for multidimensional attributes, a storage structure, and two estimation techniques. Their estimation methods are simpler than the single-dimension version because they assume that multidimensional attributes will not have duplicates.

Many other related works have been done using variable-width histograms for estimating the result size of selection queries, where the buckets are chosen based on various criteria [Kamel and King, 1985, Kooi, 1980, Muthuswamy and Kerschberg, 1985]. A detailed description of the use of histograms inside a query optimizer can be found in Kooi's thesis [Kooi, 1980]. It also deals with the concept of variable-width histograms for query optimization. The survey by Manning, Chu, and Sager [Mannino et al., 1988] provides a list of good references to work in the area of statistics on choosing the appropriate number of buckets in a histogram for sufficient error reduction. Again this work also mainly deals with the context of selection operations.

Merrett and Otoo [Merrett, T. H., and Otoo, E., 1979] showed how to model a distribution of tuples in a multidimensional space. They derived distributions for the relations that result from applying the relational algebraic operations.

In their work, a relation was modeled as a multidimensional structure where the dimension was equal to the number of attributes in the relation. They divided each dimension, i, into c.sub.i number of equal width sectors. The values for c.sub.i, 1.ltoreq.i.ltoreq.n, are chosen such that the resulting structure can completely fit into the available memory, where n is the dimension of the multidimensional array. Scanning through the relation for each attribute, the cells of the multidimensional structure are filled with integer numbers corresponding to the number of tuples that occupy the respective value ranges of the attributes.

Denoting the value of the count for a cell (a, b) of relation R(A, B) as d.sup.AB.sub.ab (R), and assuming uniformity of tuples within each cell, they showed that the expected number of tuples within a cell from a projection is, ##EQU4##

Using the multidimensional representation, they show that the number of tuples resulting from joining a sector a.sub.1 of R and a.sub.2 of S is equal to, ##EQU5##

where n.sub.1 is the number of values in sector a.sub.1, n.sub.2 is the number of values in sector a.sub.2, n is the overlap between a.sub.1 and a.sub.2, and b and c are the remaining attributes in the respective relations.

The above value for the join is valid only when the two cells from the relations have overlapping sectors on the joining attribute. The above join estimation depends on the number of overlapping attribute values, n. Since there is no way to compute the value of n, they resort to estimate it based on how the two sectors overlap along the joining attributes. They showed that there are 11 possible ways in which the sectors can overlap. By defining a rule to approximately guess the value for n for each of the 11 possible ways of overlap, they show how to estimate the result size of join from two cells.

Estimating the result sizes of selection operations using the equi-width histogram is very simple. The histogram bucket where the attribute value corresponding to the search key lies is selected. Then the result size of the selection operation is simply the average number of tuples in that bucket. But the main problem with this simple strategy is that the maximum estimation error is related to the height of the tallest bucket. Consequently a data distribution with widely varying frequency values will result in very high estimation error rate. Another problem associated with the equi-width histogram is that there are no proven statistical method to choose the boundaries of a bucket as this would require some prior knowledge about the underlying data distribution. This invariably has an effect on the estimation accuracy of the selection operations.

Fortunately there are some statistical techniques that can be used to select the bucket widths or the number of buckets based on the number of tuples in the attribute value domain being mapped. During early days of statistical analysis of data using histograms, a popular rule known as Sturge's rule was used to obtain the number of buckets. The Sturge's rule, based on the binomial distribution, states that the number of buckets for n tuples is equal to 1+log.sub.2 n, which is a reasonably accurate value under most circumstances. The modern statistical techniques provide more optimal rules for any given attribute value domain and error rates [Tapia and Thompson, 1978]. It is important to note that the simple Sturges' rule usually yields estimation results which are closer to the modern statistical techniques for up to 500 tuples. The modern statistical literature also provides a number of parameters such as the mean error rate and mean-squared error rate, which are very useful in computing the size of the error within a multiple of a power of the given attribute value domain without any further knowledge of the data