Database system with methods for performing cost-based estimates using spline histograms6012054Abstract Database system and methods are described for improving execution speed of database queries (e.g., for decision support) by provides methods employing spline histograms for improving the determination of selectivity estimates. The general approach improves histogram-based cost estimates as follows. The constant associated with a predicate (e.g., in r.a>5, the constant is "5") is used to do a binary search in an array of histogram boundary values, for determining a particular histogram cell. Once a cell has been found, the system employs interpolation to find out how much of the cell has been selected. Once this interpolation value is found, it is used with a cell weighting and a spline value or weighting to estimate the selectivity of the predicate value, which takes into account how data values are distributed within the cell. As a result of increased accuracy of estimates, the system can formulate better query plans and, thus, provides better performance. Claims What is claimed is: Description COPYRIGHT NOTICE
______________________________________
Step no weight boundary value
Linear Spline
______________________________________
1 0.0 10 0.0
2 0.20 1000 0.05
3 0.80 2000 -0.10
______________________________________
The above spline histogram list has components for two cells in which the first one has values between 10 and 1000 which select 20% of the relation and has a positive spline of 5%. The second cell selects 80% of the relation and has a negative spline of -10%. The histogram can be represented in a database by storing values for characterizing each histogram cell. In an exemplary embodiment employing Sybase.RTM. SQL Server.TM., for instance, the information can be stored as three records in the system statistics (SYSSTATISTICS.) table. If desired, the linear spline components can be stored as a separate record. 2. Detailed methodology Spline-based histograms are useful for modeling data distributions which are typically not uniform within a cell. In cases in which data is totally uniform within a cell, only two step values are required for the distribution, since an interpolation algorithm may be employed to obtain an accurate estimate based on the assumption of uniform distribution. In cases in which the distribution is not uniform, however, there typically is a gradual increase or decrease in the "spacing" of data within the column. In such instances, spline histograms may reduce the "number of steps required," for instance, by an order of magnitude (e.g., from 100 or 200 to 20). As previously described, a particular disadvantage of increasing the number of cells to reduce estimation errors is the concomitant dramatic increase in system resources--the amount of disk space, memory resources, and processing--which are required when spline histograms are not employed. Suppose, for example, that data for a particular step 600 is distributed as shown in FIG. 6. Clearly in such a case the assumption of uniform distribution is not accurate: any "<" estimate would probably be too high and any ">" estimate would probably be too low. Here, it would likely be more accurate to estimate this cell with a "positive spline" in which more values are near the end of the cell rather than the beginning. An extremely accurate estimate of a data distribution with fewer steps than a normal histogram can be obtained by applying the methodology of the present invention. Suppose, as another example, that data is distributed within a step 700 as shown in FIG. 7. The figure represents a linear spline for a step with boundary values 5.0 to 10.0. If a normal histogram were employed (i.e., uniform assumption), the sum of the areas of A+B+C+D would be represented by the float between the boundaries of 5.0, 10.0, with the result that the selectivity of the search argument r.a<7.0 would be overestimated. The spline component of the step 700 is represented by the sum of areas of A+B; the uniform component of the above diagram is represented by the sum of areas of C+D. Thus, the absolute value of the spline component added to the uniform component equals the normal histogram component: ABS (spline component)+uniform component=normal histogram component Suppose that the sum of the areas of A, B, C, and D are 0.20 (i.e., A+B+C+D=0.20), which is the float stored in a normal histogram. Suppose also that the sum of the areas of A and B are 0.05 (A+B=0.05), which is the float stored in a spline array. Using interpolation, the estimate of "r.a<7" for this cell is (7-5)/(10-5)=0.40. If the spline component were not available, then the estimate for the selectivity of the range search argument in this cell would be 0.08 (calculated as 0.40*0.20=0.08). If the spline is considered then the estimate becomes 0.40 (calculated from (0.40**2)*0.05+(0.20-0.05)*0.40). From calculus, one can determine or integrate the area as a triangle proportional to the square of the x-axis coordinate. In this case, the selectivity of 0.40 is squared to get "area of triangle A." Since everything is proportional, the total area of "A+B=0.05" can be multiplied to get the selectivity of triangle A. The second term is derived by removing the spline component from the total selectivity of the cell, that is (0.20-0.05), to get the uniform component; a uniform distribution is assumed on this remaining component, that is (0.20-0.05)* 0.40. In order to avoid an extra multiplication step, the computation can be factored to be 0.40*(0.40*0.05+0.20-0.05)=0.068. The selectivity within a step is, therefore, computationally very simple. Intuitively, if the spline component is large (A+B), then the uniform component (C+D) is small and, thus, would represent a steeper slope in the graph. If the spline component were close to zero, the behavior would be that of a uniform distribution. A spline component which is negative can be represented by a negative sloped line. Note that the histogram cell (i.e., sum of the areas A+B+C+D) always includes the spline component so that the uniform component of a cell (i.e., sum of the areas C+D) is calculated by removing the spline component ((A+B+C+D)-(A+B)). This simplifies the computation by allowing a trivial computation for the total range selectivity: sum all the steps which fall below the selected step in which "r.a<7" fell. Here, only one spline value is used in the selectivity computation per sarg, and this is only used if the sarg constant falls between steps. 3. Source code implementation a. Overview Methods are provided for determining the selectivity estimate for a histogram. The general approach improves histogram-based cost estimates as follows. The constant associated with a predicate (e.g., in r.a>5, the constant is "5") is used to do a binary search in the array of histogram boundary values, for determining a particular cell. Once a cell has been found, the system employs interpolation to find out how much of the cell has been selected. Once this interpolation value is found, it is used with a cell weighting and a spline value or weighting--which takes into account how data values are distributed within the cell--to estimate the selectivity of the predicate value. b. Data Structures At the outset, it is helpful to review core data structures employed when implementing the methodology of the present invention. A first data structure, statistics.sub.-- rp, is used to read statistics from and write statistics to disk. It may be defined as follows (using the C programming language).
__________________________________________________________________________
/* structure associated with SYSSTATISTICS tuple used to read/write
statistics from disk */
typedef struct statistics.sub.-- rp
/* row locked table format */
uint16 st.sub.-- crno; /* row number */
uint16 st.sub.-- statbyte; /* status field */
uint16 st.sub.-- eigthyone; /* 81 variable length fields */
B1MBDEF(SLID, b1senscol)
/* B1 only: sensitivity label */
B1MBDEF(SLID, b1infocol)
/* B1 only: information label */
statid.sub.-- t
st.sub.-- statid;
/* future expansion */
objid.sub.-- t
st.sub.-- tabid;
/* statistics on this table id */
sequence.sub.-- t
st.sub.-- seqno;
/* sequence number for record */
DATE st.sub.-- moddate;
/* date of last modification */
formatid.sub.-- t
st.sub.-- formatid;
/* formatid of stats record */
BYTE st.sub.-- usedcount;
/* number of valid varbinary
** entries in tuple
*/
/* beginning of VARBINARY data */
int16 st.sub.-- len;
/* length of row */
BYTE st.sub.-- colidarray[MAXKEY];
/* array of colids associated
** with stats record
*/
union stf.sub.-- u
{
/* - this structure defines the various formats which
** can be found in a SYSSTATISTICS row
** - fmtrow will understand how to decode the FMT10.sub.-- COLSTAT
** FMT20.sub.-- SYSTABSTATS and FMT30.sub.-- PARTITIONS
** - FMT11.sub.-- BOUNDARY and FMT12.sub.-- WEIGHTS will use the
** ptrfmtrow( ) API to fmtrow
*/
double st.sub.-- align;
/* align st.sub.-- histsteps */
FMT10.sub.-- COLSTAT
st.sub.-- colstat;
/* column statistics descriptor
*/
FMT11.sub.-- BOUNDARY
st.sub.-- histsteps;
/* histogram boundary values */
FMT12.sub.-- WEIGHTS
st.sub.-- weights;
/* histogram weights */
FMT12.sub.-- WEIGHTS
st.sub.-- splines;
/* spline values */
FMT20.sub.-- SYSTABSTATS
st.sub.-- systabstats;
/* systabstats snapshot */
FMT30.sub.-- PARTITIONS
st.sub.-- partitions;
/* partition information */
}stf;
}STATISTICS.sub.-- RP;
__________________________________________________________________________
where FMT12.sub.-- WEIGHTS is defined as follows.
______________________________________
/* FORMATID 12 - array of weights used for uniform weight component
and any spline component*/
typedef struct fmt12.sub.-- weights
weight.sub.-- t
st.sub.-- weightarray[MAXSTATVARBINARY];
}FMT12.sub.-- WEIGHTS;
______________________________________
Of interest to the present invention is an array of weights associated with the ordinary histogram and an array of weights which are the spline weights. Use of these values is illustrated later. Once the foregoing information (including the spline weights) is read from disk, it is stored in in-memory structures used for manipulation. In particular, these structures are optimized for run-time, in-memory use. A "virtual array", which is employed for storing weights/splines, is defined as follows.
______________________________________
/* VIRTUAL array - organized as a linked list of array fragments which
** are accessed via a virtual array API, used for weights/splines to
** create a virtual array of values
*/
typedef struct st.sub.-- virtual
struct st.sub.-- virtual *st.sub.-- nextvirtual; /* if list of st.sub.--
values is
** too large to fit into a page
** then go to next page
*/
BYTE *st.sub.-- values; /* this will be coerced to an
** array of values, or an array
** of pointers
*/
}ST.sub.-- VIRTUAL;
______________________________________
A description of the histogram itself, histogram statistics, may be defined as the following data structure.
______________________________________
/* HISTOGRAM STATISTICS
*/
typedef struct st.sub.-- histogram
stepct.sub.-- t
st.sub.-- requestedstepct; /* number of weights
requested */
stepct.sub.-- t
st.sub.-- actualstepct; /* number of weights */
int32 st.sub.-- status; /* set of booleans */
#define STHT.sub.-- PTR 0x00000001L
/* is histogram datatype a PTR */
ST.sub.-- STEPS
st.sub.-- steps; /* ptr to list of step structs */
ST.sub.-- VIRTUAL
*st.sub.-- weights;
/* ptr to list of weights
** associated with st.sub.-- steps
*/
ST.sub.-- VIRTUAL
*st.sub.-- splines;
/* if non-NULL then a ptr
** to the spline component associated with
** the st.sub.-- weightp
*/
}ST.sub.-- HISTOGRAM;
______________________________________
The "requested step count," st.sub.-- requestedstepct, stores the number of steps (and therefore weights) that the user requested for creating the histogram. The "actual step count," st.sub.-- actualstepct, stores the actual number used. In the currently-preferred embodiment, the system has a default value of 20, which the user can override on a per table basis. The next data member, st.sub.-- steps, is a pointer to a list of step structures; this is effectively a virtual array of boundary values. The st.sub.-- weights data member provides a virtual array of weights (for the ordinary histogram case). Finally, st.sub.-- splines provides a virtual array of spline weights. With an understanding of these data structures, the internal method steps employed by the currently-preferred embodiment may now be examined. c. Internal Methods To determine the percentage of a step selected by a qualification (e.g., query predicate), the system employs an "interpolate" method, STU.sub.-- INTERPOLATE. The method may be constructed as follows.
__________________________________________________________________________
/*
** STU.sub.-- INTERPOLATE
**
** Purpose:
** Determine the percentage of a step selected by a qualification
** Assumption is that lowerp <= valuep <= upperp. The fraction
determined
** is logically (valuep - lowerp) / ( upperp - lowerp)
**
** Parameters:
** datatypep
ptr to datatype structure of values
** lowerlen
length of lower bound value
** lowerp
ptr to lower bound value
** valuelen
length of value to interpolate
** valuep
ptr to value to interpolate
** upperlen
length of upper bound value
** upperp
ptr to upper bound value
** charstatp
ptr to structure of statistical anaiysis
** of character positions
**
** Returns
** 0.0 <= value <= 1.0 which represents the percentage
** selected by a qualification
**
*/
percent.sub.-- t
stu.sub.-- interpolate(
ST.sub.-- DATAVALUE
*lb.sub.-- dvp,
ST.sub.-- DATAVALUE
*mid.sub.-- dvp,
ST.sub.-- DATAVALUE
*ub.sub.-- dvp,
ST.sub.-- CHARSTAT
*charstatp,
SYB.sub.-- BOOLEAN
*estimatep)
length.sub.-- t
lowerlen;
BYTE *lowerp;
length.sub.-- t
valuelen;
BYTE *valuep;
length.sub.-- t
upperlen;
BYTE *upperp;
percent.sub.-- t
width;
/* estimate of width of cell */
percent.sub.-- t
fraction;
/* fraction of cell selected */
double ldouble;
double vdouble;
double udouble;
datatype.sub.-- t
left.sub.-- dt;
datatype.sub.-- t
right.sub.-- dt;
*estimatep = TRUE;
lowerlen = lb.sub.-- dvp->st.sub.-- vallen;
lowerp = lb.sub.-- dvp->st.sub.-- valuep;
valuelen = mid.sub.-- dvp->st.sub.-- vallen;
valuep = mid.sub.-- dvp->st.sub.-- valuep;
upperlen = ub.sub.-- dvp->st.sub.-- vallen;
upperp = ub.sub.-- dvp->st.sub.-- valuep;
SYB.sub.-- ASSERT(lowerlen && upperlen);
if (!valuelen)
{
fraction = 0.0;
}
else
{
/* assert that datatype are compatible */
SYB.sub.-- ASSERT(
((left.sub.-- dt = stu.sub.-- n.sub.-- datatype(&lb.sub.--
dvp->st.sub.-- dt)) ==
(right.sub.-- dt = stu.sub.-- n.sub.-- datatype(&mid.sub.--
dvp->st.sub.-- dt)) .vertline..vertline.
(left.sub.-- dt == NUME && right.sub.-- dt == DECML)
.vertline..vertline.
(left.sub.-- dt == DECML && right.sub.-- dt == NUME)
)
&&
((left.sub.-- dt. = stu.sub.-- n.sub.-- datatype(&lb.sub.--
dvp->st.sub.-- dt)) ==
(right.sub.-- dt = stu.sub.-- n.sub.-- datatype(&ub.sub.--
dvp->st.sub.-- dt)) .vertline..vertline.
(left.sub.-- dt == NUME && right.sub.-- dt == DECML)
.vertline..vertline.
(left.sub.-- dt == DECML && right.sub.-- dt == NUME)
)
);
SYB ASSERT((ISCSARTYPE(lb.sub.-- dvp->st.sub.-- dt.st.sub.-- datatype))
.vertline..vertline.
(lb.sub.-- dvp->st.sub.-- dt.st.sub.-- datatype == DT.sub.-- VARBINARY)
.vertline..vertline.
(lb.sub.-- dvp->st.sub.-- dt.st.datatype == DT.sub.-- BINARY)
.vertline..vertline.
( (lb.sub.-- dvp->st.sub.-- dt.st.sub.-- length ==
mid.sub.-- dvp->st.sub.-- dt.st.sub.-- length)
&&
(lb.sub.-- dvp->st.sub.-- dt.st.sub.-- length
ub.sub.-- dvp->st.sub.-- dt.st.sub.-- length ))
);
switch (stu.sub.-- n.sub.-- datatype(&lb.sub.-- dvp->st.dt))
{
case DT.sub.-- INT4:
{
width =
(double) (*((int32. *)upperp)) -
(double) (*((int32 *)lowerp));
fraction =
((width <= 0.0) ? 1.0 :
((double) (*((int32 *)valuep)) -
(double) (*((int32 *)lowerp)))
/ width);
break;
}
case DT.sub.-- INT2:
{
width =
(int) (*((int16 *)upperp)) -
(int) (*((int16 *)lowerp));
fraction =
((width <= 0.0) ? 1.0 :
((int) (*((int16 *)valuep)) -
(int) (*((int16 *)lowerp)))
/ width);
break;
}
case DT.sub.-- INT1:
{
width =
(int) (*((unsigned char *)upperp)) -
(int) (*((unsigned char *)lowerp));
fraction =
((width <= 0.0) ? 1.0 :
((int) (*((unsigned char *)valuep)) -
(int) (*((unsigned char *)lowerp)))
/ width);
break;
}
case DT.sub.-- FLT4:
{
fraction = stu.sub.-- doubleinter(*(float *)lowerp,
*(float *)valuep,
*(float *)upperp);
break;
}
case DT.sub.-- FLT8:
{
fraction = stu.sub.-- doubleinter(*(double *).lowerp,
*(double *)valuep,
*(double *)upperp);
break;
}
case DT.sub.-- CHAR:
case DT.sub.-- VARCHAR:
{
fraction = stu.sub.-- bincharinter(' ', charstatp,
lowerlen; lowerp,
valuelen, valuep,
upperlen, upperp);
break;
}
case DT.sub.-- BINARY:
case DT.sub.-- VARBINARY:
{
fraction = stu.sub.-- bincharinter((char)0,
charstatp,
lowerlen, lowerp,
valuelen, valuep,
upperlen, upperp);
break;
}
case DT.sub.-- SHORTMONEY:
{
(void) stu.sub.-- mny4toflt8(lowerp, &ldouble);
(void) stu.sub.-- mny4toflt8(valuep, &vdouble);
(void) stu.sub.-- mny4tofin8(upperp, &udouble);
fraction = stu doubleinter(ldouble,
vdouble, udouble);
break;
}
case DT.sub.-- MONEY:
{
(void) com.sub.-- mnytoflt8(lowerp, lowerlen, (BYTE
*) &ldouble,
sizeof(double), 0);
(void) com.sub.-- mnytoflt8(valuep, valuelen, (BYTE
*) &vdouble,
sizeof(double), 0);
(void) com.sub.-- .sub.-- mnytoflt8(upperp, upperlen, (BYTE
*) &udouble,
sizeof(double), 0);
fraction = stu.sub.-- doubleinter(ldouble,
vdouble, udouble);
break;
}
case DT.sub.-- DATETIME:
{
fraction = stu.sub.-- dateinter((CS.sub.-- DATETIME *) lowerp,
(CS.sub.-- DATETIME *)valuep,
(CS DATETIME *)upperp);
break,
}
case DT.sub.-- SHORTDATE:
{
fraction = stu.sub.-- shortdateinter(
(CS.sub.-- DATETIME4 *)lowerp,
(CS.sub.-- DATETIME4 *) valuep,
(CS.sub.-- DATETIME4 *) upperp);
}
case DECML:
case NUME:
{
*estimatep = stu.sub.-- decimalinter(
lb.sub.-- dvp, mid.sub.-- dvp, ub.sub.-- dvp, &fraction);
break;
}
default:
{
/* return 50% if datatype not supported */
*estimatep = FALSE;
fraction = 0.5;
break;
}
}
}
SYB.sub.-- ASSERT((fraction >= 0.0) && (fraction <= 1.0));
if (fraction < 0.0)
{
fraction = 0.0;
}
else if (fraction > 1.0)
{
fraction = 1.0;
}
return (fraction);
}
__________________________________________________________________________
This method calculates the percentage of a cell selected. This method takes the step value which falls within the cell and then determines the percentage of the cell which has been selected. When the method is invoked, it is passed the lower bound of the cell, the upper bound of the cell, and the actual value (i.e., the predicate value). The parameters can reference any valid data type in the system, including an integer, float, character string, or the like. Based on the particular data type encountered, the method then switches to a particular handler (case arm) for appropriate processing of the data type. In the instance of a data type which is a 4-byte integer (DT.sub.-- INT4), for instance, the method calculates the width of the cell by subtracting the lower bound value from that of the upper bound value. A fraction can then be calculated by the predicate value divided by the width. The fraction indicates that portion of the cell which is selected (between 0.0 and 1.0). The other case arms or branches function in a similar manner to calculate a fraction for the predicate value which is between 0.0 and 1.0. After ensuring that the calculated fraction is between the acceptable range of 0.0 and 1.0, the method returns the fraction value. Given a histogram with boundary values and given a predicate value, the system employs a "histogram hit type" method, ST.sub.-- HIST.sub.-- HITTYPE, for determining which cell the predicate value falls within. The method may be constructed as follows.
__________________________________________________________________________
/*
** ST.sub.-- HIST.sub.-- HITTYPE
**
** Calculate the step the sarg is associated with, or other
relationship
** with histogram if no step can be selected. Use binary search and
return
** HIT.sub.-- INTERPOLATE if interpolation should be subseguently used,
along
with
** spline histogram estimation. Spline histograms useful when a range
sarg
** hits a cell with interpolation.
**
** Parameters:
** histp - ptr to descriptor of histogram for a column
** optype - operator type =, !=, >, >=, <, <=, IS NULL
** and IS NOT NULL
** stepnop - step upon which the sarg constant hit
** dvp - ptr to structure describing the data value
**
** Returns:
** hit type - the way in which the sarg fell into the histogram
** steps
**
** Side Effects:
** None
**
*/
SYB.sub.-- STATIC hittype.sub.-- t
st.sub.-- hist hittype(
ST.sub.-- HISTOCRAM
*histp,
relop.sub.-- t optype,
ST.sub.-- DATAVALUE
*dvp,
stepct.sub.-- t
*stepnop)
stepct.sub.-- t
maxstep;
/* number of steps in the histogram */
stepct.sub.-- t
lowstep;
stepct.sub.-- t
highstep;
stepct.sub.-- t
midstep;
stepct.sub.-- t
frequency.sub.-- step;
ST.sub.-- VIRTUAL
**virtualpp;
SYB.sub.-- BOOLEAN
onstep;
SYB.sub.-- BOOLEAN
isptr;
ST.sub.-- DATAVALUE
hist.sub.-- dt;
SYB.sub.-- BOOLEAN
compare.sub.-- result;
percent.sub.-- t
selectivity;
hittype.sub.-- t
hittype;
length.sub.-- t
cellwidth;
if (!dvp->st.sub.-- vallen)
{
return(HIT.sub.-- NULL);
}
maxstep = histp->st.sub.-- actualstepct - 1;
if (maxstep <0 1)
{
/* only one step implies all values are NULL in the histogram
** so no rows are selected, since the dvp value is not NULL
*/
return(HIT.sub.-- NULLHISTOGRAM);
{
isptr = histp->st.sub.-- status & STHT.sub.-- PTR != 0;
virtualpp = &histp->st.sub.-- steps.st.sub.-- boundary;
STRUCTASSIGN(histp->st.sub.-- steps.st.sub.-- dt, hist.sub.-- dt.st.sub.--
dt);
hist.sub.-- dt.st.sub.-- vallen = hist.sub.-- dt.st.sub.-- dt.st.sub.--
length;
lowstep = 0;
highstep = maxstep;
midstep = maxstep/2;
onstep = FALSE;
cellwidth = (histp->st.sub.-- status & STHT.sub.-- PTR) ?
sizeof(BYTE *) : hist.sub.-- dt.st.sub.-- dt.st.sub.-- length;
/* binary search to determine where constant falls in step array */
while (lowstep < highstep)
{
hist.sub.-- dt.st.sub.-- valuep = stu.sub.-- findslot((PROC.sub.-- HDR
*)NULL, virtualpp,
cellwidth, midstep, maxstep);
if (isptr)
{
/* ptr to a ptr
** EARL RESOLVE - assume character string for now
*/
hist.sub.-- dt.st valuep = * (BYTE **)hist.sub.-- dt.st.sub.--
valuep;
hist.sub.-- dt.st.sub.-- vallen = * (hist.sub.-- dt.st.sub.--
valuep++);
}
compare.sub.-- result = stu.sub.-- compare(&hist.sub.-- dt, dvp);
if (onstep)
{
/* a previous iteration found that the constant was
** on a step boundary so now the check is for a frequency
** count cell
*/
if (!compare result)
{
/* frequency count cell has been selected */
if (midstep == highstep)
{
/* frequency.sub.-- step needs to reference the
** higher step value of two equal step
** values
* /
frequency.sub.-- step = highstep;
}
*stepnop = frequency.sub.-- step;
return(HIT.sub.-- FREQUENCY);
}
if (!frequency.sub.-- step)
{
/* since the frequency step fell on the first
** boundary, and this first boundary is not a
** frequency count cell then by convention the
** constant is iess than all values in the
** histogram
*/
return (HIT.sub.-- UNDERELOW);
}
if (midstep > frequency.sub.-- step)
{
/* check to see if the lower step forms a
** a frequency count
*/
mid step = frequency.sub.-- step - 1;
continue;
}
/* check if the next lower step is part of a dense
** sequence of frequency steps, making this a
** frequency value otherwise...
** no adjacent steps form a frequency count so this
** is a case of a constant falling on the step
** boundary of a range cell
*/
*stepnop = frequency.sub.-- step;
return(stu.sub.-- dense.sub.-- frequency(&hist.sub.-- dt, dvp) ?
HIT.sub.-- FREQUENCY : HIT.sub.-- RANGEFREQUENCY);
}
if (!compare.sub.-- result)
{
if (midstep == maxstep)
{
/* cannot be a frequency cell since only can
** occur if lowstep was tested as less than
*stepnop = midstep;
return(HIT.sub.-- RANGEFREQUENCY);
}
/* constant value is on a step boundary so check for
** a frequency count cell
*/
onstep = TRUE;
frequency.sub.-- step = midstep.;
if (!midstep)
{
highstep = 1;
midstep = 1;
}
else
{
highstep = ++midstep;
}
continue;
}
else if (compare.sub.-- result > 0)
{
if (lowstep == midstep)
{
/* no change since last comparison */
SYB.sub.-- ASSERT(lowstep == (highstep - 1));
/* loop once more to check for frequency count
** or an overflow.sub.-- step
*/
midstep = highstep;
continue;
}
else if (highstep == midstep)
{
/* greater than all histgram values */
SYB.sub.-- ASSERT(highstep == maxstep);
return (HIT.sub.-- OVERFLOW);
}
else
{
lowstep = midstep;
}
}
else
{
if (lowstep == midstep)
{
/* constant is less than all steps in
** histogram
*/
SYB.sub.-- ASSERT (!lowstep);
return(HIT.sub.-- UNDERFLOW);
}
else if (highstep == midstep)
{
/* normal exit for constant which is
** lowstep < dvp < highstep
** thus interpolation is needed to obtain
** the fraction of the cell selected
*/
break;
}
else
{
highstep = midstep;
}
{
midstep = (highstep + lowstep)/2;
}
*stepnop = midstep;
return (HIT.sub.-- INTERPOLATE);
}
__________________________________________________________________________
The method functions by performing a binary search on all boundary values for determining which cell the predicate value falls within. If the predicate value falls directly on a cell boundary, the method selects the entire cell, thus eliminating the need to interpolate or to employ a spline histogram. Upon conclusion of the binary search, the method returns (by reference) the cell number which the predicate value fell within together with a "hit type"--that is, how the value "hit" the histogram. In the currently-preferred embodiment, hit types are defined as follows.
__________________________________________________________________________
/* define a set of cases in which the sarg is related to a histogram
** - a histogram is defined by an order set of boundary values, so that
in
** order to estimate the selectivity, the predicate constant is in a
** binary search of the histogram values, to see how it hits the
histogram
** cell
** - e.g. r.a = 5 is a predicate, in which "5" is the constant, so that
** if a histogram had a "range cell" with lower bound of "1" and upper
bound
** of "10" then the hittype.sub.-- t would be HIT.sub.-- INTERPOLATE
since the predicate
** constant is between two boundary values of a range cell
*/
typedef int32
hittype.sub.-- t;
#define HIT.sub.-- FREQUENCY
1 /* boundary value of histogram is
** equal to predicate constant and
** moreover, the cell represents a single
** high frequency domain value
*/
#define HIT.sub.-- OVERFLOW
2 /* predicate constant is higher than
** all values in the histogram
*/
#define HIT.sub.-- UNDERFLOW
3 /* predicate constant is lower than
** all values in the histogram
*/
#define HIT.sub.-- NULL 4
/* predicate is IS NULL or IS NOT NULL
*/
#define HIT.sub.-- NULLHISTOGRAM
5 /* histogram has no values except for
** NULL
*/
#define HIT.sub.-- RANGEFREQUENCY
6 /* predicate constant is equal to the
** upper bound value of a range cell
*/
#define HIT.sub.-- INTERPOLATE 7
/* predicate constant fall inbetween
** the upper and lower bounds of a range
** cell
*/
__________________________________________________________________________
For instance, "hit frequency" indicates a hit to a cell which is a single value. Suppose that a hit occurred on a cell having a value of just 17. Here, there is really no upper or lower bound, since the cell stores only a single value. Further, a spline would not be applicable to the cell, as the cell again only stores a single value, not a range. A hit type of "hit underflow" indicates that the predicate value is less than any cell of the histogram, that is, the value is outside the range of the histogram. Of particular interest to the present invention is the hit type of "hit interpolate." This indicates that the value hit the histogram such that the value fell between two boundary values and, thus, the system needs to perform an interpolation (i.e., invoke the interpolate method). When the system has returned from the binary search method, it has at that point obtained a cell number. From the previously-described virtual arrays, the system can lookup for the cell its associated values (i.e., weight and spline). These values are passed to a "spline estimate" method, st.sub.-- spline.sub.-- estimate, which calculates the selectivity of the cell. The method may be constructed as follows.
______________________________________
/*
** ST.sub.-- SPLINE.sub.-- ESTIMATE
**
** Given the interpolation fraction and the spline weight calculate
the
** selectivity of the cell including contribution of the spline
component.
**
** Parameters:
** total
total weight of cell
** spline
spline component of cell
** interpolate
percent of cell selected (between 0.0 and
1.0)
**
** Returns:
** selectivity of cell and spline component if it exists
**
** Side Effects:
** None
**
*/
SYB.sub.-- STATIC percent.sub.-- t
st.sub.-- spline.sub.-- estimate(
weight.sub.-- t
total, // total wt of the cell
weight.sub.-- t
spline, // the positive or neg. wt of the
spline
percent.sub.-- t
interpolate)
// estimate of cell selected
percent.sub.-- t
selectivity;
if (spline >= 0.0)
{
/* positive spline */
/* calculate uniform component */
selectivity = (total - spline) * interpolate;
/* add spline component */
selectivity += spline * interpolate * interpolate;
}
else
{
/* negative linear spline */
spline = -spline;
/* get absolute value */
/* calculate uniform component */
selectivity = (total - spline) * interpolate;
/* add spline component */
interpolate = 1.0 - interpolate;
selectivity += spline * (1.0 - interpolate * interpolate);
}
return (selectivity);
}
______________________________________
As shown, the method is invoked with three parameters: total, spline, and interpolate. The total parameter reflects the total weight of the cell. The spline parameter indicates the positive or negative weighting of the spline. The interpolate parameter is the estimate of the portion of the cell selected, which is provided by the interpolate method. The functionality of the method is divided according to whether the spline is positive or negative. Selectivity is calculated in the case of a positive spline by adding the uniform component to the spline component. First, the uniform component of selectivity is determined by subtracting the spline value or weighting from the total cell weighting and then multiplying that quantity by the interpolate fraction, as follows. selectivity=(total-spline)*interpolate; Now, the method adds to the selectivity value the spline component, which is calculated as the spline weighting multiplied by the interpolate fraction squared. selectivity+=spline*interpolate*interpolate; In a similar manner, the calculation for a negative spline also adds the uniform component to the spline component. However in that instance, the uniform component of selectivity is calculated by subtracting the spline weight from the total cell weight and multiplying that value by the interpolate fraction. selectivity=(total-spline)*interpolate; To add the spline component, the method first calculates a new interpolate fraction by subtracting the previously-calculated interpolate fraction from the value of 1 as follows. interpolate=1.0-interpolate; Now, the selectivity calculation adds the spline component, which is determined by multiplying the spline value by the quantity of 1 minus the interpolate fraction squared, as follows. selectivity+=spline*(1.0-interpolate*interpolate); While the invention is described in some detail with specific reference to a single-preferred embodiment and certain alternatives, there is no intent to limit the invention to that particular embodiment or those specific alternatives. Thus, the true scope of the present invention is not limited to any one of the foregoing exemplary embodiments but is instead defined by the appended claims.
|
Same subclass Same class Consider this |
||||||||||
