Systems and methods for data quality management5842202Abstract Systems and methods that model and measure the propagation of error within information systems. The invention provides data management systems that determine an error measure that represent the accuracy, or inaccuracy of a query result achieved for processing a structured data set. In one embodiment, the invention provides systems that have a model of error which exists within a structured data set. The system can further include an error propagation monitor that processes the error model and the structured data set to determine errors within the structured data set that will propagate to a query result generated by performing a query process on the structured data set. The propagated error represents the error that exists within the query result signal. Claims What is claimed is: Description FIELD OF THE INVENTION
TABLE 1
__________________________________________________________________________
Sample data from the California Data Co. table
Connected to:
ORACLE7 Server Release 7.0.15.4.0-Production
With the procedural and distributed options
PL/SQL Release 2.0.17.1.0-Production
SQL> select * from disc where comp.sub.-- name like `A%`;
COMP.sub.-- NAME NUM.sub.-- EMPLOYEES
TOTAL.sub.-- ASSETS
NET.sub.-- SALES
__________________________________________________________________________
A O SMITH CORP 10800 823099 1193870
A SCHULMAN INC 1582 407865 685112
ABBOTT LABORATORIES
49659 7688569 8407843
ACX TECHNOLOGIES INC
4200 653999 641852
ADOLPH COORS CO 6200 1350944 1946592
ADVANCED MICRO DEVICES INC
12060 1929231 1648280
AG PROCESSING INC
2128 465796 1218614
AGWAY INC 7900 1204764 1710577
AIR PRODUCTS & CHEMICALS INC
14075 4761500 3327700
ALBERTO CULVER CO.
8600 593046 1147990
. . .
__________________________________________________________________________
Table 5 shows the output from an SQL Plus program (of the type sold by the Oracle Company, Redwood Shores, Calif.) as it operates on the financial database. The parser compiler of the error propagation processor 22 creates an associated script resulting in an SQL Plus script, whose execution is shown as the SQL Plus program of Table 5.
TABLE 5
______________________________________
Sample of SQL compiler execution-the logical calculus implementation
1. Connected to:
2. ORACLE7 Server Release 7.0.15.4.0-Production
3. SQL>>
4. SQL> EPC-processing standard select-from-where query:
5. SQL>>
6. SQL> dropping 3 output error tables: inacc, incompl, mismem . . .
7. Table dropped.
8. Table dropped.
9. Table dropped.
10. SQL> The user query was:
11. select total.sub.-- assets, comp.sub.-- name from disc where
total.sub.-- assets > 100,000,000;
12. SQL> table = ›disc!
13. SQL> key.sub.-- attr = ›comp.sub.-- name!
14. SQL> attr.sub.-- list = ›total.sub.-- assets, comp.sub.-- name!
15. SQL> where.sub.-- clause = › total.sub.-- assets > `100000000`!
16. SQL> INACCURACY
17. SQL> compiling inaccuracy EPC.
18. SQL> result:
19. SQL>> CREATE table error.sub.-- table.sub.-- inacc as
20. SQL>> SELECT total.sub.-- assets, comp.sub.-- name
21. SQL>> FROM disc.sub.-- ea
22. SQL>> WHERE
23. SQL>> (total.sub.-- assets > `100000000`)
24. SQL>> AND
25. SQL>> (comp.sub.-- name IN
26. SQL>> ( SELECT comp.sub.-- name
27. SQL>> FROM disc
28. SQL>> WHERE ( total.sub.-- assets > `100000000`)
29. SQL>> )
30. SQL>> )
31. SQL>> AND
32. SQL>> (total.sub.-- assets, comp.sub.-- name) NOT IN
33. SQL>> ( SELECT total.sub.-- assets, comp.sub.-- name
34. SQL>> FROM disc)
35. SQL>>
36. SQL>> executing EPC to compute output inaccuracy
37. Table created.
38. SQL> INCOMPLETENESS
39. SQL> compiling incompleteness EPC
40. SQL> result:
41. SQL>> CREATE table error.sub.-- table.sub.-- incomp as
42. SQL>> SELECT total.sub.-- assets comp.sub.-- name
43. SQL>> FROM disc.sub.-- ei
44. SQL>> WHERE ( total.sub.-- assets >`100000000" )
45. SQL>> UNION
46. SQL>> SELECT total.sub.-- assets, comp.sub.-- name
47. SQL>> FROM disc.sub.-- ea
48. SQL>> WHERE
49. SQL>> ( total.sub.-- assets > `100000000` )
50. SQL>> AND
51. SQL>> comp.sub.-- name NOT IN
52. SQL>> (SELECT comp.sub.-- name
53. SQL>> FROM disc
54. SQL>> WHERE ( total.sub.-- assets >`100000000` )
55. SQL>> )
56. SQL>>
57. SQL> executing EPC to compute output incompleteness
58. Table created.
59. SQL> MISMEMBERSHIP
60. SQL> compiling mismembership EPC
61. SQL> result:
62. SQL>> CREATE table error.sub.-- table.sub.-- mismem as
63. SQL>> SELECT total.sub.-- assets, comp.sub.-- name
64. SQL>> FROM disc.sub.-- em
65. SQL>> WHERE ( total.sub.-- assets > `100000000" )
66. SQL>> UNION
67. SQL>> SELECT total.sub.-- assets comp.sub.-- name
68. SQL>> FROM disc
69. SQL>> WHERE
70. SQL>> ( total.sub.-- assets > `100000000` )
71. SQL>> AND
72. SQL>> comp.sub.-- name IN
73. SQL>> (SELECT comp.sub.-- name
74. SQL>> FROM disc.sub.-- ea
75. SQL>> WHERENOT ( total.sub.-- assets > `100000000` )
76. SQL>> )
77. SQL>>
78. SQL> END OF EPC OUTPUT
79. SQL> executing EPC to compute output mismembers
80. Table created.
______________________________________
In this example, the user query is: select total.sub.-- assets comp.sub.-- name, from disc where total.sub.-- assets>"100000000". By this query, the database user wishes to see which companies have greater than one hundred billion dollars and what is their asset amount. The listing of Table 5 shows the output of the script as executed in the SQL Plus environment. The user query is shown on line 11. The LEX and YACC parse and transform input queries into error propagation expressions in lines 16, 38, and 58 of the listing of Table 5 respectively. The compiler generates a separate SQL expression set for inaccuracy, incompleteness, and mismembership errors. Lines 19-35 provide an example of the error propagation instructions generated by the error propagation processor 22. The parser generates from the SQL command select, a subset of instructions that identifies the set of tuples that are inaccurate within the query results generated by the query mechanism 14. The selection operation is a logical formula on the data of database memory 18 that will select from that data a subset of the data which meets the criterion of the selection operation. Inaccuracies result in the selection operation due to the existence of an inaccurate tuple in the input data structure that will be selected during the SQL selection operation entered by the user. This relationship can be represented by the following formula: S.sub.ea ={r.sub.2 .vertline..E-backward.r.sub.1 .epsilon.r, .E-backward.r.sub.2 .epsilon.r.sub.ea, r.sub.1.R.sub.k =r.sub.2.R.sub.k, f(r.sub.1)f(.sym.r.sub.1)} Wherein the expression .E-backward.r.sub.1 .epsilon.r means there exists a tuple in r.sub.1 in R. "r.sub.1.R.sub.k =r.sub.2.R.sub.k " matches r.sub.1 in R with a tuple r.sub.2 in R.sub.ea on the key columns R.sub.k. So, r.sub.1 is inaccurate and r.sub.2 contains attribute value corrections to r.sub.1. "f(r.sub.1)" indicates that r.sub.1 was selected by f. "f(r.sub.1)"indicates that r.sub.1 should have been selected, i.e. if the attribute values in the input data r.sub.1 had been corrected before applying f, the tuple would still have been selected. This is understood to ensure that the inaccuracy in r.sub.1 did not result in an S mismember due to false selection of r.sub.1. This formula is represented by the SQL query instructions set forth in lines 19-35 of Table 5. Selection incompleteness errors arise from two conditions. An incomplete from the input data would have been selected had the tuple been present, and inaccuracy in the input data caused the select condition to fail where it would have succeeded otherwise. These criteria are set forth by the expressions 3a and 3b below. Selection mismemberships in the query results can arise from two causes; a mismember of the input data is selected by a selection operation so that the mismember remains, and where an inaccurate tuple is selected by the selection operation only due to its inaccuracy and should not have been selected. This logical formula is set forth in expressions 2a and 2b. Other query operations are described besides SELECT in Appendix A attached hereto, and still others will be apparent to those of ordinary skill in the art. Accordingly, it will be understood the invention set forth herein which measures error propagation is not to be limited to any particular set of operations, nor to any particular set of error definitions. The parser of the error propagation of monitor 22 generates, as set forth in lines 41 through 56, a set of query instructions for processing the error model 24 and the data sets stored in database memory 18 to generate an error measure of the incompleteness within the query results. Similarly, the parser at lines 62 through 78 of Table 5 generate a set of query instructions for determining mismemberships within the query results. These three derived SQL queries relate to the error propagation formalisms set forth in formulas 1 through 3. When executed against the error model 24 and the data in database 18, the error propagation processor 22 generates an error measure represented in FIG. 1 as three output tables 28a, 28b, and 28c. Each of these tables corresponds to one portion of an error triple, i.e. inaccuracy, mismembership, or incompleteness. The contents of these three tables are set forth in Table 6 below. A short review of Table 6 indicates that the error measurement system 20 according to the invention has identified those errors that exist within the query result 16 depicted in FIG. 1. The system is mathematically closed in that the output of one error propagation calculus expression can serve as input to another.
TABLE 6
______________________________________
Output error triple: <output.sub.-- ea, output.sub.-- ei, output.sub.--
em>
81. SQL> ************** OUTPUT INACCURACIES ******
82. SQL> select * from output.sub.-- ea;
83. TOTAL.sub.-- ASSETS COMP.sub.-- NAME
84. --
85. 251506000 GENERAL ELECTRIC CO
86. SQL> ***************** OUTPUT INCOMPLETES ******
87. SQL> select * from output.sub.-- ei;
88. TOTAL.sub.-- ASSETS COMP.sub.-- NAME
89. --
90. 101132000 AMERICAN EXPRESS CO
91. 198938000 FORD MOTOR CO
92. SQL> ************* OUTPUT MISMEMBERS ******
93. SQL> select * from output.sub.-- em;
94. TOTAL.sub.-- ASSETS COMP.sub.-- NAME
95. --
96. 101000000 INTERNATIONAL BUSINESS MACHINES CORP
97. 216574000 CITICORP
98. 9.0926E + 10 DAIMLER BENZ CORP
99. SQL> Disconnected from ORACLE7 Server Release 7.0.15.4.0-Production
______________________________________
The above description describes a method for measuring error within query results and provides a deterministic evaluation of the error within those query results. To this end, the system depicted in FIG. 1 employs an error model 24 that contains, as described in the above example, values which indicate the actual errors existing in the input data stored in the database memory 18. Accordingly, this logical error model represents error as deterministic corrections to individual facts. Such a deterministic error model 24 allows a database administrator to document or verify the integrity and value of the structured data set stored within the database memory 18. For example, the database administrator can generate a deterministic error model 24 by selecting a portion of the data stored within the database memory 18 and manually researching the data therein to identify any inaccuracies, incompleteness, or mismemberships. For example, a database administrator can be in charge of a database that stores records on 35,000 alumni. The database administrator can select a representative 500 records from that database. The database administrator can review and analyze these records to create an error model 24 that is representative of a portion of the data records stored in the database memory 18. The database administrator can then conduct query searches of those 500 records to get error measures that indicate the accuracy of the query result being generated from the database. The error measures generated by the error measurement system of the invention provides the database administrator with the integrity of the search results being produced on searches of the full database. If error measurements indicate that the search results are of little value, the database administrator can draw the inference that the database needs to be overhauled. Alternatively, if the error measures indicated that query results being produced from the database 18 are generally accurate, the database administrator can choose to avoid the expensive process of updating the database. Similarly, a database administrator that is choosing between a number of available databases that contain similar information can select the highest quality database, perhaps the most expensive, and determine differences between that high quality database and other less expensive ones. These differences can be employed to generate a deterministic error model 24. If error measures produced upon processing less expensive databases indicates an acceptable level of error within the query results, then the database administrator can select to purchase the less expensive database. Alternatively, if the error measure results indicate that the less expensive database produces query results highly inaccurate and of little value, the database administrator can justify the expense of purchasing the more expensive database. It is understood, that the system allows a database administrator to get a measure of how error in an input database propagates to query results. The propagation of error is not necessarily intuitive. Accordingly, an error model that indicates a high level of inaccuracy in a certain attribute may nonetheless fail to translate into any type of significant error for the queries generally performed by the database administrator. Consequently, the invention provides database administrators with systems and method for comparing, maintaining, and selecting databases. In an alternative embodiment of the invention, the system employs an error model 24 that represents error within the structured data set by probabilistic data. In a probabilistic sense, knowing error fully implies having a probability distribution on the error sample space so that each error state can be individually quantified for likelihood. A probabilistic error representation consisting of expressions such as those described below in Table 7, can be one way of defining these probability distributions. FIG. 5 depicts one embodiment of the invention for use with probabilistic error models. This error and error propagation model define a simulation mechanism for iterating over the logical error states. FIG. 5 depicts the induced probabilistic error measurement system 40 that includes a logical error propagation monitor 42, an induced probabilistic error model 44, a discrete logical error model memory 46, a discrete logical error measure memory 48 having storage for error measure triples 48a, 48b, and 48c, an aggregation controller 50, and an induced probabilistic output error measure memory 52. In the system 40 depicted in FIG. 5, the probabilistic error data stored in the probabilistic error model 44 is employed to generate an error measure that defines error within query results as a probability distribution. In this embodiment, the system 40 generates probabilistic error measures by iteratively testing each possible error state of the database memory 18 to generate a plurality of error measures, each being representative of one possible output error state for the query results. Thus the system 40 can also be termed the induced probability error measurement and propagation system, as this model results from probabilistic iterations over the logical model. As further depicted in FIG. 5, the system 40 includes an aggregation controller 50 that aggregates the probabilities of the individual logical output error measurements to generate an output error measure that represents the probability distribution of error within the query result separately for each type of error and for various subsets of the query result. To this end, the error propagation monitor 42 contains an iteration processor that employs the probabilistic data stored in the probabilistic error model 44 to generate randomly one discrete logical error measure which is stored in error model memory 46. This discrete logical error measure represents one possible logical error model for the data of database memory 18. The system 40 processes this single discrete logical error measure as described above with reference to the system 20 of FIG. 1, to generate an error measure representative of an output error that would occur for this one discrete logical error measure stored in error model memory 46. This output measure is then stored within the error measure memory 48. The iteration processor of error propagation monitor 42 then randomly generates, using the probability error model 44, another discrete logical error measure which can be stored in error model memory 46. Again, the error propagation monitor 42 generates an output error measure signal for this possible error measure and stores the output measure within the error measure memory 48. This iterative process continues until sufficient iterations have been performed to generate a sufficient number of logical output error measures to define the probability distribution of the error within the query result signal. In particular, as further illustrated in FIG. 5, each iteration of the logical model produces an error triple that can be stored within the error measure memory 48. Each error measure stored within error measure memory 48 represents one point of the error space of the logical measure of error for the query results signal 16. In the depicted embodiment, the error measure memory 48 couples to the aggregation controller 50. The aggregation controller 50 can, optionally, sum together the multiple logical executions provided by the iterative processor of the propagation monitor 42. In particular, the aggregation controller generates an aggregate statistic over the multiple logical executions performed by the error propagation monitor 42. For example, the iterative processor can perform one hundred iterations of the simulation. Random logical error measures are generated according to probabilistic error statements such as those in Table 7 and in formulas 1, 2, and 3 below. The output of each iteration is a logical output error measure. This output error measure is a probability density function of its outputs over one hundred iterations and constitutes the induced model's output distribution for each of the error types. This output distribution can be graphically plotted to depict the probability distribution of the error within the query result signal 16. The probabilistic output error measure memory 52 can store the aggregated output error data for subsequentive use by the database administrator. It will be apparent that other query operations and error propagation calculus expressions can be employed with the invention without departing from the scope thereof. In a further alternative embodiment of the invention, the systems can employ error model data that is represented as probabilistic expressions. For example a statistical methodology can determine that an error term on income for Alumni is some (discretized) normal distribution (call it P (income-error)). These probabilistic error statements can be processed by statistical operations to directly determine the error probability distribution of the query result signal 16. Continuing with the example, if a selection predicate asks for the number of individuals with income greater than some amount, this input normal distribution of error become a distribution on inaccuracy, incompleteness, and mismembership by operation of a functional processor that employs a sum of binomial random variables to model the number of incompletes and mismembers in the output. Such a functional view of error representation and propagation directly determines the probabilistic output error without requiring the iterative processes previously described with reference to FIG. 5. The statistical processing operations employed by the functional processor for combining probability statements of error are well known in the art of probability and statistics. One such system is the functional probabilistic error measurement system 60 depicted in FIG. 6. The depicted system 60 includes a functional error propagation monitor 62, a functional probabilistic error model memory 64 having probability data stored therein, a statistical function library 68, and functional probabilistic error measure memory devices 70a, 70b, and 70c. Accordingly, in this embodiment, the error model data stored in memory device 64 can be stated as closed form statements of probability distributions. For example, an error statement that alumni under-report or over-report their income as a function of region (Europe, Texas, etc.) and employment category (CEO, manager, etc.), can take the form: P(Income-error=x.vertline.Region=y Employment-category=z) (Statement 1) Further, mismembers may be represented as: (1) "the probability that a randomly chosen tuple in the table is a mismember is 0.1", and (2) "the distribution of values in a tuple given it is a mismember is P.sub.x (x)" where x.epsilon.X is a tuple variable on the table's scheme. These can be stated in the form: P (mismember) and P(X=x.vertline.mismember). Other probabilistic statements of error will be apparent to those of ordinary skill in the art, and the use thereof does not depart from the scope of the invention. Table 7 illustrates probability statements in an error model data set. In Table 7, t is a random tuple in a relation r, .sym.t is the correction to t necessary to correct any error therein, o is a random object in the world which is in fact in r's class, x and y are tuple variables on r's scheme. Then the following statements describe alternative ways of structuring knowledge about data error for each error type. Member (t) is the event that the tuple t is in fact a member of the table's class. Mismember (t) is the event that the tuple t is in fact not a member of the table's class. Incomplete (o) is the event that the object o is in fact a member of the table's class, but is not listed in the table.
______________________________________
error type
item functional representation expression
______________________________________
inaccuracy
1 P (.sym.t = y .vertline. member(t) t = x)
1' P (member(t) t = x)
or
2 P (t = y .vertline. member(t) .sym.t = x)
2' P (member(t) .sym.t = x)
incompleteness
3 P (incomplete(o).vertline.o = x)
3' P (o = x)
or
4 P (o = x .vertline. incomplete(o))
4' P(incomplete(o))
mismembership
5 P (mismember(t) .vertline. t = x)
5' P (t = x)
or
6 P (t = x .vertline. mismember(t))
6' P (mismember(t))
______________________________________
For encoding knowledge about the error type inaccuracy, an error data model can structure the information based on either of the two expressions of item 1 and 1' or on 2 and 2' above as these are alternative statements. This can be similarly done for incompleteness and mismembership. Accordingly, memory 64 can store an error triple data model having three tables of errors each providing tuple level representation of error recited as statements of probability. This system 60 can also be termed the functional probabilistic error measurement and propagation system. This is because it operates directly on probability functions. The functional probability representation can also provide error models that include conditional descriptions of error. To this end, the probability statement of error expressed for a tuple in the error model can provide a first statement of error probability that applies under a first condition, and a second statement of probability of error that applies under a second condition. For example, the probability statement of error for the accuracy of an address field in the input data set may be conditional upon the age of the associated alumni member, given that younger people have a tendency to move more frequently that older people. Accordingly, the parameters of a probability statement, e.g., the mean of a binomial distribution, may be different conditioned upon the age of the alumni. Error can may also be conditioned upon evidence of other error contained in the table. The evidence of error can be any information which leads to a belief that error may exist in a particular subset of the data which is different than the error that exists for the other portions of the data. One such form of evidence includes a recording of the processing history of data, including its source, age, and collection method. As with the embodiments described above, the error propagation monitor 62 determines errors in the input data set that will propagate through to the query results 70 given the input errors and the operations employed in the query. For example, mismembers can propagate for a select operation. For example, variable r represents a table that is input to a database query, and variable s represents a table that is output from that query. The scheme of both is R, with r.sub.e and s.sub.e are the respective error triples. s.sub.e can be computed by the calculus. Let K.OR right.R be the key of R. As an example of the probabilistic events, let s.sub.1 be a random tuple drawn from the output table s. Let s.sub.1.K.epsilon.s.sub.em.K be the event that s.sub.1.K is a mismember of s. Then P(s.sub.1.K.di-elect cons.s.sub.em.K) is the probability of this event. As discussed above, a conditional distribution such as P (s.sub.1.K .epsilon.s.sub.em.K .vertline.s.sub.1 =x) allows assignment of a higher likelihood or degree of error to some subsets of the data(world) than to others (where x is a tuple variable on R). mismembership in the select result ##EQU1## Expressions 2a and 2b above state that two exclusive events among input tuples can result in an s mismember. 2a covers the event that s.sub.1.K was a mismember in r, in which case (by definition of a selection) it is also a mismember in s. 2b describes the other way a tuple may be a mismember in s--when an inaccuracy in r causes a tuple to be wrongly selected into s. The probability of an output mismembership is a function of the probabilities of the these two types of input error events. The probability that a random output tuple s.sub.1 .epsilon.s is a mismember (given s.sub.1 =x) is the probability that, for the tuple s.sub.1 .epsilon.r, and given .function.(s.sub.1), then what is the conditional probability--in r--that s.sub.1 is either a mismember of r or s.sub.1 is inaccurate resulting in false selection by .function.. And, because of the conditionals, a probabilistic "filtering" of error occurs. The selectivity of .function. over conditioning variables may lead to different proportions of tuples in each error category. Inaccuracy error concerns an attribute value vis-a-vis an object. As in mismembership, a conditional interpretation of inaccuracy can be adopted. y below is another tuple variable on R. inaccuracy in the select result ##EQU2## This equation describes the single event in the input event space that results in an inaccuracy in the output. This is the case where an inaccurate tuple s.sub.1 of r satisfies .function., and the satisfaction is not spurious, i.e., it would have occurred even if the inaccuracy were corrected. For incompleteness, let o be a random tuple missing from .sym.s where .sym.s represents the true output. Let t be the corresponding inaccurate tuple in r such that t.K=o.K. Two conditions can cause incompleteness: an incomplete from r would have been selected had the tuple been present and an inaccuracy in r causes the select condition to fail where it would have succeeded otherwise. P.sub.s and P.sub.r represent probabilities on s and r respectively. incompleteness in the select result ##EQU3## An error calculus can be provided to detect errors that arise from attribute level errors crossing over into class-level errors. For example, Major was an attribute column in context of Alumni, but will generate an object in the Major table due to a projection operation. A propagation calculus can account for such semantic transformations and convert across error measures from one interpretation of data to another. Let r and s be input and output respectively for a projection: s=.PI..sub.s (r). Probabilistic project propagation depends on the relationship between the projection list S and the key K of the input scheme R. If S includes the entire key of R, then the key of S and R are the same, and the incompleteness and mismembership of s and r are the same. If the key is removed, then a new key arises (as in Major above) and error is to be computed accordingly. Another factor in the projection can be the relationship between S and the set of columns that are conditioning in the error representation. If conditioning columns are removed, then a new marginal distribution of error is to be computed for the remaining columns in order to maintain (the now reduced) error information. For example, the formula below describes the calculus for projection incompleteness when the conditioning columns are kept and the key is removed. Let R.sub.k be the key of R. Because S is disjoint from R.sub.k, there is a key change so that S.sub.k =S. Define o as in 3a-c above. The functional processor can compute incompleteness as: ##EQU4## This error propagation calculus expression indicates that object o.S.sub.k will be incomplete from s if either incompleteness or inaccuracy in r masked the fact that a member of r in fact had S value o.S.sub.k. A probabilistic aggragation calculus can also be determined. A count operation reflects how many tuples are in a given table. In the above described error model, the true count can be derived random variable. Let x be the number of tuples actually present in a table. Let y be the number of incomplete objects, and let z be the number of mismember tuples. Then by simple observation the true count is equal to x+y-z. The data defines x. So as long as the probabilistic error term gives a distribution on the number of incompletes (y) and mismembers (z), then the true count is fully (i.e., probabilistically) defined. For example, if the alumni DBA states: 300 people are incorrectly listed as deceased then the query select name from alumni where Deceased=`no` would result in a table of exactly 300 missing. The DBA might have stated instead: the likelihood of any individual reporting him or herself as dead is 1 in 1,000 Then, given 70,000 alive and 30,000 dead tuples, a simple binomial model of lying about death can determine the distribution of numbers of incompletes in the result. A probabilistic calculus for a sum operation can also be determined. Let T be a table having a numeric attribute column a.sub.1, and having n tuples. Let the aggregation be over .alpha..sub.1. Let .SIGMA..alpha..sub.1 i=1, . . . , n be the sum over .alpha..sub.1, counting blanks as zeros. First make adjustment to S..alpha..sub.1 to address the incompletes. Let P.sub.y (y) be the probability that the number of incomplete objects is y. Then, to adjust S..alpha..sub.1 for incompleteness in T, the functional processor can add z, where z is the random sum of a random variable. The random sum is over the random number (.vertline.T.sub.ei .vertline.) of incomplete objects. The random variable is the value t.sub.1. a.sub.1 for a missing tuple t.sub.1. A similar operation corrects for mismembers, but the random sum of random variables is subtracted. Let m be that mismember adjustment, Then the true sum random variable can be expressed as: .SIGMA..alpha..sub.1 +z-m. Thus, the probability distribution for the statistic "total-income of Boston area alumni" can be computed, from which various derivative metrics can be computed, such as confidence intervals on this statistic. The count and sum calculi described above for the functional processor, compute the same output distributions as the embodiment depicted in FIG. 5. The semantics are a clearer from this discussion, however, because they acknowledge the conditional structure of error and the formulation of error as random sums of random variables, allowing for (increasingly) closed-form solutions. Propagating such sums functionally (e.g., without the simulations of the induced model of FIG. 5) will depend on the particular underlying distributions involved. Accordingly, the functional model manipulates conditional probability distributions explicitly and leads, where possible, to increasingly closed form analytic formulas for probability distribution propagation. Many uncertainty models embody assumptions about the "shape" of uncertainty (e.g., uniformity, independence, and normality). These may or may not be valid in a given setting. The current model makes no assumptions about distributions, but specifies what probabilities are relevant. It will be apparent that other aggregate query expressions, such as average, and union, can be employed with the invention without departing from the scope thereof. In operation the propagation monitor 62 monitors the instructions generated by query mechanism 14 and parsers the queries, as described above, to generate a set of propagation queries to determine how errors from the probabilistic error model 64 propagate to the output error measure in memory device 70. As described above, the operations of the logical instructions for processing the database 18 determine how error propagates to the query results. For example, for a select operation, errors that exist within the structured data set generally, if selected, propagate right through to the output error measure stored in 70. However, when the query requests data to be aggregated, either for counting, summing, or averaging, the propagation monitor 62 determines the type of probability statement, e.g., normal distribution, associated with the appropriate attributes in the error model 64 and accesses the statistical function library 68 to select a statistical function, i.e. a binomial sum of normal random variables, for generating the probability statements to achieve the proper probability statement for the error measure. In this way, the propagation monitor 62 acts as a functional processor for directly determining probability statements of error within the error measure signal stored in the memory devices 70a, 70b, and 70c. In the alternative embodiment depicted in FIG. 6, the error model can also store possibility data. Possibility data is understood to be probabilistic data which is less certain than probability data, in that it merely indicates that error can exist within the data measurement. However, possibility data typically provides no measure as to the likelihood of that error. However, in some embodiments, it is useful for database administrator to determine whether or not the possibility of error propagates through into his query results. The above description illustrates the systems and methods according to the invention that are suitable for determining an error measure signal representative of the error that occurs within a query result generated from processing a structured data set. The error models shown herein can be provided to the systems of the invention or can be generated for use with such systems. As described above, the error models can be generated by comparing a reference data set to an existing data set to determine an error measure, which in one embodiment can be represented as the corrections which are necessary to make to the structured data set to bring the structured data set into correspondence with the reference data set. Additionally, the error models can include probability data which can be gathered either through known statistical processes for measuring error within a set of data, or by less deterministic, and empirical methods, wherein a database administrator who has substantial knowledge of what the accuracy of data within the database is interviewed to determine rough estimates or subjective statement about error within the data. Other techniques, including database integrity constraints, can be employed by the invention for measuring the error within database and for generating the error models suitable for use with the invention described herein. The systems and methods of the invention can be employed for determining or stimulating the integrity of an existing database as well as for allowing a database administrator to compare multiple databases for purposes of selecting between the multiple databases. Additionally, systems of the invention can be employed for determining measures of error produced by application of an interpretation map to the query results of a database system. In this application, an interpretation map can be applied to the query results provided by a database system for purposes of translating the query results received from a first context to a second context. For example, an interpretation map can be provided to translate query results achieved from processing a database having pre-tax financial information into query results represented in post-tax dollars. As the interpretation from pre-tax to post-tax dollars can be inexact, and create errors within the query results, systems of the invention can be employed for modeling the generated error and for determining the error that gets propagated through to the post-tax query results. Other applications that employ these systems and methods of the invention described herein will be apparent to those of ordinary skill in the art of database systems and statistical analyses. It will thus be seen that the invention provides systems and methods for measuring error within a structured data set and for measuring and modeling the propagation of that error through the structured data set to a query result. ##SPC1##
|
Same subclass Same class Consider this |
||||||||||
