| |
|
|
DATABASE OR FILE ACCESSING |
Pattern recognition using generalized association rules6311173
Abstract
A method and system for predicting an unknown value of an attribute of interest of a given item from a population of items, each item in the population having a plurality of variable attributes including the attribute of interest. Known attribute values regarding a training sample of items within the population including the attribute of interest are stored in a memory. The stored attribute values are processed to determine association rules regarding the training sample, including at least one generalized association rule, each association rule including one or more conditions on one or more respective attribute values of the items predictive of the value of the attribute of interest, and the at least one generalized rule including a logical combination of a plurality of such conditions using at least one logical operation from the group consisting of disjunction and negation. Data are received from an input device, the data including values of at least some of the attributes of the given item. The association rules including the at least one generalized association rule are applied to the values included in the data so as to predict the unknown value of the attribute of interest of the given item.
Claims
What is claimed is:
1. A method for predicting an unknown value of an attribute of interest of a given item from a population of items, each item in the population having a plurality of variable attributes including the attribute of interest, comprising:
storing in a memory known attribute values regarding a training sample of items within the population including the attribute of interest; and
processing the stored attribute values to determine association rules regarding the training sample, including at least one generalized association rule, each association rule comprising one or more conditions on one or more respective attribute values of the items predictive of the value of the attribute of interest, and the at least one generalized rule comprising a logical combination of a plurality of such conditions using at least one logical operation from a group consisting of disjunction and negation.
2. A method according to claim 1, wherein processing the attribute values comprises finding the at least one generalized rule such that a support of the rule is maximized on the training sample.
3. A method according to claim 2, wherein finding the at least one generalized rule comprises finding a generalized association rule predicting that the attribute of interest has a given value, such that the support of the generalized rule in the training sample comprises at least ten times as many items having the given value of the attribute of interest as having another value, not equal to the given value.
4. A method according to claim 3, wherein finding the generalized association rule comprises finding a rule whose support in the training sample comprises substantially only items having the given value of the attribute of interest.
5. A method according to claim 1, wherein processing the attribute values comprises finding a plurality of generalized association rules such that an overlap of the respective supports of two or more of the plurality of rules on the training sample is minimized.
6. A method according to claim 1, wherein processing the attribute values comprises finding a plurality of generalized association rules such that substantially all of the items in the training sample are included in the support of one or more of the generalized rules.
7. A method according to claim 6, wherein finding the plurality of generalized rules comprises finding first and second groups of generalized rules that are respectively predictive of first and second values of the attribute of interest, such that there is an approximately equal number of rules in each group.
8. A method according to claim 1, wherein the at least one generalized association rule comprises a rule predicting that the attribute of interest has a first value if a condition substantially of a form {character pullout}(C.sub.j1.sup.(i) vC.sub.j2.sup.(i) v . . . vC.sub.jk.sup.(i)) is fulfilled, wherein j1, j2, . . . , jk are indices enumerating conditions represented collectively as C.sub.jm.sup.(i), m an arbitrary index running from 1 to k, and i an index representing a conclusion of the conditions, and wherein each C.sub.jm.sup.(i) represents a condition on a known attribute value of the item other than the attribute of interest, which is predictive that the attribute of interest will have another value, different from the first value.
9. A method according to claim 1, wherein the at least one generalized association rule comprises a rule predicting that the attribute of interest has a first value if a condition substantially of a form (C.sub.j1.sup.(i.sup..sub.1 .sup.) vC.sub.j2.sup.(i.sup..sub.1 .sup.) v . . . vC.sub.jk.sup.(i.sup..sub.1 .sup.)) {character pullout}(C.sub.jk+1.sup.(i.sup..sub.2 .sup.) vC.sub.jk+2.sup.(i.sup..sub.2 .sup.) v . . . vC.sub.jK.sup.(i.sup..sub.2 .sup.)) is fulfilled, wherein j1, j2, . . . , jk, jk+1, jk+2, . . . , jK are indices enumerating conditions represented collectively as C.sub.jm.sup.(i.sup..sub.1 .sup.) and C.sub.jm.sup.(i.sup..sub.2 .sup.) m an arbitrary index running from 1 to K, and i1 and i2 are indices representing conclusions of the conditions, and wherein each C.sub.jm.sup.(i.sup..sub.1 .sup.) represents a condition on a known attribute value of the item other than the attribute of interest which is predictive that the attribute of interest will have the first value, and each C.sub.jm.sup.(i.sup..sub.2 .sup.) represents a condition on a known attribute value of the item other than the attribute of interest which is predictive that the attribute of interest will have another value, different from the first value.
10. A method according to claim 9, wherein processing the attribute values comprises finding the set of rules {C.sub.j1.sup.(i.sup..sub.1 .sup.) . . . C.sub.jk.sup.(i.sup..sub.1 .sup.) } and then searching for rules in the set {C.sub.jk+1.sup.(i.sup..sub.2 .sup.) . . . C.sub.jK.sup.(i.sup..sub.2 .sup.) } on the support of {C.sub.j1.sup.(i.sup..sub.1 .sup.) . . . C.sub.jk.sup.(i.sup..sub.1 .sup.) }.
11. A method according to claim 1, wherein processing the attribute values comprises finding simple association rules, wherein each simple association rule comprises one or more conditions on one or more respective attribute values of the items predictive of the value of the attribute of interest, such that if the simple association rule includes more than one such condition, the conditions are combined using the logical conjunction operation in defining the conditions of the rule.
12. A method according to claim 11, and comprising:
receiving data from an input device, the data including values of at least some of the attributes of the given item; and
applying the association rules including the at least one generalized association rule to the values included in the data so as to predict the unknown value of the attribute of interest of the given item.
13. A method according to claim 12, wherein applying the association rules comprises applying both the simple and the at least one generalized association rules jointly to predict the unknown value.
14. A method according to claim 13, wherein applying the rules jointly comprises computing a weighted sum of values of the attribute of interest predicted by the rules.
15. A method according to claim 14, wherein computing the weighted sum comprises computing probabilities respectively associated with the simple and generalized rules, and weighting the predicted values by the respective probabilities.
16. A method according to claim 11, wherein finding the association rules comprises finding the at least one generalized association rule by combining a plurality of the simple association rules.
17. A method according to claim 16, wherein combining the plurality of the simple association rules comprises combining the rules to find a generalized rule which includes a disjunction of two or more of the simple rules.
18. A method according to claim 16, wherein combining the plurality of the simple association rules comprises combining the rules to find a generalized rule which includes a negation of one or more of the simple rules.
19. A method according to claim 11, wherein determining the simple association rules comprises determining substantially all simple association rules pertaining to the sample having respective probability and support greater than predetermined minimum values thereof.
20. A method according to claim 1, wherein processing the attribute values comprises encoding values of the attributes according to the frequency of their occurrence in the training sample.
21. A method according to claim 20, wherein encoding the values comprises calculating hash functions.
22. A method according to claim 20, wherein encoding the values comprises assigning a distinguishable code to values occurring at less than a predetermined frequency in the training sample, whereby such values are substantially excluded from the determination the of association rules.
23. A method according to claim 1, and comprising:
receiving data from an input device, the data including values of at least some of the attributes of the given item; and
applying the association rules including the at least one generalized association rule to the values included in the data so as to predict the unknown value of the attribute of interest of the given item.
24. A method according to claim 23, wherein applying the association rules comprises applying a subset of the rules consisting of rules whose one or more conditions are fulfilled by known values of attributes of the given item other than the item of interest.
25. A method according to claim 23, wherein processing the attribute values comprises finding probabilities corresponding to the determined association rules, and wherein applying the association rules comprises applying the probabilities to compute a cumulative probability that the attribute of interest has a given value.
26. A method according to claim 25, wherein computing the cumulative probability comprises computing a weighted sum of the probabilities corresponding respectively to the association rules applied in predicting the value.
27. A method according to claim 25, and comprising determining a probability decision point such that when the cumulative probability is greater than the decision point, the attribute of interest is predicted to have a first value, and when the probability of interest is less than the decision point, the attribute of interest is predicted to have a different, second value.
28. A method according to claim 27, wherein determining the decision point comprises defining an ambiguity range of probabilities including the decision point in which the predicted value is ambiguous.
29. A method according to claim 28, wherein defining the ambiguity range comprises comparing the training sample and at least a portion of the overall population from which the given item is taken, and determining an extent of the ambiguity range responsive to a measure of the similarity of the training sample and the at least portion of the overall population.
30. A method according to claim 27, wherein determining the decision point comprises defining a point such that a total number of prediction errors is minimized.
31. A method according to claim 27, wherein an error cost is assigned to each of a plurality of types of prediction errors, and wherein determining the decision point comprises defining a point such that a total cost of prediction errors is minimized.
32. A method according to claim 23, wherein the items comprise records in a database, and the attributes comprise fields in the records, and wherein applying the association rules comprises predicting the unknown value of a database field.
33. A method according to claim 32, wherein predicting the unknown value comprises predicting a Boolean value.
34. A method according to claim 23, wherein the items comprise sounds, and the attribute values comprise characteristics of sound signals corresponding to the sounds, and wherein applying the association rules comprises identifying a sound signal.
35. A method according to claim 34, wherein identifying the sound signal comprises finding a word corresponding to the signal.
36. A method according to claim 34, wherein identifying the sound signal comprises identifying a speaker who generated the sound signal.
37. A method according to claim 34, wherein receiving the data comprises receiving data from a microphone.
38. A method according to claim 23, wherein the items comprise images, and the attribute values comprise image features, and wherein applying the association rules comprises processing an image.
39. A method according to claim 38, wherein processing the image comprises identifying a subject of the image.
40. A method according to claim 38, wherein receiving the data comprises receiving data from a camera.
41. A method according to claim 38, wherein receiving the data comprises receiving data from a scanner.
42. A method according to claim 23, and comprising outputting an indication of the predicted value to an output device.
43. A method according to claim 42, wherein outputting the indication comprises displaying the predicted value and a probability thereof.
44. A method according to claim 42, wherein outputting the indication comprises controlling an access responsive to the predicted value.
45. A method according to claim 42, wherein outputting the indication comprises sorting the given item responsive to the predicted value.
46. A method for predicting an unknown value of an attribute of interest of a given item from a population of items, each item in the population having a plurality of variable attributes including the attribute of interest, comprising:
storing in a memory known attribute values regarding a training sample of items within the population including the attribute of interest;
processing the attribute values to determine simple association rules regarding the training sample, each simple association rule comprising one or more conditions on one or more respective attribute values of the items predictive of the value of the attribute of interest, such that if the simple association rule includes more than one such condition, the conditions are combined using a logical conjunction operation in defining the conditions of the rule,
wherein substantially all simple association rules applicable to the sample having respective probability and support greater than predetermined minimum values thereof are determined.
47. A method according to claim 46, wherein processing the attribute values comprises constructing a contingency table, each of whose entries corresponds to the number of items in the sample having a given value of the attribute of interest and a given, respective value of another one of the attributes, and wherein the association rules are determined with respect to the contingency table.
48. A method according to claim 46, and comprising:
receiving data from an input device, the data including values of at least some of the attributes of the given item; and
applying the association rules to the values included in the data so as to predict the unknown value of the attribute of interest of the given item.
49. A method for predicting an unknown value of an attribute of interest of a given item from a population of items, each item in the population having a plurality of variable attributes including the attribute of interest, comprising:
storing in a memory known attribute values regarding a training sample of items within the population including the attribute of interest;
processing the attribute values to determine association rules regarding the training sample, each association rule comprising one or more conditions on one or more respective attribute values of the items predictive of the value of the attribute of interest,
wherein the attribute values are processed by constructing a contingency table, each of whose entries corresponds to the number of items in the sample having a given value of the attribute of interest and satisfying a given, respective condition on one or more of the attributes other than the attribute of interest, and wherein the association rules are determined with respect to the contingency table.
50. A method according to claim 49, wherein constructing the contingency table comprises constructing a table of 1-conditions, characterized in that the condition on the one or more of the attributes comprises a condition on a single one of the attributes.
51. A method according to claim 50, wherein constructing the contingency table comprises constructing a table of 2-conditions, characterized in that the condition on the one or more of the attributes comprises a condition on two of the attributes, using the table of 1-conditions.
52. A method according to claim 51, wherein constructing the contingency table comprises constructing a plurality of respective tables of q-conditions, for a sequence of one or more integers q.gtoreq.3, characterized in that for each q, the condition on the one or more of the attributes comprises a condition on a group of q of the attributes, wherein for each q, the corresponding table is constructed using the table of q-1-conditions previously constructed.
53. A method according to claim 52, wherein each of the tables of q-conditions is stored in the memory as it is constructed, and wherein for each q, the corresponding table of q-1-conditions is deleted from the memory after the table of q-conditions is constructed.
54. A method according to claim 49, and comprising:
receiving data from an input device, the data including values of at least some of the attributes of the given item; and
applying the association rules to the values included in the data so as to predict the unknown value of the attribute of interest of the given item.
55. A system for predicting an unknown value of an attribute of interest of a given item from a population of items, each item in the population having a plurality of variable attributes including the attribute of interest, comprising:
an input device, which receives data indicative of values of at least some of the attributes of the given item;
a memory, which stores association rules regarding the population, the association rules including at least one generalized association rule, each association rule comprising one or more conditions on one or more respective attribute values of the items predictive of the value of the attribute of interest, and the at least one generalized rule comprising a logical combination of such conditions using at least one logical operation from a group consisting of disjunction and negation in defining the conditions of the rule; and
a processor, which receives the data from the input device and reads the association rules from the memory, and which applies the association rules including the at least one generalized association rule to the values included in the data so as to predict the unknown value of the attribute of interest and which generates an output responsive to the prediction.
56. A system according to 55, wherein the processor applies a subset of the rules consisting of rules whose one or more conditions are fulfilled by known values of attributes of the given item other than the item of interest.
57. A system according to claim 55, wherein the processor finds probabilities corresponding to the determined association rules and applies the probabilities to compute a cumulative probability that the attribute of interest has a given value.
58. A system according to claim 57, wherein the processor computes a weighted sum of the probabilities corresponding respectively to the association rules applied in predicting the value.
59. A system according to claim 57, wherein the processor determines a probability decision point such that when the cumulative probability is greater than the decision point, the attribute of interest is predicted to have a first value, and when the probability of interest is less than the decision point, the attribute of interest is predicted to have a different, second value.
60. A system according to claim 59, wherein the processor defines an ambiguity range of probabilities including the decision point in which the predicted value is ambiguous.
61. A system according to claim 60, wherein the processor compares the training sample and at least a portion of the overall population from which the given item is taken, and determines an extent of the ambiguity range responsive to a measure of the similarity of the training sample and the at least portion of the overall population.
62. A system according to claim 59, wherein the processor determines the decision point such that a total number of prediction errors is minimized.
63. A system according to claim 59, wherein an error cost is assigned to each of a plurality of types of prediction errors, and wherein the processor determines the decision point such that a total cost of prediction errors is minimized.
64. A system according to claim 55, wherein the items comprise records in a database, and the attributes comprise fields in the records, and wherein the processor applies the association rules to predict the unknown value of a database field.
65. A system according to claim 64, wherein the unknown value comprises predicting a Boolean value.
66. A system according to claim 55, wherein the items comprise sounds, and the attribute values comprise characteristics of sound signals corresponding to the sounds, and wherein the processor applies the association rules to identify a sound signal.
67. A system according to claim 66, wherein the processor finds a word corresponding to the signal.
68. A system according to claim 66, wherein the processor identifies a speaker who generated the sound signal.
69. A system according to claim 66, wherein the input device comprises a microphone.
70. A system according to claim 55, wherein the items comprise images, and the attribute values comprise image features, and wherein the processor applies the association rules to process an image.
71. A system according to claim 70, wherein the processor identifies a subject of the image.
72. A system according to claim 70, wherein the input device comprises a camera.
73. A system according to claim 70, wherein the input device comprises a scanner.
74. A system according to claim 55, and comprising an output device, which receives the output from the processor and performs an action responsive thereto.
75. A system according to claim 74, wherein the output device comprises a display, which displays the predicted value.
76. A system according to claim 75, wherein the display displays a probability associated with the predicted value.
77. A system according to claim 74, wherein the output device comprises an access controller, which controls an access responsive to the predicted value.
78. A system according to claim 74, wherein the output device comprises a sorter, which sorts the given item responsive to the predicted value.
79. A system according to claim 55, and comprising a computer, which receives a training sample of items within the population having known respective attribute values including the attribute of interest, and which determines the association rules and finds the at least one generalized association rule by processing the known attribute values.
80. A system according to claim 79, wherein the computer includes the processor.
81. A system for determining association rules for prediction of an unknown value of an attribute of interest of a given item from a population of items, each item in the population having a plurality of variable attributes including the attribute of interest, comprising:
an input device, which receives data indicative of values of attributes of a training sample of items within the population including the attribute of interest;
a memory, which stores the values of the attributes; and
a computer, which reads the values from the memory and determines association rules regarding the population, the association rules including at least one generalized association rule, each association rule comprising one or more conditions on one or more respective attribute values of the items predictive of the value of the attribute of interest, and the at least one generalized rule comprising a logical combination of such conditions using at least one logical operation from a group consisting of disjunction and negation in defining the conditions of the rule, and which stores the association rules in the memory.
82. A method according to claim 81, wherein the computer finds the at least one generalized rule such that a support of the rule is maximized on the training sample.
83. A system according to claim 82, wherein the at least one generalized association rule predicts that the attribute of interest has a given value, and the support of the generalized rule in the training sample comprises at least ten times as many items having the given value of the attribute of interest as having another value, not equal to the given value.
84. A system according to claim 83, wherein the support of the generalized rule in the training sample comprises substantially only items having the given value of the attribute of interest.
85. A system according to claim 81, wherein the computer finds a plurality of generalized association rules such that an overlap of the respective supports of two or more of the plurality of rules on the training sample is minimized.
86. A system according to claim 81, wherein the computer finds a plurality of generalized association rules such that substantially all of the items in the training sample are included in the support of one or more of the generalized rules.
87. A system according to claim 86, wherein the plurality of generalized rules comprises first and second groups of generalized rules that are respectively predictive of first and second values of the attribute of interest, such that there is an approximately equal number of rules in each group.
88. A system according to claim 81, wherein the at least one generalized association rule comprises a rule predicting that the attribute of interest has a first value if a condition substantially of a form {character pullout}(C.sub.j1.sup.(i) vC.sub.j2.sup.(i) v . . . vC.sub.jk.sup.(i) is fulfilled, wherein j1, j2, . . . , jk are indices enumerating conditions represented collectively as C.sub.jm.sup.(i), m an arbitrary index running from 1 to k, and i an index representing a conclusion of the conditions, and wherein each C.sub.jm.sup.(i) represents a condition on a known attribute value of the item other than the attribute of interest which is predictive that the attribute of interest will have another value, different from the first value.
89. A system according to claim 81, wherein the at least one generalized association rule comprises a rule predicting that the attribute of interest has a first value if a condition substantially of a form (C.sub.j1.sup.(i.sup..sub.1 .sup.) vC.sub.j2.sup.(i.sup..sub.1 .sup.) v . . . vC.sub.jk.sup.(i.sup..sub.1 .sup.)) {character pullout}(C.sub.jk+1.sup.(i.sup..sub.2 .sup.) vC.sub.jk+2.sup.(i.sup..sub.2 .sup.) v . . . vC.sub.jK.sup.(i.sup..sub.2 .sup.) is fulfilled, wherein j1, j2, . . . , jk, jk+1, jk+2, . . . , jK are indices enumerating conditions represented collectively as C.sub.jm.sup.(i.sup..sub.1 .sup.) and C.sub.jm.sup.(i.sup..sub.2 .sup.), m an arbitrary index running from 1 to K, and i1 and i2 are indices representing conclusions of the conditions, and wherein each C.sub.jm.sup.(i.sup..sub.1 .sup.) represents a condition on a known attribute value of the item other than the attribute of interest which is predictive that the attribute of interest will have the first value, and each C.sub.jm.sup.(i.sup..sub.2 .sup.) represents a condition on a known attribute value of the item other than the attribute of interest which is predictive that the attribute of interest will have another value, different from the first value.
90. A system according to claim 89, wherein the computer finds the set of rules {C.sub.j1.sup.(i.sup..sub.1 .sup.) . . . C.sub.jk.sup.(i.sup..sub.1 .sup.) } and then searches for rules in the set {C.sub.jk+1.sup.(i.sup..sub.2 .sup.) . . . C.sub.jK.sup.(i.sup..sub.2 .sup.) } on the support of {C.sub.j1.sup.(i.sup..sub.1 .sup.) . . . C.sub.jk.sup.(i.sup..sub.1 .sup.) }.
91. A system according to claim 81, wherein the computer finds simple association rules, wherein each simple association rule comprises one or more conditions on one or more respective attribute values of the items predictive of the value of the attribute of interest, such that if the simple association rule includes more than one such condition, the conditions are combined using the logical conjunction operation in defining the conditions of the rule.
92. A system according to claim 91, wherein the simple and the at least one generalized association rules are applied jointly to predict the unknown value.
93. A system according to claim 92, wherein a weighted sum of values of the attribute of interest predicted by the rules is used by the computer to predict the unknown value.
94. A system according to claim 93, wherein the computer computes probabilities respectively associated with the simple and generalized rules, which are used to weight the predicted values in computing the weighted sum.
95. A system according to claim 93, wherein the computer finds the at least one generalized association rule by combining a plurality of the simple association rules.
96. A system according to claim 95, wherein the at least one generalized rule includes a disjunction of two or more of the simple rules.
97. A system according to claim 95, wherein the at least one generalized rule includes a negation of one or more of the simple rules.
98. A system according to claim 91, wherein the computer determines substantially all simple association rules pertaining to the sample having respective probability and support greater than predetermined minimum values thereof.
99. A system according to claim 81, wherein the computer encodes values of the attributes according to the frequency of their occurrence in the training sample.
100. A system according to claim 99, wherein the computer encodes the values by calculating hash functions.
101. A system according to claim 99, wherein the computer assigns a distinguishable code to values occurring at less than a predetermined frequency in the training, sample, whereby such values are substantially excluded from the determination the of association rules.
102. A system for determining association rules for prediction of an unknown value of an attribute of interest of a given item from a population of items, each item in the population having a plurality of variable attributes including the attribute of interest, comprising:
an input device, which receives data indicative of values of attributes of a training sample of items within the population including the attribute of interest;
a memory, which stores the values of the attributes; and
a computer, which reads the values from the memory and determines simple association rules regarding the population, each simple association rule comprising one or more conditions on one or more respective attribute values of the items predictive of the value of the attribute of interest, such that if the simple association rule includes more than one such condition, the conditions are combined using a logical conjunction operation in defining the conditions of the rule, and which stores the association rules in the memory,
wherein substantially all simple association rules applicable to the sample having respective probability and support greater than predetermined minimum values thereof are determined.
103. A system according to claim 102, wherein the computer determines the association rules by constructing a contingency table, each of whose entries corresponds to the number of items in the sample having a given value of the attribute of interest and a given, respective value of another one of the attributes.
104. A system for determining association rules for prediction of an unknown value of an attribute of interest of a given item from a population of items, each item in the population having a plurality of variable attributes including the attribute of interest, comprising:
an input device, which receives data indicative of values of attributes of a training sample of items within the population including the attribute of interest;
a memory, which stores the values of the attributes; and
a computer, which reads the values from the memory and determines association rules regarding the population, each association rule comprising one or more conditions on one or more respective attribute values of the items predictive of the value of the attribute of interest, and which stores the association rules in the memory,
wherein the computer determines the association rules by constructing one or more contingency tables, each of whose entries corresponds to the number of items in the sample having a given value of the attribute of interest and satisfying a given, respective condition on one or more of the attributes other than the attribute of interest.
105. A system according to claim 104, wherein the one or more contingency tables comprise a table of 1-conditions, characterized in that the condition on the one or more of the attributes comprises a condition on a single one of the attributes.
106. A system according to claim 105, wherein the one or more contingency tables comprise a table of 2-conditions, characterized in that the condition on the one or more of the attributes comprises a condition on two of the attributes, which is constructed by the computer using the table of 1-conditions.
107. A system according to claim 106, wherein the one or more contingency tables comprise a plurality of respective tables of q-conditions, for a sequence of one or more integers q.gtoreq.3, characterized in that for each q, the condition on the one or more of the attributes comprises a condition on a group of q of the attributes, wherein for each q, the computer constructs the corresponding table using the table of q-1-conditions previously constructed.
108. A system according to claim 107, wherein the computer stores each of the tables of q-conditions in the memory as it is constructed, and wherein for each q, the computer deletes the corresponding table of q-1-conditions from the memory after the table of q-conditions is constructed.
Description
FIELD OF THE INVENTION
The present invention relates generally to systems and methods for pattern recognition, and specifically to methods of pattern recognition based on association rules.
BACKGROUND OF THE INVENTION
Automated pattern recognition is well known in the art, in a variety of applications. Common applications of pattern recognition include image analysis, speech recognition, and predicting unknown fields for records in a database. Typically, a template or a collection of rules is determined, which are believed to constitute a pattern that is characteristic of a certain class. Items in the set are then evaluated to determine how closely they fit the pattern. A close fit indicates a high probability that the item being evaluated falls within the class. Thus, a face may be found to belong to a certain individual; or a spoken sound may be found to correspond to a certain word; or a bank customer may be predicted to be a good or bad credit risk.
In order to build the template or rules, "data mining" of a training database is frequently used. The training database is selected and is assumed to be a real and representative sample of the overall population. The training database generally contains variables (fields, representing various attributes of the items) of different types, among which a field is selected as the "Field to Predict" (also referred to in the database art as the "output", or "result", or "dependent" variable). The training database may be represented as a temporary file of codes of attribute values, given by the matrix:
A={x.sub.1, . . . , x.sub.n,y}.sub.1.sup.N (1)
consisting of N records, where a record is a vector of previously encoded values of n input (predicting) fields x.sub.1, . . . , x.sub.n and of the output (predicted) field y. The purpose is to discover all essential regularities within the investigated file A, thus enabling one to predict the unknown value of output field y based on the known values of variables x.sub.1, . . . , x.sub.n. The same encoding methods may be applied when the rules derived from the training database are applied to the rest of the population.
Typically, the Field to Predict is Boolean, i.e., the output field can be specified as y .epsilon.{0,1}. It will be understood, however, that similar methods may be applied, mutatis mutandis, when the output field has a wider range of possible values.
The accuracy of prediction of the value of y may be verified by testing on the training database. However, such testing may not be sufficient, since there exists the problem of "overfitting," wherein regularities discovered on the training database turn out to be found by chance, and are therefore not valid on other samples from the overall population. Thus, the purpose of data mining is to discover regularities within the training database which possess the property of likely stability of their validity on the whole population.
Regularities of this sort are sought within the training database in the form of association rules. Methods of deriving such association rules are described, for example, by Agrawal, et al., in "Fast Discovery of Association Rules," in Advances in Kowledge Discovery and Data Mining (AAAI Press/MIT Press, 1996), pages 307-328; and by Zaki, et al., in "New Algorithms for Fast Discovery of Association Rules," in Proc. 3rd Int. Conf. KDD (California), pages 283-286. These publications are incorporated herein by reference.
A formal definition of an association rule is as follows: Let I.sub.i ={a.sub.i1, . . . , a.sub.im.sub..sub.i } be the set of codes of values of a variable x.sub.i. A single condition (referred to herein as a 1-condition) is a condition of the type: x.sub.i =a.sub.ij, j .epsilon.{1, . . . , m.sub.i }. A composite condition is a conjunction of q single conditions, wherein q=2, . . . , n, referred to herein as a q-condition. Thus, a q-condition is a condition of the type:
(x.sub.i.sub..sub.1 =a.sub.i.sub..sub.1 .sub.j.sub.1) (x.sub.i.sub..sub.2 =a.sub.i.sub..sub.2 .sub.j.sub..sub.2 ) . . . (x.sub.i.sub..sub.q =a.sub.i.sub..sub.q .sub.j.sub..sub.q ) (2)
Association rules are "if-then" rules, wherein the condition is the single or conjunctive expression, as in equation (2). Such rules are referred to in the context of the present patent application and in the claims as "simple association rules." Thus, a simple association rule is a statement of the following type:
if (q-condition) then y=y.sub.1 (3)
The possible values of y.sub.1 are y.sub.1 =1 or y.sub.1 =0. The number s of records at which both the condition of the rule (the q-condition) and the rule's conclusion (y=y.sub.1) are fulfilled is referred to as the support, s, of the rule. The probability p of the rule is the probability that for a random record satisfying the rule's condition, the rule's conclusion is also fulfilled. Hence ##EQU1##
wherein m is the number of records satisfying the rule's condition.
Let p.sub.a be the a priori probability that the Field to Predict y possesses the predicted value 1, that is, ##EQU2##
where r is the number of records in the training database at which y=1. As mentioned above, the training database is assumed to be a real and representative sample of the investigated population. Hence, it can be assumed that the same a priori probability p.sub.a is valid on the overall population.
Which rules are interesting? In other words, about which rules can it be said that they were discovered not by chance, and are likely to be valid on the overall population? It can be assumed that these are rules fulfilled at a sufficiently large number of records and whose probability significantly deviates from p.sub.a. A formal statement for this intuitive notion is as follows: A user specifies a minimum support S.sub.min, and minimum admissible probabilities for a rule with y=1 (denoted by p.sub.1) and with y=0 (denoted by p.sub.0), 1-p.sub.0 <p.sub.a <p.sub.1. The objective of the data mining is then to determine association rules (rules of the type of equation (2)) for which s.gtoreq.S.sub.min and p.gtoreq.p.sub.1 (if y=1), or p.gtoreq.p.sub.0 (if y=0). Methods of data mining known in the art do not necessarily find all such rules exhaustively on the training database, and furthermore tend to require very substantial computing resources.
The rule defined by equation (3) can be expressed as a statement of conditional probability:
P(y=y.sub.1.vertline.(the q-condition))=p (4)
However, unlike equation (3), the number of records at which this statement is fulfilled (support s) is absent in equation (4). Therefore, association rules in a sense contain more information than conditional probabilities, which are applied in Bayes methods. Such methods are described, for example, by Friedman, in "On Bias, Variance, 0/1--Loss, and the Curse-of-Dimensionality," in Data Mining and Knowledge Discovery, 1(1), pages 54-77; and by Heckerman, in "Bayesian Networks for Data Mining," in Data Mining and Knowledge Discovery, 1(1), pages 79-119. These publications are incorporated herein by reference.
As mentioned above, the object of deriving association rules from the training database is to predict accurately the Field to Predict value for data outside the training database. To predict the value of output field y for a given record, all relevant association rules are selected, and using the respective probabilities of these rules, the conclusive probability that the Field to Predict possesses the predicted value is calculated. Such prediction gives especially accurate results when a great number of "strong" association rules (rules with a high probability and support) are found, and many of them are independent or nearly independent. Methods of finding strong rules are described, for example, by Piatetsky-Shapiro, in "Discovery, Analysis, and Presentation of Strong Rules," in Knowledge Discovery in Databases (Menlo Park, Calif.: AAAI Press), pages 229-248, which is incorporated herein by reference. These methods apply to finding simple association rules, as defined by expression (2), above.
The number of simple association rules determined is usually large and may even exceed (sometimes considerably) the number of records in the training database. Therefore, the average number of relevant rules for each record may also be large. Among the relevant rules there are often a large number of "weak" rules caused by "noise" --random data fluctuations--which complicate prediction. On the other hand, if only "strong" rules are allowed, there will be many records to which no relevant rules apply.
SUMMARY OF THE INVENTION
It is an object of the present invention to provide improved systems and methods of pattern recognition. In some aspects of the present invention, these systems and methods are used for predicting an attribute of one or more items in a population.
It is an object of some aspects of the present invention to enable the attributes to be predicted with greater accuracy and reliability.
It is a further object of some aspects of the present invention to enable the attributes to be predicted more rapidly and/or at a lower expenditure of computing resources.
In preferred embodiments of the present invention, a training database is selected out of an overall population of items, and simple association rules regarding the training database are determined. Preferably, the simple association rules that are determined have a support greater than or equal to a predetermined minimum support and a probability greater than or equal to a predetermined minimum probability. The simple association rules are used to determine generalized association rules, as defined hereinbelow. The generalized association rules and, optionally, the simple association rules are then applied to one or more members of the population outside the training database so as to predict an unknown value of an attribute or attributes of interest of the one or more members.
The term "generalized association rules," as used in the present patent application and in the claims, refers to rules that use logical operations of conjunction, disjunction, and negation in defining their conditions. As described above, methods of data mining and prediction known in the art use only simple association rules, based only on the logical conjunction operation. Methods in accordance with the present invention, using generalized association rules, provide "stronger" and more stable rules, which afford more accurate and reliable prediction, at lower expenditure of computing resources, than methods known in the art. Furthermore, the present invention substantially overcomes the problem of overfitting, which exists in methods known in the art.
In some preferred embodiments of the present invention, before determining, the generalized association rules, substantially all of the simple association rules meeting the minimum support and minimum probability criteria are found, using generalized contingency tables and sets of potentially representative q-conditions for each possible value of q. Methods of data mining known in the art do not use contingency tables and are not capable of conclusively finding all simple association rules. Because methods in accordance with the present invention find an exhaustive set of rules, they allow the attributes of the items in the database to be predicted with the highest possible level of confidence.
In some preferred embodiments of the present invention, the unknown value of the attribute of interest is predicted based on a cumulative probability, which is calculated based on probabilities corresponding to the generalized and/or simple association rules. A probabilistic decision point is defined, such that the unknown value is predicted based on whether the cumulative probability is above or below the decision point. Preferably, the decision point is defined so as to minimize a total number of prediction errors or, alternatively, when error costs are given, to minimize a total cost of errors. In a preferred embodiment of the invention, an ambiguous range of probabilities is defined, including the decision point, wherein the extent of the range is dependent on a measure of similarity or dissimilarity between the training sample and the overall population for prediction.
Preferred embodiments of the present invention are described herein with reference to methods and systems for data mining and prediction of unknown fields in a database. It will be appreciated, however, that the principles of the present invention may similarly be applied in other areas of pattern recognition. For example, generalized association rules may be derived for use in image or voice recognition, and may thus be used to identify with improved accuracy, reliability, and computation speed the identity of an item in the image or the word associated with a spoken sound pattern.
There is therefore provided, in accordance with a preferred embodiment of the present invention, a method for predicting an unknown value of an attribute of interest of a given item from a population of items, each item in the population having a plurality of variable attributes including the attribute of interest, including:
storing in a memory known attribute values regarding a training sample of items within the population including the attribute of interest; and
processing the stored attribute values to determine association rules regarding the training sample, including at least one generalized association rule, each association rule including one or more conditions on one or more respective attribute values of the items predictive of the value of the attribute of interest, and the at least one generalized rule including a logical combination of a plurality of such conditions using at least one logical operation from the group consisting of disjunction and negation.
Preferably, the method includes receiving data from an input device, the data including values of at least some of the attributes of the given item; and applying the association rules including the at least one generalized association rule to the values included in the data so as to predict the unknown value of the attribute of interest of the given item.
Preferably, processing the attribute values includes finding the at least one generalized rule such that a support of the rule is maximized on the training sample. Further preferably, finding the at least one generalized rule includes finding a generalized association rule predicting that the attribute of interest has a given value, such that the support of the generalized rule in the training sample includes at least ten times as many items having the given value of the attribute of interest as having another value, not equal to the given value, and most preferably, finding a rule whose support in the training sample comprises substantially only items having the given value of the attribute of interest.
Preferably, processing the attribute values includes finding a plurality of generalized association rules such that an overlap of the respective supports of two or more of the plurality of rules on the training sample is minimized.
Further preferably, processing the attribute values includes finding a plurality of generalized association rules such that substantially all of the items in the training sample are included in the support of one or more of the generalized rules. Most preferably, finding the plurality of generalized rules includes finding first and second groups of generalized rules that are respectively predictive of first and second values of the attribute of interest, such that there is an approximately equal number of rules in each group.
In a preferred embodiment, the at least one generalized association rule includes a rule predicting that the attribute of interest has a first value if a condition substantially of a form {character pullout}(C.sub.j1.sup.(i) vC.sub.j2.sup.(i) v . . . vC.sub.jk.sup.(i) is fulfilled, wherein each C.sub.jm.sup.(i) represents a condition on a known attribute value of the item other than the attribute of interest which is predictive that the attribute of interest will have another value, different from the first value.
In another preferred embodiment, the at least one generalized association rule includes a rule predicting that the attribute of interest has a first value if a condition substantially of a form (C.sub.j1.sup.(i.sup..sub.1 .sup.) vC.sub.j2.sup.(i.sup..sub.1 .sup.) v . . . vC.sub.jk.sup.(i.sup..sub.1 .sup.)) {character pullout}(C.sub.jk+1.sup.(i.sup..sub.2 .sup.) vC.sub.jk+2.sup.(i.sup..sub.2 .sup.) v . . . vC.sub.jK.sup.(i.sup..sub.2 .sup.)) is fulfilled, wherein each C.sub.jm.sup.(i.sup..sub.1 .sup.) represents a condition on a known attribute value of the item other than the attribute of interest which is predictive that the attribute of interest will have the first value, and each C.sub.jm.sup.(i.sup..sub.2 .sup.) represents a condition on a known attribute value of the item other than the attribute of interest which is predictive that the attribute of interest will have another value, different from the first value. Preferably, processing the attribute values includes finding the set of rules {C.sub.j1.sup.(i.sup..sub.1 .sup.) . . . C.sub.jk.sup.(i.sup..sub.1 .sup.) } and then searching for rules in the set {C.sub.jk+1.sup.(i.sup..sub.2 .sup.) . . . C.sub.jK.sup.(i.sup..sub.2 .sup.) } on the support of {C.sub.j1.sup.(i.sup..sub.1 .sup.) . . . C.sub.jk.sup.(i.sup..sub.1 .sup.) }.
Preferably, processing the attribute values includes finding simple association rules, wherein each simple association rule includes one or more conditions on one or more respective attribute values of the items predictive of the value of the attribute of interest, such that if the simple association rule includes more than one such condition, the conditions are combined using the logical conjunction operation in defining the conditions of the rule. Further preferably, applying the association rules includes applying both the simple and the at least one generalized association rules jointly to predict the unknown value, wherein applying the rules jointly preferably includes computing a weighted sum of values of the attribute of interest predicted by the rules. Preferably, computing the weighted sum includes computing probabilities respectively associated with the simple and generalized rules, and weighting the predicted values by the respective probabilities.
Preferably, finding the association rules includes finding the at least one generalized association rule by combining a plurality of the simple association rules, most preferably by finding a generalized rule which includes a disjunction of two or more of the simple rules or, additionally or alternatively, by finding a generalized rule which includes a negation of one or more of the simple rules.
Preferably, determining the simple association rules includes determining substantially all simple association rules pertaining to the sample having respective probability and support greater than predetermined minimum values thereof
There is further provided, in accordance with a preferred embodiment of the present invention, a method for predicting an unknown value of an attribute of interest of a given item from a population of items, each item in the population having a plurality of variable attributes including the attribute of interest, including:
storing in a memory known attribute values regarding a training sample of items within the population including the attribute of interest;
processing the attribute values to determine simple association rules regarding the training sample, each simple association rule including one or more conditions on one or more respective attribute values of the items predictive of the value of the attribute of interest, such that if the simple association rule includes more than one such condition, the conditions are combined using the logical conjunction operation in defining the conditions of the rule,
wherein substantially all simple association rules applicable to the sample having respective probability and support greater than predetermined minimum values thereof are determined;
receiving data from an input device, the data including values of at least some of the attributes of the given item; and
applying the association rules to the values included in the data so as to predict the unknown value of the attribute of interest of the given item.
Preferably, processing the attribute values includes constructing a contingency table, each of whose entries corresponds to the number of items in the sample having a given value of the attribute of interest and a given, respective value of another one of the attributes, and wherein the association rules are determined with respect to the contingency table.
There is also provided, in accordance with a preferred embodiment of the present invention, a method for predicting an unknown value of an attribute of interest of a given item from a population of items, each item in the population having a plurality of variable attributes including the attribute of interest, including:
storing in a memory known attribute values regarding a training sample of items within the population including the attribute of interest;
processing the attribute values to determine association rules regarding the training sample, each association rule including one or more conditions on one or more respective attribute values of the items predictive of the value of the attribute of interest,
wherein the attribute values are processed by constructing a contingency table, each of whose entries corresponds to the number of items in the sample having a given value of the attribute of interest and satisfying a given, respective condition on one or more of the attributes other than the attribute of interest, and wherein the association rules are determined with respect to the contingency table;
receiving data from an input device, the data including values of at least some of the attributes of the given item; and
applying the association rules to the values included in the data so as to predict the unknown value of the attribute of interest of the given item.
Preferably, constructing the contingency table includes constructing a table of 1-conditions, characterized in that the condition on the one or more of the attributes includes a condition on a single one of the attributes. Further preferably, constructing the contingency table includes constructing a table of 2-conditions, characterized in that the condition on the one or more of the attributes includes a condition on two of the attributes, using the table of 1-conditions. Most preferably, constructing the contingency table includes constructing a plurality of respective tables of q-conditions, for a sequence of one or more integers q.gtoreq.3, characterized in that for each q, the condition on the one or more of the attributes includes a condition on a group of q of the attributes, wherein for each q, the corresponding table is constructed using the table of q-1-conditions previously constructed. Preferably, each of the tables of q-conditions is stored in the memory as it is constructed, and wherein for each q, the corresponding table of q-1-conditions is deleted from the memory after the table of q-conditions is constructed.
Preferably, processing the attribute values includes encoding values of the attributes according to the frequency of their occurrence in the training sample, most preferably by calculating hash functions. Preferably, encoding the values includes assigning a distinguishable code to values occurring at less than a predetermined frequency in the training sample, whereby such values are substantially excluded from the determination the of association rules.
Preferably, applying the association rules includes applying a subset of the rules consisting of rules whose one or more conditions are fulfilled by known values of attributes of the given item other than the item of interest.
Further preferably, processing the attribute values includes finding probabilities corresponding to the determined association rules, and applying the association rules includes applying the probabilities to compute a cumulative probability that the attribute of interest has a given value, wherein computing the cumulative probability preferably includes computing a weighted sum of the probabilities corresponding respectively to the association rules applied in predicting the value.
In a preferred embodiment, a probability decision point is determined such that when the cumulative probability is greater than the decision point, the attribute of interest is predicted to have a first value, and when the probability of interest is less than the decision point, the attribute of interest is predicted to have a different, second value. Preferably, determining the decision point includes defining an ambiguity range of probabilities including the decision point in which the predicted value is ambiguous, most preferably by comparing the training sample and at least a portion of the overall population from which the given item is taken, and determining an extent of the ambiguity range responsive to a measure of the similarity of the training sample and the at least portion of the overall population.
Preferably, determining the decision point includes determining a point such that a total number of prediction errors is minimized.
Alternatively, an error cost is assigned to each of a plurality of types of prediction errors, and determining the decision point includes determining a point such that a total cost of prediction errors is minimized.
In a preferred embodiment, the items include records in a database, and the attributes include fields in the records, and applying the association rules includes predicting the unknown value of a database field. Preferably, predicting the unknown value includes predicting a Boolean value.
In another preferred embodiment, the items include sounds, and the attribute values include characteristics of sound signals corresponding to the sounds, and applying the association rules includes identifying a sound signal. Preferably, identifying the sound signal includes finding a word corresponding to the signal. Alternatively or additionally, identifying the sound signal includes identifying a speaker who generated the sound signal. Preferably, receiving the data includes receiving data from a microphone.
In still another preferred embodiment, the items include images, and the attribute values include image features, and applying the association rules includes processing an image. Preferably, processing the image includes identifying a subject of the image. Further preferably, receiving the data includes receiving data from a camera or, alternatively or additionally, from a scanner.
Preferably, the method includes outputting an indication of the predicted value to an output device.
In a preferred embodiment, outputting the indication includes displaying the predicted value and a probability thereof.
In another preferred embodiment, outputting the indication includes controlling an access responsive to the predicted value.
In still another preferred embodiment, outputting the indication includes sorting the given item responsive to the predicted value.
There is further provided, in accordance with a preferred embodiment of the present invention, a system for predicting an unknown value of an attribute of interest of a given item from a population of items, each item in the population having a plurality of variable attributes including the attribute of interest, including:
an input device, which receives data indicative of values of at least some of the attributes of the given item;
a memory, which stores association rules regarding the population, the association rules including at least one generalized association rule, each association rule including one or more conditions on one or more respective attribute values of the items predictive of the value of the attribute of interest, and the at least one generalized rule including a logical combination of such conditions using at least one logical operation from the group consisting of disjunction and negation in defining the conditions of the rule; and
a processor, which receives the data from the input device and reads the association rules from the memory, and which applies the association rules including the at least one generalized association rule to the values included in the data so as to predict the unknown value of the attribute of interest and which generates an output responsive to the prediction.
Preferably, the processor applies a subset of the rules consisting of rules whose one or more conditions are fulfilled by known values of attributes of the given item other than the item of interest.
Preferably, the processor finds probabilities corresponding to the determined association rules and applies the probabilities to compute a cumulative probability that the attribute of interest has a given value, wherein the processor preferably most preferably computes a weighted sum of the probabilities corresponding respectively to the association rules applied in predicting the value.
In a preferred embodiment, the processor determines a probability decision point such that when the cumulative probability is greater than the decision point, the attribute of interest is predicted to have a first value, and when the probability of interest is less than the decision point, the attribute of interest is predicted to have a different, second value. Preferably, the processor defines an ambiguity range of probabilities including the decision point in which the predicted value is ambiguous. Most preferably, the processor compares the training sample and at least a portion of the overall population from which the given item is taken, and determines an extent of the ambiguity range responsive to a measure of the similarity of the training sample and the at least portion of the overall population.
Preferably, the processor determines the decision point such that a total number of prediction errors is minimized.
Alternatively or additionally, an error cost is assigned to each of a plurality of types of prediction errors, and the processor determines the decision point such that a total cost of prediction errors is minimized.
In a preferred embodiment, the items include records in a database, and the attributes include fields in the records, and the processor applies the association rules to predict the unknown value of a database field. Preferably, the unknown value includes predicting a Boolean value.
In another preferred embodiment, the items include sounds, and the attribute values include characteristics of sound signals corresponding to the sounds, and the processor applies the association rules to identify a sound signal. Preferably, the processor finds a word corresponding to the signal. Additionally or alternatively, the processor identifies a speaker who generated the sound signal. Preferably, the input device includes a microphone.
In still another preferred embodiment, the items include images, and the attribute values include image features, and the processor applies the association rules to process an image. Preferably, the processor identifies a subject of the image. Preferably, the input device includes a camera or, additionally or alternatively, a scanner.
Preferably, the system includes an output device, which receives the output from the processor and performs an action responsive thereto.
In a preferred embodiment, the output device includes a display, which displays the predicted value, preferably along with a probability associated with the predicted value.
In another preferred embodiment, the output device includes an access controller, which controls an access responsive to the predicted value.
In still another preferred embodiment, the output device includes a sorter, which sorts the given item responsive to the predicted value.
Preferably, the system includes a computer, which receives a training sample of items within the population having known respective attribute values including the attribute of interest, and which determines the association rules and finds the at least one generalized association rule by processing the known attribute values, wherein the computer preferably includes the processor.
There is additionally provided, in accordance with a preferred embodiment of the present invention, a system for determining association rules for prediction of an unknown value of an attribute of interest of a given item from a population of items, each item in the population having a plurality of variable attributes including the attribute of interest, including:
an input device, which receives data indicative of values of attributes of a training sample of items within the population including the attribute of interest;
a memory, which stores the values of the attributes, and
a computer, which reads the values from the memory and determines association rules regarding the population, the association rules including at least one generalized association rule, each association rule including one or more conditions on one or more respective attribute values of the items predictive of the value of the attribute of interest, and the at least one generalized rule including a logical combination of such conditions using at least one logical operation from the group consisting of disjunction and negation in defining the conditions of the rule, and which stores the association rules in the memory.
Preferably, the computer finds the at least one generalized rule such that a support of the rule is maximized on the training sample, wherein the at least one generalized association rule preferably predicts that the attribute of interest has a given value, and the support of the generalized rule in the training sample includes at least ten times as many items having the given value of the attribute of interest as having another value, not equal to the given value. Most preferably, the support of the generalized association rule includes substantially only items having the given value of the attribute of interest.
Preferably, the computer finds a plurality of generalized association rules such that an overlap of the respective supports of two or more of the plurality of rules on the training sample is minimized.
Further preferably, the computer finds a plurality of generalized association rules such that substantially all of the items in the training sample are included in the support of one or more of the generalized rules.
In a preferred embodiment, the plurality of generalized rules includes first and second groups of generalized rules that are respectively predictive of first and second values of the attribute of interest, such that there is an approximately equal number of rules in each group.
In another preferred embodiment, the at least one generalized association rule includes a rule prediction that the attribute of interest has a first value if a condition substantially of a form {character pullout}(C.sub.j1.sup.(i) vC.sub.j2.sup.(i) v . . . vC.sub.jk.sup.(i) is fulfilled, wherein each C.sub.jm.sup.(i) represents a condition on a known attribute value of the item other than the attribute of interest which is predictive that the attribute of interest will have another value, different from the first value.
Alternatively or additionally, the at least one generalized association rule includes a rule predicting that the attribute of interest has a first value if a condition substantially of a form (C.sub.j1.sup.(i.sup..sub.1 .sup.) vC.sub.j2.sup.(i.sup..sub.1 .sup.) v . . . vC.sub.jk.sup.(i.sup..sub.1 .sup.) {character pullout}(C.sub.jk+1.sup.(i.sup..sub.2 .sup.) vC.sub.jk+2.sup.(i.sup..sub.2 .sup.) v . . . vC.sub.jK.sup.(i.sup..sub.2 .sup.)) is fufilled, wherein each C.sub.jm.sup.(i.sup..sub.1 .sup.) represents a condition on a known attribute value of the item other than the attribute of interest which is predictive that the attribute of interest will have the first value, and each C.sub.jm.sup.(i.sup..sub.2 .sup.) represents a condition on a known attribute value of the item other than the attribute of interest which is predictive that the attribute of interest will have another value, different from the first value. Preferably, the computer finds the set of rules {C.sub.j1.sup.(i.sup..sub.1 .sup.) . . . C.sub.jk.sup.(i.sup..sub.1 .sup.) } and then searches for rules in the set {C.sub.jk+1.sup.(i.sup..sub.2 .sup.) . . . C.sub.jK.sup.(i.sup..sub.2 .sup.) } on the support of {C.sub.j1.sup.(i.sup..sub.1 .sup.) . . . C.sub.jk.sup.i.sup..sub.1 .sub.) }.
Preferably, the computer finds simple association rules, wherein each simple association rule includes one or more conditions on one or more respective attribute values of the items predictive of the value of the attribute of interest, such that if the simple association rule includes more than one such condition, the conditions are combined using the logical conjunction operation in defining the conditions of the rule.
Preferably, the simple and the at least one generalized association rules are applied jointly to predict the unknown value.
In a preferred embodiment, a weighted sum of values of the attribute of interest predicted by the rules is used by the computer to predict the unknown value. Preferably, the computer computes probabilities respectively associated with the simple and generalized rules, which are used to weight the predicted values in computing, the weighted sum.
Preferably, the computer finds the at least one generalized association rule by combining a plurality of the simple association rules, wherein the at least one generalized rule preferably includes a disjunction of two or more of the simple rules or, alternatively or additionally, wherein the at least one generalized rule includes a negation of one or more of the simple rules.
Preferably, the computer determines substantially all simple association rules pertaining to the sample having respective probability and support greater than predetermined minimum values thereof
There is moreover provided, in accordance with a preferred embodiment of the present invention, a system for determining association rules for prediction of an unknown value of an attribute of interest of a given item from a population of items, each item in the population having a plurality of variable attributes including the attribute of interest, including:
an input device, which receives data indicative of values of attributes of a training sample of items within the population including the attribute of interest;
a memory, which stores the values of the attributes; and
a computer, which reads the values from the memory and determines simple association rules regarding the population, each simple association rule including one or more conditions on one or more respective attribute values of the items predictive of the value of the attribute of interest, such that if the simple association rule includes more than one such condition, the conditions are combined using the logical conjunction operation in defining the conditions of the rule, and which stores the association rules in the memory,
wherein substantially all simple association rules applicable to the sample having respective probability and support greater than predetermined minimum values thereof are determined.
Preferably, the computer determines the association rules by constructing a contingency table, each of whose entries corresponds to the number of items in the sample having a given value of the attribute of interest and a given, respective value of another one of the attributes.
There is additionally provided, in accordance with a preferred embodiment of the present invention, a system for determining association rules for prediction of an unknown value of an attribute of interest of a given item from a population of items, each item in the population having a plurality of variable attributes including the attribute of interest, including:
an input device, which receives data indicative of values of attributes of a training sample of items within the population including the attribute of interest;
a memory, which stores the values of the attributes; and
a computer, which reads the values from the memory and determines association rules regarding the population, each association rule including one or more conditions on one or more respective attribute values of the items predictive of the value of the attribute of interest, and which stores the association rules in the memory,
wherein the computer determines the association rules by constructing one or more contingency tables, each of whose entries corresponds to the number of items in the sample having a given value of the attribute of interest and satisfying a given, respective condition on one or more of the attributes other than the attribute of interest.
Preferably, the one or more contingency tables include a table of 1-conditions, characterized in that the condition on the one or more of the attributes includes a condition on a single one of the attributes. Further preferably, the one or more contingency tables include a table of 2-conditions, characterized in that the condition on the one or more of the attributes includes a condition on two of the attributes, which is constructed by the computer using the table of 1-conditions. Most preferably, the one or more contingency tables include a plurality of respective tables of q-conditions, for a sequence of one or more integers q.gtoreq.3, characterized in that for each q, the condition on the one or more of the attributes includes a condition on a group of q of the attributes, wherein for each q, the computer constructs the corresponding table using the table of q-1-conditions previously constructed. Preferably, the computer stores each of the tables of q-conditions in the memory as it is constructed, and for each q, the computer deletes the corresponding table of q-1-conditions from the memory after the table of q-conditions is constructed.
Preferably, the computer encodes values of the attributes according to the frequency of their occurrence in the training sample, preferably by calculating hash functions. Further preferably, the computer assigns a distinguishable code to values occurring at less than a predetermined frequency in the training sample, whereby such values are substantially excluded from the determination the of association rules.
The present invention will be more fully understood from the following detailed description of the preferred embodiments thereof, taken together with the drawings in which:
BRIEF DESCRIPTION OF THE DRAWINGS
FIG. 1 is a flow chart illustrating a generalized method for data prediction based on pattern recognition, in accordance with a preferred embodiment of the present invention;
FIG. 2 is a block diagram that schematically illustrates a system for implementing the method of FIG. 1, in accordance with a preferred embodiment of the present invention;
FIG. 3 is a flow chart illustrating a method for encoding values of variables, for use particularly in the method of FIG. 1, in accordance with a preferred embodiment of the present invention;
FIG. 4 is a flow chart illustrating a method for determining simple association rules, in accordance with a preferred embodiment of the present invention;
FIGS. 5A, 5B an 5C are tables illustrating exemplary data arrays used in the method of FIG. 4, in accordance with a preferred embodiment of the present invention;
FIG. 6 is a table illustrating types of association rules, useful particularly in finding generalized association rules, in accordance with a preferred embodiment of the present invention;
FIG. 7 is a table illustrating types of generalized association rules, which are found using the method of FIG. 1, in accordance with a preferred embodiment of the present invention;
FIG. 8A is a table illustrating a sample database, regarding which association rules are found;
FIG. 8B is a table summarizing generalized association rules found with reference to the database of FIG. 8A, in accordance with a preferred embodiment of the present invention,
FIG. 9 is a table summarizing characteristics of simple association rules found with reference to the database of FIG. 8A;
FIG. 10 is a table summarizing characteristics of generalized association rules found with reference to the database of FIG. 8A, in accordance with a preferred embodiment of the present invention; and
FIG. 11 is a flow chart illustrating a method of predicting data values using association rules, in accordance with a preferred embodiment of the present invention.
DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS
Reference is now made to FIG. 1, which schematically illustrates a generalized method for data prediction, and to FIG. 2, which schematically illustrates a system 20 for carrying out the method, in accordance with preferred embodiments of the present invention.
As shown in FIG. 2, system 20 comprises an input device 22, which receives data relating to a plurality of items in a population and conveys the data to a processor 24, preferably a general purpose computer. Device 22 preferably comprises a keyboard or digital data link, through which database records regarding the items in the population are input to the processor. Alternatively, device 22 may comprise an electronic camera or scanner, for inputting image data; or a microphone, for inputting audio data; or any other suitable type of sensor or other input device known in the art. Although preferred embodiments are described hereinbelow mainly with reference to analysis and prediction of database records, in alternative embodiments of the present invention, system 20 and the method of FIG. 1 may be applied to pattern recognition and prediction on other types of data, as well.
As a first step 30 in the method shown in FIG. 1, processor 24 receives the data from device 22 relative to a training sample within the overall population. Each of the items in the training sample has a known value of a predetermined Field to Predict, y. In step 32, processor 24 then encodes the data regarding the training sample, preferably in a form substantially as given by equation (1), above, and stores the training database thus generated in a memory device 26, such as a magnetic disk. Methods of encoding the data are described in further detail hereinbelow.
Once the training database has been generated and encoded, processor 24 analyzes the database in step 34 to determine all simple association rules having a minimum support S.sub.min, and minimum admissible probabilities, as defined by a user of system 20. The processor then analyzes the set of simple association rules to find generalized association rules regarding the training database in step 36.
The following is a formal definition of generalized association rules. Let J be the set of numbers of all simple association rules. Let J.sub.1 (J.sub.0) be the subset of J including all rules with y=1 (y=0) in the "then" part, wherein J.sub.1, J.sub.0.OR right.J, J.sub.1.orgate.J.sub.0 =J, and J.sub.1 J.sub.0 =.0.. Let us denote the condition of rule j (with a single or conjunctive expression in the "if" part) by C.sub.j.sup.(1) (if j.epsilon.J.sub.1) and by C.sub.j.sup.(0) (if j.epsilon.J.sub.0) Consider each C.sub.j.sup.(1), C.sub.j.sup.(0) as a propositional (logical) variable. C.sub.j.sup.(1), C.sub.j.sup.(0) will denote that the condition of rule j is fulfilled, and {character pullout}C.sub.j.sup.(1) {character pullout}C.sub.j.sup.(0) will denote that the condition of rule j is not fulfilled.
We now define a Boolean algebra function .function..sub.1 =.function..sub.1 (C.sub.j1.sup.(1), . . . ,C.sub.jk.sup.(1) of selected k propositional variables C.sub.j1.sup.(1), . . . ,C.sub.jk.sup.(1), wherein {j.sub.1, . . . ,j.sub.k }.OR right.J.sub.1. We further define a Boolean algebra function .function..sub.2 =.function..sub.2 (C.sub.&jcirc ;.sub.1.sup.(0), . . . , C.sub.&jcirc ;l.sup.(0)), wherein {&jcirc ;.sub.1, . . . ,&jcirc ;.sub.1 } .OR right.J.sub.0. Moreover, we define a Boolean algebra function .function.(.function..sub.1 (C.sub.j1.sup.(1), . . . ,C.sub.jk.sup.(1)), .function..sub.2 (C.sub.&jcirc ;1.sup.(0))) of two propositional variables given by .function..sub.1 and .function..sub.2. A generalized association rule is a statement of the kind:
if (the Boolean algebra function of {C.sub.j.sup.(1) } and/or {C.sub.j.sup.(0) }) then y=y.sub.1 (5)
wherein y.sub.1 =1 or y.sub.1 =0, the probability of the rule on the training database is 1, and the support of the rule is not less than S.sub.min.
Processor 24 selects and applies only those generalized association rules having a clear and intelligible interpretation, which rules are therefore likely to be valid on the overall population to which the training database belongs. An example of such a rule is as follows:
if {character pullout}C.sub.j1.sup.(1) C.sub.j2.sup.(1) . . . {character pullout}C.sub.jk.sup.(1)) then y=0 (6)
Methods of determining all of the simple association rules and of finding and selecting the generalized association rules are described further hereinbelow. The rules thus found are preferably stored by processor 24 in memory 26.
Returning to FIG. 1, in step 38, after all the association rules are found, data regarding one or more items having an unknown value of the Field to Predict are input to processor 24, preferably via device 22, and are encoded in similar fashion to the training data. Although in the system of FIG. 2, a single processor is used both to find the association rules and to predict the unknown value, it will be understood that a separate computer could also be used to find the association rules, as described above, which rules are then stored in memory 26 for application by processor 24. The association rules, including the generalized association rules, are applied by processor 24 to the encoded data in step 40, and are used to predict the value of the item's Field to Predict and, preferably, the probability that the prediction is correct.
Processor 24 outputs the predicted value and, optionally, the probability of correctness, via an output device 28. Device 28 preferably comprises a monitor or other data display, but may alternatively comprise an access control device, such as an electronic lock, which allows access responsive to the predicted value; or a sorter, which sorts items according to the predicted value; or a printer, or any other suitable type of output device known in the art. Methods of applying the generalized association rules to predict database fields are described further hereinbelow.
ENCODING OF VARIABLES
Each individual field of the training database is considered to be a qualitative (categorical) or quantitative variable. The variable type is defined by the types of comparisons that can be applied to its values. If only the relations "equal" and "unequal" can be introduced on the set of a variable's values, then such a variable is said to be a qualitative (categorical) variable. Unlike such qualitative variables, the set of values of a quantitative variable has the structure and properties of the axis of real numbers. All alphanumeric data are considered as attributes. Therefore, an attribute can be recognized by definition of the source file when the field is defined as alphanumeric. However, if the field is defined as numeric, it is impossible to ascertain whether the variable is quantitative or qualitative. In this case, the user must state the type of variable corresponding to this field. In the present patent application, we will not distinguish between a categorical and an orderable qualitative variable, wherein an ordered relation between any pair of variable values can be introduced. All such variables are collectively referred to herein as "attributes."
FIG. 3 is a flow chart that schematically illustrates a method for encoding the values of the variables, corresponding generally to step 32 in FIG. 1, in accordance with a preferred embodiment of the present invention. The method maps the set of a field's values into a virtually minimum possible number of codes necessary for determining all rules of the type given by equation (3), with a support not less than S.sub.min. The procedure for encoding the attributes' values is based on the use of hash functions and byte-statistics. It requires two readings of an initial training database. A dictionary table listing the correspondence of the codes thus generated to the field values is constructed simultaneously. The method makes use of the fact that attribute values having a frequency less than S.sub.min are uninformative, i.e., an association rule containing such a value cannot be established.
During the first reading of the investigated database, one-dimensional statistics are computed for each position (byte) of the values of each attribute. Simultaneously, H hash functions h.sub.1 (z.sup.(k)),h.sub.2 (z.sup.(k)), . . . ,h.sub.H (z.sup.(k)), . . . , as are known in the art, are calculated for each attribute k, wherein z.sup.(k) =(z.sub.1.sup.(k), . . . ,z.sub.l.sub..sub.k .sup.(k)) is a vector whose components respectively correspond to the positions (bytes) of attribute k, which is l.sub.k bytes long. The values of functions h.sub.i (z.sup.(k)),i=1, . . . ,H are used as addresses in which the corresponding quantities of records of the investigated database are computed.
During the second reading of the training database, different successive codes are assigned to attribute values whose frequency could be greater than or equal to S.sub.min. In assigning the codes, using the tables of byte-statistics and the hash tables, the following two statements are applied:
(1) If at least one of the positions (bytes) of the fixed value of any fixed attribute occurs less than S.sub.min times in the investigated database, then this attribute value occurs less than S.sub.min times too.
(2) If the frequency of at least one of the values h.sub.1 (z.sup.(k)), . . . ,h.sub.H (z.sup.(k)) is less than S.sub.min, then the value z.sup.(k) of attribute k occurs less than S.sub.min times in the investigated database.
The different successive codes 1, . . . ,m.sub.k are thus assigned only to those values of attribute k which have a frequency not less than S.sub.min in all the corresponding places in the tables of byte-statistics and in hash tables. Here m.sub.k is the number of different values of attribute k taken into account when searching for association rules. A special code .alpha. is assigned to any value of attribute k which has at least one frequency less than S.sub.min in the above-mentioned tables, wherein .alpha. is preferably an integer sufficiently large such that m.sub.k <.alpha. for all k. (For example, ##EQU3##
Code 0 is preferably used for encoding missing values (if all z.sub.i.sup.(k) are blanks).
Encoding for a quantitative variable is preferably carried out by dividing the set of values of each quantitative variable into nonintersecting intervals. When constructing such intervals for values of the quantitative variable, the corresponding frequency distribution function is analyzed. Intervals having a relatively great density of the variable's values are determined, and successive serial codes are assigned to them. Each detected interval satisfies the condition that the number of values belonging to this interval is not less than S.sub.min.
Based on the method described above, substantially any data field can be treated as an attribute. The methods described hereinbelow for determining the association rules use substantially only the codes of the values of the fields. For the sake of brevity, such codes are referred to themselves hereinafter, as values.
DETERMINING SIMPLE ASSOCIATION RULES
FIG. 4 is a flow chart that schematically illustrates a method for determining all the simple association rules with respect to the training database, corresponding generally to step 34 in FIG. 1, in accordance with a preferred embodiment of the present invention. To determine all association rules with 1-conditions and to construct a set of all representative 2-conditions, contingency tables are used. Then for each value of q=2, . . . ,n-1, a set of all potentially representative (q+1)-conditions is determined recursively from a set of all representative q-conditions. The set of potentially representative conditions for each value of q is used to determine all of the association rules for that value of q that are actually representative of the data, by applying a generalized contingency table.
A contingency table between attributes x.sub.k.sub..sub.1 and x.sub.k.sub..sub.2 is two-dimensional matrix .parallel.N.sub.ij.sup.(k.sup..sub.1 .sup.,k.sup..sub.2 .sup.).parallel., wherein N.sub.ij.sup.(k.sup..sub.1 .sup.,k.sup..sub.2 .sup.) is a number of records such that x.sub.k.sub..sub.1 =i and x.sub.k.sub..sub.2 =j. In a general case, if N.sub.ij.sup.(k.sup..sub.1 .sup.,k.sup..sub.2 .sup.).gtoreq.S.sub.min, i, j.noteq..alpha., then we get the statement:
If x.sub.k.sub..sub.2 j then there is a probability ##EQU4##
In the preferred embodiments described herein, the rules of interest are those with field y in the "then" part. Therefore, contingency tables between attributes y and x.sub.k, k=1, . . . ,n are constructed. For a given k, let us denote an element of the corresponding contingency table by N.sub.ij.sup.(k), wherein i=0,1. If N.sub.1j.sup.(k).gtoreq.S.sub.min and N.sub.1j.sup.(k) /(N.sub.1j.sup.(k) +N.sub.0j.sup.(k)).gtoreq.p1, then we obtain the following rule:
if x.sub.k =j then y=1
Probability p=N.sub.1j.sup.(k) /(N.sub.1j.sup.(k) +N.sub.0j.sup.(k))
support=N.sub.1j.sup.(k) (8)
On the other hand, if N.sub.0j.sup.(k).gtoreq.S.sub.min and N.sub.0j.sup.(k) /(N.sub.1j.sup.(k) +N.sub.0j.sup.(k)).gtoreq.p0, then we obtain the rule:
if x.sub.k =j then y=0.
Probability p=N.sub.0j.sup.(k) /(N.sub.1j.sup.(k) +N.sub.0j.sup.(k))
Support=N.sub.0j.sup.(k) (9)
In this manner all association rules with 1-conditions are discovered.
A q-condition is called representative if a number of records at which this condition is fulfilled is not less than S.sub.min. The q-condition is called potentially representative if any (q-1)-condition which could be derived from the q-condition is representative, for q>1. It will be appreciated that if the q-condition is representative, then it is necessarily potentially representative, too. Hence, the set of all potentially representative q-conditions contains all representative q-conditions.
Let us denote the set of all representative q-conditions by C.sub.q, and the set of all potentially representative q-conditions by &Ccirc ;.sub.q. Set &Ccirc ;.sub.q, as well as C.sub.q, is represented as an array (denoted by the same symbols &Ccirc ;.sub.q, C.sub.q, respectively), wherein line j describes the jth q-condition. Arrays &Ccirc ;.sub.q and C.sub.q have q pairs of fields (k.sub.l,i.sub.l), l=1, . . . ,q, wherein x.sub.k.sub..sub.l =i.sub.l is the lth single condition in the q-condition. By this construction, k.sub.l.sub..sub.1 <k.sub.l.sub..sub.2 for l.sub.1 <l.sub.2.
Once array &Ccirc ;.sub.q has been created for a given q, a two-dimensional matrix .parallel.N.sub.ij.parallel. is constructed, wherein N.sub.ij is a number of records satisfying the jth q-condition of array &Ccirc ;.sub.q and at which y=i (i=0, 1). The construction of this matrix is analogous to constructing the contingency table between attributes y and x.sub.k. Matrices of the type of .parallel.N.sub.ij.parallel. are referred to herein as "generalized contingency tables." If N.sub.1j.gtoreq.S.sub.min and N.sub.1j /(N.sub.1j +N.sub.0j).gtoreq.p.sub.1, then the following rule applies:
if (the jth q-condition) then y=1
Probability p=N.sub.1j /(N.sub.1j +N.sub.0j)
Support=N.sub.1j (10)
If N.sub.0j.gtoreq.S.sub.min and N.sub.0j /(N.sub.1j +N.sub.0j).gtoreq.p.sub.0, then we have the rule:
if (the jth q-condition) then y=0
Probability p=N.sub.0j /(N.sub.1j +N.sub.0j)
support=N.sub.0j (11)
The definition of the jth condition is in the jth line of array &Ccirc ;.sub.q. Thus, using the generalized contingency table enables substantially all association rules to be determined for each given q>1.
To construct array C.sub.2, a contingency table .parallel.N.sub.ij.sup.(k.sup..sub.1 .sup.k.sup..sub.2 .sup.).parallel. between each pair of attributes x.sub.k.sub..sub.1 and x.sub.k.sub..sub.2 is computed, for k.sub.1 =1, . . . , n-1; k.sub.2 =k.sub.1 +1, . . . , n. All pairs (k.sub.1,i.sub.l) and (k.sub.2,i.sub.2) for which N.sub.i.sub..sub.1 .sup.i.sub..sub.2 .sup.(k.sup..sub.1 .sup.k.sup..sub.2 .sup.).gtoreq.S.sub.min are then recorded in array C.sub.2.
For each value of q>1, an array &Ccirc ;.sub.q is created from array C.sub.q-1, using a single scan of array C.sub.q-1. We will denote values of line j of array C.sub.q-1 by (k.sub.l [j],i.sub.l [j]), l=1, . . . ,q-1. Because k.sub.l.sub..sub.1 [j]<k.sub.l.sub..sub.2 [j] for l.sub.1 <l.sub.2, the lines of C.sub.q-1 are sorted by a key made up of the sequence of all fields of C.sub.q -1. As a result, array &Ccirc ;.sub.q is sorted automatically.
Since array C.sub.q-1 is sorted by the first pairs (k.sub.1,i.sub.1) of its fields, a corresponding interval of lines within which each pair of values of k.sub.1 and i.sub.1 is located in array C.sub.q-1 is easily determined. Such sorting is facilitated by introduction of a vector (g.sub.r), r=0, 1, . . ., M, of a dimension ##EQU5##
wherein m.sub.k is the number of encoded values of attribute k. Components of vector (g.sub.r) are computed as follows:
Let ##EQU6##
(If k.sub.1 =1, then m.sub.k.sub..sub.1 =0.) First, g.sub.r is assigned values (k.sub.1,i.sub.1), wherein r=m.sub.k.sub..sub.1 +i.sub.1, k.sub.1 =1, . . . ,n -1, and i.sub.1 and i.sub.1 =1, . . . ,m.sub.k.sub..sub.1 . After that, ##EQU7##
is assigned to g.sub.r, r=2, . . . ,M. Now, if g.sub.r.noteq.g.sub.r-, then g.sub.r is the number of the last line in array C.sub.q-1 where a pair of values (k.sub.1,i.sub.1) is located such that r=m.sub.k.sub..sub.1 +i.sub.1 ; Therefore, if g.sub.r.noteq.g.sub.r-, then numbers of lines of array C.sub.q-1 containing values k.sub.1,i.sub.1 in the first two columns respectively are located only within an interval [g.sub.r-1 +1, g.sub.r ], with r=m.sub.k.sub..sub.1 +i.sub.1. If g.sub.r =g.sub.r-1, this means that there is no a pair of values (k.sub.1,i.sub.1) in the first two columns of array C.sub.q-1. Note that vector (g.sub.r) is computed simultaneously with constructing array C.sub.q-1. The vector enables us not only to speed up retrieval of values from C.sub.q-1, but also to reduce the number of fields in array C.sub.q-1 by eliminating the first pair of fields (k.sub.1, i.sub.1).
For each q, array &Ccirc ;.sub.q is constructed in one scan of array C.sub.q-1 and one scan of the vector (g.sub.r) relating to C.sub.q-1. Upon scanning components of vector (g.sub.r), values r.sub.0 are found such that g.sub.r.sub..sub.0 .noteq.g.sub.r.sub..sub.0 .sub.-1. This means that (q-1)-conditions having a first single condition x.sub.k.sub..sub.1 =i.sub.1, wherein r.sub.0 =m.sub.k.sub..sub.1 +i.sub.1, are located within an interval [g.sub.r.sub..sub.0 .sub.-1 +1, g.sub.r.sub..sub.0 ] of lines of array C.sub.q-1. For j=g.sub.r.sub..sub.0 .sub.-1 +1, . . . , g.sub.r.sub..sub.0 , the lines of C.sub.q-1 are scanned. A given line j.sub.0.epsilon.[g.sub.r.sub..sub.0 .sub.-1 +1, g.sub.r.sub..sub.0 ] in array C.sub.q-1 is represented as (k.sub.2 [j.sub.0 ],i.sub.2 [j.sub.0 ]), . . . ,(k.sub.q-1 [j.sub.0 ],i.sub.q-1 [j.sub.0 ]), taking k.sub.1 [j.sub.0 ]=k.sub.1 and i.sub.1 [j.sub.0 ]=i.sub.1. A value of r.sub.10 is calculated using r.sub.1 =m.sub.k.sub..sub.2 [.sub.j .sub..sub.0 ]+.sub.i.sub..sub.2[j .sub.0]. If g.sub.r.sub..sub.1 =g.sub.r.sub..sub.1 .sub.-1, this means that there is no line in C.sub.q-1 with values (k.sub.2 [j.sub.0],i.sub.2 [j.sub.0 ]) of the first pair of fields (k.sub.1,i.sub.1), and we proceed to the next line j.sub.0 +1 within interval [g.sub.r.sub..sub.0 .sub.-1 +1, g.sub.r.sub..sub.0 ]. Otherwise, if g.sub.r.sub..sub.1 .noteq.g.sub.r.sub..sub.1 .sub.-1, we scan lines of array C.sub.q-1 within interval [g.sub.r.sub..sub.1 .sub.-1 +1, g.sub.r.sub..sub.1 ]. For each line j.sub.1 belonging to this interval and (for q>3) satisfying the condition:
(k.sub.2 [j.sub.1 ]=k.sub.3 [j.sub.0, i.sub.2 [j.sub.1 ]=i.sub.3 [j.sub.0 ]), . . . ,(k.sub.q-2 [j.sub.1 ]=k.sub.q-2 [j.sub.1 ]i.sub.q-2 [j.sub.1 ]=i.sub.q-1 j.sub.0 ]) (12)
we find the corresponding (k.sub.q-1[j.sub.1 ],i.sub.q-1 [j.sub.1 ]). If there is no such j.sub.1 (for q>3), then we proceed to consider the next line j.sub.0 30 1 in array C.sub.q-1 within interval [g.sub.r.sub..sub.0 .sub.-1 +1, g.sub.r.sub..sub.0 ]. For q=3, equalities of the type of equation (12) do not exist, and we instead consider each (k.sub.q-1 [j.sub.1 ],i.sub.q-1 [j.sub.1 ])=(k.sub.2 [j.sub.1 ],i.sub.2 [j.sub.1 ]) for j.sub.1.epsilon. [g.sub.r.sub..sub.1 .sub.-1 +1, g.sub.r.sub..sub.1 ].
If a line j.sub.1 is found that satisfies equation (12), there may exist a potentially representative q-condition corresponding to:
(k.sub.1,i.sub.1), (k.sub.2 [j.sub.0 ]i.sub.2 [j.sub.0 ]), . . . , (k.sub.q-1 [j.sub.0 ],i.sub.q-1 [j.sub.0 ]), (k.sub.q-1 [j.sub.1 ],i.sub.q-1 [j.sub.1 ]) (13)
For the q-condition to exist, it is necessary that all the following q-2 lines within interval [j.sub.0 +1, g.sub.r.sub..sub.0 ] exist in array C.sub.q -1 (for k.sub.1, i.sub.1 such that r.sub.0 =m.sub.k.sub..sub.1 +i.sub.1). Each of these lines is formed from the line
(k.sub.2 [j.sub.0 ], i.sub.2 [j.sub.0 ]), . . . , (k.sub.q-1 [j.sub.0 ], i.sub.q-1 [j.sub.0 ]), (k.sub.q-1 [j.sub.1 ], i.sub.q-1 [j.sub.1 ]) (14)
by deletion of one of pairs (k.sub.2 [j.sub.0 ], i.sub.2 [j.sub.0 ]), . . . ,(k.sub.q-1 [j.sub.0 ],i.sub.q-1 [j.sub.0 ]).
If array C.sub.q-1 contains all of these lines within interval [j.sub.0 +1, g.sub.r.sub..sub.0 ], then the q-condition given by equation (13) is recorded in the current line of array &Ccirc ;.sub.q. Simultaneously, the number of q-condition recorded in array &Ccirc ;.sub.q, which were constructed from lines within the interval [g.sub.r.sub..sub.0 .sub.-1 30 1, g.sub.r.sub..sub.0 ] of array C.sub.q-1, is computed in component g.sub.r.sub..sub.0 of the vector (g.sub.r), relating to &Ccirc ;.sub.g. This construction gives the sorted array &Ccirc ;.sub.q.
FIGS. 5A-5C are tables illustrating calculation of an array &Ccirc ;.sub.3 (FIG. 5C) of 3-conditions, from array C.sub.2 (FIG. 5A) of 2-conditions on a sample training database, using auxiliary vectors (m.sub.k.sub..sub.1 ) and (g.sub.r) (FIG. 5B), in accordance with the method described hereinabove.
There are n=6 input fields in the training database, and the set of codes of values for each of input fields x.sub.k,k=1, . . . ,6, is taken to be {1,2,3}. Assume that we have found the set C.sub.2 of all representative 2-conditions presented in FIG. 5A. The first pair of fields (k.sub.1,i.sub.1) of array C.sub.2 is not saved, and it is given for information only. Instead of this pair of fields, auxiliary vectors (m.sub.k.sub..sub.1 ) and (g.sub.r) are used.
Vector (g.sub.r) is constructed in the process of creating array C.sub.2, as described above. In FIG. 5B, the line above (g.sub.r) illustrates the process of the construction of (g.sub.r), wherein the frequency of each pair of values (k.sub.1,i.sub.1) is first computed. For example, the frequency of pair (2,3) is 2, and it is recorded in (g.sub.r) as the component with index r=m.sub.2 +3=6.
Let us construct array &Ccirc ;.sub.3 from C.sub.2. Vector (g.sub.r) shows that the first pair of (k.sub.1,i.sub.1) is (1,1), and it belongs to the interval [0+1,3]=[1,3] of the lines of C.sub.2. Now, for (k.sub.2 [1],i.sub.2 [1])=(2, 1), r=m.sub.k.sub..sub.2 .sub.[1] +i.sub.2 [1]=m.sub.k.sub..sub.2 +1=4. This means that the pair (2, 1) of fields (k.sub.1,i.sub.1) is located within interval of lines [g.sub.3 +1, g.sub.4 ]=[6, 6]. In line 6 of C.sub.2, (k.sub.2 [6], i.sub.2 [6])=(3, 1). Pair (3,1), however, does not exist among pairs (k.sub.2 [j], i.sub.2 [j]) for j=2, 3.
In the next line in the interval [1,3], line 2, (k.sub.2 [2], i.sub.2 [2])=(3, 2). For these values, r=m.sub.3 +2=8, so that pair (3,2) of fields (k.sub.1, i.sub.1) is located within the interval of lines [g.sub.7 +1, g.sub.8 ]=[10, 10]. In line 10 of C.sub.2 (k.sub.2 [10], i.sub.2 [10])=(4, 1). Similarly, (k.sub.2 [3], i.sub.2 [3])=(4, 1), so that the 3-condition {(1,1),(3,2),(4,1)} is potentially representative, and it is recorded in array &Ccirc ;.sub.3. Continuing in such a way to scan array C.sub.2, we find no other potentially representative 3-conditions.
After array &Ccirc ;.sub.q for a given q has been created, a generalized contingency table between output attribute y and the set of lines of array &Ccirc ;.sub.q is constructed in order to determine all the simple association rules based on q-conditions. Simultaneously, q-conditions having a frequency less than S.sub.min are detected and deleted from array &Ccirc ;.sub.q as it is transformed into C.sub.q. Then array &Ccirc ;.sub.q+1 is constructed from C.sub.q, and so on.
After creation of array &Ccirc ;.sub.q+1 and finding all simple association rules for a given q, array C.sub.q is not further required, and therefore, it is preferably deleted to reduce the demand on memory 26. Typically, the number of lines in array C.sub.q rapidly decreases relatively to the number in array C.sub.q-1 beginning with q=4 or q=5. The condition for terminating the procedure is an absence of lines in array &Ccirc ;.sub.g when constructing the array from C.sub.q-1, or q=n.
FINDING GENERALIZED ASSOCIATION RULES
As described above, a generalized association rule (also referred to hereinafter as a "generalized rule") is a determinate statement of the value of the Field to Predict given generally by equation (5). After all the simple association rules (also referred to hereinafter as "simple rules" or simply as "rules") are determined, they are used to find such generalized rules, as indicated by step 36 in FIG. 1. There are a number of types of generalized rules, which are found in accordance with preferred embodiments of the present invention as described hereinbelow.
Reference is now made to FIG. 6, which is a table illustrating a matrix C, containing a classification of simple rules according to a set of numbers of records I in the database and a set of numbers of simple rules applying thereto J, which are preferably determined in accordance with the methods described hereinabove. J.sub.1 denotes the subset of J containing all rules with y=1 in the "then" part. The subset of numbers of rules with y=0 is denoted by J.sub.0. J.sub.1.orgate.J.sub.0 =J, J.sub.1.andgate.J.sub.0 =.O slashed.. The condition of rule j (the expression in the "if" part of the rule) is considered as a propositional variable and denoted by C.sub.j.sup.(1) (if j.epsilon.J.sub.1) and C.sub.j.sup.(0) (if j.epsilon.J.sub.0). We further denote the subset of numbers of records with y=1 (y=0) by I.sub.1 (I.sub.0) respectively. I.sub.1, I.sub.0.OR right.I, I.sub.1.orgate.I.sub.0 =I, I.sub.1.andgate.I.sub.0 =.O slashed..
Matrix C is given by C=.parallel.c.sub.ij.parallel., i.epsilon.I, j.epsilon.J, wherein c.sub.ij =1 if the condition of rule j is fulfilled at record i; and otherwise, C.sub.ij =0. Matrix C serves as input data for finding generalized rules. Without loss of generality and for the sake of simplicity, we represent matrix C=.parallel.c.sub.ij.parallel. with its rows ordered by Field to Predict value, and any rule number j.epsilon.J.sub.1 is less than any j.epsilon.J.sub.0.
As shown in FIG. 6, matrix C may be divided into four quadrants, marked I through IV. Quadrant I, for example, contains all c.sub.ij for i.epsilon.I.sub.1, j.epsilon.J.sub.1. From the definition of association rules, it follows that the relative density of 1's in quadrants I and IV is greater than in quadrants II and III, respectively. For any given column j of matrix C, corresponding to a rule j.epsilon.J.sub.1 with support s and probability p, the relative frequency of 1's in quadrant I is ##EQU8##
whereas in quadrant II it is ##EQU9##
pI>pII, since by the definition of set J.sub.1, p.gtoreq.p.sub.1 >p.sub.a. (Here, as previously, p.sub.a is the a priori probability that y=1, and p.sub.1 is the minimum admissible probability for a rule with y=1.) Note that if p>>p.sub.a, then p.sub.I >>p.sub.II. Analogously, based on the definition of set J.sub.0, it will be appreciated that for a rule j.epsilon.J.sub.0, the relative frequency of 1's in quadrant IV is greater than in quadrant III.
This property of matrix C is used in finding the generalized rules. Although in the preferred embodiments described herein, C is limited to four quadrants by our definition of y as Boolean, it will be appreciated that C may be generalized to encompass variables of other types, as well.
FIG. 7 is a table that illustrates a classification of generalized association rules into four different types, in accordance with a preferred embodiment of the present invention. Each type is derived from the simple association rules, preferably using methods described hereinbelow.
Type 1. The number of association rules determined is usually great, and on average, there are many relevant rules for each record of the training database. In terms of matrix C=.parallel.c.sub.ij.parallel., this means that there are, on average, many 1's in each line i. Let us assume that for each line i.epsilon.I.sub.1, there is at least one c.sub.1ij =1, j.epsilon.J.sub.1. Thus, all of the rule conditions j.epsilon.J.sub.1, taken in aggregate, "cover" all the records that have y=1. But at the same time, this aggregate of rules may cover almost all or even all records with y=0, since only the rules j.epsilon.J.sub.1 having a probability of 1 "cover" only relevant records i.epsilon.I.sub.1 and do not cover any records with i.epsilon.I.sub.0. It would be desirable to select a small number of rules j.epsilon.J.sub.1 such that together they cover all records i.epsilon.I.sub.1 and a minimal number of records i.epsilon.I.sub.0. Then for each "uncovered" record, it can be stated definitively that this record belongs to set I.sub.0, i.e., there is a probability of 1 that y=0.
Formally, assume that we have found the subset of rules {j.sub.1, . . . ,j.sub.k }.OR right.J.sub.1 such that {i.epsilon.I.sub.1.vertline..E-backward.j.epsilon.{j.sub.1, . . . ,j.sub.k } c.sub.ij 1}=I.sub.1. This means that we have found a generalized rule of type 1, as shown in FIG. 7. The support of this generalized rule is equal to the number of records at which none of rules j.sub.1, . . . , j.sub.k is fulfilled. As described above, to establish the rule it is necessary that the rule's support be not less than S.sub.min. Note that the condition of the generalized rule thus found can be written in the following equivalent form:
{character pullout}C.sub.j1.sup.(1) {character pullout}C.sub.j2.sup.(1) . . . {character pullout}C.sub.jk.sup.(1) (15)
It will thus be understood that the association rules predicting value y=1 of the Field to Predict with a probability greater than the a priori one, taken in aggregate, identify records having y=0 with a maximal level of confidence.
In a preferred embodiment of the present invention, to find a generalized rule of type 1, we solve a covering problem formulated as follows: A vector of Boolean variables z=(z.sub.j), j.epsilon.J.sub.1 is introduced, wherein z.sub.j =1 if the condition of rule j is entered in the subset of conditions to be found; and otherwise z.sub.j =0. In order to find the generalized rule, the minimum of a functional F(z) is found: ##EQU10##
subject to ##EQU11## z.sub.j.epsilon.{0, 1}, j.epsilon.J.sub.1 (18)
Here sign (x)=1 if x>0; and sign (x)=0 if x=0.
Assuming z.sup.(o) =(z.sub.j.sup.(o) to be the optimal solution of equations (16)-(18), then the subset {j.sub.1, . . . ,j.sub.k } of numbers of rules such that z.sub.jl.sup.(o) =1, l=1, . . . ,k defines the aggregate of conditions {C.sub.j1.sup.(1), . . . , C.sub.jk.sup.(1) } to be used in the generalized rule of type 1.
The optimization criterion minimizing the functional of equation (16) is equivalent to maximizing the support of the generalized rule of type 1 to be found. Therefore, to solve equations (16)-(18), rules having a large support and a large probability that y=1 should be chosen. The rules should be incompatible with or independent of one other on subset I.sub.1 of the database records and simultaneously (if possible) should be dependent on subset I.sub.0.
Preferably, all generalized, mutually independent rules of type 1 are found. The independence of generalized rules is understood as follows: Assume that L generalized rules of type 1 have been found. Let J.sub.1.sup.(l) be the subset of numbers of rules whose conditions are entered in the lth generalized rule, J.sub.1.sup.(l).OR right.J.sub.1, l=1, . . . , L. These generalized rules are considered independent if ##EQU12##
i.e., there is no simple rule entered in more than one generalized rule of type 1.
Independent generalized rules are preferably found by solving another covering problem relative to the lth such generalized rule (l>1), wherein the minimum of the following functional F(z) is found: ##EQU13##
subject to ##EQU14## z.sub.j.epsilon.{0, 1}, j.epsilon.J.sub.1.backslash.&Jcirc ;.sub.1.sup.(l-1) (21)
wherein ##EQU15##
It will be appreciated that equations (16)-(18) represent a special case of equations (19)-(21), for &Jcirc ;.sub.1.sup.(0) =.O slashed. and l=1.
Before solving the lth model of the type given by equations (9)-(21), the condition for existence of a solution is verified. This condition is formulated as follows: that the number of records belonging to a set {i.epsilon.I.sub.1.vertline..E-backward.j.epsilon.J.sub.1.backslash.&Jcirc ;.sub.1.sup.(l-1) c.sub.ij =1} is equal to the total number of records with y=1 in the training database. In other words, the condition is {i.epsilon.I.sub.1.vertline..E-backward.j.epsilon.J.sub.1.backslash.&Jcirc ;.sub.1.sup.(l-1) c.sub.ij =1}=I.sub.1. If this condition is not fulfilled, the search for generalized rules of type I is terminated.
Each generalized rule of type 1 is associated with to a certain subset of records for which y=0. Let us denote the subset of records associated with (i.e., satisfying the condition of) the lth generalized rule of type 1 by I.sub.0.sup.(l). A number s.sub.l of records i .epsilon.I.sub.0.sup.(l) constitute the support of the lth generalized rule. To establish the rule, it is necessary that s.sub.l >S.sub.min. Since the optimization criterion of equation (19) is equivalent to maximization of the support of the generalized rule, such generalized rules typically have s.sub.l >>S.sub.min.
The lth generalized rule of type 1, having the set of conditions {C.sub.j1.sup.(1), . . . , C.sub.jk.sup.(1) }, wherein J.sub.1.sup.(l) ={j.sub.1, . . . ,j.sub.k } is the optimal solution of lth set of equations (19)-(21), thus states that on the set I.sub.0.sup.(l) of records satisfying none of conditions C.sub.j1.sup.(1), . . . , C.sub.jk.sup.(1) the Field to Predict y=0. Any rule or rules j.epsilon.J.sub.1.backslash.&Jcirc ;.sub.1.sup.(l) whose conditions are fulfilled only on the set of records I.backslash.I.sub.0.sup.(l) can be added to the set {C.sub.j1.sup.(1), . . . , C.sub.jk.sup.(1) }. The numbers of such rules are then entered in set J.sub.1.sup.(l). The addition of such conditions does not change set I.sub.0.sup.(l) of records corresponding to the lth generalized rule, and therefore, it has no effect on the prediction results of the generalized rule on the training database. It reinforces, however, the aggregate condition of the lth generalized rule, and consequently, increases confidence in the accuracy of prediction when the generalized rules are applied to predict y for records (items) in the overall population outside the training database. It may be that the number of records in the overall population corresponding to this generalized rule will be slightly decreased.
The choice of the above-mentioned additional conditions from set J.sub.1.backslash.&Jcirc ;.sub.1.sup.(l) guarantees the independence of the generalized rules thus found (in the sense of the definition given above). An excessive number of conditions in a generalized rule may, however, decrease the possibilities for finding subsequent independent generalized rules of this type. This possible disadvantage is preferably overcome by permitting slightly dependent generalized rules. In this case, the above-mentioned additional conditions, fulfilled only on the set of records I.backslash.I.sub.0.sup.(l), are selected from set J.sub.1.backslash.J.sub.1.sup.(l).
In accordance with preferred embodiments of the present invention, the criterion defined by equation (19) is used to find each set I.sub.0.sup.(l) in a way that maximizes the accuracy of prediction on the overall population. If the same records are covered by different independent generalized rules predicting the same value of y, there will be a high level of confidence in the prediction accuracy for these records. Records belonging to the intersection of multiple sets I.sub.0.sup.(l), l=1, . . . , L (if it is non-empty) have a maximal number of corresponding generalized rules.
It is generally sufficient to have approximately five independent generalized rules predicting the same value of y, to be sure of the prediction accuracy. Therefore, if generalized rules of type 1 are found with strongly intersecting sets I.sub.0.sup.(l), but there are other records belonging to I.sub.0 for which such rules have not been found, then beginning at some stage (for example, from l=6), a modified optimization criterion is preferably used in place of equation (19), minimizing the value of the functional: ##EQU16##
Generalized rules are thus found corresponding to records having y=0 with greater uniformity.
Type 2. Generalized rules of type 2 are preferably found in a manner analogous to that described above for finding generalized rules of type 1. The lth independent generalized rule of type 2 is thus found by finding the minimum of ##EQU17##
subject to ##EQU18## z.sub.j.epsilon.{0, 1}, j.epsilon.J.sub.0.backslash.&Jcirc ;.sub.0.sup.(l-1) (25)
wherein ##EQU19##
&Jcirc ;.sub.0.sup.(0) =.O slashed.; J.sub.0.sup.(l.sup..sub.1 .sup.) is the subset of numbers of rules belonging to set J.sub.0 whose conditions are entered in the l.sub.1 th generalized rule of type 2.
Each generalized rules of type 2 corresponds to a subset of records at which y=1, based on simultaneous nonfulfilment of conditions of selected rules predicting y=0.
Types 3 and 4. To define a generalized rule of type 3, consider an lth generalized rule of type 1, with a set I.sub.0.sup.(l) of records satisfying the condition {character pullout}(C.sub.j1.sup.(1) C.sub.j2.sup.(1) v . . . vC.sub.jk.sup.(1)) of this generalized rule. For records i.epsilon.I.sub.0.sup.(l), the value of the Field to Predict is known to be y=0. An optimal covering of all records remaining in set I.sub.0 is then found using rules selected from set J.sub.0, wherein {j.sub.k+1, . . . j.sub.K } .OR right.J.sub.0 is the subset containing the numbers of the selected rules. Records not covered belong to set I.sub.1, containing solely records with y=1. Thus, a generalized rule of type 3 is defined, as shown in FIG. 7.
Type 3 generalized rules are preferably found by solving the covering problem of finding the minimum of. ##EQU20##
subject to ##EQU21## z.sub.j.epsilon.{0, 1}, j.epsilon.J.sub.0 (28)
The condition for existence of a solution to equations (26)-(28) is that {i.epsilon.I.sub.0.backslash.I.sub.0.sup. (l).vertline..E-backward.j.epsilon.J.sub.0 c.sub.ij =1}=I.sub.0.backslash.I.sub.0.sup.(l), for a subset I.sub.0.sup.(l) defined by the lth generalized rule of type 1. Uniting the corresponding generalized rules of type 1 and 3, we obtain the following statement:
if {character pullout}(C.sub.j1.sup.(1) vC.sub.j2.sup.(1) v . . . vC.sub.jk.sup.(1)) then y=0
else if {character pullout}(C.sub.jk+1.sup.(0) vC.sub.jk+2.sup.(0) v . . . vC.sub.jK.sup.(0)) then y=1 (29)
In a manner analogous to finding the generalized rules of type 3, a corresponding generalized rule of type 4 can be found for each generalized rule of type 2. If I.sub.1 .sup.(l).OR left.I.sub.1 is the subset of records corresponding to the lth generalized rule of type 2, then the generalized rule of type 4 can be found by determining an optimal covering of records i.epsilon.I.sub.1.backslash.I.sub.1.sup.(l) using the selected subset of rules belonging to set J.sub.1.
The joint statement of corresponding generalized rules of type 2 and 4 is then:
if {character pullout}(C.sub.j1.sup.(0) C.sub.j2.sup.(0) v . . . vC.sub.jk.sup.(0)) then y=1
else if {character pullout}(C.sub.jk+1.sup.(1) vC.sub.jk+2.sup.(1) v . . . vC.sub.jK.sup.(1)) then y=0 (30)
In some cases, a complete covering of records i.epsilon.I.sub.1 by rules j.epsilon.J.sub.1 may not exist, since there may be a small subset I.sub.1.OR right.I.sub.1 of records at which no rule j.epsilon.J.sub.1 is fulfilled, I.sub.1 {i.epsilon.I.sub.1.vertline..A-inverted.j.epsilon.J.sub.1 c.sub.ij =0}. In this case, a generalized rule of type 1 cannot be found, but a generalized rule of type 4 (and even several independent such rules) may exist.
A method is now described for finding an optimal covering of records i.epsilon.I.sub.1.backslash.I.sub.1 by a selected subset {C.sub.jk+1.sup.(1), . . ,C.sub.jK.sup.(1) } of rules belonging to set J.sub.1, in accordance with a preferred embodiment of the present invention. Let I.sub.0.sup.(1).OR right.I.sub.0 be the subset of "uncovered" records with y=0. For an "uncovered" record, there is a probability of ##EQU22##
that y=0, wherein the symbol .vertline.I.vertline. designates the number of records in set I. When I.sub.1.noteq..O slashed., a generalized rule of type 4 is sought if .vertline.I.sub.1.vertline. is relatively small, as defined by .vertline.I.sub.1.ltoreq.r .vertline.I.sub.1.vertline., wherein r is a predetermined constant (for example, r=0.05). If, as desired, subset I.sub.0.sup.(1) is relatively large, then the above-mentioned probability p will be close to 1. To find a generalized rule with probability p=1, however, it is necessary to find the rules that enable us to distinguish between subsets I.sub.1 and I.sub.0.sup.(1).
To define a generalized rule of type 4, we find the covering of records i.epsilon.I.sub.0.sup.(1) by rules selected from subset J.sub.0.backslash.&Jcirc ;.sub.0, wherein &Jcirc ;.sub.0 ={j.epsilon.J.sub.0.vertline..E-backward.i.epsilon.I.sub.1 c.sub.ij =1}. In other words, when searching for this covering, rules having c.sub.ij =1 on subset I.sub.1 of records are excluded from consideration. The generalized rule of type 4 may then include rules C.sub.j1.sup.(0), . . . , C.sub.jk.sup.(0).
The generalized rule of type 4 is preferably found by solving to find the minimum of: ##EQU23##
subject to ##EQU24## z.sub.j.epsilon.{0, 1}, j.epsilon.J.sub.1 (33)
Taking z.sup.(o) =(z.sub.j.sup.(o)), j.epsilon.J.sub.1 to be the optimal solution of equations (31)-(33), the subset {j.sub.k+1, . . . j.sub.K }={j.epsilon.J.sub.1.vertline.z.sub.j.sup.(o) 1} of corresponding rules is used to find the subset I.sub.0.sup.(1) of records satisfying the condition {character pullout}(C.sub.jk+1.sup.(1) vC.sub.jk+2.sup.(l) v . . . vC.sub.jK.sup.1)). Then the minimum of the following functional is found: ##EQU25##
subject to ##EQU26## z.sub.j.epsilon.{0, 1}, j.epsilon.J.sub.0.backslash.&Jcirc ;.sub.0 (36)
wherein p.sub.j is the probability of rule j; and .beta.>1 is a constant described hereinbelow, for example, .beta.=1.1. The desired subset of rules for finding the generalized rule of type 4 is {j.sub.1, . . . ,j.sub.k }={j.epsilon.J.sub.0.backslash.&Jcirc ;.sub.0.vertline.z.sub.j.sup.(o) =1}, wherein z.sup.(o) =(z.sub.j.sup.(o)) is the optimal solution of equations (34)-(36).
Covering subset I.sub.0.sup.(1) of the records by a minimum number of rules belonging to J.sub.0.backslash.&Jcirc ;.sub.0 could be accomplished by minimization of the functional: ##EQU27##
This criterion would give an accurate prediction on the training database. But to make predictions for the overall population, the generalized rule must be as stable as possible.
In a preferred embodiment of the present invention, assuming a generalized rule of type 4 has been found, then for each record i.epsilon.I.sub.0.sup.(1), we use this rule to predict y=0, based on two factors:
1) none of conditions C.sub.jk+1.sup.(1), . . . , C.sub.jK.sup.(1) is fulfilled; and
2) at least one of conditions C.sub.j1.sup.(0), . . . C.sub.jk.sup.(0) is fulfilled.
Preferably, a subset of rules {C.sub.j1.sup.(0), . . . , C.sub.jk.sup.(0) } is found such that both of the above-mentioned factors are not fulfilled for a maximum number of records belonging to subset I.sub.1.backslash.I.sub.1 (i.e., at which y=1). The first factor is not fulfilled for records i.epsilon.I.sub.1.backslash.I.sub.1 of the training database in accordance with equation (32), and this fact is sufficient to preclude making an erroneous prediction. In order to increase the stability of the generalized rule on future data from the overall population, it is necessary to find rules corresponding to set I.sub.0.sup.(1) of records with a high probability that y=0.
In one preferred embodiment, this purpose is achieved by solving a weighted covering problem with a minimized criterion: ##EQU28##
The drawback of criterion (38) is that it does not change its value when rules with p.sub.j =1, covering a small subset of I.sub.0.sup.(1), are entered in a solution. Note that criterion (34) is equal to criterion (38) when .beta.=1.
Preferably, to overcome the above-mentioned drawback of criterion (38), we slightly increase the value of coefficient . |