Scalable set oriented classifier5899992Abstract A method, apparatus, and article of manufacture for a computer implemented scaleable set-oriented classifier. The scalable set-oriented classifier stores set-oriented data as a table in a relational database. The table is comprised of rows having attributes. The scalable set-oriented classifier classifies the rows by building a classification tree. The scalable set-oriented classifier determines a gini index value for each split value of each attribute for each node that can be partitioned in the classification tree. The scalable set-oriented classifier selects an attribute and a split value for each node that can be partitioned based on the determined gini index value corresponding to the split value. Then, the scalable set-oriented classifier grows the classification tree by another level based on the selected attribute and split value for each node. The scalable set-oriented classifier repeats this process until each row of the table has been classified in the classification tree. Claims What is claimed is: Description BACKGROUND OF THE INVENTION
TABLE 1
______________________________________
Training Set
salary age credit rating
______________________________________
65K 30 Safe
15K 23 Risky
75K 40 Safe
15K 28 Risky
100k 55 Safe
60K 45 Safe
62K 30 Risky
______________________________________
The rows of Table 1 (also known as relations or tuples) reflect the past experience of an organization extending credit. The salary and age columns of Table 1 represent attributes of the examples, and the credit rating column of Table 1 represents a class that will be used to classify the examples. From the examples, the SLIM classifier 114 generates a classification tree 200 as illustrated in FIG. 2. The SLIM classifier 114 generates the classification tree 200 to classify the examples in the training set based on the credit rating class. The credit rating class can have values of either "safe" or "risky." The classification tree 200 has a root node 205 with decision "age<=30". When the age attribute of an example is less than or equal to 30, branch 212 is followed to node 220 with decision "salary<=62K". When the salary attribute of the example is less than or equal to 62K, branch 222 is followed to leaf node 240. Leaf node 240 indicates that an example whose age attribute is less than or equal to 30 and whose salary attribute is less than or equal to 62K falls into the risky class. When the salary attribute of the example is greater than 62K, branch 224 is followed to leaf node 240 that indicates that the example falls into the risky class. When the age attribute of a row is greater than 30, branch 214 is followed to node 230, which is a leaf node that indicates the example falls into the safe class. Additionally, when the age attribute of an example is greater than 30, branch 214 is followed to leaf node 230 that indicates the example falls into the safe class. A classification tree 200 is built by the SLIM classifier 114 in two phases: a growth phase and a pruning phase. In the growth phase, the tree is built by recursively partitioning the data in the training set until each partition is either "pure" (all members belong to the same class) or sufficiently small (a parameter set by a user). Each node in the classification tree 200 that contains a decision (i.e., split test) reflects the partitioning that has occurred. The form of the split test that is used to partition the data in the classification tree depends on the type of the attribute used in the split test. Split tests for a numerical attribute A are of the form value(A).ltoreq.x, where x is a value in the domain of A. Split tests for a categorical attribute A are of the form value(A) .epsilon. S, where S is a subset of the domain of A. The SLIM classifier 114 uses a classification tree 200 with binary split tests as described in Mehta and Shafer. One skilled in the art would recognize that the classification tree need not be binary. After the classification tree 200 has been fully grown, it is pruned to remove the noise to obtain the final classification tree 200. The pruning method used by the present invention is the one described in Shafer. The growth phase is computationally more expensive than the pruning phase. During the growth phase, the SLIM classifier 114 accesses the training set multiple times; while during the pruning phase, the SLIM classifier 114 only accesses the fully grown classification tree 200. Therefore, the SLIM classifier 114 focuses on the growth phase. The following pseudocode provides an overview of the growth phase performed by the SLIM classifier 114: GrowTree (TrainingSet DETAIL) Initialize tree T, with all rows of the DETAIL table in the root; while not(all leafs in T are STOP nodes) {for each attribute i, form the dimension table DIM.sub.i ; evaluate gini index for each non-STOP leaf at each split value with respect to attribute i; for each non-STOP leaf, get the overall best split value for it; partition each row and grow the tree for one more level according to the best split value; mark all small or pure leafs as STOP nodes;} return T; First, the SLIM classifier 114 initializes a DETAIL table, containing a row for each example in the training set, and the classification tree 200. Then, until each of the nodes is pure or sufficiently small, the SLIM classifier 114 performs the following procedure. First, for each attribute of an example, a DIM.sub.i table is generated. Next, a gini index value is determined for each distinct value (i.e., split value) of each attribute in each leaf node that is to be partitioned. Then, the split value with the lowest gini index value is selected for each leaf node that is to be partitioned for each attribute i. The best split value for each leaf node that is to be partitioned in the classification tree 200 is determined by choosing the attribute with a split value that has the lowest corresponding gini index value for that leaf node. After the best split value is determined, the classification tree 200 is grown by another level. Finally, the nodes that are pure or sufficiently small are marked as "STOP" nodes to indicate that they are not to be partitioned any further. Data Structures In Mehta, a method called SLIQ is proposed as a scalable classifier. The key data structure used in SLIQ is a class list whose size is linear in the number of examples in the training set. Shafer shows that since the class list must be memory-resident, it puts a hard limitation on the size of the training set that the method can handle. Shafer proposes the new data structures: attribute list and histograms. Although it is no longer necessary for the attribute list to be memory-resident, the histograms must be in memory to insure good performance. While the size of a histogram for a numerical attribute may be small, the size of the histogram for a categorical attribute is linear in #distinct.sub.-- value * #distinct.sub.-- class, which could be large. Also, to perform the split in Shafer, a hash table is used. The size of such a hash table is in fact linear in the number of examples of the training set. When the hash table is too large to fit in memory, splitting is done in multiple steps. In each step, it appears the entire attribute list needs to be accessed. Therefore, Shafer's method does not achieve real linear scalability with respect to the number of examples in the training set. This was confirmed from the time per example measurement for the method. Instead of being flat, the method described by Shafer grows with the number of examples. In the SLIM classifier 114, all information needed to evaluate the split values and perform the partition is stored in rows of a table in a relational database 150. Therefore, memory allocation issues need not be handled by the SLIM classifier 114 alone. The SLIM classifier 114 uses a data structure that relates the rows of the table to the growing classification tree 200. The SLIM classifier 114 assigns a unique identification number to identify each node in the classification tree 200. When loading the data from the training set into the relational database 150, the SLIM classifier 114 adds a leaf.sub.-- num column to the DETAIL table. For each example in the training set, leaf.sub.-- num indicates which leaf node in the current classification tree 200 to which it belongs. When the classification tree 200 grows, the leaf.sub.-- num column is updated to indicate that the example is moved to a new node by applying the split in the current node. There is a one-to-one mapping between leaf.sub.-- num values and leaf nodes in the classification tree 200. If such a mapping is stored in the rows of the DETAIL table, it will be very expensive to access the corresponding leaf node for any row when the table is not memory resident. By examining the mapping carefully, it is seen that the cardinality of the leaf.sub.-- num column is the same as the number of leaf nodes in the classification tree, which is not huge at all, regardless of the size of the training set. Therefore, the mapping is stored indirectly in a leaf node list (LNL). A LNL is a static array that is used to relate the leaf.sub.-- num value in the table to the identification number assigned to the corresponding node in the classification tree 200. By using a labeling technique, the SLIM classifier 114 insures that at each tree growing stage, the nodes always have the identification numbers 0 through N-1, where N is the number of nodes in the tree. LNL›i! is a pointer to the node with identification number i. Now, for any row in the table, the SLIM classifier 114 can get the leaf node it belongs to from its leaf.sub.-- num value and LNL at anytime, and, hence, get the information in the node (e.g. split test, number of examples belonging in this node, and the class distribution of examples belonging in this node). To insure the performance of the SLIM classifier 114, LNL is the only data structure that needs to be memory resident. The size of LNL is equal to the number of nodes in the tree, which is not large at all and which can certainly be stored in memory all the time. The Gini Index Formula A splitting index is used to choose from alternative splits for each node. Several splitting indices have been proposed. The SLIM classifier 114 uses the gini index, originally proposed in Breiman. The SLIM classifier 114 uses the gini index, instead of another index, because in both Mehta and Shafer, it gave acceptable accuracy. It can be shown the accuracy of the SLIM classifier 114 is at least as good as those published in Mehta and Shafer. For a data set S containing m examples from n classes, gini(S) is defined as: ##EQU1## where p.sub.i is the relative frequency of class i in S. If a split divides S into two subset S.sub.1 and S.sub.2, whose size is m.sub.1 and m.sub.2 respectively, the index of the divided data gini.sub.split (S) is given by: ##EQU2## Computing the gini index is the most expensive part of the method, since to find the best split value for a node, the SLIM classifier 114 needs to evaluate the gini index value for each attribute at each possible split value for each non-STOP leaf node. The attribute containing the split value achieving the lowest gini index value is then chosen to split the node, as was done in Breiman. Scalable Supervised Learning Irregardless of Memory (SLIM) FIG. 3 is a flow diagram illustrating the general logic performed by the SLIM classifier 114. In step 310, the SLIM classifier 114 initializes the data set table. The examples of the training set are stored in a relational database 112 using a table with the following schema: DETAIL(attr.sub.1, attr.sub.2, . . . , attr.sub.N, class, leaf.sub.-- num), where attr.sub.i is the ith attribute, class is the classifying attribute, and leaf.sub.-- num indicates the leaf node in the classification tree 200 to which the row belongs. When the classification tree grows, the leaf.sub.-- num value of each example in the training set is updated. Assuming that there are N other attributes besides the class attribute, the cardinality of the class attribute set is n. Table 2 illustrates the DETAIL table for the training set illustrated in Table 1:
TABLE 2
______________________________________
DETAIL
attr.sub.1
attr.sub.2 class leaf.sub.-- num
______________________________________
65K 30 Safe 0
15K 23 Risky 0
75K 40 Safe 0
15K 28 Risky 0
100K 55 Safe 0
60K 45 Safe 0
62K 30 Risky 0
______________________________________
In step 320, the classification tree 200 is initialized. At this stage, the classification tree 200 contains only a root node with all examples belonging to this root node. Step 330 is a decision step in which the SLIM classifier 114 determines whether all nodes in the classification tree 200 are STOP nodes. That is, the SLIM classifier 114 determines whether each node is either pure or sufficiently small. When all nodes are STOP nodes, the SLIM classifier 114 has completed classifying the training set and the classification tree 200 is pruned in step 360. Otherwise, the SLIM classifier 114 continues in step 340 to select the best split value for each non-STOP leaf node. The classification tree is grown in step 350. FIG. 4 is a flow chart illustrating the steps performed to select the best split value for each non-STOP leaf node, as identified in step 340. In step 410, the SLIM classifier 114 generates a DIM.sub.i table for each attribute. In particular, once for every level of the tree, for each attribute attr.sub.i, the SLIM classifier 114 generates a DIM.sub.i table with the schema DIM.sub.i (leaf.sub.-- num, class, attr.sub.i, count) using the following simple select statement on the DETAIL table: INSERT INTO DIM.sub.i SELECT leaf.sub.-- num, class, attr.sub.i, count(*) FROM DETAIL WHERE leaf.sub.-- num.noteq.STOP GROUP BY leaf.sub.-- num, class, attr.sub.i Although the number of distinct values in the DETAIL table could be huge, the maximal number of rows in DIM.sub.i is no greater than #leaf.sub.-- in.sub.-- tree * #distinct.sub.-- values.sub.-- on.sub.-- attr.sub.i * #distinct.sub.-- class, which is very likely to be of the order of several hundreds. In the case that #distinct.sub.-- values.sub.-- on.sub.-- attr.sub.i is very big, preprocessing is suggested to further discretize it. Also, DETAIL could refer to data either in a table or a file (e.g., on magnetic tape). In case of a file, DETAIL resolves to an execution of a user defined function (e.g. fread in UNIX). D. Chamberlin, personal communication, ›hereinafter Chamberlin!. When such dimension tables are formed for every dimension, it is easy to visualize the database schema as a star schema. Thus many innovations related to data warehousing are now applicable to improve performance. G. Larry, Articles on Datawarehousing, http://pwp.starnetinc.com/larryg/articles.html, 1996, ›hereinafter Larry!, which is incorporated by reference herein. Once, the DIM.sub.i tables are generated, the SLIM classifier 114 determines the gini index value for each attribute at each possible split value of the attribute i by performing a series of SQL operations which only involve accessing the DIM.sub.i tables. For one attribute i, its DIM.sub.i table may be created in one pass over the DETAIL table. It is straightforward to schedule one query per dimension (i.e., attribute). Completion time is still linear in the number of dimensions. Commercial DBMSs store data in essentially row major sequence. Thus, I/O efficiencies may be obtained if it is possible to create dimension tables for all attributes in one pass over the DETAIL table. Concurrent scheduling of the queries populating the DIM tables is the simple approach. Existing buffer management schemes that rely on I/O latency appear to synchronize access to the DETAIL table for the different attributes. The idea is that one query piggybacks onto another query's I/O data stream. Results from early experiments are encouraging. J. B. Sinclair, Rice University, personal communication, ›hereinafter Sinclair!. It is also possible for SQL to be extended to ensure that not only I/O is optimized but also processor 102 utilization. Taking liberty with SQL standards, the following query is written as a proposed SQL operator: SELECT FROM DETAIL INSERT INTO DIM.sub.1 {leaf.sub.-- num, class, attr.sub.1, count(*) WHERE predicate GROUP BY leaf.sub.-- num, class, attr.sub.1 } INSERT INTO DIM.sub.2 {leaf.sub.-- num, class, attr.sub.2, count(*) WHERE predicate GROUP BY leaf.sub.-- num, class, attr.sub.2 } . . . INSERT INTO DIM.sub.N {leaf.sub.-- num, class, attr.sub.N, count(*) WHERE predicate GROUP BY leaf.sub.-- num, class, attrN} The new operator forms multiple groupings concurrently, and may allow further optimization. For each non-STOP leaf node in the tree, possible split values for attribute i are all distinct values of attr.sub.i among the examples which belong to this leaf node. For each possible split value, the SLIM classifier 114 needs to get the class distribution for the two parts partitioned by this value to compute the corresponding gini index. In step 430, the SLIM classifier 114 collects such distribution information in two tables, UP and DOWN. The UP table with the schema. UP(leaf.sub.-- num, attri, class, count) could be generated by performing a self-outer-join on DIM.sub.i using the following SQL query: INSERT INTO UP SELECT d.sub.1.node.sub.-- num, d.sub.1.attr.sub.i, d.sub.1.class, SUM(d.sub.2. count) FROM (FULL OUTER JOIN DIM.sub.i d.sub.1, DIM.sub.i d.sub.2 ON d.sub.1. leaf.sub.-- num=d.sub.2.leaf.sub.-- num AND d.sub.2.attr.sub.i <=d.sub.1 attr.sub.i AND d.sub.1.class=d.sub.2.class GROUP BY d.sub.1.leaf.sub.-- num, d.sub.1.attri, d.sub.1.class) Similarly, the DOWN table could be generated by just changing the <=to> in the ON clause. Also, the SLIM classifier 114 can obtain the DOWN table by using the information in the leaf nodes and the count column in the UP table without doing join on DIMI again. In case the outer-join operator is not supported, by performing simple set operations such as EXCEPT and UNION, the SLIM classifier 114 can form a view DIM.sub.i with the same schema as DIM.sub.i first. For each possible split value on attribute i and each possible class label of each node, there is a row in DIM.sub.i that gives the number of rows belonging to this leaf node that have such a value on attribute i and such a class label. Note that DIM.sub.i is a superset of DIM.sub.i and the difference between them are those rows with a count 0. After DIM.sub.i is generated, the SLIM classifier 114 performs a self-join on DIM.sub.i to create the UP table as follow: INSERT INTO UP SELECT d.sub.1.node.sub.-- num, d.sub.1.attri, d.sub.1.class, SUM(d.sub.2. count) FROM DIM.sub.i d.sub.1, DIM.sub.i d.sub.2 WHERE d.sub.1.leaf.sub.-- num=d.sub.2.leaf.sub.-- num AND d.sub.2.attr.sub.i <=d.sub.1.attr.sub.i AND d.sub.1 class=d.sub.2.class GROUP BY d.sub.1.leaf.sub.-- num, d.sub.1.attri, d.sub.1.class The UP and DOWN tables contain all the information the SLIM classifier 114 needs to compute the gini index at each possible split value for each current leaf node in the classification tree 200, but the SLIM classifier 114 needs to rearrange them in some way before the gini index is calculated. In step 440, the SLIM classifier 114 obtains classification information. The following intermediate view could be formed for all possible classes k: CREATE VIEW C.sub.k-- UP(leaf.sub.-- num, attr.sub.i, count) AS SELECT leaf.sub.-- num, attr.sub.i, count FROM UP WHERE class=k Similarly, the SLIM classifier 114 defines view C.sub.K-- DOWN from the DOWN table. In step 450, the SLIM classifier 114 calculates the gini index for each possible split value for attribute i. Now a view GINI.sub.-- VALUE that contains all gini index values at each possible split value is generated. Taking the liberty with SQL syntax, the following query is written: CREATE VIEW GINI-VALUE(leaf.sub.-- num, attr.sub.i, gini) AS SELECT u.sub.1.leaf.sub.-- num, u.sub.1.attr.sub.i, f.sub.gini FROM C.sub.1-- UP u.sub.1, . . . , C.sub.n-- UP u.sub.n, C.sub.1-- DOWN d.sub.1, . . . , C.sub.n-- DOWN d.sub.n WHERE u.sub.1.attr.sub.i = . . . =u.sub.n. attr.sub.i =d.sub.1. attr.sub.i = . . . =d.sub.n. attr.sub.i AND u.sub.1.leaf.sub.-- num= . . . =u.sub.n.leaf.sub.-- num=d.sub.1.leaf.sub.-- num= . . . =d.sub.n.leaf.sub.-- num where f.sub.gini is a function of u.sub.1.count, . . . , u.sub.n.count, d.sub.1.count, . . . , d.sub.n.count. In step 460, for each non-STOP leaf node, the SLIM classifier 114 selects the best split value for attribute i. The SLIM classifier 114 creates a table with the schema MIN.sub.-- GINI(leaf.sub.-- num, attr.sub.-- name, attr.sub.-- value, gini): INSERT INTO MIN.sub.-- GINI SELECT leaf.sub.-- num,: i, attr.sub.i, min(gini) Note the transformation for the table name DIM.sub.i to column value i and column name attr.sub.i. FROM GINI.sub.-- VALUE a WHERE a.gini=SELECT min(gini) FROM GINI.sub.-- VALUE b WHERE a.leaf.sub.-- num=b.leaf.sub.-- num) GROUP BY leaf.sub.-- num The MIN.sub.-- GINI table contains the best split value and the corresponding gini index value for each leaf node of the classification tree 200 with respect to attribute i. The SLIM classifier 114 repeats the above procedure for all attributes. Once that is done, the MIN.sub.-- GINI table contains the best split value for each non-STOP leaf node with respect to all attributes. Step 470 is a decision step in which the SLIM classifier 114 determines whether all attributes have been selected. If not all attributes have been selected, the SLIM classifier 114 continues at step 420 to perform the procedure for the remaining attributes. If all attributes have been selected, the SLIM classifier 114 continues at step 480. In step 480, the SLIM classifier 114 selects the best split value for each non-STOP leaf node. The overall best split value for each non-STOP leaf node is obtained from executing the following query: CREATE VIEW BEST.sub.-- SPLIT(leaf.sub.-- num, attr.sub.-- name, attr.sub.-- value, gini) AS SELECT leaf.sub.-- num, attr.sub.-- name, attr.sub.-- value, min(gini) FROM MIN.sub.-- GINI a WHERE a.gini=(SELECT min(gini) FROM MIN.sub.-- GINI b WHERE a.leaf.sub.-- num b.leaf.sub.-- num) GROUP BY leaf.sub.-- num Categorical Attributes For a categorical attribute i, the SLIM classifier 114 forms DIM.sub.1 in the same way as for a numerical attribute. DIM.sub.i contains all the information the SLIM classifier 114 needs to compute the gini index for any subset splitting. In fact, It is an analog of the count matrix in Shafer, but formed with set-oriented operators. A possible split is any subset of the set that contains all the distinct attribute values. If the cardinality of attribute i is m, the SLIM classifier 114 needs to evaluate the splits for all the 2.sup.m subsets. Those subsets and their related counts can be generated in a recursive way. The schema of the table that contains all the k-sets is S.sub.k-- IN(leaf.sub.-- num, class, v.sub.1, v.sub.2, . . ., v.sub.k, count). Obviously DIM.sub.i =S.sub.1-- IN.S.sub.k-- IN is then generated from S.sub.1-- IN and S.sub.k-1-- IN as follows: INSERT INTO S.sub.k-- IN SELECT p.leaf.sub.-- num, p.class, p.v.sub.1, . . . , P.v.sub.k-1 q.v.sub.1, p.count+q.count FROM (FULL OUTER JOIN S.sub.k-1-- IN p, S.sub.1-- IN q ON p.leaf.sub.-- num=q.leaf.sub.-- num AND p.class=q.class AND q.v.sub.1 >p.v.sub.k-1) The SLIM classifier 114 generates the S.sub.k-- OUT table in a similar way as the SLIM classifier 114 generates the DOWN table from the UP table. Then the SLIM classifier 114 treats S.sub.k-- IN and S.sub.k-- OUT exactly as DOWN and UP for numerical attribute to compute the gini index for each k-set split. The SLIM classifier 114 does not need to evaluate all the subsets. The SLIM classifier 114 only needs to compute the k-sets for k=1, 2, . . . , .left brkt-bot.m/2.right brkt-bot. and thus saves time. Partitioning Once the best split values have been found for each leaf node, the leaf nodes are split into two child nodes. FIG. 5 is a flow diagram illustrating the steps performed to grow the classification tree 200 as identified in step 350. In step 510, the SLIM classifier 114 updates the leaf.sub.-- num values in the DETAIL table as follows: UPDATE DETAIL SET leaf.sub.-- num=partition(attr.sub.1, . . . , attr.sub.N, class, leaf.sub.-- num) The following is pseudocode for the user defined function partition: partition(row r). Use the leaf.sub.-- num value of r to locate the tree node which r belongs to through LNL; Get the best split from node n; Apply the split to r; Return a new leaf.sub.-- num according the result of the split test and update r in DETAIL; The partition function applies the current tree to the original training set. If updating the whole DETAIL table is expensive, the update is avoided by just replacing leaf.sub.-- num by the partition function in the statement forming DIM.sub.i. Therefore, there is no need to store leaf.sub.-- num in the DETAIL table. Instead, leaf.sub.-- num can be computed from the attribute values of each row. In step 520, the STOP nodes are determined. These are the nodes that are pure or sufficiently small. In step 530, the SLIM classifier 114 marks STOP nodes in the DETAIL table to indicate that these nodes are not to be partitioned further. In step 540, the SLIM classifier 114 updates the leaf node list. EXAMPLE The SLIM classifier 114 is illustrated by an example. The example training set is the same as the data in Table 1. Initially, the SLIM classifier 114 loads the training set and initializes the classification tree 200 and the leaf node list. The DETAIL table is shown in Table 2 above. Next, the SLIM classifier 114 finds the best split value for the root node. To do this, the SLIM classifier 114 evaluates the gini index values for each split value of each attribute. For illustration purposes, the procedure for finding the best split value will be shown using the salary attribute. First, a dimension table is generated for the salary attribute. Table 3 illustrates a sample DIM.sub.i table for the salary attribute.
TABLE 3
______________________________________
DIM.sub.i
leaf.sub.-- num
attr.sub.1 class count
______________________________________
0 15 2 2
0 60 1 1
0 62 2 1
0 65 1 1
0 75 1 1
0 100 1 1
______________________________________
Second, in order to be able to generate the UP and DOWN tables, a DIM.sub.i table is generated. This example assumes that the outer join operation is not available. Table 4 illustrates a sample DIM.sub.i table.
TABLE 4
______________________________________
Dim.sub.i
leaf.sub.-- num
attr.sub.1 class count
______________________________________
0 15 1 0
0 62 1 0
0 60 2 0
0 65 2 0
0 75 2 0
0 100 2 0
0 60 1 1
0 65 1 1
0 75 1 1
0 100 1 1
0 62 2 1
0 15 2 2
______________________________________
Third, the SLIM classifier 114 collects distribution information by generating the UP and DOWN tables. Tables 5 and 6 illustrate these tables.
TABLE 5
______________________________________
UP
leaf.sub.-- num
attr.sub.1 class count
______________________________________
0 15 1 0
0 15 2 2
0 60 1 1
0 60 2 2
0 62 1 1
0 62 2 3
0 65 1 2
0 65 2 3
0 75 1 3
0 75 2 3
0 100 1 4
0 100 2 3
______________________________________
Fourth, the SLIM classifier 114 obtains classification information by generating the C.sub.k views. Tables 7-10 illustrate these views.
TABLE 7
______________________________________
C.sub.i-- UP
leaf.sub.-- num attr.sub.1
count
______________________________________
0 15 0.0
0 60 1.0
0 62 1.0
0 65 2.0
0 75 3.0
0 100 4.0
______________________________________
Fifth, the SLIM classifier 114 generates the GINI.sub.-- VALUE view with the gini index values of each split value of attribute i. Table 11 illustrates this view.
TABLE 11
______________________________________
GINI.sub.-- VALUE
leaf.sub.-- num attr.sub.1
count
______________________________________
0 15 0.22856
0 60 0.40474
0 62 0.21428
0 65 0.34284
0 75 0.42856
______________________________________
Sixth, the SLIM classifier 114 generates the MIN.sub.-- GINI view for the salary attribute. Table 12 illustrates this view.
TABLE 12
______________________________________
MIN.sub.-- GINI
after attr.sub.1 is evaluated
leaf.sub.-- num
attr.sub.-- name
attr.sub.-- value
gini
______________________________________
0 1 62 0.21428
______________________________________
At this point, the MIN.sub.-- GINI table contains the best split value with respect to the salary attribute. Then, the above procedure is repeated for the age attribute, and one or more rows is added to the MIN.sub.-- GINI table. Table 13 illustrates the updated table.
TABLE 13
______________________________________
MIN.sub.-- GINI after
attr.sub.1 and attr.sub.2 are evaluated
leaf.sub.-- num
attr.sub.-- name
attr.sub.-- value
gini
______________________________________
0 1 62 0.21428
0 2 30 0.21428
______________________________________
Normally, at this point, the SLIM classifier 114 selects the best split value based on the split value of an attribute with the lowest corresponding gini index value. Because both attributes achieve the same gini index value in this example, either one can be selected. The SLIM classifier 114 stores the best split values in each leaf node of the tree( the root node in this phase). According to the best split value found, the SLIM classifier 114 grows the tree and partitions the training set. The partition is reflected as the leaf.sub.-- num changes in the DETAIL table. Also, any new grown node that is pure or sufficiently small is marked and reassigned a special leaf.sub.-- num value STOP so that the SLIM classifier 114 does not need to process it any more. The updated DETAIL table is shown in Table 14.
TABLE 14
______________________________________
DETAIL after phase 2
attr.sub.1
attr.sub.2 class leaf.sub.-- num
______________________________________
65K 30 Safe 1
15K 23 Risky 1
75K 40 Safe 2.fwdarw.STOP
15K 28 Risky 1
100K 55 Safe 2.fwdarw.STOP
60K 45 Safe 2.fwdarw.STOP
62K 30 Risky 1
______________________________________
After this, the SLIM classifier 114 follows the above procedure for the DETAIL table until all the nodes in the classification tree 200 become STOP nodes. The final classification tree 200 is shown in FIG. 2, and the final DETAIL table is shown in Table 15.
TABLE 15
______________________________________
Final DETAIL
attr.sub.1
attr.sub.2 class leaf.sub.-- num
______________________________________
65K 30 Safe 4.fwdarw.STOP
15K 23 Risky 3.fwdarw.STOP
75K 40 Safe STOP
15K 28 Risky 3.fwdarw.STOP
100K 55 Safe STOP
60K 45 Safe STOP
62K 30 Risky 3.fwdarw.STOP
______________________________________
Experimental Results There are two important metrics to evaluate the quality of a: classification accuracy and classification time. Since Mehta and Shafer are the only published papers on a scalable classifier dealing with large training sets, experimental results from the SLIM classifier 114 are compared with their method. Although the SLIM classifier 114 uses different methodology to build the classifier, the SLIM classifier 114 uses the same measurement(gini index) to choose the best split value for each node. Also, the SLIM classifier 114 grows the classification tree 200 in a breath first fashion and prunes the classification tree 200 using the same pruning method as Mehta and Shafer. This insures the SLIM classifier 114 generates the same classification tree as that produced by Mehta and Shafer for the same training set. The accuracy of SPRINT and SLIQ are discussed in Mehta and Shafer. For scaling experiments, a prototype was run on large data sets. The main cost of the SLIM classifier 114 is that it needs to access DETAIL N times(N is the number of attributes) at each level of the growth of the classification tree 200. It is recommended that future DBMSs 108 support multiple GROUP BY statements so the DETAIL table can be accessed only once regardless of the number of attributes. Due to the lack of a classification benchmark, the synthetic database is used that was proposed in R. Agrawal, T. Imielinski, and A. Swami, Database Mining: A Performance Perspective, IEEE Transactions on Knowledge and Data Engineering, December 1993, ›hereinafter Agrawal!, which is incorporated by reference herein. In this synthetic database, each row consists of nine attributes as shown in Table 16.
TABLE 16
______________________________________
Description of the synthetic data
attribute value
______________________________________
salary uniformly distributed from 20K to
150K.
commission salary .gtoreq. 74K .fwdarw. commission = 0 else
uniformly distributed from 10K to 75K
age uniformly distributed from 20 to 80
loan uniformly distributed from 0 to 500K
elevel uniformly chosen from 0 to 4
car uniformly chosen from 1 to 20
zipcode uniformly chosen from 9 available
zipcode
hvalue uniformly distributed from 0.5k100000
to 1.5k100000 where k .epsilon. {0 . . . 9}
depends on zipcode
hyear uniformly distributed from 1 to 30
______________________________________
Ten classification functions are proposed in Agrawal to produce databases with different complexities. The prototype is run using function 2, described below. Two classes of databases can be generated: Group A and Group B. The description of the predicate for group A is shown below. Function 2-Group A ((age<40) (50K.ltoreq.salary.ltoreq.100K))V ((40.ltoreq.age<60) (75K.ltoreq.salary.ltoreq.125K))V ((age.gtoreq.60) (25K.ltoreq.salary.ltoreq.75K) Experiments were conducted using IBM's DB2 RDBMS 108. Training sets with sizes ranging from 0.5 million rows to 3 million rows were used. The experimental results indicate that the SLIM classifier 114 achieves linear scalability with respect to the training set size. Moreover, the time per example curve stays flat when the training size increases. This is the first flat curve seen for any classifier built for large data sets. The SLIM classifier 114 exhibits properties of a truly linear classifier. It scales in such a way that the time per example remains the same. This desirable property of linear scaling may be attributed to conversion of classification(a data mining problem) into a multi-dimensional analysis problem and to exploitation of true DBMS technology. Additionally, it was found that attribute names in DETAIL became table names for the dimension tables and that it was an attributes value pair that determined the next processing step. Conclusion The foregoing description of the preferred embodiment of the invention has been presented for the purposes of illustration and description. It is not intended to be exhaustive or to limit the invention to the precise form disclosed. Many modifications and variations are possible in light of the above teaching. It is intended that the scope of the invention be limited not by this detailed description, but rather by the claims appended hereto.
|
Same subclass Same class Consider this |
||||||||||
