Scalable system for expectation maximization clustering of large databases6263337Abstract In one exemplary embodiment the invention provides a data mining system for use in finding clusters of data items in a database or any other data storage medium. Before the data evaluation begins a choice is made of the number M of models to be explored, and the number of clusters (K) of clusters within each of the M models. The clusters are used in categorizing the data in the database into K different clusters within each model. An initial set of estimates for a data distribution of each model to be explored is provided. Then a portion of the data in the database is read from a storage medium and brought into a rapid access memory buffer whose size is determined by the user or operating system depending on available memory resources. Data contained in the data buffer is used to update the original model data distributions in each of the K clusters over all M models. Some of the data belonging to a cluster is summarized or compressed and stored as a reduced form of the data representing sufficient statistics of the data. More data is accessed from the database and the models are updated. An updated set of parameters for the clusters is determined from the summarized data (sufficient statistics) and the newly acquired data. Stopping criteria are evaluated to determine if further data should be accessed from the database. Claims We claim: Description FIELD OF THE INVENTION
TABLE 1
CaseID AGE INCOME CHILDREN CARS
1 30 40 2 2
2 26 21 0 1
3 18 16 0 1
4 57 71 3 2
5 41 73 2 3
6 67 82 6 3
7 75 62 4 1
8 21 23 1 1
9 45 51 3 2
10 28 19 0 0
Extended EM Procedure of FIGS. 8A and 8B An extended EM procedure 120 (FIGS. 8A and 8B) takes the contents of the three data structures RS, DS, CS and data from an existing model stored in the data structure of FIG. 6D and produces a new model. The new model is then stored in place of the old model. The procedure 120 uses the existing model to create 202 an Old_Model in a data structure like that of FIG. 6D, then determines 204 the length of the pointer arrays of FIGS. 6A-6C and computes 206 means and covariance matrices from the Old_Model SUM, SUMSQ and M data. The set of Old_Model means and covariance matrices are stored as a list of length K where each element of the list includes two parts: 1) a vector of length n (called the "mean") which stores the mean of the corresponding Gaussian or cluster 2) a matrix of size n.times.n (called the "CVMatrix") which stores the values of a covariance matrix of the corresponding Gaussian or cluster. The structure holding the means and covariance matrices are referred to below as "Old_SuffStats". To compute the matrix CVMatrix for a given cluster from the sufficient statistics SUM, SUMSQ and M (in FIG. 6D), the extended EM procedure computes an outer product defined for 2 vectors OUTERPROD(vector1,vector2). The OUTERPROD operation takes 2 vectors of length n and returns their outer product, or the n.times.n matrix with an entry in row h and column j being vector1(h)*vector2(j). A DETERMINANT function computes the determinant of a matrix. The procedure 200 also uses a function, INVERSE that computes the inverse of a matrix. A further function TRANSPOSE returns the transpose of a vector (i.e. changes a column vector to a row vector). The function EXP(z) computes the exponential e.sup.z. A function `ConvertSuffStats` calculates 206 the mean and covariance matrix from the sufficient statistics stored in a cluster model (SUM,SUMSQ,M) [Mean,CVMatrix]=ConvertSuffStats(SUM,SUMSQ,M) Mean=(1/M)*SUM: MSq=M*M; OutProd=OUTERPROD(SUM,SUM); CVMatrix=(1/MSq)*(M*SUMSQ-3*OutProd); The data structures of FIG. 6A-6D are initialized 100 before entering the FIG. 4 processing loop. In order for the extended EM procedure 120 to process a first set of data read into the memory, the MODEL data structure of FIG. 6D that is copied into Old_Model is not null. An initial set of cluster means is presented to the process. One procedure is to randomly choose the means and place them in the vector `Sum` and setting M=1.0. For a clustering number K=2 for the data format from Table 1, assume the SUM vector is given as Table 2 for these two clusters.
TABLE 2
SUM
Cluster # AGE INCOME CHILDREN CARS
1 55 50 2.5 2
2 30 38 1.5 2
Also, arbitrary values are chosen as diagonal members of the starting model's covariance matrices. Each cluster has a covariance matrix associated with it. The diagonal values of the starting matrices are chosen to range in size from 0.8 to 1.2 and are needed to start the process. Initially the float M of each cluster is set to 1.0. For a vector having dimension n=4, the covariance matrix is a 4.times.4 matrix. Assume cluster number 1 is assigned diagonal entries of the covariance matrix A (Below) yielding a determinant of 0.9504:
A A.sup.-1
.8 0.0 0.0 0.0 1.25 0.0 0.0 0.0
0.0 1.2 0.0 0.0 0.0 .83 0.0 0.0
0.0 0.0 0.9 0.0 0.0 0.0 1.1 0.0
0.0 0.0 0.0 1.1 0.0 0.0 0.0 0.9
Note the off diagonal elements are assumed to be zero. These facilitates the process of determining the inverse matrix A.sup.-1 as well as the determinant. In a similar fashion the matrix for cluster 2 is designated as B and has a determinant of 0.8448:
B B.sup.-1
1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0
0.0 1.05 0.0 0.0 0.0 .95 0.0 0.0
0.0 0.0 .85 0.0 0.0 0.0 1.18 0.0
0.0 0.0 0.0 .95 0.0 0.0 0.0 1.05
A function designated `GAUSSIAN` defined below is used at a step 212 (FIG. 8A) to compute the height of the Gaussian curve above a given data point, where the Gaussian has mean=Mean and covariance matrix=CVMatrix. [height]=GAUSSIAN(x,Mean,CVMatrix) normalizing_constant=(2*PI) (n/2)*SQRT(DET(CVMatrix)); CVMatrixInv=INVERSE(CVMatrix); Height=(1/normalizing_constant)*exp(-(1/ 2)*(TRANSPOSE(x-Mean))*CVMatrixInv*(x-Mean)); Note, mathematically, the value of GAUSSIAN for a given cluster for the datapoint x is: ##EQU1## After resetting 208 a New_Model data structure (similar to FIG. 6D) to all zeros, each point of the RS data structure is accessed 210 and used to update the New_Model. A contribution is determined for each of the data points in RS by determining the weight of each point under the old model. A weight vector has k elements weight(1), weight(2), . . . weight(k) where each element indicates the normalized or fractional assignment of the data point to the corresponding cluster. Recall that each data record contributes to all of the K clusters that were set up during initialization. The mean and co-variant matrix structures allow a height contribution for each record (for j=1, . . . , r) to be determined at step 212 of the extended EM procedure 120 for each cluster (for l=1, . . . , k). This height contribution is then scaled to form a weight contribution that takes into account a factor M.sub.1 /N where N is the total number of data records read thus far from the database. Normalizing the weight factor is performed at a step 214 of the procedure. This step is performed by a procedure (referenced below) called UpdateModel(New_Model,DataPoint,vWeight) At a step 216 an outer product is calculated for the relevant vector data point in RS. An update step 218 loops through all k clusters and update the new model data structure by adding the contribution for each data point:
for 1 = 1, . . . ,k
New_Model(1).SUM = New_Model(1).sum +
vWeight(1)*center;
New_Model(1).SUMSQ = New_Model(1).SUMSQ +
vWeight(1)*OuterProd;
New_Model(1).M_1 = New_Model(1).M_1 + vWeight(1);
End for
Consider the ten records of table 1. The fourth record has attributes of Age=45, Income=71 k, Children=3, and Cars=2. FIG. 7 is a graph depicting the two clusters for the `children` attribute for clusters A and B corresponding to the covariance matrices A and B summarized above. The mean for cluster B is 1.5 and the standard deviation is 0.9220=sqrt(0.85). For cluster A the mean is 2.5 and the standard deviation is 0.9487=sqrt(0.9). Note CaseId 4 has a value of children=3. This data point has a height of 0.1152 under Gaussian B and a height of 0.3660 under Gaussian A. The weighting factor for this point is therefor with respect to Gaussian A (0.3360)/(0.1152+0.3360)=0.7606. One can assign 23.94% of CaseId No. 4 to cluster B and 76.06% of CaseId No. 4 to cluster A. The normalizing factor for computing the heights is 0.026. The process of updating the sufficient statistics for the New_Model continues for all points in the RS data structure. On the first pass through the procedure, the data structures DS and CS are null and the RS structure is made up of data read from the database. Typically a portion of the main memory of the computer 20 (FIG. 1) is allocated for the storage of the data records in the RS structure. On a later iteration of the processing loop of FIG. 4, however, the data structures DS, CS are not null. Consider the data points in table 1 again. Assume that the two clusters G1 and G2 of FIG. 5 represent two data clusters after a number of iterations for the age attribute of the table 1 data. After multiple data gathering steps the means of the clusters are 39 and 58 yrs respectively. To free up space for a next iteration of data gathering from the database, some of the data in the structure RS is summarized and stored in one of the two data structures CS or DS. (FIGS. 6A, 6B) To define which data points can be safely summarized or compressed, the invention sets up a Bonferroni confidence interval (CI) which defines a multidimensional "box" whose center is the current mean for the K Gaussians defined in the MODEL (FIG. 6D). In one dimension this confidence interval is a span of data both above and below a cluster mean. The confidence interval can be interpreted in the following way: one is confident, to a given level, that the mean of a Gaussian will not lie outside of the CI if it was re-calculated over a different sample of data. A detailed discussion of the determination of the Bonferroni confidence interval is found in Appendix A of this application. Returning to the data of Table 1, CaseId 4 has an age value of 57 years and CaseId 5 has an age of 41. These values of the age attribute fall within a one dimensional confidence interval of the means and therefore are compressed into the RS data structures for the two clusters wherein CaseId 4 is associated or belongs to the cluster G2 and CaseId 5 is associated or belongs to the cluster G1. In performing this compression on actual data a vector distance over all dimensions of the data is calculated and a confidence interval box is determined. To identify smaller, "dense" regions on the set of data not compressed into the DS dataset. The following process is adopted. Identify many "candidate" subsets of "dense" data points. Then these "candidates" are filtered to make sure that they satisfy a specified "density" criterion. Two of these so-called subclusters SUB1, SUB2 are shown in FIG. 5. Of the candidates that remain after this filtering procedure, we merge the two nearest candidates. If the resulting merger still satisfies the "density" criterion, we keep merged candidate, otherwise it is discarded. The candidate set is determined by running a classic K-means algorithm on the small number of data points remaining in RS after data points have been compressed into DS. This K-means procedure determines a number of subclusters. The subclusters are then kept and merged if the maximum standard deviation along the coordinate axes is less than a threshold. Consider the subdluster SUB2. This subcluster is characterized by a Gaussian curve that has its own mean and covariance matrix. The mean appears to be about 67 years which is the same as the age attribute for CaseId no. 6. Enough other records fall within the region of this subcluster to warrant summarization of the data from a multiple number of records within this subcluster SUB2. Other records that cannot be classed in one of the Subdlusters are maintained as individual records in RS. This might be the case for CaseId 3 having an age of 18 which is considerably lower than the mean of cluster G! and is not in close proximity to any identified subcluster.. Returning to FIGS. 8A and 8B, the addition of the two datasets CS and DS mean that on subsequent iterations, after each of the single points has been updated a branch 220 is taken to begin the process of updating the New_Model data structure for all the subclusters in CS. A contribution is determined for each of the data points in CS by determining 230 the weight of each subcluster under the old model. First a center vector for the subcluster is determined 230 from the relation center=(1/CS(j).M)*CS(j). SUM. A weight vector has K elements weight(1), weight(2), . . . weight(k) where each element indicates the normalized or fractional assignment of a given subcluster to a cluster. This weight is determined 232 for each cluster and the weight factor is normalized at a step 234 of the procedure. This step is performed by a procedure (referenced later in this application) called UpdateModel(New_Model, SuffStat, vWeight). Note this function is same as procedure UpdateModel introduced above for data points, but this one takes sufficient statistics (CS members, and DS members as its second argument). At a step 236 an outer product is calculated for the relevant vector of the subdluster. An update step 238 for the subcluster of CS loops through all k clusters:
for 1 = 1, . . . ,k
New_Model(1).SUM = New_Model(1).sum + weight(1)*center *
CS(j).M;
New_Model(1).SUMSQ =
New_Model(1).SUMSQ + Weight(1)* OuterTempProd *
CS(j).M;
New_Model(1).M_1 = New_Model(1).M_1 + weight(1) *
CS(j).M;
End for
When the contribution of all the subclusters whose sufficient statistics are contained in CS have been used to update the New_Model, a branch 240 is taken to update the New_Model using the contents of the data structure DS. A center for each of the k entries of DS is determined 250 from the relation center=(1/DS(j).M)*DS(j).SUM. A weight of this `point` is then determined under the Old_Model and the weight is normalized 254. The contributions of each of the subclusters is then added 260 to the sufficient statistics of the New_Model: An exemplary embodiment of the Procedure UpdateModel (New_Model, SuffStat, vWeight) to work with DS members (sufficient stats):
for 1 = 1, . . . ,k
New_Model(1).SUM = New_Model(1).sum + weight(1)*center *
DS(j).M;
New_Model(1).SUMSQ =
New_Model(1).SUMSQ + Weight(1)*OuterTempProd *
DS(j).M;
New_Model(1).M_1 = New_Model(1).M_1 + weight(1) *
DS(j).M;
End for
After the New_Model has been updated at the step 260 for each of the k clusters, the extended EM procedure tests 265 whether a stopping criteria has been met. This test begins with an initialization of two variables CV_dist=0 and mean_dist=0. For each cluster a new co-variance matrix is calculated and a distance from the old mean and the new mean as well as a distance between the new and old covariance matrices is determined. These values are totaled for all the clusters:
For 1 = 1, . . . ,k
[New_Mean, New_CVMatrix] = ConvertSuffStats-
(New_Model(1).SUM,
New_model(1).SUMSQ,
New_Model(1).M_1);
mean_dist = mean_dist + distance(Old_SuffStats-
(1).Mean,New_mean);
CVDist = CV_dist + distance(Old_SuffStats-
(1).CVMatrix, New_CVMatrix);
End for
The stopping criteria determines whether the sum of these two numbers is less than a stopping criteria value: ((1/(2*k))*mean_dist+CV_dist)<stop_tol If the stopping criteria is met the New_Model becomes the Model and the procedure returns 268. Otherwise the New_Model becomes the old model and the procedure branches 270 back to recalculate another New_model from the then existing sufficient statistics in RS, DS, and CS. Three more detailed explanations of alternate procedures for computing the elements of the data set DS are presented below. The Third is our preferred embodiment: First Embodiment of DS Data Set Compression Let Current_Means denote the set {x.sup.1, x.sup.2, . . . ,x.sup.k and Current_Data denote the set {x.sup.1,x.sup.2, . . . ,x.sup.m. Assume that the sets of sufficient statistics DS, CS, RS also keep track of the number of data represented in each. Initially the so called Compress_Set CS is empty. For each cluster l=1, . . . , k: Determine the CI interval L.sup.l,U.sup.l on the mean of the l-th Gaussian, by one of the 2 methods discussed in Appendix A, for instance. For each element in the dataset RS x.sup.i, Let w.sup.i.epsilon.R.sup.k be the vector of probabilistic assignments of data point x.sup.i to the K Gaussians. Compute the perturbed centers x.sup.1,x.sup.2, . . . x.sup.k for perturbed Gaussians by solving: ##EQU2## Let w.sup.i.epsilon.R.sup.k be the vector of probabilistic assignments of data point x.sup.i to the K perturbed Gaussians. If .parallel.w.sup.i -w.sup.i.parallel.<.tau., then place (x.sup.i, w.sup.i) in Discard_Set. Remove x.sup.i from Current_Data RS. Update DS by computing the sufficient statistics from Discard_Set. DS is then computed by determining sufficient statistics of a set of probability distributions "best" fitting the Compress_Set. Second Embodiment of DS Data Set Compression In accordance with a second exemplary embodiment of the present invention a certain percentage of points "nearest" to the Current_Means of the K Gaussians is summarized by a set of sufficient statistics. The percentage may be user-defined or adapted from the current state of the clustering and dynamically changing memory requirements, for instance. The process begins by presenting a Mahalanobis distance and showing how it is related to the probability that a particular data point was generated by one of the K Gaussians in the mixture model. The Mahalanobis distance from a data point x to the mean (center) of a multivariate Gaussian with mean x and covariance matrix S is: D(x,x)=(x-{overscore (x)}).sup.T S.sup.-1 (x-x) We note the multivariate data point x and the mean x are assumed to be column vectors. The superscript `T` denotes transpose and the superscript "-1" applied to the matrix S denotes matrix inversion. The Mahalanobis distance is related to the value of the probability density function p(x), assuming the multivariate Gaussian with mean and covariance matrix specified above: ##EQU3## Here .vertline.S.vertline. denotes the determinant of the covariance matrix S. The expression above states that the larger the Mahalanobis distance of a given data point to a cluster center, the less likely it is that the given cluster generated the data point. A given percentage of the newly sampled data points, determined to be most likely as generated by the K clusters, are compressed. The actual percentage of points to compress, denoted by the parameter p may be user-defined or determined based on the current clustering and memory limitations. Let Current_Means denote the set {x.sup.1, x.sup.2, . . . , x.sup.k and Current_Data be made up of New_Data, denoted as {x.sup.1, x.sup.2, . . . , x.sup.m and Old_Data (we are only concerned with compression of New_Data, Old_Data has been compressed on a prior iteration). Initially Compress_Set=empty For each cluster l=1, . . . , k: Set New_Data(l) to the be the subset of New_Data that is nearest to current mean x.sup.l as measured by the Mahalanobis distance. Set DS(l) to be the set of sufficient statistics representing compression near the current mean x.sup.1. Set CS(l) to be the set of sufficient statistics for all subclusters nearest the current mean x.sup.1 as measured by the Mahalanobis distance. For each element data point x.sup.i,l in New_Data(l), Set D(x.sup.i,l,x.sup.l) to be the Mahalanobis distance from the data point to the current mean x.sup.l. Set r.sup.l to be a threshold on the Mahalanbis distance over each data element in Current_Data(l) so that p percent of the elements of Current_Data(l) have a Mahalanobis distance less than r.sup.l (this operation is easily done by sorting the list of D(x.sup.i,l, x.sup.l) values). If D(x.sup.i,l,x.sup.l)<r.sup.l, then add data point x.sup.i,l to Discard_Set and remove it from Current_Data. Update DS by computing the sufficient statistics of SUM and SUMSQ (FIG. 6A) from the Compress_Set. In general, DS is computed by determining sufficient statistics of a set of probability distributions "best" fitting the Discard_Set. Third Embodiment of DS Data Set Compression In accordance with a third exemplary embodiment of the present invention, a certain percentage of points that are "most likely" under each cluster are moved to the DS set of that cluster. This embodiment is similar to the second, except instead of ranking data by Mahalanobis distance, we rank it by likelihood value. Likelihood is the height of the Gaussian at the data point. The procedure proceeds as follows: 1. For each data point, determine which cluster it is most likely under (which Gaussian has the highest curve at the data point). Call this Temporary Membership. 2. For each cluster, rank its Temporary Members by their likelihood values, then move to DS the top P % of the rank. P is a value specified by the user. Note that step 2 can be equivalently performed by finding the nearest P % of the temporary members of a cluster when the distance measure is the Mahalanobis Distance introduced above. Calculate CS Data Structure Assume that the set RS consists of singleton data points and the points compressed by embodiments 1 and 2 above have been removed from the RS dataset and have been summarized in the DS data set. In the one dimensional case described previously this could include, for example, CaseId 4 and CaseId 5. Let m be the number of singleton data elements left in RS. Set CS_New=empty. Set k' to be number of subcluster candidate to search for. Randomly choose k' elements from RS to use as an initial starting point for classic K-means. Run classic K-means over the data in RS with the initial point. This procedure will determine k' candidate subclusters. Set up a new data structure CS_New to contain the set of sufficient statistics for the k' candidate subclusters determined in this manner. For each set of sufficient statistics in CS_New, if the number of data points represented by these sufficient statistics is below a given threshold, remove the set of sufficient statistics from CS_New and leave the data points generating these sufficient statistics in RS. For each set of sufficient statistics in CS_New remaining, if the maximum standard deviation along any dimension of the corresponding candidate subdluster is greater than a threshold .beta., remove the set of sufficient statistics from CS_New and keep the data points generating these sufficient statistics in RS. Set CS_Temp=CS_New.orgate.CS. Augment the set of previously computed sufficient statistics CS with the new ones surviving the filtering in steps 6 and 7. For each set of sufficient statistics s (corresponding to a sub-cluster) in CS_Temp Determine the s', the set of sufficient statistics in CS_Temp corresponding to the nearest subcluster to the subcluster represented by s. If the subdluster formed by merging s and s', denoted by merge(s, s') is such that the maximum standard deviation along any dimension is less than .beta., then add merge(s, s') to CS_Temp and remove s and s' from CS_Temp. Set CS=CS_Temp. Remove from RS all points that went into CS. (RS=RS-CS.) Note that the function merge (s, s') simply computes the sufficient statistics for the sub-cluster summarizing the points in both s and s'. Stopping Criteria at Step 140 The scalable Expectation Maximization analysis is stopped (rather than suspended) and a resultant model output produced when the test 140 of FIG. 4 indicates the Model is good enough. Two alternate stopping criteria (other than a scan of the entire database) are used. A first stopping criteria defines a probability function p(x) to be the quantity ##EQU4## where x is a data point or vector sampled from the database and 1) the quantity M(1) is the scalar weight for the 1th cluster, (The number of data elements from the database sampled so far represented by cluster 1) 2) N is the total number of data points or vectors sampled thus far, and 3) g(x.vertline.1) is the weighting factor of the data point for the 1th cluster. This weighting factor is determined from the SUM and SUMSQ and M(1) data for a given cluster as outlined previously. Now define a function f(iter) that changes with each iteration. ##EQU5## The summation in the function is over all data points and therefore includes the subclusters in the data structure CS, the summarized data in DS and individual data points in RS. When the values of p(x.sub.i) are calculated, the probablity function of a subcluster is determined by calculating the weighting factor in a manner similar to the calculation at step 232. Similarly the weighting factor for the k elements of DS are calculated in a manner similar to the step 252 in FIG. 8B. Consider two computations during two successive processing loops of the FIG. 4 scalable EM analysis. Designate the calculations for these two iterations as f.sub.z and f.sub.z-1. Then a difference parameter d.sub.z =f.sub.z -f.sub.z-1. Evaluate the maximum difference parameter over the last r iterations and if no difference exceeds a stopping tolerance ST then the first stopping criteria has been satisfied and the model is output. A second stopping criteria is the same as the stopping criteria outlined earlier for the extended EM procedure 120. Each time the Model is updated K cluster means and covariance matrices are determined. The two variables CV_dist and mean_dist are initialized. For each cluster k the newly determined covariance matrix and mean are compared with a previous iteration for these paramaters. A distance from the old mean and the new mean as well as a distance between the the new and old covariance matrices is determined. These values are totaled for all the clusters:
For 1 = 1, . . . ,k
[New_Mean, New_CVMatrix] = ConvertSuffStats(
New_Model(1).SUM, New_model(1).SUMSQ,
New_Model(1).M_1);
mean_dist = mean_dist +
distance(Old_SuffStats(1).Mean,New_mean);
CVDist = CV_dist + distance(Old_SuffStats(1).CVMatrix,
New_CVMatrix);
End for
The stopping criteria determines whether the sum of these two numbers is less than a stopping criteria value: ((1/(2*k))*mean_dist+CV_dist)<stop_tol Multiple Model Embodiment In accordance with an alternate embodiment of the present invention, the process of FIG. 4 is supplemented with a model optimizer. In accordance with this embodiment, and as depicted in FIG. 2, a multiple number of different clustering models S are simultaneously generated by the computer 20. In FIG. 10 one sees that the multiple model embodiment is implemented with an array S of pointers m.sub.1 . . . m.sub.s where each pointer points to a different model data structure. The model data structure is depicted in FIG. 6D. In this embodiment the structure CS and RS are shared by the multiple models. Each of the models m.sub.s is initialized with a different set of centroid vectors (value of `sum`, M=1) for the K different clusters of the model. When data is gathered at the step 110, that data is used to update each of the S models. An extended EM procedure for the multiple model process takes into account the multiple model aspects of the structures DS and CS that is performed on each of the S models. On a first iteration through the FIG. 4 process there is no DS or CS dataset for any of the models so that all data is in the RS dataset. A given data point r.sub.s in the data set RS is compressed into the dataset DS.sub.j for only one of the S models even though it may be close enough to a centroid to compress into a DS dataset for multiple models. The data point r.sub.s is assigned to the set DS of the model whose centroid is closest to the point r.sub.s. DS structures for all the S models are determined by compressing data points into the appropriate DS data structure. The CS data structures for each of the models are then determined from the points remaining in RS. When performing the extended EM procedure 120, however, the CS sufficient statistics must be augmented with the sufficient statistics contained in the DS data structures of the other models. When performing the extended EM procedure to update a given model mj, the subclusters in CS must be augmented with the DS structures from the other models. Specifically, when updating model m.sub.j, the extended EM procedure considers the augmented set CS.sub.j =CS U DS.sub.1 U DS.sub.2 . . . DS.sub.j-1 U DS.sub.j+1 U . . . DS.sub.s when performing the loop 230 of FIG. 7. If a data point is compressed into DS, it enters the DS set of only one model at the step 240, hence there is no double counting of data. The multiple model analysis can be performed until one of the models satisfies the stopping criteria at the step 140. An alternate system would continue to compute all the models until each model reaches a stopping criteria. Additionally, the scalable EM process could be performed until a certain percentage of the models have reached a stopping criteria. The multiple model implementation shares data structures between models and performs calculations on certain data unique to a given model. This analysis is susceptible to parallel processing on a computer 20 having multiple processing units 21. Pseudo-Code for Multiple Model Embodiment The following presents pseudo-code for an exemplary embodiment for the multiple-model, scalable clustering. The previous discussion makes it clear how the various embodiments can be implemented by modifying the pseudo-code below.
MAIN CLUSTERING LOOP:
Sub GeneralizedCluster (MODELs,ReinitEmptySeeds,StopTol,StdTol)
DSs = {}
Open(DataSource)
While (True) Do {
OldMODELs = MODELs
SEM (MODELs,DSs,CS,RS,ReinitEmptySeeds,StopTol,StdTol)
If TotalRowsRead < Length(DataSource) Then
EmThreshold (MODELs,DSs,RS,Confidence)
UpdateCS (CS,RS,StdTol)
End If
While Not StopCriteria (OldMODELs,MODELs,StopTol)
Close(DataSource)
For Each Model In MODELs Do
Save(Model)
Next Model
End Sub
The Algorithm GeneralizedCluster above calls two subroutines; SEM ()
and
EmThreshold (). These are defined below. These two algorithms also
reference other
routines which are listed here:
[vWeight,s] = ComputeEMWeights(Model,Mean)
[void] = UpdateModel(Model,SuffStat,vWeight)
[void] = UpdateModel(Model,DataPoint,vWeight)
[void] = UpdateSeedCache(SeedCache,Mean,RawScore)
[SuffStat] = GetBestCandidate(SeedCache)
[void] = UpdateCs(...)
UpdateCS() routine is described in copending application serial no.
09/040,219 to
Fayyad et al. Which is incorporated herein by reference.
The UpdateModel() functions only differ in the second argument and
have been
explained previously. The procedure UpdateSeedCache() maintains a cache of
candidate
seeds for new clusters if the main loop at some point decides to reset one
of the clusters to
a new starting point (e.g. cluster goes empty). It manages a list of
candidate points.
The procedure GetBestCandidate(SeedCache) returns the next
candidate from the
SeedCache. This criterion can be implemented in a variety of ways, but in
the exemplary
embodiment it returns the point that is assigned the lowest likelihood
given the existing
model (set of clusters). That is the point in SeedCache that has the lowest
sum of heights
of Gaussians from this model. This point represents a point that fits least
well to the
current model.
Having described all the references procedures, we now list the main
algorithms:
Sub SEM (MODELs,DSs,CS,RS,ReinitEmptySeeds,StopTol,StdTol)
For ModelIndex = 1 To Length(MODELS) Do
DoModelAgain:
M = MODELS[ModelIndex]
RestartModel = False
NewSeedCache = {}
For Iteration = 0 to INFINITY Do
For each X in RS Do
[vWeight,s] = ComputeEMWeights(M,X.mean);
UpdateModel (Mnew,X,vWeight);
if ReinitEmptySeeds
UpdateSeedCache (NewSeeds,X,vWeight,s);
Next X
For each X in CS Do
[vWeight,s] = ComputeEMWeights(M,X.mean);
UpdateModel (Mnew,X,vWeight);
if ReinitEmptySeeds
UpdateSeedCache (NewSeeds,X,vWeight,s);
Next X
For each DS in DSs Do
For each X in DS Do
[vWeight,s] = ComputeEMWeights(M,X.mean);
UpdateModel (Mnew,X,vWeight);
Next X
Next DS
If ModelDiff(Mnew,M) < StopTol Then
Break;
End If
If ReinitEmptySeeds Then
For each Cluster in M Do
If Cluster Is Empty
ResetModel = True
End If
Next Cluster
If ResetModel Then
For each Cluster in M Do
If Cluster Is Empty Then
Candidate = GetBestCandidate(NewSeedCache)
Cluster.Mean = Candidate.Mean
Candidate.Remove
End If
Cluster.Covariance = {1,1,1,...}
Next Cluster
End If
End If
M = Mnew
Next Iteration
M = Mnew
// at this point, M has converged but it may contain centroids
// which are duplicates of other centroids. we remove duplicate
// centroids at this point since they simply take up space in the
// model which could be used by viable centroid candidates.
// We separate this check out from other post-convergence
// since we allow for a variety of alternate embodiments
// to implement various ways of handling this case. see code
//below for other conditions which may arise.
If CS Is Not Empty Then
NewSeedCache = {}
For each X in CS Do
[vWeight,s] ComputeEMWeights(M,X.mean)
UpdateSeedCache (NewSeedCache,X,vWeight,s)
Next X
For I = 1 to Length (M) Do
For J = I+1 to Length(M) Do
If Cluster M[I] is indistinguishable from Cluster M[j]
Then
Candidate = GetBestCandidate(NewSeedCache)
M[J].Mean = Candidate.Sum
M[J].Covariance = Candidate.Covariance
M[J].Weight = Candidate.Weight
Candidate.Remove
RestartModel = True
End If
Next J
Next I
End If
If RestartModel Then
Goto DoModelAgain
End If
Next ModelIndex
End Sub
EmThreshold() is a scaleable clustering implementation of the third
embodiment of
thresholding. Note that other embodiment implementations are described
(Mahalanobis
Distance, and Worst-case perturbation), but the pseudocode is restricted to
this
embodiment.
Sub EmThreshold (MODELS,DSs,RS,Confidence)
RSDist = Array of { -1, -1, -1, +INFINITY }
For ModelIndex = 1 To Length(MODELs) Do
M = MODELs[ModelIndex]
For RSIndex = 1 To Length(RS) Do
[vWeight,S] = ComputeEMWeights(M,RS[RSIndex].Mean)
ClusterNum = GetMaxPos(vWeights)
If S < RSDist(ClusterNum).S Then
RSDist[RSIndex].RSIndex = RSIndex
RSDist[RSIndex].ModelIndex = ModelIndex
RSDist[RSIndex].ClusterNum = ClusterNum
RSDist[RSIndex].S = S
End If
Next RSIndex
Next ModelIndex
RSDist = Sort(RSDist,Ascending,By S)
For I = 1 To Floor(Confidence*Length(RS)) Do
RSIndex = RSDist[I].RSIndex
ModelIndex = RSDist[I].ModelIndex
ClusterNum = RSDiSt[I].ClusterNum
DS = DSs[ModelIndex]
DS[ClusterNum].Sum += RS[RSIndex].Mean
DS[ClusterNum].SumSq += RS[RSIndex].Mean*RS[RSIndex].Mean
DS[ClusterNum].Weight ++
Next I
End Sub
Computer System With reference to FIG. 1 an exemplary data processing system for practicing the disclosed data mining engine invention includes a general purpose computing device in the form of a conventional computer 20, including one or more processing units 21, a system memory 22, and a system bus 23 that couples various system components including the system memory to the processing unit 21. The system bus 23 may be any of several types of bus structures including a memory bus or memory controller, a peripheral bus, and a local bus using any of a variety of bus architectures. The system memory includes read only memory (ROM) 24 and random access memory (RAM) 25. A basic input/output system 26 (BIOS), containing the basic routines that helps to transfer information between elements within the computer 20, such as during start-up, is stored in ROM 24. The computer 20 further includes a hard disk drive 27 for reading from and writing to a hard disk, not shown, a magnetic disk drive 28 for reading from or writing to a removable magnetic disk 29, and an optical disk drive 30 for reading from or writing to a removable optical disk 31 such as a CD ROM or other optical media, The hard disk drive 27, magnetic disk drive 28, and optical disk drive 30 are connected to the system bus 23 by a hard disk drive interface 32, a magnetic disk drive interface 33, and an optical drive interface 34, respectively. The drives and their associated computer-readable media provide nonvolatile storage of computer readable instructions, data structures, program modules and other data for the computer 20. Although the exemplary environment described herein employs a hard disk, a removable magnetic disk 29 and a removable optical disk 31, it should be appreciated by those skilled in the art that other types of computer readable media which can store data that is accessible by a computer, such as magnetic cassettes, flash memory cards, digital video disks, Bernoulli cartridges, random access memories (RAMs), read only memories (ROM), and the like, may also be used in the exemplary operating environment. A number of program modules may be stored on the hard disk, magnetic disk 29, optical disk 31, ROM 24 or RAM 25, including an operating system 35, one or more application programs 36, other program modules 37, and program data 38. A user may enter commands and information into the computer 20 through input devices such as a keyboard 40 and pointing device 42. Other input devices (not shown) may include a microphone, joystick, game pad, satellite dish, scanner, or the like. These and other input devices are often connected to the processing unit 21 through a serial port interface 46 that is coupled to the system bus, but may be connected by other interfaces, such as a parallel port, game port or a universal serial bus (USB). A monitor 47 or other type of display device is also connected to the system bus 23 via an interface, such as a video adapter 48. In addition to the monitor, personal computers typically include other peripheral output devices (not shown), such as speakers and printers. The computer 20 may operate in a networked environment using logical connections to one or more remote computers, such as a remote computer 49. The remote computer 49 may be another personal computer, a server, a router, a network PC, a peer device or other common network node, and typically includes many or all of the elements described above relative to the computer 20, although only a memory storage device 50 has been illustrated in FIG. 1. The logical connections depicted in FIG. 1 include a local area network (LAN) 51 and a wide area network (WAN) 52. Such networking environments are commonplace in offices, enterprise-wide computer networks, intranets and the Internet. When used in a LAN networking environment, the computer 20 is connected to the local network 51 through a network interface or adapter 53. When used in a WAN networking environment, the computer 20 typically includes a modem 54 or other means for establishing communications over the wide area network 52, such as the Internet. The modem 54, which may be internal or external, is connected to the system bus 23 via the serial port interface 46. In a networked environment, program modules depicted relative to the computer 20, or portions thereof, may be stored in the remote memory storage device. It will be appreciated that the network connections shown are exemplary and other means of establishing a communications link between the computers may be used. While the present invention has been described with a degree of particularity, it is the intent that the invention include all modifications and alterations from the disclosed implementations falling within the spirit or scope of the appended claims. Appendix A The following development of the Bonferroni inequality, used to determine the multidimensional confidence interval on the mean follows from page 12 of [2] G. A. F. Seber. Multivariate Observations. John Wiley & Sons, New York, 1984. A conservative procedure for determining the multidimensional confidence interval on the mean vector of a set of multivariate observations is always available using the Bonferroni inequality: ##EQU6## Where E.sub.i is the complement of E.sub.i. If we use the critical level of .alpha./r for each test, then P(E.sub.j)=1-.alpha./r=>P(E.sub.j)=.alpha./r ##EQU7## Hence, in our application, let E.sub.j be the event that the j-th element of the l-th current mean lies in the between the values L.sub.j.sup.l (lower bound) and U.sub.j.sup.l (upper bound), or specifically, L.sub.j.sup.l.ltoreq.x.sub.j.sup.l.ltoreq.U.sub.j.sup.l. Here x.sub.j.sup.l is the j-th element of the l-th current mean. Here the values of L.sub.j.sup.l and U.sub.j.sup.l define the 100(1-.alpha./r)% confidence interval on x.sub.j.sup.l which is computed as: ##EQU8## N is the number of singleton data points represented by cluster l, including those that have already been compressed in earlier iterations and uncompressed data points. S.sub.j.sup.l is an estimate of the variance of the l-th cluster along dimension j. Let L.sup.l,U.sup.l.epsilon.R.sup.n be the vectors of lower and upper bounds on the mean of cluster l. The invention assigns data points to Gaussians in a probabalistic fashion. Two different techniques are proposed for determining the integer N, the number of singleton data points over which the Gaussian mean is computed. The first way is motivated by the EM Gaussian center update formula which is computed over all of the data processed so far (whether it has been compressed or not), hence in the first variant of the Bonferroni CI computation we take N to be the number of data elements processed by the Scalable EM algorithm so far. The second variant is motivated by the fact that although the EM Gaussian center update is over all data points, each data point is assigned probabilistically to a given Gaussian in the mixture model, hence in the second variant of the Bonferroni computations we take N to be the rounded integer of the sum of the probabilistic assignments over all data points processed so far. The Bonferroni CI formulation assumes that the Gaussian centers, computed over multiple data samples of size N are computed as ##EQU9## This is true for the classic K-means algorithm, but is only guaranteed to be true for the first iteration of the EM algorithm. Hence a distribution other than the t distribution may better fit the assumptions on the distribution of the Gaussian center as computed by the EM algorithm. This would result in a different formula for the Bonferroni CI. After determining the confidence intervals on the K Gaussian means L.sup.l,U.sup.l.epsilon.R.sup.n,l=1, . . . , k, one technique perturbs the means so that the resulting situation is a "worst case scenario" for a given singleton data element. Assuming that the data point is x.sup.i, we propose solving the following optimization problem for determining the perturbed cluster means and corresponding probabilistic assignment of data point x.sup.i to the K perturbed Gaussians: ##EQU10## Here f(x.sup.l) is the probabilistic assignment of data point x.sup.i to the Gaussian centered at x.sup.l, more specifically: ##EQU11## The perturbation becomes a more general optimization problem and the procedure used in the K-mean case is a special case of the the solution of this problem when 0/1 assignments are made between points and clusters.
|
Same subclass Same class Consider this |
||||||||||
