Scalable system for K-means clustering of large databases6012058Abstract In one exemplary embodiment the invention provides a data mining system for use in evaluating data in a database. Before the data evaulation begins a choice is made of a cluster number K for use in categorizing the data in the database into K different clusters and initial guesses at the means, or centriods, of each cluster are provided. Then a portion of the data in the database is read from a storage medium and brought into a rapid access memory. Data contained in the data portion is used to update the original guesses at the centroids of each of the K clusters. Some of the data belonging to a cluster is summarized or compressed and stored as a summarization of the data. More data is accessed from the database and assigned to a cluster. An updated mean for the clusters is determined from the summarized data and the newly acquired data. A stopping criteria is evaluated to determine if further data should be accessed from the database. If further data is needed to characterize the clusters, more data is gathered from the database and used in combination with already compressed data until the stopping criteria has been met. Claims We claim: Description FIELD OF THE INVENTION
TABLE 1
______________________________________
CaseID AGE INCOME CHILDREN
CARS
______________________________________
1 30 40 2 2
2 26 21 0 1
3 18 16 0 1
4 45 71 3 2
5 41 73 2 3
6 67 82 6 3
7 75 62 4 1
8 21 23 1 1
9 45 51 3 2
10 28 19 0 0
______________________________________
Table 2 below tabulates mean values chosen for a starting point assuming K=3 for performing the scalable K-means clustering analysis on the data of table 1.
TABLE 2
______________________________________
Cluster # AGE INCOME CHILDREN
CARS
______________________________________
1 55 50 2.5 2
2 30 38 1.5 2
3 20 24 1 1
______________________________________
An important concept of the present invention is the summarization or compression of data points of type SDATA contained in the dataset RS (TABLE 1) sampled from the database into the two data structures DS, CS to allow more sampling of data from the database. During each processing iteration of the FIG. 4 flowchart the scalable K-means analysis calls an `extended K-means` procedure 120 that utilizes the compressed data as well as the remaining data samples contained in the dataset RS. On a first iteration through the FIG. 4 process the set DS (FIG. 6A) is empty. Updates to the set DS are performed at the step 130 for each cluster of the range, i=1, . . . K. For each cluster i the processor determines which singleton data elements (elements of the set RS of type SDATA), assigned to cluster i will nor change cluster membership over future data samples. These points will be used to augment the i-th element of the set DS which contains the sufficient statistics summarizing these singleton points. These points are removed from the set RS and used to update the sufficient statistics for the i-th cluster of the set DS. Two conceptual data structures help explain a first embodiment of the method of updating the set DS. This first embodiment is referred to as the Worst Case Analysis method. These conceptual data structures hold the upper and lower bounds defining an n-dimensional (n=# attributes) confidence interval (CI) on the means or centroids of the K clusters computed so far. A list structure designated LOWER is a vector of k elements (one for each cluster) where each element points to a vector of n elements (floats) holding the lower bounds for each attribute of the CI on the mean of the corresponding cluster. For example LOWER(3). LowVec(2) is the value of the lower bound on the CI for the third cluster along dimension 2. A second structure designated UPPER is a vector of k elements (one for each cluster) where each element points to a vector of n elements (floats) holding the upper bounds for the CI on the mean or centroid of the corresponding cluster. Singleton Points (Elements of RS) not changing cluster assignment when the K cluster centers are perturbed, within their respective confidence intervals in a worst-case fashion, can be summarized by adding them to the set DS and removing them from RS. Appendix A is a summarization of the Worst Case Analysis that defines LOWER and UPPER as well as the process of updating of the set DS using the Worst Case Analysis. A second embodiment of the process of updating the dataset DS is referred to as the Threshold Analysis. For this second embodiment a data structure is used that helps sort the singleton elements of RS (of type SDATA) by their Mahalanobis distance (See Duda and Hart, Pattern Classification and Scene Analysis referenced above) to a given cluster center. A structure RSDist is a list of r elements (r is the number of singleton data elements in the set RS) where each element in RSDist points to 2 objects: 1) float called "MahalDist" which holds the Mahalanobis distance of the corresponding element in RS to the nearest cluster center and 2) an integer indicating the cluster whose center is nearest to the given point in RS, called "ClustAssign". Appendix C summarizes the calculation of the Mahalanobis distances. A third embodiment for updating the dataset DS is based upon the distance of a data point from the then current mean. If this distance for a point rs contained in RS is less than a specified distance, then the point is removed from RS and added to DS by updating the sufficient statistics of DS to include the rs point. FIG. 5 depicts a clustering in one dimensional, for example, income data from the table 1 list of data. Three clusters K1, K2, K3 are made up of much larger numbers of data vectors SDATA. The data structure DS that summarizes data for the cluster designated K1 is centered within the generally gaussian shaped curve depicting that cluster. Regardless of the technique used, certain data safely `belongs` to the cluster K1 so it is safe to compress or summarize the data in this range in the form of the sufficient statistics contained in the DS (discard region) data structure. Subclusters in dataset CS After the compression of data into the DS data structure, there is still a fairly large amount of data (SDATA) contained to the left and the right of the centroid of K1 in FIG. 5 that neither the worst case analysis (Appendix A) nor the threshold analysis (Appendix B) identifies for compression into the set DS. These point fall within the `compression region` for cluster 1. The present invention also summarizes at least some of this data in the form of a dataset designated CS. An exemplary process for determining the CS dataset is summarized in the psuedocode of Appendix C. Briefly, a dataset RSNew is made up of the dataset RS after removing the set DS from the original set RS. The process of determining the new CS data set begins by merging and removing any singleton points from RSNew into CS which can safely be merged without violating a specified "density" criteria. For each data point in RSNew, the appendix C procedure finds the CS subcluster which is closest to it. If the data point can be merged into CS without violating the specified "density criteria" of the CS subcluster, then the data point is merged into that particular CS subcluster and removed from RSNew. If it doesn't satisfy the criteria, then the data point is left in RSNew. The process of determining the data set CS then continues by finding a set of "dense" subclusters within the set RSNew. This is done by performing a traditional K-means analysis on the data in the set RSNew using a cluster number K' (Kprime) greater than K, the number of clusters used in performing the scalable K-means analysis. The set of sufficient statistics (Sum, Sumsq, and M) for the K' subclusters found by this procedure are appended to the current set of sufficient statistics in the dataset CS. Hence CS is augmented by K' elements to produce a number c of subclusters. This augmented list is then filtered and elements are merged (if possible), reducing the size of the list. If the number of data points (M in the data structure CS) is less than a threshold value, (MinPoints in Appendix C) the data from this analysis is not clustered and the data is instead kept in the set RSNew. Furthermore only dense clusters are kept in CS. For each subcluster remaining after the threshold value of points criteria has been checked, if the maxiumum measure of spread computed from (SUMSQ) along any of the n dimensions (attributes) of the candidate subcluster is less than a threshold (StdTol in appendix C) the data from that subcluster is also left in RSNew and not summarized in CS. These two criteria remove the elements of CS corresponding to subclusters which are either too small (in terms of number of points) or too `spread out`. A final step is the combining of subclusters using hierarchical agglomerative clustering. An attempt is made to merge two elements of the dataset CS. If the larger, merged subcluster still satisfies the `spread` criteria discussed above, these two subclusters are removed from the set CS, and the larger subcluster representing these two smaller subclusters is added to the dataset CS. This process continues until the merging process produces no larger subclusters that still satisfy the "dense" criteria. The data structure CS contains c elements that are then used in the K-means analysis. Extended K-means Procedure An extended K-means procedure 120 includes looping constructs for updating the current model that are summarized in the flow chart of FIG. 7. Certain utility functions are needed to perform this extended K-means procedure. A function ModelCopy(ModelNew, ModelOrig ) copies the model ModelOrig into ModelNew. A function Length(DataStructure) returns the length of the pointer array for the data structures of FIG. 6 so that for example, Length(CS)=c and Length(RS)=r. Zero(Model ) takes the data structure for the model in FIG. 6D and sets all elements to 0.0. A function Distance2Norm(point1, point2) measures the distance between two vectors point1 and point2. The extended K-means procedure of FIG. 7 begins with a step 200 of copying the existing model into an old model data structure. The process next determines 202 the length of the RS, DS, and CS data structures of FIGS. 6A-6C and returns the values k, c, and r. The data structure NewModel is then zeroed or initialized 204. The process updates the NewModel until a test 206 indicates a stopping criteria has been met. If the stopping criteria has not been met, the process saves 208 the new model in the old model data structure and branches 210 to zero a next subsequent new model. The test 206 is similar to the test 140 described in conjunction with the scalable K-means process overview. After initialization of a new model, that model is updated in a loop 220 that updates the model for each of the r vectors in the dataset RS. The loop gathers data 222 a point at a time from the set RS and determines 224 what cluster to assign that data point to. This is done by finding the distance from the data point to the mean of each of the then existing means of the old model. By reference to FIG. 6D it is seen that the model includes the Sum for a given cluster K and therfore the mean or centroid of each dimension is given by this value divided by the scalar M for the cluster. Once the closest cluster is found, the New model Sum and SumSq component for that closest cluster is updated by adding the data point to the vector Cluster(closest).sum and then squaring the components and adding them to the Cluster(closest).SumSq vector. The scalar M for a cluster is incremented by one for each point added to that cluster. Once the loop over the r vectors is completed, the procedure updates the model based on the compressed statistics stored in the c subclusters found in the data structure CS. On an initial loop through the FIG. 4 scalable K-means process there are no CS or DS structures. Table 3 below indicates the contents of the model (FIG. 6D) after the RS portion of the extended K-means process on data points of Table 1.
TABLE 3
______________________________________
SUM
Cluster #
AGE INCOME CHILDREN CARS M
______________________________________
1 228 288 15 9 4
2 75 91 5 4 2
3 93 79 1 3 4
______________________________________
SUMSQ
Cluster # AGE INCOME CHILDREN
CARS
______________________________________
1 51984 8244 225 81
2 5625 8281 25 16
3 8649 6241 1 9
______________________________________
The ten records of TABLE 1 will fit in memory, and a conventional K-means analysis is possible. For a large data base containing millions of records, the ten records constitute only a part of one data gathering step. Table 4 below lists a K-means clustering of data performed on the ten records with K=3.
TABLE 4
______________________________________
Cluster # AGE INCOME CHILDREN
CARS
______________________________________
1 57 72 3.75 2.25
2 37.5 45.5 2.5 2
3 23.25 19.75 0.25 0.75
______________________________________
To free up computer memory for gathering more of the millions of records some of the ten records shown in Table 1 are candidates to be compressed into the data structures CS, DS. The cluster averages for the income attribute of the ten records are labeled in the FIG. 5 depiction. Record number 10 has an income of `19` and for this one dimension falls safely within the DS (discard region) centered around the cluster K1 in FIG. 5. Visualizing the situation over many attributes becomes more difficult but the techniques summarized in the appendices deal with vectors and identify records within RS for compression. Record number 8 in Table 1 has an income of `23`. Assume this record does not fall within the DS region and therefore becomes a candidate for inclusion in the CS (compress) dataset. Note, the cluster mean for the second cluster K2 is at an income of 45.5 k dollars. Data falling between the two means of 19.75 and 45.5 typically will not be classed in either the DS or the CS dataset. It is retained in RS and used on the next iteration to perform the clustering. After the initial iteration of the FIG. 4 process, the CS and DS structures contain sufficient statistics, and the extended K-means procedure of FIG. 7 must take into account this data in determining the new model when the procedure is again called at the step 120 of the scalable K-means analysis. To update the model based on the sufficient statistics contained in the dataset CS the FIG. 7 procedure executes a loop 230 over each of the subclusters c in CS and determines which of the K clusters in the Model (FIG. 6D) that subcluster is closest to. Assume subcluster p is closest to cluster q. When this fact is discovered the sufficient statistics of cluster q are updated by adding the contents subcluster(p).sum to cluster(q).sum and the statistics subcluster(p).sumsq to cluster(q).sumsq. Additionally, the value of subcluster(p).M for the subcluster is added to the to the value cluster(q).M. At a step 240 the extended K-means procedure updates the NewModel for the clusters summarized in DS. There is no need to search for the cluster nearest the clusters in DS since the elements of DS will always (or are assumed to be always) assigned to the same cluster. The step 240 merely loops over the clusters in DS and adds their sufficient statistics to the new model of FIG. 6D. (NewModel(l).Sum+=DS(l). Sum, NewModel(l).SumSq+=DS(l).SumSq and NewModel(l).M+=DS(l).M) Once the contributions of CS and DS are added the stopped criteria is checked 206 to see if the procedure has converged to a solution. In one exemplary embodiment a variable CenterDist is set to zero and for each of the clusters K, a distance between the centroid of the old model and the centroid of the new model is determined and added to the CenterDist variable. Once all K distances have been calculated, and added together the CenterDist value is divided by the number of clusters K and compared to a value `StopTol` which is used as a measure of how stable the model has become. If the value of CenterDist is smaller than the value `StopTol` then the procedure returns, otherwise the procedure branches back to recalculate the model using the same data in RS, CS, and DS but with a different "old model". Stopping Criteria at the Step 140 Each time the procedure 120 returns, the RS, DS and CS data struactures are updated and the test of the stopping criteria 140 is performed Three alternative stopping criteria are proposed for use in the scalable K-mean procedure. (methods 1 and 2 of the Appendix D pseudocode summarize two of these criteria) A first method terminates the analysis if the difference between the K-means, measured in some norm, over a given number of data samples is below a given tolerance. A second method terminates if the difference in an "energy" function (measure of distortion) minimized by the k-mean analysis falls below a given tolerance over a fixed number of data samples. A third terminates if the number of data samples from the database is exhausted. A fourth stopping criteria is actually a suspension of the scalable K-means analysis rather than stopping. Note that if storage permits, the most general storage scheme would keep in main memory the last z models, hence easily allowing the plug-in of either stopping criteria 1 or 2 by easily computing either PrevModelDiff (in the case that the first stopping criteria is chosen, (see appendix D) from these z models or by computing PrevEnergyDiff (in the case that the second stopping criteria is chosen, see appendix D). As seen in FIG. 1 the computer 20 includes a monitor 47 for displaying a user interface. A suitable interface for monitoring the clustering analysis of FIG. 4 includes a reference to the amount of data as a percentage of the entire database 10 that has been used in defining the model shown in FIG. 6D. This interface allows the user to activate a button to suspend the operation of updating the model as well as adjusting the stopping criteria (Appendix D). The ability to suspend allows the database to be updated and then clustering can be resumed without resorting to a completely new analysis. This ability is particularly advantageous when clustering large databases where obtaining even a part of the data can take significant time. User Interface FIGS. 9-13 illustrate user interface screens that are depicted on a monitor 47 as data is clustered. Turning to FIG. 9, this screen 300 illustrates a clustering process as that clustering takes place. A progress bar 302 indicates what portion of the entire database has been clustered and a text box 304 above the progress bar 302 indicates how many records have been evaluated. In a center portion of the screen 300 two graphs 310, 312 illustrate clustering parameters as a function of iteration number and cluster ID respectively. The first graph 310 illustrates progress of the clustering in terms of iteration number which is displayed in the text box 314. The iteration number refers to the number of data gathering steps that have occurred since clustering was begun. In the FIG. 9 depiction an energy value for the clustering is calculated as defined in Appendix D method 2. As the clustering continues the energy decreases until a stopping criteria has been satisfied. In the graph 310 of FIG. 9 sixteen iterations are depicted. The second graph 312 at the bottom of the screen is a graph of clustering parameters as a function of cluster number. In the depiction shown there are ten clusters (shown in the text box 316) and the minimum covariance for each of these ten clusters is shown. Covariance is defined from the model data (FIG. 6D) for a given cluster and a given dimension by the relation: SumSq/M-Sum*Sum/M.sup.2 A plot of minimum covariance is therefore a plot of the dimension (1 . . . n) for a given cluster model having the least or minimum covariance. A drop down list box 320 allows the user to select other indications of covariance. By selecting a maximum for this parameter, a depiction of the dimension of the model having the maximum covariance (FIG. 10) for each of the ten clusters is shown in the bar graph 312. An average covariance bar graph (FIG. 12) indicates the average of the covariance for each cluster over all cluster dimensions. A different user selection via the drop down list box 320 (FIG. 13) shows the weight M for the ten clusters. In a similar manner, a dropdown list box 322 allows different cluster parameters such as model difference (Appendix D, method 1) to be plotted on the graph 310. A row 326 of command buttons at the bottom of the screen allow a user to control a clustering process. A parameter screen button 330 allows the user to view a variety of clustering parameters on a screen (not shown). By accessing this screen the user can determine for example a maximum number of records or tuples that can be brought into memory to the data set RS in a given iteration. As an example, the user could indicate that this maximum value is 10,000 records. As outlined above, as the clustering process is performed data is summarized in DS, CS, and stored in RS. If a number of 10,000 records is chosen as the maximum, the system limits the number of new data that can be read based upon the number of subclusters in the data set CS. Designate the number as ROWSMAX, then the amount of data records that can be currently stored in RS (Rscurrent) is ROWSMAX-2*c where c is the number of subclusters in CS. A progress bar 332 indicates a proportion of data buffers that are currently being used to store RS and CS datasets. This is also depicted as a percentage in a text box 334. Current data regarding subclustering into the dataset CS is depicted in a panel 340 of the screen. Text boxes 342, 344, 346, 348 in this panel indicate a number of subclusters c, and average, minimum and maximum variances for the subclusters using the above definition of variance. A last text box 350 indicates the average size of the subclusters in terms of data points or tuples in the subclusters. Additional command buttons allow the user to interact with the clustering as it occurs. A stop button 360 stops the clustering and stores the results to disk. A continue button 362 allows the process to be suspended and resumed by activating a resume button 364. A generate batch button 366 allows the user to generate a clustering batch file which can be executed as a separate process. Finally a close button 368 closes this window without stopping the clustering. Multiple Model Embodiment In accordance with an alternate embodiment of the present invention, the process of FIG. 4 is supplemented with a model optimizer. In accordance with this embodiment, a multiple number of different clustering models S are simultaneously generated by the computer 20. Turning to FIG. 8 one sees the data structure of this implementation. There is an array S pointers m.sub.1 . . . m.sub.s where each pointer points to a different model data structure. The model data structure is depicted in FIG. 6D. In this embodiment the structure CS and RS are shared by the multiple models. Each of the models m.sub.s is initialized with a different set of centroid vectors (value of `sum`, M=1) for the K different clusters of the model. When data is gathered at the step 110, that data is used to update each of the S models. An extended K-means procedure for the multiple model process takes into account the multiple model aspects of the structures DS and CS is perfrormed on each of the S models. On a first iteration through the FIG. 4 process there is no DS or CS dataset for any of the models so that all data is in the RS dataset. A given data point r.sub.s in the data set RS is compressed into the dataset DS.sub.j for only one of the S models even though it may be close enough to a centroid to compress into a DS dataset for multiple models. The data point r.sub.s is assigned to the set DS of the model whose centroid is closest to the point r.sub.s. DS structures for all the S models are determined by compressing data points into the appropriate DS data structure. The CS data structures for each of the models are then determined from the points remaining in RS. When performing the extended K-means procedure 120, however, the CS sufficient statistics must be augmented with the sufficient statistics contained in the DS data structures of the other models. When performing the extended K-means procedure to update a given model m.sub.j, the subclusters in CS must be augmented with the DS structures from the other models. Specifically, when updating model m.sub.j, the extended K-means procedure considers the augmented set CS.sub.j =CS U (union) DS.sub.1 U DS.sub.2 . . . DS.sub.j-1 U DS.sub.j+1 U . . . DS.sub.s when performing the loop 230 of FIG. 7. If a data point is compressed into DS, it enters the DS set of only one model at the step 240, hence there is no double counting of data. The multiple model analysis can be performed until one of the models satisfies the stopping criteria at the step 140. An alternate system would continue to compute all the models until each model reaches a stopping criteria. Additionally, the scalable K-means process could be performed until a certain percentage of the models have reached a stopping criteria. The multiple model implementation shares data structures between models and performs calculations on certain data unique to a given model. This analysis is susceptible to parallel processing on a computer 20 having multiple processing units 21. Computer System With reference to FIG. 1 an exemplary data processing system for practicing the disclosed data mining engine invention includes a general purpose computing device in the form of a conventional computer 20, including one or more processing units 21, a system memory 22, and a system bus 23 that couples various system components including the system memory to the processing unit 21. The system bus 23 may be any of several types of bus structures including a memory bus or memory controller, a peripheral bus, and a local bus using any of a variety of bus architectures. The system memory includes read only memory (ROM) 24 and random access memory (RAM) 25. A basic input/output system 26 (BIOS), containing the basic routines that helps to transfer information between elements within the computer 20, such as during start-up, is stored in ROM 24. The computer 20 further includes a hard disk drive 27 for reading from and writing to a hard disk, not shown, a magnetic disk drive 28 for reading from or writing to a removable magnetic disk 29, and an optical disk drive 30 for reading from or writing to a removable optical disk 31 such as a CD ROM or other optical media. The hard disk drive 27, magnetic disk drive 28, and optical disk drive 30 are connected to the system bus 23 by a hard disk drive interface 32, a magnetic disk drive interface 33, and an optical drive interface 34, respectively. The drives and their associated computer-readable media provide nonvolatile storage of computer readable instructions, data structures, program modules and other data for the computer 20. Although the exemplary environment described herein employs a hard disk, a removable magnetic disk 29 and a removable optical disk 31, it should be appreciated by those skilled in the art that other types of computer readable media which can store data that is accessible by a computer, such as magnetic cassettes, flash memory cards, digital video disks, Bernoulli cartridges, random access memories (RAMs), read only memories (ROM), and the like, may also be used in the exemplary operating environment. A number of program modules may be stored on the hard disk, magnetic disk 29, optical disk 31, ROM 24 or RAM 25, including an operating system 35, one or more application programs 36, other program modules 37, and program data 38. A user may enter commands and information into the computer 20 through input devices such as a keyboard 40 and pointing device 42. Other input devices (not shown) may include a microphone, joystick, game pad, satellite dish, scanner, or the like. These and other input devices are often connected to the processing unit 21 through a serial port interface 46 that is coupled to the system bus, but may be connected by other interfaces, such as a parallel port, game port or a universal serial bus (USB). A monitor 47 or other type of display device is also connected to the system bus 23 via an interface, such as a video adapter 48. In addition to the monitor, personal computers typically include other peripheral output devices (not shown), such as speakers and printers. The computer 20 may operate in a networked environment using logical connections to one or more remote computers, such as a remote computer 49. The remote computer 49 may be another personal computer, a server, a router, a network PC, a peer device or other common network node, and typically includes many or all of the elements described above relative to the computer 20, although only a memory storage device 50 has been illustrated in FIG. 1. The logical connections depicted in FIG. 1 include a local area network (LAN) 51 and a wide area network (WAN) 52. Such networking environments are commonplace in offices, enterprise-wide computer networks, intranets and the Internet. When used in a LAN networking environment, the computer 20 is connected to the local network 51 through a network interface or adapter 53. When used in a WAN networking environment, the computer 20 typically includes a modem 54 or other means for establishing communications over the wide area network 52, such as the Internet. The modem 54, which may be internal or external, is connected to the system bus 23 via the serial port interface 46. In a networked environment, program modules depicted relative to the computer 20, or portions thereof, may be stored in the remote memory storage device. It will be appreciated that the network connections shown are exemplary and other means of establishing a communications link between the computers may be used. While the present invention has been described with a degree of particularity, it is the intent that the invention include all modifications and alterations from the disclosed implementations falling within the spirit or scope of the appended claims. The following APPENDICES describe components of the scalable K-means analysis
__________________________________________________________________________
Appendix A - Worst Case Analysis
Assume that the following functions are available:
[tValue]= ComputeT( alpha, DegreesFreedom): computes the t-statistics
for the
given value of alpha (1-ConfidenceLevel, in our case) and
DegreesFreedom (# of
points in a cluster -1 in our case).
DSCopy( DSNew, DSOrig): copies DSOrig into DSNew, DSNew is altered,
DSOrig
remains the same.
[WeightVec]= ComputeWeightVec( DataPoint, Model): for K-Means, returns
the
{0, 1} weight vector indicating the cluster to which DataPoint is
assigned.
AddRS( DataPoint, RS): appends singleton DataPoint to the end of RS, RS
is
increased by 1 point.
The following functions determine the new sets DS and RS from the old
sets, the current
Model, and given ConfidenceLevel which is a number between 0 and 1.
[DSNew, RSNew] = WorstCaseAnalysis( DS, RS, Model, Confidence Level)
RSNew = empty;
k = Length( Model);
r = Length(RS);
// first determine values for LOWER and UPPER
alpha = 1-ConfidenceLevel;
for1 = 1, . . ., k
{
Mean = Model(1).GetMean( );
CVDiag = Model(1).GetCVDiag( );
TValue = ComputeT( (alpha/n), Model(1).GetNum( )-1 ); // correct t-value
For j = 1, . . ., n
{
LOWER(1).LowVec(j) = Mean(j) - (TValue)*
sqrt( CVDiag(j)/(Mode1(1).GetNum( )));
UPPER(1).UpVec(j) = Mean(j) + (TValue)*
Sqrt( CVDiag(j)/(Model(1).GetNum)));
}
}
// Copy DS into DSNew
DSCopy( DSNew, DS);
// for each singleton element in r, perform the worst-case "jiggle" and
determine
// if the cluster assignment for this data point changes, if so keep in
RS, if not, put
// inDSNew
for j = 1, . . ., r
{
DataPoint = RS(j).RSVec;
// Determine the cluster to which this data point is assigned
[TrueWeightVec] = ComputeWeightVec( DataPoint, Model);
// Zero out the perturbed model
Zero( PerturbModel);
// determine the perturbed model for this data point
for 1 = 1, . . . , k
{
Mean = Model(1).GetMean( );
If ( TrueWeighVec(1) == 1.0) // DataPoint is assigned to cluster 1
{
// the pertubed model center is as far away from Datapoint
// as possible
for h = 1, . . ., n
{
if ( Datapoint(h) >= Mean(h))
{
PerturbModel(1).Center(h) =
LOWER(1).LowVec(h);
}
else
{
PerturbModel(1).Center(h) =
UPPER(1).UpVec(h);
}
}
}
else
{
// the data point is not assigned to model 1, move the
// perturbed
// center as close to data point as possible
for h = 1, . . . , n
{
case ( DataPoint(h) >= UPPER(1).UpVec(h))
{
PerturbModel(1).Center(h) =
UPPER(1).UpVec(h);
}
case (Datapoint(h) <= LOWER(1).LowVec(h))
{
PerturbModel(1).Center(h) =
LOWER(1).LowVec(h);
}
otherwise
{
PertubModel(1).Center(h) = Datapoint(h);
}
}
}
}
// at this point the perturbed model has been determined for the given
data
// point
// determine the assignement of this point under the perturbed model
[PerturbWeightVec] = ComputeWeightVec( Datapoint, PerturbModel);
// determine if the assignments are the same. If so, update the correct
// DSNew
// component. If not, put the point in RSNew
for 1 = 1, . . ., k
{
if ( TrueWeightVec(1) == 1.0) and (PerturbWeightVec(1) == 1.0)
{
DSNew(1).Sum += DataPoint;
DSNew(1).Num ++;
}
if ((TrueWeightVec(1) == 1.0) and (PerturbWeightVec(1) == 0.0))
or ((TrueWeightVec(1) == 0.0) and
(PerturbeWeightVec == 1.0))
{
AddRS( Datapoint, RSNew);
}
}
}
return DSNew, RSNew;
}
Appendix B-Mahalanobis Threshold Analysis
Assume that the following functions are available:
[WeightVec] = ComputeWeightVec( DataPoint, Model): in the k-Mean case,
returns
the {0, 1} weight vector indicating the cluster to which Datapoint is
assigned
AddRS( Datapoint, RS): appends singleton Datapoint to the end of RS, RS
is
increased by 1 point.
RSDistSort( RSDist): sorts the list RSDistSort from smallest to
greatest by the values
in MahalDist.
[Dist] = DistanceMahalanobis( DataPoint, Center, CVDiagonal): computes
the
Mahalanobis distance from Datapoint to Center with the given covariance
diagonal
CVDiagonal.
[integer] = Floor(float): returns the integer obtained by rounding the
float down to the
nearest integer.
DSCopy(DSNew, DSOrig): copies DSOrig into DSNew, DSNew is altered,
DSOrig
remains the same.
The method:
[DSNew, RSNew] = Threshold( DS, RS, Model, Percentage)
{
// Percentage is the percentage of points to compress with this function
DSCopy( DSNew, DS);
RSNew = empty;
Initialize(RSDist); // initialize the RSDist structure
k = Length( Model);
r = Length( RS);
// Fill the fields of the RSDist structure
For j = 1, . . ., r
{
// DataPoint = RS(j).RSVec;
[WeightVec] = ComputeWeightVec( DataPoint, Model);
for 1 = 1, . . ., k
{
if ( WeightVec(1) == 1.0)
{
// DataPoint is assigned to cluster 1
RSDist(j).RSIndex = j;
RSDist(j).ClustAssign = l;
RSDist(j).MahalDist = DistanceMahalanobis( DataPoint,
Model(1).GetMean( ), Model(1).GetCVDiag( ));
}
}
}
RSDistSort( RSDist ); // do the sorting
// determine the number of points to compress
CompressNum = Floor( r*Percentage);
For j = 1, . . ., r
{
DataPoint = RS(RSDist(j).RSIndex).RSVec;
if (j <= CompressNum)
{
DSNew( RSDis(j).ClustAssign ).Sum += DataPoint;
DSNew(RSDist(j).ClustAssign)SumSq
+= DataPoint*DataPoint;
DSNew( RSDist(j).ClustAssign ).Num ++;
}
else
{
AddRS( DataPoint, RSNew);
}
}
return DSNew, RSNew;
}
Appendix C SubCluster data set CS
We assume that the following functions are availale:
[Model] = VanillaKMean( Model, Data, StopTol): takes initial values for
Model
and updates the model with the Data until the model ceases to changes,
within
StopTol. The updated model is returned in Model.
[CSMerged] = MergeCS( CSElem1, CSElem2): take the sufficient statistics
for 2
sub-clusters, CS1 and CS2, and computes the sufficient statistics for
the sub-cluster
formed by merging CS1 and CS2.
AddRS( DataPoint, RS): appends singleton DataPoint to the end of RS, RS
is
increased by 1 point.
[SubModel] = RandomSample( RSNew, kPrime): randomly chooses kPrime
elements from RSNew and uses these as the initial points for the
vanilla k-mean
algorithm. The elements are stored in SubModel.
[WeightVec] = ComputeWeightVec( DataPoint, Model): computes the {0, 1}
weight
vector with k elements. DataPoint is assigned to cluster j if the j-th
element of the
WeightVec is 1, the other elements are 0.
Append( integer, integerList): appends the integer to the end of
integerList.
RemoveCSCandidates( IndexList, CS): removes the elements specified in
IndexList
from CS and returns the altered, smaller list in CS.
[BigCSList] = AppendCS( CSList1, CSList2): creates the BigCSList by
appending
CSList2 to the end of CSList1.
[SubcPartner, SubcPartnerInd]= FindMergePartner( Index, CS): finds the
element
(subcluster) in CS that has center nearest to CS(Index) (a different
subcluster) and
returns this element in SubcPartner and its index.
10.
AppendCSEnd( CS, CSElement): appends CSElement to the end of the CS
list
[CSIndex]= FindCSMatch(DataPoint, CS): finds the cluster in CS to which
DataPoint
is closest and, if DataPoint was merged into that cluster, the
"density" criteria would still
be satisfied. If no such cluster exists, then this routine returns
NULL.
[CSMergedElem] = MergeCS(DataPoint,CSElem): merges a singleton data
point into
a CS cluster and returns the merged Cluster.
The subclustering method:
[CSNew, RSNewer] = CSUpdate( CS, RSNew, StdTol, PointsPerSubClust,
MinPoints,
StopTol)
{
// STDTol is a scalar which defines the "dense" criteria discussed above
// a subcluster is deemed "dense" if the square root of the maximum
element of
// its covariance matrix is less than StdTol
// PointPerSubClust is an integer which used to determine the number of
// secondary
// subclusters to search for in RSNew. The number of sub-cluster to
search for is
// (# of points in RSNew)/(PointsPerSubClust)
// Minpoints is an integer specialing the minimum number of points
allowed in a
// subcluster. If a sub-cluster does not have have this number of points,
the points
// remain as singletons are are placed in RSNewer.
// StopTol is a tolerance for the vanilla k-mean algorithm
// prefiltering
// filter as many points from rs into cs as possible now
r = length(RSNew);
for i = 1, . . ., r
{
DataPoint = RSNew(i).RSVec;
[WeightVec]= computeWeightVec(DataPoint, CS);
for j = 1, . . ., CS.NumElems
{
if (WeightVec(j) == 1.0) {
if (OkToMerge(DataPoint,CS(j))){
CS(j) = Merge(DataPoint, CS(j));
RSNew.Remove(i);
}
}
}
}
CSCopy( CSNew, CS); // copy CS into CSNew
RSNewer = empty;
// determine the number of singleton data elements
r = length( RSNew);
// kPrime = the number of "dense" regions in RSNew to search for
kPrime = floor( r/ PointsPerSubClust);
// choose the starting point for the vanilla k-Mean algorithm as kPrime
random
// elements
|
Same subclass Same class Consider this |
||||||||||
